# Week 9 Practical

In this practical we will be looking at a dataset called "Symptom2Disease" 
from https://www.kaggle.com/datasets/niyarrbarman/symptom2disease 

Authors:

- Niyar R Barman
- Faizal Karim
- Krish Sharma

It's not quite real, but it is believable: their methodology was:

> We collected disease symptoms for 24 common diseases. We then used LLMs to convert the
> raw data into natural language description of symptoms.

## Data preparation

The data set is in the same folder as this jupyter notebook, with the name Symptom2Disease.csv

Load it up into a dataframe and view it

How many different diseases are listed ("label")? Is this a balanced data set?

We'll start with a slightly easier problem: instead of predicting the disease specifically, let's
just see whether someone with those symptoms should visit a general practitioner to be treated, or whether
they will end up seeing a specialist.

The diseases that are very common and general practitioner might be able to treat are:

- Common Cold

- Bronchial Asthma

- Hypertension

- Migraine

- Allergy

- Drug Reaction

- Urinary Tract Infection

Update your dataframe with a column "requires_specialist" for the diseases not in that list.

Is it a balanced data set now?

Split the data into training, validation and test data. You can do this by running 
`sklearn.model_selection.train_test_split` twice.

Create a `keras.layers.TextVectorization` object with `output_mode='tf_idf'`, and `.adapt()` it to 
the text of your training data.

Convert your training, validation and test data into TF-IDF vectors using your vectorizer.

## A small logistic regression classifier using Keras

Let's see if we can predict who is going to need a specialist.

`prog3f.py` is a good sample to work from.

Create a `keras.Input` object. Its shape should be the size of the vocabulary in your vectorizer.

Create a `keras.layers.Dense` object to be your output layer, and pass it your input object
as a function argument.

Because there is only one value ("needs a specialist or not"), it should have one neuron, and a sigmoid activation.

Create a `keras.Model` object. `inputs=` the input object; `outputs=` the output layer

Compile the model. There are only two classes, so you can use `binary_crossentropy`.

We want to know accuracy, recall and precision. (Given how unbalanced the data set is, it would
be nice to have F1Score calculated for us. That would require a bit more code that we would
have time to write in this practical.)

_Note to cohorts after 2023-10: check to see if F1Score is now part of default Keras releases._

Display a summary of your model.

Fit the model. 

You will need:

- `x` will be your training vectors

- `y` will be the `requires_specialist` column

- `validation_data` will be the same as for training, but using the validation data

- You might want a callback to stop training when the validation loss stops improving.

- You won't need many epochs: it should be less than 100, and they should be quick to run.

Save the history into a variable so that we can look at it.

Make a matplotlib chart showing the accuracy and training loss over time.

Calculate the corpus size, the vocabulary size and their ratio (i.e. the corpus size divided by the vocabulary size).
Based on this and the previous charts, decide whether the model is likely to be overfitting.

Calculate an appropriate vocabulary size if you only wanted to include words that appeared in the
training data at least 3 times. Go back to where you defined your TextVectorization object and 
set `max_tokens` to (this value + 1). It might give you a tiny improvement in your validation scores.

Also try using bigrams. Does this help? 

Use `.evaluate` on your test data to confirm that your validation results are still close to the test results and
our parameter tuning for validation hasn't altered the results too much.

## Explainability

People's health is important. We can't let them use a black-box classifier for analysing their health conditions.

What words were the most predictive that a general practitioner could help, and what words were most predictive
that a specialist would be required?

The `wordeffects.py` file might be helpful here.

## Harder task: predicting the actual disease, not just who to go to

Unlike (say) the sklearn classifiers, keras can't work with text labels for classes. Use
`sklearn.preprocessing.LabelEncoder` to convert the disease labels into integer labels.

You can re-use the existing `keras.Input` since that hasn't changed, or define a new one.
Then create an output layer:

- It will need as many neurons as there are diseases

- Use a 'softmax' activation to normalise the probabilities

And then create a model using those inputs and outputs.

To compile this model we will need to use `sparse_categorical_crossentropy` as the
loss. Let's use accuracy as the only other metric (we know how to add more if we need to.)

Plot the accuracy and loss for the training data and validation data

It's less obvious how to improve this model now, but we do have fairly good accuracy anyway. 

Confirm that the accuracy is also quite good on the test data.

## Free-form

Try to optimise the model by adding extra layers, modifying the vocabulary or any other ideas you want to try out.