# Machine Learning Applications for Health (COMP90089_2022_SM2)
# Tutorial 6: Natural Language Processing with Machine Learning Classifiers

##NLP tasks employed here:
* **Low-level task**: Split text into smaller units, create vectors (tokens).
* **High-level task**: Make inferences based on the labelled text information.


##Types of NLP Models:

* **Rule-based**: Manually define and implement logic for processing text
* **Statistical**: Machine Learning models trained on annotated data.

> Here we are going to play with Statistical NLP model.
 

###Set up the environment: we will basically use **sklearn** library.

In [1]:
import pandas as pd
import numpy as np
import sys

import matplotlib.pyplot as plt
import seaborn as sns

from sklearn import linear_model, model_selection, preprocessing, metrics
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV

import warnings
warnings.filterwarnings("ignore") 

#Display maximum 
pd.set_option('display.max_columns', None)
np.set_printoptions(threshold=sys.maxsize)

* ### **Data source:** Alec Chapman [GitHub](https://github.com/abchapman93/Melbourne_COMP90089_NLP) > Notebooks (05)

"The dataset in these examples is a set of MIMIC-II **radiology reports**. The annotations were created by University of Utah physician-scientist and **pneumonia** extraordinaire Dr. Barbara Jones. (A note which contains a positive or possible mention of a term referring to pneumonia might consider: Pneumonia; Pna; Opacity; Infiltrate; Consolidation)"




**Read the Data sets**: the training and testing set. 

Each set has **4 columns** as follows: 
* The document name: **'filename'**
* The text: **'text'**
* The annotator's document classification (whether or not the patient have pneumonia): **'document_classification'**
* The baseline NLP system's document classification (this is the prediction result from another method. We will compare our model with it): **'baseline_document_classification'**

In [None]:
##Download the data set from Alec GitHub
url_train = 'https://raw.githubusercontent.com/abchapman93/Melbourne_COMP90089_NLP/main/melbourne_comp90089_nlp/data/pneumonia_data_train.json'
url_test = 'https://raw.githubusercontent.com/abchapman93/Melbourne_COMP90089_NLP/main/melbourne_comp90089_nlp/data/pneumonia_data_test.json'

#Read the data in the proper format
df_train = pd.read_json(url_train)
df_test = pd.read_json(url_test)

#Check how the first rows looks like:
print(df_train.head())

In [None]:
#Check specifically the Text column which contains the raw information (radiology report notes)
print(df_train['text'][0])


### Verify how balanced is the annotaded data regarding Pneumonia:

In [None]:
fig , ax = plt.subplots(figsize=(6,4))
sns.countplot(x='document_classification', data=df_train)
plt.title("Count of Pneumonia Annotations (0/1)")
plt.show()

##Text Analysis: Building Vectors (Tokens)

Text Analysis is a major application field for machine learning algorithms. However the raw data, a sequence of symbols **cannot be fed directly to the algorithms themselves** as most of them expect numerical feature vectors with a fixed size rather than the raw text documents with variable length.


* The method from sklearn **CountVectorizer()** converts a collection of text documents to a matrix of token counts. Read more about it [here](https://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction)

* The method **fit.transform()** learns the vocabulary dictionary and return document-term matrix.


In [None]:
#CountVectorizer: tokenizing strings and giving an integer id for each possible token. Also counting the occurrences of tokens in each document.
count_vectorizer = CountVectorizer()

#Learn the vocabulary dictionary and return document-term matrix
train_vectors = count_vectorizer.fit_transform(df_train["text"])
train_vectors #We will have 140 rows with 1358 unique words transformed into matrix

In [None]:
#Each term found by the analyzer during the fit is assigned a unique integer index corresponding to a column in the resulting matrix
count_vectorizer.get_feature_names_out()


Create Vectors (Tokens) for the testing set

In [7]:
## note that we're NOT using .fit_transform() here. Using just .transform() makes sure
# that the tokens in the train vectors are the only ones mapped to the test vectors - 
# i.e. that the train and test vectors use the same set of tokens.
test_vectors = count_vectorizer.transform(df_test["text"])

###Machine Learning Model:

* Random Forest [Classifier](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html) applying [GridSearchCV](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html) to find the best set of parameters.

In [None]:
#Initialise the Random Forest Classifier
rfc = RandomForestClassifier(random_state=0)

#Check the Random Forest documentation to get the parameters. Let's set some possible values for each of them:
param_grid = { 
    'n_estimators': [200, 500],
    'max_features': ['auto', 'sqrt', 'log2'],
    'max_depth' : [4,5,6,7,8],
    'criterion' :['gini', 'entropy']
}

#Call the GridSearchCV function to analyse each possible combination of parameter, in this case: 2 * 3 * 5 * 2 = 60
CV_rfc = GridSearchCV(estimator = rfc, param_grid = param_grid, cv= 5)
CV_rfc.fit(train_vectors, df_train["document_classification"])


* What is the **best set of parameters for the Random Forest** classifier using this data?

In [None]:
CV_rfc.best_params_

* Get the parameters and **call the classifier again**, now passing these information:

In [10]:
rfc = RandomForestClassifier(criterion='gini',max_depth=5, max_features='auto', n_estimators=500, random_state=0)


In [None]:
rfc.fit(train_vectors, df_train["document_classification"])
df_test["new_prediction_rforest"] = rfc.predict(test_vectors)
df_test.head()

### Evaluate the model, how good is it? 

* The **sklearn.metrics** module implements several loss, score, and utility functions to measure classification performance.

In [None]:
#Accuracy classification score
acc_rfc = float(round(metrics.accuracy_score(df_test["document_classification"], df_test["new_prediction_rforest"]),3))

#Compute the balanced accuracy.
bacc_rfc = float(round(metrics.balanced_accuracy_score(df_test["document_classification"], df_test["new_prediction_rforest"]),3))

#Compute the Matthews correlation coefficient (MCC)
mcc_rfc = float(round(metrics.matthews_corrcoef(df_test["document_classification"], df_test["new_prediction_rforest"]),3))

#Compute the F1 score, also known as balanced F-score or F-measure.
f1_rfc = float(round(metrics.f1_score(df_test["document_classification"], df_test["new_prediction_rforest"]),3))

#Save results as a DataFrame:
results_rfc = {'Accuracy' : [acc_rfc], 'Balanced Accuracy' : [bacc_rfc], 'MCC' : [mcc_rfc], 'F1-Score' : [f1_rfc]}
rfc_results = pd.DataFrame.from_dict(data = results_rfc, orient='columns')
print(rfc_results)

### **Comparison**: How is the baseline model?

* Get the output predictions for the **baseline** model in the column df_test["baseline_document_classification"] and check the metrics again for them.

In [None]:
#Accuracy classification score
acc_base = float(round(metrics.accuracy_score(df_test["document_classification"], df_test["baseline_document_classification"]),3))

#Compute the balanced accuracy.
bacc_base = float(round(metrics.balanced_accuracy_score(df_test["document_classification"], df_test["baseline_document_classification"]),3))

#Compute the Matthews correlation coefficient (MCC)
mcc_base = float(round(metrics.matthews_corrcoef(df_test["document_classification"], df_test["baseline_document_classification"]),3))

#Compute the F1 score, also known as balanced F-score or F-measure.
f1_base = float(round(metrics.f1_score(df_test["document_classification"], df_test["baseline_document_classification"]),3))

#Save results as a DataFrame:
results_base = {'Accuracy' : [acc_base], 'Balanced Accuracy' : [bacc_base], 'MCC' : [mcc_base], 'F1-Score' : [f1_base]}
base_results = pd.DataFrame.from_dict(data = results_base, orient='columns')
print(base_results,"\n")
print("Compare with our result: \n",rfc_results)

### What is your conclusion?



---



### **Bonus**: check how good a Linear Regression model would be with this data. 

* Linear Regression: Classifier using [Ridge](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.RidgeClassifier.html) regression.


In [None]:
linear = linear_model.RidgeClassifier()

scores = model_selection.cross_val_score(linear, train_vectors, df_train["document_classification"], cv=3, scoring="f1")
scores

In [None]:
linear.fit(train_vectors, df_train["document_classification"])
df_test["new_prediction_linear"] = linear.predict(test_vectors)
df_test.head()

### Evaluate the model, how good is it? 

* The **sklearn.metrics** module implements several loss, score, and utility functions to measure classification performance.

In [None]:
#Accuracy classification score
acc_linear = float(round(metrics.accuracy_score(df_test["document_classification"], df_test["new_prediction_linear"]),3))

#Compute the balanced accuracy.
bacc_linear = float(round(metrics.balanced_accuracy_score(df_test["document_classification"], df_test["new_prediction_linear"]),3))

#Compute the Matthews correlation coefficient (MCC)
mcc_linear = float(round(metrics.matthews_corrcoef(df_test["document_classification"], df_test["new_prediction_linear"]),3))

#Compute the F1 score, also known as balanced F-score or F-measure.
f1_linear = float(round(metrics.f1_score(df_test["document_classification"], df_test["new_prediction_linear"]),3))

#Save results as a DataFrame:
results = {'Accuracy' : [acc_linear], 'Balanced Accuracy' : [bacc_linear], 'MCC' : [mcc_linear], 'F1-Score' : [f1_linear]}
linear_results = pd.DataFrame.from_dict(data = results, orient='columns')
print(linear_results)