# Abstratct

- In medical science, there are many diseases related to memory loss. Alzheimer's is one of them. It is important to detect this kind of disease at an early stage to cure it. The combination of the natural language processing (NLP) and deep learning (DL) fields can lead us in the direction of a solution. NLP can help transform words into vectors known as word embedding. The classification performed on these word embeddings to correctly classify them was based on Alzeimer disease (AD) and healthy control (HC). Word embedding is computationally expensive and consumes much time. It is also important to focus on precesion and recall value to check classification model intergrity.

# Introduction
- For the creation of a better world, good mental health is required. Medical science has progressed outstandingly, but still, some diseases exist that cannot be identified earlier to be cured. Natural language processing (NLP) and deep learning (DL) combination can be used to identify such diseases by the classification algorithm.


- Alzheimer's is one of the diseases that is hard to detect in its earlier phases or even in the present moment. Fortunately, data science has the solution. A combination of natural language processing and deep learning can be used to detect the disease. Sentences can be transformed into word embedding by NLP, where the representation of words is converted into vectors of real numbers. This can be done by BERT (bidirectional encoder representations from transformers). Berta has different variations, such as Roberta and Electra.


- ROBERTA (Robustly Optimized BERT-Pretraining Approach) is a powerful model for word embedding. It transforms each sentence into a token, and these tokens are used by the model for word embedding in the encoder phase.

- Gaussian process classification is useful for this high-dimensional word embedding, as it has a huge amount of uncertainty in the data.

In [1]:
# environment for this notbook
import torch

from transformers import RobertaModel, RobertaModel, RobertaTokenizer

import numpy as np

import pandas as pd

import glob

from sklearn.metrics.pairwise import cosine_similarity
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn import metrics
from sklearn.cluster import KMeans
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
from sklearn.gaussian_process import GaussianProcessClassifier
from sklearn.gaussian_process.kernels import RBF

from scipy.stats import entropy


### Dataset

- There is two type of datasets. One with sentences and second with the patient health discription. First  type of data is used for this word embedding task. Data collected by image desciption of patients affected by Alzeihmer and helathy person.

### Pre-processing

- There are a couple of csv files based on the patients descriptions, and therefore it is important to convert them into an appropriate format for word embedding. Through the combination of all csv files, make it one and add a label to identify which is Alzeimer disease (AD) and healthy control (HC). Similarly, the same process was performed for the test data set.

In [2]:
# Combine all csv files of AD_train and add label column
AD_train=[]
path=r"C:\Users\uttam\OneDrive\Desktop\NLP Project\NLP-Project\Data\AD_train"
csv_files=glob.glob(path+'/*.csv')
for f in csv_files:
    df1 = pd.read_csv(f)
    df1 = df1.loc[:, ~df1.columns.str.contains('^Unnamed')]
    AD_train.append(df1)


df_AD_train=pd.concat(AD_train,axis=0,ignore_index=True)
label_AD=0 #AD : 0 label
df_AD_train['label']=label_AD

In [3]:
# Combine all csv files of AD_test and add label column
AD_test=[]
path=r"C:\Users\uttam\OneDrive\Desktop\NLP Project\NLP-Project\Data\AD_train\AD_Test"
csv_files=glob.glob(path+'/*.csv')
for f in csv_files:
    df1 = pd.read_csv(f)
    df1 = df1.loc[:, ~df1.columns.str.contains('^Unnamed')]
    AD_test.append(df1)


df_AD_test=pd.concat(AD_test,axis=0,ignore_index=True)
label_AD=0 #AD : 0 label
df_AD_test['label']=label_AD

In [4]:
# Combine all csv files of HC_train and add label column
HC_train=[]
path=r"C:\Users\uttam\OneDrive\Desktop\NLP Project\NLP-Project\Data\HC_train"
csv_files=glob.glob(path+'/*.csv')
for f in csv_files:
    df2=pd.read_csv(f)
    df2 = df2.loc[:, ~df2.columns.str.contains('^Unnamed')]
    HC_train.append(df2)


df_HC_train=pd.concat(HC_train,axis=0,ignore_index=True)
label_HC=1 #HC : 1 label
df_HC_train['label']=label_HC


In [5]:
# Combine all csv files of HC_test and add label column
HC_test=[]
path=r"C:\Users\uttam\OneDrive\Desktop\NLP Project\NLP-Project\Data\HC_train\HC_Test"
csv_files=glob.glob(path+'/*.csv')
for f in csv_files:
    df2=pd.read_csv(f)
    df2 = df2.loc[:, ~df2.columns.str.contains('^Unnamed')]
    HC_test.append(df2)


df_HC_test=pd.concat(HC_test,axis=0,ignore_index=True)
label_HC=1 #HC : 1 label
df_HC_test['label']=label_HC

In [6]:
# concatination of AD & HC train and similarly for test set
df_train=pd.concat([df_AD_train,df_HC_train], ignore_index=True)
df_test=pd.concat([df_AD_test,df_HC_test], ignore_index=True)

- Furthermore, it is essential to shuffle data samples to avoid overfitting and also help create a robust classification model. After pre-processing step, There is total 1968 samples in train set and 649 samples in test set. It is important to separate the training and test sets in the beginning to avoid a mixture of sentences from one script to another, because it is essential to maintaining the uniqueness of patients' scripts.

In [7]:
# shuffling
df_train = df_train.sample(frac=1,random_state=0,replace=True).reset_index(drop=True)
df_test = df_test.sample(frac=1,random_state=0,replace=True).reset_index(drop=True)
print('length of train', len(df_train),'.','length of test', len(df_test)) # output of train and test length
df_train

length of train 1968 . length of test 649


Unnamed: 0,sentence,label
0,I heard it might but I'm sayin(g) let's tell h...,0
1,yes .,0
2,and the woman is standing in water .,1
3,the little girl's got her hand up for one .,1
4,wonder if he got a cookie .,0
...,...,...
1963,fell down .,0
1964,uhhuh .,1
1965,that's a mess .,0
1966,touching lip .,1


### Word Embedding

- The next step is word embedding in sentences. For the word embedding, tokenizers and the Roberta model are required. Tokenizers are helpful for tokenizing sentences. Roberta model use 'G' character to represt space as described in given below example.


Example: 'My name is Khodiyar.
           ['My','Gname', 'Gis', 'GKhodiyar'] =>  Tokenized sentence


- The transformer converts each tokenized word into a unique ID, which can be taken by the Roberta model for further processing. Here, the tokenizer adds two additional tokens at the beginning and end of each sentence. For the distinguish each sentence to each other.


- For more convenience in this task we make same length for all sentences by taking into account length of longest sentence wich is 47 in our case. Extra-added tokens are zero, which will not affect the model. Roberta model configuration can be checked by **model.config**.

In [8]:
# loading of tokenizer and roberta model
tokenizer=RobertaTokenizer.from_pretrained('roberta-base')
Roberta = RobertaModel.from_pretrained('roberta-base',output_hidden_states=True)

Some weights of the model checkpoint at roberta-base were not used when initializing RobertaModel: ['lm_head.layer_norm.bias', 'lm_head.bias', 'lm_head.dense.weight', 'lm_head.dense.bias', 'lm_head.layer_norm.weight', 'lm_head.decoder.weight']
- This IS expected if you are initializing RobertaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [9]:
sentences_train=df_train['sentence'] # train
sentences_test=df_test['sentence']  # test

In [10]:
sentences_pool={'sentences_train':sentences_train,'sentences_test':sentences_test} # train and test pool
input_id_train=[] # list
input_id_test=[]  # list
max_len = 47

# convert sentence into token
for key in sentences_pool.keys():
    input_ids = [torch.tensor(tokenizer.encode(sentence, add_special_tokens=True)) for sentence in sentences_pool[key]]

    # Pad the sequences to have the same length
    
    padded_input = [torch.cat((input_id, torch.zeros((max_len - len(input_id)), dtype=torch.long))) for input_id in input_ids]
    padded_input = torch.stack(padded_input).to(torch.int64) # tensor

    if key=='sentences_train':
        input_id_train.append(padded_input)
    else :
        input_id_test.append(padded_input)
    

- After transforming words into id, roberta model used. Word embedding can be achieved by keeping the gradient off. This means that PyTorch will not keep track of the operations that are involved in computing the word embeddings, and therefore it will not use them to compute the gradients during backpropagation. By using torch.no_grad() when computing the word embeddings, we can reduce the memory usage and computation time required during forward propagation, as well as avoid any potential errors that may occur if we accidentally try to compute gradients for the word embeddings. Additionally, disabling gradient tracking can also help to improve the speed and efficiency of the overall training process.

In [11]:
input_id_train=input_id_train[0] # get tensor from list

#  word embedding of train
with torch.no_grad():
    output = Roberta(input_id_train)
    last_hidden_state_train = output[2] # Word embedding can get by input for next part like forward pass

In [12]:
input_id_test=input_id_test[0] # get tensor from list

# word embedding of test
with torch.no_grad():
    output = Roberta(input_id_test)
    last_hidden_state_test = output[2]

- The Roberta model has 12 encoders, and each encoder creates one embedding, giving a total of twelve embeddings. Every token has 769 bytes. Best encoding can be achieved by combining encoders; in this case, the sum of the last 7 encoders' embedding performed better for the classification task. Now, each sentence has a dimension of [1,47,768]. Entropy can help reduce dimension. Entropy converts the array into a single value.

In [21]:
# find best embedding by combination of encoders and use entropy to convert sentence dimension from [1,47,768] to [1,768]

'''
Dimension: shape of whole dataset after embedding: [2617,47,768],
           each sentence dimension: [1,47,768] shape 
           each word dimension: [1,768]
'''
x_train=[]
x_test=[]
embedding_pool={'x_train':last_hidden_state_train, 'x_test':last_hidden_state_test}

for key in embedding_pool.keys():
    final_embedding_sum=torch.stack(embedding_pool[key][-6:]).sum(0) # combine output of last six encoder
    #--------------------------------------------
    n_samples = final_embedding_sum.shape[0]
    n_features = np.prod(final_embedding_sum.shape[1:])

    X = final_embedding_sum.reshape(n_samples, n_features)

    scaler = StandardScaler()
    scaler.fit(X)

   # Transform data using StandardScaler
    X_scaled = scaler.transform(X)

   # Reshape scaled data back to original tensor shape
    tensor_scaled = torch.tensor(X_scaled.reshape(n_samples, *final_embedding_sum.shape[1:]))

    
    #-------------------------------------------------------------
    print('Total dimension of', key, 'is (without entorpy):' ,tensor_scaled.shape)

    final_embedding = torch.mean(tensor_scaled, dim=-1) # mean of last dimension

    # entropy
    # tensor_scaled=tensor_scaled-(torch.min(tensor_scaled))
    # final_embedding = entropy(tensor_scaled, axis=-1) 

    print('Final embedding shape', key, 'is (after entropy):',final_embedding.shape) # one dimensional single sentence of length 768
    print('----------')
    if key=='x_train':
        x_train.append(final_embedding)
    else:
        x_test.append(final_embedding)


Total dimension of x_train is (without entorpy): torch.Size([1968, 47, 768])
Final embedding shape x_train is (after entropy): torch.Size([1968, 47])
----------
Total dimension of x_test is (without entorpy): torch.Size([649, 47, 768])
Final embedding shape x_test is (after entropy): torch.Size([649, 47])
----------


## Classification

- The first part has been completed successfully. Now onward, the classification part starts. Achieved embedding used for the training along with its label.

- A Gaussian process regression model works better as it is a bit high-dimensional data. To overcome this uncertainty, Gaussian process regression works better.

In [22]:
# classification task
x_train, x_test, y_train, y_test= x_train[0], x_test[0], df_train['label'], df_test['label']

# Classification model: GaussianProcessClassifier 
Gaussian = GaussianProcessClassifier(kernel=1.0 * RBF(1.0),max_iter_predict=1000)


In [23]:
Gaussian.fit(x_train,y_train) # fit to the model
y_pred=Gaussian.predict(x_test) # evaluation

# Result Analysis

- The classification results changed every time as the random state was set to zero while creating the test set. Result accuracy generally lied between 54 and 70% with the mean/entropy technique. Recall suggests how many items were correctly classified in that class. Precesion describes how accurate a model is at identifying relevant items. 

In [26]:
# accuracy score (result vary as dataset shuffle each time:54%-68%)
print(f"Accuracy score :{metrics.accuracy_score(y_test,y_pred)}")

# classification report
print('-----classification report-------')
class_labels=  [0,1] # AD : 0, HC : 1
print(classification_report(y_test, y_pred, labels=class_labels)) 

Accuracy score :0.5469953775038521
-----classification report-------
              precision    recall  f1-score   support

           0       0.55      1.00      0.71       355
           1       0.50      0.00      0.01       294

    accuracy                           0.55       649
   macro avg       0.52      0.50      0.36       649
weighted avg       0.53      0.55      0.39       649



# Conclusion

- In sum, it is important to select a good combination of encoders in order to get the best word embedding. The best word embedding is the main factor in good classification accuracy. Roberta is the most powerful variation model of BERT for embedding. This selection part is also challenging in the sense of making the classification model generalize because the combination of selected encoders might not be a good choice for every word. Word embedding is high-dimensional, and dimensionality can be reduced by using the mean or entropy function. Because of the high dimensionality of word embedding, the Gaussian process model might be more useful than others. The achieved accuracy range is 54%–70%, which can be improved by changing the classification model.
