# Abstratct

- There are numerous illnesses connected to memory loss in medicine. It includes Alzheimer's disease. To effectively treat this type of sickness, early detection is crucial. The domains of deep learning (DL) and natural language processing (NLP) can point us in the right path for a solution. Word embedding, a process that turns words into vectors, is made possible by NLP. These word embeddings are accurately classified based on the following classes: Alzeimer disease (AD) and healthy control. (HC). Word embedding takes a lot of effort and is computationally expensive. To evaluate the effectiveness of classification models, it is also crucial to concentrate on precision and recall value. The accuracy of the model can yet be improved through the application of new approaches. 

# Introduction
- Good mental health is necessary for the development of a better world. Although medical research has advanced significantly, there are still some diseases that cannot be detected early enough to be treated. The classification method can identify such disorders using a combination of **natural language processing** (NLP) and **deep learning** (DL). Because of their busy schedules, people occasionally neglect to sleep, which causes them to develop Alzeimer disease as they get older.

- Alzheimer's is one of the diseases that is hard to detect in its earlier phases or even in the present moment. Fortunately, data science has the solution. A combination of natural language processing and deep learning can be used to detect the disease. The Boston cookie theft picture was shown to people from both categories, and based on the description of the picture, Alzeihmer can be classified. Sentences can be transformed into word embedding by NLP, where the representation of words is converted into vectors of real numbers. This can be done by BERT (bidirectional encoder representations from transformers). Bert has different variations, such as Roberta and Electra.

- ROBERTA (Robustly Optimized BERT-Pretraining Approach) is a powerful model for word embedding. It transforms each sentence into a token, and these tokens are used by the model for word embedding in the encoder phase.

- Support vwctor classification (SVC) model is useful for this high-dimensional word embedding, as it has a huge amount of uncertainty in the data. Accuracy is a way more crucial in the medical field. There are some limitations which cannot be overlooked. The right way of word embedding and feature engineering can help to improve classification accuracy.

In [1]:
# environment for this notbook
import torch

from transformers import RobertaModel, RobertaModel, RobertaTokenizer

import numpy as np

import pandas as pd

import glob

from sklearn.metrics.pairwise import cosine_similarity
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn import metrics
from sklearn.cluster import KMeans
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
from sklearn.gaussian_process import GaussianProcessClassifier
from sklearn.gaussian_process.kernels import RBF
from sklearn.preprocessing import MinMaxScaler
from sklearn.pipeline import make_pipeline
from sklearn.svm import SVC


from scipy.stats import entropy


### Data types

- There are two types of datasets. One with sentences collected from AD and HC patients, and the second data type contains the health description of the same patients with their ID. The first data type is more important to classify Alzeimer, which was first used for the word embedding and later for classification. These data were collected based on image descriptions of patients affected by Alzeihmer and those not affected by Alzeihmer.

- The second data type has a couple of attributes such as ID_Pitt, Gender, Age, Scholarity, MMSE_Pitt, and Session. The goal of this dataset is to analyze which parameter is mostly connected with Alzeimer's disease. In addition, it has the same two classes, AD and HC.


### Pre-processing

- There are many csv files according to the patients descriptions, and therefore it is important to convert them into an appropriate format for word embedding. Through the combination of all csv files, make it one and add a label to identify which is Alzeimer disease (AD) and healthy control (HC). However, it is also important to separate 30% of the files at the beginning for the test set. This helps to stop mixture of senteces from train set and test set. Because it is essential to maintaining the uniqueness of patients' scripts.

In [2]:
# Combine all csv files of AD_train and add label column
AD_train=[]
path=r"C:\Users\uttam\OneDrive\Desktop\NLP Project\NLP-Project\Data\AD_train"
csv_files=glob.glob(path+'/*.csv')
for f in csv_files:
    df1 = pd.read_csv(f)
    df1 = df1.loc[:, ~df1.columns.str.contains('^Unnamed')]
    AD_train.append(df1)


df_AD_train=pd.concat(AD_train,axis=0,ignore_index=True)
label_AD=0 #AD : 0 label
df_AD_train['label']=label_AD

In [3]:
# Combine all csv files of AD_test and add label column
AD_test=[]
path=r"C:\Users\uttam\OneDrive\Desktop\NLP Project\NLP-Project\Data\AD_train\AD_Test"
csv_files=glob.glob(path+'/*.csv')
for f in csv_files:
    df1 = pd.read_csv(f)
    df1 = df1.loc[:, ~df1.columns.str.contains('^Unnamed')]
    AD_test.append(df1)


df_AD_test=pd.concat(AD_test,axis=0,ignore_index=True)
label_AD=0 #AD : 0 label
df_AD_test['label']=label_AD

In [4]:
# Combine all csv files of HC_train and add label column
HC_train=[]
path=r"C:\Users\uttam\OneDrive\Desktop\NLP Project\NLP-Project\Data\HC_train"
csv_files=glob.glob(path+'/*.csv')
for f in csv_files:
    df2=pd.read_csv(f)
    df2 = df2.loc[:, ~df2.columns.str.contains('^Unnamed')]
    HC_train.append(df2)


df_HC_train=pd.concat(HC_train,axis=0,ignore_index=True)
label_HC=1 #HC : 1 label
df_HC_train['label']=label_HC


In [5]:
# Combine all csv files of HC_test and add label column
HC_test=[]
path=r"C:\Users\uttam\OneDrive\Desktop\NLP Project\NLP-Project\Data\HC_train\HC_Test"
csv_files=glob.glob(path+'/*.csv')
for f in csv_files:
    df2=pd.read_csv(f)
    df2 = df2.loc[:, ~df2.columns.str.contains('^Unnamed')]
    HC_test.append(df2)


df_HC_test=pd.concat(HC_test,axis=0,ignore_index=True)
label_HC=1 #HC : 1 label
df_HC_test['label']=label_HC

In [6]:
# concatination of AD & HC train and similarly for test set
df_train=pd.concat([df_AD_train,df_HC_train], ignore_index=True)
df_test=pd.concat([df_AD_test,df_HC_test], ignore_index=True)

- Furthermore, shuffled data samples are important to avoid overfitting and help create a robust classification model. After the pre-processing step, there are a total of 1968 samples in the train set and 649 samples in the test set.

In [7]:
# shuffling
df_train = df_train.sample(frac=1,random_state=42,replace=True).reset_index(drop=True)
df_test = df_test.sample(frac=1,random_state=42,replace=True).reset_index(drop=True)
print('length of train', len(df_train),'.','length of test', len(df_test)) # output of train and test length
df_train

length of train 1968 . length of test 649


Unnamed: 0,sentence,label
0,the mama's washin(g) dishes and the sink's run...,1
1,that little girl .,1
2,and over in the window you see through the kit...,0
3,oh this is gonna be like looking at those arti...,1
4,you want something else ?,1
...,...,...
1963,I guess you take a big sledge hammer to that .,0
1964,and two children .,0
1965,the little girl is holding out her hand for a ...,1
1966,and then back there's a yard .,1


### Word Embedding

- - Sentence word embedding is the following stage. Tokenizers and the Roberta model are needed for word embedding. Sentences can be tokenized with the aid of a tokenizer. The 'G' character is used by Roberta's model to represent space as shown in the example below.


- Example: 'My name is Khodiyar.
           ['My','Gname', 'Gis', 'GKhodiyar'] =>  Tokenized sentence


- Each tokenized phrase is transformed by the transformer into a unique id that the Roberta model can use for further processing. The tokenizer in this case inserts two extra tokens at the start and end of each sentence. To make each phrase distinct from the others.

- To make this job more convenient, we set the length of the longest sentence—47 in our case—as the standard for all other sentences. Tokens that were not originally added are zero, so the model is unaffected. **model.config**  can be used to verify Roberta model configuration.

In [8]:
# loading of tokenizer and roberta model
tokenizer=RobertaTokenizer.from_pretrained('roberta-base')
Roberta = RobertaModel.from_pretrained('roberta-base',output_hidden_states=True)

Some weights of the model checkpoint at roberta-base were not used when initializing RobertaModel: ['lm_head.layer_norm.weight', 'lm_head.dense.bias', 'lm_head.decoder.weight', 'lm_head.dense.weight', 'lm_head.bias', 'lm_head.layer_norm.bias']
- This IS expected if you are initializing RobertaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [9]:
sentences_train=df_train['sentence'] # train
sentences_test=df_test['sentence']  # test

In [10]:
sentences_pool={'sentences_train':sentences_train,'sentences_test':sentences_test} # train and test pool
input_id_train=[] # list
input_id_test=[]  # list
max_len = 47

# convert sentence into token
for key in sentences_pool.keys():
    input_ids = [torch.tensor(tokenizer.encode(sentence, add_special_tokens=True)) for sentence in sentences_pool[key]]

    # Pad the sequences to have the same length
    
    padded_input = [torch.cat((input_id, torch.zeros((max_len - len(input_id)), dtype=torch.long))) for input_id in input_ids]
    padded_input = torch.stack(padded_input).to(torch.int64) # tensor

    if key=='sentences_train':
        input_id_train.append(padded_input)
    else :
        input_id_test.append(padded_input)
    

- Roberta's model is used following the conversion of words into the id. Keep the gradient off to accomplish word embedding. As a result, PyTorch won't record the processes used to calculate the word embeddings and won't employ them to calculate the gradients during backpropagation. We can decrease the memory usage and computation time needed during forward propagation by using **torch.no_grad()** when computing the word embeddings. We can also prevent any potential errors that may occur if we unintentionally attempt to compute gradients for the word embeddings. Additionally, turning off gradient tracking can help to increase the general training process' speed and effectiveness.


In [11]:
input_id_train=input_id_train[0] # get tensor from list

#  word embedding of train
with torch.no_grad():
    output = Roberta(input_id_train)
    last_hidden_state_train = output[2] # Word embedding can get by input for next part like forward pass

In [12]:
input_id_test=input_id_test[0] # get tensor from list

# word embedding of test
with torch.no_grad():
    output = Roberta(input_id_test)
    last_hidden_state_test = output[2]

- A total of twelve embeddings are produced by the 12 encoders that make up the Roberta model. Every token is represented by a 768-valued vector. Combining encoders can result in the best encoding; in this instance, the classification objective was better served by the embedding of the final seven encoders. The dimension of each sentence is now [1,47,768]. Dimension reduction can be aided by the Entropy/Mean function. The tensor is transformed into a single array with dimension [1,47] by Entropy/Mean.

In [13]:
# find best embedding by combination of encoders and use entropy to convert sentence dimension from [1,47,768] to [1,768]

'''
Dimension: shape of whole dataset after embedding: [2617,47,768],
           each sentence dimension: [1,47,768] shape 
           each word dimension: [1,768]
'''
x_train=[]
x_test=[]
embedding_pool={'x_train':last_hidden_state_train, 'x_test':last_hidden_state_test}

for key in embedding_pool.keys():
    final_embedding_sum=torch.stack(embedding_pool[key][-6:]).sum(0) # combine output of last six encoder
    #--------------------------------------------
#     n_samples = final_embedding_sum.shape[0]
#     n_features = np.prod(final_embedding_sum.shape[1:])

#     X = final_embedding_sum.reshape(n_samples, n_features)

#     scaler = MinMaxScaler()
#     scaler.fit(X)

#    # Transform data using StandardScaler
#     X_scaled = scaler.transform(X)

#    # Reshape scaled data back to original tensor shape
#     tensor_scaled = torch.tensor(X_scaled.reshape(n_samples, *final_embedding_sum.shape[1:]))

    
    #-------------------------------------------------------------
    print('Total dimension of', key, 'is (without entorpy):' ,final_embedding_sum.shape)

    # mean
    final_embedding = torch.mean(final_embedding_sum, dim=-1) # mean of last dimension

    # entropy
    # tensor_scaled=final_embedding_sum-(torch.min(final_embedding_sum))
    # final_embedding = entropy(tensor_scaled, axis=-1) 


    print(torch.min(final_embedding),torch.max(final_embedding))
    print('Final embedding shape', key, 'is (after entropy):',final_embedding.shape) # one dimensional single sentence of length 768
    print('----------')
    if key=='x_train':
        x_train.append(final_embedding)
    else:
        x_test.append(final_embedding)


Total dimension of x_train is (without entorpy): torch.Size([1968, 47, 768])
tensor(0.1589) tensor(0.2211)
Final embedding shape x_train is (after entropy): torch.Size([1968, 47])
----------
Total dimension of x_test is (without entorpy): torch.Size([649, 47, 768])
tensor(0.1589) tensor(0.2220)
Final embedding shape x_test is (after entropy): torch.Size([649, 47])
----------


## Classification

- The first part has been completed successfully. Now onwards, the classification  take into place. Achieved embedding used for the training along with its label.

- A support vector classification (SVC) model works better compared to other model as it is a bit high-dimensional data. Appatently data is unlikely to normal distribution. Normalization techninuque is also used.

In [14]:
# classification task
x_train, x_test, y_train, y_test= x_train[0], x_test[0], df_train['label'], df_test['label']
 
# Classification model: GaussianProcessClassifier 
SVC_model = SVC(gamma=2, C=1)
model = make_pipeline(MinMaxScaler(), SVC_model)

In [15]:
model.fit(x_train,y_train) # fit to the model
y_pred=model.predict(x_test) # evaluation

# Result Analysis

-  Result accuracy generally lied between 56% and 70% with the mean/entropy technique. Recall (TP/TP+FN) percentage suggests how many items were correctly classified by our algorithm. Precesion (TP/TP+FP) describes how accurate a model is at identifying relevant items. Precision for class 0 (AD) is 63% and class 1 (HC) is 43%. Which means SVC is more reliable for classifying AD class rather than HC. Similarly for the precision model is more accurate towards AD than HC class. From the observation it can be said that classification model is able to make difference between  transcripts of Alzeihmer patient with normal person.  

In [16]:
# accuracy score (result vary as dataset shuffle each time:54%-68%)
print(f"Accuracy score :{metrics.accuracy_score(y_test,y_pred)}")

# classification report
print('-----classification report-------')
class_labels=  [0,1] # AD : 0, HC : 1
print(classification_report(y_test, y_pred, labels=class_labels)) 

Accuracy score :0.559322033898305
-----classification report-------
              precision    recall  f1-score   support

           0       0.59      0.63      0.61       353
           1       0.52      0.48      0.50       296

    accuracy                           0.56       649
   macro avg       0.55      0.55      0.55       649
weighted avg       0.56      0.56      0.56       649



# Conclusion

- In conclusion, due of a busy schedule, people are sleeping less, which can occasionally lead to Alzheimer's. Memory loss may be lessened if Alzeihmer disease is identified earlier. The BERT variation model for word embedding named Roberta is the most potent. In order to obtain the greatest word embedding, it is crucial to choose a decent combination of encoders. This is the key element in accurate classification. Word embedding has a high dimensionality, and the mean/entropy function can be used to minimize dimensionality. The SVC model may be more suitable than others due to the high dimensionality of word embedding. Class 0 (AD) and class 1 (HC) achieved recall results of 63% and 48%, respectively.
However, there are certain restrictions, such as the possibility that the combination of chosen encoders may not be an acceptable option for every word, which can prevent the model from becoming more universal. However, by altering the process used to create word embedding and locating a more precise categorization model, the accuracy can be increased.