## Import Libraries

In [1]:
#Import the libraries
import pandas as pd
from sklearn.svm import SVC
from sklearn.naive_bayes import MultinomialNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import GridSearchCV, train_test_split, cross_val_score

**Comment**: Imported the required libraries to perform the tasks below

## Load the Dataset

In [2]:
# Load the dataframe from csv
df = pd.read_csv('https://raw.githubusercontent.com/ktxdev/symp-check/main/backend/data/processed/symptoms_disease.csv')
# Display dataframe
df.head()

Unnamed: 0,diseases,symptoms
0,panic disorder,"anxiety and nervousness ,shortness of breath ,..."
1,panic disorder,"shortness of breath ,depressive or psychotic s..."
2,panic disorder,"anxiety and nervousness ,depression ,shortness..."
3,panic disorder,"anxiety and nervousness ,depressive or psychot..."
4,panic disorder,"anxiety and nervousness ,depression ,insomnia ..."


**Comment**: In this step, we are loading the  dataset from a CSV file, allowing us to explore the data structure and contents of the file as you can see in the output. 

## Split the data into a training and testing set

In [3]:
#Spliting the data for test and train
train, test = train_test_split(df, test_size=0.2, random_state=42)

# assigning the train and test variables
X_train = train['symptoms']
X_test = test['symptoms']
y_train = train['diseases'] 
y_test = test['diseases']

**Comment**: Now, in the above code cell we are spliting the data into training and testing sets, to ensure a balanced approach to model evaluation by separating features (X) and labels (y).


## Text Vectorization

In [4]:
# Vectorize the text data
tfidf_vectorizer = TfidfVectorizer(min_df=10)
# Fit the training data
X_train_bow = tfidf_vectorizer.fit_transform(X_train)
# Transform the test data
X_test_bow = tfidf_vectorizer.transform(X_test)

print(X_train_bow.shape)
print(X_test_bow.shape)

(395112, 340)
(98778, 340)


**Comment**: We used TfidfVectorizer to vectorize the text data and fit the training data and transform the test data. This results in output shapes of (395112, 340) for training and (98778, 340) for testing, representing the number of documents and features, respectively.


## Model Selection and Cross Validation
### Support Vector Machine

In [5]:
# fit the SVM model
svm_model = SVC(kernel='linear')
svm_model.fit(X_train_bow, y_train)
# Perform cross-validation
svm_model_acc = cross_val_score(estimator=svm_model, X = X_train_bow, y = y_train, cv = 5, n_jobs = -1)
# Return the array of accuracy scores for each fold
svm_model_acc



array([0.86428002, 0.86462169, 0.86120321, 0.86363291, 0.86383539])

**Comment**: We applied a Support Vector Machine (SVM) with a linear kernel to the vectorized data and assessed its performance through cross-validation. The accuracy scores from cross-validation are [0.86420802, 0.86462169, 0.86102321, 0.86363291, 0.86383539].

### Logistic Regression

In [6]:
# Initialize Logistic Regression model
lg_model = LogisticRegression()
# Fit the model on the training data
lg_model.fit(X_train_bow, y_train)
# Perform cross-validation
lg_model_acc = cross_val_score(estimator=lg_model, X = X_train_bow, y = y_train, cv = 5, n_jobs = -1)
# Return the array of accuracy scores for each fold
lg_model_acc



array([0.86234387, 0.86164787, 0.85943155, 0.86127914, 0.86165878])

**Comment**: We trained a Logistic Regression model on the vectorized data and evaluated its performance using cross-validation. The model demonstrated consistent accuracy scores of [0.86234387, 0.86164787, 0.85943155, 0.86127914, 0.86165878] across different folds.

### Decision Tree Classifier

In [7]:
# Initialize Decision Tree Classifier model
dtc_model = DecisionTreeClassifier()
# Fit the model on the training data
dtc_model.fit(X_train_bow, y_train)
# Perform cross-validation
dtc_model_acc = cross_val_score(estimator=dtc_model, X=X_train_bow, y=y_train, cv=5, n_jobs=-1)
# Return the array of accuracy scores for each fold
dtc_model_acc



array([0.83662984, 0.83680701, 0.83412214, 0.8357546 , 0.83476753])

**Comment**: We have trained a Decision Tree Classifier on the vectorized data and evaluated its accuracy using cross-validation. The accuracy scores obtained from the cross-validation are as follows: 0.83662984, 0.83680701, 0.83412214, 0.8357546, 0.83476753.

### Multinomial Naive Bayes

In [8]:
# Initialize Multinomial Naive Bayes model
mnb_model = MultinomialNB()
# Fit the model on the training data
mnb_model.fit(X_train_bow, y_train)
# Perform cross-validation
mnb_model_acc = cross_val_score(estimator=mnb_model, X=X_train_bow, y=y_train, cv=5, n_jobs=-1)
# Return the array of scores for each fold
mnb_model_acc



array([0.79803348, 0.79622388, 0.79699324, 0.79581635, 0.79634785])

**Comment**: We utilize a Multinomial Naive Bayes model on the vectorized data and assessing its performance using cross-validation. The accuracy scores from cross-validation are as follows: 0.79803348, 0.79622388, 0.79699324, 0.79581635, and 0.79634758. 

## Evaluation

In [9]:
# Evaluate Support Vector Machine model accuracy on test set
print("Support Vector Machine Accuracy:\t", svm_model.score(X_test_bow, y_test))

# Evaluate Decision Tree Classifier accuracy on test set
print("Decision Tree Classifier Accuracy:\t", dtc_model.score(X_test_bow, y_test))

# Evaluate Multinomial Naive Bayes accuracy on test set
# Note: .toarray() is used to convert sparse matrix to dense array, which MNB requires
print("Multinomial Naive Bayes Accurracy:\t", mnb_model.score(X_test_bow.toarray(), y_test))

# Evaluate Logistic Regression accuracy on test set
print("Logistic Regression Accuracy:\t\t", lg_model.score(X_test_bow, y_test))

Support Vector Machine Accuracy:	 0.8644839944117111
Decision Tree Classifier Accuracy:	 0.841381684180688
Multinomial Naive Bayes Accurracy:	 0.8022029196784709
Logistic Regression Accuracy:		 0.863066674765636


**Comment**: We have checked how well the models perform on the test data. The Support Vector Machine got an accuracy of 86.44%, the Decision Tree Classifier scored 84.12%, the Multinomial Naive Bayes reached 80.22%, and the Logistic Regression model got 86.30%.

## Tuning Hyperparameters

In [10]:
# Define hyperparameters to search
params = {'C': [0.1, 1, 10], 'solver': ['liblinear']}

# Initialize Logistic Regression model
lg_model = LogisticRegression()
lg_model.fit(X_train_bow, y_train)

# Setting up GridSearchCV
gscv = GridSearchCV(lg_model, params, cv=5, n_jobs=-1)
# Perform grid search on training data
gscv.fit(X_train_bow, y_train)

# Print best parameters found
print("Best Params:\t", gscv.best_params_)

# Print accuracy on test se
print("Accurracy:\t\t", gscv.score(X_test_bow, y_test))



Best Params:	 {'C': 10, 'solver': 'liblinear'}
Accurracy:		 0.8639373139767964


**Comment**: We adjusted the Logistic Regression model by testing different settings for the parameters. The best settings found were {'C': 10, 'solver': 'liblinear'}, which improved the model's accuracy to 86.39% on the test data.