# Unit Test 3

Topics Covered:
* Generalized Linear Models
* K-Nearest Neighbors
* CART
* Random Forests
* Boosting
* Support Vector Machines

## Background
Story telling is a key component of interpersonal communication and the study of narrative ability in children can provide critical insights into their language development. Narrative sample analysis is a process in which an individual produces a narrative and then a Speech-Language Pathologist (or similar practitioner) analyzes the quality. One tool for measuring this quality is the Monitoring Indicators of Scholarly Language (MISL). It provides an objective measure of the macrostructure story elements (e.g. Characters, Setting, Initiating Event) as well as the microstructure or grammatical elements. 

The process of scoring the macrostructure can be very time consuming though, which leads to less effective ongoing monitoring. This dataset provides the first publicly accessible data for attempting to automate scoring of the macrostructure via Machine Learning.

## Dataset:
`AutomatedNarrativeAnalysisMISLData.csv`

## Task

Your goal is to predict the Initiating Event (`IE`) label. The `IE` is scored as either 0, 1, 2, or 3 but for our purposes it is acceptable to predict this as either a continuous or categorical output. Note that if you predict it as continuous, it is necessary to constrain the prediction in some way, therefore, categorical may be easier.

For predictor variables, you have two choices: either the raw text or the text features (or both, technically). The text features are every column **except** `Char`, `Sett`, `IE`, `Plan`, `Act`, and `Con`. Those 6 variables are the output scores but again we'll just be focusing on `IE` for now. Also, exclude the `ID` column.

Using cross-validation, explore the many different classification algorithms we discussed to find the model with the highest performance (I'll leave it to you to define performance).

**Bonus**: The column `vecOfNarratives` contains the raw text. If you would like, feel free to use Tf-Idf method for creating columns out of raw text that we discussed in the SVM lecture. It's in the notebook titled `BBC_Text_preprocessing.ipynb`.

*This is still very much an open task so any major improvements would likely be publication worthy.*

In [12]:
# Import libraries
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.model_selection import cross_val_score
from sklearn.metrics import accuracy_score, f1_score, classification_report
import numpy as np
from sklearn.model_selection import GridSearchCV

In [3]:
# Read in dataframe
df = pd.read_csv('AutomatedNarrativeAnalysisMISLData.csv')

# Clean data
df = df.drop(columns=['ID', 'vecOfNarratives', 'Char', 'Sett', 'Plan', 'Act', 'Con']) # Drop columns
df = df.dropna() # Drop rows with missing values

In [4]:
# Split data into features and target variable
X = df.drop(columns=['IE'])
y = df['IE']

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Standardize the features
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

In [5]:
# Define model evaluation function
def evaluate_model(model, X_train, y_train):
    # Perform 5-fold cross-validation
    cv_scores = cross_val_score(model, X_train, y_train, cv=5, scoring='accuracy')
    print(f"\tCross-validated accuracy scores: {cv_scores}")
    print(f"\tMean accuracy: {np.mean(cv_scores)}")
    print(f"\tStandard deviation: {np.std(cv_scores)}\n")
    
    # Train the model on training set and return it
    model.fit(X_train, y_train)
    return model

# List of models
models = {
    "Logistic Regression": LogisticRegression(max_iter=1000),
    "K-Nearest Neighbors": KNeighborsClassifier(),
    "Decision Tree": DecisionTreeClassifier(),
    "Random Forest": RandomForestClassifier(n_estimators=100),
    "Support Vector Machine": SVC()}

# Evaluate each model on the training set
for name, model in models.items():
    print(f"Evaluating {name}:")
    trained_model = evaluate_model(model, X_train, y_train)
    models[name] = trained_model

# Evaluate the models on the test set
for name, model in models.items():
    y_pred = model.predict(X_test)
    print(f"Performance of {name} on the test set:")
    print(f"Accuracy: {accuracy_score(y_test, y_pred)}")
    print(f"F1 Score: {f1_score(y_test, y_pred, average='weighted')}")
    print(classification_report(y_test, y_pred))
    print("\n")

Evaluating Logistic Regression:
	Cross-validated accuracy scores: [0.56716418 0.5        0.48484848 0.40909091 0.5       ]
	Mean accuracy: 0.4922207146087743
	Standard deviation: 0.05040331689500523

Evaluating K-Nearest Neighbors:
	Cross-validated accuracy scores: [0.52238806 0.46969697 0.48484848 0.5        0.42424242]
	Mean accuracy: 0.48023518769787427
	Standard deviation: 0.03296979991478287

Evaluating Decision Tree:
	Cross-validated accuracy scores: [0.41791045 0.42424242 0.28787879 0.54545455 0.43939394]
	Mean accuracy: 0.42297602894617814
	Standard deviation: 0.08191344603100235

Evaluating Random Forest:
	Cross-validated accuracy scores: [0.6119403  0.5        0.56060606 0.48484848 0.59090909]
	Mean accuracy: 0.5496607869742197
	Standard deviation: 0.049732861206520367

Evaluating Support Vector Machine:
	Cross-validated accuracy scores: [0.64179104 0.45454545 0.46969697 0.51515152 0.57575758]
	Mean accuracy: 0.5313885119855268
	Standard deviation: 0.06947182882212444

Perfor

In [14]:
# Define hyperparameter grid for SVM
svm_param_grid = {
    'C': [0.1, 1, 10, 100],
    'gamma': [1, 0.1, 0.01, 0.001],
    'kernel': ['linear', 'rbf']}

# GridSearchCV object for SVM
svm_grid_search = GridSearchCV(SVC(), svm_param_grid, cv=5, scoring='accuracy', n_jobs=-1)

# Perform Grid Search for SVM
svm_grid_search.fit(X_train, y_train)
best_svm = svm_grid_search.best_estimator_
print(f"Best SVM parameters: {svm_grid_search.best_params_}")
print(f"Best SVM cross-validated accuracy: {svm_grid_search.best_score_}")

# Evaluate the best SVM model on the test set
print("Performance of the best SVM on the test set:")
y_pred_svm = best_svm.predict(X_test)
print(f"Accuracy: {accuracy_score(y_test, y_pred_svm)}")
print(f"F1 Score: {f1_score(y_test, y_pred_svm, average='weighted')}")
print(classification_report(y_test, y_pred_svm))

Best SVM parameters: {'C': 1, 'gamma': 0.01, 'kernel': 'rbf'}
Best SVM cross-validated accuracy: 0.5313885119855268
Performance of the best SVM on the test set:
Accuracy: 0.5301204819277109
F1 Score: 0.4883527435660328
              precision    recall  f1-score   support

           0       0.83      0.59      0.69        17
           1       0.23      0.13      0.17        23
           2       0.53      0.83      0.65        36
           3       1.00      0.14      0.25         7

    accuracy                           0.53        83
   macro avg       0.65      0.42      0.44        83
weighted avg       0.55      0.53      0.49        83

