# Sample ML model testing
This notebook will create a model based on a few features we see are well correlated in multiple datasets. It will then run this model over all existing assembled datasets.

### Importing libraries
Because we want to display and interact with our data this will take a many libraries

In [None]:
#basic ds
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

#basic system
import sys
import os
import glob

# math and signals
import math
from scipy.stats import entropy
from scipy.signal import savgol_filter
from scipy.signal import find_peaks

# demo stuff
import ipywidgets as widgets
import seaborn 

In [None]:
# ml stuff
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score
from sklearn.linear_model import LogisticRegression
from sklearn.linear_model import LinearRegression
from sklearn import linear_model
from sklearn import tree
from sklearn import metrics
from sklearn.linear_model import SGDRegressor
from sklearn.metrics import confusion_matrix
import joblib

## Importing and loading data
Here we take data from hardcoded paths. This can be changed later, but this is a proof of concept about our specific datasets at this point- not something for all scientists everywhere to use without modifying.

In [None]:
root_data_path ="C:/Projects/brainspin/not_pushed/data_anonymized/assembled"
TOP_data       = os.path.join(root_data_path,'top_stitched.csv') 
StrokeMRI_data = os.path.join(root_data_path,'StrokeMRI_stitched.csv')
Insight46_data = os.path.join(root_data_path,'Insight46_stitched.csv')

TOP_frame = pd.read_csv(TOP_data)
StrokeMRI_frame = pd.read_csv(StrokeMRI_data)
Insight46_frame = pd.read_csv(Insight46_data)

## Examining relationships in data

Here we will seperate a subset of variables earlier experiments and clinical
logic have shown to be of interest, then show heatmaps, and also examine plots.

In [None]:
#TOP_frame.columns

In [None]:
TOP_small_frame = TOP_frame[['Age','Sex','GM_ICVRatio', 'GMWM_ICVRatio','WMH_vol', 'WMH_count',]]
StrokeMRI_small_frame = StrokeMRI_frame[['Age','Sex','GM_ICVRatio', 'GMWM_ICVRatio','WMH_vol', 'WMH_count',]]
Insight46_small_frame = Insight46_frame[['Age','Sex','GM_ICVRatio', 'GMWM_ICVRatio','WMH_vol', 'WMH_count',]]

In [None]:
#StrokeMRI_small_frame.head(3)

In [None]:
#Insight46_small_frame.head(3)

In [None]:
#TOP_small_frame.head(3)

After examining the datasets we see there are small adjustments i.e. we must drop the NaN containing rows, and one moved header row from the top dataaset.

In [None]:
TOP_small_frame = TOP_small_frame[1:]
TOP_small_frame = TOP_small_frame.dropna()
StrokeMRI_small_frame = StrokeMRI_small_frame.dropna()
Insight46_small_frame = Insight46_small_frame.dropna()

In [None]:
TOP_small_frame = TOP_small_frame.apply( pd.to_numeric)
StrokeMRI_small_frame = StrokeMRI_small_frame.apply(pd.to_numeric)
Insight46_small_frame = Insight46_small_frame.apply(pd.to_numeric)


In [None]:
%matplotlib inline
seaborn.heatmap(TOP_small_frame.corr(), annot = True)

In [None]:
%matplotlib inline
seaborn.heatmap(Insight46_small_frame.corr(), annot = True)

In [None]:
%matplotlib inline
seaborn.heatmap(StrokeMRI_small_frame.corr(), annot = True)

So we see that in the StrokeMRI dataset and the TOP datasets there is strong negative correlation beweeen age and brain size relative to crainial volume. There is a postiive correlation between age and volume and count of white matter hyperintensities. The insight 46 dataset does not have these expected correlations. Let's examine a bit further

In [None]:
Insight46_small_frame.describe()

OK....the Insight46 dataset (or what is left when we get rid of NaN containing rows) basically has very little variance in age... so the results are not so strange after all.

## Model makeing and saving
We will make a model on the TOP dataset and save it.

In [None]:
ml_matrix = TOP_small_frame.copy()
X = ml_matrix.drop('Age', axis =1)


In [None]:
y = ml_matrix['Age']
#y=y.astype('int')

In [None]:
#y

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [None]:
X_train.shape

In [None]:
# # scale
# sc = StandardScaler()
# sc.fit(X_train)
# X_train = sc.transform(X_train)
# X_test = sc.transform(X_test)

In [None]:
linr = LinearRegression()
linr.fit(X_train, y_train)

In [None]:
y_pred = linr.predict(X_test)

In [None]:

print('R2 score : %.3f' % metrics.r2_score(y_test,y_test))
print('Explained variance score: %.3f'  % metrics.explained_variance_score(y_test, y_pred))
print('Mean absolute error: %.3f'  % metrics.mean_absolute_error(y_test, y_pred))

## Important note: results may vary. That is why we save the models. The model saved on 17 June 2023
had an MAE of 5.382

R2 score: 1
Explained variance score: 0.506
Mean absolute error: 5.382

Saving name was:
    '../result_models/T1_obvious4_linearreg_notricks_noints.sav'

Yeah! We have a model than can predict age within 5 years... no fancy k-folding or other optimization tricks...but that will come later.

In [None]:
# check if model folder exists and if not , then create
model_folder = '../result_models/'
if not os.path.exists(model_folder):
    os.makedirs(model_folder)

In [None]:
#file_given_name = input()

In [None]:
# save off file
file_given_name = 'T1_obvious4_linearreg_notricks_nointsB'
joblib.dump(linr, ('../result_models/' + file_given_name + '.sav'))
    

## Now we can apply the model to our other datasets
and we will pull it from the saved file because we can (as an example)

let's pickle save as well for convenience

In [None]:
import pickle
pickle_filename = '../result_models/pickle_T1_obvious4_linearreg_notricks_nointsB.sav'
pickle.dump(linr, open(pickle_filename, 'wb'))

In [None]:
model_filename = '../result_models/pickle_T1_obvious4_linearreg_notricks_nointsB.sav'
loaded_model = pickle.load(open(model_filename, 'rb'))


In [None]:
MRIml_matrix = StrokeMRI_small_frame.copy()
MRIX = MRIml_matrix.drop('Age', axis =1)


len(MRIX)

In [None]:
MRIy = MRIml_matrix['Age']
#MRIy=y.astype('float')
len(MRIy)

In [None]:
MRIX_train, MRIX_test, MRIy_train, MRIy_test = train_test_split(MRIX, MRIy, test_size=0.2, random_state=42)

Here again we will scale, but actually, this is a very bad idea, probably, but this is just a demo

In [None]:
# # scale
# sc = StandardScaler()
# sc.fit(MRIX_train)
# MRIX_train = sc.transform(MRIX_train)
# MRIX_test = sc.transform(MRIX_test)

In [None]:
MRIy_pred = loaded_model.predict(MRIX_test)

In [None]:
loaded_model

In [None]:

print('R2 score : %.3f' % metrics.r2_score(MRIy_test,MRIy_test))
print('Explained variance score: %.3f'  % metrics.explained_variance_score(MRIy_test, MRIy_pred))
print('Mean absolute error: %.3f'  % metrics.mean_absolute_error(MRIy_test, MRIy_pred))

Ouch...this hurts...Our model had a relatively large absolute error on the StrokeMRI dataset. It improved with getting rid of scaling. 

This makes sense as the means of X were different. We bmust however ask if these numbers should be different- or whether they should be normalized. This tsv normalization is a possible next step.

# Trying it on the final dataset:

In [None]:
Insight46ml_matrix = Insight46_small_frame.copy()
Insight46X = Insight46ml_matrix.drop('Age', axis =1)
len(Insight46X)

In [None]:
Insight46y = Insight46ml_matrix['Age']

len(Insight46y)

In [None]:
Insight46X_train, Insight46X_test, Insight46y_train, Insight46y_test = train_test_split(Insight46X, Insight46y, test_size=0.2, random_state=42)

In [None]:
Insight46y_pred = loaded_model.predict(Insight46X_test)

In [None]:

print('R2 score : %.3f' % metrics.r2_score(Insight46y_test,Insight46y_test))
print('Explained variance score: %.3f'  % metrics.explained_variance_score(Insight46y_test, Insight46y_pred))
print('Mean absolute error: %.3f'  % metrics.mean_absolute_error(Insight46y_test, Insight46y_pred))

The model does even worse on the Insight 46 data. Needs disucssion with scientists about direction for a stronger model

I would suggest a full dataset of all dataset for training, and kfold methods, also a fancier linear regression; but only after data cleaning with scientists 

Now let's move on and try KNn

In [None]:
from sklearn.neighbors import KNeighborsRegressor

neigh = KNeighborsRegressor(n_neighbors=5)

In [None]:
neigh.fit(X_train, y_train)

In [None]:
y_pred = nearn.predict(X_test)

In [None]:

print('R2 score : %.3f' % metrics.r2_score(y_test,y_test))
print('Explained variance score: %.3f'  % metrics.explained_variance_score(y_test, y_pred))
print('Mean absolute error: %.3f'  % metrics.mean_absolute_error(y_test, y_pred))

In [None]:
# Looks bad, but let's continue
Insight46y_pred = nearn.predict(Insight46X_test)

In [None]:

print('R2 score : %.3f' % metrics.r2_score(Insight46y_test,Insight46y_test))
print('Explained variance score: %.3f'  % metrics.explained_variance_score(Insight46y_test, Insight46y_pred))
print('Mean absolute error: %.3f'  % metrics.mean_absolute_error(Insight46y_test, Insight46y_pred))

## how depressing., let's just stop here for now. Note we need to lookat code for R2 score. It seems wrong.