# Advanced Machine Learning 2nd Project
### Authors: Guilherme Cepeda - 62931, Pedro Serrano - 54853


In this second project divided in 2 parts we were given 2 files, `worms_trainset.csv` and `worms_testset.csv`, and had to answer a couple of questions:
* Can we classify the type of worm using the information provided by the eigenworm
series?
* For a specific worm, how can we model its motion, i.e., the eigenworm?

In [52]:
#imports
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import PowerTransformer, StandardScaler, MinMaxScaler
from sklearn.impute import SimpleImputer, KNNImputer
from sklearn.tree import DecisionTreeClassifier, plot_tree
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import GaussianNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import precision_score, recall_score, f1_score, matthews_corrcoef, confusion_matrix
from sklearn.metrics import accuracy_score
from tslearn.neighbors import KNeighborsTimeSeriesClassifier
from tslearn.piecewise import PiecewiseAggregateApproximation, SymbolicAggregateApproximation
from tslearn.preprocessing import TimeSeriesScalerMeanVariance
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier

### Load Data 

We noticed beforehand that the files given had no labels meaning that we would lose the first row of both datasets. So we decided to add a new row explicitly showing the target variable `class` and the 900 time instances with `t1 to t900`.

Here we load the 2 files given into 2 **Dataframes**.

In [2]:
#creates a dataframe from a file

#the csv has no column names, so we have to add them
list=['class']
for i in range(1,901):
    list.append('t' + str(i))

#we created a list [class, t0 - t900] to represent the 901 columns we have
#names= list makes this list as a header to the dataframe
df_trainset = pd.read_csv("worms_trainset.csv", names=list)

df_testset = pd.read_csv("worms_testset.csv", names=list)

#info
print(df_trainset.info())

#info
print(df_testset.info())

print(df_trainset.shape)
print(df_testset.shape)


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 181 entries, 0 to 180
Columns: 901 entries, class to t900
dtypes: float64(901)
memory usage: 1.2 MB
None
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 77 entries, 0 to 76
Columns: 901 entries, class to t900
dtypes: float64(901)
memory usage: 542.1 KB
None
(181, 901)
(77, 901)


In [7]:
#statistical info of the data
df_trainset.describe()

Unnamed: 0,class,t1,t2,t3,t4,t5,t6,t7,t8,t9,...,t891,t892,t893,t894,t895,t896,t897,t898,t899,t900
count,181.0,181.0,181.0,181.0,181.0,181.0,181.0,181.0,181.0,181.0,...,181.0,181.0,181.0,181.0,181.0,181.0,181.0,181.0,181.0,181.0
mean,1.58011,-0.054762,-0.055049,-0.05332,-0.044472,-0.046638,-0.054311,-0.060297,-0.064308,-0.059014,...,0.097132,0.101416,0.102469,0.088442,0.087919,0.075541,0.070701,0.064898,0.064639,0.055676
std,0.49491,1.231458,1.217009,1.212576,1.21189,1.212344,1.211266,1.216288,1.214133,1.209916,...,1.280118,1.284467,1.264096,1.265234,1.264899,1.274256,1.260681,1.257045,1.278832,1.279609
min,1.0,-3.739104,-3.719033,-3.731076,-3.715019,-3.75516,-3.751146,-3.803328,-3.799314,-3.731076,...,-3.485523,-3.623585,-3.027088,-3.27843,-3.00373,-3.27843,-3.002307,-2.97036,-2.957012,-3.071337
25%,1.0,-0.854008,-0.831337,-0.80441,-0.80441,-0.855404,-0.822698,-0.85016,-0.883371,-0.8709,...,-0.904257,-0.8828,-0.87703,-0.874598,-0.849661,-0.846069,-0.849122,-0.934296,-0.920635,-0.931001
50%,2.0,0.00437,-0.034666,-0.055647,-0.033979,-0.012762,0.003513,0.003733,-0.038728,-0.014604,...,0.101768,0.072891,0.021509,0.045957,0.011206,0.016229,-0.004682,0.035938,0.029757,0.029757
75%,2.0,0.764857,0.70145,0.795012,0.845768,0.863678,0.849089,0.857694,0.859221,0.853187,...,1.130934,1.173048,1.112793,1.082254,1.079085,1.094289,1.049968,1.053081,1.072255,1.081842
max,2.0,3.482405,3.29594,3.109476,2.923011,2.736547,2.603414,2.583677,2.563941,2.542011,...,3.799677,3.934791,3.754639,4.024867,3.799677,3.777158,3.777158,3.574488,3.844715,3.574488


### Exploratory Data Analysis (EDA)


First we started by verifying the existence of duplicated values, presented the target variable "TenYearCHD" distribution in the dataset, analyzed the correlations between the features and the target variable and in the end we also presented a plot with the 2 most correlated features with the target variable. 

In [15]:
#We dont think duplicates are bad here, we just want to be aware of them

#check for duplicates training set
print("Train set duplicates:",df_trainset.duplicated().sum())

#check for duplicates test set
print("Test set duplicates:",df_testset.duplicated().sum())

#Checking for nulls - there are none

#check for null values in the entire train set dataframe
print(df_trainset.isnull().any().any())

#check for null values in the entire test set dataframe
print(df_testset.isnull().any().any())


Train set duplicates: 23
Test set duplicates: 0
False
False


#### Check for Outliers
Identify outliers and anomalies in the data.

In [14]:
#calculate the z-score for each point of the training set
z_scores = np.abs((df_trainset - df_trainset.mean()) / df_trainset.std())

#define a threshold value
threshold = 3 # its considered an outiler when the value of the point is 3 * mean of the training set, so the threshold is 3

#Identify the outliers
outliers = df_trainset[z_scores > threshold]

#Count the number of outliers
num_outliers = outliers.count().sum()


print(f"outliers \n {outliers} \n") # non null values represent the outliers
print(f"outliers count \n {num_outliers} \n")
'''
For now, vou esquever isto
Vou normalizar o data set, e depois, se tivermos más classificações, volot a isto

'''

outliers 
      class  t1        t2        t3        t4       t5        t6        t7  \
0      NaN NaN       NaN       NaN       NaN      NaN       NaN       NaN   
1      NaN NaN       NaN       NaN       NaN      NaN       NaN       NaN   
2      NaN NaN       NaN       NaN       NaN      NaN       NaN       NaN   
3      NaN NaN       NaN       NaN       NaN      NaN       NaN       NaN   
4      NaN NaN       NaN       NaN       NaN      NaN       NaN       NaN   
..     ...  ..       ...       ...       ...      ...       ...       ...   
176    NaN NaN       NaN       NaN       NaN      NaN       NaN       NaN   
177    NaN NaN -3.719033 -3.731076 -3.715019 -3.75516 -3.751146 -3.803328   
178    NaN NaN       NaN       NaN       NaN      NaN       NaN       NaN   
179    NaN NaN       NaN       NaN       NaN      NaN       NaN       NaN   
180    NaN NaN       NaN       NaN       NaN      NaN       NaN       NaN   

           t8        t9  ...  t891  t892  t893      t894  t895  

'\nFor now, vou esquever isto\nVou normalizar o data set, e depois, se tivermos más classificações, volot a isto\n\n'

In [41]:
#creates a matrix of correlations
corr_matrix = df_trainset.corr() 
#how much each attribute correlates with the Class target variable value, the lower the value the least relevant the feature is
print("\nCorrelations Matrix\n")
print(corr_matrix['class'].sort_values(ascending=False))#to present all columns their type cannot be object so we must convert it to float

#plot to present the correlation between the 2 most correlated features with target variable
#sns.pairplot(df, vars=['Y9', 'Y10'], hue='Class')


Correlations Matrix

class    1.000000
t56      0.183297
t54      0.181402
t55      0.177794
t57      0.175693
           ...   
t219    -0.206140
t216    -0.209983
t215    -0.211973
t218    -0.216237
t217    -0.217713
Name: class, Length: 901, dtype: float64


### Data Processing

### Train Test Split

In [114]:
#Here, we split the data into X and y
#y is the target variable 'class', and X is everything else

y_train = df_trainset['class']
#X_test = X_test.transpose()
X_train = df_trainset.drop(['class'], axis=1)
y_test = df_testset['class']
#y_test = y_test.transpose()
X_test = df_testset.drop(['class'], axis=1)

print("X_train:\n")
print(X_train)
print("\ny_train:\n")
print(y_train)
print("\nX_test:\n")
print(X_test)
print("\ny_test:\n")
print(y_test)
print(X_train.shape)

X_train:

           t1        t2        t3        t4        t5        t6        t7  \
0    1.660505  1.739092  1.812766  1.847148  1.901176  1.935558  1.906088   
1   -0.379133  0.242145 -0.517195 -0.033979  0.587299 -0.517195 -0.172040   
2    0.534425  0.444349  0.399312  0.511906  0.669539  0.714577  0.511906   
3   -2.438882 -2.412564 -2.438882 -2.333611 -2.267818 -2.307294 -2.412564   
4    1.601259  1.601259  1.589440  1.589440  1.589440  1.589440  1.589440   
..        ...       ...       ...       ...       ...       ...       ...   
176 -0.816431 -0.804662 -0.706590 -0.612441 -0.624210 -0.624210 -0.628132   
177 -3.739104 -3.719033 -3.731076 -3.715019 -3.755160 -3.751146 -3.803328   
178 -1.010301 -1.151468 -1.201885 -1.232135 -1.332969 -1.353136 -1.403553   
179  1.511671  1.577663  1.569414  1.618907  1.618907  1.602410  1.635405   
180  0.732443  0.698494  0.694251  0.673033  0.711225  0.723955  0.728199   

           t8        t9       t10  ...      t891      t892      t

### Best Model/Representation Method for Classification


The KNeighborsTimeSeriesClassifier model implements the k-nearest neighbor for time series. 

We have three possible metrics, as seen below in comments
* 1-NN with Euclidean distance
* 1-NN with DTW
* 1-NN with SAX, in this case you need to set two other parameters: `n_segments` and `alphabet_size_avg`. The first parameter means the number of Piecewise Aggregate Approximation pieces to compute (start by fixing it at 16) and the latter is the number of SAX symbols to use (start by fixing it at 10). To fix these parameters, you need to use the parameter `metric_params` in the class of the classifier and provide a dictionary with the two parameters required.

We are going to use the accuracy score (from scikit-learn) to compare the methods. Also, our data is already splitted in train and test set, so we don't need to worry about splitting our data.

In [108]:
#K nearest Neighbors for time series

#c = KNeighborsTimeSeriesClassifier(n_neighbors = 3, metric = 'euclidean')#0.6103896103896104
c = KNeighborsTimeSeriesClassifier(n_neighbors = 1, metric = 'dtw') #0.6233766233766234
#dict = {'n_segments' : 16 , 'alphabet_size_avg': 10}
#c = KNeighborsTimeSeriesClassifier(n_neighbors = 1, metric = 'sax', metric_params = dict)#0.5974025974025974
#c = SVC(C=50,gamma='auto')
#c  = RandomForestClassifier(n_estimators=12, random_state=0)
#c = LogisticRegression(C = 0.01)
#c = DecisionTreeClassifier(criterion = 'gini')
#c = KNeighborsClassifier(n_neighbors = 5, algorithm = 'ball_tree',weights = 'distance')
#c = GaussianNB()

c.fit(X_train, y_train)
preds = c.predict(X_test)

#accuracy = accuracy_score(y_test, preds)

#print(accuracy)

precision = precision_score(y_test, preds)
recall = recall_score(y_test, preds)
f1 = f1_score(y_test, preds)
mcc = matthews_corrcoef(y_test, preds)


print("The Precision is: %7.4f" % precision)
print("The Recall is: %7.4f" % recall)
print("The F1 score is: %7.4f" % f1)
print("The Matthews correlation coefficient is: %7.4f" % mcc)
print()
print("This is the Confusion Matrix")
print(pd.DataFrame(confusion_matrix(y_test, preds)))

#Test randomForest , SVM, decision trees , KNN 

The Precision is:  0.5455
The Recall is:  0.7273
The F1 score is:  0.6234
The Matthews correlation coefficient is:  0.2727

This is the Confusion Matrix
    0   1
0  24   9
1  20  24


Now, we are going to explore some representation methods, namely the Piecewise Aggregate Approximation (PAA) and the Symbolic Aggregate Approximation (SAX).

In [126]:
# Execute list of scalers, imputers,models and present the results 
def test_models (scalers, representation_models, models, X_train, y_train, X_test,y_test, show_rep_model, show_model):
    results =[]
    ct = 0
    for name_scaler, scaler in scalers:
        if show_rep_model:
            for name_rep_method, rep_method in representation_models:
                for name_mod, model in models:
                    #scaling
                    scaler.fit(X_train)
                    Xt_train = scaler.transform(X_train)
                    Xt_test  = scaler.transform(X_test)

                    #representation methods
                    rep_method.fit(Xt_train)
                    Xt_train = rep_method.transform(Xt_train)
                    Xt_test  = rep_method.transform(Xt_test)

                    #len of the array
                    if len(Xt_train.shape) == 2:
                        model.fit(Xt_train, y_train)
                        preds = model.predict(Xt_test)
                    else:
                        model.fit(Xt_train[:,:,0], y_train) #[:,:,0] to have only 2 dimensions
                        preds = model.predict(Xt_test[:,:,0]) #PREDICTION

                    #save results
                    results = save_results (name_scaler, scaler, name_rep_method, rep_method, name_mod, model, results,y_test, preds, show_model)
        else:
            for name_mod, model in models:
                    #scaling
                    scaler.fit(X_train)
                    Xt_train = scaler.transform(X_train)
                    Xt_test  = scaler.transform(X_test)

                    #len of the array
                    if len(Xt_train.shape) == 2:
                        model.fit(Xt_train, y_train)
                        preds = model.predict(Xt_test)
                    else:
                        model.fit(Xt_train[:,:,0], y_train) #[:,:,0] to have only 2 dimensions
                        preds = model.predict(Xt_test[:,:,0]) #PREDICTION

                    #save results
                    results = save_results (name_scaler, scaler, "", "", name_mod, model, results,y_test, preds, show_model)

        #present model number
        if show_model:
            ct += 1
            print("\nModel %d" % ct)

    
    results_sorted = sorted(results, key=lambda x: x[8], reverse=True) #f1 sorted decreasing
    display_results(results_sorted, show_rep_model)
    return results



# Save the model scores and present intermediate results (w/ show_model)
# Returns the list with the saved results 
def save_results(name_scaler, scaler,name_rep_method, rep_method, name_mod, model, results, y_test, preds, show_model):

    # Calculate the precision, recall, f1 and mcc scores
    precision = precision_score(y_test, preds)
    recall = recall_score(y_test, preds)
    f1 = f1_score(y_test, preds)
    mcc = matthews_corrcoef(y_test, preds)
    
    if show_model:
        print(f"Scaler: {scaler} rep method: {rep_method} classifier: {name_mod} {model}")
        print("The Precision is: %7.4f" % precision)
        print("The Recall is: %7.4f" % recall)
        print("The F1 score is: %7.4f" % f1)
        print("The Matthews correlation coefficient is: %7.4f" % mcc)
        print()
        print("This is the Confusion Matrix")
        print(pd.DataFrame(confusion_matrix(y_test, preds)))


    results.append((name_scaler,
                    scaler,
                    name_rep_method, 
                    rep_method, 
                    name_mod, 
                    model,
                    precision,
                    recall,
                    f1,
                    mcc,                    
                    ))
    return results

# Display the model final results. Receives the ordered results to present
def display_results (results, show_rep_model):        
    
    noshow = ""
    if show_rep_model:
        print (f"\n--------------------------Results for Representation Methods Performance--------------------------")
    else:
        print (f"\n--------------------------Results for Classification Models Performance--------------------------")
    for res in results:
        name_scaler = res [0]
        scaler = res [1]
        name_rep_method = res [2]
        rep_method = res [3]
        name_mod = res [4]
        model = res [5]
        precision = res [6]
        recall = res [7]
        f1 = res [8]
        mcc = res [9]

        if show_rep_model:
            print(f"{name_mod.ljust(25)} | precision     {precision:.4f} | recall     {recall:.4f} | f1     {f1:.4f}| mcc     {mcc:.4f}")
            print(f"{noshow.ljust(25)} | scaler {scaler} | rep method {rep_method}")
        else:
             print(f"{name_mod.ljust(25)} | precision     {precision:.4f} | recall     {recall:.4f} | f1     {f1:.4f}| mcc     {mcc:.4f}")
             print(f"{noshow.ljust(25)} | scaler {scaler}")
    

In [127]:
# Defining a list of scalers
scalers = [
    ('PowerTransformer', PowerTransformer()),
    ('MinMaxScaler', MinMaxScaler()),
    ('StandardScaler', StandardScaler()),
    ('TimeSeriesScalerMeanVariance', TimeSeriesScalerMeanVariance(mu=0, std=1))
]

# Defining a list of representation methods
representation_models = [
    ('PiecewiseAggregateApproximation_ns10', PiecewiseAggregateApproximation(n_segments=12)),
    ('PiecewiseAggregateApproximation_ns16', PiecewiseAggregateApproximation(n_segments=16)),
    ('SymbolicAggregateApproximation_ns10', SymbolicAggregateApproximation(n_segments=10, alphabet_size_avg=40)),
    ('SymbolicAggregateApproximation_ns10', SymbolicAggregateApproximation(n_segments=32, alphabet_size_avg=40))
]

dict = {'n_segments' : 16 , 'alphabet_size_avg': 10}

# Defining a list of classification models
classification_models = [
    ('LogisticRegression', LogisticRegression(C = 0.01)),
    ('DecisionTree_maxd10', DecisionTreeClassifier(max_depth = 10)),
    ('DecisionTree_minsl20', DecisionTreeClassifier(min_samples_leaf = 5)),
    ('DecisionTree_critgini', DecisionTreeClassifier(criterion = 'gini')),
    ('DecisionTree_critentropy', DecisionTreeClassifier(criterion = 'entropy')),
    ('GaussianNB', GaussianNB()),
    ('KNN_K1_wdist', KNeighborsClassifier(n_neighbors = 1,weights = 'distance')),
    ('KNNTM_K1_eu', KNeighborsTimeSeriesClassifier(n_neighbors = 1, metric = 'euclidean')),
    ('KNNTM_K1_dtw', KNeighborsTimeSeriesClassifier(n_neighbors = 1, metric = 'dtw')),
    ('KNNTM_K1_sax', KNeighborsTimeSeriesClassifier(n_neighbors = 1, metric = 'sax',metric_params = dict)),
    ('RandomForestClassifier_ne50',RandomForestClassifier(n_estimators=50, random_state=0)),
    ('RandomForestClassifier_ne10',RandomForestClassifier(n_estimators=10, random_state=0)),
    ('SVC_c50',SVC(C=50,gamma='auto')),
    ('SVC_c10',SVC(C=10,gamma='auto'))
]

SHOW_REP_MODEL = False #false will not present the representation methods
SHOW_MODEL = False #True to present/print the progress of the model performance

#First run will be just with the scalers and the classification models
test_models(scalers, representation_models, classification_models, X_train, y_train, X_test,y_test, SHOW_REP_MODEL, SHOW_MODEL)

SHOW_REP_MODEL = True

#Second run, for representation models
test_models(scalers, representation_models, classification_models, X_train, y_train, X_test,y_test, SHOW_REP_MODEL, SHOW_MODEL)


--------------------------Results for Classification Models Performance--------------------------
SVC_c10                   | precision     0.5750 | recall     0.6970 | f1     0.6301| mcc     0.3077
                          | scaler PowerTransformer()
SVC_c10                   | precision     0.5750 | recall     0.6970 | f1     0.6301| mcc     0.3077
                          | scaler StandardScaler()
KNNTM_K1_dtw              | precision     0.5455 | recall     0.7273 | f1     0.6234| mcc     0.2727
                          | scaler TimeSeriesScalerMeanVariance(mu=0, std=1)
KNNTM_K1_dtw              | precision     0.5610 | recall     0.6970 | f1     0.6216| mcc     0.2855
                          | scaler StandardScaler()
SVC_c10                   | precision     0.5500 | recall     0.6667 | f1     0.6027| mcc     0.2551
                          | scaler TimeSeriesScalerMeanVariance(mu=0, std=1)
KNNTM_K1_sax              | precision     0.4286 | recall     1.0000 | f1     0.6000

  X_transformed[i_ts, i_seg, :] = segment.mean(axis=0)
  ret = um.true_divide(
  X_transformed[i_ts, i_seg, :] = segment.mean(axis=0)
  ret = um.true_divide(
  X_transformed[i_ts, i_seg, :] = segment.mean(axis=0)
  ret = um.true_divide(
  X_transformed[i_ts, i_seg, :] = segment.mean(axis=0)
  ret = um.true_divide(
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  X_transformed[i_ts, i_seg, :] = 


--------------------------Results for Representation Methods Performance--------------------------
KNNTM_K1_dtw              | precision     0.6279 | recall     0.8182 | f1     0.7105| mcc     0.4530
                          | scaler StandardScaler() | rep method PiecewiseAggregateApproximation(n_segments=16)
KNNTM_K1_dtw              | precision     0.5854 | recall     0.7273 | f1     0.6486| mcc     0.3381
                          | scaler PowerTransformer() | rep method PiecewiseAggregateApproximation(n_segments=16)
DecisionTree_critentropy  | precision     0.5854 | recall     0.7273 | f1     0.6486| mcc     0.3381
                          | scaler MinMaxScaler() | rep method PiecewiseAggregateApproximation(n_segments=16)
SVC_c50                   | precision     0.5854 | recall     0.7273 | f1     0.6486| mcc     0.3381
                          | scaler StandardScaler() | rep method PiecewiseAggregateApproximation(n_segments=16)
DecisionTree_maxd10       | precision     0.6286

[('PowerTransformer',
  PowerTransformer(),
  'PiecewiseAggregateApproximation_ns10',
  PiecewiseAggregateApproximation(n_segments=12),
  'LogisticRegression',
  LogisticRegression(C=0.01),
  0.0,
  0.0,
  0.0,
  -0.09933992677987828),
 ('PowerTransformer',
  PowerTransformer(),
  'PiecewiseAggregateApproximation_ns10',
  PiecewiseAggregateApproximation(n_segments=12),
  'DecisionTree_maxd10',
  DecisionTreeClassifier(max_depth=10),
  0.6333333333333333,
  0.5757575757575758,
  0.6031746031746033,
  0.3305736828770171),
 ('PowerTransformer',
  PowerTransformer(),
  'PiecewiseAggregateApproximation_ns10',
  PiecewiseAggregateApproximation(n_segments=12),
  'DecisionTree_minsl20',
  DecisionTreeClassifier(min_samples_leaf=5),
  0.5172413793103449,
  0.45454545454545453,
  0.4838709677419355,
  0.1392715036327889),
 ('PowerTransformer',
  PowerTransformer(),
  'PiecewiseAggregateApproximation_ns10',
  PiecewiseAggregateApproximation(n_segments=12),
  'DecisionTree_critgini',
  DecisionTre

In [None]:
#Question 2 of the project is most likely to be done using the info in the TP7 timeseries forecasting i believe 

#histograms in EDA its  NO GO, too many rows in the datasets
#Same for presenting the plot for outliers, FIND ANOTHER WAY, counting them its a solution

#DONT KNOW if i have to check the outliers of the test set also 
#VERIFY if a matrix of correlations is possibleor not, sounds difficult as we dont have a target variable

#Scaling data might need to be done VERIFY  the 4 Scalers PowerTransformer, StandardScaler , MinMaxScaler and the last one is the normalizer i believe it might be useful as it works with rows check AA notes/slides, no need for imputation though
#in the Project statement when she says find the best classifier model, in the TP6 she only uses the KNeighborsClassifier CHECK if its possible to use other classification models or if its even needed to

#SEE IF SCALING IS NEEDED
