# Test Models

There are many different algorithms that can be used for classification problems. In this work, we aim to compare results by some of the main algorithms and evaluate the impact of our features treatment in each of them.

### Objectives
1. Evaluate different algorithms in terms of accuracy
2. Evaluate the impact of data categorization in algorithms

#### Tested Algorithms 
1. Linear Regression
2. Gaussian Naive Bayes
3. KNN Classifier
4. Random Forest Classifier

In [51]:
%load_ext autoreload
%autoreload 2

import numpy as np
import pandas as pd

from sklearn.linear_model import LinearRegression
from sklearn.naive_bayes import GaussianNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier

In [52]:
df_train = pd.read_csv('../data/interim/train.csv')
df_test = pd.read_csv('../data/interim/test.csv')

In [53]:
X_train = df_train.drop(['Survived', 'PassengerId'], axis=1)
Y_train = df_train['Survived']

X_test = df_test.drop('PassengerId', axis=1)

print(X_train)

     Unnamed: 0  Pclass  Sex   Age  SibSp  Parch      Fare  Embarked
0             0       3    0  22.0      1      0    7.2500         0
1             1       1    1  38.0      1      0   71.2833         2
2             2       3    1  26.0      0      0    7.9250         0
3             3       1    1  35.0      1      0   53.1000         0
4             4       3    0  35.0      0      0    8.0500         0
5             5       3    0   NaN      0      0    8.4583         1
6             6       1    0  54.0      0      0   51.8625         0
7             7       3    0   2.0      3      1   21.0750         0
8             8       3    1  27.0      0      2   11.1333         0
9             9       2    1  14.0      1      0   30.0708         2
10           10       3    1   4.0      1      1   16.7000         0
11           11       1    1  58.0      0      0   26.5500         0
12           12       3    0  20.0      0      0    8.0500         0
13           13       3    0  39.0

## Algorithms with partially categorized features
Our first batch of tests will input datasets without full categorization. After that, we'll compare results with and without categorization.

### Test Linear Regression

In [42]:
linear_regressor = LinearRegression()
linear_regressor.fit(X_train,Y_train)
prd = linear_regressor.predict(X_test)

linear_regressor.score(X_train,Y_train)

0.39777264019086606

### Test Naive Bayes

In [44]:
nb_regressor = GaussianNB()
nb_regressor.fit(X_train,Y_train)
prd = nb_regressor.predict(X_test)

nb_regressor.score(X_train,Y_train)

0.792368125701459

### Test KNN

In [46]:
knn = KNeighborsClassifier(n_neighbors=3)
knn.fit(X_train,Y_train)
prd = knn.predict(X_test)

knn.score(X_train,Y_train)

0.8069584736251403

### Test Random Forest

In [49]:
rf_regressor = RandomForestClassifier(n_estimators=10)
rf_regressor.fit(X_train,Y_train)
prd = rf_regressor.predict(X_test)

rf_regressor.score(X_train,Y_train)

0.9842873176206509

## Algorithms with totally categorized features

### Test Linear Regression

In [None]:
linear_regressor = LinearRegression()
linear_regressor.fit(X_train,Y_train)
prd = linear_regressor.predict(X_test)

linear_regressor.score(X_train,Y_train)

### Test Naive Bayes

In [None]:
nb_regressor = GaussianNB()
nb_regressor.fit(X_train,Y_train)
prd = nb_regressor.predict(X_test)

nb_regressor.score(X_train,Y_train)

### Test KNN

In [None]:
knn = KNeighborsClassifier(n_neighbors=3)
knn.fit(X_train,Y_train)
prd = knn.predict(X_test)

knn.score(X_train,Y_train)

### Test Random Forest

In [None]:
rf_regressor = RandomForestClassifier(n_estimators=10)
rf_regressor.fit(X_train,Y_train)
prd = rf_regressor.predict(X_test)

rf_regressor.score(X_train,Y_train)

## Conclusion
Although, Linear regression is very sensitive to features treatment, other algorithms seems to adapt to the ranges of each feature. 

In those tests, Random Forest could adapt to features in different ranges and not totally categorized. Also, it was much better in accuracy terms than others.