# Machine Learning Code Using Logistic Regression

In [390]:
#import the libraries needed
import numpy as np              #used for algebra
import pandas as pd             #used for reading csv files
import matplotlib.pyplot as plt #used to plot the data
import seaborn as sns           #used to make statistical graphs

In [391]:
#calling and checking the dataset
datatrain = pd.read_csv('train.csv')
datatrain

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S
...,...,...,...,...,...,...,...,...,...,...,...,...
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0000,,S
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,B42,S
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.4500,,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0000,C148,C


In [392]:
#used to check if there are empty/null values in the dataset
datatrain.isnull().sum()

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

In [393]:
#cleaning the dataset by dropping unused columns and giving values to some of the empty cells
datatrain = datatrain.drop(columns=["PassengerId", "Name", "Ticket", "Cabin"], axis=1)
datatrain["Age"].fillna(datatrain["Age"].mean(), inplace=True)
datatrain["Embarked"].fillna("U", inplace=True)
datatrain

Unnamed: 0,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked
0,0,3,male,22.000000,1,0,7.2500,S
1,1,1,female,38.000000,1,0,71.2833,C
2,1,3,female,26.000000,0,0,7.9250,S
3,1,1,female,35.000000,1,0,53.1000,S
4,0,3,male,35.000000,0,0,8.0500,S
...,...,...,...,...,...,...,...,...
886,0,2,male,27.000000,0,0,13.0000,S
887,1,1,female,19.000000,0,0,30.0000,S
888,0,3,female,29.699118,1,2,23.4500,S
889,1,1,male,26.000000,0,0,30.0000,C


In [394]:
#preprocessing of the data to change the string values to numerical values
from sklearn import preprocessing
#uses labelencoder to change the labels of each cell in the existing columns into numerical values
le = preprocessing.LabelEncoder()
cols = ["Sex", "Embarked"]

for col in cols:
        datatrain[col] = le.fit_transform(datatrain[col])

In [395]:
#defining the values of X and y based on the independent and dependent variables in the csv file
X = datatrain.drop("Survived", axis=1)
y = datatrain["Survived"]

In [396]:
#the train-test split library and used to split the dataset into the test and train variables
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [397]:
#importing the libraries needed to create a model, in this case, I used LogisticRegression
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
#defining and fitting the variables into the model
LR = LogisticRegression(random_state=42, max_iter=100)
LR.fit(X_train,y_train)
prediction = LR.predict(X_test)

In [398]:
#checking the accuracy of the original model
pred_acc = accuracy_score(y_test, prediction)
pred_acc

0.8100558659217877

In [399]:
#importing the libraries used for hyperparameter tuning to determine the best parameters to use
from sklearn.model_selection import RepeatedStratifiedKFold, GridSearchCV
#stored the parameters that can be used into arrays to be called later
solvers = ['newton-cg', 'lbfgs', 'liblinear']
penalty = ['l2']
c_values = [100, 10, 1.0, 0.1, 0.01]
#creates a gridsearch of parameters by testing which parameter will yield the highest accuracy
grid = dict(solver=solvers,penalty=penalty,C=c_values)
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
grid_search = GridSearchCV(estimator=LR, param_grid=grid, n_jobs=-1, cv=cv, scoring='accuracy',error_score=0)
grid_result = grid_search.fit(X_train, y_train)
#prints the best parameter to use 
print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_))
means = grid_result.cv_results_['mean_test_score']
params = grid_result.cv_results_['params']
#prints the list of parameter combinations with their respective accuracy scores
for mean, param in zip(means, params):
    print("%f with: %r" % (mean, param))

Best: 0.801493 using {'C': 1.0, 'penalty': 'l2', 'solver': 'liblinear'}
0.799596 with: {'C': 100, 'penalty': 'l2', 'solver': 'newton-cg'}
0.799596 with: {'C': 100, 'penalty': 'l2', 'solver': 'lbfgs'}
0.800065 with: {'C': 100, 'penalty': 'l2', 'solver': 'liblinear'}
0.799126 with: {'C': 10, 'penalty': 'l2', 'solver': 'newton-cg'}
0.799596 with: {'C': 10, 'penalty': 'l2', 'solver': 'lbfgs'}
0.799602 with: {'C': 10, 'penalty': 'l2', 'solver': 'liblinear'}
0.800072 with: {'C': 1.0, 'penalty': 'l2', 'solver': 'newton-cg'}
0.800535 with: {'C': 1.0, 'penalty': 'l2', 'solver': 'lbfgs'}
0.801493 with: {'C': 1.0, 'penalty': 'l2', 'solver': 'liblinear'}
0.800522 with: {'C': 0.1, 'penalty': 'l2', 'solver': 'newton-cg'}
0.800522 with: {'C': 0.1, 'penalty': 'l2', 'solver': 'lbfgs'}
0.794953 with: {'C': 0.1, 'penalty': 'l2', 'solver': 'liblinear'}
0.724146 with: {'C': 0.01, 'penalty': 'l2', 'solver': 'newton-cg'}
0.724146 with: {'C': 0.01, 'penalty': 'l2', 'solver': 'lbfgs'}
0.710113 with: {'C': 0.01

In [400]:
#prints the best parameter result of the gridsearch
print(grid_result.best_params_)

{'C': 1.0, 'penalty': 'l2', 'solver': 'liblinear'}


In [401]:
#application of the best parameters in the gridsearch to the model and refitting the data
LR1 = LogisticRegression(penalty="l2", C=0.1, solver="newton-cg", random_state=42, max_iter=100)
LR1.fit(X_train,y_train)
prediction_tuned = LR1.predict(X_test)

In [402]:
#prints the predicted accuracy of the tuned model, in this case, the accuracy was the same as the original
pred_acc_tuned_LR = accuracy_score(y_test, prediction_tuned)
pred_acc_tuned_LR

0.8100558659217877

In [403]:
#comparison between the Original Accuracy and the hyper tuned Accuracy
print("Original Accuracy: ", pred_acc)
print("Tuned Accuracy: ", pred_acc_tuned_LR)

Original Accuracy:  0.8100558659217877
Tuned Accuracy:  0.8100558659217877


In [404]:
#prints the final predicted outcome based on the test csv file using the tuned model since it has the same accuracy as the original one
prediction_final_LR = LR1.predict(X_test)
prediction_final_LR

array([0, 0, 0, 1, 1, 1, 1, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0,
       1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0,
       1, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 1, 0, 1, 1, 1, 0, 1, 1, 0, 0, 1,
       0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 0, 0, 1, 1, 0, 0, 0, 1, 1,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0,
       1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 1, 1, 1, 0, 1, 0,
       0, 1, 0, 1, 1, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1,
       0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1, 1, 0, 1, 0, 0,
       0, 1, 1], dtype=int64)

# Machine Learning Code Using K-Nearest Neighbors

In [405]:
#import the libraries needed
import numpy as np              #used for algebra
import pandas as pd             #used for reading csv files
import matplotlib.pyplot as plt #used to plot the data
import seaborn as sns           #used to make statistical graphs

In [406]:
#calling and checking the dataset
datatrain = pd.read_csv('train.csv')
datatrain

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S
...,...,...,...,...,...,...,...,...,...,...,...,...
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0000,,S
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,B42,S
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.4500,,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0000,C148,C


In [407]:
#used to check if there are empty/null values in the dataset
datatrain.isnull().sum()

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

In [408]:
#cleaning the dataset by dropping unused columns and giving values to some of the empty cells
datatrain = datatrain.drop(columns=["PassengerId", "Name", "Ticket", "Cabin"], axis=1)
datatrain["Age"].fillna(datatrain["Age"].mean(), inplace=True)
datatrain["Embarked"].fillna("U", inplace=True)
datatrain

Unnamed: 0,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked
0,0,3,male,22.000000,1,0,7.2500,S
1,1,1,female,38.000000,1,0,71.2833,C
2,1,3,female,26.000000,0,0,7.9250,S
3,1,1,female,35.000000,1,0,53.1000,S
4,0,3,male,35.000000,0,0,8.0500,S
...,...,...,...,...,...,...,...,...
886,0,2,male,27.000000,0,0,13.0000,S
887,1,1,female,19.000000,0,0,30.0000,S
888,0,3,female,29.699118,1,2,23.4500,S
889,1,1,male,26.000000,0,0,30.0000,C


In [409]:
#preprocessing of the data to change the string values to numerical values
from sklearn import preprocessing
#uses labelencoder to change the labels of each cell in the existing columns into numerical values
le = preprocessing.LabelEncoder()
cols = ["Sex", "Embarked"]

for col in cols:
        datatrain[col] = le.fit_transform(datatrain[col])

In [410]:
#defining the values of X and y based on the independent and dependent variables in the csv file
X = datatrain.drop("Survived", axis=1)
y = datatrain["Survived"]

In [411]:
#the train-test split library and used to split the dataset into the test and train variables
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [412]:
#import the preprocessing library for standardscaler to use in the KNN Classifier
from sklearn.preprocessing import StandardScaler
#This standardizes the data and scales it to the unit variance
scaler = StandardScaler()
scaler.fit(X_train)
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)

In [413]:
#importing the libraries needed to create a model, in this case, I used K-Nearest Neighbors Classifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score
#defining and fitting the variables into the model
classifier = KNeighborsClassifier(n_neighbors=7)
classifier.fit(X_train, y_train)
prediction = classifier.predict(X_test)

In [414]:
#checking the accuracy of the original model
pred_acc = accuracy_score(y_test, prediction)
pred_acc

0.8212290502793296

In [415]:
#importing the libraries used for hyperparameter tuning to determine the best parameters to use
from sklearn.model_selection import RepeatedStratifiedKFold, GridSearchCV
#stored the parameters that can be used into arrays to be called later
n_neighbors = range(1, 21, 2)
weights = ['uniform', 'distance']
metric = ['euclidean', 'manhattan', 'minkowski']
#creates a gridsearch of parameters by testing which parameter will yield the highest accuracy
grid = dict(n_neighbors=n_neighbors,weights=weights,metric=metric)
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
grid_search = GridSearchCV(estimator=classifier, param_grid=grid, n_jobs=-1, cv=cv, scoring='accuracy',error_score=0)
grid_result = grid_search.fit(X_train, y_train)
#prints the best parameter to use 
print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_))
means = grid_result.cv_results_['mean_test_score']
params = grid_result.cv_results_['params']
#prints the list of parameter combinations with their respective accuracy scores
for mean, param in zip(means, params):
    print("%f with: %r" % (mean, param))

Best: 0.822105 using {'metric': 'manhattan', 'n_neighbors': 13, 'weights': 'uniform'}
0.746668 with: {'metric': 'euclidean', 'n_neighbors': 1, 'weights': 'uniform'}
0.746668 with: {'metric': 'euclidean', 'n_neighbors': 1, 'weights': 'distance'}
0.788778 with: {'metric': 'euclidean', 'n_neighbors': 3, 'weights': 'uniform'}
0.782251 with: {'metric': 'euclidean', 'n_neighbors': 3, 'weights': 'distance'}
0.796753 with: {'metric': 'euclidean', 'n_neighbors': 5, 'weights': 'uniform'}
0.793936 with: {'metric': 'euclidean', 'n_neighbors': 5, 'weights': 'distance'}
0.800033 with: {'metric': 'euclidean', 'n_neighbors': 7, 'weights': 'uniform'}
0.796283 with: {'metric': 'euclidean', 'n_neighbors': 7, 'weights': 'distance'}
0.807544 with: {'metric': 'euclidean', 'n_neighbors': 9, 'weights': 'uniform'}
0.799544 with: {'metric': 'euclidean', 'n_neighbors': 9, 'weights': 'distance'}
0.811783 with: {'metric': 'euclidean', 'n_neighbors': 11, 'weights': 'uniform'}
0.791608 with: {'metric': 'euclidean', 

In [416]:
#prints the best parameter result of the gridsearch
print(grid_result.best_params_)

{'metric': 'manhattan', 'n_neighbors': 13, 'weights': 'uniform'}


In [417]:
#application of the best parameters in the gridsearch to the model and refitting the data
classifier1 = KNeighborsClassifier(n_neighbors=13, weights="uniform", metric="manhattan")
classifier1.fit(X_train, y_train)
prediction_tuned = classifier1.predict(X_test)

In [418]:
#prints the predicted accuracy of the tuned model, in this case, the accuracy became higher than the original parameters
pred_acc_tuned_KNN = accuracy_score(y_test, prediction_tuned)
pred_acc_tuned_KNN

0.8324022346368715

In [419]:
#comparison between the Original Accuracy and the hyper tuned Accuracy
print("Original Accuracy: ", pred_acc)
print("Tuned Accuracy: ", pred_acc_tuned_KNN)

Original Accuracy:  0.8212290502793296
Tuned Accuracy:  0.8324022346368715


In [420]:
#prints the final predicted outcome based on the tuned prediction since the accuracy increased by around 1%
prediction_final_KNN = classifier1.predict(X_test)
prediction_final_KNN

array([0, 0, 0, 1, 1, 1, 1, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1,
       1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0,
       1, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 1, 0, 1, 1, 1, 0, 1, 1, 0, 0, 1,
       0, 0, 0, 1, 1, 1, 0, 1, 0, 0, 1, 1, 1, 1, 0, 1, 1, 0, 0, 0, 1, 1,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0,
       1, 0, 1, 1, 0, 0, 0, 0, 1, 1, 0, 1, 1, 1, 0, 0, 1, 0, 1, 0, 1, 0,
       0, 1, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1,
       0, 0, 0, 1, 1, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 1, 1, 0, 0,
       0, 1, 1], dtype=int64)

# Neural Network Code Using MLP in sklearn

In [421]:
#import the libraries needed
import numpy as np              #used for algebra
import pandas as pd             #used for reading csv files
import matplotlib.pyplot as plt #used to plot the data
import seaborn as sns           #used to make statistical graphs

In [422]:
#calling and checking the dataset
datatrain = pd.read_csv('train.csv')
datatrain

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S
...,...,...,...,...,...,...,...,...,...,...,...,...
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0000,,S
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,B42,S
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.4500,,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0000,C148,C


In [423]:
#used to check if there are empty/null values in the dataset
datatrain.isnull().sum()

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

In [424]:
#cleaning the dataset by dropping unused columns and giving values to some of the empty cells
datatrain = datatrain.drop(columns=["PassengerId", "Name", "Ticket", "Cabin"], axis=1)
datatrain["Age"].fillna(datatrain["Age"].mean(), inplace=True)
datatrain["Embarked"].fillna("U", inplace=True)
datatrain

Unnamed: 0,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked
0,0,3,male,22.000000,1,0,7.2500,S
1,1,1,female,38.000000,1,0,71.2833,C
2,1,3,female,26.000000,0,0,7.9250,S
3,1,1,female,35.000000,1,0,53.1000,S
4,0,3,male,35.000000,0,0,8.0500,S
...,...,...,...,...,...,...,...,...
886,0,2,male,27.000000,0,0,13.0000,S
887,1,1,female,19.000000,0,0,30.0000,S
888,0,3,female,29.699118,1,2,23.4500,S
889,1,1,male,26.000000,0,0,30.0000,C


In [425]:
#preprocessing of the data to change the string values to numerical values
from sklearn import preprocessing
#uses labelencoder to change the labels of each cell in the existing columns into numerical values
le = preprocessing.LabelEncoder()
cols = ["Sex", "Embarked"]

for col in cols:
        datatrain[col] = le.fit_transform(datatrain[col])

In [426]:
#defining the values of X and y based on the independent and dependent variables in the csv file
X = datatrain.drop("Survived", axis=1)
y = datatrain["Survived"]

In [427]:
#the train-test split library and used to split the dataset into the test and train variables
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [428]:
#importing the libraries needed to create a model, in this case, I used MLP Classifier
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import accuracy_score
#defining and fitting the variables into the model
mlp = MLPClassifier(random_state=42)
mlp.fit(X_train, y_train)
prediction = mlp.predict(X_test)

In [429]:
#checking the accuracy of the original model
pred_acc = accuracy_score(y_test, prediction)
pred_acc

0.7653631284916201

In [430]:
#importing the libraries used for hyperparameter tuning to determine the best parameters to use
from sklearn.model_selection import RepeatedStratifiedKFold, GridSearchCV
#stored the parameters that can be used into arrays to be called later
hidden_layer_sizes = [5],
activation = ["identity", "logistic", "tanh", "relu"]
solver = ["lbfgs", "sgd", "adam"]
learning_rate = ["constant", "invscaling", "adaptive"]
#creates a gridsearch of parameters by testing which parameter will yield the highest accuracy
grid = dict(hidden_layer_sizes=hidden_layer_sizes,activation=activation,solver=solver,learning_rate=learning_rate)
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
grid_search = GridSearchCV(estimator=mlp, param_grid=grid, n_jobs=-1, cv=cv, scoring='accuracy',error_score=0)
grid_result = grid_search.fit(X_train, y_train)
#prints the best parameter to use 
print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_))
means = grid_result.cv_results_['mean_test_score']
params = grid_result.cv_results_['params']
#prints the list of parameter combinations with their respective accuracy scores
for mean, param in zip(means, params):
    print("%f with: %r" % (mean, param))

Best: 0.803339 using {'activation': 'logistic', 'hidden_layer_sizes': [5], 'learning_rate': 'constant', 'solver': 'lbfgs'}
0.799596 with: {'activation': 'identity', 'hidden_layer_sizes': [5], 'learning_rate': 'constant', 'solver': 'lbfgs'}
0.704969 with: {'activation': 'identity', 'hidden_layer_sizes': [5], 'learning_rate': 'constant', 'solver': 'sgd'}
0.788328 with: {'activation': 'identity', 'hidden_layer_sizes': [5], 'learning_rate': 'constant', 'solver': 'adam'}
0.799596 with: {'activation': 'identity', 'hidden_layer_sizes': [5], 'learning_rate': 'invscaling', 'solver': 'lbfgs'}
0.563322 with: {'activation': 'identity', 'hidden_layer_sizes': [5], 'learning_rate': 'invscaling', 'solver': 'sgd'}
0.788328 with: {'activation': 'identity', 'hidden_layer_sizes': [5], 'learning_rate': 'invscaling', 'solver': 'adam'}
0.799596 with: {'activation': 'identity', 'hidden_layer_sizes': [5], 'learning_rate': 'adaptive', 'solver': 'lbfgs'}
0.749961 with: {'activation': 'identity', 'hidden_layer_si

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
  self.n_iter_ = _check_optimize_result("lbfgs", opt_res, self.max_iter)


In [431]:
#prints the best parameter result of the gridsearch
print(grid_result.best_params_)

{'activation': 'logistic', 'hidden_layer_sizes': [5], 'learning_rate': 'constant', 'solver': 'lbfgs'}


In [432]:
#application of the best parameters in the gridsearch to the model and refitting the data
mlp1 = MLPClassifier(activation="logistic", hidden_layer_sizes=5, learning_rate="constant", solver="adam",random_state=42, max_iter=1000)
mlp1.fit(X_train, y_train)
prediction_tuned = mlp1.predict(X_test)

In [433]:
#prints the predicted accuracy of the tuned model, in this case, the accuracy became higher than the original parameters
pred_acc_tuned_MLP = accuracy_score(y_test, prediction_tuned)
pred_acc_tuned_MLP

0.7821229050279329

In [434]:
#comparison between the Original Accuracy and the hyper tuned Accuracy
print("Original Accuracy: ", pred_acc)
print("Tuned Accuracy: ", pred_acc_tuned_MLP)

Original Accuracy:  0.7653631284916201
Tuned Accuracy:  0.7821229050279329


In [435]:
#prints the final predicted outcome based on the tuned prediction since the accuracy increased by around 2%
prediction_final_MLP = mlp1.predict(X_test)
prediction_final_MLP

array([0, 0, 0, 1, 1, 1, 1, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0,
       1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 1, 1, 1, 0, 1, 1, 0, 0, 1,
       1, 0, 0, 1, 1, 1, 0, 1, 0, 0, 1, 1, 1, 1, 0, 1, 1, 0, 0, 0, 1, 1,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0,
       1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 1, 1, 1, 0, 1, 0,
       0, 1, 0, 1, 1, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1,
       0, 0, 0, 1, 1, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1, 1, 0, 1, 0, 0,
       0, 1, 1], dtype=int64)

# Comparison, Analysis and Conclusion

Data Comparison:

Accuracy:

In [446]:
print("Logistic Regression Accuracy: ",pred_acc_tuned_LR, "\n" "K-Nearest Neighbors Accuracy: ",pred_acc_tuned_KNN, "\n" "MultiLayer Perceptron Accuracy: ",pred_acc_tuned_MLP)

Logistic Regression Accuracy:  0.8100558659217877 
K-Nearest Neighbors Accuracy:  0.8324022346368715 
MultiLayer Perceptron Accuracy:  0.7821229050279329


Predictions:

In [451]:
print("Logistic Regression Prediction:\n",prediction_final_LR)

Logistic Regression Prediction:
 [0 0 0 1 1 1 1 0 1 1 0 0 0 0 0 1 0 1 0 0 0 0 1 0 0 0 0 0 0 1 0 1 1 1 0 0 0
 1 1 0 0 0 0 0 1 0 0 0 0 0 1 1 0 1 0 1 0 1 1 1 0 1 1 0 0 1 0 0 0 1 1 1 1 1
 0 0 1 1 1 0 0 1 1 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 1
 0 1 0 1 0 0 0 1 0 0 1 1 0 0 0 1 1 1 0 1 0 0 1 0 1 1 0 0 1 0 1 0 0 0 1 0 0
 1 0 0 0 0 1 0 0 0 1 1 0 0 0 0 1 0 0 0 1 0 0 1 1 0 1 0 0 0 1 1]


In [449]:
print("K-Nearest Neighbors Prediction:\n",prediction_final_KNN)

K-Nearest Neighbors Prediction:
 [0 0 0 1 1 1 1 0 1 1 0 0 0 0 0 1 0 1 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 1 1 0 0 0 0 0 1 0 0 0 0 0 1 1 0 1 0 1 0 1 1 1 0 1 1 0 0 1 0 0 0 1 1 1 0 1
 0 0 1 1 1 1 0 1 1 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 1
 0 1 1 0 0 0 0 1 1 0 1 1 1 0 0 1 0 1 0 1 0 0 1 0 1 1 0 0 0 0 1 0 0 0 1 0 0
 1 0 0 0 0 1 0 0 0 1 1 1 0 0 0 1 0 0 0 1 0 0 0 1 1 1 0 0 0 1 1]


In [450]:
print("MultiLayer Perceptron Prediction:\n",prediction_final_MLP)

MultiLayer Perceptron Prediction:
 [0 0 0 1 1 1 1 0 1 1 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 1 0 0 0
 1 1 0 0 0 0 0 1 0 0 0 0 0 1 1 0 0 0 1 0 1 1 1 0 1 1 0 0 1 1 0 0 1 1 1 0 1
 0 0 1 1 1 1 0 1 1 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 1
 0 1 0 1 0 0 0 1 0 0 1 1 0 0 0 1 1 1 0 1 0 0 1 0 1 1 0 0 1 0 1 0 0 0 1 0 0
 1 0 0 0 0 1 0 0 0 1 1 1 0 0 0 1 0 0 0 1 0 0 1 1 0 1 0 0 0 1 1]


Analysis and Conclusion:

Based on the models that I used, I found Logical Regression to be the easiest to use and straightforward. The hardest one was the MultiLayer Perceptron Neural Network since it required me to research a lot of syntax that I didn't know before. However, after a long grind since the first day, I managed to comeplete it with a reasonable accuracy. Upon comparing the three models, I saw that the K-Nearest Neighbors had the highest accuracy, maybe because the data that I used in the data processing can be easily classified and I removed the ones with null cells. Both the LR and KNN had an accuracy of more than 80% after hyperparameter tuning and MLP came close to 80% after the hyperparameter tuning.

Upon looking on the prediction arrays, the three models predicted almost the same in most of the cells. With this, I can conclude that whatever factor that was, It was a common reason why and how the person survived and how some died. Although we cannot guarantee this since the accuracy is not 100%, we can assume some of these can be true.

Based on this coding experience, I gained a lot of knowledge in neural networks especially in tensorflow, however, I could not use this knowledge since the tensorflow neural networks mostly focused on image processing. This also gave me ideas on how to process my data, for example, in the titanic dataset, I could use the Cabin data and determine if there was a correlation between survival and having a cabin, since a lot of the time you will be located in the Cabin, however, I could not implement this since I did not know how to process the data and make it work that way. 