# Noteoob Description
> - Purpose: This notebook is a solution of Titnac challenge as part of DSC-UJ competition.
> - Date: November 2020.
> - Done by: Arwa Alghamdi.

This notebook contains multiple models' results, their results are explained in the following table, ordered from highest to lowest. The highest achieved accuracy rate is **85.47%**.

| Model  | Accuracy |
| :- | :-: |
| DecisionTreeClassifier | 85.47%
| RandomForestClassifier | 84.92%
| LogisticRegression | 79.89%
| SVM | 78.77%
| Gaussian Naive Bayes | 78.77%
| KNeighborsClassifier | 73.18%

*Note: RandomForestClassifier has some random processes, sometimes it yields out a result of 85.47%.*

## 1. Importing important modules

In [1]:
import pandas as pd
import numpy as np
from sklearn import metrics

## 2. Reading the dataset csv file

In [2]:
df = pd.read_csv("Titanic - Data.csv")
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


## 3. Dealing with NaN values in "Age" column
This is done by replacing NaN values by the average value of "Age" accross non-NaN rows.

In [3]:
print(len(df['Age'].dropna()))
print(int(sum(df['Age'].dropna()/714)))
df['Age'] = df['Age'].fillna(int(sum(df['Age'].dropna()/714)))

714
29


## 4. Selecting featrues' columns

### 4.1 Creating LabelEncoder object to transform text data into numerical

In [4]:
# Importing LabelEncoder to transform text data into numbers.
from sklearn import preprocessing
# Creating labelEncoder object.
features_encoder = preprocessing.LabelEncoder()

### 4.2 Creating new dataframe with string data to use "Embarked" column from it

In [7]:
df2 = df.apply(lambda col: features_encoder.fit_transform(col.astype(str)), axis=0, result_type='expand')

### 4.3 Transforming "Sex" column's values into numerical

In [8]:
Sex_encoded = features_encoder.fit_transform(df['Sex'])

### 4.4 Choosing: "Sex", "Pclass", "Age", "Fare", and "Embarked" as features, zipping them in one list which is called *features*

In [9]:
features = list(zip(Sex_encoded, df['Pclass'], df['Age'], df['Fare'], df2['Embarked']))

### 4.5 Splitting the data into two sets: train and test. Each one with features (x) and labels (y).
The results are four sets: x_train, x_test, y_train, and y_test

In [10]:
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(features, df['Survived'],
                                                    test_size=0.20, random_state=0)

## 5. DecisionTreeClassifier

In [11]:
from sklearn.tree import DecisionTreeClassifier # Import Decision Tree Classifier

# Create Decision Tree classifer object
clf = DecisionTreeClassifier(criterion= 'gini', max_depth=10, min_samples_split=8,
                             max_leaf_nodes=30, min_impurity_split=0.23)

# Train Decision Tree Classifer
clf = clf.fit(x_train,y_train)

y_pred = clf.predict(x_test)

print("Accuracy:",metrics.accuracy_score(y_test, y_pred))

# Model Accuracy: how often is the classifier correct?
print("Accuracy:",metrics.accuracy_score(y_test, y_pred))

# Model Precision: what percentage of positive tuples are labeled as such?
print("Precision:",metrics.precision_score(y_test, y_pred))

# Model Recall: what percentage of positive tuples are labeled as such?
print("Recall:",metrics.recall_score(y_test, y_pred))

Accuracy: 0.8547486033519553
Accuracy: 0.8547486033519553
Precision: 0.9056603773584906
Recall: 0.6956521739130435




## 6. RandomForestClassifier

In [12]:
#RandomForest
from sklearn.ensemble import RandomForestClassifier

#Create a Gaussian Classifier
clf=RandomForestClassifier(bootstrap = False,
 max_depth = 110,
 max_features = 1,
 min_samples_leaf = 3,
 min_samples_split = 10,
 n_estimators = 200)

#Train the model using the training sets y_pred=clf.predict(X_test)
clf.fit(x_train,y_train)

y_pred=clf.predict(x_test)

y_pred = clf.predict(x_test)

# Use score method to get accuracy of model
score = clf.score(x_test, y_test)
print(score)
# Model Accuracy: how often is the classifier correct?
print("Accuracy:",metrics.accuracy_score(y_test, y_pred))

# Model Precision: what percentage of positive tuples are labeled as such?
print("Precision:",metrics.precision_score(y_test, y_pred))

# Model Recall: what percentage of positive tuples are labeled as such?
print("Recall:",metrics.recall_score(y_test, y_pred))

0.8547486033519553
Accuracy: 0.8547486033519553
Precision: 0.8909090909090909
Recall: 0.7101449275362319


## 7. LogisticRegression

In [13]:
from sklearn.linear_model import LogisticRegression

#Import scikit-learn metrics module for accuracy calculation
from sklearn import metrics
# all parameters not specified are set to their defaults
logisticRegr = LogisticRegression()

logisticRegr.fit(x_train, y_train)

# Use score method to get accuracy of model
score = logisticRegr.score(x_test, y_test)

y_pred = logisticRegr.predict(x_test)


print(score)
# Model Accuracy: how often is the classifier correct?
print("Accuracy:",metrics.accuracy_score(y_test, y_pred))

# Model Precision: what percentage of positive tuples are labeled as such?
print("Precision:",metrics.precision_score(y_test, y_pred))

# Model Recall: what percentage of positive tuples are labeled as such?
print("Recall:",metrics.recall_score(y_test, y_pred))

0.7988826815642458
Accuracy: 0.7988826815642458
Precision: 0.7323943661971831
Recall: 0.7536231884057971


## 8. SVM

In [14]:
# Import train_test_split function
from sklearn.model_selection import train_test_split

#Import svm model
from sklearn import svm

#Create a svm Classifier
clf = svm.SVC(kernel='linear') # Linear Kernel

#Train the model using the training sets
clf.fit(x_train, y_train)

#Predict the response for test dataset
y_pred = clf.predict(x_test)


# Model Accuracy: how often is the classifier correct?
print("Accuracy:",metrics.accuracy_score(y_test, y_pred))

# Model Precision: what percentage of positive tuples are labeled as such?
print("Precision:",metrics.precision_score(y_test, y_pred))

# Model Recall: what percentage of positive tuples are labeled as such?
print("Recall:",metrics.recall_score(y_test, y_pred))

Accuracy: 0.7877094972067039
Precision: 0.7313432835820896
Recall: 0.7101449275362319


## 9. Gaussian Naive Bayes

In [15]:
#Import Gaussian Naive Bayes model
from sklearn.naive_bayes import GaussianNB

#Create a Gaussian Classifier
gnb = GaussianNB(var_smoothing=1e-20)

#Train the model using the training sets
gnb.fit(x_train, y_train)

#Predict the response for test dataset
y_pred = gnb.predict(x_test)

# Model Accuracy: how often is the classifier correct?
print("Accuracy:",metrics.accuracy_score(y_test, y_pred))

# Model Precision: what percentage of positive tuples are labeled as such?
print("Precision:",metrics.precision_score(y_test, y_pred))

# Model Recall: what percentage of positive tuples are labeled as such?
print("Recall:",metrics.recall_score(y_test, y_pred))

Accuracy: 0.7877094972067039
Precision: 0.7012987012987013
Recall: 0.782608695652174


## 10. KNeighborsClassifier

In [16]:
# from sklearn.neighbors import KNeighborsClassifier
from sklearn.neighbors import KNeighborsClassifier

model = KNeighborsClassifier(n_neighbors=12)

# Train the model using the training sets
model.fit(x_train,y_train)

y_pred = model.predict(x_test)


print("Accuracy:",metrics.accuracy_score(y_test, y_pred))

# Model Precision: what percentage of positive tuples are labeled as such?
print("Precision:",metrics.precision_score(y_test, y_pred))

# Model Recall: what percentage of positive tuples are labeled as such?
print("Recall:",metrics.recall_score(y_test, y_pred))

Accuracy: 0.7318435754189944
Precision: 0.7441860465116279
Recall: 0.463768115942029
