#### [Titanic Project](https://www.kaggle.com/datasets/shubhamgupta012/titanic-dataset)

The Titanic Passenger dataset provides information about passengers who were aboard the RMS Titanic during its ill-fated maiden voyage. This dataset is often used for exploring patterns and factors associated with survival on the Titanic.

<mark>The dataset includes the following columns:</mark>

PassengerId: Unique identifier for each passenger.

Survived: Survival status of the passenger (0 = Not Survived, 1 = Survived).

Pclass: Passenger class (1 = First class, 2 = Second class, 3 = Third class).

Sex: Gender of the passenger.

Age: Age of the passenger.

SibSp: Number of siblings/spouses aboard the Titanic.

Parch: Number of parents/children aboard the Titanic.

Fare: Fare paid by the passenger.

Embarked: Port of embarkation (C = Cherbourg, Q = Queenstown, S = Southampton).

In [1]:
import pandas as pd
import numpy as np

In [2]:
passengers = pd.read_csv('SVMtrain.csv')
passengers.head()

Unnamed: 0,PassengerId,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked
0,1,0,3,Male,22.0,1,0,7.25,3
1,2,1,1,female,38.0,1,0,71.2833,1
2,3,1,3,female,26.0,0,0,7.925,3
3,4,1,1,female,35.0,1,0,53.1,3
4,5,0,3,Male,35.0,0,0,8.05,3


In [12]:
X = passengers.drop(['Survived', 'PassengerId'], axis=1)
y = passengers['Survived']

In [14]:
X.head(3)

Unnamed: 0,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked
0,3,Male,22.0,1,0,7.25,3
1,1,female,38.0,1,0,71.2833,1
2,3,female,26.0,0,0,7.925,3


In [15]:
y.head(3)

0    0
1    1
2    1
Name: Survived, dtype: int64

In [4]:
from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier()

In [5]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
clf.fit(X_train, y_train);

ValueError: could not convert string to float: 'Male'

In [6]:
passengers.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 889 entries, 0 to 888
Data columns (total 9 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  889 non-null    int64  
 1   Survived     889 non-null    int64  
 2   Pclass       889 non-null    int64  
 3   Sex          889 non-null    object 
 4   Age          889 non-null    float64
 5   SibSp        889 non-null    int64  
 6   Parch        889 non-null    int64  
 7   Fare         889 non-null    float64
 8   Embarked     889 non-null    int64  
dtypes: float64(2), int64(6), object(1)
memory usage: 62.6+ KB


<span style="color:pink">We need to convert categorical values to numeric values...</span>

In [7]:
passengers['Pclass'].value_counts()

3    491
1    214
2    184
Name: Pclass, dtype: int64

In [None]:
passengers['Sex'].value_counts()

Male      577
female    312
Name: Sex, dtype: int64

In [8]:
passengers['SibSp'].value_counts()

0    606
1    209
2     28
4     18
3     16
8      7
5      5
Name: SibSp, dtype: int64

In [None]:
passengers['Parch'].value_counts()

0    676
1    118
2     80
3      5
5      5
4      4
6      1
Name: Parch, dtype: int64

In [10]:
passengers['Embarked'].value_counts()

3    644
1    168
2     77
Name: Embarked, dtype: int64

In [22]:
X.head(1)

Unnamed: 0,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked
0,3,Male,22.0,1,0,7.25,3


In [23]:
# Turn categories into numbers
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer

categorical_features = ['Pclass', 'Sex', 'SibSp', 'Parch', 'Embarked']
one_hot = OneHotEncoder()
transformer = ColumnTransformer([('one_hot',
                                  one_hot,
                                  categorical_features)],
                                  remainder='passthrough')
transformed_X = transformer.fit_transform(X)
transformed_X

<889x24 sparse matrix of type '<class 'numpy.float64'>'
	with 6208 stored elements in Compressed Sparse Row format>

In [29]:
pd.DataFrame(transformed_X)

Unnamed: 0,0
0,"(0, 2)\t1.0\n (0, 3)\t1.0\n (0, 6)\t1.0\n ..."
1,"(0, 0)\t1.0\n (0, 4)\t1.0\n (0, 6)\t1.0\n ..."
2,"(0, 2)\t1.0\n (0, 4)\t1.0\n (0, 5)\t1.0\n ..."
3,"(0, 0)\t1.0\n (0, 4)\t1.0\n (0, 6)\t1.0\n ..."
4,"(0, 2)\t1.0\n (0, 3)\t1.0\n (0, 5)\t1.0\n ..."
...,...
884,"(0, 1)\t1.0\n (0, 3)\t1.0\n (0, 5)\t1.0\n ..."
885,"(0, 0)\t1.0\n (0, 4)\t1.0\n (0, 5)\t1.0\n ..."
886,"(0, 2)\t1.0\n (0, 4)\t1.0\n (0, 6)\t1.0\n ..."
887,"(0, 0)\t1.0\n (0, 3)\t1.0\n (0, 5)\t1.0\n ..."


In [30]:
dummies = pd.get_dummies(passengers[['Pclass', 'Sex', 'SibSp', 'Parch', 'Embarked']])
dummies.head()

Unnamed: 0,Pclass,SibSp,Parch,Embarked,Sex_Male,Sex_female
0,3,1,0,3,1,0
1,1,1,0,1,0,1
2,3,0,0,3,0,1
3,1,1,0,3,0,1
4,3,0,0,3,1,0


In [26]:
np.random.seed(42)
X_train, X_test, y_train, y_test = train_test_split(transformed_X,
                                                    y,
                                                    test_size=0.2)
clf.fit(X_train, y_train)

RandomForestClassifier()

In [28]:
# let's score our model
clf.score(X_test, y_test)

0.7584269662921348

In [31]:
# let's make a prediction
y_preds = clf.predict(X_test)
y_preds

array([0, 1, 1, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1,
       0, 1, 0, 1, 0, 1, 0, 0, 1, 1, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0,
       1, 0, 1, 0, 0, 1, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 0, 1, 1, 0, 1, 0,
       0, 1, 1, 1, 0, 1, 1, 0, 0, 1, 0, 0, 1, 1, 0, 1, 1, 1, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 1, 0, 0,
       0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 1, 1, 1, 1, 0, 1, 0,
       0, 0, 0, 1, 0, 1, 0, 1, 1, 0, 1, 0, 1, 0, 0, 1, 1, 1, 0, 1, 0, 1,
       0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 1, 0, 0, 1, 1, 1, 1, 0, 1, 0,
       0, 1])

In [32]:
clf.score(X_train, y_train)

0.9845288326300985

In [33]:
# let's look at some other metrics
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
print(classification_report(y_test, y_preds))

              precision    recall  f1-score   support

           0       0.82      0.77      0.80       109
           1       0.67      0.74      0.70        69

    accuracy                           0.76       178
   macro avg       0.75      0.75      0.75       178
weighted avg       0.76      0.76      0.76       178



In [34]:
confusion_matrix(y_test, y_preds)

array([[84, 25],
       [18, 51]])

In [35]:
accuracy_score(y_test, y_preds)

0.7584269662921348

In [36]:
# let's try to improve our model by changing some of the hyperparameters...
clf.get_params()

{'bootstrap': True,
 'ccp_alpha': 0.0,
 'class_weight': None,
 'criterion': 'gini',
 'max_depth': None,
 'max_features': 'auto',
 'max_leaf_nodes': None,
 'max_samples': None,
 'min_impurity_decrease': 0.0,
 'min_impurity_split': None,
 'min_samples_leaf': 1,
 'min_samples_split': 2,
 'min_weight_fraction_leaf': 0.0,
 'n_estimators': 100,
 'n_jobs': None,
 'oob_score': False,
 'random_state': None,
 'verbose': 0,
 'warm_start': False}

In [38]:
# Let's try to see what we would ge with different n_estimators...
np.random.seed(42)
for i in range(10, 100, 10):
    print(f'Trying model with {i} estimators...')
    clf = RandomForestClassifier(n_estimators=i).fit(X_train, y_train)
    print(f'Model accuracy on test set: {clf.score(X_test, y_test) * 100:.2f}%')
    print('')

Trying model with 10 estimators...
Model accuracy on test set: 75.84%

Trying model with 20 estimators...
Model accuracy on test set: 74.72%

Trying model with 30 estimators...
Model accuracy on test set: 74.72%

Trying model with 40 estimators...
Model accuracy on test set: 76.97%

Trying model with 50 estimators...
Model accuracy on test set: 77.53%

Trying model with 60 estimators...
Model accuracy on test set: 76.40%

Trying model with 70 estimators...
Model accuracy on test set: 76.40%

Trying model with 80 estimators...
Model accuracy on test set: 74.72%

Trying model with 90 estimators...
Model accuracy on test set: 75.84%



`n_estimators` represents the number of trees in the forest.

Generally, the optimal number of trees in a random forest is equal to the square root of the number of features (49 in this case).

Here are some helpful videos for understanding [decision trees](https://www.youtube.com/watch?v=ZVR2Way4nwQ) and [random forest classifier](https://www.youtube.com/watch?v=v6VJ2RO66Ag).

Some other hyperparameters we can tune:

n_estimators = number of trees in the foreset

max_features = max number of features considered for splitting a node

max_depth = max number of levels in each decision tree

min_samples_split = min number of data points placed in a node before the node is split

min_samples_leaf = min number of data points allowed in a leaf node

bootstrap = method for sampling data points (with or without replacement)