### Predicting Titanic Survival using..
#### **sklearn (Scikit Learn) machine learning (ML) library**
- **sklearn** is the most popular and robust library for machine learning, having a selection of algorithms that await your data

**titanic dataset** is an iconic dataset used for Machine Learning practice and training
- it contains 891 **observations** (rows) of 15 **variables** (columns) each
- the dataset is used to train ML models to predict the probability of surviving the Titanic disaster based on variables such as passenger class, age and gender (spoiler alert: first class female passengers had a *much* better chance of survival than third class males)
- **survived** column: 1 means survived; 0 means perished
- **sibsp** stands for siblings and spouses
- **parch** stands for parents and children
- some columns are repeats of each other:
  - **alive** (no, yes) is the string version of **survived** (0,1)
  - **pclass** (1,2,3) is the numeric version of **class** (First, Second, Third)
  - **who** (man, woman, child) overlaps **sex** (male, female)
  - **embarked** (S, C, Q) is short for **embark_town** (Southhampton, Cherbourg, Queenstown)

**seaborn** is another visualization / plotting library, similar to matplotlib
- seaborn also has the titanic dataset built in, so no csv file required
**Kaggle Titanic Competition**
- If you would like to submit your predictions to **Kaggle Titanic Competition**, use their datasets, which come pre-divided into:
    - train (**train-titanic.csv**)
    - test (**test-titanic.csv**)

In [219]:
# import the usual basics:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from IPython.display import Image
import pprint as pp
import seaborn as sns

In [220]:
# import sklearn machine learning packages:

# LabelEncoder converts string data to numeric:


# normalize the data (get all values down to mean=0
# and all other values are standard deviation
# 95% of all values go in the +-2 std (-2 to 2)


# import RandomForestClassifier for training model


#### **Scikit Learn (sklearn) packages**
- **sklearn.preprocessing.LabelEncoder** encodes target labels with value between **0** and **n_classes-1**
  - all data must be numeric for ML model training
  - so, passenger classes First, Second and Third become 0, 1 and 2
  - we need to convert object (string) columns ("sex", "embarked") into numbers
- **sklearn.model_selection.train_test_split** splits data into training and testing sets
  - **model** is trained on training set (fed data with the answers)
  - **model** is tested on testing set by having it predict answers (testing labels)
- **sklearn.preprocessing.StandardScaler** is for getting all values in a standard range,  
where all mean values are 0; values within +/- 1 std are in the 1 to -1 range
  - model needs all values in same range, so that it doesn't
  conclude that age, which ranges from 0-80,  
  is 40 times more important in predicting survival than passenger class, which ranges from just 0-2

- make **confusion matrix** for each model (a 2x2 array showing TP, TN, FP, FN)
from sklearn.metrics import confusion_matrix

**What Is Random Forest?**
- random forest is a tree-based supervised learning algorithm
- **random forest classifier** can be used to solve for **regression** or **classification** problems
- Random forest tends to combine hundreds of decision trees and then trains each decision tree on a different sample of the observations.
- The final predictions of the random forest are made by averaging the predictions of each individual tree.

**What is Confusion Matrix?**
- a confusion maxtrix is a set of 4 values used to describe a categorical classification:
- a confusion maxtrix has 4 categories: True Positive, False Positive, True Negative, False Negative
  - **True Positive** : model said it's a 'cat' and it is a 'cat'
  - **True Negative** : model said it's not a 'cat' and that's right
  - **False Positive** : model said it's a 'cat' but it's not
  - **False Negative** : model said it's not a 'cat' but it is


In [221]:
# load titanic dataset from within seaborn (no csv file)


In [222]:
# load the same data again BUT this time from train_titanic.csv file provided by Kaggle


In [223]:
# load up the test set:


In [224]:
print() # (418, 11)





In [225]:
# concat the two df's into one big df, w passenger ID's from 1-1309
# axis=0 so the concat happens along x-axis (horizontally), resulting
# in new rows, NOT new columns (for new cols specify axis=1)


In [226]:
# (1309, 12)
# titanic_df[888:894]
# L@@K: The index starts over after 890 (due to concat)
# FIX: reset the index

In [227]:
# reset index, specify drop=True to prevent old wonky index from being preserved as a new column:


In [228]:
# instantiate LabelEncoder


In [229]:
# Convert 'Sex' from 'female', 'male' to 0, 1


In [230]:
# (1309, 13)


In [231]:
# make a new column 'CabinKnown' with value of 1 Cabin exists (only 204 of those) and 0 if it is NA
# Use a vectorized checkâ€”no need for apply(lambda)
# .notna() returns a boolean
# .astype(int) casts the bool as int (0,1)


In [232]:
# (1309, 14)


In [233]:
# add the 'Fare' -- one of the 1309 is missing.. fill with median


In [234]:
# 'PassengerId'
# 'SexEnc'

In [235]:
# get the median Age for filling in the nearly 200 missing ages


In [236]:
# fill all missing ages with the median (28.0)


In [237]:
# fill 2 missing Embarked values with most common, 'S'


In [238]:
# merge emparked one hot encoded cols with main titanic df:


In [239]:
print()





In [240]:
# KISS: Make "Big X" simplest X feature / training set possible for Quik initial Kaggle submission
# , 'Age', 'Fare'

# [['Pclass', 'SexEnc', 'CabinKnown']] got a 0.77272 (good score)

In [241]:


# if there might be stray strings/empties:


In [242]:
# X

In [243]:
# X['Pclass']

In [244]:
print()





In [245]:
# save Survived as the target "little y_train"
# There is NO y_test (only Kaggle has those "answers")


In [246]:
# Now that we have X and y, we can train a model BUT first instantiate:
# instantiate RandomForestClassifier


# rand_forest_model = RandomForestClassifier(
#     n_estimators=400,       # many trees
#     max_depth=None,         # let trees grow, control with min_* instead
#     min_samples_split=4,
#     min_samples_leaf=1,
#     max_features="sqrt",    # good default for classification
#     bootstrap=True,
#     class_weight="balanced", # Titanic is imbalanced (~38% survived)
#     n_jobs=-1,
#     random_state=42,
#     criterion="gini"        # often edges out 'entropy' for RF
# )

In [247]:
# train the RandomForestClassifier model on X_tran and y_train


In [248]:
# have the model predict survival of the test set (X_test)
# AGAIN : WE DO NOT HAVE y_test, those being the answers (ONLY Kaggle has that)
# Ergo: WE can only know how well our model did if we upload our predictions to Kaggle


In [249]:
print()




In [250]:
print()




In [251]:
# check the model's predictions (even though we cannot know how accurate they are):


In [252]:
# make the required Kaggle df:
# MUST have exactly two cols: "PassengerId" and "Survived"
# "PassengerId" value: consec ints from 892-1309 (we can generate this w range())
# "Survived" value: our model's predictions as y_pred


In [253]:
print() # (418, 2)





In [254]:
# save the predictions df to csv for uploading to Kaggle:
# specify index=False or else you get a new 'Unnamed: 0' col containing the index values -- which we definitely do NOT want as it will be rejected by Kaggle -- MUST have exactly TWO cols ONLYs


In [255]:
# load the csv right back up to df to make sure it's good to go:


In [256]:
# predict 100% did not survive just to gain insight into the mysterious kaggle 'answer key'


In [257]:
# split() passenger name to isolate the title:
passenger_names = [ "Boulos, Mr. Hanna", "Duff Gordon, Sir. Cosmo Edmund (Mr Morgan)",
    "Jacobsohn, Mrs. Sidney Samuel (Amy Frances)", "Slabenoff, Mr. Petco",
    "Olsen, Mr. Henry Margido", "Lang, Mr. Fang", "Daly, Mr. Eugene Patrick",
    "Webber, Mr. James", "McGough, Mr. James Robert",
    "Andersson, Mrs. Anders Johan (Alfrida Konstant)", "Jardin, Mr. Jose Neto",
    "Laroche, Mrs. Joseph (Juliette Marie Louise)", "Shutes, Miss. Elizabeth W" ]

In [258]:
# challenge: get the title as "Mr" from the name
pass_name = "Duff Gordon, Sir. Cosmo Edmund (Mr Morgan)"
# first, split the name on the comma:

print()
# get the 2nd of the 2 items:

print()
# split the titled name on the dot:

print()
# get the 1st item from list--this is the title result we want:
# remove leading space
print()







In [259]:
# process the list of passenger names on a loop that calls the extract_title_from_name() function

    # first, split the name on the comma:
    # print(pass_name_list)
    # get the 2nd of the 2 items:

    # print(pass_name_titled)
    # split the titled name on the dot:

    # print(pass_title_name_list)
    # get the 1st item from list--this is the title result we want: replace leading space with empty string

    # print(pass_title)


In [260]:
# run a loop that iterates the passenger_names list, passing each name to function. Save return value of function -- the title -- to a new list:

print()




In [261]:
# make a new title column for the train_titanic.csv


In [262]:
# make Title col again BUT chain all title-extractor code into inline lambda:


In [263]:
# print unique values from Title col


In [264]:
# ['Mr', 'Mrs', 'Miss', 'Master', 'Don', 'Rev', 'Dr', 'Mme', 'Ms', 'Major', 'Lady', 'Sir', 'Mlle', 'Col', 'Capt', 'theCountess', 'Jonkheer']
# consolidate rare titles into two labels: "RareFemale" and "RareMale"
rare_male_titles = ['Don', 'Rev', 'Dr', 'Major', 'Sir', 'Col', 'Capt', 'Jonkheer']
rare_female_titles = ['Lady', 'theCountess', 'Ms', 'Mme', 'Mlle']
common_male_titles = ['Mr', 'Master']
common_female_titles = ['Mrs', 'Miss']


In [265]:
# re-print unique values from Title col


In [266]:
# check survival of 'Master'

In [267]:
print()




In [268]:
# check survival of 'RareFemale:


In [269]:
# check survival of 'RareMale' titles:


In [270]:
# check survival of 'CommonMale' titles:


In [271]:
# check survival of 'CommonFemale' titles:


**LabelEncoder.fit_transform(list_of_strings)** takes a list of strings and returns a corresponding list of numbers




- **n_estimators=10**
- Definition: Specifies the number of trees in the forest (i.e., the number of decision trees).
- Explanation:
- In a Random Forest model, multiple decision trees are built, and their predictions are averaged (for regression) or voted on (for classification).
- The more trees you have, the more stable and accurate the model can be, although it may take longer to train.
- In this case: You are using 10 decision trees.
- **criterion='entropy'**
- Definition: This parameter specifies the function used to measure the quality of a split when constructing each tree.
- Explanation:
- The two common criteria are gini (Gini impurity) and entropy (information gain).
entropy: Measures how much information is gained by making a split. It uses the concept of information theory to find splits that reduce uncertainty (entropy) in the target labels.
gini: Measures the degree of "impurity" in the nodes and tends to be slightly faster.
- In this case: The Random Forest uses entropy to evaluate how splits reduce the uncertainty of the class labels in the dataset.
- **random_state=42**
- Definition: This parameter sets the seed for the random number generator.
Explanation:
- Random Forests introduce randomness by selecting random subsets of data for each tree and selecting random subsets of features for splitting at each node.
random_state ensures reproducibility by controlling this randomness. Using the same seed (e.g., 42) ensures the same results across different runs (all else being equal).

- it's all good to go -- time to submit the csv file to kaggle
- download the csv file and upload it to kaggle at

 **https://www.kaggle.com/competitions/titanic**