### Predicting Titanic Survival using..
#### **sklearn (Scikit Learn) machine learning (ML) library**
- **sklearn** is the most popular and robust library for machine learning, having a selection of algorithms that await your data

**titanic dataset** is an iconic dataset used for Machine Learning practice and training
- it contains 891 **observations** (rows) of 15 **variables** (columns) each
- the dataset is used to train ML models to predict the probability of surviving the Titanic disaster based on variables such as passenger class, age and gender (spoiler alert: first class female passengers had a *much* better chance of survival than third class males)
- **survived** column: 1 means survived; 0 means perished
- **sibsp** stands for siblings and spouses
- **parch** stands for parents and children
- some columns are repeats of each other:
  - **alive** (no, yes) is the string version of **survived** (0,1)
  - **pclass** (1,2,3) is the numeric version of **class** (First, Second, Third)
  - **who** (man, woman, child) overlaps **sex** (male, female)
  - **embarked** (S, C, Q) is short for **embark_town** (Southhampton, Cherbourg, Queenstown)

**seaborn** is another visualization / plotting library, similar to matplotlib
- seaborn also has the titanic dataset built in, so no csv file required
**Kaggle Titanic Competition**
- If you would like to submit your predictions to **Kaggle Titanic Competition**, use their datasets, which come pre-divided into:
    - train (**train-titanic.csv**)
    - test (**test-titanic.csv**)

In [693]:
# import the usual basics:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from IPython.display import Image
import pprint as pp
import seaborn as sns

In [694]:
# import sklearn machine learning packages:

# LabelEncoder converts string data to numeric:
from sklearn.preprocessing import LabelEncoder

# normalize the data (get all values down to mean=0
# and all other values are standard deviation
# 95% of all values go in the +-2 std (-2 to 2)
from sklearn.preprocessing import StandardScaler

# import RandomForestClassifier for training model
from sklearn.ensemble import RandomForestClassifier

In [695]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


#### **Scikit Learn (sklearn) packages**
- **sklearn.preprocessing.LabelEncoder** encodes target labels with value between **0** and **n_classes-1**
  - all data must be numeric for ML model training
  - so, passenger classes First, Second and Third become 0, 1 and 2
  - we need to convert object (string) columns ("sex", "embarked") into numbers
- **sklearn.model_selection.train_test_split** splits data into training and testing sets
  - **model** is trained on training set (fed data with the answers)
  - **model** is tested on testing set by having it predict answers (testing labels)
- **sklearn.preprocessing.StandardScaler** is for getting all values in a standard range,  
where all mean values are 0; values within +/- 1 std are in the 1 to -1 range
  - model needs all values in same range, so that it doesn't
  conclude that age, which ranges from 0-80,  
  is 40 times more important in predicting survival than passenger class, which ranges from just 0-2

- make **confusion matrix** for each model (a 2x2 array showing TP, TN, FP, FN)
from sklearn.metrics import confusion_matrix

**What Is Random Forest?**
- random forest is a tree-based supervised learning algorithm
- **random forest classifier** can be used to solve for **regression** or **classification** problems
- Random forest tends to combine hundreds of decision trees and then trains each decision tree on a different sample of the observations.
- The final predictions of the random forest are made by averaging the predictions of each individual tree.

**What is Confusion Matrix?**
- a confusion maxtrix is a set of 4 values used to describe a categorical classification:
- a confusion maxtrix has 4 categories: True Positive, False Positive, True Negative, False Negative
  - **True Positive** : model said it's a 'cat' and it is a 'cat'
  - **True Negative** : model said it's not a 'cat' and that's right
  - **False Positive** : model said it's a 'cat' but it's not
  - **False Negative** : model said it's not a 'cat' but it is


In [696]:
# load titanic dataset from within seaborn (no csv file)
titanic_sns_df = sns.load_dataset('titanic')

In [697]:
print(titanic_sns_df.shape) # (891, 15)
titanic_sns_df.head()

(891, 15)


Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.925,S,Third,woman,False,,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1,S,First,woman,False,C,Southampton,yes,False
4,0,3,male,35.0,0,0,8.05,S,Third,man,True,,Southampton,no,True


In [698]:
# check for missing data
titanic_sns_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 15 columns):
 #   Column       Non-Null Count  Dtype   
---  ------       --------------  -----   
 0   survived     891 non-null    int64   
 1   pclass       891 non-null    int64   
 2   sex          891 non-null    object  
 3   age          714 non-null    float64 
 4   sibsp        891 non-null    int64   
 5   parch        891 non-null    int64   
 6   fare         891 non-null    float64 
 7   embarked     889 non-null    object  
 8   class        891 non-null    category
 9   who          891 non-null    object  
 10  adult_male   891 non-null    bool    
 11  deck         203 non-null    category
 12  embark_town  889 non-null    object  
 13  alive        891 non-null    object  
 14  alone        891 non-null    bool    
dtypes: bool(2), category(2), float64(2), int64(4), object(5)
memory usage: 80.7+ KB


In [699]:
# get the sns dataset col names
# the sns dataset is the SAME 891 passengers as the Kaggle train_titanic.csv
# get the sns titanic df col names into a list:
sns_titanic_cols = list(titanic_sns_df.columns)
print("SNS Titanic Cols:", len(sns_titanic_cols), sns_titanic_cols)

SNS Titanic Cols: 15 ['survived', 'pclass', 'sex', 'age', 'sibsp', 'parch', 'fare', 'embarked', 'class', 'who', 'adult_male', 'deck', 'embark_town', 'alive', 'alone']


In [700]:
base_path = "/content/drive/MyDrive/____Intro-Python-Machine-Learning-Dec-2025"

In [701]:
# load the same data again BUT this time from train_titanic.csv file provided by Kaggle
titanic_train_df = pd.read_csv(base_path + '/csv/train_titanic.csv')

In [702]:
print(titanic_train_df.shape)
titanic_train_df.head()

(891, 12)


Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [703]:
kaggle_titanic_cols = list(titanic_train_df.columns)
print("Kaggle Titanic Cols:", len(kaggle_titanic_cols), kaggle_titanic_cols)

Kaggle Titanic Cols: 12 ['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp', 'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked']


In [704]:
# compare sns to kaggle datasets:

# SNS Titanic Cols: 15
# ['survived', 'pclass', 'sex', 'age', 'sibsp', 'parch', 'fare', 'embarked', 'class', 'who', 'adult_male', 'deck', 'embark_town', 'alive', 'alone']

# Kaggle Titanic Cols: 12
# ['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp', 'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked']

In [705]:
# load up the test set:
titanic_test_df = pd.read_csv(base_path + '/csv/test_titanic.csv')

In [706]:
print(titanic_test_df.shape) # (418, 11)
titanic_test_df.head()
# PassengerId is consec ints from 892-1309
# this col must be included in the 2-col csv sumitted to Kaggle
# the csv is saved from a df of 2-col: PassengerID, Survived

(418, 11)


Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,892,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,,Q
1,893,3,"Wilkes, Mrs. James (Ellen Needs)",female,47.0,1,0,363272,7.0,,S
2,894,2,"Myles, Mr. Thomas Francis",male,62.0,0,0,240276,9.6875,,Q
3,895,3,"Wirz, Mr. Albert",male,27.0,0,0,315154,8.6625,,S
4,896,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,22.0,1,1,3101298,12.2875,,S


### **Feature Engineering Game Plan**
Feature Engineering means making new columns to include in feature set for Machine Learining training and testing

1. Make a new column **'SexInt'** where 'female'=0 and 'male'=1

2. Make a new column **'CabinKnown'** with value of:
    - 1 if Cabin exists (only 204 of those)
    - 0 if it is NA
   To make this, use vectorized checkâ€”no need for apply(lambda)
   .notna() returns a boolean
   .astype(int) casts the bool as int (0,1)

3. Make a new column **'Deck'** from first letter of 'Cabin',
     with 'U' for unknown' being its own deck -- there are a lot of unknowns
     - consolidate low-freq decks, A F G T, into one cat: 'R' for rare

9. One hot encode 'Deck', where all 3 the unique deck letters get their own columns, prefixed with 'Deck':
    - **'Deck_B'  'Deck_C'  'Deck_D' 'Deck_E'  'Deck_R' 'Deck_U'**
    - in one hot encoding only one of the cols gets a 1; the other cols get 0

4. Fill the one missing 'Fare' with median fare--no other changes

5. 'Age' col is missing a lot of values, fill w median age--no other changes

6. Make a new **'AgeKnown'** col, with values of 1 for known and 0 for unknown;
    - same prodedure used for making 'CabinKnown' : .notna().astype(int)

7. make a **'FamilySize'** col, the value of which is sibsp + parch + 1.

8. fill the 2 missing Embarked values with most common, 'S' (Southhampton)

9. One hot encode Embarked, where all 3 towns, S, C, Q, get their own columns, with 'Town' prefix:
  - **'Town_C' 'Town_Q' 'Town_S'**
- in one hot encoding only one of the cols gets a 1; the other cols get 0

**map()** is a function that takes a function argument
- it runs the function arg on every item in an iterable
- the function can be a named function OR a **lambda**
- it returns a new vector (list)
- map can be called on a df col:
  - **df['new_col'] = df['col'].map(doStuff)**
- map can take a list to iterate as its arg:
  - **new_list = list(map(doStuff,my_list))**

In [707]:
# example: give map this list and a func for it to run on each item:
fruits = ['apple','banana','blueberry','cherry','grape','kiwi','lemon','mango','orange','peach','papaya','plum','pineapple','raspberry','strawberry']

In [708]:
# define a function with an input and an output
# keyword def
# function_name
# function () inputs
# do stuff inside func
# return value / answer
# call function
# set func call equal to var to catch return value
def make_jellybean(fru):
  fru_jbean = fru + ' jellybean'
  return fru_jbean

In [709]:
# call the function, passing it apple:
jbean1 = make_jellybean('apple')
jbean2 = make_jellybean(fruits[-1])
print(jbean1, jbean2)

apple jellybean strawberry jellybean


In [710]:
# a tad more complex: define a function that accepts a whole list of fruits
# and mass produces treats based on conditions
# func returns a list of treats: so, it takes in a list and returns a list
def make_fruit_treats(fru_list):
  treats = []
  for fru in fru_list:
    if len(fru) == 5:
      treats.append(fru + " jelly")
    elif fru.startswith('p'):
      treats.append(fru + " popsicle")
    elif 'berry' in fru:
      treats.append(fru + " jam")
    else:
      treats.append(fru + " jellybean")
  return treats

In [711]:
# call the func, passing it fruits list, save result to var (new list):
treats_list = make_fruit_treats(fruits)
pp.pprint(treats_list)

['apple jelly',
 'banana jellybean',
 'blueberry jam',
 'cherry jellybean',
 'grape jelly',
 'kiwi jellybean',
 'lemon jelly',
 'mango jelly',
 'orange jellybean',
 'peach jelly',
 'papaya popsicle',
 'plum popsicle',
 'pineapple popsicle',
 'raspberry jam',
 'strawberry jam']


In [712]:
def make_fruit_treat(fru):
  if len(fru) == 5:
    treat = fru + " jelly"
  elif fru.startswith('p'):
    treat = fru + " popsicle"
  elif 'berry' in fru:
    treat = fru + " jam"
  else:
    treat = fru + " jellybean"
  return treat

In [713]:
# call the function repeatedly for various fruits
# this way is not so efficient:
# better would be to loop fruits and
fru_treat_1 = make_fruit_treat('apple')
print('fru_treat_1:',fru_treat_1)

fru_treat_2 = make_fruit_treat('pineapple')
print('fru_treat_2:',fru_treat_2)

fru_treat_3 = make_fruit_treat('blueberry')
print('fru_treat_3:',fru_treat_3)

fru_treat_4 = make_fruit_treat('orange')
print('fru_treat_4:',fru_treat_4)

fru_treat_1: apple jelly
fru_treat_2: pineapple popsicle
fru_treat_3: blueberry jam
fru_treat_4: orange jellybean


In [714]:
# better way: loop fruits list
# call the make_fruit_treat function from inside a loop
# each time the loop runs pass func call the current fru
fru_treats = []
for fru in fruits:
  treat = make_fruit_treat(fru)
  fru_treats.append(treat)

pp.pprint(fru_treats)

['apple jelly',
 'banana jellybean',
 'blueberry jam',
 'cherry jellybean',
 'grape jelly',
 'kiwi jellybean',
 'lemon jelly',
 'mango jelly',
 'orange jellybean',
 'peach jelly',
 'papaya popsicle',
 'plum popsicle',
 'pineapple popsicle',
 'raspberry jam',
 'strawberry jam']


In [715]:
# map(func,list) version of the above: a one-liner
fruit_treats = list(map(make_fruit_treat,fruits))
pp.pprint(fruit_treats)

['apple jelly',
 'banana jellybean',
 'blueberry jam',
 'cherry jellybean',
 'grape jelly',
 'kiwi jellybean',
 'lemon jelly',
 'mango jelly',
 'orange jellybean',
 'peach jelly',
 'papaya popsicle',
 'plum popsicle',
 'pineapple popsicle',
 'raspberry jam',
 'strawberry jam']


In [716]:
# define a function that takes in a sex str and returns a num
def convert_sex_str_to_int(sex_str):
  if sex_str == "female":
    sex_int = 0
  else:
    sex_int = 1
  return sex_int

In [717]:
# Using map to convert 'Sex' from 'female', 'male' to 0, 1
# titanic_train_df["SexInt"] = titanic_train_df["Sex"].map(convert_sex_str_to_int)
print(titanic_train_df["Sex"].value_counts())

Sex
male      577
female    314
Name: count, dtype: int64


In [718]:
titanic_train_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB


In [719]:
# OR: lambda version of the above just cuz:
titanic_train_df["SexInt"] = titanic_train_df["Sex"].apply(lambda sex : 0 if sex=='female' else 1)

In [720]:
print(titanic_train_df.shape) # (891,13)
titanic_train_df.head()

(891, 13)


Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,SexInt
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S,1
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C,0
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S,0
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S,0
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S,1


In [721]:
# Using map to convert 'Sex' from 'female', 'male' to 0, 1
titanic_test_df["SexInt"] = titanic_test_df["Sex"].map(convert_sex_str_to_int)
print(titanic_test_df.shape) # (418, 12)
titanic_test_df.head()

(418, 12)


Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,SexInt
0,892,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,,Q,1
1,893,3,"Wilkes, Mrs. James (Ellen Needs)",female,47.0,1,0,363272,7.0,,S,0
2,894,2,"Myles, Mr. Thomas Francis",male,62.0,0,0,240276,9.6875,,Q,1
3,895,3,"Wirz, Mr. Albert",male,27.0,0,0,315154,8.6625,,S,1
4,896,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,22.0,1,1,3101298,12.2875,,S,0


In [722]:
# make a new column 'Cabin Known' with value of 1 Cabin exists (only 204 of those) and 0 if it is NA
# Use a vectorized checkâ€”no need for apply(lambda)
# .notna() returns a boolean
# .astype(int) recasts the bool as integer (0,1)
titanic_train_df['CabinKnown'] = titanic_train_df['Cabin'].notna().astype(int)

In [723]:
print(titanic_train_df.shape) # (891, 15)
titanic_train_df.head(2)

(891, 14)


Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,SexInt,CabinKnown
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S,1,0
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C,0,1


In [724]:
titanic_test_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418 entries, 0 to 417
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  418 non-null    int64  
 1   Pclass       418 non-null    int64  
 2   Name         418 non-null    object 
 3   Sex          418 non-null    object 
 4   Age          332 non-null    float64
 5   SibSp        418 non-null    int64  
 6   Parch        418 non-null    int64  
 7   Ticket       418 non-null    object 
 8   Fare         417 non-null    float64
 9   Cabin        91 non-null     object 
 10  Embarked     418 non-null    object 
 11  SexInt       418 non-null    int64  
dtypes: float64(2), int64(5), object(5)
memory usage: 39.3+ KB


In [725]:
# same for titanic_test, which is also missing most Cabins:
titanic_test_df['CabinKnown'] = titanic_test_df['Cabin'].notna().astype(int)

In [726]:
print(titanic_test_df.shape) # (418, 143)
titanic_test_df.head(2)

(418, 13)


Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,SexInt,CabinKnown
0,892,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,,Q,1,0
1,893,3,"Wilkes, Mrs. James (Ellen Needs)",female,47.0,1,0,363272,7.0,,S,0,0


In [727]:
# make a new col 'Deck' from first letter of 'Cabin', with 'U' (Unknown) assigned to missing values (there are only 204 known values in this cos)
titanic_train_df['Deck'] = titanic_train_df['Cabin'].str[0].fillna('U')

In [728]:
titanic_test_df['Deck'] = titanic_test_df['Cabin'].str[0].fillna('U')

In [729]:
# get the unique deck values:
unique_decks_train = titanic_train_df['Deck'].unique()
print(unique_decks_train) # ['U' 'C' 'E' 'G' 'D' 'A' 'B' 'F' 'T']

['U' 'C' 'E' 'G' 'D' 'A' 'B' 'F' 'T']


In [730]:
print(titanic_train_df['Deck'].value_counts())

Deck
U    687
C     59
B     47
D     33
E     32
A     15
F     13
G      4
T      1
Name: count, dtype: int64


In [731]:
unique_decks_test = titanic_test_df['Deck'].unique()
print(unique_decks_train) # ['U' 'C' 'E' 'G' 'D' 'A' 'B' 'F' 'T']

['U' 'C' 'E' 'G' 'D' 'A' 'B' 'F' 'T']


In [732]:
print(titanic_test_df['Deck'].value_counts())

Deck
U    327
C     35
B     18
D     13
E      9
F      8
A      7
G      1
Name: count, dtype: int64


In [733]:
print(titanic_train_df.shape) # (418, 15)
titanic_train_df.head(3)

(891, 15)


Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,SexInt,CabinKnown,Deck
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S,1,0,U
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C,0,1,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S,0,0,U


In [734]:
# consolidate low-freq decks, A F G T, into one cat: 'R' for rare
# df['col'].replace({"old":"new"})
titanic_train_df['Deck'] = titanic_train_df['Deck'].replace({"A":"R","F":"R","G":"R","T":"R"})

In [735]:
titanic_test_df['Deck'] = titanic_test_df['Deck'].replace({"A":"R","F":"R","G":"R"})

In [736]:
# One hot encode 'Deck', where all 3 the unique deck letters get their own columns, prefixed with 'Deck-':
# 'Deck_B'  'Deck_C'  'Deck_D' 'Deck_E'  'Deck_R' 'Deck_U'
# in one hot encoding only one of the cols gets a 1; the other cols get 0
decks_train_df = pd.get_dummies(titanic_train_df['Deck'], prefix='Deck').astype(int)

In [737]:
print(decks_train_df.shape) # (418, 76)
decks_train_df.head(2)

(891, 6)


Unnamed: 0,Deck_B,Deck_C,Deck_D,Deck_E,Deck_R,Deck_U
0,0,0,0,0,0,1
1,0,1,0,0,0,0


In [738]:
decks_test_df = pd.get_dummies(titanic_test_df['Deck'], prefix='Deck').astype(int)

In [739]:
print(decks_test_df.shape) # (418, 6)
decks_test_df.sample(5)

(418, 6)


Unnamed: 0,Deck_B,Deck_C,Deck_D,Deck_E,Deck_R,Deck_U
32,0,0,0,0,0,1
223,0,0,0,0,0,1
407,0,1,0,0,0,0
96,0,1,0,0,0,0
381,0,0,0,0,0,1


In [740]:
# concat the Deck cols onto the main dfs:
titanic_train_df = pd.concat([titanic_train_df,decks_train_df],axis=1)

In [741]:
print(titanic_train_df.shape) # (891, 23)
titanic_train_df.head(2)

(891, 21)


Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,...,Embarked,SexInt,CabinKnown,Deck,Deck_B,Deck_C,Deck_D,Deck_E,Deck_R,Deck_U
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,...,S,1,0,U,0,0,0,0,0,1
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,...,C,0,1,C,0,1,0,0,0,0


In [742]:
# concat the Deck cols onto the main dfs:
titanic_test_df = pd.concat([titanic_test_df,decks_test_df],axis=1)

In [743]:
print(titanic_test_df.shape) # (418, 22)
titanic_test_df.head(2)

(418, 20)


Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,SexInt,CabinKnown,Deck,Deck_B,Deck_C,Deck_D,Deck_E,Deck_R,Deck_U
0,892,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,,Q,1,0,U,0,0,0,0,0,1
1,893,3,"Wilkes, Mrs. James (Ellen Needs)",female,47.0,1,0,363272,7.0,,S,0,0,U,0,0,0,0,0,1


In [744]:
# add the 'Fare' -- one of the fares is missing.. fill with median
median_fare_titanic_test = round(titanic_test_df['Fare'].median(),2)
print(median_fare_titanic_test)
titanic_test_df['Fare'] = titanic_test_df['Fare'].fillna(median_fare_titanic_test)
# NOTE: filling missing Fare not required for training set

14.45


In [745]:
titanic_test_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418 entries, 0 to 417
Data columns (total 20 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  418 non-null    int64  
 1   Pclass       418 non-null    int64  
 2   Name         418 non-null    object 
 3   Sex          418 non-null    object 
 4   Age          332 non-null    float64
 5   SibSp        418 non-null    int64  
 6   Parch        418 non-null    int64  
 7   Ticket       418 non-null    object 
 8   Fare         418 non-null    float64
 9   Cabin        91 non-null     object 
 10  Embarked     418 non-null    object 
 11  SexInt       418 non-null    int64  
 12  CabinKnown   418 non-null    int64  
 13  Deck         418 non-null    object 
 14  Deck_B       418 non-null    int64  
 15  Deck_C       418 non-null    int64  
 16  Deck_D       418 non-null    int64  
 17  Deck_E       418 non-null    int64  
 18  Deck_R       418 non-null    int64  
 19  Deck_U  

In [746]:
# fill missing 'Age' values w median
median_age_train = round(titanic_train_df['Age'].median(),1)
print('median_age_train:',median_age_train)
median_age_test = round(titanic_test_df['Age'].median(),1)
print('median_age_test:',median_age_test)

median_age_train: 28.0
median_age_test: 27.0


In [747]:
# even though median age is not same in train vs test, fill missing w train median, so as to not peek or cheat:
titanic_train_df['AgeFilled'] = titanic_train_df['Age'].fillna(median_age_train)
titanic_test_df['AgeFilled'] = titanic_test_df['Age'].fillna(median_age_train)

In [748]:
# make new 'AgeKnown' col w 1,0
titanic_test_df['AgeKnown'] = titanic_test_df['Age'].notna().astype(int)

In [749]:
titanic_train_df['AgeKnown'] = titanic_train_df['Age'].notna().astype(int)

In [750]:
print(titanic_train_df.shape) # (891, 17)
titanic_train_df.head(2)

(891, 23)


Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,...,CabinKnown,Deck,Deck_B,Deck_C,Deck_D,Deck_E,Deck_R,Deck_U,AgeFilled,AgeKnown
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,...,0,U,0,0,0,0,0,1,22.0,1
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,...,1,C,0,1,0,0,0,0,38.0,1


In [751]:
print(titanic_train_df['AgeKnown'].value_counts())

AgeKnown
1    714
0    177
Name: count, dtype: int64


In [752]:
# make a 'FamilySize' col, the value of which is sibsp + parch + 1
titanic_train_df['FamilySize'] = titanic_train_df['SibSp'] + titanic_train_df['Parch'] + 1
titanic_test_df['FamilySize'] = titanic_test_df['SibSp'] + titanic_test_df['Parch'] + 1

In [753]:
print(titanic_train_df.shape) # (891, 24)
titanic_train_df.head(2)

(891, 24)


Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,...,Deck,Deck_B,Deck_C,Deck_D,Deck_E,Deck_R,Deck_U,AgeFilled,AgeKnown,FamilySize
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,...,U,0,0,0,0,0,1,22.0,1,2
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,...,C,0,1,0,0,0,0,38.0,1,2


In [754]:
titanic_train_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 24 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
 12  SexInt       891 non-null    int64  
 13  CabinKnown   891 non-null    int64  
 14  Deck         891 non-null    object 
 15  Deck_B       891 non-null    int64  
 16  Deck_C       891 non-null    int64  
 17  Deck_D       891 non-null    int64  
 18  Deck_E       891 non-null    int64  
 19  Deck_R  

In [755]:
print(titanic_train_df['Embarked'].value_counts())

Embarked
S    644
C    168
Q     77
Name: count, dtype: int64


In [756]:
# fill 2 missing Embarked values with most common, 'S'
titanic_train_df['Embarked'] = titanic_train_df['Embarked'].fillna('S')

In [757]:
# one hot encoding on Embarked: result: one new col per category: 'S', 'C', 'Q'
# in one hot encoding, only the active ('hot') col = 1 (all others = 0)
embark_df = pd.get_dummies(titanic_train_df['Embarked'], prefix='Town').astype(int)

In [758]:
print(embark_df.shape) # (891, 3)
embark_df.sample(3)

(891, 3)


Unnamed: 0,Town_C,Town_Q,Town_S
49,0,0,1
863,0,0,1
561,0,0,1


In [759]:
titanic_train_df = pd.concat([titanic_train_df,embark_df],axis=1)

In [760]:
print(titanic_train_df.shape) # (891, 27)
titanic_train_df.sample(3)

(891, 27)


Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,...,Deck_D,Deck_E,Deck_R,Deck_U,AgeFilled,AgeKnown,FamilySize,Town_C,Town_Q,Town_S
309,310,1,1,"Francatelli, Miss. Laura Mabel",female,30.0,0,0,PC 17485,56.9292,...,0,1,0,0,30.0,1,1,1,0,0
857,858,1,1,"Daly, Mr. Peter Denis",male,51.0,0,0,113055,26.55,...,0,1,0,0,51.0,1,1,0,0,1
111,112,0,3,"Zabour, Miss. Hileni",female,14.5,1,0,2665,14.4542,...,0,0,0,1,14.5,1,2,1,0,0


In [761]:
# one hot code embark on the test set too:
embark_test_df = pd.get_dummies(titanic_test_df['Embarked'], prefix='Town').astype(int)
titanic_test_df = pd.concat([titanic_test_df,embark_test_df],axis=1)
print(titanic_test_df.shape) # (418, 26)

(418, 26)


### **Making y_train from the train_titanic_df Survived col**
- make **y_train** -- a vector of just the target values of 0 and 1
- there is NO **y_test** (only Kaggle has those "answers")


In [762]:
# pop the 'Survived' col off the train set and save as y_train
y_train = titanic_train_df.pop('Survived')

In [None]:
print(y_train[:5])

In [794]:
# get all cols for both dfs into list:
train_cols = titanic_train_df.columns.tolist()
print(train_cols)
test_cols = titanic_test_df.columns.tolist()
print(test_cols)

['PassengerId', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp', 'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked', 'SexInt', 'CabinKnown', 'Deck', 'Deck_B', 'Deck_C', 'Deck_D', 'Deck_E', 'Deck_R', 'Deck_U', 'AgeFilled', 'AgeKnown', 'FamilySize', 'Town_C', 'Town_Q', 'Town_S']
['PassengerId', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp', 'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked', 'SexInt', 'CabinKnown', 'Deck', 'Deck_B', 'Deck_C', 'Deck_D', 'Deck_E', 'Deck_R', 'Deck_U', 'AgeFilled', 'AgeKnown', 'FamilySize', 'Town_C', 'Town_Q', 'Town_S']


### **Making X_train and X_test from the prepared data**
- **X_train** and **X_test** must have the exact same cols
-  **X_train** cannot have **Survived** col--that's the answers
- **Survived** col is saved to **y_train**

In [795]:
# make X_train -- a df of just features (cols) used for training
X_train = titanic_train_df[[ 'Pclass', 'SexInt', 'AgeFilled', 'AgeKnown', 'Fare', 'FamilySize', 'Deck_B', 'Deck_C', 'Deck_D', 'Deck_E', 'Deck_R', 'Deck_U', 'Town_C', 'Town_Q', 'Town_S' ]]

In [797]:
print(X_train.shape) # (891, 15)
X_train.head(3)

(891, 15)


Unnamed: 0,Pclass,SexInt,AgeFilled,AgeKnown,Fare,FamilySize,Deck_B,Deck_C,Deck_D,Deck_E,Deck_R,Deck_U,Town_C,Town_Q,Town_S
0,3,1,22.0,1,7.25,2,0,0,0,0,0,1,0,0,1
1,1,0,38.0,1,71.2833,2,0,1,0,0,0,0,1,0,0
2,3,0,26.0,1,7.925,1,0,0,0,0,0,1,0,0,1


In [796]:
# make X_test -- a df of just features (cols) used for testing
# the X_test features must exactly match X_train
X_test = titanic_test_df[[ 'Pclass', 'SexInt', 'AgeFilled', 'AgeKnown', 'Fare', 'FamilySize', 'Deck_B', 'Deck_C', 'Deck_D', 'Deck_E', 'Deck_R', 'Deck_U', 'Town_C', 'Town_Q', 'Town_S' ]]

In [798]:
print(X_test.shape) # (418, 15)
X_test.head(3)

(418, 15)


Unnamed: 0,Pclass,SexInt,AgeFilled,AgeKnown,Fare,FamilySize,Deck_B,Deck_C,Deck_D,Deck_E,Deck_R,Deck_U,Town_C,Town_Q,Town_S
0,3,1,34.5,1,7.8292,1,0,0,0,0,0,1,0,1,0
1,3,0,47.0,1,7.0,2,0,0,0,0,0,1,0,0,1
2,2,1,62.0,1,9.6875,1,0,0,0,0,0,1,0,1,0


### **Training the RandomForestClassifier model**

In [799]:
# Now that we have X_train and y_train, we can train a model BUT first instantiate:
# instantiate RandomForestClassifier

rand_forest_model = RandomForestClassifier(
    n_estimators=1000,       # many trees
    max_depth=None,         # let trees grow, control with min_* instead
    min_samples_split=4,
    min_samples_leaf=1,
    max_features="sqrt",    # good default for classification
    bootstrap=True,
    class_weight="balanced", # Titanic is imbalanced (~38% survived)
    n_jobs=-1,
    random_state=42,
    criterion="gini"        # often edges out 'entropy' for RF
)

**n_estimators=1000**: number of trees. More trees = more stable predictions (slower, but usually better than 100).

**max_depth=None**: trees can grow fully. Can boost training score but risks overfitting; your min_* settings are the brakes.

**min_samples_split=4**: a node needs at least 4 samples to split. Slightly regularizes vs default 2.

**min_samples_leaf=1**: leaves can be as small as 1 sample. This is the most overfit-prone setting here; try 2â€“5 if youâ€™re stuck.

**max_features="sqrt"**: each split considers only âˆš(num_features). Adds randomness/diversity; good RF default.

**bootstrap=True**: each tree trains on a bootstrapped sample (sample-with-replacement). Standard RF behavior.

**class_weight="balanced"**: up-weights the minority class (survived) during training. Sometimes helps, sometimes hurts; worth testing both.

**n_jobs=-1**: use all CPU cores (faster).

**random_state=42**: reproducible results.

**criterion="gini"**: how splits are chosen. Gini is fast and standard; differences vs entropy are usually tiny.

In [800]:
# train the RandomForestClassifier model on X_train and y_train
rand_forest_model.fit(X_train, y_train)

In [801]:
# have the model predict survival of the test set (X_test)
# AGAIN : WE DO NOT HAVE y_test, those being the answers (ONLY Kaggle has that)
# Ergo: WE can only know how well our model did if we upload our predictions to Kaggle
y_pred = rand_forest_model.predict(X_test)

In [None]:
# print the model's predictions (even though we cannot know how accurate they are):
print(y_pred)

In [804]:
# make the required Kaggle df:
# MUST have exactly two cols: "PassengerId" and "Survived"
# "PassengerId" value: consec ints from 892-1309 (we can generate this w range())
# "Survived" value: our model's predictions as y_pred
kaggle_comp_df = pd.DataFrame()

In [805]:
kaggle_comp_df["PassengerId"] = range(892,1310)
kaggle_comp_df["Survived"] = y_pred

In [806]:
print(kaggle_comp_df.shape) # (418, 2)
kaggle_comp_df.head()

(418, 2)


Unnamed: 0,PassengerId,Survived
0,892,0
1,893,0
2,894,0
3,895,1
4,896,1


In [807]:
kaggle_comp_df.tail()

Unnamed: 0,PassengerId,Survived
413,1305,0
414,1306,1
415,1307,0
416,1308,0
417,1309,1


**saving a dataframe to csv**
- **df.to_csv(file_path/file_name, encoding='utf-8', index=False)**
- saves df as file_name to file_path
- encoding **encoding='utf-8'** just means normal English letters
- **index=False** means do not make a column for the index

In [809]:
# save the predictions df to csv for uploading to Kaggle:
# specify index=False or else you get a new 'Unnamed: 0' col containing the index values -- which we definitely do NOT want as it will be rejected by Kaggle -- MUST have exactly TWO cols ONLY
kaggle_comp_df.to_csv(base_path + '/csv/kaggle-titanic-comp-brian.csv', encoding='utf-8', index=False)

In [810]:
# load the csv right back up to df to make sure it's good to go:
titanic_kaf_comp_df = pd.read_csv(base_path + '/csv/kaggle-titanic-comp-brian.csv')

In [811]:
print(titanic_kaf_comp_df.shape)
titanic_kaf_comp_df.head()

(418, 2)


Unnamed: 0,PassengerId,Survived
0,892,0
1,893,0
2,894,0
3,895,1
4,896,1


In [778]:
# predict 100% did not survive just to gain insight into the mysterious kaggle 'answer key'


**LabelEncoder.fit_transform(list_of_strings)** takes a list of strings and returns a corresponding list of numbers




- **n_estimators=10**
- Definition: Specifies the number of trees in the forest (i.e., the number of decision trees).
- Explanation:
- In a Random Forest model, multiple decision trees are built, and their predictions are averaged (for regression) or voted on (for classification).
- The more trees you have, the more stable and accurate the model can be, although it may take longer to train.
- In this case: You are using 10 decision trees.
- **criterion='entropy'**
- Definition: This parameter specifies the function used to measure the quality of a split when constructing each tree.
- Explanation:
- The two common criteria are gini (Gini impurity) and entropy (information gain).
entropy: Measures how much information is gained by making a split. It uses the concept of information theory to find splits that reduce uncertainty (entropy) in the target labels.
gini: Measures the degree of "impurity" in the nodes and tends to be slightly faster.
- In this case: The Random Forest uses entropy to evaluate how splits reduce the uncertainty of the class labels in the dataset.
- **random_state=42**
- Definition: This parameter sets the seed for the random number generator.
Explanation:
- Random Forests introduce randomness by selecting random subsets of data for each tree and selecting random subsets of features for splitting at each node.
random_state ensures reproducibility by controlling this randomness. Using the same seed (e.g., 42) ensures the same results across different runs (all else being equal).

- it's all good to go -- time to submit the csv file to kaggle
- download the csv file and upload it to kaggle at

 **https://www.kaggle.com/competitions/titanic**