### Predicting Titanic Survival using..
#### **sklearn (Scikit Learn) machine learning (ML) library**
- **sklearn** is the most popular and robust library for machine learning, having a selection of algorithms that await your data

**titanic dataset** is an iconic dataset used for Machine Learning practice and training
- it contains 891 **observations** (rows) of 15 **variables** (columns) each
- the dataset is used to train ML models to predict the probability of surviving the Titanic disaster based on variables such as passenger class, age and gender (spoiler alert: first class female passengers had a *much* better chance of survival than third class males)
- **survived** column: 1 means survived; 0 means perished
- **sibsp** stands for siblings and spouses
- **parch** stands for parents and children
- some columns are repeats of each other:
  - **alive** (no, yes) is the string version of **survived** (0,1)
  - **pclass** (1,2,3) is the numeric version of **class** (First, Second, Third)
  - **who** (man, woman, child) overlaps **sex** (male, female)
  - **embarked** (S, C, Q) is short for **embark_town** (Southhampton, Cherbourg, Queenstown)

**seaborn** is another visualization / plotting library, similar to matplotlib
- seaborn also has the titanic dataset built in, so no csv file required
**Kaggle Titanic Competition**
- If you would like to submit your predictions to **Kaggle Titanic Competition**, use their datasets, which come pre-divided into:
    - train (**train-titanic.csv**)
    - test (**test-titanic.csv**)

In [1]:
# import the usual basics:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from IPython.display import Image
import pprint as pp
import seaborn as sns

In [2]:
# import sklearn machine learning packages:

# LabelEncoder converts string data to numeric:
from sklearn.preprocessing import LabelEncoder

# normalize the data (get all values down to mean=0
# and all other values are standard deviation
# 95% of all values go in the +-2 std (-2 to 2)
from sklearn.preprocessing import StandardScaler

# import RandomForestClassifier for training model
from sklearn.ensemble import RandomForestClassifier

In [56]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


#### **Scikit Learn (sklearn) packages**
- **sklearn.preprocessing.LabelEncoder** encodes target labels with value between **0** and **n_classes-1**
  - all data must be numeric for ML model training
  - so, passenger classes First, Second and Third become 0, 1 and 2
  - we need to convert object (string) columns ("sex", "embarked") into numbers
- **sklearn.model_selection.train_test_split** splits data into training and testing sets
  - **model** is trained on training set (fed data with the answers)
  - **model** is tested on testing set by having it predict answers (testing labels)
- **sklearn.preprocessing.StandardScaler** is for getting all values in a standard range,  
where all mean values are 0; values within +/- 1 std are in the 1 to -1 range
  - model needs all values in same range, so that it doesn't
  conclude that age, which ranges from 0-80,  
  is 40 times more important in predicting survival than passenger class, which ranges from just 0-2

- make **confusion matrix** for each model (a 2x2 array showing TP, TN, FP, FN)
from sklearn.metrics import confusion_matrix

**What Is Random Forest?**
- random forest is a tree-based supervised learning algorithm
- **random forest classifier** can be used to solve for **regression** or **classification** problems
- Random forest tends to combine hundreds of decision trees and then trains each decision tree on a different sample of the observations.
- The final predictions of the random forest are made by averaging the predictions of each individual tree.

**What is Confusion Matrix?**
- a confusion maxtrix is a set of 4 values used to describe a categorical classification:
- a confusion maxtrix has 4 categories: True Positive, False Positive, True Negative, False Negative
  - **True Positive** : model said it's a 'cat' and it is a 'cat'
  - **True Negative** : model said it's not a 'cat' and that's right
  - **False Positive** : model said it's a 'cat' but it's not
  - **False Negative** : model said it's not a 'cat' but it is


In [3]:
# load titanic dataset from within seaborn (no csv file)
titanic_sns_df = sns.load_dataset('titanic')

In [4]:
print(titanic_sns_df.shape) # (891, 15)
titanic_sns_df.head()

(891, 15)


Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.925,S,Third,woman,False,,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1,S,First,woman,False,C,Southampton,yes,False
4,0,3,male,35.0,0,0,8.05,S,Third,man,True,,Southampton,no,True


In [5]:
# check for missing data
titanic_sns_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 15 columns):
 #   Column       Non-Null Count  Dtype   
---  ------       --------------  -----   
 0   survived     891 non-null    int64   
 1   pclass       891 non-null    int64   
 2   sex          891 non-null    object  
 3   age          714 non-null    float64 
 4   sibsp        891 non-null    int64   
 5   parch        891 non-null    int64   
 6   fare         891 non-null    float64 
 7   embarked     889 non-null    object  
 8   class        891 non-null    category
 9   who          891 non-null    object  
 10  adult_male   891 non-null    bool    
 11  deck         203 non-null    category
 12  embark_town  889 non-null    object  
 13  alive        891 non-null    object  
 14  alone        891 non-null    bool    
dtypes: bool(2), category(2), float64(2), int64(4), object(5)
memory usage: 80.7+ KB


In [62]:
# get the sns dataset col names
# the sns dataset is the SAME 891 passengers as the Kaggle train_titanic.csv
# get the sns titanic df col names into a list:
sns_titanic_cols = list(titanic_sns_df.columns)
print("SNS Titanic Cols:", len(sns_titanic_cols), sns_titanic_cols)

SNS Titanic Cols: 15 ['survived', 'pclass', 'sex', 'age', 'sibsp', 'parch', 'fare', 'embarked', 'class', 'who', 'adult_male', 'deck', 'embark_town', 'alive', 'alone']


In [59]:
base_path = "/content/drive/MyDrive/____Intro-Python-Machine-Learning-Dec-2025"

In [64]:
# load the same data again BUT this time from train_titanic.csv file provided by Kaggle
titanic_train_df = pd.read_csv(base_path + '/csv/train_titanic.csv')

In [65]:
print(titanic_train_df.shape)
titanic_train_df.head()

(891, 12)


Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [66]:
kaggle_titanic_cols = list(titanic_train_df.columns)
print("Kaggle Titanic Cols:", len(kaggle_titanic_cols), kaggle_titanic_cols)

Kaggle Titanic Cols: 12 ['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp', 'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked']


In [None]:
# compare sns to kaggle datasets:

# SNS Titanic Cols: 15
# ['survived', 'pclass', 'sex', 'age', 'sibsp', 'parch', 'fare', 'embarked', 'class', 'who', 'adult_male', 'deck', 'embark_town', 'alive', 'alone']

# Kaggle Titanic Cols: 12
# ['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp', 'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked']

In [67]:
# load up the test set:
titanic_test_df = pd.read_csv(base_path + '/csv/test_titanic.csv')

In [70]:
print(titanic_test_df.shape) # (418, 11)
titanic_test_df.head()
# PassengerId is consec ints from 892-1309
# this col must be included in the 2-col csv sumitted to Kaggle
# the csv is saved from a df of 2-col: PassengerID, Survived

(418, 11)


Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,892,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,,Q
1,893,3,"Wilkes, Mrs. James (Ellen Needs)",female,47.0,1,0,363272,7.0,,S
2,894,2,"Myles, Mr. Thomas Francis",male,62.0,0,0,240276,9.6875,,Q
3,895,3,"Wirz, Mr. Albert",male,27.0,0,0,315154,8.6625,,S
4,896,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,22.0,1,1,3101298,12.2875,,S


### **Feature Engineering Game Plan**
Feature Engineering means making new columns to include in feature set for Machine Learining training and testing

1. Using LabelEncoder convert 'Sex' from 'female', 'male' to 0, 1

2. Make a new column 'Cabin Known' with value of:
    - 1 if Cabin exists (only 204 of those)
    - 0 if it is NA
   To make this, use vectorized check—no need for apply(lambda)
   .notna() returns a boolean
   .astype(int) casts the bool as int (0,1)

3. Possibly make a new col 'Deck', derive from first letter of 'Cabin',
     with 'Unknown' being its own deck value -- there are a lot of unknowns

4. Fill the one missing Fare with median and keep that col for feature set

5. Age is missing a lot of values, so fill w median age BUT better may be to make several age categories, maybe decade group, with unknown being its own category.. in that case not even include actual age numbers in feature set; so that would entail making a new col 'Age Categories' w values by decade and Unknown as its own category

6. fill the 2 missing Embarked values with most common, 'S' (Southhampton)

7. One hot encode Embarked, where all 3 towns, S, C, Q, get their own columns
- in one hot encoding only one of the cols gets a 1; the other cols get 0

8. make a 'Family Size' col, the value of which is sibsp + parch + 1.

9. Possibly derive titles from names and bucket those: Mr, Miss, Mrs, Elite.
- caveat: this is a complex process, which may not help accuracy, anyway
- better perhaps to just ignore Name altogether

**map()** is a function that takes a function argument
- it runs the function arg on every item in an iterable
- the function can be a named function OR a **lambda**
- it returns a new vector (list)
- map can be called on a df col: **df['col'].map(doStuff)**
- map can take a list to iterate as its arg: **map(doStuff,list)**

In [72]:
# example: give map this list and a func for it to run on each item:
fruits = ['apple','banana','blueberry','cherry','grape','kiwi','lemon','mango','orange','peach','papaya','plum','pineapple','raspberry','strawberry']

In [78]:
# define a function with an input and an output
# keyword def
# function_name
# function () inputs
# do stuff inside func
# return value / answer
# call function
# set func call equal to var to catch return value
def make_jellybean(fru):
  fru_jbean = fru + ' jellybean'
  return fru_jbean

In [79]:
# call the function, passing it apple:
jbean1 = make_jellybean('apple')
jbean2 = make_jellybean(fruits[-1])
print(jbean1, jbean2)

apple jellybean strawberry jellybean


In [82]:
# a tad more complex: define a function that accepts a whole list of fruits
# and mass produces treats based on conditions
# func returns a list of treats: so, it takes in a list and returns a list
def make_fruit_treats(fru_list):
  treats = []
  for fru in fru_list:
    if len(fru) == 5:
      treats.append(fru + " jelly")
    elif fru.startswith('p'):
      treats.append(fru + " popsicle")
    elif 'berry' in fru:
      treats.append(fru + " jam")
    else:
      treats.append(fru + " jellybean")
  return treats

In [None]:
# call the func, passing it fruits list, save result to var (new list):
treats_list = make_fruit_treats(fruits)
pp.pprint(treats_list)

In [84]:
def make_fruit_treat(fru):
  if len(fru) == 5:
    treat = fru + " jelly"
  elif fru.startswith('p'):
    treat = fru + " popsicle"
  elif 'berry' in fru:
    treat = fru + " jam"
  else:
    treat = fru + " jellybean"
  return treat

In [86]:
# call the function repeatedly for various fruits
# this way is not so efficient:
# better would be to loop fruits and
fru_treat_1 = make_fruit_treat('apple')
print('fru_treat_1:',fru_treat_1)

fru_treat_2 = make_fruit_treat('pineapple')
print('fru_treat_2:',fru_treat_2)

fru_treat_3 = make_fruit_treat('blueberry')
print('fru_treat_3:',fru_treat_3)

fru_treat_4 = make_fruit_treat('orange')
print('fru_treat_4:',fru_treat_4)

fru_treat_1: apple jelly
fru_treat_2: pineapple popsicle
fru_treat_3: blueberry jam
fru_treat_4: orange jellybean


In [None]:
# call the function from inside the loop
# each time the loop runs pass func call the current fru

In [13]:
# Using map to convert 'Sex' from 'female', 'male' to 0, 1
# train_df["Sex"] = train_df["Sex"].map({"female": 0, "male": 1})

In [15]:
# make a new column 'Cabin Known' with value of 1 Cabin exists (only 204 of those) and 0 if it is NA
# Use a vectorized check—no need for apply(lambda)
# .notna() returns a boolean
# .astype(int) casts the bool as int (0,1)


In [16]:
# possibly (tbd) make a new col 'Deck' which we derive from first lettr of 'Cabin', again with 'deck unknown' handled


In [17]:
# add the 'Fare' -- one of the fares is missing.. fill with median


In [20]:
# possibly (tbd) make a new col 'Age Categories' w values by decade and Unknown as its own category

In [21]:
# fill 2 missing Embarked values with most common, 'S'


In [22]:
# do hot encoding on Embarked: result: one new col per category: 'S', 'C', 'Q'
# in one hot encoding, only the active ('hot') col = 1 (all others = 0)

In [None]:
# make a 'Family Size' col, the value of which is sibsp + parch + 1

### **Making X_train and X_test from the prepared data**
- **X_train** and **X_test** must have the exact same cols
-  **X_train** cannot have **Survived** col--that's the answers
- **Survived** col is saved to **y_train**

In [26]:
# make X_train -- a df of just features (cols) used for training

In [29]:
# make X_test -- a df of just features (cols) used for testing
# the X_test features must exactly match X_train

### **Making y_train from the train_titanic_df Survived col**

In [29]:
# make y_train -- a vector of just the target values of 0 and 1
# There is NO y_test (only Kaggle has those "answers")


### **Training the RandomForestClassifier model**

In [30]:
# Now that we have X_train and y_train, we can train a model BUT first instantiate:
# instantiate RandomForestClassifier

# rand_forest_model = RandomForestClassifier(
#     n_estimators=400,       # many trees
#     max_depth=None,         # let trees grow, control with min_* instead
#     min_samples_split=4,
#     min_samples_leaf=1,
#     max_features="sqrt",    # good default for classification
#     bootstrap=True,
#     class_weight="balanced", # Titanic is imbalanced (~38% survived)
#     n_jobs=-1,
#     random_state=42,
#     criterion="gini"        # often edges out 'entropy' for RF
# )

In [31]:
# train the RandomForestClassifier model on X_tran and y_train


In [32]:
# have the model predict survival of the test set (X_test)
# AGAIN : WE DO NOT HAVE y_test, those being the answers (ONLY Kaggle has that)
# Ergo: WE can only know how well our model did if we upload our predictions to Kaggle


In [33]:
print()




In [34]:
print()




In [35]:
# check the model's predictions (even though we cannot know how accurate they are):


In [36]:
# make the required Kaggle df:
# MUST have exactly two cols: "PassengerId" and "Survived"
# "PassengerId" value: consec ints from 892-1309 (we can generate this w range())
# "Survived" value: our model's predictions as y_pred


In [37]:
print() # (418, 2)





In [38]:
# save the predictions df to csv for uploading to Kaggle:
# specify index=False or else you get a new 'Unnamed: 0' col containing the index values -- which we definitely do NOT want as it will be rejected by Kaggle -- MUST have exactly TWO cols ONLYs


In [39]:
# load the csv right back up to df to make sure it's good to go:


In [40]:
# predict 100% did not survive just to gain insight into the mysterious kaggle 'answer key'


In [41]:
# split() passenger name to isolate the title:
passenger_names = [ "Boulos, Mr. Hanna", "Duff Gordon, Sir. Cosmo Edmund (Mr Morgan)",
    "Jacobsohn, Mrs. Sidney Samuel (Amy Frances)", "Slabenoff, Mr. Petco",
    "Olsen, Mr. Henry Margido", "Lang, Mr. Fang", "Daly, Mr. Eugene Patrick",
    "Webber, Mr. James", "McGough, Mr. James Robert",
    "Andersson, Mrs. Anders Johan (Alfrida Konstant)", "Jardin, Mr. Jose Neto",
    "Laroche, Mrs. Joseph (Juliette Marie Louise)", "Shutes, Miss. Elizabeth W" ]

In [42]:
# challenge: get the title as "Mr" from the name
pass_name = "Duff Gordon, Sir. Cosmo Edmund (Mr Morgan)"
# first, split the name on the comma:

print()
# get the 2nd of the 2 items:

print()
# split the titled name on the dot:

print()
# get the 1st item from list--this is the title result we want:
# remove leading space
print()







In [43]:
# process the list of passenger names on a loop that calls the extract_title_from_name() function

    # first, split the name on the comma:
    # print(pass_name_list)
    # get the 2nd of the 2 items:

    # print(pass_name_titled)
    # split the titled name on the dot:

    # print(pass_title_name_list)
    # get the 1st item from list--this is the title result we want: replace leading space with empty string

    # print(pass_title)


In [44]:
# run a loop that iterates the passenger_names list, passing each name to function. Save return value of function -- the title -- to a new list:

print()




In [45]:
# make a new title column for the train_titanic.csv


In [46]:
# make Title col again BUT chain all title-extractor code into inline lambda:


In [47]:
# print unique values from Title col


In [48]:
# ['Mr', 'Mrs', 'Miss', 'Master', 'Don', 'Rev', 'Dr', 'Mme', 'Ms', 'Major', 'Lady', 'Sir', 'Mlle', 'Col', 'Capt', 'theCountess', 'Jonkheer']
# consolidate rare titles into two labels: "RareFemale" and "RareMale"
rare_male_titles = ['Don', 'Rev', 'Dr', 'Major', 'Sir', 'Col', 'Capt', 'Jonkheer']
rare_female_titles = ['Lady', 'theCountess', 'Ms', 'Mme', 'Mlle']
common_male_titles = ['Mr', 'Master']
common_female_titles = ['Mrs', 'Miss']


In [49]:
# re-print unique values from Title col


In [50]:
# check survival of 'Master'

In [51]:
print()




In [52]:
# check survival of 'RareFemale:


In [53]:
# check survival of 'RareMale' titles:


In [54]:
# check survival of 'CommonMale' titles:


In [55]:
# check survival of 'CommonFemale' titles:


**LabelEncoder.fit_transform(list_of_strings)** takes a list of strings and returns a corresponding list of numbers




- **n_estimators=10**
- Definition: Specifies the number of trees in the forest (i.e., the number of decision trees).
- Explanation:
- In a Random Forest model, multiple decision trees are built, and their predictions are averaged (for regression) or voted on (for classification).
- The more trees you have, the more stable and accurate the model can be, although it may take longer to train.
- In this case: You are using 10 decision trees.
- **criterion='entropy'**
- Definition: This parameter specifies the function used to measure the quality of a split when constructing each tree.
- Explanation:
- The two common criteria are gini (Gini impurity) and entropy (information gain).
entropy: Measures how much information is gained by making a split. It uses the concept of information theory to find splits that reduce uncertainty (entropy) in the target labels.
gini: Measures the degree of "impurity" in the nodes and tends to be slightly faster.
- In this case: The Random Forest uses entropy to evaluate how splits reduce the uncertainty of the class labels in the dataset.
- **random_state=42**
- Definition: This parameter sets the seed for the random number generator.
Explanation:
- Random Forests introduce randomness by selecting random subsets of data for each tree and selecting random subsets of features for splitting at each node.
random_state ensures reproducibility by controlling this randomness. Using the same seed (e.g., 42) ensures the same results across different runs (all else being equal).

- it's all good to go -- time to submit the csv file to kaggle
- download the csv file and upload it to kaggle at

 **https://www.kaggle.com/competitions/titanic**