# Predicting Titanic Survival using..
## **sklearn (Scikit Learn) machine learning (ML) library**
- **sklearn** is the most popular and robust library for machine learning, having a selection of algorithms that await your data

**titanic dataset** is an iconic dataset used for Machine Learning practice and training
- it contains 891 **observations** (rows) of 15 **variables** (columns) each
- the dataset is used to train ML models to predict the probability of surviving the Titanic disaster based on variables such as passenger class, age and gender (spoiler alert: first class female passengers had a *much* better chance of survival than third class males)
- **survived** column: 1 means survived; 0 means perished
- **sibsp** stands for siblings and spouses
- **parch** stands for parents and children
- some columns are repeats of each other:
  - **alive** (no, yes) is the string version of **survived** (0,1)
  - **pclass** (1,2,3) is the numeric version of **class** (First, Second, Third)
  - **who** (man, woman, child) overlaps **sex** (male, female)
  - **embarked** (S, C, Q) is short for **embark_town** (Southhampton, Cherbourg, Queenstown)

### **seaborn** is another visualization / plotting library, similar to matplotlib
- seaborn also has the titanic dataset built in, so no csv file required

In [None]:
# imp

In [None]:
# fro

In [None]:
# import sklearn machine learning packages:

# LabelEncoder converts string data to numeric:
# fro

# normalize the data (get all values down to mean=0
# and all other values are standard deviation
# 95% of all values go in the +-2 std (-2 to 2)
# fr

# import RandomForestClassifier for training model
# fr
# import confusion matrix for quantifying model's accuracy
# as TP, TN, FP, FN
# fro

#### **Scikit Learn (sklearn) packages**
- **sklearn.preprocessing.LabelEncoder** encodes target labels with value between **0** and **n_classes-1**
  - all data must be numeric for ML model training
  - so, passenger classes First, Second and Third become 0, 1 and 2
  - we need to convert object (string) columns ("sex", "embarked") into numbers
- **sklearn.model_selection.train_test_split** splits data into training and testing sets
  - **model** is trained on training set (fed data with the answers)
  - **model** is tested on testing set by having it predict answers (testing labels)
- **sklearn.preprocessing.StandardScaler** is for getting all values in a standard range,  
where all mean values are 0; values within +/- 1 std are in the 1 to -1 range
  - model needs all values in same range, so that it doesn't
  conclude that age, which ranges from 0-80,  
  is 40 times more important in predicting survival than passenger class, which ranges from just 0-2

- make **confusion matrix** for each model (a 2x2 array showing TP, TN, FP, FN)
from sklearn.metrics import confusion_matrix

**What Is Random Forest?**
- random forest is a tree-based supervised learning algorithm
- **random forest classifier** can be used to solve for **regression** or **classification** problems
- Random forest tends to combine hundreds of decision trees and then trains each decision tree on a different sample of the observations.
- The final predictions of the random forest are made by averaging the predictions of each individual tree.

**What is Confusion Matrix?**
- a confusion maxtrix is a set of 4 values used to describe a categorical classification:
- a confusion maxtrix has 4 categories: True Positive, False Positive, True Negative, False Negative
  - **True Positive** : model said it's a 'cat' and it is a 'cat'
  - **True Negative** : model said it's not a 'cat' and that's right
  - **False Positive** : model said it's a 'cat' but it's not
  - **False Negative** : model said it's not a 'cat' but it is


In [None]:
# load titanic dataset from within seaborn (no csv file)
# t
print() # (891, 15)
# t




In [None]:
# ti

In [None]:
# load the same data again BUT this time from train_titanic.csv file provided by Kaggle
# ba
# cs
# tk_
print() # (891, 12)
# tk_




In [None]:
# print column names for both for comparing:
print('seaborn columns:\n')
print('\nkaggle columns:\n')
# 3 L@@K: the seaborn data has a lot of redundant columns
# if we know gender and age, we can infer man/woman/child

seaborn columns:


kaggle columns:



In [None]:
# print a few cols unique values to see overlapping and redundancy:
print()
print()





In [None]:
# view statistical data for the numeric columns
# ti

In [None]:
# tk_

In [None]:
# get info about the df:
# tk_
# L@@K: we are missing a lot of ages and almost all Cabin values are null
# the move is to drop Cabin column altogether and drop missing Age rows
# BUT definitely DO KEEP Age column

In [None]:
# drop the Cabin column:
# tk_

In [None]:
print() # (891, 11)
# tk_




In [None]:
# make a new new column for titanic kaggle called "who"
# set it equal to the "who" col from other titanic_df
# this way, we can know if the person is a child
# which is useful for filling missing ages w more accurate age
# tk_

In [None]:
print()
# tk_




In [None]:
# fill the 177 missing ages w mean age

In [None]:
# calculate the mean age of the adults and the children
# man_
# man

In [None]:
# mea
# me

In [None]:
# mea
# mea

In [None]:
# mea
# mea

- **df['col'].fillna(value)** is for filling with same value for all null / missing values
- if you need to fill based on a condition (man, woman, child), use **df['col'].apply(lambda)**

In [None]:
# fill the 2 missing Embarked with the most common value, which is "S"
# add inplace=True if you don't want to save the changed df to itself
# tk_

In [None]:
# tk_

In [None]:
# fill the 177 missing ages with the correct mean age
# call apply lambda on the WHOLE df, since we need access to 2 different columns
# set the lamdba expression equal to row to capture the return value, row by row
# tk

#### **How the correct value is chosen from the dictionary passed to `map()`**

The correct value is chosen from the dictionary passed to `map()` based on the value in the `who` column for each row. Here's how it works:

1. **Mapping operation**: The `map()` function takes each value in the `who` column and looks it up in the dictionary `{'man': 33.2, 'woman': 32.0, 'child': 6.4}`.

2. **Dictionary lookup**: For each row, the `map()` function uses the value in the `who` column (either `'man'`, `'woman'`, or `'child'`) as the key to look up the corresponding value (either `33.2`, `32.0`, or `6.4`) in the dictionary.

3. **Assignment of values**: It then assigns that value (e.g., `33.2` for `'man'`) to the corresponding `Age` value, but only for rows where `Age` is `NaN` because we use `loc` to specifically target those rows.

This allows for a very efficient and direct mapping between the categories in the `who` column and the desired `Age` values.


In [None]:
# shorter version: use loc to target only the rows where Age is NaN, and
# then set the values based on the who column. Here's a more concise solution:
# tk

In [None]:
# tk

In [None]:
# tk_

- **df['col'].value_counts()** returns an itemized count for each discrete category in the col

In [None]:
# 'perished / survived' ratio of full 891 rows
# 549 / 342 (61.61% of the 891 did not survive)
# get the 'perished / survived' ratio (value_counts) for the 891 rows
# tk_ # 59.55% of 891 did not survive

In [None]:
# get the breakdown of the 3 passenger classes
# tk_
# 'Pclass'

In [None]:
# tk_

**groupby()**
- **group_count_df = df.groupby('col_1')[['col_2']].count()**
- **group_mean_df = df.groupby('col_1')[['col_2']].mean()**
- **group_mean_df** has index of 'col_1' where each category of the column is a row
- **group_mean_df** has one col, 'col_2', the values of which is the count (or mean) for that row category


In [None]:
# get a break down of passengers by class:
# 'Pclass''Survived'
# p
# L@@K: careful -- the result is not how many survived per class,
# but how many passengers there were per class

In [None]:
# pc

In [None]:
# change the column with the value counts to 'count'
# pc

In [None]:
# pc

In [None]:
# do a groupby to get survival rates by sex
# 'Sex' 'Survived'
# su
print()
# su




In [None]:
# do a groupby to get survival totals by sex
# co
# cou
print()
# cou




In [None]:
# but this is not showing survival, but rather the number of males vs females
# the total is 891, which is the total number of observations (rows) in our df
# 'sex'

**double and triple groupby()**
- pass a list of columns to groupby() method

In [None]:
# 'Sex' 'Pclass' 'Survived'
# 'Survived' 'Count'
# pc
# p
# pc

In [None]:
# 'Sex' 'Pclass' 'Embarked' 'Survived'
# 'Survived' 'Count'
# pc
# pc
# pc

In [None]:
# if 'survived' is actually showing total number of passengers
# how can we break down the triple group by survival?
# add a 4th item to group by list: 'survived'
# 'survived', 'sex', 'pclass', 'embarked'
# su
# L@@K: survived at left is 0 vs 1 -- the perished vs. survived breakdown we want to see

In [None]:
# su

In [None]:
# do a groupby to get survival totals by sex
# s
# sur
# surv

In [None]:
# plot survived_counts_df by calling plot() on the survived_or_not_df dataframe
# set plot equal to ax so that we can show bar values
# ax

# 'Titanic Passengers: Survived vs. Perished')
# pl
# 'Number of Passengers'
# pl
# CHALLENGE: Label the bars with their counts: hint requires that ax variable as
# well as a for loop
# p
# plt'Perished','Survived'

# for bar in ax.containers:
#   ax.bar_label(bar, padding=5, color="coral")
# ax
print()
# p




**seaborn** provides another way to make bar charts, which are side-by-side numeric comparisons of a column's values:
- **sns.countplot(x='col_name', data=df)**
  - makes a bar chart comparing counts
  - does not require a "grouped df" (like survived_counts_df) for it to work
  - **x** is the 'col_name' ; category values appear below x-axis
  - **data** is the dataframe

In [None]:
# make a seaborn countplot showing survived vs perished
# one bar per category of the 'survived' column;
# bar height is the count of that category
# ax

# 'Titanic Passengers: Survived vs. Perished')
# pl
# 'Perished','Survived'
# 'Number of Passengers'

# show values above bars
# ax

# pl

In [None]:
# make a seaborn countplot counts of each passenger class
# ax

# 'Titanic Passengers by Class'
# p
# 'First','Second','Third'
# 'Number of Passengers'

# show values above bars
# for bar in ax.containers:
#   ax.bar_label(bar, padding=5, color="coral")
# a

# pl

**making a bar chart with data from two columns at once**
- **sns.countplot(x='col_1', hue='col_2', data=df)**
  - makes a bar chart comparing counts of *two* columns
  - **x** is the column 1; category values appear below x-axis
  - **hue** is column 2
  - **data** is the dataframe

In [None]:
# make a seaborn countplot counts of each passenger class
# ax

# 'Titanic Survival by Class'
# pl
# 'Perished','Survived'
# 'Number of Passengers'
# plt
# show values above bars
# f
# ax.bar_label(ax.containers[0], padding=5, color="coral")

# pl

In [None]:
# divide survived vs perished by sex
# 'Survived' 'Sex'
# ax

# 'Titanic Survival by Gender'
# pl
# 'Perished','Survived'
# 'Number of Passengers'
# pl
# show values above bars
# fo
# ax.bar_label(ax.containers[0], padding=5, color="coral")

# pl
# L@@K: More than half the passengers who embarked at Cherbourg survived
# vs. 2/3 of the passengers who got on at Southampton perished

**subdivide Pclass by Sex**
- create a new column that combines both Pclass and Sex
- then use this new column as the hue.

#### How to Arrange Bars in a Specific Order (1st Class Female/Male, 2nd Class Female/Male, 3rd Class Female/Male)

To arrange the bars in a specific order in your seaborn `countplot`, follow these steps:

#### Step 1: Set the desired order for the combined `Pclass_Sex` column
Convert the `Pclass_Sex` column into a categorical type and specify the order of categories:

```python
order = ['1_female', '1_male', '2_female', '2_male', '3_female', '3_male']
tk_df['Pclass_Sex'] = pd.Categorical(tk_df['Pclass_Sex'], categories=order, ordered=True)


In [None]:
# make a new column that has as its value a concatenates string of Pclass and Sex
# '1_female', '1_male', '2_female', '2_male', '3_female', '3_male'
# 'Pclass_Sex'
# 'Pclass_Sex'

In [None]:
# tk

In [None]:
# survived vs perished by pclass AND sex using the new combo col as the hue
# 'Survived' 'Sex'
# 'Survived', 'Pclass_Sex'

# 'Titanic Survival by Class and Gender'
# pl
# 'Perished','Survived'
# 'Number of Passengers'
# pl
# show values above bars
# fo
# ax.bar_label(ax.containers[0], padding=5, color="coral")

# pl
# L@@K: More than half the passengers who embarked at Cherbourg survived
# vs. 2/3 of the passengers who got on at Southampton perished

In [None]:
# Challenge" : plot survival by Embarked Port
# hint: no combo new column necessary, just an x for the main col
# and hue for the secondary column
# label the bars w their values
# divide survived vs perished by sex
# 'Survived' 'Sex'
# ax

# 'Titanic Survival by Port of Embarcation'
# pl
# 'Perished','Survived'
# 'Number of Passengers'
# pl
# show values above bars
# fo
  # ax

# rename the legend from S,Q,C to Southampton, Cherbourg, Queenstown
# make a dictonary with current legend labels as keys and new labels as values
# embarked_dict = {'S':'Southampton', 'Q':'Queenstown', 'C':'Cherbourg'}
# get handles and labels from the legend
# ha
# replace the labels

# use list comprehension to loop the dict, key by key:
# list comprehension provides a "one-liner" alternative to a loop:
# new_labels = [ embarked_dict[key] for key in labels ]
# new_labels = []
# for key, val in embarked_dict.items():
#   new_labels.append(val)

# or if you just know the new label names:

# update the legend
# 'Southampton', 'Queenstown', 'Cherbourg'
# "Port of Embarkation"

# pl
# L@@K: More than half the passengers who embarked at Cherbourg survived
# vs. 2/3 of the passengers who got on at Southampton perished

In [None]:
# 1. Given a list of fruits, we want a new list of just the berries.
# Without list comprehension you would write a for loop containing an if statement
# that checked the current fruit to see if it contained the substring "berry":
# "apple", "banana", "blackberry", "blueberry", "cherry", "cranberry", "grape",
# "kiwi", "lemon", "mango", "peach", "raspberry", "strawberry", "tangerine", "watermelon"

# 2. Declare a list for holding the result (the berries)
# ber

# 3. Loop the list and check the condition, appending only berries to the new list
# for
print('for loop result:')

for loop result:



#### **map() vs filter**
- like **map(func,list)**, **filter(func,list)** takes 2 arguments:
- - a function and a list
- you can call a function or just run an inline anonymous lambda function
- each item is passes as the arg of the lambda function
- **map()** must return a new list w the SAME NUMBER of items as in the original
- **filter()** must return a new list w FEWER items than in the original
- the return value of filter is a **boolean comparison**
- only items that return **True** are accumulated in the the new list
- filter returns a filter object that you unpack by passing all to **list()**

**example: berries2 = list(filter(lambda fru : 'berry' in fru, fruits))**

In [None]:
# ber
('berries2 via filter:')

'berries2 via filter:'

### list comprehension
- List comprehension offers a shorter way to make a new list from items in an existing list.
- The syntax is: **new_list = [return_value for_loop if_condition]**

In [None]:
# list comprehension way:
# [ return_value for loop if condition ]
# ber
# 'berries3 via list comprehensio:'

**pivot tables: rows become columns, columns become rows**
- **df.pivot_table(index='col_1', columns='col_2', values='col_3')**
- index can have two or more levels by passing **index** a list
- **df.pivot_table(index=['col_1','col_2'], columns='col_3', values='col_4')**

In [None]:
# make a pivot_table() where sex categories ('male', 'female') are the index values
# columns are passenger class string ('First', 'Second', 'Third') and values are 'survived'
# 'Sex', 'Pclass', 'Survived'
# "First","Second","Third"
# format as pcts: multiply the entire df by 100 -- this is a matrix operation, cuz it's 2D
# as opposed to a vector operation which is math done across all items in a column vector
# piv
# pi

In [None]:
# Challenge: make a pivot table where:
# row names are passenger classes (First, Second, Third -- NOT 1,2,3)
# column names are sex
# values are survival pcts
# 'Survived'
# "First","Second","Third"
# pi
# piv

In [None]:
# 'Embarked' 'Sex' 'Pclass' 'Survived'
# 'Sex','Embarked', 'Pclass', 'Survived'
# pivo
# p
# pivo

**label_encoder_fit_transform()** method takes a list of strings and returns corresponding numbers

In [None]:
# test the label encoder on a simple dataset we make from scratch
# 'aardvark', 'barracuda', 'crocodile', 'dolphin', 'elephant',
#              'frog', 'giraffe', 'hawk', 'kangaroo', 'lizard', 'marlin',
#              'narhwahl', 'ostrich', 'pirana', 'quail', 'raccoon', 'salamander',

# 'mammal', 'fish', 'reptile', 'mammal', 'mammal', 'amphibian',
#               'mammal', 'bird', 'mammal', 'reptile', 'fish', 'mammal',
#               'bird', 'fish', 'bird', 'mammal', 'amphibian'


In [None]:
# make dataframe from animals_dict:
# an
print() # (17, 2)
# a




**list(df['col_name'].unique()** returns a list of unique column values


In [None]:
# an
# ani

**LabelEncoder.fit_transform(list_of_strings)** takes a list of strings and returns a corresponding list of numbers




In [None]:
# instantiate a label encoder for converting string data to numeric:
# l

In [None]:
# convert the animal classes (mammal, bird, fish, amphibian, reptile) to numbers 0-4
# 'class num'
# an

In [None]:
# an

In [None]:
# L@@K: we have 2 cols with string values that we need to make numeric
# sex has 2 categories (male, female), which need to be converted to 0,1
# embarked has 3 categories (S,C,Q), which needs to become 0,1,2

In [None]:
# use label_encoder.fit_transform(list_of_string) to convert 'sex' and 'embarked' from strings to ints
# tk
# tk_

In [None]:
# CHALLENGE: label encode 'Embarked' columns: change S C Q into 0,1,2
# tk
# tk_

In [None]:
# SPLIT the DATA into TRAINING INPUTS and TRAINING LABELS
# X is the training inputs, which is all rows and all cols EXCEPT for the survived column
# the training inputs do not contain the survived column, because survived (0,1) is the answer
# that we want the model to be able to learn to predict
# X : titanic_kag_df[['']].va # X is all rows, all cols except 'survived'
# y .il : .va # y is all rows, just first col, so 'survived' col
# 'Pclass', 'Sex', 'Age'
# 'Pclass','Embarked','Sex','Age','Fare','SibSp','Parch'

In [None]:
print()
# X_




In [None]:
# 'Survived'
print() # (891,)
# y_




**StandardScaler.fit_transform(col)** creates an "even playing field" for the data
- it takes a column and recalculates all the values, such that the mean is 0  
- 1 standard deviation from the mean is assigned a 1 (or -1)
- works best for continuous values (age, fare) and NOT to be used on discrete values (pclass 1,2,3)
- for discrete values, we can use one hot encoding instead, whereby each class/category gets its own col (1st, 2nd, 3rd cols from pclass)
- bigger numbers (such as age compared to passenger class) are not judged to be more important
- 68% of the data will be in the -1 to 1 range (plus or minus 1 standard deviation from the mean)
- 95% of the data will be in the -2 to 2 range (plus or minus 1 standard deviation from the mean)



In [None]:
# instantiate the StandardScaler object:
# sc

In [None]:
# X

In [None]:
# The y data is already in the form of 0 to 1 only, so
# we only need to scale that -- just scale the X (input) data
# use .loc to specify all rows as : and target cols as list of col names
# 'Age','Fare'

In [None]:
# X_

In [None]:
# 'images/random-forest-classifier-2.jpg'

In [None]:
# 'images/random-forest-tree-w-text.jpg'

- **n_estimators=10**
- Definition: Specifies the number of trees in the forest (i.e., the number of decision trees).
- Explanation:
- In a Random Forest model, multiple decision trees are built, and their predictions are averaged (for regression) or voted on (for classification).
- The more trees you have, the more stable and accurate the model can be, although it may take longer to train.
- In this case: You are using 10 decision trees.
- **criterion='entropy'**
- Definition: This parameter specifies the function used to measure the quality of a split when constructing each tree.
- Explanation:
- The two common criteria are gini (Gini impurity) and entropy (information gain).
entropy: Measures how much information is gained by making a split. It uses the concept of information theory to find splits that reduce uncertainty (entropy) in the target labels.
gini: Measures the degree of "impurity" in the nodes and tends to be slightly faster.
- In this case: The Random Forest uses entropy to evaluate how splits reduce the uncertainty of the class labels in the dataset.
- **random_state=42**
- Definition: This parameter sets the seed for the random number generator.
Explanation:
- Random Forests introduce randomness by selecting random subsets of data for each tree and selecting random subsets of features for splitting at each node.
random_state ensures reproducibility by controlling this randomness. Using the same seed (e.g., 42) ensures the same results across different runs (all else being equal).

In [None]:
# instantiate a RandomForestClassifier model:
# fo

**model.fit(X_train,y_train)** trains the model on the features with labels (that is, it gets the answers along with the questions)

In [None]:
# fo

**predictions_list = model.predict(X_test)** returns a list of model predictions

In [None]:
# load the test set
# tk

In [None]:
print()
# t




In [None]:
# tk_

In [None]:
# there is no 'who' col of 'man', 'woman', 'child' to help fill missing ages
# so just fill all age NA w 30
# tk_t

In [None]:
# tk_t

In [None]:
# get the avg fare
# avg_
# avg_

In [None]:
# fill that one and only missing Fare w mean:
# tk_

In [None]:
# tk_

In [None]:
# 'Pclass','Embarked','Sex','Age','Fare','SibSp','Parch'

In [None]:
print()
# X_




In [None]:
# Label Encode the Emparked and Sex cols -- strings become ints
# 'Embarked'

In [None]:
# X_t

In [None]:
# X_t

In [None]:
# standard scaler the continuous values (non-categorical)
# 'Age','Fare'

In [None]:
# X

**testing our model on the test.csv from kaggle**

In [None]:
# have the new model predict survival on the kaggle 418 row test set
# sur

**imaginary passenger check**

In [None]:
# imaginary passenger check
# the list must have all 7 expected "independent variables" in correct order
# Pclass	Sex	Age	SibSp	Parch	Fare	Embarked
# ['male' 'female'] = [1 0]
# ['S' 'C' 'Q'] = [2 0 1]

# inputs have to be a 2D matrix, so double wrap the list:
# 3	1	1	0.339424	-0.498407	0	0
# sec [2,	0, 20, 0,	0,	13.5,	0]
# thi [3,	1, 25, 0,	0,	9.5,	2]
# fir [1,	0, -0.3, 1,	0,	0.8,	1]
# [1,	0, -0.3, 1,	0,	0.8,	1],
#                                     [3,	1,  -0.3, 1,	0,	-0.5,	2],
#                                    [3,	1,  -0.3, 1,	0,	-0.5,	2],
#                                    [2,	1,  -0.3, 1,	0,	0.25,	1]
# ima

In [None]:
# ima

In [None]:
# imag

**Kaggle Titanic Survival Prediction Submission File Format:**.

- You should submit a csv file with exactly 418 entries plus a header row.
- Your submission will show an error if you have extra columns (beyond PassengerId and Survived) or rows.

- The file should have exactly 2 columns:

  - **PassengerId** (sorted in any order)
  - **Survived** (contains your binary predictions: 1 for survived, 0 for deceased)

In [None]:
# make a df for our predictions to submit to kaggle
# kaggle requires a csv of 2 cols: PassengerId and Survived
# my_ti
# 'PassengerId'
# 'Survived'

In [None]:
print()
# my




**saving a dataframe to csv**
- **df.to_csv(file_path/file_name, encoding='utf-8', index=False)**
- saves df as file_name to file_path
- encoding **encoding='utf-8'** just means normal English letters
- **index=False** means do not make a column for the index

In [None]:
# save dataframe as csv
# index=False prevents the index, which are ints from 0-417 from becoming a 3rd col
# my_

In [None]:
# load the csv to make sure it worked as expected
# my_
print() # (418, 3) oops -- we got an extra column
# my_
# kag_
# L@@K: we have an unwanted "Unnamed: 0" column of the old index values




- it's all good to go -- time to submit the csv file to kaggle
- download the csv file and upload it to kaggle at

 **https://www.kaggle.com/competitions/titanic**