# Individual Project - Titanic


## Table of Contents

[**Step 3: Data Preparation**](#Step-3:-Data-Preparation)
- [**Deal with Missing Data**](#Deal-with-Missing-Data)
- [**Feature Engineering**](#Feature-Engineering)

[**Step 4: Modeling**](#Step-4:-Modeling)


[Back to Top](#Table-of-Contents)


This initial phase focuses on understanding the project objectives and requirements from a business perspective, and then converting this knowledge into a data mining problem definition, and a preliminary plan designed to achieve the objectives.
#### Titanic Story
The sinking of the RMS Titanic is one of the most infamous shipwrecks in history.  On April 15, 1912, during her maiden voyage, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 passengers and crew. This sensational tragedy shocked the international community and led to better safety regulations for ships.

One of the reasons that the shipwreck led to such loss of life was that there were not enough lifeboats for the passengers and crew. Although there was some element of luck involved in surviving the sinking, some groups of people were more likely to survive than others, such as women, children, and the upper-class passengers.

#### Objective
 we will build a regression model to predict ticket price(Fare).



[Back to Top](#Table-of-Contents)

## Step 3: Data Preparation
Create new features through feature engineering; Deal with missing values; Clean up data, ie. strip extra white spaces in string values. We will focus on dealing with missing data in this phrase.

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

import warnings
warnings.filterwarnings('ignore')
%matplotlib inline
#check all missing data
df_titanic = pd.read_csv('titanic.csv')

# df_titanic.info()

df_titanic.isnull().sum()
df_titanic

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,$7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,$71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,$7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,$53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,$8.05,,S
...,...,...,...,...,...,...,...,...,...,...,...,...
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,$13.0,,S
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,$30.0,B42,S
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,$23.45,,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,$30.0,C148,C


In [4]:
# Define a function to fill missing values in Cabin column with the most common first letter
def fill_cabin_na(x):
    if pd.isna(x):
        # Count the number of each first letter in non-NA values
        first_letters = df_titanic.loc[df_titanic['Cabin'].notna(), 'Cabin'].str[0].value_counts()
        # Fill the missing value with the most common first letter
        return first_letters.index[0]
    else:
        return x

# Fill missing values in Cabin column with the most common first letter
df_titanic['Cabin'] = df_titanic['Cabin'].apply(fill_cabin_na)

# Check for missing values in the DataFrame
print(df_titanic.isnull().sum())

PassengerId    0
Survived       0
Pclass         0
Name           0
Sex            0
Age            0
SibSp          0
Parch          0
Ticket         0
Fare           0
Cabin          0
Embarked       2
dtype: int64


### Deal with Missing Data
We will demonstrate filling with mean/mode and estimate from other columns.

#### Fill with Mean/Mode
Embarked only has 2 missing values and there is no obvious way to estimate the missing walue, we will simply fill it with mode of the column, or 'S'

##### Task12: Fill missing Embarked with mode

In [4]:
# Fill missing values in 'Embarked' column with the mode
df_titanic['Embarked'] = df_titanic['Embarked'].fillna(df_titanic['Embarked'].mode()[0])
print(df_titanic['Embarked'].isnull().sum())


0


#### Fill with Estimated Value

A title is a word used in a person's name, in certain contexts. It may signify either veneration, an official position, or a professional or academic qualification. It's a good indication of age, for example, Mr is for adult man, Master is for young boys.

If we look at all names of Titanic passengers, we can see that the name is in format Last, Title. First. We can use this information to estimate missing ages.

- First, we will use regular expression to extract title from name.
- Then we will convert title to upper case.
- Then we fill missing age with mean age of specific title.

In [9]:
#extract prefix from name
#df_titanic.Name.str.extract('([A-Za-z]+\.)')

# import re

# Extract titles from names using regular expressions
df_titanic['Title'] = df_titanic['Name'].str.extract(r',\s*([^\.]*)\.', expand=False)





df_titanic.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Title
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,$7.25,C,S,MR
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,$71.2833,C85,C,MRS
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,$7.925,C,S,MISS
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,$53.1,C123,S,MRS
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,$8.05,C,S,MR


##### Task13: convert title to upper case.
To ensure we get accurate mean age of each initial, we convert initial to all upper case.

In [10]:
# Convert titles to uppercase
df_titanic['Title'] = df_titanic['Title'].str.upper()


##### Task14: Fill missing age with mean age of the title

In [16]:
# Fill missing ages with the mean age of the corresponding title
for title, mean_age in mean_age_by_title.items():
    df_titanic.loc[df_titanic['Title'] == title, 'Age'].fillna(mean_age, inplace=True)
df_titanic[['Name', 'Age', 'Title']].head()

Unnamed: 0,Name,Age,Title
0,"Braund, Mr. Owen Harris",22.0,MR
1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",38.0,MRS
2,"Heikkinen, Miss. Laina",26.0,MISS
3,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",35.0,MRS
4,"Allen, Mr. William Henry",35.0,MR


In [13]:
# .fillna(df_titanic.groupby('Title').Age.transform('mean'), inplace=True)
# Calculate mean age for each title
mean_age_by_title = df_titanic.groupby('Title')['Age'].mean()
mean

Title
CAPT            70.000000
COL             58.000000
DON             40.000000
DR              40.242731
JONKHEER        38.000000
LADY            48.000000
MAJOR           48.500000
MASTER           7.086662
MISS            23.341584
MLLE            24.000000
MME             24.000000
MR              31.753762
MRS             35.055080
MS              28.000000
REV             43.166667
SIR             49.000000
THE COUNTESS    33.000000
Name: Age, dtype: float64

### Feature Engineering
We'll create a new column FamilySize. There are 2 columns related to family size, parch indicates parent or children number, Sibsp indicates sibling and spouse number.

Take one name 'Asplund' as example, we can see that total family size is 7(Parch + SibSp + 1), and each family member has same Fare, which means the Fare is for the whole group. So family size will be an important feature to predict Fare. There're only 4 Asplunds out of 7 in the dataset becasue the dataset is only a subset of all passengers.

In [17]:
# df_titanic.Name.str.contains('Asplund')




                                                Name  Parch  SibSp  FamilySize
0                            Braund, Mr. Owen Harris      0      1           2
1  Cumings, Mrs. John Bradley (Florence Briggs Th...      0      1           2
2                             Heikkinen, Miss. Laina      0      0           1
3       Futrelle, Mrs. Jacques Heath (Lily May Peel)      0      1           2
4                           Allen, Mr. William Henry      0      0           1


##### Task15: Create column 'FamilySize'
FamilySize = Parch + SibSp + 1

In [18]:
# Calculate FamilySize by adding Parch, SibSp, and 1 for the passenger itself
#  df_titanic.Parch + df_titanic.SibSp + 1
df_titanic['FamilySize'] = df_titanic['Parch'] + df_titanic['SibSp'] + 1

# Print the first few rows to check the new FamilySize column
df_titanic[['Name', 'Parch', 'SibSp', 'FamilySize']].head()

Unnamed: 0,Name,Parch,SibSp,FamilySize
0,"Braund, Mr. Owen Harris",0,1,2
1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",0,1,2
2,"Heikkinen, Miss. Laina",0,0,1
3,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",0,1,2
4,"Allen, Mr. William Henry",0,0,1


[Back to Top](#Table-of-Contents)

## Step 4: Modeling

Now we have a relatively clean dataset(Except for Cabin column which has many missing values). We can do a classification on Survived to predict whether a passenger could survive the desaster or a regression on Fare to predict ticket fare. This dataset is not a good dataset for regression. But since we don't talk about classification in this workshop we will construct a linear regression on Fare in this exercise.

##### Task16: Contruct a regresson on Fare
Construct regression model with statsmodels.

Pick Pclass, Embarked, FamilySize as independent variables.

In [26]:
#result =smf.ols("Fare ~ C(Pclass) + C(Embarked) + FamilySize", data=df_titanic).fit()
#result.summary()

# Remove non-numeric characters from 'Fare' column
df_titanic['Fare'] = df_titanic['Fare'].replace('[\$,]', '', regex=True)

# Convert 'Fare' column to numeric
df_titanic['Fare'] = pd.to_numeric(df_titanic['Fare'])

# Now construct the regression model
import statsmodels.formula.api as smf

# Constructing the regression model
model = smf.ols(formula="Fare ~ C(Pclass) + C(Embarked) + FamilySize", data=df_titanic)

# Fitting the model
result = model.fit()

# Print the summary
result.summary()




0,1,2,3
Dep. Variable:,Fare,R-squared:,0.426
Model:,OLS,Adj. R-squared:,0.423
Method:,Least Squares,F-statistic:,131.0
Date:,"Mon, 22 Apr 2024",Prob (F-statistic):,8.009999999999999e-104
Time:,17:31:10,Log-Likelihood:,-4486.7
No. Observations:,889,AIC:,8985.0
Df Residuals:,883,BIC:,9014.0
Df Model:,5,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,79.2468,3.551,22.314,0.000,72.277,86.217
C(Pclass)[T.2],-59.0069,3.936,-14.990,0.000,-66.733,-51.281
C(Pclass)[T.3],-68.7956,3.269,-21.045,0.000,-75.211,-62.380
C(Embarked)[T.Q],-11.8535,5.454,-2.173,0.030,-22.557,-1.150
C(Embarked)[T.S],-14.9724,3.422,-4.375,0.000,-21.689,-8.256
FamilySize,7.8315,0.790,9.913,0.000,6.281,9.382

0,1,2,3
Omnibus:,1041.0,Durbin-Watson:,2.04
Prob(Omnibus):,0.0,Jarque-Bera (JB):,117978.06
Skew:,5.717,Prob(JB):,0.0
Kurtosis:,58.265,Cond. No.,13.4


Conclusion: