# Machine Learning Foundations Data Preparation

### Description
In this assignment, you will carry out data preparation tasks using The Complete Titanic Dataset. This is a well-
known training dataset about the tragedy of the Titanic, the British ocean liner that sank in the Atlantic Ocean
on April 15, 1912. You will acquire hands-on experience with data cleaning, preprocessing, encoding, and feature
engineering, which are essential steps in the machine-learning pipeline.

This dataset includes the following classes:
1. **survival**: Survival (0 = No; 1 = Yes).
2. **class**: Passenger Class (1 = 1st; 2 = 2nd; 3 = 3rd).
3. **name**: Name.
4. **sex**: Sex.
5. **sibsp**: Number of Siblings/Spouses Aboard.
6. **parch**: Number of Parents/Children Aboard.
7. **ticket**: Ticket Number.
8. **fare**: Passenger Fare.
9. **cabin**: Cabin.
10. **embarked**: Port of Embarkation (C = Cherbourg; Q = Queenstown; S = Southampton).
11. **boat**: Lifeboat (if survived).
12. **body**: Body number (if did not survive and the body was recovered).

This data can be utilized to train models to predict various types of questions. For instance, will a passenger
survive? What factors influence survival? Can we group passengers? What was the ticket price? Are there any
anomalies in the data? For this assignment, we concentrate on data exploration and feature engineering for the
following ML problem: **Will a passenger survive?**

In [2]:
# import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

Matplotlib is building the font cache; this may take a moment.


### Step 1: Data Loading & Initial Exploration

In [9]:
df = pd.read_excel('titanic.xls')
print(df.head(5))
print(df.describe())
print(df.info())


            pclass     survived          age        sibsp        parch  \
count  1309.000000  1309.000000  1046.000000  1309.000000  1309.000000   
mean      2.294882     0.381971    29.881135     0.498854     0.385027   
std       0.837836     0.486055    14.413500     1.041658     0.865560   
min       1.000000     0.000000     0.166700     0.000000     0.000000   
25%       2.000000     0.000000    21.000000     0.000000     0.000000   
50%       3.000000     0.000000    28.000000     0.000000     0.000000   
75%       3.000000     1.000000    39.000000     1.000000     0.000000   
max       3.000000     1.000000    80.000000     8.000000     9.000000   

              fare        body  
count  1308.000000  121.000000  
mean     33.295479  160.809917  
std      51.758668   97.696922  
min       0.000000    1.000000  
25%       7.895800   72.000000  
50%      14.454200  155.000000  
75%      31.275000  256.000000  
max     512.329200  328.000000  
<class 'pandas.core.frame.DataFrame'