# Titanic Notebook
 
 This dataset contains information about the 891 passensger on-board the Titanic.

- **PassensgerId**: ID assigned to the passenger 
- **Pclass**: The class of the ticket
- **Name**: The name of the passenger
- **Sex**: Sex of the passenger
- **Age**: Age of the passenger
- **SibSp**: Number of siblings accompanying the passenger
- **Parc**: Number of parents and children accompanying the passenger
- **Ticket**: Ticket number of passenger
- **Fare**: Fare paid for trip
- **Cabin**: Cabin ID assigned to the passenger
- **Embarked**: Port at which passenger embarked. Queenstown, Cherbourg, and Southampton
- **Survived**: 1 or 0 is the passenger survived or died, respectively

## Cleaning Data

In [57]:
# Import libraries
import pandas as pd
import numpy as np

In [58]:
# Import the data
titanicDf = pd.read_csv('./titanic.csv')

In [59]:
# Drop unnecessary columns
titanicDf.drop(['Name', 'Ticket','PassengerId'], axis=1, inplace=True)

In [60]:
# Preview change
titanicDf.head()

Unnamed: 0,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Cabin,Embarked
0,0,3,male,22.0,1,0,7.25,,S
1,1,1,female,38.0,1,0,71.2833,C85,C
2,1,3,female,26.0,0,0,7.925,,S
3,1,1,female,35.0,1,0,53.1,C123,S
4,0,3,male,35.0,0,0,8.05,,S


In [61]:
# Add column 'Cabin_ind'
titanicDf['Cabin_ind'] = np.where( titanicDf['Cabin'].isnull(), 0, 1)
titanicDf.head()

Unnamed: 0,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Cabin,Embarked,Cabin_ind
0,0,3,male,22.0,1,0,7.25,,S,0
1,1,1,female,38.0,1,0,71.2833,C85,C,1
2,1,3,female,26.0,0,0,7.925,,S,0
3,1,1,female,35.0,1,0,53.1,C123,S,1
4,0,3,male,35.0,0,0,8.05,,S,0


In [62]:
# Convert 'Sex' to 1 or 0 instead of Female or Male
# .map() will map values in the original to values in the specified data
sexFlag = {
    'male': 0,
    'female': 1
}

titanicDf['Sex'] = titanicDf['Sex'].map(sexFlag)
titanicDf.head()

Unnamed: 0,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Cabin,Embarked,Cabin_ind
0,0,3,0,22.0,1,0,7.25,,S,0
1,1,1,1,38.0,1,0,71.2833,C85,C,1
2,1,3,1,26.0,0,0,7.925,,S,0
3,1,1,1,35.0,1,0,53.1,C123,S,1
4,0,3,0,35.0,0,0,8.05,,S,0


In [63]:
# Convert 'Embarked' to 1,2,3 instead of S,Q,C
# .map() will map values in the original to values in the specified data
embarkedFlag = {
    'S': 1,
    'Q': 2,
    'C': 3
}

titanicDf['Embarked'] = titanicDf['Embarked'].map(embarkedFlag)
titanicDf.head()

Unnamed: 0,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Cabin,Embarked,Cabin_ind
0,0,3,0,22.0,1,0,7.25,,1.0,0
1,1,1,1,38.0,1,0,71.2833,C85,3.0,1
2,1,3,1,26.0,0,0,7.925,,1.0,0
3,1,1,1,35.0,1,0,53.1,C123,1.0,1
4,0,3,0,35.0,0,0,8.05,,1.0,0


In [64]:
# Count missing values in the dataframe
titanicDf.isnull().sum()

Survived       0
Pclass         0
Sex            0
Age          177
SibSp          0
Parch          0
Fare           0
Cabin        687
Embarked       2
Cabin_ind      0
dtype: int64

In [65]:
# Show where data is null for the column 'Embarked'
titanicDf[ titanicDf['Embarked'].isnull() == True ]

Unnamed: 0,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Cabin,Embarked,Cabin_ind
61,1,1,1,38.0,0,0,80.0,B28,,1
829,1,1,1,62.0,0,0,80.0,B28,,1


In [66]:
# Return the data frame without missing values for 'Embarked'
titanicDf = titanicDf[titanicDf['Embarked'].notna()]

In [67]:
# Get rid of 'Cabin'
titanicDf.drop(['Cabin'], axis=1, inplace=True)
titanicDf.head()

Unnamed: 0,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked,Cabin_ind
0,0,3,0,22.0,1,0,7.25,1.0,0
1,1,1,1,38.0,1,0,71.2833,3.0,1
2,1,3,1,26.0,0,0,7.925,1.0,0
3,1,1,1,35.0,1,0,53.1,1.0,1
4,0,3,0,35.0,0,0,8.05,1.0,0


In [68]:
# Fill-missing values for age
titanicDf['Age'].fillna( titanicDf['Age'].mean(), inplace=True, )
titanicDf.head()
titanicDf.isnull().sum()

Survived     0
Pclass       0
Sex          0
Age          0
SibSp        0
Parch        0
Fare         0
Embarked     0
Cabin_ind    0
dtype: int64

In [69]:
titanicDf.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 889 entries, 0 to 890
Data columns (total 9 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   Survived   889 non-null    int64  
 1   Pclass     889 non-null    int64  
 2   Sex        889 non-null    int64  
 3   Age        889 non-null    float64
 4   SibSp      889 non-null    int64  
 5   Parch      889 non-null    int64  
 6   Fare       889 non-null    float64
 7   Embarked   889 non-null    float64
 8   Cabin_ind  889 non-null    int64  
dtypes: float64(3), int64(6)
memory usage: 69.5 KB


In [70]:
# Export new file from updated data frame
titanicDf.to_csv('./titanic_cleaned.csv', index=False)