# Titanic Survival Classification – Data Preparation

## 1. Setup and Data Load
## 2. Initial Exploration
## 3. Preprocessing
## 4. Feature and Target Selection
## 5. Save Cleaned Data (Optional)

In [10]:
import pandas as pd

# Load Titanic dataset
file_path = 'data/titanic.csv'  # adjust if needed
titanic_data = pd.read_csv(file_path)

In [11]:
# View basic dataset shape
print("Shape:", titanic_data.shape)

# Preview column names
print("\nColumns:")
print(titanic_data.columns.tolist())

# Show first few rows
print("\nTop rows:")
print(titanic_data.head())

# Summary statistics
print("\nSummary statistics:")
print(titanic_data.describe(include='all'))

# Check for missing values
print("\nMissing values per column:")
print(titanic_data.isnull().sum())

Shape: (891, 12)

Columns:
['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp', 'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked']

Top rows:
   PassengerId  Survived  Pclass  \
0            1         0       3   
1            2         1       1   
2            3         1       3   
3            4         1       1   
4            5         0       3   

                                                Name     Sex   Age  SibSp  \
0                            Braund, Mr. Owen Harris    male  22.0      1   
1  Cumings, Mrs. John Bradley (Florence Briggs Th...  female  38.0      1   
2                             Heikkinen, Miss. Laina  female  26.0      0   
3       Futrelle, Mrs. Jacques Heath (Lily May Peel)  female  35.0      1   
4                           Allen, Mr. William Henry    male  35.0      0   

   Parch            Ticket     Fare Cabin Embarked  
0      0         A/5 21171   7.2500   NaN        S  
1      0          PC 17599  71.2833   C85        C  
2     

Initial inspection shows several columns with missing values (e.g., Age, Cabin). We'll decide how to handle them based on model needs.

In [12]:
# Drop columns unlikely to help with classification or too incomplete
titanic_data = titanic_data.drop(columns=['Cabin', 'Ticket', 'Name'])

# Drop rows with missing target ('Survived') or critical features
titanic_data = titanic_data.dropna(subset=['Survived', 'Embarked'])

# Fill missing Age with median
titanic_data['Age'] = titanic_data['Age'].fillna(titanic_data['Age'].median())

# Encode 'Sex' and 'Embarked'
titanic_data['Sex'] = titanic_data['Sex'].map({'male': 0, 'female': 1})
titanic_data['Embarked'] = titanic_data['Embarked'].map({'S': 0, 'C': 1, 'Q': 2})

In [13]:
# Define target and features
y = titanic_data['Survived']

feature_names = ['Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare', 'Embarked']
X = titanic_data[feature_names]

print(X.head())
print(y.head())

   Pclass  Sex   Age  SibSp  Parch     Fare  Embarked
0       3    0  22.0      1      0   7.2500         0
1       1    1  38.0      1      0  71.2833         1
2       3    1  26.0      0      0   7.9250         0
3       1    1  35.0      1      0  53.1000         0
4       3    0  35.0      0      0   8.0500         0
0    0
1    1
2    1
3    1
4    0
Name: Survived, dtype: int64


In [14]:
# Save cleaned data
titanic_data.to_csv('data/titanic_cleaned.csv', index=False)

## Notes – Day 4

- Loaded Titanic dataset and explored structure and missing values.
- Dropped irrelevant columns (Name, Ticket, Cabin).
- Filled missing Age values with median, dropped missing Embarked rows.
- Encoded 'Sex' and 'Embarked' for modeling.
- Selected 7 features to use in classification modeling tomorrow.
- Dataset is clean and ready for training.