# Titanic: Data Analysis & Machine Learning (Google Colab)

**Objective:** Build a simple machine learning pipeline to predict passenger survival using the Titanic dataset. This notebook is ready to open in Google Colab.

Sections:
1. Setup
2. Load data (seaborn)
3. Exploratory Data Analysis (EDA)
4. Preprocessing
5. Model training (Random Forest)
6. Evaluation & Conclusion

You can upload this notebook to Google Drive and open it with Colab: `File > Upload notebook` or `Open notebook > GitHub/Drive`.

In [None]:
# Setup (Colab-ready)
# If running in Colab and packages are missing, uncomment the following lines to install them.
# !pip install seaborn scikit-learn pandas matplotlib

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

sns.set()

print('Libraries imported successfully')

In [None]:
# Load the Titanic dataset from seaborn
titanic = sns.load_dataset('titanic')
titanic.head()

# Quick info
titanic.info()

## Exploratory Data Analysis (EDA)
Let's look at distributions and relationships between features and survival.

In [None]:
# EDA plots
plt.figure(figsize=(8,5))
sns.countplot(x='survived', data=titanic)
plt.title('Survival Count')
plt.show()

plt.figure(figsize=(8,5))
sns.countplot(x='class', hue='survived', data=titanic)
plt.title('Survival by Class')
plt.show()

plt.figure(figsize=(8,5))
sns.boxplot(x='survived', y='age', data=titanic)
plt.title('Age vs Survival')
plt.show()

## Preprocessing
We'll select a subset of features, handle missing values, encode categoricals, and prepare data for modeling.

In [None]:
# Select features and preprocess
df = titanic.copy()
features = ['pclass','sex','age','sibsp','parch','fare','embarked']
df = df[features + ['survived']]

# Handle missing values: simple imputation
df['age'] = df['age'].fillna(df['age'].median())
df['embarked'] = df['embarked'].fillna(df['embarked'].mode()[0])

# One-hot encode categorical features
df = pd.get_dummies(df, columns=['sex','embarked'], drop_first=True)
df.head()

In [None]:
# Train/Test split and Random Forest training
X = df.drop('survived', axis=1)
y = df['survived']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)

model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)

y_pred = model.predict(X_test)
print('Accuracy:', accuracy_score(y_test, y_pred))

In [None]:
# Detailed evaluation
print('\nClassification Report:\n')
print(classification_report(y_test, y_pred))

cm = confusion_matrix(y_test, y_pred)
plt.figure(figsize=(5,4))
sns.heatmap(cm, annot=True, fmt='d')
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.title('Confusion Matrix')
plt.show()

## Feature importance
Check which features the Random Forest found most useful.

In [None]:
importances = model.feature_importances_
feat_importance = pd.Series(importances, index=X.columns).sort_values(ascending=False)
feat_importance

plt.figure(figsize=(8,4))
feat_importance.plot(kind='bar')
plt.title('Feature Importances')
plt.show()

## Conclusion & How to run in Google Colab
- This notebook uses seaborn's built-in Titanic dataset and a Random Forest classifier to predict survival.

To open in Google Colab:
1. Save this file to your computer from the link below.
2. Go to **https://colab.research.google.com/**
3. Click **File > Upload notebook** and select the `.ipynb` file.

Optional improvements:
- Use cross-validation and hyperparameter search (GridSearchCV / RandomizedSearchCV).
- Try other models (Logistic Regression, XGBoost).
- Add pipelines (sklearn.pipeline) and scaling.
- Use more feature engineering (title extraction from names, family size feature, etc.).