<center> <img src="JHU.png" width = 200/> </center>


## Autograder Assignment: Titanic Survival Prediction


### Overview:
In this assignment, you will develop a classification pipeline to predict if a passenger from the Titanic survived or not. You will preprocess the data, train machine learning models, and evaluate their performance. Finally, you will save your predictions to a csv file.
### Learning Objectives:

- Load and preprocess data, including handling missing values and encoding categorical features.
- Build the model using RandomForestClassifier to predict whether a passenger of titanic survived or not.
- Export your predictions to a csv file.

### Data Dictionary

| **Column**      | **Description**                                                                                          |
|-----------------|----------------------------------------------------------------------------------------------------------|
| `PassengerId`   | Unique identifier for each passenger                                                                      |
| `Survived`      | Survival status (0 = No, 1 = Yes)                                                                         |
| `Pclass`        | Ticket class (1 = 1st, 2 = 2nd, 3 = 3rd)                                                                  |
| `Name`          | Passenger's name, including title (e.g., Mr., Mrs., Miss.)                                                |
| `Sex`           | Gender of the passenger                                                                                   |
| `Age`           | Age of the passenger (in years)                                                                           |
| `SibSp`         | Number of siblings or spouses aboard the Titanic                                                          |
| `Parch`         | Number of parents or children aboard the Titanic                                                          |
| `Ticket`        | Ticket number                                                                                             |
| `Fare`          | Passenger fare                                                                                            |
| `Cabin`         | Cabin number (if available)                                                                               |
| `Embarked`      | Port of embarkation (C = Cherbourg, Q = Queenstown, S = Southampton)                                       |




### Problem 1:  Load and Preprocess the Dataset. 
- Use the `train.csv`and `test.csv` datasets downloaded from Kaggle.
- Load the datasets into your development environment.
- Drop the insignificant columns from both training and test dataset




In [1]:
import pandas as pd
import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn.impute import SimpleImputer
from sklearn.metrics import accuracy_score

In [2]:
def dropcolumns(df_train,df_test):
    ## Drop unnecessary columns such as Cabin Ticket and Name from both training and test data
    ## Drop unnecessary columns such as Cabin Ticket and Name from both training and test data
    df_train.drop(["Cabin", "Ticket", "Name"], axis=1, inplace=True)
    df_test.drop(["Cabin", "Ticket", "Name"], axis=1, inplace=True)
    
    return (df_train.shape,df_test.shape)

In [3]:
## Note: This is a read-only cell and cannot be edited.
## Read Training data
df_train = pd.read_csv('train.csv')
print(df_train.shape)
## Read Test data
df_test = pd.read_csv('test.csv')
df_test_org = df_test.copy()  # Keep an original copy for saving predictions later
df_trainshape,df_testshape=dropcolumns(df_train,df_test)
print("Train data shape after dropping the columns=",df_trainshape)
print("Test data shape after dropping the columns=",df_testshape)




(891, 12)
Train data shape after dropping the columns= (891, 9)
Test data shape after dropping the columns= (418, 8)


In [4]:
df_train.columns

Index(['PassengerId', 'Survived', 'Pclass', 'Sex', 'Age', 'SibSp', 'Parch',
       'Fare', 'Embarked'],
      dtype='object')

In [5]:
df_test.columns

Index(['PassengerId', 'Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare',
       'Embarked'],
      dtype='object')

## Problem 2: Identify missing values in the datasets and return the column names that contain missing values, along with their respective counts.




In [6]:
def missing_values(df):
    """
    Identify missing values in the dataframe and return a Series containing
    column names with missing values and their respective counts.
    """
    # Calculate missing values count per column
    missing = df.isnull().sum()
    
    # Filter out only columns with missing values
    missing = missing[missing > 0]
    
    # Ensure the output is sorted by index (column names) to match test expectations
    missing = missing.sort_index()
    
    return missing




In [7]:
## Note: This is a read-only cell and cannot be edited.
missing_train = missing_values(df_train)
missing_test = missing_values(df_test)
print("missing_train",missing_train)
print("missing_test",missing_test)


missing_train Age         177
Embarked      2
dtype: int64
missing_test Age     86
Fare     1
dtype: int64


## Problem 3: Impute missing values and apply One-Hot Encoding to categorical columns


In [8]:
# Generate an ensemble of 100 classifiers for the specified models with underpowered hyperparameters.

def preprocessdata(df_train,df_test):
    ### Fill missing values in age column with median value.
    ### Fill missing values in Embarked column with mode value.
    ### Fill missing values in Fare column with median.
    ### Apply one-hot encoding to the columns Sex and Embarked.
    # Fill missing values
    for df in [df_train, df_test]:
        df['Age'].fillna(df['Age'].median(), inplace=True)
        df['Embarked'].fillna(df['Embarked'].mode()[0], inplace=True)
        df['Fare'].fillna(df['Fare'].median(), inplace=True)

    # Apply One-Hot Encoding to categorical columns
    df_train = pd.get_dummies(df_train, columns=['Sex', 'Embarked'], drop_first=True)
    df_test = pd.get_dummies(df_test, columns=['Sex', 'Embarked'], drop_first=True)

    return df_train, df_test

In [9]:
## Note: This is a read-only cell and cannot be edited.
df_train,df_test=preprocessdata(df_train,df_test)
print("Train data shape after dropping the columns=",df_trainshape)
print("Test data shape after dropping the columns=",df_testshape)


Train data shape after dropping the columns= (891, 9)
Test data shape after dropping the columns= (418, 8)


In [10]:
df_train

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare,Sex_male,Embarked_Q,Embarked_S
0,1,0,3,22.0,1,0,7.2500,1,0,1
1,2,1,1,38.0,1,0,71.2833,0,0,0
2,3,1,3,26.0,0,0,7.9250,0,0,1
3,4,1,1,35.0,1,0,53.1000,0,0,1
4,5,0,3,35.0,0,0,8.0500,1,0,1
...,...,...,...,...,...,...,...,...,...,...
886,887,0,2,27.0,0,0,13.0000,1,0,1
887,888,1,1,19.0,0,0,30.0000,0,0,1
888,889,0,3,28.0,1,2,23.4500,0,0,1
889,890,1,1,26.0,0,0,30.0000,1,0,0


In [11]:
df_train.isnull().sum()

PassengerId    0
Survived       0
Pclass         0
Age            0
SibSp          0
Parch          0
Fare           0
Sex_male       0
Embarked_Q     0
Embarked_S     0
dtype: int64

In [12]:
df_test

Unnamed: 0,PassengerId,Pclass,Age,SibSp,Parch,Fare,Sex_male,Embarked_Q,Embarked_S
0,892,3,34.5,0,0,7.8292,1,1,0
1,893,3,47.0,1,0,7.0000,0,0,1
2,894,2,62.0,0,0,9.6875,1,1,0
3,895,3,27.0,0,0,8.6625,1,0,1
4,896,3,22.0,1,1,12.2875,0,0,1
...,...,...,...,...,...,...,...,...,...
413,1305,3,27.0,0,0,8.0500,1,0,1
414,1306,1,39.0,0,0,108.9000,0,0,0
415,1307,3,38.5,0,0,7.2500,1,0,1
416,1308,3,27.0,0,0,8.0500,1,0,1


In [13]:
df_test.isnull().sum()

PassengerId    0
Pclass         0
Age            0
SibSp          0
Parch          0
Fare           0
Sex_male       0
Embarked_Q     0
Embarked_S     0
dtype: int64

### Problem 4: Use a machine learning classifier such as RandomForestClassifier  to predict whether a passenger of titanic survived or not

  * Use the following parameters `n_estimators=500, n_jobs=-1, random_state=42` in the  RandomForestClassifier

In [14]:
def build_model(df_train,df_test):
    # Drop the PassengerId and Survived columnsfrom the training set
    X_train = df_train.drop(['PassengerId', 'Survived'], axis=1)
    ## Use the target column Survived from the dataframe and assign it to variable y_train.
    y_train = df_train["Survived"]
    ## Drop the PassengerId from the test set and assign it to a variable X_test.
    X_test = df_test.drop(["PassengerId"], axis=1)
    ## Model training Create the Random Forest classifier object
    model = RandomForestClassifier(n_estimators=500, n_jobs=-1, random_state=42)
    ## Fit the Data
    model.fit(X_train, y_train)
    ## Make Predictions on the test set.
    y_pred = model.predict(X_test)
    
    
    return y_pred


In [15]:
## Note: This is a read-only cell and cannot be edited.
y_pred=build_model(df_train,df_test)
print(y_pred[0:5])

[0 0 0 1 0]


### Problem 5: Export your predictions to a csv file `predictions.csv`


In [16]:
df_test_org.head()

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,892,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,,Q
1,893,3,"Wilkes, Mrs. James (Ellen Needs)",female,47.0,1,0,363272,7.0,,S
2,894,2,"Myles, Mr. Thomas Francis",male,62.0,0,0,240276,9.6875,,Q
3,895,3,"Wirz, Mr. Albert",male,27.0,0,0,315154,8.6625,,S
4,896,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,22.0,1,1,3101298,12.2875,,S


In [17]:
# Function to save predictions
def save_preds(_fn, _y_pred, _df):
    import csv
    with open(_fn, 'w') as fout:
        writer = csv.writer(fout, delimiter=',', lineterminator='\n')
        writer.writerow(['PassengerId', 'Survived'])
        for yid, ypred in zip(_df['PassengerId'], _y_pred):
            writer.writerow([yid, ypred])
            
## Call the function to save your predictions to a csv file named predictions.csv.Use the df_test_org as your dataframe

save_preds("predictions.csv", y_pred, df_test_org)

In [18]:
## Note: This is a read-only cell and cannot be edited.
print("The csv file is successfully created")

The csv file is successfully created


Congratulations! for completing this Autograded Assignment 