
#**Titanic Dataset Analysis Summary**

This notebook aims to analyze the Titanic dataset to predict passenger survival based on various features such as age, gender, class, and fare. The dataset is divided into two parts: training and testing sets. The training set is used to train machine learning models, while the testing set is used to evaluate the model's performance.

1. Data Loading and Preprocessing:
   - The dataset is loaded using Pandas' read_csv() function.
   - Missing values are handled by imputing the mean for numerical features.
   - Categorical variables are encoded using one-hot encoding.

2. Feature Selection and Engineering:
   - Relevant features are selected based on domain knowledge and data exploration.
   - Feature engineering is performed if necessary to create new informative features.

3. Model Training:
   - The dataset is split into training and validation sets using train_test_split().
   - A Random Forest Classifier model is trained using the training data.
   - Hyperparameters such as the number of trees (n_estimators) can be adjusted for optimal performance.

4. Model Evaluation:
   - Predictions are made on the validation set using the trained model.
   - Evaluation metrics such as accuracy and classification report are computed.
   - The classification report provides detailed information on precision, recall, and F1-score for each class.


By following these steps, we aim to build an accurate predictive model to determine passenger survival on the Titanic.

For detailed code implementation and analysis, please refer to the code cells below.


1. Installing necessary modules


In [None]:
!pip install pandas
!pip install numpy



2. Uploading necessary dataset file

In [None]:
from google.colab import files


uploaded = files.upload()


Saving test.csv to test (1).csv
Saving train.csv to train (1).csv


3. Importing the necessary libraries

In [None]:
# Importing necessary libraries
import pandas as pd
import io
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report

# Reading the training and testing datasets
# 'train (1).csv' contains the training data
# 'test (1).csv' contains the testing data
training = pd.read_csv(io.BytesIO(uploaded['train (1).csv']))
testing = pd.read_csv(io.BytesIO(uploaded['test (1).csv']))

# Printing the testing dataset
print(testing)

     PassengerId  Pclass                                          Name  \
0            892       3                              Kelly, Mr. James   
1            893       3              Wilkes, Mrs. James (Ellen Needs)   
2            894       2                     Myles, Mr. Thomas Francis   
3            895       3                              Wirz, Mr. Albert   
4            896       3  Hirvonen, Mrs. Alexander (Helga E Lindqvist)   
..           ...     ...                                           ...   
413         1305       3                            Spector, Mr. Woolf   
414         1306       1                  Oliva y Ocana, Dona. Fermina   
415         1307       3                  Saether, Mr. Simon Sivertsen   
416         1308       3                           Ware, Mr. Frederick   
417         1309       3                      Peter, Master. Michael J   

        Sex   Age  SibSp  Parch              Ticket      Fare Cabin Embarked  
0      male  34.5      0      0 

4. Converting necessary data to dataset

In [None]:
#dataframes
data_train = pd.DataFrame(training)
data_test = pd.DataFrame(testing)
data_test.head(1)

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,892,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,,Q


#**Data cleaning**
- Checking if there are null values in the dataset or not.

In [None]:
#checking for the number of null values in the dataset
data_train.isnull().sum()

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

In [None]:
#Here what i did, I filled the null values of rating count to 0 and displayed again
data_train.fillna(0, inplace=True)#inplace will make the changes in the actual dataframe
data_train.isnull().sum()

PassengerId    0
Survived       0
Pclass         0
Name           0
Sex            0
Age            0
SibSp          0
Parch          0
Ticket         0
Fare           0
Cabin          0
Embarked       0
dtype: int64

In [None]:
#Again checking
data_test.isnull().sum()

PassengerId      0
Pclass           0
Name             0
Sex              0
Age             86
SibSp            0
Parch            0
Ticket           0
Fare             1
Cabin          327
Embarked         0
dtype: int64

In [None]:
#Here what i did, I filled the null values of rating count to 0 and displayed again
data_test.fillna(0, inplace=True)#inplace will make the changes in the actual dataframe
data_test.isnull().sum()

PassengerId    0
Pclass         0
Name           0
Sex            0
Age            0
SibSp          0
Parch          0
Ticket         0
Fare           0
Cabin          0
Embarked       0
dtype: int64

In [None]:
# Drop irrelevant columns and handle missing values from both the dataset
data_train.drop(['Name', 'Ticket', 'Cabin'], axis=1, inplace=True)
data_test.drop(['Name', 'Ticket', 'Cabin'], axis=1, inplace=True)

In [None]:
#checking the sample
data_train.head(1)

Unnamed: 0,PassengerId,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked
0,1,0,3,male,22.0,1,0,7.25,S


# Model Training and Testing


- Training of model using RandomForest Regressor

In [None]:
#training the model
# Creating training data and labels
# We drop the 'Survived' column from the training data to create the features (train_data)
# The 'Survived' column is used as the target variable (train_prediction)
train_data = data_train.drop('Survived', axis=1)
train_prediction = data_train['Survived']

# Creating and training a Random Forest Classifier model
# We create a Random Forest Classifier model with 100 trees (n_estimators = 100)
# We train the model using the fit() method with the training data (train_data) and corresponding labels (train_prediction)
model = RandomForestClassifier(n_estimators=100)
model.fit(train_data, train_prediction)

In [None]:
#prediction of a model based on predict dataset
#creating this new column to match the shape of both train and test
data_test['Embarked_0'] = 0
train_columns_order = train_data.columns

# Reorder the columns in the test dataset to match the column ordering of the train dataset
data_test = data_test[train_columns_order]
m_pred = model.predict(data_test)

In [None]:
#accuracy score
accuracy = accuracy_score( train_prediction[:418], m_pred[:418])
print(accuracy, ' is the accuracy of the model')

0.5  is the accuracy of the model


In [None]:
from sklearn.linear_model import LogisticRegression
logistic_regression = LogisticRegression(max_iter=2000)
logistic_regression.fit(train_data, train_prediction)

In [None]:
y_pred = logistic_regression.predict(data_test)
#test has 418 rows so to make them equal, taking 418 of both for accuracy
accuracy1 = accuracy_score( train_prediction[:418], y_pred[:418])
print(accuracy1, ':is the accuracy of the model')

0.48564593301435405 :is the accuracy of the model


In [None]:
print("\nClassification Report:")
print(classification_report(train_prediction[:418], y_pred))


Classification Report:
              precision    recall  f1-score   support

           0       0.58      0.58      0.58       255
           1       0.34      0.34      0.34       163

    accuracy                           0.49       418
   macro avg       0.46      0.46      0.46       418
weighted avg       0.49      0.49      0.49       418

