In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
train_data = pd.read_csv("/kaggle/input/titanic/train.csv")
train_data.head()

In [None]:
test_data = pd.read_csv("/kaggle/input/titanic/test.csv")
test_data.head()

In [None]:
#Check the name of for training data set
test_data.columns

In [None]:
#Check the column names for the testing data set
train_data.columns

In [None]:
#Check info of the training data set
train_data.info()

In [None]:
#Check info of the test data set
test_data.info()

The training dataset contains 12 columns and 891 rows of data. There are 7 numeric columns (5 numeric and 2 float), and 5 categorical columns. The test dataset contains 11 rows (no Survived column) and 418 rows. There are 5 numeric columns (4 numeric and 2 float), and 5 categorical columns.

## Running Descriptive Statistics and Correlations 

In [None]:
#Descriptive Statistics on all variables for training data set
train_data.describe().style.background_gradient(cmap='Oranges')

In [None]:
#Descriptive Statistics on all variables for training data set
test_data.describe().style.background_gradient(cmap='Oranges')

When looking at both of the training and testing data sets, they both look fairly similar as far as their descriptive statistics go between the same columns. Nothing stands out as a big difference for using the training dataset to predict the test dataset. There should be no issues there.

For Pclass, with a mean of approximately 2.3 for both datasets, a majority of the Titanic was made up of the 2nd and 3rd class, and the smaller population was 1st class. A large population of the passengers on the Titanic were between the ages of 20 to approximately 39 (~50%). For those with siblings and spouses (SibSp), most people did not have either and were traveling alone. For parents and children (Parch), most people did not have either and were traveling alone. As for the Fare, the mean was 33.92 and the median was 14.45 with the highest ticket price of 512.33 and lowest ticket price of 0.00. The mean and median prices should fall in line with the pricing for 3rd and 2nd class tickets that make up the majority of the Titanic passengers.

In [None]:
#Looking at the training data categorical variables
train_data.describe(include='object')

In [None]:
#Looking at the testing data categorical variables
test_data.describe(include='object')

When looking at the training dataset and testing dataset Sex column, males make up approximately 64.1% of the population of the Titanic. Then for the Embarked column, between both datasets is approximately 68.5% embarked from Southampton on the Titanic. In both datasets, about 81.6% have unique ticket numbers.

In [None]:
#Creating a function to count and calculate percentage of missing data in datasets
#Calculating the output of the counts and percentage of missing data for training datasets
def missing_value(train_data:pd.DataFrame):
    """function to print the missing value """
    missing_train_data=train_data.isna().sum()
    total_record=train_data.shape[0]
    perc_missing=round((missing_train_data/total_record)*100,2)
    missing_train_data=pd.DataFrame(data={'columns_name':missing_train_data.index,
                                  'num_missing':missing_train_data.values,
                                  'perc_missing':perc_missing.values})

    return missing_train_data.sort_values(by='perc_missing',ascending=False)
missing_value(train_data)

In [None]:
#Counting and calculating percentage of missing data in the testing data
missing_value(test_data)

Within the training dataset and testing dataset, the Cabin column has the highest percent of missing data with 77-78% missing. Next for both datasets, Age has the highest amount of missing data with about 20% missing. Then both datasets have only 1-2 missing Fare data points. All other columns have no missing data. Thus, in the analysis of the data, Cabin will probably not be a useful feature due to so much missing data, and Age it is possible to use it and possible either eliminate those rows of data or fill them in with data.

Let's produce some correlation matrices on the numeric variables and the categorical variables to see which of them are possibly affecting survival on the Titanic to examine further and concentrate on.

In [None]:
#Create a training data set where the categorical variables are converted into numeric variables
train_data_num = train_data.copy()
import pandas as pd
from sklearn.preprocessing import LabelEncoder

train_data_num[['Sex']] = train_data_num[['Sex']].apply(lambda col:pd.Categorical(col).codes).replace(-1,np.nan)
train_data_num[['Ticket']] = train_data_num[['Ticket']].apply(lambda col:pd.Categorical(col).codes).replace(-1,np.nan)
train_data_num[['Cabin']] = train_data_num[['Cabin']].apply(lambda col:pd.Categorical(col).codes).replace(-1,np.nan)
train_data_num[['Embarked']] = train_data_num[['Embarked']].apply(lambda col:pd.Categorical(col).codes).replace(-1,np.nan)

#Drop the Name column because it isn't needed
cols_to_drop = ['PassengerId', 'Name']
train_data_num.drop(cols_to_drop, axis=1, inplace=True)

# printing Dataframe
train_data_num

In [None]:
train_data_num.describe().style.background_gradient(cmap='Oranges')

In [None]:
#Correlation matrix for training numeric variables
import pandas as pd
import numpy as np

corr = train_data_num.corr()

mask = np.zeros_like(corr, dtype=bool)
mask[np.triu_indices_from(mask)] = True
corr[mask] = np.nan
(corr
 .style
 .background_gradient(cmap='coolwarm', axis=None, vmin=-1, vmax=1)
 .highlight_null(color='#f1f1f1')  # Color NaNs grey
 .format(precision=3))

Looking at this correlation matrix, there are decently strong correlations between the Survived variable and their Pclass (class within the Titanic), their Sex (males vs. female), and their Fare (ticket price - likely covariance with Pclass). There is a slightly strong correlation between the Survived variable and their Ticket (likely covariance with Pclass and Fare) and Embarked (port of embarkment). All other variables have a weak or no correlation to the Survived variable (Age, SibSp, Parch, Cabin). 

***Note that Age and Cabin are missing 17% and 78% of their data in the training set when this correlation matrix was done.

Let's run a Principal Component Analysis to see how much variance is explained by these variables.

## Running Principal Component Analysis 

In [None]:
#Prepping the data for PCA and separating out the features and the target (Survived)
from sklearn.preprocessing import StandardScaler

train_data_num['Age'] = train_data_num['Age'].fillna((train_data_num['Age'].mean()))
train_data_num['Embarked'] = train_data_num['Embarked'].fillna((train_data_num['Embarked'].mean()))
features = ['Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Ticket', 'Fare', 'Embarked']

cols_to_drop = ['Cabin']
train_data_num.drop(cols_to_drop, axis=1, inplace=True)

# Separating out the features
x = train_data_num.loc[:, features].values

# Separating out the target
y = train_data_num.loc[:,['Survived']].values

# Standardizing the features
x = StandardScaler().fit_transform(x)

In [None]:
#Gather the principal components for the analysis
from sklearn.decomposition import PCA

pca = PCA(n_components=8)

principalComponents = pca.fit_transform(x)

principalDf = pd.DataFrame(data = principalComponents
             , columns = ['principal component 1', 'principal component 2', 'principal component 3', 'principal component 4', 'principal component 5', 'principal component 6', 'principal component 7', 'principal component 8'])

In [None]:
#Get the final dataframe for the PCA with Survived variable
finalDf = pd.concat([principalDf, train_data_num[['Survived']]], axis = 1)

finalDf.head()

Output of the variance each principal component has on Titanic Survivors. PC 1 has the highest variance, PC 2 has the 2nd highest, PC 3 has the 3rd highest, so on and so forth for the variance values.

In [None]:
pca.explained_variance_ratio_

When looking at 8 of the variables (excluding Cabin due to 78% missing data), 5 principal components explain more than 80% of the variance explained in Survived variable of the training data set. Meaning that when modeling the data, it more than likely will not be necessary to include all of the variables because they are not providing much variance. This will be taken into consideration as data is further prepared and data is selected for modeling.

We want to ensure to not overfit any model and to not include unnecessary variables.

In [None]:
#Due to missing so much data and no correlation, excluding Cabin from both data sets
#Drop the Name column because it isn't needed
cols_to_drop = ['Cabin', 'Name', 'PassengerId']
train_data.drop(cols_to_drop, axis=1, inplace=True)

## Visualizing the Different Variables of the Training Data Set 

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

plt.rcParams['figure.figsize']=(6,6)

ax = sns.countplot(train_data,x='Survived', hue = 'Survived', dodge=False)

for c in ax.containers:
    
    # custom label calculates percent and add an empty string so 0 value bars don't have a number
    labels = [f'{h/train_data.Survived.count()*100:0.1f}%' if (h := v.get_height()) > 0 else '' for v in c]
    
    ax.bar_label(c, labels=labels, label_type='edge')
    
plt.xlabel('Not Survived vs Survived')
plt.suptitle('Counts and Rates of Titanic Survivors')
plt.show()

In [None]:
fig, (ax1,ax2) =plt.subplots(1,2,figsize=(15,6))
plt.subplot(1,2,1)
sns.barplot(x ='Sex', y ='Survived', data = train_data, palette ='plasma')
ax1.set_title("Rate of Titanic Survivors by their Sex")

plt.subplot(1,2,2)
sns.barplot(x ='Pclass', y ='Survived', data = train_data, palette ='plasma')
plt.xlabel('Titanic Passenger Class')
ax2.set_title("Rate of Titanic Survivors by their Class")

#plt.subplot(1,2,3)
#sns.barplot(x ='Embarked', y ='Survived', data = train_data, palette ='plasma')
#plt.xlabel('Titanic Passenger Class')
#ax3.set_title("Rate of Titanic Survivors by Port of Embarkment")

plt.show()

Looking at the graphs and comparing it to the correlation matrix, women are much significantly more likely to survive than men on the Titanic. When it comes to class, the first class citizens had a higher likelihood of surviving than 2nd class and a much higher survivorship than 3rd class citizens; which makes sense since 3rd class citizens where located in the lower portions of the ships for their cabins.

In [None]:
fig, (ax1,ax2) =plt.subplots(1,2,figsize=(15,6))
plt.subplot(1,2,1)
sns.barplot(x ='Embarked', y ='Survived', data = train_data, palette ='plasma')
plt.xlabel('Embarkment')
ax1.set_title("Rate of Titanic Survivors by Port of Embarkment")

plt.subplot(1,2,2)
sns.boxplot(y='Age',x='Survived',data = train_data)
ax2.set_title("Titanic Survivors by their Age")

plt.show()

When looking at the Port of Embarkment, those who embarked from Cherbourg have the highest survivorship on the Titanic and those from Southampton have the lowest survivorship. This could have to do with being related to class or ticket or fare but that is speculation.

Then looking at age, there is almost no discernible difference between age and survivorship on the Titanic. There appears to be that maybe people that are slightly older didn't survive as much as those who were a bit younger, but it is not much. There is too much overlap between the data. It is not significant enough which falls in line with the lack of correlation in the matrix.

In [None]:
fig, (ax1,ax2) =plt.subplots(1,2,figsize=(15,6))
plt.subplot(1,2,1)
sns.barplot(x ='SibSp', y ='Survived', data = train_data, palette ='plasma')
plt.xlabel('Sibling/Spouse')
ax1.set_title("Rate of Titanic Survivors with Siblings and/or Spouses")

plt.subplot(1,2,2)
sns.barplot(x='Parch',y='Survived',data = train_data, palette = 'plasma')
plt.xlabel('Parent/Child')
ax2.set_title("Titanic Survivors with Parents and/or Chilren")

plt.show()

When looking at the Titanic survivors with siblings/spouses, there is a difference in survivors with no siblings/spouses and 1 sibling spouse. However, after that, any other data points, there is so much overlap that there is no discernible difference. It is hard to give a pattern or confirm any significance, which falls in line with the correlation matrix.

Looking at the Titanic survivors with parent/children, there is a difference between 0 and 1-2 parent/children. Beyond that, all other data points have so much overlap there is no discernible difference. It is hard to give a pattern or confirm any significance. Again, this falls in line with the correlation matrix.

In [None]:
plt.rcParams['figure.figsize']=(6,6)  
sns.boxplot(y='Fare',x='Survived',data = train_data)
plt.suptitle('Titanic Counts of Suvivors by Ticket Fare')

plt.show()

While there is a lot of overlap happening with outlier data points, when it comes to looking at the min, max, median, 1st quartile, and 3rd quartile, those who paid a lower ticket fare had a lower likelihood of surviving. In those that didn't survive, about 50% of tickets averaging between 7.86 - 26.00 and a median of 10.80. In those who did survive, about 50% of the  tickets averaging between 12.50 - 57.00 and a median of 26.00. This graph does show differences between survivorship and ticket fare which is connected to the correlation matrix.

## Preparing the Data for the XGBoost Model 

In [None]:
#Have 3 main data sets to work with for modeling. Train data as it currently is
#Split data into features (X) and target variable (y)
import pandas as pd
from sklearn.preprocessing import LabelEncoder

train_data_num1 = train_data.copy()
test_data_num1 = test_data.copy()

cols_to_drop = ['Name', 'Cabin', 'PassengerId']
test_data_num1.drop(cols_to_drop, axis=1, inplace=True)

train_data_num1[['Sex']] = train_data_num1[['Sex']].apply(lambda col:pd.Categorical(col).codes).replace(-1,np.nan)
train_data_num1[['Ticket']] = train_data_num1[['Ticket']].apply(lambda col:pd.Categorical(col).codes).replace(-1,np.nan)

test_data_num1[['Sex']] = test_data_num1[['Sex']].apply(lambda col:pd.Categorical(col).codes).replace(-1,np.nan)
test_data_num1[['Ticket']] = test_data_num1[['Ticket']].apply(lambda col:pd.Categorical(col).codes).replace(-1,np.nan)

X_main = train_data_num1.drop('Survived', axis=1)
y_main = train_data_num1['Survived']

In [None]:
#Train data excluding Age, SibSp, Parch
#Drop the Name column because it isn't needed
cols_to_drop = ['Age', 'SibSp', 'Parch']
train_sub1 = train_data_num1.drop(cols_to_drop, axis=1)

test_data_num2 = test_data_num1.copy()
cols_to_drop = ['Age', 'SibSp', 'Parch']
test_data_num2 = test_data_num2.drop(cols_to_drop, axis=1)

#Split data into features (X) and target variable (y)
X_sub1 = train_sub1.drop('Survived', axis=1)
y_sub1 = train_sub1['Survived']

In [None]:
#Train data excluding Age
#Drop the Name column because it isn't needed
cols_to_drop = ['Age']
train_sub2 = train_data_num1.drop(cols_to_drop, axis=1)

test_data_num3 = test_data_num1.copy()
cols_to_drop = ['Age']
test_data_num3 = test_data_num3.drop(cols_to_drop, axis=1)

#Split data into features (X) and target variable (y)
X_sub2 = train_sub2.drop('Survived', axis=1)
y_sub2 = train_sub2['Survived']

In [None]:
#Take the training data set and split it from the X and y creation
#Take this step with the train_main data set
from sklearn.model_selection import train_test_split
X_train, X_val, y_train, y_val = train_test_split(X_main, y_main, test_size=0.2, random_state=42)

In [None]:
#Take the training data set and split it from the X and y creation
#Take this step with the train_sub1 data set
from sklearn.model_selection import train_test_split
X_train_sub1, X_val_sub1, y_train_sub1, y_val_sub1 = train_test_split(X_sub1, y_sub1, test_size=0.2, random_state=42)

In [None]:
#Take the training data set and split it from the X and y creation
#Take this step with the train_sub2 data set
from sklearn.model_selection import train_test_split
X_train_sub2, X_val_sub2, y_train_sub2, y_val_sub2 = train_test_split(X_sub2, y_sub2, test_size=0.2, random_state=42)

## Running the XGBoost Model with All Variables (sans Cabin feature) 

In [None]:
#Encode categorical variables - training main data
from sklearn.preprocessing import OneHotEncoder

# Identify categorical columns
categorical_cols = ['Embarked']

# Initialize OneHotEncoder
encoder = OneHotEncoder(drop='first', sparse=False)

# Fit encoder on training data
encoder.fit(X_train[categorical_cols])

# Transform categorical columns
encoded_cols_train = pd.DataFrame(encoder.transform(X_train[categorical_cols]))
encoded_cols_val = pd.DataFrame(encoder.transform(X_val[categorical_cols]))
encoded_cols_test = pd.DataFrame(encoder.transform(test_data_num1[categorical_cols]))

# Reindexing encoded columns to match original indices
encoded_cols_train.index = X_train.index
encoded_cols_val.index = X_val.index
encoded_cols_test.index = test_data_num1.index

# Drop original categorical columns and concatenate encoded ones
X_train_encoded = pd.concat([X_train.drop(categorical_cols, axis=1), encoded_cols_train], axis=1)
X_val_encoded = pd.concat([X_val.drop(categorical_cols, axis=1), encoded_cols_val], axis=1)
test_data_encoded = pd.concat([test_data_num1.drop(categorical_cols, axis=1), encoded_cols_test], axis=1)

In [None]:
#XGBoost Model Training (Using this model because it can handle missing data) - Using the largest model with main training data
import xgboost as xgb

model = xgb.XGBClassifier()
model.fit(X_train_encoded, y_train)

from sklearn.metrics import accuracy_score

y_pred = model.predict(X_val_encoded)
accuracy = accuracy_score(y_val, y_pred)

print("Validation Accuracy:", accuracy)

The XGBoost model accuracy for using all the Titanic variables besides the Cabin variable shows an accuracy of 82.1% which is a decent accuracy. I want to re-run the model where I exclude Age, Sibsp, and Parch to see if there is an improvement in the model accuracy per the correlation matrix and the PCA variance output.

Below are the predictions for the first XGBoost model and the possible submission output.

In [None]:
#XGBoost Model Predictions on Test Data from Main Training Data
test_predictions1 = model.predict(test_data_encoded)
#print(test_predictions1)

In [None]:
#Create Possible Submission File
submission1 = pd.DataFrame({'PassengerId': test_data.PassengerId, 'Survived': test_predictions1})

## Rerunning the Model with Eliminated Variables that Show No Correlation and No PCA Variance

Preparing the data for the second XGBoost model like we did for the first model using the second test data set that was preparted (training sub1).

In [None]:
#Encode categorical variables - training sub1 data
from sklearn.preprocessing import OneHotEncoder

#Identify categorical columns
categorical_cols1 = ['Embarked']

#Initialize OneHotEncoder
encoder = OneHotEncoder(drop='first', sparse=False)

#Fit encoder on training sub1 data
encoder.fit(X_train_sub1[categorical_cols1])

#Transform categorical columns for the training sub1 data
encoded_cols_trains1 = pd.DataFrame(encoder.transform(X_train_sub1[categorical_cols1]))
encoded_cols_vals1 = pd.DataFrame(encoder.transform(X_val_sub1[categorical_cols1]))
encoded_cols_tests1 = pd.DataFrame(encoder.transform(test_data_num2[categorical_cols1]))

#Reindexing encoded columns to match original indices
encoded_cols_trains1.index = X_train_sub1.index
encoded_cols_vals1.index = X_val_sub1.index
encoded_cols_tests1.index = test_data_num2.index

#Drop original categorical columns and concatenate encoded ones
X_train_encodeds1 = pd.concat([X_train_sub1.drop(categorical_cols, axis=1), encoded_cols_trains1], axis=1)
X_val_encodeds1 = pd.concat([X_val_sub1.drop(categorical_cols, axis=1), encoded_cols_vals1], axis=1)
test_data_encodeds1 = pd.concat([test_data_num2.drop(categorical_cols1, axis=1), encoded_cols_tests1], axis=1)

In [None]:
#XGBoost Model Training Part 2 - Using the model with sub1 training data
import xgboost as xgb

model = xgb.XGBClassifier()
model.fit(X_train_encodeds1, y_train_sub1)

from sklearn.metrics import accuracy_score

y_pred2 = model.predict(X_val_encodeds1)
accuracy2 = accuracy_score(y_val_sub1, y_pred2)

print("Validation Accuracy:", accuracy2)

The 2nd XGBoost model with Sex, Pclass, Fare, Ticket, and Embarked has a higher model accuracy of 83.8% than the first model which includes all variables except for Cabin. So, far, this is strongest model with the highest accuracy and is falling in line with what was shown in the correlation matrix and the PCA variance output. 

Let's re-run the model one last time, including SibSp and Parch but excluding Age. This is to exclude all variables that are missing a significant amount of data to see if there is any improvement or change in the model.

Below are the output and possible submission file of the second XGBoost model.

In [None]:
#XGBoost Model Predictions on Test Data from Main Training Data
test_predictions2 = model.predict(test_data_encodeds1)
#print(test_predictions2)

In [None]:
#Create Possible Submission File
submission2 = pd.DataFrame({'PassengerId': test_data.PassengerId, 'Survived': test_predictions2})

## Rerunning the Model with Only Age Removed Now (Checking for Improvement) 

Preparing the 3rd training data set for the XGBoost model the same as what was done for the main and sub1 training data sets (using trainin sub2).

In [None]:
#Encode categorical variables - training sub1 data
from sklearn.preprocessing import OneHotEncoder

#Identify categorical columns
categorical_cols2 = ['Embarked']

#Initialize OneHotEncoder
encoder = OneHotEncoder(drop='first', sparse=False)

#Fit encoder on training sub1 data
encoder.fit(X_train_sub2[categorical_cols2])

#Transform categorical columns for the training sub1 data
encoded_cols_trains2 = pd.DataFrame(encoder.transform(X_train_sub2[categorical_cols2]))
encoded_cols_vals2 = pd.DataFrame(encoder.transform(X_val_sub2[categorical_cols2]))
encoded_cols_tests2 = pd.DataFrame(encoder.transform(test_data_num3[categorical_cols2]))

#Reindexing encoded columns to match original indices
encoded_cols_trains2.index = X_train_sub2.index
encoded_cols_vals2.index = X_val_sub2.index
encoded_cols_tests2.index = test_data_num3.index

#Drop original categorical columns and concatenate encoded ones
X_train_encodeds2 = pd.concat([X_train_sub2.drop(categorical_cols2, axis=1), encoded_cols_trains2], axis=1)
X_val_encodeds2 = pd.concat([X_val_sub2.drop(categorical_cols2, axis=1), encoded_cols_vals2], axis=1)
test_data_encodeds2 = pd.concat([test_data_num3.drop(categorical_cols2, axis=1), encoded_cols_tests2], axis=1)

In [None]:
#XGBoost Model Training Part 3 - Using the model with sub2 training data
import xgboost as xgb

model = xgb.XGBClassifier()
model.fit(X_train_encodeds2, y_train_sub2)

from sklearn.metrics import accuracy_score

y_pred3 = model.predict(X_val_encodeds2)
accuracy3 = accuracy_score(y_val_sub2, y_pred3)

print("Validation Accuracy:", accuracy3)

The accuracy of the model where only Age and Cabin are removed (these are the variables missing a considerable amount of data), and the model accuracy is 83.8%. Which is the same accuracy as the XGBoost model with Age, Cabin, Sibsp, and Parch variables removed. 

Meaning that the model that will be used for submission will be the model using Pclass, Sex, Fare, Embarked, and Ticket variables (2nd model) because it is just as strong and a more simple model that explains the variance and is not overfitting the data. Having an accuracy of 83.8% is a decent level of accuracy given the variables. I did not fill in the missing data because I did not want to misrepresent any of the data and did not want to overfit any data in the modeling.

In [None]:
#XGBoost Model Predictions on Test Data from Main Training Data
test_predictions3 = model.predict(test_data_encodeds2)
#print(test_predictions3)

In [None]:
#Create Possible Submission File
submission3 = pd.DataFrame({'PassengerId': test_data.PassengerId, 'Survived': test_predictions3})

## Final Submission Selection for Competition 

In [None]:
submission2.head(15)

In [None]:
#Final Submission File for Competition
submission2.to_csv('submission.csv', index=False)