# Data Overviews
The sinking of the Titanic is one of the most infamous shipwrecks in history.

On April 15, 1912, during her maiden voyage, the widely considered “unsinkable” RMS Titanic sank after colliding with an iceberg. Unfortunately, there weren’t enough lifeboats for everyone onboard, resulting in the death of 1502 out of 2224 passengers and crew.

While there was some element of luck involved in surviving, it seems some groups of people were more likely to survive than others.

In this challenge, we ask you to build a predictive model that answers the question: “what sorts of people were more likely to survive?” using passenger data (ie name, age, gender, socio-economic class,
## Target
Train.csv will contain the details of a subset of the passengers on board (891 to be exact) and importantly, will reveal whether they survived or not, also known as the “ground truth”.

The test.csv dataset contains similar information but does not disclose the “ground truth” for each passenger. It’s your job to predict these outcomes. etc).


## Data Description

### Variable Definitions

| Variable   | Definition                          | Key                                 |
|------------|-------------------------------------|-------------------------------------|
| survival   | Survival                            | 0 = No, 1 = Yes                     |
| pclass     | Ticket class                        | 1 = 1st, 2 = 2nd, 3 = 3rd           |
| sex        | Sex                                 |                                     |
| Age        | Age in years                        |                                     |
| sibsp      | # of siblings / spouses aboard the Titanic |                             |
| parch      | # of parents / children aboard the Titanic |                             |
| ticket     | Ticket number                       |                                     |
| fare       | Passenger fare                      |                                     |
| cabin      | Cabin number                        |                                     |
| embarked   | Port of Embarkation                 | C = Cherbourg, Q = Queenstown, S = Southampton |

### Variable Notes

- **pclass:** A proxy for socio-economic status (SES)
  - 1st = Upper
  - 2nd = Middle
  - 3rd = Lower
- **age:** Age is fractional if less than 1. If the age is estimated, it is in the form of xx.5
- **sibsp:** The dataset defines family relations in this way:
  - Sibling = brother, sister, stepbrother, stepsister
  - Spouse = husband, wife (mistresses and fiancés were ignored)
```

In [None]:
# Suppressing Warnings
import warnings
import warnings
warnings.filterwarnings('ignore')

In [None]:
# import pandas, numpy, matplotlib and seaborn
import pandas as pd, numpy as np, matplotlib.pyplot as plt, seaborn as sns
# Data display coustomization
pd.set_option('display.max_rows', 100)
pd.set_option('display.max_columns', 100)

In [None]:
# import the csv files 
gs = pd.read_csv("/kaggle/input/titanic/gender_submission.csv")
train_df = pd.read_csv("/kaggle/input/titanic/train.csv")
test= pd.read_csv("/kaggle/input/titanic/test.csv")

In [None]:
gs.head()

In [None]:
train_df.head()

In [None]:
test.head()

In [None]:
print(("The number of rows and columns of gs dataset"),gs.shape)
print(("The number of rows and columns of train dataset"), train_df.shape)
print(("The number of rows and columns of test dataset"), test.shape)

In [None]:
gs.info()

In [None]:
train_df.info()

In [None]:
test.info()

In [None]:
train_df.describe()

In [None]:
test.describe()

In [None]:
train_df.isnull().mean()/100 # check the null value percentages

In [None]:
test.isnull().mean()/100 #chech the null values percentages

In [None]:
# Fill null values in 'Age' and 'Fare' with the median values of their respective columns
train_df['Age'].fillna(train_df['Age'].median(), inplace=True)
train_df['Fare'].fillna(train_df['Fare'].median(), inplace=True)

test['Age'].fillna(test['Age'].median(), inplace=True)
test['Fare'].fillna(test['Fare'].median(), inplace=True)

In [None]:
# drop the unneccessary column
# Drop the 'PassengerId' and 'Name' columns
train_df = train_df.drop(['PassengerId', 'Name','Ticket','Cabin'], axis=1)

In [None]:
test = test.drop([ 'Name','Ticket','Cabin'], axis=1)

### Univariate Analysis

In [None]:
train_df['Sex'].value_counts() # Value Counts of Sex

In [None]:
# Count the occurrences of each value in the 'Sex' column
sex_counts = train_df['Sex'].value_counts()

# Create the pie plot
plt.pie(sex_counts, labels=sex_counts.index, autopct='%1.1f%%', startangle=140, wedgeprops=dict(width=0.3))

# Add a circle at the center to create a doughnut shape
centre_circle = plt.Circle((0,0), 0.70, fc='white')
fig = plt.gcf()
fig.gca().add_artist(centre_circle)

# Add title
plt.title('Distribution of Passenger Sex')

# Display the plot
plt.show()

In [None]:
train_df['Pclass'].value_counts() # value counts of Pclass

In [None]:

# Count the occurrences of each value in the 'Pclass' column
pclass_counts = train_df['Pclass'].value_counts()

# Create the pie plot
plt.pie(pclass_counts, labels=pclass_counts.index, autopct='%1.1f%%', startangle=140, wedgeprops=dict(width=0.3))

# Add a circle at the center to create a doughnut shape
centre_circle = plt.Circle((0,0), 0.70, fc='white')
fig = plt.gcf()
fig.gca().add_artist(centre_circle)

# Add title
plt.title('Distribution of Passenger Classes')

# Display the plot
plt.show()

In [None]:
train_df['Survived'].value_counts() # Value counts of Survived

In [None]:
# Count the occurrences of each value in the 'Survived' column
survived_counts = train_df['Survived'].value_counts()

# Create the bar plot
survived_counts.plot(kind='bar', color=['blue', 'orange'])
plt.title('Survival Count')
plt.xlabel('Survived')
plt.ylabel('Count')
plt.xticks([0, 1], ['Not Survived', 'Survived'], rotation=0)
plt.show()

In [None]:
train_df['Embarked'].value_counts() # value counts of Embarked

In [None]:

# Create the count plot for the 'Embarked' column

sns.countplot(data=train_df, x='Embarked', palette='viridis')

# Add title and labels
plt.title('Count of Passengers by Embarkation Point')
plt.xlabel('Embarkation Point')
plt.ylabel('Count')

# Display the plot
plt.show()


In [None]:
train_df['SibSp'].value_counts() # value counts of SibSp

In [None]:
train_df['Parch'].value_counts() # value counts of Parch

In [None]:
# let's join the two columns and place the value in new column
# Create the new column 'total_member'
train_df['total_member'] = train_df['SibSp'] + train_df['Parch']

In [None]:
# Create the 'family_size' column based on the 'total_member' column
train_df['family_size'] = train_df['total_member'].apply(
    lambda x: 'single person' if x == 0 else ('medium family' if 1 <= x <= 4 else 'large family')
)

In [None]:
train_df['family_size'].value_counts() # value counts of family size

In [None]:
# Count the occurrences of each family size category
family_size_counts = train_df['family_size'].value_counts()

# Plot the family size distribution
family_size_counts.plot(kind='bar', color=['blue', 'green', 'red'])
plt.title('Family Size Distribution')
plt.xlabel('Family Size')
plt.ylabel('Number of Passengers')
plt.xticks(rotation=0)
plt.show()

In [None]:

# Set the style of the visualization
sns.set(style="whitegrid")

# Create a histogram for the 'Age' column
sns.histplot(train_df['Age'], bins=10, kde=True)
plt.title('Histogram of Age')
plt.xlabel('Age')
plt.ylabel('Frequency')
plt.show()

In [None]:

# Set the style of the visualization
sns.set(style="whitegrid")

# Create a histogram for the 'Age' column
sns.histplot(train_df['Fare'], bins=10, kde=True)
plt.title('Histogram of Fare')
plt.xlabel('Fare')
plt.ylabel('Frequency')
plt.show()

### Multivariate Analysis

In [None]:
# Create a count plot for the 'Survived' and 'Sex' columns
sns.countplot(data=train_df, x='Survived', hue='Sex')
plt.title('Count Plot of Survival by Sex')
plt.xlabel('Survived')
plt.ylabel('Count')
plt.show()

In [None]:
# Convert the 'Pclass' column to string type
train_df['pclass'] = train_df['Pclass'].astype(str)

# Create a count plot for the 'Survived' and 'Pclass' columns
sns.countplot(data=train_df

In [None]:
, x='Survived', hue='pclass')
plt.title('Count Plot of Survival by Pclass')
plt.xlabel('Survived')
plt.ylabel('Count')
plt.show()

In [None]:
# Create a cross-tabulation table to count the number of passengers who survived and didn't survive for each embarked port
survived_counts = pd.crosstab(train_df['Embarked'], train_df['Survived'])

# Print the count of passengers who didn't survive and survived for each embarked port
print("Count of passengers who didn't survive and survived for each embarked port:")
print(survived_counts)


In [None]:
# Create a count plot for the 'Survived' and 'Embarked' columns
sns.countplot(data=train_df, x='Survived', hue='Embarked')
plt.title('Count Plot of Survival by Embarked')
plt.xlabel('Survived')
plt.ylabel('Count')
plt.show()

In [None]:
# Create a cross-tabulation table to count the number of passengers who survived and didn't survive for each embarked port
survived_counts = pd.crosstab(train_df['Embarked'], train_df['family_size'])

# Print the count of passengers who didn't survive and survived for each embarked port
print("Count of passengers who didn't survive and survived for family_size feature:")
print(survived_counts)


In [None]:
# Create a count plot for the 'Survived' and 'family_size' columns
sns.countplot(data=train_df, x='Survived', hue='family_size')
plt.title('Count Plot of Survival by family_size')
plt.xlabel('Survived')
plt.ylabel('Count')
plt.show()

In [None]:
plt.figure(figsize=(10, 6))
# Create a distplot for Age, separated by Survived
sns.histplot(data=train_df, x='Age', hue='Survived', kde=True, element='step')
plt.title('Distribution of Age by Survival Status')
plt.xlabel('Age')
plt.ylabel('Density')
plt.show()


In [None]:
plt.figure(figsize = (10,6))
# Create a distplot for Age, separated by Survived
sns.histplot(data=train_df, x='Fare', hue='Survived', kde=True, element='step')

plt.title('Distribution of Fare by Survival Status')
plt.xlabel('Fare')
plt.ylabel('Density')
plt.show()

In [None]:
train_df.head(3)

In [None]:
# Convert 'Sex' column to binary values: 'male' to 1 and 'female' to 0
train_df['Sex'] = train_df['Sex'].map({'male': 1, 'female': 0})
test['Sex'] = test['Sex'].map({'male': 1, 'female': 0})

In [None]:
train_df.head(3)

In [None]:
test.head(3)

In [None]:

# Create dummy variables for the 'Embarked' and 'Cabin' columns with appropriate prefixes
train_df = pd.get_dummies(train_df, columns=['Embarked'], prefix=['Embarked'])

# Display the DataFrame to confirm the changes
train_df.head()

In [None]:
# Fetch columns with object data type
object_columns = train_df.select_dtypes(include='bool').columns
train_df[object_columns] = train_df[object_columns].astype('int')

In [None]:
# drop the new made columns
train_df = train_df.drop(['total_member','family_size'],axis =1)

In [None]:
train_df = train_df.drop(['pclass'],axis =1)

In [None]:
#### Lets find the oorelation
# Calculate correlations
correlation_matrix = train_df.corr()
correlation_matrix

In [None]:
# Set a threshold for correlation
threshold = 0.8

# Calculate correlations
correlation_matrix = train_df.corr()

# Find column pairs with correlation above the threshold
high_correlation_pairs = []
for i in range(len(correlation_matrix.columns)):
    for j in range(i+1, len(correlation_matrix.columns)):
        if abs(correlation_matrix.iloc[i, j]) > threshold:
            high_correlation_pairs.append((correlation_matrix.columns[i], correlation_matrix.columns[j], correlation_matrix.iloc[i, j]))

print("Column pairs with correlation above the threshold:")
for pair in high_correlation_pairs:
    print(f"{pair[0]} - {pair[1]} : {pair[2]}")


In [None]:
# Putting features variable to x
X_train= train_df.drop(['Survived'], axis =1)
X_train.head()

In [None]:
# create the Y data set with the column Converted
y_train = train_df['Survived']
y_train.head()

In [None]:
# checking the Survived Rate
Survived = (sum(train_df['Survived'])/len(train_df['Survived'].index))*100
Survived

The Survival rate is 38%. Thats mean maximum number of people can not survived

In [None]:
# import the StandardScaler
from sklearn.preprocessing import StandardScaler

In [None]:
scaler = StandardScaler()
X_train[['Age','Fare']] = scaler.fit_transform(X_train[['Age','Fare']])
X_train.head()

## model building

In [None]:
import statsmodels.api as sm

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.feature_selection import RFE

# Initialize the logistic regression model
logreg = LogisticRegression()

# Initialize RFE with logistic regression estimator and number of features to select
rfe = RFE(estimator=logreg, n_features_to_select=15)

# Fit RFE to the training data
rfe.fit(X_train, y_train)


In [None]:
rfe.support_

In [None]:
#list of RFE supported columns
col = X_train.columns[rfe.support_]
col

In [None]:
# Building Model 1
X_train_sm = sm.add_constant(X_train[col])
logm1 = sm.GLM(y_train,X_train_sm, family = sm.families.Binomial())
res = logm1.fit()
res.summary()

In [None]:
#dropping column with high p-value

col = col.drop('Embarked_C',1)

In [None]:
#BUILDING MODEL #2

X_train_sm = sm.add_constant(X_train[col])
logm2 = sm.GLM(y_train,X_train_sm, family = sm.families.Binomial())
res = logm2.fit()
res.summary()

In [None]:
#dropping column with high p-value

col = col.drop('Embarked_Q',1)

In [None]:
#BUILDING MODEL #3

X_train_sm = sm.add_constant(X_train[col])
logm2 = sm.GLM(y_train,X_train_sm, family = sm.families.Binomial())
res = logm2.fit()
res.summary()

In [None]:
#dropping column with high p-value

col = col.drop('Parch',1)

In [None]:
#BUILDING MODEL #4
X_train_sm = sm.add_constant(X_train[col])
logm2 = sm.GLM(y_train,X_train_sm, family = sm.families.Binomial())
res = logm2.fit()
res.summary()

In [None]:
#dropping column with high p-value

col = col.drop('Fare',1)

In [None]:
#BUILDING MODEL #5

X_train_sm = sm.add_constant(X_train[col])
logm2 = sm.GLM(y_train,X_train_sm, family = sm.families.Binomial())
res = logm2.fit()
res.summary()

In [None]:
# import variance inflation factor from stats model 
from statsmodels.stats.outliers_influence import variance_inflation_factor

In [None]:
# create a dataframe where we see the all features and their vif
vif = pd.DataFrame()
vif['Features'] = X_train[col].columns
vif['VIF'] = [variance_inflation_factor(X_train[col].values, i) for i in range(X_train[col].shape[1])]
vif['VIF'] = round(vif['VIF'], 2)
vif = vif.sort_values(by = "VIF", ascending = False)
vif

The VIF 's are Under control as its values are below 5

In [None]:
# Let's check the predicted value on the train dataset
y_train_pred = res.predict(X_train_sm)
y_train_pred[:10]

In [None]:
# Let's compare the original and predicted( on train dataset) target value
y_train_pred_final = pd.DataFrame({'Survived':y_train.values, 'Survived_prob':y_train_pred})
y_train_pred_final['PassengerId'] = y_train.index
y_train_pred_final.head()

In [None]:
# Creating new column 'predicted' with 1 if Converted_Prob > 0.5 else 0
y_train_pred_final['Predicted'] = y_train_pred_final.Survived_prob.map(lambda x: 1 if x > 0.5 else 0)

# Let's see the head
y_train_pred_final.head(10)

In [None]:
from sklearn import metrics  # import the metrics from sklearn

# Check the confusion matrix
confusion = metrics.confusion_matrix(y_train_pred_final.Survived, y_train_pred_final.Predicted)
print(confusion)

In [None]:
# Let's check the overall accuracy.
print('Accuracy:',(metrics.accuracy_score(y_train_pred_final.Survived, y_train_pred_final.Predicted)))

In [None]:
TP = confusion[1,1] # true positive 
TN = confusion[0,0] # true negatives
FP = confusion[0,1] # false positives
FN = confusion[1,0] # false negatives

In [None]:
# Let's see the sensitivity of our logistic regression model
print('Sensitivity',(TP / float(TP+FN)))

In [None]:
# Let us calculate specificity
print('Specificity:',(TN / float(TN+FP)))

In [None]:
# Calculate False Postive Rate - predicting conversion when customer does not have convert
print('False positive rate:',(FP/ float(TN+FP)))

In [None]:
# positive predictive value 
print('Positive predictive rate:', (TP / float(TP+FP)))

In [None]:
# Negative predictive value
print('Negetive predictive rate:', (TN / float(TN+ FN)))

In [None]:
def draw_roc( actual, probs ):
    fpr, tpr, thresholds = metrics.roc_curve( actual, probs,
                                              drop_intermediate = False )
    auc_score = metrics.roc_auc_score( actual, probs )
    plt.figure(figsize=(5, 5))
    plt.plot( fpr, tpr, label='ROC curve (area = %0.2f)' % auc_score )
    plt.plot([0, 1], [0, 1], 'k--')
    plt.xlim([0.0, 1.0])
    plt.ylim([0.0, 1.05])
    plt.xlabel('False Positive Rate or [1 - True Negative Rate]')
    plt.ylabel('True Positive Rate')
    plt.title('Receiver operating characteristic example')
    plt.legend(loc="lower right")
    plt.show()

    return None

In [None]:
fpr, tpr, thresholds = metrics.roc_curve( y_train_pred_final.Survived, y_train_pred_final.Survived_prob, drop_intermediate = False )

In [None]:
draw_roc(y_train_pred_final.Survived, y_train_pred_final.Survived_prob)

In [None]:
# Let's create columns with different probability cutoffs 
numbers = [float(x)/10 for x in range(10)]
for i in numbers:
    y_train_pred_final[i]= y_train_pred_final.Survived_prob.map(lambda x: 1 if x > i else 0)
y_train_pred_final.head()

In [None]:
# Now let's calculate accuracy sensitivity and specificity for various probability cutoffs.
cutoff_df = pd.DataFrame( columns = ['prob','accuracy','sensi','speci'])
from sklearn.metrics import confusion_matrix

# TP = confusion[1,1] # true positive 
# TN = confusion[0,0] # true negatives
# FP = confusion[0,1] # false positives
# FN = confusion[1,0] # false negatives

num = [0.0,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9]
for i in num:
    cm1 = metrics.confusion_matrix(y_train_pred_final.Survived, y_train_pred_final[i] )
    total1=sum(sum(cm1))
    accuracy = (cm1[0,0]+cm1[1,1])/total1
    
    speci = cm1[0,0]/(cm1[0,0]+cm1[0,1])
    sensi = cm1[1,1]/(cm1[1,0]+cm1[1,1])
    cutoff_df.loc[i] =[ i ,accuracy,sensi,speci]
print(cutoff_df)

In [None]:
# Let's plot accuracy sensitivity and specificity for various probabilities.
cutoff_df.plot.line(x='prob', y=['accuracy','sensi','speci'])
plt.show()

In [None]:
y_train_pred_final['final_predicted'] = y_train_pred_final.Survived_prob.map( lambda x: 1 if x > 0.4 else 0)

y_train_pred_final.head(10)

In [None]:
# Let's check the overall accuracy.
print("Accuracy:", metrics.accuracy_score(y_train_pred_final.Survived, y_train_pred_final.final_predicted))


In [None]:
confusion2 = metrics.confusion_matrix(y_train_pred_final.Survived, y_train_pred_final.final_predicted )
confusion2

In [None]:
TP = confusion2[1,1] # true positive 
TN = confusion2[0,0] # true negatives
FP = confusion2[0,1] # false positives
FN = confusion2[1,0] # false negatives

In [None]:
# Let's see the sensitivity of our logistic regression model
print("Sensitivity:",(TP / float(TP+FN)))

In [None]:
# Let us calculate specificity
print('Specificity:',(TN / float(TN+FP)))

In [None]:
# Calculate false postive rate - predicting churn when customer does not have churned
print('False positive rate:',(FP/ float(TN+FP)))

In [None]:
# Positive predictive value 
print('Positive predictive rate:',(TP / float(TP+FP)))

In [None]:
# Negative predictive value
print('Negetive predictive rate:', (TN / float(TN+ FN)))

In [None]:
#Looking at the confusion matrix again
confusion = metrics.confusion_matrix(y_train_pred_final.Survived, y_train_pred_final.Predicted )
confusion

In [None]:
##### Precision
TP / TP + FP

print('Pricision:',(confusion[1,1]/(confusion[0,1]+confusion[1,1])))

In [None]:
# Recall
print('Recall:',(confusion[1,1]/(confusion[1,0]+confusion[1,1])))

In [None]:
from sklearn.metrics import precision_recall_curve # Import the precision curve from the sk learn

In [None]:
y_train_pred_final.Survived, y_train_pred_final.Predicted
p, r, thresholds = precision_recall_curve(y_train_pred_final.Survived, y_train_pred_final.Survived_prob)

In [None]:
plt.plot(thresholds, p[:-1], "g-")
plt.plot(thresholds, r[:-1], "r-")
plt.show()

In [None]:
test.head()

In [None]:
test.shape

In [None]:
scaler = StandardScaler()
test[['Age','Fare']] = scaler.fit_transform(test[['Age','Fare']])
test.head()

In [None]:

# Create dummy variables for the 'Embarked' and 'Cabin' columns with appropriate prefixes
test = pd.get_dummies(test, columns=['Embarked'], prefix=['Embarked'])


In [None]:
# Fetch columns with object data type
object_columns = test.select_dtypes(include='bool').columns
test[object_columns] = test[object_columns].astype('int')

In [None]:
test.head()

In [None]:
# Putting features variable to x
X_test= test.drop(['PassengerId','Parch','Fare','Embarked_C','Embarked_Q'], axis =1)
X_test.head()

In [None]:
X_test_sm = sm.add_constant(X_test[col])

In [None]:
y_test_pred = res.predict(X_test_sm)

In [None]:
y_test_pred[:10]

In [None]:
test.shape

In [None]:
# Converting y_pred to a dataframe which is an array
y_pred_1= pd.DataFrame(y_test_pred)

In [None]:
# Let's see the head
y_pred_1.head()

In [None]:
# Appending y_test_df and y_pred_1
y_pred_final = pd.concat([test, y_pred_1],axis=1)
y_pred_final.head()

In [None]:
# Renaming the column 
y_pred_final= y_pred_final.rename(columns={ 0 : 'Survived_prob'})
y_pred_final.head(2)

In [None]:
y_pred_final = y_pred_final[['PassengerId','Survived_prob']]

In [None]:
y_pred_final.head(2)

In [None]:
y_pred_final.shape

In [None]:
y_pred_final['survived'] = y_pred_final.Survived_prob.map(lambda x: 1 if x > 0.4 else 0)
y_pred_final.head()

In [None]:
# Create the new dataset with Passengerid and survived columns
gender_submission_pred = y_pred_final[['PassengerId', 'survived']]

# Display the first few rows of the new dataset
print(gender_submission_pred.head())


In [None]:
gender_submission_pred.shape

In [None]:
# Export the dataset to a CSV file
gender_submission_pred.to_csv('gender_submission_pred.csv', index=False)