<a href="https://colab.research.google.com/github/aszpetmanski/kaggle_titanic/blob/main/Untitled18.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## # Kaggle #1 Titanic set - Introduction


Welcome to my Kaggle project using the Titanic dataset! This classic machine learning challenge involves predicting passenger survival based on various features such as age, gender, and class. The dataset provides a glimpse into the historical context of the Titanic disaster, making this an engaging and informative project.

Objective:
The primary goal is to build a predictive model that accurately determines whether a given passenger survived or not. Through data exploration, preprocessing, and the application of machine learning algorithms, we aim to create a robust model for predicting survival outcomes.

Key Tasks:


*   Data Exploration: Understand the dataset and its features.

*   Data Preprocessing: Handle missing values, perform feature engineering
*   Model Building: Implement machine learning algorithms for prediction.

 This project offers an excellent opportunity to enhance data science skills and apply them to a real-world scenario. Let's embark on this journey into the world of data science with the Titanic dataset on Kaggle!

# Section 1: Importing Libraries

In [97]:
import pandas as pd
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.naive_bayes import GaussianNB
from sklearn.tree import DecisionTreeClassifier
import xgboost as xgb

# Section 2: Loading Data

In [98]:
train_data = pd.read_csv("https://alanszpetmanski.blob.core.windows.net/kaggle/titanic/train.csv")
test_data = pd.read_csv("https://alanszpetmanski.blob.core.windows.net/kaggle/titanic/test.csv")

# Section 3: Exploring and Preprocessing Data

In [99]:
train_data.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [100]:
train_data.isna().sum()

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

In [101]:
# Fill missing values for age according to the mean age of the passenger class
train_data['Age'].fillna(train_data.groupby('Pclass')['Age'].transform('mean'), inplace=True)
test_data['Age'].fillna(test_data.groupby('Pclass')['Age'].transform('mean'), inplace=True)

# Fill missing values for fare according to the mean fare of the passenger class
test_data['Fare'].fillna(train_data.groupby('Pclass')['Fare'].transform('mean'), inplace=True)

# Fill missing values for embarked according to the most common value
train_data['Embarked'].fillna(train_data['Embarked'].mode()[0], inplace=True)
test_data['Embarked'].fillna(train_data['Embarked'].mode()[0], inplace=True)

# Drop columns that are not useful for the model
train_data.drop(['PassengerId', 'Ticket', 'Cabin', 'Name'], axis=1, inplace=True)
test_data.drop(['PassengerId', 'Ticket', 'Cabin', 'Name'], axis=1, inplace=True)

In [102]:
# Convert column names to lowercase
# Convert data types for specific columns

train_data.columns = [col_name.lower() for col_name in train_data.columns]
train_data = train_data.astype({'survived':'int8',
              'pclass':'int8',
              'age':'float32',
              'sibsp':'int8',
              'parch':'int8',
              'fare':'float32',
              'embarked':'category'})

test_data.columns = [col_name.lower() for col_name in test_data.columns]
test_data = test_data.astype({
              'pclass':'int8',
              'age':'float32',
              'sibsp':'int8',
              'parch':'int8',
              'fare':'float32',
              'embarked':'category'})
# Create a new feature 'num_family_on_board' by combining 'sibsp' and 'parch' in both sets

train_data['num_family_on_board'] = train_data.sibsp + train_data.parch
test_data['num_family_on_board'] = test_data.sibsp + test_data.parch

In [103]:
# Define fare bins and categorize the 'fare' column
fare_bins = pd.IntervalIndex.from_tuples([(-1, 8), (8, 30), (30, 513)])
train_data['fare'] = pd.cut(train_data['fare'], fare_bins, labels=['low', 'medium', 'high']).astype('category')

# Rename the categories to numerical values
train_data.fare = train_data['fare'].cat.rename_categories([1, 2, 3])

# Define age bins and categorize the 'age' column
age_bins = pd.IntervalIndex.from_tuples([(-1, 17), (17, 37), (37, 100)])
train_data['age'] = pd.cut(train_data['age'], age_bins, labels=['low', 'medium', 'high']).astype('category')

# Rename the categories to numerical values
train_data.age = train_data['age'].cat.rename_categories([1, 2, 3])

train_data.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,num_family_on_board
0,0,3,male,2,1,0,1,S,1
1,1,1,female,3,1,0,3,C,1
2,1,3,female,2,0,0,1,S,0
3,1,1,female,2,1,0,3,S,1
4,0,3,male,2,0,0,2,S,0


In [104]:
#Repeat last 4 steps for test_data
test_data['fare'] = pd.cut(test_data['fare'], fare_bins, labels=['low', 'medium', 'high']).astype('category')
test_data.fare = test_data['fare'].cat.rename_categories([1, 2, 3])

test_data['age'] = pd.cut(test_data['age'], age_bins, labels=['low', 'medium', 'high']).astype('category')
test_data.age = test_data['age'].cat.rename_categories([1, 2, 3])

test_data.head()

Unnamed: 0,pclass,sex,age,sibsp,parch,fare,embarked,num_family_on_board
0,3,male,2,0,0,1,Q,0
1,3,female,3,1,0,1,S,1
2,2,male,3,0,0,2,Q,0
3,3,male,2,0,0,2,S,0
4,3,female,2,1,1,2,S,2


In [105]:
# Convert 'sex' column to numerical values
train_data.sex= train_data.sex.map(lambda sex: 1 if sex == 'male' else 0).astype('int8')
test_data.sex= test_data.sex.map(lambda sex: 1 if sex == 'male' else 0).astype('int8')

# Map 'embarked' categories to numerical values
train_data.embarked = train_data.embarked.cat.rename_categories([1, 2, 3]).astype('int8')
test_data.embarked = test_data.embarked.cat.rename_categories([1, 2, 3]).astype('int8')

# Drop 'sibsp' and 'parch' columns
train_data.drop(['sibsp', 'parch'], axis=1, inplace=True)
test_data.drop(['sibsp', 'parch'], axis=1, inplace=True)

# Convert 'age' and 'fare' columns to 'int8' data type
train_data.age = train_data.age.astype('int8')
train_data.fare = train_data.fare.astype('int8')

test_data.age = test_data.age.astype('int8')
test_data.fare = test_data.fare.astype('int8')

# Extract the target variable 'survived'
y = train_data.pop('survived')

# Standardize the features using StandardScaler
ss = StandardScaler()
X = ss.fit_transform(train_data)


# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                    test_size=0.25,
                                                    shuffle=True)

test_data = ss.fit_transform(test_data)

# Section 4: Building, Training and Evaluating Models

In [106]:
models = [KNeighborsClassifier(), RandomForestClassifier(), SVC(), GaussianNB(), DecisionTreeClassifier(), xgb.XGBClassifier()]
models_names = ['KNeighborsClassifier', 'RandomForestClassifier', 'SVC', 'GaussianNB', 'DecisionTreeClassifier', 'xgb.XGBClassifier']

for model, name in zip(models, models_names):
    model.fit(X_train, y_train)
    print(name, model.score(X_test, y_test))




KNeighborsClassifier 0.8071748878923767
RandomForestClassifier 0.8026905829596412
SVC 0.8116591928251121
GaussianNB 0.757847533632287
DecisionTreeClassifier 0.7982062780269058
xgb.XGBClassifier 0.8161434977578476


# Section 5: Tune the Model

In [107]:
from sklearn.model_selection import GridSearchCV

param_grid = {'n_estimators': [100, 200, 300, 400, 500],
                'max_depth': [1, 2, 3, 4, 5],
                'learning_rate': [0.1, 0.2, 0.3, 0.4, 0.5]}
model = xgb.XGBClassifier()
grid = GridSearchCV(model, param_grid, cv=5)
grid.fit(X_train, y_train)
print(grid.best_params_)
print(grid.best_score_)
print(grid.score(X_test, y_test))

{'learning_rate': 0.1, 'max_depth': 5, 'n_estimators': 200}
0.815912916619908
0.820627802690583


In [108]:
model = xgb.XGBClassifier(learning_rate=0.2, max_depth=4, n_estimators=200)
model.fit(X_train, y_train)
print(model.score(X_test, y_test))

0.820627802690583


# Predict!


In [109]:
predictions = model.predict(test_data)
test_data = pd.read_csv('https://alanszpetmanski.blob.core.windows.net/kaggle/titanic/test.csv')
passenger_id = test_data['PassengerId']
output = pd.DataFrame({'PassengerId': passenger_id, 'Survived': predictions})
output.to_csv('submission_final.csv', index=False)