<a href="https://www.kaggle.com/code/benotroussel/notebook-titanic?scriptVersionId=137834125" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

<font size="3"> 
    
<b> Welcome to my notebook on the Titanic dataset. 
    
Although I'm relatively new to Kaggle, I've tried to ensure the techniques applied here are valuable and interesting, providing my approach to this well-known challenge. The intention of sharing this notebook is not just to showcase my efforts, but also to invite constructive feedback from the Kaggle community. By opening my methods to scrutiny, I hope to refine my data science skills and improve upon future iterations. This notebook aims to serve as a stepping stone for fellow beginners, illustrating my process from data exploration to prediction. Nevertheless, I'm aware that there's always room for improvement. So, please feel free to suggest alternative strategies or areas for refinement. Your input is appreciated.

<b> Enjoy exploring this notebook! 

</font>

# Titanic Dataset


## Dataset Description

### Overview

The data has been split into two groups:
- training set (`train.csv`)
- test set (`test.csv`)

The training set should be used to build your machine learning models. For the training set, we provide the outcome (also known as the “ground truth”) for each passenger. Your model will be based on “features” like passengers, gender and class. You can also use feature engineering to create new features.

The test set should be used to see how well your model performs on unseen data. For the test set, we do not provide the ground truth for each passenger. It is your job to predict these outcomes. For each passenger in the test set, use the model you trained to predict whether or not they survived the sinking of the Titanic.



### Data Dictionary

| Variable  | Definition                              | Key                             |
| --------- | -------------------------------------- | ------------------------------- |
| `survival`  | Survival                               | 0 = No, 1 = Yes                 |
| `pclass`    | Ticket class                           | 1 = 1st, 2 = 2nd, 3 = 3rd      |
| `sex`       | Sex                                    |                                 |
| `Age`       | Age in years                           |                                 |
| `sibsp`     | # of siblings / spouses aboard the Titanic |                              |
| `parch`     | # of parents / children aboard the Titanic |                              |
| `ticket`    | Ticket number                          |                                 |
| `fare`      | Passenger fare                         |                                 |
| `cabin`     | Cabin number                           |                                 |
| `embarked`  | Port of Embarkation                    | S = Southampton (UK), C = Cherbourg (France), Q = Queenstown (Ireland)  |


Embarkation ordered and in direction the USA.

### Variable Notes
- `pclass`: A proxy for socio-economic status (SES)
  - 1st = Upper
  - 2nd = Middle
  - 3rd = Lower
- `age`: Age is fractional if less than 1. If the age is estimated, it is in the form of xx.5
- `sibsp`: The dataset defines family relations in this way...
  - Sibling = brother, sister, stepbrother, stepsister
  - Spouse = husband, wife (mistresses and fiancés were ignored)
- `parch`: The dataset defines family relations in this way...
  - Parent = mother, father
  - Child = daughter, son, stepdaughter, stepson



### Some help for the EDA :

https://www.kaggle.com/code/allohvk/titanic-missing-age-imputation-tutorial-advanced/notebook

https://www.kaggle.com/code/allohvk/titanic-advanced-eda?scriptVersionId=77739368

https://github.com/Kaggle/kaggle-api

## Imports and configuration

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline
%config InlineBackend.figure_format = "retina"

!kaggle config set -n competition -v titanic

In [None]:
titanic_train = pd.read_csv("train.csv")
titanic_test = pd.read_csv("test.csv")
combined_dataset = titanic_train.append(titanic_test)

titanic_train.head()


When trying to visualize datas, make assumption with respect to the survivibility of passengers, use the titanic train dataset. We don't want to make some leakeage from the test to the traind dataset. Then, when post-processing techniques / filling missing values, you can applied your findings from the train dataset to the test dataset (`combined_dataset`).

## Dataset Visualization

### Types

In [None]:
titanic_train.dtypes


In [None]:
titanic_train.Name.apply(lambda x: type(x).__name__).value_counts()


Every `object` are strings. If not, it means that they are `NaN` values.
No type issues for this dataset.

### Missing values

In [None]:
print(titanic_train.isnull().sum())
print(titanic_test.isnull().sum())


### Duplicates

In [None]:
titanic_train.duplicated(["PassengerId"]).sum() + titanic_train.duplicated(["Name"]).sum()


### Summary statistics / Univariate analysis

In [None]:
print(titanic_train.Survived.value_counts())
print()
print(titanic_train.Pclass.value_counts())
print()
print(titanic_train.Embarked.value_counts())


In [None]:
print(titanic_train.SibSp.value_counts())
print()
print(titanic_train.Parch.value_counts())

In [None]:
titanic_train.loc[:, ["SibSp", "Parch", "Age", "Fare"]].describe()

### Bivariate analysis

In [None]:
sns.histplot(titanic_train, x="Age", hue="Survived")


In [None]:
sns.histplot(titanic_train, x="Sex", hue="Survived")


In [None]:
sns.histplot(titanic_train, x="Fare", hue="Survived")

In [None]:
sns.displot(data=titanic_train, x="Fare", hue="Pclass", kind="kde")
sns.displot(data=titanic_train, x="Fare", hue="Survived", kind="kde")


In [None]:
sns.histplot(titanic_train, x="Pclass", hue="Survived")


In [None]:
sns.histplot(titanic_train, x="Embarked", hue="Survived")

In [None]:
sns.histplot(titanic_train, x="SibSp", hue="Survived")


In [None]:
sns.histplot(titanic_train, x="Parch", hue="Survived")

## Extract features

### Family Names & Titles

In [None]:
titanic_train.Name

In [None]:
titanic_train["Last_name"] = titanic_train.Name.apply(lambda x: str.split(x, ",")[0])
titanic_test["Last_name"] = titanic_test.Name.apply(lambda x: str.split(x, ",")[0])

In [None]:
titanic_train.Name.apply(lambda x: ((str.split(x, ",")[1]).split(".")[0])).value_counts()

In [None]:
Title_Dictionary = {
    "Capt": "Officer",
    "Col": "Officer",
    "Major": "Officer",
    "Jonkheer": "Royal",
    "Don": "Royal",
    "Sir": "Royal",
    "Dr": "Officer",
    "Rev": "Officer",
    "the Countess": "Royal",
    "Dona": "Royal",
    "Mme": "Mrs",
    "Mlle": "Miss",
    "Ms": "Mrs",
    "Mr": "Mr",
    "Mrs": "Mrs",
    "Miss": "Miss",
    "Master": "Master",
    "Lady": "Royal",
}

In [None]:
titanic_train["Title"] = titanic_train.Name.apply(
    lambda x: Title_Dictionary[((str.split(x, ",")[1]).split(".")[0]).strip()]
)

titanic_test["Title"] = titanic_test.Name.apply(
    lambda x: Title_Dictionary[((str.split(x, ",")[1]).split(".")[0]).strip()]
)

titanic_train.Title.value_counts()


In [None]:
round(titanic_train.groupby(["Title"]).Survived.mean(), 3)


The `Title` is highly correlated to your chances to survive.

### Fare per person

#### PeopleInTicket

In [None]:
titanic_train["PeopleInTicket"] = titanic_train["Ticket"].map(
    combined_dataset["Ticket"].value_counts()
)

titanic_test["PeopleInTicket"] = titanic_test["Ticket"].map(
    combined_dataset["Ticket"].value_counts()
)



#### Fare Outliers / missing values

In [None]:
titanic_train[(titanic_train["Fare"] == 0)]

How can their fare be 0. All of them are middle aged males. All have embarked at one place. Most likely this is the Cabin crew. This can be considered as an outlier and must be solved.

In [None]:
round(
    titanic_train.loc[:, ["Pclass", "Embarked", "PeopleInTicket", "Fare"]]
    .groupby(["Pclass", "Embarked", "PeopleInTicket"])
    .agg(("count", "min", "mean", "max")),
    2,
)


In [None]:
# Calculate the mean ages from the train set
mean_fares = titanic_train.groupby(["Pclass", "Embarked", "PeopleInTicket"])["Fare"].mean()


# Define a function to fill the missing values
def fill_fare(row):
    if row["Fare"] == 0 or pd.isna(row["Fare"]):
        return mean_fares[row["Pclass"], row["Embarked"], row["PeopleInTicket"]]
    else:
        return row["Fare"]


# Use the function to fill the missing values in the train set
titanic_train["Fare"] = titanic_train.apply(fill_fare, axis=1)

# # Use the function to fill the missing values in the test set
titanic_test["Fare"] = titanic_test.apply(fill_fare, axis=1)

#### Family size

In [None]:
(titanic_train.Last_name).value_counts().value_counts()


In [None]:
titanic_train["FamilySize"] = titanic_train.SibSp + titanic_train.Parch + 1
titanic_test["FamilySize"] = titanic_test.SibSp + titanic_test.Parch + 1
titanic_train["FamilySize"].value_counts()


#### PeopleInGroup

In [None]:
test = titanic_train.loc[
    :, ["Survived", "Pclass", "Fare", "FamilySize", "Embarked", "PeopleInTicket"]
]
test["PeopleInGroup"] = test[["FamilySize", "PeopleInTicket"]].max(axis=1)

# test = test.loc[test.Embarked == "S"]
# test = test.loc[test.Embarked == "C"]
# test = test.loc[test.Embarked == "Q"]

test["FarePerPerson1"] = test["Fare"] / test["PeopleInTicket"]
test["FarePerPerson2"] = test["Fare"] / test["FamilySize"]
test["FarePerPerson3"] = test["Fare"] / test["PeopleInGroup"]

fig, ((ax1, ax2), (ax3, ax4)) = plt.subplots(2, 2, figsize=(10, 5))

sns.histplot(test, x="Fare", hue="Pclass", binwidth=4, ax=ax1)
ax1.set_xlim(0, 100)

sns.histplot(test, x="FarePerPerson1", hue="Pclass", binwidth=4, ax=ax2)
ax2.set_xlim(0, 100)

sns.histplot(test, x="FarePerPerson2", hue="Pclass", binwidth=4, ax=ax3)
ax3.set_xlim(0, 100)

sns.histplot(test, x="FarePerPerson3", hue="Pclass", binwidth=4, ax=ax4)
ax4.set_xlim(0, 100)

They are multiple possibilities to determine the new `Fare_per_person` from the `Fare` column.
1. From the number of people on the same ticket `PeopleInTicket`
2. From the number of people on the same ticket `FamilySize`
3. From a mix of both columns `max(PeopleInTicket, FamilySize)`.

From varying the `Embarked` parameter (because price of the tickets from the same Embarkation point should be closer), we can see that any of the methods is still more preferable that the others. However, values are more stacked with the second option (`PeopleInTicket`). People from the same family are not forced to buy it at the same time.


In [None]:
round(
    test.loc[
        :, ["Pclass", "Embarked", "Fare", "FarePerPerson1", "FarePerPerson2", "FarePerPerson3"]
    ]
    .groupby(["Embarked", "Pclass"])
    .agg(("count", "mean", "std")),
    2,
)

From this final table, we see that using the `FarePerPerson1` option will reduce the most the variance. The option `FarePerPerson3` is still safe tho.

In [None]:
titanic_train["FarePerPerson"] = titanic_train["Fare"] / titanic_train["PeopleInTicket"]

titanic_test["FarePerPerson"] = titanic_test["Fare"] / titanic_test["PeopleInTicket"]

### Cabin deck letter

In [None]:
titanic_train["Cabin_deck"] = titanic_train.Cabin.apply(lambda x: x[0] if type(x) == str else x)
titanic_test["Cabin_deck"] = titanic_test.Cabin.apply(lambda x: x[0] if type(x) == str else x)
titanic_train["Cabin_deck"].value_counts()


In [None]:
table_count = titanic_train.loc[:, ["Ticket", "Cabin", "PeopleInTicket"]].groupby("Ticket").count()
table_count.loc[(table_count["Cabin"] > 0) & (table_count["PeopleInTicket"] > table_count["Cabin"])]


In [None]:
titanic_train[titanic_train.Ticket == "PC 17757"]

It will be too complicated to try to fill the blanks of the `Cabin` column. In addition, the information of the Cabin letter would have been useful to determine a potential position on the boat (which is even more complicated to implement). Thus, we will drop this column later and not use it. 

## Complete data

### Cabin

No reason to fill it if we will frop the column.

### Embarked

In [None]:
titanic_train[titanic_train.Embarked.isnull()]

In [None]:
titanic_train[(titanic_train.Pclass == 1)].groupby("Embarked").agg(
    {"FarePerPerson": "mean", "Fare": "mean", "PassengerId": "count"}
)
titanic_train.Embarked.fillna("C", inplace=True)


### Age

Filling the age is important to determine the autonomy of the person. The mean `Age` on the boat is 30. If we assign the same age to a child, we will probably not have the result.

Use transform when you want to maintain the same shape as the original DataFrame, but replace values based on group-based calculations. Use agg when you want to obtain a summary of each group.

In [None]:
titanic_train["Age"].describe()


In [None]:
round(
    titanic_train.loc[:, ["Pclass", "Embarked", "Age", "Title", "Sex", "SibSp", "Parch"]]
    .groupby(["Embarked", "Pclass", "Title"])
    .agg(("count", "min", "mean", "max")),
    2,
)


In [None]:
round(
    titanic_train.loc[:, ["Pclass", "Embarked", "Age", "Title", "Sex"]]
    .groupby(["Embarked", "Pclass", "Title", "Sex"])
    .agg(("count", "min", "mean", "max")),
    2,
)

In [None]:
round(
    titanic_train.loc[:, ["Pclass", "Embarked", "Age", "Title", "Sex"]]
    .groupby(["Pclass", "Title", "Sex"])
    .agg(("count", "min", "mean", "max")),
    2,
)


In [None]:
# Calculate the mean ages from the train set
mean_ages = titanic_train.groupby(["Pclass", "Sex", "Title"])["Age"].mean()


# Define a function to fill the missing values
def fill_age(row):
    if pd.isnull(row["Age"]):
        return mean_ages[row["Pclass"], row["Sex"], row["Title"]]
    else:
        return row["Age"]


# Use the function to fill the missing values in the train set
titanic_train["Age"] = titanic_train.apply(fill_age, axis=1)

# Use the function to fill the missing values in the test set
titanic_test["Age"] = titanic_test.apply(fill_age, axis=1)


## Select features

After having verified we have done as much work as we can to clean, fill missing values, to detect outliers and to create new features, it is high time to select features for the prediction model and to adjust the format of certains features (categorical feature, cyclic features, etc ...).

In [None]:
titanic_train = titanic_train.loc[
    :,
    [
        "Survived",
        "Pclass",
        "Sex",
        "Age",
        "SibSp",
        "Parch",
        "Embarked",
        "Title",
        "FamilySize",
        "PeopleInTicket",
        "FarePerPerson",
    ],
]

titanic_test = titanic_test.loc[
    :,
    [
        "Pclass",
        "Sex",
        "Age",
        "SibSp",
        "Parch",
        "Embarked",
        "Title",
        "FamilySize",
        "PeopleInTicket",
        "FarePerPerson",
    ],
]


### Binning

#### Age

In [None]:
sns.histplot(titanic_train, x="Age", hue="Survived")


In [None]:
# Define the bin edges
bins = [0, 16, 30, 40, 55, np.inf]

# Define the labels for the bins
labels = ["Young", "YoungAdult", "Adults", "Old", "VeryOld"]

# Create the new column
titanic_train["Age"] = pd.cut(titanic_train["Age"], bins=bins, labels=labels)
titanic_test["Age"] = pd.cut(titanic_test["Age"], bins=bins, labels=labels)

#### FarePerPerson

In [None]:
sns.histplot(titanic_train, x="FarePerPerson", hue="Survived")


In [None]:
# Define the bin edges
bins = [0, 10, 20, 40, np.inf]

# Define the labels for the bins
labels = ["LowFare", "MediumFare", "HighFare", "VeryHighFare"]

# Create the new column
titanic_train["FarePerPerson"] = pd.cut(titanic_train["FarePerPerson"], bins=bins, labels=labels)
titanic_test["FarePerPerson"] = pd.cut(titanic_test["FarePerPerson"], bins=bins, labels=labels)


In [None]:
titanic_train["FarePerPerson"]

In [None]:
sns.histplot(titanic_train, x="FarePerPerson", hue="Survived")


#### FamilySize / PeopleInTicket

In [None]:
plt.figure()
sns.countplot(data=titanic_train, x="FamilySize", hue="Survived")
plt.show()

plt.figure()
sns.countplot(data=titanic_train, x="PeopleInTicket", hue="Survived")
plt.show()


We have very close Barplot. This is due because most people having the same ticket are people from the same family.
We clearly see that we can bin those values in 3 categories. 

In [None]:
# Define the bin edges
bins = [0, 1, 4, np.inf]

# Define the labels for the bins
labels = ["Solo", "SmallGroup", "LargeGroup"]

# Create the new column
titanic_train["FamilySize"] = pd.cut(titanic_train["FamilySize"], bins=bins, labels=labels)
titanic_test["FamilySize"] = pd.cut(titanic_test["FamilySize"], bins=bins, labels=labels)

titanic_train["PeopleInTicket"] = pd.cut(titanic_train["PeopleInTicket"], bins=bins, labels=labels)
titanic_test["PeopleInTicket"] = pd.cut(titanic_test["PeopleInTicket"], bins=bins, labels=labels)

In [None]:
sns.histplot(titanic_train, x="PeopleInTicket", hue="Survived")


#### SibSp / ParCh

In [None]:
plt.figure()
sns.countplot(data=titanic_train, x="SibSp", hue="Survived")
plt.show()

plt.figure()
sns.countplot(data=titanic_train, x="Parch", hue="Survived")
plt.show()


In [None]:
# Define the bin edges
bins = [-np.inf, 0, 2, np.inf]

# Define the labels for the bins
labels = ["NoSibSP", "FewSibSP", "LotSibSP"]

# Create the new column
titanic_train["SibSp"] = pd.cut(titanic_train["SibSp"], bins=bins, labels=labels)
titanic_test["SibSp"] = pd.cut(titanic_test["SibSp"], bins=bins, labels=labels)

# Define the bin edges
bins = [-np.inf, 0, 3, np.inf]

# Define the labels for the bins
labels = ["NoParch", "FewParch", "LotParch"]

titanic_train["Parch"] = pd.cut(titanic_train["Parch"], bins=bins, labels=labels)
titanic_test["Parch"] = pd.cut(titanic_test["Parch"], bins=bins, labels=labels)

In [None]:
sns.histplot(titanic_train, x="Parch", hue="Survived")


### One-hot encoding


In [None]:
titanic_train["Sex"] = (titanic_train["Sex"] == "male").astype(int)
titanic_test["Sex"] = (titanic_test["Sex"] == "male").astype(int)

In [None]:
from sklearn.preprocessing import OneHotEncoder

# Instantiate the OneHotEncoder
enc = OneHotEncoder(sparse=False)

# Suppose 'column1' and 'column2' are your categorical columns
# Fit and transform the data, creating a new DataFrame

columns_to_ohe = ["Embarked", "Title"]
df_encoded = pd.DataFrame(enc.fit_transform(titanic_train[columns_to_ohe]))

# Give names to the new columns and concatenate with the original df
df_encoded.columns = enc.get_feature_names_out(columns_to_ohe)
titanic_train = pd.concat([titanic_train, df_encoded], axis=1).drop(columns_to_ohe, axis=1)

df_encoded = pd.DataFrame(enc.transform(titanic_test[columns_to_ohe]))
df_encoded.columns = enc.get_feature_names_out(columns_to_ohe)
titanic_test = pd.concat([titanic_test, df_encoded], axis=1).drop(columns_to_ohe, axis=1)


### Ordinal Encoding

In [None]:
from sklearn.preprocessing import OrdinalEncoder

# Instantiate the OrdinalEncoder
enc = OrdinalEncoder()

# Columns to apply the encoding
columns_to_encode = [
    "Pclass",
    "Age",
    "FamilySize",
    "PeopleInTicket",
    "SibSp",
    "Parch",
    "FarePerPerson",
]
# columns_to_encode = ["Pclass"]

df_encoded = titanic_train[columns_to_encode].copy()
df_encoded = enc.fit_transform(df_encoded)
df_encoded = pd.DataFrame(df_encoded, columns=columns_to_encode)
titanic_train = pd.concat([titanic_train.drop(columns_to_encode, axis=1), df_encoded], axis=1)

df_encoded = titanic_test[columns_to_encode].copy()
df_encoded = enc.transform(df_encoded)
df_encoded = pd.DataFrame(df_encoded, columns=columns_to_encode)
titanic_test = pd.concat([titanic_test.drop(columns_to_encode, axis=1), df_encoded], axis=1)

### Standard scaling

In [None]:
# from sklearn.preprocessing import StandardScaler

# scaler = StandardScaler()

# columns_to_scale = []
# titanic_train[columns_to_scale] = scaler.fit_transform(titanic_train[columns_to_scale])
# titanic_test[columns_to_scale] = scaler.transform(titanic_test[columns_to_scale])


In [None]:
print(titanic_train.isnull().sum())

In [None]:
print(titanic_test.isnull().sum())

In [None]:
titanic_train.dtypes


### Final Selection



In [None]:
correlation_matrix = titanic_train.corr()

plt.figure(figsize=(10, 8))
sns.heatmap(correlation_matrix, annot=True, cmap="coolwarm")
plt.show()


In [None]:
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2

# separate your data into X and y
X = titanic_train.drop("Survived", axis=1)
y = titanic_train["Survived"]

# apply SelectKBest class to extract the best features
bestfeatures = SelectKBest(score_func=chi2, k="all")
fit = bestfeatures.fit(X, y)

# create a DataFrame to visualize the scores
df_scores = pd.DataFrame(fit.scores_)
df_columns = pd.DataFrame(X.columns)

# concatenate dataframes for better visualization and sort values
feature_scores = pd.concat([df_columns, df_scores], axis=1)
feature_scores.columns = ["Feature", "Score"]
feature_scores_sorted = feature_scores.sort_values(by="Score", ascending=False)

print(feature_scores_sorted)


In [None]:
titanic_train = titanic_train.loc[
    :,
    [
        "Survived",
        "Pclass",
        "Sex",
        "Age",
        "SibSp",
        "Parch",
        "Embarked_C",
        "Embarked_Q",
        "Embarked_S",
        "Title_Master",
        "Title_Miss",
        "Title_Mr",
        "Title_Mrs",
        "Title_Officer",
        "Title_Royal",
        "FamilySize",
        "PeopleInTicket",
        "FarePerPerson",
    ],
]

titanic_test = titanic_test.loc[
    :,
    [
        "Pclass",
        "Sex",
        "Age",
        "SibSp",
        "Parch",
        "Embarked_C",
        "Embarked_Q",
        "Embarked_S",
        "Title_Master",
        "Title_Miss",
        "Title_Mr",
        "Title_Mrs",
        "Title_Officer",
        "Title_Royal",
        "FamilySize",
        "PeopleInTicket",
        "FarePerPerson",
    ],
]


## Machine learning

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import GradientBoostingClassifier
from xgboost import XGBClassifier

classifiers = {
    "LogisticRegression": LogisticRegression(),
    "SVC": SVC(),
    "RandomForestClassifier": RandomForestClassifier(),
    "KNeighborsClassifier": KNeighborsClassifier(),
    # "DecisionTreeClassifier": DecisionTreeClassifier(),
    "GradientBoostingClassifier": GradientBoostingClassifier(),
    "XGBClassifier": XGBClassifier(eval_metric="logloss"),
}

In [None]:
# params = {
#     "LogisticRegression": {
#         "C": [0.1, 1.0, 10.0],
#         "solver": ["lbfgs", "liblinear"],
#         "max_iter": [5000],
#     },
#     "SVC": {"C": [0.1, 1.0, 10.0], "kernel": ["linear", "rbf"]},
#     "RandomForestClassifier": {"n_estimators": [10, 100, 1000], "max_depth": [None, 10, 20]},
#     "KNeighborsClassifier": {
#         "n_neighbors": range(1, 21),
#         "p": [1, 2],
#     },
#     "DecisionTreeClassifier": {
#         "max_depth": [None, 10, 20, 30, 40],
#         "min_samples_split": [2, 10, 20],
#     },
#     "GradientBoostingClassifier": {
#         "learning_rate": [0.01, 0.1, 1.0],
#         "n_estimators": [100, 500, 1000],
#     },
#     "XGBClassifier": {"learning_rate": [0.01, 0.1, 1.0], "n_estimators": [100, 500, 1000]},
# }

In [None]:
params = {
    "LogisticRegression": {
        "C": [0.1],
        "solver": ["lbfgs", "liblinear"],
        "max_iter": [5000],
    },
    "SVC": {
        "C": [0.4],
        "kernel": ["rbf"],
        "gamma": ["scale"],
    },
    "RandomForestClassifier": {
        "n_estimators": [1000],
        "max_depth": [5],
        "max_features": ["sqrt"],
    },
    "KNeighborsClassifier": {
        "n_neighbors": range(1, 41),
        "p": [2],
    },
    "GradientBoostingClassifier": {
        "learning_rate": [0.01],
        "n_estimators": [350],
        "max_depth": [5],
        "subsample": [0.8],
    },
    "XGBClassifier": {
        "learning_rate": [0.007],
        "n_estimators": [300],
        "max_depth": [7],
        "subsample": [0.5],
    },
}

In [None]:
X_train = titanic_train.drop("Survived", axis=1)
y_train = titanic_train["Survived"]


In [None]:
from sklearn.model_selection import GridSearchCV, cross_val_score

model_score = {}
folds = 10

for classifier_name in classifiers.keys():
    clf = classifiers[classifier_name]
    param_grid = params[classifier_name]

    # Perform grid search with cross-validation
    grid_search = GridSearchCV(clf, param_grid, cv=folds)  # 5-fold cross-validation
    grid_search.fit(X_train, y_train)

    # Calculate additional scores
    f1_score = cross_val_score(
        grid_search.best_estimator_, X_train, y_train, cv=folds, scoring="f1"
    ).mean()
    auc_score = cross_val_score(
        grid_search.best_estimator_, X_train, y_train, cv=folds, scoring="roc_auc"
    ).mean()

    model_score[classifier_name] = {
        "Best Score": grid_search.best_score_,
        "F1 Score": f1_score,
        "AUC": auc_score,
    }
    print(classifier_name)
    print(f"Best parameters : {grid_search.best_params_}")
    print(f"Best score : {grid_search.best_score_}")
    print(f"F1 score : {f1_score}")
    print(f"AUC : {auc_score}")
    print()


In [None]:
# Convert nested dictionary into flat dictionary
flat_data = []
for model, scores in model_score.items():
    flat_scores = {"Model": model, **scores}
    flat_data.append(flat_scores)

# Create dataframe from flat dictionary
df_model_score = pd.DataFrame(flat_data)

# Sort by Score
df_model_score.sort_values(by="Best Score", ascending=False)

## Prediction

In [None]:
# Suppose your best model was logistic regression
best_model = grid_search.best_estimator_
X_test = titanic_test

# Make predictions
predictions = best_model.predict(X_test)

# Assuming that the test DataFrame is ordered correctly
titanic_test["PassengerId"] = range(892, 892 + len(titanic_test))

# Create a DataFrame with the passenger ids and the corresponding predictions
submission = pd.DataFrame({"PassengerId": titanic_test["PassengerId"], "Survived": predictions})

# Save the DataFrame to a CSV file
submission.to_csv("submission.csv", index=False)

In [None]:
# !kaggle competitions submit -f submission.csv -m ""


In [None]:
# !kaggle competitions submissions titanic


In [None]:
df_leaderboard = pd.read_csv("leaderboard.csv")
df_leaderboard.head()


In [None]:
df_myscore = pd.DataFrame({"ScoreId": [0, 1], "Score": [0.75358, 0.7703]})
df_myscore.head()

In [None]:
plt.figure()
sns.histplot(df_leaderboard, x="Score")
plt.xlim([0.7, 0.85])

In [None]:
import matplotlib.patches as mpatches

plt.figure(figsize=(10, 5))
plt.hist(df_leaderboard["Score"], bins=300, alpha=0.5, label="Leaderboard Scores")
for index, row in df_myscore.iterrows():
    plt.axvline(row["Score"], color="r", linestyle="dashed", linewidth=2)


red_patch = mpatches.Patch(color="red", label="Your Scores")
plt.legend(handles=[red_patch])
plt.xlim([0.7, 0.85])
plt.xlabel("Score")
plt.ylabel("Frequency")
plt.legend()
plt.title("Leaderboard Scores and Your Score")
plt.show()
