<a href="https://colab.research.google.com/github/davidofitaly/06_classification_projects/blob/main/Untitled296.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Table of contents:
1. [Import of libraries](#0)
2. [Data loading](#1)
3. [Label Encoder](#2)
4. [Pandas get_dummies()](#3)
5. [](#4)
6. [](#5)
7. [](#6)
8. [](#7)
9. [](#8)
10. [](#9)
11. [](#10)
12. [](#11)

### <a name='0'> </a> Import of libraries

In [38]:
# Import the necessary libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import plotly.express as px
import seaborn as sns
import sklearn
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import plotly.figure_factory as ff


# Set the font scale for Seaborn plots
sns.set(font_scale=1.3)

np.set_printoptions(precision=6, suppress=True, edgeitems=10, linewidth=100000,
                    formatter= dict(float=lambda x: f'{x:.2f}'))

# Print the version of the imported libraries for reference
print(f'Pandas: {pd.__version__}')
print(f'Numpy: {np.__version__}')
print(f'Sklearn: {sklearn.__version__}')
print(f'Seaborn: {sns.__version__}')

Pandas: 2.2.2
Numpy: 1.26.4
Sklearn: 1.5.2
Seaborn: 0.13.1


### <a name='1'> </a> Data loading

In [39]:
titanic_df = sns.load_dataset('titanic')



In [40]:
titanic_df.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.925,S,Third,woman,False,,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1,S,First,woman,False,C,Southampton,yes,False
4,0,3,male,35.0,0,0,8.05,S,Third,man,True,,Southampton,no,True


In [41]:
object_columns = titanic_df.select_dtypes(include=['object']).columns
titanic_df[object_columns] = titanic_df[object_columns].astype('category')


In [42]:
titanic_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 15 columns):
 #   Column       Non-Null Count  Dtype   
---  ------       --------------  -----   
 0   survived     891 non-null    int64   
 1   pclass       891 non-null    int64   
 2   sex          891 non-null    category
 3   age          714 non-null    float64 
 4   sibsp        891 non-null    int64   
 5   parch        891 non-null    int64   
 6   fare         891 non-null    float64 
 7   embarked     889 non-null    category
 8   class        891 non-null    category
 9   who          891 non-null    category
 10  adult_male   891 non-null    bool    
 11  deck         203 non-null    category
 12  embark_town  889 non-null    category
 13  alive        891 non-null    category
 14  alone        891 non-null    bool    
dtypes: bool(2), category(7), float64(2), int64(4)
memory usage: 50.8 KB


###<a name='2'> </a> Label Encoder

In [44]:
from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()
le.fit(titanic_df['alive'])
titanic_df['alive'] = le.transform(titanic_df['alive'])

In [45]:
titanic_df.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.25,S,Third,man,True,,Southampton,0,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,1,False
2,1,3,female,26.0,0,0,7.925,S,Third,woman,False,,Southampton,1,True
3,1,1,female,35.0,1,0,53.1,S,First,woman,False,C,Southampton,1,False
4,0,3,male,35.0,0,0,8.05,S,Third,man,True,,Southampton,0,True


### <a name='3'> </a> Pandas get_dummies()

In [47]:
titanic_df = pd.get_dummies(data=titanic_df, drop_first=True, dtype=int)

titanic_df.head()

Unnamed: 0,survived,pclass,age,sibsp,parch,fare,adult_male,alive,alone,sex_male,...,who_man,who_woman,deck_B,deck_C,deck_D,deck_E,deck_F,deck_G,embark_town_Queenstown,embark_town_Southampton
0,0,3,22.0,1,0,7.25,True,0,False,1.0,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
1,1,1,38.0,1,0,71.2833,False,1,False,0.0,...,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
2,1,3,26.0,0,0,7.925,False,1,True,0.0,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
3,1,1,35.0,1,0,53.1,False,1,False,0.0,...,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0
4,0,3,35.0,0,0,8.05,True,0,True,1.0,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0


### <a name='2'> </a> Data description


The Titanic dataset contains information about the passengers who traveled on the RMS Titanic, which sank in 1912. The goal of the analysis is to determine which passengers survived the disaster. Each passenger is described by several features.

### Features

- **survived**: Indicates whether the passenger survived the disaster (1 = survived, 0 = did not survive).
- **pclass**: The class of the passenger (1, 2, or 3).
- **sex**: The gender of the passenger (male, female).
- **age**: The age of the passenger in years.
- **sibsp**: The number of siblings/spouses aboard.
- **parch**: The number of parents/children aboard.
- **fare**: The ticket price in dollars.
- **embarked**: The port of embarkation (C = Cherbourg; Q = Queenstown; S = Southampton).

### Target: Survival

- **target**: This is the target variable representing the survival of the passenger:
  - `0`: Did not survive
  - `1`: Survived

The dataset contains **891** samples of passengers, each described by the above features. Analyzing this dataset allows for the classification of passengers based on their characteristics, which can lead to insights into the factors influencing survival during the disaster.
