### Titanic Dataset - Exploratory Data Analysis (EDA)
##### Author: Danilo Jelovac  
##### Dataset Source: [Kaggle - M Yasser H.](https://www.kaggle.com/datasets/yasserh/titanic-dataset/data)
---

> *“A grand ship, a tragic story — and a dataset that still teaches us about people, chance, and survival.”*

This notebook represents **Part 3: Model Training and Evaluation**.  
After data cleaning and exploratory analysis, our goal in this notebook is to:
- Train several machine learning models on preprocessed Titanic data
- Evaluate how well each model predicts Survival
- Rank them by their performance
- Save the best-performing model for future use

We are not aiming for complex ML.
The focus is on beginner-friendly model comparison and understanding how accuracy changes across algorithms.

---


### Table of Contents

1. [Introduction](#introduction)  
2. [Importing Libraries](#importing-libraries)
3. [Loading data and preview](#loading-data-and-preview)
4. [ML data preparation](#ml-data-preparation)  
5. [Model training](#model-training)    
6. [Ranking the models](#ranking-the-models)
7. [Saving prepared dataframe](#saving-prepared-dataframe)
7. [Conclusions & Next Steps](#conclusions--next-steps)

---

### Introduction

In this notebook, we apply several basic machine learning classification models to the Titanic dataset.
Each model learns patterns from the data and attempts to predict whether a passenger survived.
Machine learning here is used to:
- Test whether our cleaned dataset contains useful predictive features
- Compare algorithm performance
- Select the best model for simple survival prediction
- No deep mathematical knowledge is required — we rely on scikit-learn’s built-in implementations.

The Titanic dataset remains one of the best introductory datasets for practicing  
**data inspection, visualization, and early feature evaluation** before machine learning.

---

##### `Importing Libraries`

In [1]:
# -------
# Imports:
# -------

try:
    import pandas as pd
    
    # import matplotlib.pyplot as plt
    # import seaborn as sns
    
    # --
    from sklearn.model_selection import train_test_split
    from sklearn.preprocessing import MinMaxScaler, OneHotEncoder
    from sklearn.pipeline import Pipeline
    from sklearn.compose import ColumnTransformer
    # --
    from sklearn.linear_model import LogisticRegression
    from sklearn.naive_bayes import MultinomialNB
    from sklearn.tree import DecisionTreeClassifier
    from sklearn.ensemble import RandomForestClassifier
    from sklearn.svm import LinearSVC
    # --
    from sklearn.metrics import classification_report
    
    print("If you see this message, the libraries are uploaded successfuly!")
except ModuleNotFoundError:
    print("Modules not found! Please check if modules are installed.")
    print("Check 'README.md' under 'Requirements' for info.")

If you see this message, the libraries are uploaded successfuly!


##### `Loading data and preview`

In [2]:
# ----------------------------------------
# Loading the dataset, previewing the data:
# ----------------------------------------


# ---------------------------
FOLDER_NAME = 'datasets'
FILE_NAME = 'Titanic-Dataset'
EXTENSION = '.csv'

FILE_PATH = f'../{FOLDER_NAME}/{FILE_NAME}_Cleaned_MLRdy{EXTENSION}'
# ---------------------------

# --Loading and previewing the dataset:
try:
    data = pd.read_csv(FILE_PATH)
    print(f"File '{FILE_NAME}_Cleaned_MLRdy{EXTENSION}' loaded!\n")
except FileNotFoundError:
    print(f"File not found! Please check if path is correct: '{FILE_PATH}'")


print("Previewing data:\n---------------")
display(data.head(5))


File 'Titanic-Dataset_Cleaned_MLRdy.csv' loaded!

Previewing data:
---------------


Unnamed: 0,PassengerId,Title,Name,Sex,Age,FamilyStatus,Class,Fare,Survived,SurvivedBinary
0,1,Mr,"Braund, Mr. Owen Harris",male,22.0,with_family,3,7.25,no,0
1,2,Mrs,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,with_family,1,71.2833,yes,1
2,3,Miss,"Heikkinen, Miss. Laina",female,26.0,single,3,7.925,yes,1
3,4,Mrs,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,with_family,1,53.1,yes,1
4,5,Mr,"Allen, Mr. William Henry",male,35.0,single,3,8.05,no,0


##### `ML data preparation`

In [3]:
# -------------------------------------------------
# Double check: preparing data, removing unnecesary:
# -------------------------------------------------


# -- Dropping unnecesary columns, saving to new dataframe:
columns_to_drop = ['PassengerId', 'Name', 'Survived']  # -- we're 100% sure these data won't affect the ML outcome.
ml_data = data.drop(columns=columns_to_drop, axis='columns')

print("===== CHECKING THE DATA =====")

# -- Checking new dataframe:
display(ml_data.head())
display(ml_data.info())


===== CHECKING THE DATA =====


Unnamed: 0,Title,Sex,Age,FamilyStatus,Class,Fare,SurvivedBinary
0,Mr,male,22.0,with_family,3,7.25,0
1,Mrs,female,38.0,with_family,1,71.2833,1
2,Miss,female,26.0,single,3,7.925,1
3,Mrs,female,35.0,with_family,1,53.1,1
4,Mr,male,35.0,single,3,8.05,0


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 889 entries, 0 to 888
Data columns (total 7 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Title           889 non-null    object 
 1   Sex             889 non-null    object 
 2   Age             889 non-null    float64
 3   FamilyStatus    889 non-null    object 
 4   Class           889 non-null    int64  
 5   Fare            889 non-null    float64
 6   SurvivedBinary  889 non-null    int64  
dtypes: float64(2), int64(2), object(3)
memory usage: 48.7+ KB


None

##### `Model training`

In [4]:
# ------------------------------
# Training and evaluating models:
# ------------------------------


data_for_prediction = ml_data[['Title', 'Sex', 'Age', 'FamilyStatus', 'Class', 'Fare']]

# -- Model preparation:
X = data_for_prediction
y = ml_data['SurvivedBinary']

# -- Splitting training and testing for improved results:
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# -- Splitting data (categorical v. numerical):
numerical_features = ['Age', 'Class', 'Fare']
categorical_features = ['Title', 'Sex', 'FamilyStatus']

preprocessor = ColumnTransformer(
    transformers=[
        ('cat', OneHotEncoder(handle_unknown='ignore'), categorical_features),
        ('num', MinMaxScaler(), numerical_features)
    ]
)

# -- Chosing models:
models = {
    "Logistic Regression": LogisticRegression(max_iter=200),
    "Naive Bayes": MultinomialNB(),
    "Decision Tree": DecisionTreeClassifier(),
    "Random Forest": RandomForestClassifier(),
    "Linear SVM": LinearSVC()
}

# -- Training...
results = {}

for name, model in models.items():
    print(f"\n===== MODEL -> [{name}] =====\n{'-'*42}")
    
    # -- preprocessing and model:
    clf = Pipeline(steps=[
        ('preprocess', preprocessor),
        ('model', model)
    ])
    
    # -- training:
    clf.fit(X_train, y_train)
    
    # -- prediction:
    y_pred = clf.predict(X_test)
    
    # -- report:
    print(classification_report(y_test, y_pred), "\n", "-"*42)
    
    # -- saving accuracy data:
    results[name] = clf.score(X_test, y_test)
    
    


===== MODEL -> [Logistic Regression] =====
------------------------------------------
              precision    recall  f1-score   support

           0       0.86      0.83      0.84       109
           1       0.74      0.78      0.76        69

    accuracy                           0.81       178
   macro avg       0.80      0.80      0.80       178
weighted avg       0.81      0.81      0.81       178
 
 ------------------------------------------

===== MODEL -> [Naive Bayes] =====
------------------------------------------
              precision    recall  f1-score   support

           0       0.85      0.82      0.83       109
           1       0.73      0.77      0.75        69

    accuracy                           0.80       178
   macro avg       0.79      0.79      0.79       178
weighted avg       0.80      0.80      0.80       178
 
 ------------------------------------------

===== MODEL -> [Decision Tree] =====
------------------------------------------
         

##### `Ranking the models`

In [5]:
# --------
# Rankings:
# --------

print("\n=== MODEL ACCURACY RANKING ===\n")

for model_name, score in sorted(results.items(), key=lambda x: x[1], reverse=True):
    print(f"{model_name}: {score:.4f}")

print("-"*30)



=== MODEL ACCURACY RANKING ===

Logistic Regression: 0.8090
Linear SVM: 0.8090
Naive Bayes: 0.7978
Random Forest: 0.7978
Decision Tree: 0.7416
------------------------------


##### `Saving prepared dataframe`

In [6]:
# --------------------
# Saving the dataframe:
# --------------------


# ---------------------------
NEW_FILE_NAME = 'Titanic-Dataset_ML_Train'
# EXTENSION = '.csv'

SAVE_FILE_PATH = f'../{FOLDER_NAME}/{NEW_FILE_NAME}{EXTENSION}'
# ---------------------------

try:
    ml_data.to_csv(SAVE_FILE_PATH, index=False)
    print(f"File '{NEW_FILE_NAME}' saved! Path: '{SAVE_FILE_PATH}'.")
except Exception as e:
    print(f"Someting went wrong... Error: {e}")

File 'Titanic-Dataset_ML_Train' saved! Path: '../datasets/Titanic-Dataset_ML_Train.csv'.


##### `Concslusion`

- Multiple ML algorithms were tested on the Titanic dataset
- Models were compared based on accuracy
- This completes the core ML step in the Titanic project
- The next phase is integrating everything into a clean project structure and pushing to GitHub

Across all trained models, the performance differences were small, but the best-performing models on this dataset were:
- Logistic Regression
- Linear SVM

These models achieved the highest accuracy and most balanced precision–recall scores.
They are the best candidates for final model training in the production script.

---

##### `Next Steps`

The final model will be retrained and saved inside a dedicated Python script (train_model.py) located in the `model_source/` directory.
That script will produce a **.pkl model file** that can be used for predictions in separate applications.

---


##### `About the Author`

**Danilo Jelovac** — Aspiring Data Analyst & Python Developer  
Focused on clean, understandable code and data-driven storytelling. 
 
> Portfolio: *[[GitHub link](https://github.com/d-jlvc/data-ml-portfolio)]*  
> LinkedIn: *[[LinkedIn link](https://www.linkedin.com/in/danilo-jelovac-b1b7a5396/)]*  