# Logistic Regression Exercise

Now it's your turn to implement logistic regression on a new data set. For this purpose we use the Titanic Dataset. It includes personal information of all passengers on the Titanic as well as they survived the sinking of the Titanic or died.

Here’s the **Data Dictionary** of the dataset:

- PassengerID: type should be integers

- Survived: survived or not

- Pclass: class of Travel of every passenger

- Name: the name of the passenger

- Sex: gender

- Age: age of passengers

- SibSp: No. of siblings/spouse aboard

- Parch: No. of parent/child aboard

- Ticket: Ticket number

- Fare: what Prices they paid

- Cabin: cabin number

- Embarked: the port in which a passenger has embarked.

        - C: Cherbourg , S: Southampton , Q: Queenstown


You will find the data in the data folder (it's a zip folder, so you first have to unzip it).


## What you should do:

- conduct a brief EDA to become familiar with the data
- use Logistic Regression to predict if a passenger died or not

## How to do it:

Time is short, so aim for the simplest viable product first:
1. Load the data

2. Separate features and target 

3. Split the data in train and test

3. Get a quick overview of the train data

4. Agree on a classification metric for the task 

5. Create a simple heuristic/educated guess for the classification first. This is called a "baseline model". It is used to compare more complex models later (in this case: logistic regression). You as a data scientist want to prove how much your work/ML could improve the business metric, therefore you need a baseline model for comparison. In some cases you want to improve on an already existing model in your company which would be your baseline model then. In other cases, there are typical baseline models used in the specific field. For other tasks, you have to come up with a simple but meaningful idea, how to classify the data based on your business understanding (EDA). A baseline model should follow Occam’s Razor principle: "A simple model is the best model". 
    - Example of a baseline model: 
    If the task is to classify cats and dogs, a baseline model could be: We classify every animal as cat if its weight < 5 kg, otherwise the animal is classified as a dog. (The value of 5 kg is an educated guess, based on our business understanding/EDA.) 

6. use one or two already numerical features to create a simple first model
    -  did it even beat your base model?

7. Now you can go through the data science lifecycle again and again:
    - clean the data better

    - get more insights with EDA

    - add more features

    - do feature engineering 
    
    and check if your work improves your model further!

8. Stop whenever time is up or you cannot improve your model any further.

This repo a solution to this problem. If you want to compare your final result with the result of this repo solution, choose **25** as random seed and a test size of 30% for your train test split.

In [2]:
import pandas as pd

df = pd.read_csv("data/titanic.csv")

df.head()
df.info()



<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB


In [3]:
y = df["Survived"]
X = df.drop(columns=["Survived"])


In [4]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.30,
    random_state=25
)


In [5]:
y_train.mean()


0.38362760834670945

In [6]:
pd.crosstab(X_train["Sex"], y_train, normalize="index")


Survived,0,1
Sex,Unnamed: 1_level_1,Unnamed: 2_level_1
female,0.252252,0.747748
male,0.817955,0.182045


Let's see how accurate it is to just make a prediction based on sex.

In [8]:
y_pred_baseline = (X_test["Sex"] == "female").astype(int)

from sklearn.metrics import accuracy_score
accuracy_score(y_test, y_pred_baseline)



0.7723880597014925

Let's now make our first logistic regression model based on numerical values only.

In [10]:
features = ["Age", "Fare"]

X_train_simple = X_train[features]
X_test_simple = X_test[features]

#fill in missing values with median
X_train_simple = X_train_simple.fillna(X_train_simple.median())
X_test_simple = X_test_simple.fillna(X_train_simple.median())

from sklearn.linear_model import LogisticRegression

lr = LogisticRegression(max_iter=1000)
lr.fit(X_train_simple, y_train)

y_pred_simple = lr.predict(X_test_simple)
accuracy_score(y_test, y_pred_simple)




0.6716417910447762

So sex is a dominant feature. A model without sex throws away crucial information

In [13]:
X_train_better = X_train[["Age", "Fare", "Sex"]].copy()
X_test_better = X_test[["Age", "Fare", "Sex"]].copy()

X_train_better["Sex"] = (X_train_better["Sex"] == "female").astype(int)
X_test_better["Sex"] = (X_test_better["Sex"] == "female").astype(int)

# Fill missing values
X_train_better = X_train_better.fillna(X_train_better.median())
X_test_better = X_test_better.fillna(X_train_better.median())

lr.fit(X_train_better, y_train)
y_pred_better = lr.predict(X_test_better)
accuracy_score(y_test, y_pred_better)


0.7574626865671642

The logistic regression almost matches the baseline, but doesn’t beat it yet.

In [18]:
features = ["Age", "Fare", "Sex", "Pclass", "SibSp", "Parch", "Embarked"]

X_train_full = X_train[features].copy()
X_test_full = X_test[features].copy()

# Convert Sex
X_train_full["Sex"] = (X_train_full["Sex"] == "female").astype(int)
X_test_full["Sex"] = (X_test_full["Sex"] == "female").astype(int)

# Fill missing values
X_train_full["Age"] = X_train_full["Age"].fillna(X_train_full["Age"].median())
X_test_full["Age"] = X_test_full["Age"].fillna(X_train_full["Age"].median())
X_train_full["Embarked"] = X_train_full["Embarked"].fillna("S")  # most common port
X_test_full["Embarked"] = X_test_full["Embarked"].fillna("S")

X_train_full = pd.get_dummies(X_train_full, columns=["Embarked"], drop_first=True)
X_test_full = pd.get_dummies(X_test_full, columns=["Embarked"], drop_first=True)



In [19]:
lr.fit(X_train_full, y_train)
y_pred_full = lr.predict(X_test_full)
accuracy_score(y_test, y_pred_full)


0.7835820895522388

This score (I checked) is the same as if we hadn't used "embarked", so this one doesn't change anything.

Start with Age + Fare → 67.2% (worse than baseline)

Add Sex → 75.7%

Add Pclass + SibSp + Parch → 78.4%

Add Embarked → same 78.4% (not significant)

In [21]:
feature_names = X_train_full.columns
coefficients = lr.coef_[0]
intercept = lr.intercept_[0]

for f, c in zip(feature_names, coefficients):
    print(f"{f}: {c:.3f}")

print(f"Intercept: {intercept:.3f}")


Age: -0.047
Fare: 0.005
Sex: 2.640
Pclass: -0.888
SibSp: -0.396
Parch: -0.165
Embarked_Q: -0.103
Embarked_S: -0.291
Intercept: 2.216


Positive coefficient → feature increases probability of survival

Negative coefficient → feature decreases probability of survival