# Logistic Regression with Python

For this Project we will be working with the [Titanic Data Set from Kaggle](https://www.kaggle.com/c/titanic). This is a very famous data set and very often is a student's first step in machine learning! 

We'll be trying to predict a classification- survival or deceased.
Let's begin our understanding of implementing Logistic Regression in Python for classification.

## Import Libraries
Let's import some libraries to get started!

## The Data

Let's start by reading in the titanic_train.csv file into a pandas dataframe.

**Quick exploration by looking at Top 5 rows of Data**


**Titanic Dataset – Column Definitions**

**1. PassengerId**

A unique ID number assigned to each passenger (just for identification).

**2. Survived**

Whether the passenger survived the Titanic disaster.

* `0` = Did **not** survive
* `1` = **Survived**

**3. Pclass**

Passenger’s **ticket class** (a proxy for socio-economic status):

* `1` = First class (richest)
* `2` = Second class
* `3` = Third class (lowest class)

**4. Name**

The full name of the passenger, including title (Mr., Mrs., Miss, etc.).

**5. Sex**

Passenger’s gender (`male` or `female`).

**6. Age**

Passenger’s age in years.
(Some values are missing → unknown age.)

**7. SibSp**

Number of **siblings** or **spouses** the passenger had onboard.
Examples:

* 1 sister + 1 brother → **2**
* Traveling with husband → **1**

**8. Parch**

Number of **parents** or **children** the passenger had onboard.
Examples:

* Traveling with mom → **1**
* Traveling with both parents + 1 child → **3**

**9. Ticket**

Ticket number the passenger used to board the ship.

**10. Fare**

Amount of money the passenger paid for the ticket.

**11. Cabin**

Cabin number assigned to the passenger.
Many values are missing (most passengers didn’t have cabins assigned).

**12. Embarked**

Port where the passenger boarded the Titanic:

* `C` = Cherbourg
* `Q` = Queenstown
* `S` = Southampton

# Exploratory Data Analysis

Let's begin some exploratory data analysis! We'll start by checking out missing data!

## Missing Data

We can use seaborn to create a simple heatmap to see where we are missing data!

Roughly 20 percent of the Age data is missing. The proportion of Age missing is likely small enough for reasonable replacement with some form of imputation. Looking at the Cabin column, it looks like we are just missing too much of that data to do something useful with at a basic level. We'll probably drop this later, or change it to another feature like "Cabin Known: 1 or 0"

**Gender Distribution of The Passengers Based on Survival**

**Class Distribution of The Passengers Based on Survival**

**Age Distribution**

**Fare Distribution**

## Data Cleaning
We want to fill in missing age data instead of just dropping the missing age data rows. One way to do this is by filling in the mean age of all the passengers (imputation).
However we can be smarter about this and check the average age by passenger class. 


**For this you can plot the boxplot with X as Pclass and y as Age for different Sex**

## Imputation for the Age

Great! Let's go ahead and drop the Cabin column that is NaN.

## Converting Categorical Features 

We'll need to convert categorical features to dummy variables using pandas! Otherwise our machine learning algorithm won't be able to directly take in those features as inputs.

Great! Our data is ready for our model!

# Building a Logistic Regression model

Let's start by splitting our data into a training set and test set (there is another test.csv file that you can play around with in case you want to use all this data for training).

## Train Test Split

## Training and Predicting

Let's move on to evaluate our model!
## Evaluation
We can check precision,recall,f1-score using classification report!


---

# **Classification Evaluation Metrics**

In classification problems, we compare the model’s predictions with the true labels using metrics such as **Precision**, **Recall**, and **F1-score**.

These metrics come from the **confusion matrix**:

|                     | Predicted Positive      | Predicted Negative      |
| ------------------- | ----------------------- | ----------------------- |
| **Actual Positive** | **True Positive (TP)**  | **False Negative (FN)** |
| **Actual Negative** | **False Positive (FP)** | **True Negative (TN)**  |

---

# **Definitions of TP, FP, FN, TN**

### **True Positive (TP)**

Cases where the model **correctly predicted Positive**.
Example: model says “disease” and the person actually has the disease.

### **False Positive (FP)**

Cases where the model **predicted Positive**, but the actual class was Negative.
Example: model says “spam”, but the email is actually normal.
(Also called **Type I Error**.)

### **False Negative (FN)**

Cases where the model **predicted Negative**, but the actual class was Positive.
Example: model says “no disease”, but the person is actually sick.
(Also called **Type II Error**.)

### **True Negative (TN)**

Cases where the model **correctly predicted Negative**.
Example: model says “not spam”, and the email is indeed not spam.

---

# **1. Precision**

**Precision** tells us:

> “Out of all the predictions the model said were *positive*, how many were actually positive?”

It measures the **accuracy of positive predictions**.

$$
\text{Precision} = \frac{TP}{TP + FP}
$$


* High precision = very few **false positives**
* Useful when **false positives are costly**
  (e.g., spam detection → don’t mark real emails as spam)

---

# **2. Recall (Sensitivity / True Positive Rate)**

**Recall** tells us:

> “Out of all the actual positive cases, how many did the model correctly identify?”

It measures the model’s ability to **find all positive cases**.

$$
\text{Recall} = \frac{TP}{TP + FN}
$$


* High recall = very few **false negatives**
* Useful when **missing a positive is costly**
  (e.g., disease detection → do not miss sick patients)

---

# **3. F1-Score**

**F1-score** is the **harmonic mean** of Precision and Recall.

It balances both metrics, giving a single score that considers both **false positives** and **false negatives**.

$$
\text{F1} = 2 \times \frac{\text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}}
$$

* High F1-score = good balance between precision and recall
* Most useful when:

  * Classes are **imbalanced**
  * You care about both catching positives **and** not raising false alarms

---

# **4. Support**

**Support** is simply:

> “How many actual samples of this class exist in the dataset?”

It does *not* measure performance; it only tells you the number of true samples.

$$
\text{Support} = \text{Number of actual instances of each class}
$$


Support helps you evaluate:

* class imbalance
* how reliable the metrics are (small support → unstable or misleading values)

---

# **Summary of Interpretation**

| Metric        | Measures                        | Good When                   | Bad When                                 |
| ------------- | ------------------------------- | --------------------------- | ---------------------------------------- |
| **Precision** | Quality of positive predictions | False positives must be low | Model misses many positives              |
| **Recall**    | Ability to find all positives   | False negatives must be low | Too many false positives                 |
| **F1-Score**  | Balance of Prec. & Recall       | Need one combined score     | Classes balanced & accuracy is preferred |
| **Support**   | Count of true samples           | Check class imbalance       | Not a measure of accuracy                |

---
