Can you solve this riddle.


What five-letter word typed in all capital letters can be read the same upside down?

## Supervised Learning

Supervised Learning = we have labeled data → input features (X) + correct output/target (y).

1. We train the model to learn the mapping: X → y
2. Goal: generalize well to new, unseen data.


Two main branches:

1. Regression → predict continuous numeric values
Examples: house price, temperature, salary

2. Classification → predict discrete categories/classes
Examples: spam/not spam, disease/no disease, will buy/not buy, cat/dog/sheep

# 1. Logistic Regression

Logistic regression helps us model the probability of something belonging to a particular class.

For example, given a patient's age, blood pressure, and cholesterol, what's the probability they have heart disease (yes or no)?

It uses a mathematical function called the logistic function (also known as the sigmoid function) to squeeze predictions into a range between 0 and 1, representing probabilities.

If the probability is above a threshold (usually 0.5), we classify it as "yes"; below, "no."

# Recap of Linear Regression

Linear regression is for regression tasks in supervised learning: predicting continuous values.
- It fits a straight line: $  y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \dots + \epsilon  $ (y is the output, ε is error).

- Goal: Minimize the sum of squared errors between predicted and actual values.

Output: Can be any real number (positive or negative infinity).

But what if we try using linear regression for classification? Say, predicting if an email is spam (1) or not (0).


# From Linear to Logistic:

1; Start with the linear model: z = \beta_0 + \beta_1 x_1 + \dots  

2. Apply sigmoid: $  p = \frac{1}{1 + e^{-z}}  

For training, we use maximum likelihood estimation (not least squares). This maximizes the chance of observing the actual labels given the model's probabilities.

Intuitively: Adjust β coefficients so the model assigns high probability to correct classes.

Why This Works for Classification:
The decision boundary is linear in the feature space but sigmoid-curved in probability space.
It's still "linear" at its core but adapted for odds ratios.

# Fitting a Logistic Regression Model - Lab

1. Prepare Data: Collect labeled data (features + binary labels, e.g., 0/1).
2. Choose Features: Select relevant inputs (e.g., age, income for predicting loan approval).
3. Initialize Model: Start with guess coefficients (β).
4. Train (Fit): Use an algorithm like gradient descent to iteratively adjust β. Goal: Maximize likelihood (make predictions match labels as closely as possible in probability terms).
Loss function: Binary cross-entropy (measures how wrong probabilities are).

5. Evaluate: Check accuracy, precision, recall, or ROC curve (plots true positives vs. false positives).
6. Predict: For new data, compute probability and classify.

Assumptions: Features are independent, no multicollinearity, large sample size for reliable estimates.

# HANDS ON LAB -

Predicting Likelihood of survival based on the age, the fare, class, sibsp,sex of passengers on the titanic ship


In [15]:
import seaborn as sns
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, confusion_matrix # for evaluation

# Load the Titanic dataset
df = sns.load_dataset('titanic')

In [16]:
df.head(10)

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.925,S,Third,woman,False,,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1,S,First,woman,False,C,Southampton,yes,False
4,0,3,male,35.0,0,0,8.05,S,Third,man,True,,Southampton,no,True
5,0,3,male,,0,0,8.4583,Q,Third,man,True,,Queenstown,no,True
6,0,1,male,54.0,0,0,51.8625,S,First,man,True,E,Southampton,no,True
7,0,3,male,2.0,3,1,21.075,S,Third,child,False,,Southampton,no,False
8,1,3,female,27.0,0,2,11.1333,S,Third,woman,False,,Southampton,yes,False
9,1,2,female,14.0,1,0,30.0708,C,Second,child,False,,Cherbourg,yes,False


In [17]:
df.describe()

Unnamed: 0,survived,pclass,age,sibsp,parch,fare
count,891.0,891.0,714.0,891.0,891.0,891.0
mean,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208
std,0.486592,0.836071,14.526497,1.102743,0.806057,49.693429
min,0.0,1.0,0.42,0.0,0.0,0.0
25%,0.0,2.0,20.125,0.0,0.0,7.9104
50%,0.0,3.0,28.0,0.0,0.0,14.4542
75%,1.0,3.0,38.0,1.0,0.0,31.0
max,1.0,3.0,80.0,8.0,6.0,512.3292


In [18]:
df.shape
df.columns
df.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.925,S,Third,woman,False,,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1,S,First,woman,False,C,Southampton,yes,False
4,0,3,male,35.0,0,0,8.05,S,Third,man,True,,Southampton,no,True


In [19]:
df.columns

Index(['survived', 'pclass', 'sex', 'age', 'sibsp', 'parch', 'fare',
       'embarked', 'class', 'who', 'adult_male', 'deck', 'embark_town',
       'alive', 'alone'],
      dtype='object')

In [20]:
X = ['pclass', 'sex', 'age','fare','sibsp']
y = 'survived'

In [21]:
titanic_df = df[X +[y]].copy()

In [22]:
titanic_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 6 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   pclass    891 non-null    int64  
 1   sex       891 non-null    object 
 2   age       714 non-null    float64
 3   fare      891 non-null    float64
 4   sibsp     891 non-null    int64  
 5   survived  891 non-null    int64  
dtypes: float64(2), int64(3), object(1)
memory usage: 41.9+ KB


###Handling the Missing Rows on Age Column

In [23]:
#titanic_df['age'] = titanic_df['age'].fillna(titanic_df['age'].mean())
titanic_df['age'] = titanic_df['age'].fillna(titanic_df['age'].median())
#titanic_df['age'] = titanic_df['age'].fillna(titanic_df['age'].mode())

### Group by Class to fill age gap

In [24]:
titanic_df['sex'] = titanic_df['sex'].map({'male':0,'female':1})

### Train Test Split

In [29]:
X_array = titanic_df[X]
y_array = titanic_df[y]

In [30]:
X_train, X_test, y_train, y_test = train_test_split(X_array,y_array, test_size=0.25,random_state=42,stratify = y_array)

### Fit Our Model

In [31]:
LogR = LogisticRegression(max_iter=1000)
LogR.fit(X_train,y_train)

In [32]:
y_pred =LogR.predict(X_test)

In [35]:
accuracy_on_titanic = accuracy_score(y_test,y_pred)
print(f"Accuracy Score of the model we just fitted:{accuracy_on_titanic:.3f}  ({accuracy_on_titanic*100:.1f}%)")

Accuracy Score of the model we just fitted:0.780  (78.0%)
