# **Logistic Regression**

`Logistic regression` is a `classification` algorithm used to assign observations to a discrete set of classes. 
Unlike linear regression which outputs continuous number values, logistic regression transforms its output using the `logistic sigmoid function` to return a `probability value` which can then be mapped to two or more discrete classes.

`Logistic regression can be used for`:
1. Binary Classification
2. Multi-class Classification
3. One-vs-Rest Classification

#### `Assumptions of Logistic regression`
1. The dependent variable must be categorical in nature.
2. The independent variables(features) must be independent.
3. There should be no outliers in the data. Check for outliers.
4. There should be no high correlations among the independent variables. This can be checked using a correlation matrix.

### **1.0 Libraries**

In [99]:
import pandas as pd 
import numpy as np 
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler, MinMaxScaler, LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix, recall_score, precision_score, classification_report, f1_score

### **1.1 Dataset**

In [100]:
# load the data
df = sns.load_dataset('titanic')
df.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.925,S,Third,woman,False,,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1,S,First,woman,False,C,Southampton,yes,False
4,0,3,male,35.0,0,0,8.05,S,Third,man,True,,Southampton,no,True


In [101]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 15 columns):
 #   Column       Non-Null Count  Dtype   
---  ------       --------------  -----   
 0   survived     891 non-null    int64   
 1   pclass       891 non-null    int64   
 2   sex          891 non-null    object  
 3   age          714 non-null    float64 
 4   sibsp        891 non-null    int64   
 5   parch        891 non-null    int64   
 6   fare         891 non-null    float64 
 7   embarked     889 non-null    object  
 8   class        891 non-null    category
 9   who          891 non-null    object  
 10  adult_male   891 non-null    bool    
 11  deck         203 non-null    category
 12  embark_town  889 non-null    object  
 13  alive        891 non-null    object  
 14  alone        891 non-null    bool    
dtypes: bool(2), category(2), float64(2), int64(4), object(5)
memory usage: 80.7+ KB


In [102]:
df.isnull().sum()

survived         0
pclass           0
sex              0
age            177
sibsp            0
parch            0
fare             0
embarked         2
class            0
who              0
adult_male       0
deck           688
embark_town      2
alive            0
alone            0
dtype: int64

### **1.3 Preprocessing**

In [103]:
# remove the deck column 
df.drop('deck', axis=1, inplace=True)
# impute missing values in age and fare
df['age'].fillna(df['age'].median(), inplace=True)
df['fare'].fillna(df['fare'].median(), inplace=True)
# impute missing values in embarked and embarked town
df['embarked'].fillna(df['embarked'].mode()[0], inplace=True)
df['embark_town'].fillna(df['embark_town'].mode()[0], inplace=True)
# encoding the categorical column using for loop
for col in df.columns:
    if df[col].dtype == 'object' or df[col].dtype.name == 'category':
        df[col] = LabelEncoder().fit_transform(df[col])


In [104]:
df.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,embark_town,alive,alone
0,0,3,1,22.0,1,0,7.25,2,2,1,True,2,0,False
1,1,1,0,38.0,1,0,71.2833,0,0,2,False,0,1,False
2,1,3,0,26.0,0,0,7.925,2,2,2,False,2,1,True
3,1,1,0,35.0,1,0,53.1,2,0,2,False,2,1,False
4,0,3,1,35.0,0,0,8.05,2,2,1,True,2,0,True


In [105]:
df.isnull().sum()

survived       0
pclass         0
sex            0
age            0
sibsp          0
parch          0
fare           0
embarked       0
class          0
who            0
adult_male     0
embark_town    0
alive          0
alone          0
dtype: int64

In [106]:
df.drop('alive', axis=1) # drop the alive column as its similar to our target column survived

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,embark_town,alone
0,0,3,1,22.0,1,0,7.2500,2,2,1,True,2,False
1,1,1,0,38.0,1,0,71.2833,0,0,2,False,0,False
2,1,3,0,26.0,0,0,7.9250,2,2,2,False,2,True
3,1,1,0,35.0,1,0,53.1000,2,0,2,False,2,False
4,0,3,1,35.0,0,0,8.0500,2,2,1,True,2,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...
886,0,2,1,27.0,0,0,13.0000,2,1,1,True,2,True
887,1,1,0,19.0,0,0,30.0000,2,0,2,False,2,True
888,0,3,0,28.0,1,2,23.4500,2,2,2,False,2,False
889,1,1,1,26.0,0,0,30.0000,0,0,1,True,0,True


### **1.4 Feature Selection $ Scaling**

In [107]:
# X and y column 
X = df.drop('survived', axis=1)
y = df['survived']

In [108]:
scaler = StandardScaler()
X = scaler.fit_transform(X)

In [109]:
X

array([[ 0.82737724,  0.73769513, -0.56573646, ...,  0.58595414,
        -0.78927234, -1.2316449 ],
       [-1.56610693, -1.35557354,  0.66386103, ..., -1.9423032 ,
         1.2669898 , -1.2316449 ],
       [ 0.82737724, -1.35557354, -0.25833709, ...,  0.58595414,
         1.2669898 ,  0.81192233],
       ...,
       [ 0.82737724, -1.35557354, -0.1046374 , ...,  0.58595414,
        -0.78927234, -1.2316449 ],
       [-1.56610693,  0.73769513, -0.25833709, ..., -1.9423032 ,
         1.2669898 ,  0.81192233],
       [ 0.82737724,  0.73769513,  0.20276197, ..., -0.67817453,
        -0.78927234,  0.81192233]])

In [110]:
# train test split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1)

### **1.5 Trainig Model**

In [111]:
# model call
model = LogisticRegression()

In [112]:
# train the model 
model.fit(X_train, y_train)

### **1.6 Evaluation | Prediction**

In [113]:
# predict the model 
y_pred = model.predict(X_test)

In [114]:
# evalute the model 
print('accuracy_score: ', accuracy_score(y_test, y_pred))
print('Accuracy Score: ', accuracy_score(y_test, y_pred))
print('Recall Score: ', recall_score(y_test, y_pred))
print('Precision Score: ', precision_score(y_test, y_pred))
print('F1 Score: ', f1_score(y_test, y_pred))
print('Confusion Matrix: \n', confusion_matrix(y_test, y_pred))
print('Classification Report: \n', classification_report(y_test, y_pred))

accuracy_score:  1.0
Accuracy Score:  1.0
Recall Score:  1.0
Precision Score:  1.0
F1 Score:  1.0
Confusion Matrix: 
 [[59  0]
 [ 0 31]]
Classification Report: 
               precision    recall  f1-score   support

           0       1.00      1.00      1.00        59
           1       1.00      1.00      1.00        31

    accuracy                           1.00        90
   macro avg       1.00      1.00      1.00        90
weighted avg       1.00      1.00      1.00        90



##### **`ASAD ABBAS SHEIKH`**
`MONDAY, APRIL 29, 2024`