## Logistic Regression 

Concept: 
1. A binary classification model, input the result of linear regression 
2. Output Range(0, 1)

Core Idea:
1. Using linear model $f(x) = w^{T}x + b$, get a value from features.
2. Convert the value into a probability value via sigmoid function $\frac{1}{1 + e^{-x}}$
    - if p > 0.5, then the output would be classfication 1
    - if p < 0.5, then the output would be classification 0


Bernoulli/Binomial Probability: 
$$
P(y_i \mid x_i) = \hat{y}_i^{\,y_i} (1 - \hat{y}_i)^{\,1-y_i}
$$

MLE:
$$
L(w, b) 
= \prod_{i=1}^{m} 
\hat{y}_i^{\,y_i} (1 - \hat{y}_i)^{\,1-y_i}
$$

Because,
$$
log(a b c) = log(a) + log(b) + log(c)
$$



Loss function in Logistic Regression:
$$
L = -\frac{1}{m}\sum_{i=1}^{m} \left[ y^{(i)}\log(\hat{y}^{(i)}) + (1 - y^{(i)})\log(1-\hat{y}^{(i)}) \right]
$$


In [41]:
import kagglehub
path = kagglehub.dataset_download("mariolisboa/breast-cancer-wisconsin-original-data-set")

import os
import pandas as pd
print(os.listdir(path))

file_path = os.path.join(path, 'breast_cancer_bd.csv')
df = pd.read_csv(file_path)

['breast_cancer_bd.csv']


In [42]:
print(df.columns)
print(df.head())

Index(['Sample code number', 'Clump Thickness', 'Uniformity of Cell Size',
       'Uniformity of Cell Shape', 'Marginal Adhesion',
       'Single Epithelial Cell Size', 'Bare Nuclei', 'Bland Chromatin',
       'Normal Nucleoli', 'Mitoses', 'Class'],
      dtype='object')
   Sample code number  Clump Thickness  Uniformity of Cell Size  \
0             1000025                5                        1   
1             1002945                5                        4   
2             1015425                3                        1   
3             1016277                6                        8   
4             1017023                4                        1   

   Uniformity of Cell Shape  Marginal Adhesion  Single Epithelial Cell Size  \
0                         1                  1                            2   
1                         4                  5                            7   
2                         1                  1                            2   
3        

In [43]:
df.info()
df.shape

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 699 entries, 0 to 698
Data columns (total 11 columns):
 #   Column                       Non-Null Count  Dtype 
---  ------                       --------------  ----- 
 0   Sample code number           699 non-null    int64 
 1   Clump Thickness              699 non-null    int64 
 2   Uniformity of Cell Size      699 non-null    int64 
 3   Uniformity of Cell Shape     699 non-null    int64 
 4   Marginal Adhesion            699 non-null    int64 
 5   Single Epithelial Cell Size  699 non-null    int64 
 6   Bare Nuclei                  699 non-null    object
 7   Bland Chromatin              699 non-null    int64 
 8   Normal Nucleoli              699 non-null    int64 
 9   Mitoses                      699 non-null    int64 
 10  Class                        699 non-null    int64 
dtypes: int64(10), object(1)
memory usage: 60.2+ KB


(699, 11)

### Clean Data

In [44]:
import numpy as np
df = df.replace('?', np.nan)
df.dropna(axis = 0, how = 'any', inplace = True)
df.shape

(683, 11)

### Import Packages

In [45]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score

In [46]:
# Split data and target
print(df.shape)
data = df.iloc[:, 1:10]
target = df.iloc[:, 10:11]
x_train, x_test, y_train, y_test = train_test_split(data, target, test_size = 0.2, random_state = 42)

(683, 11)


### Feature Engineering


In [47]:
transfer = StandardScaler()
x_train = transfer.fit_transform(x_train)
x_test = transfer.transform(x_test)

## Model Tranining

In [48]:
estimator = LogisticRegression()
estimator.fit(x_train, y_train)
y_pre = estimator.predict(x_test)

  y = column_or_1d(y, warn=True)


In [50]:
print(accuracy_score(y_test, y_pre))

0.9562043795620438
