# Steps to use Logistic Regression in Scikit-Learn

## scikit-learn: built on numpy, scipy and matplotlib

## simple and efficient tools for predictive data analysis

### What is Scikit-Learn?
- Simple and efficient tools for predictive data analysis.
- Accessible to everybody and reusable in various contexts
- Built on numpy, scipy and matplotlib
- Open Source, commercially usable - BSD license

### Step 1: Import the Algo: Logistic Regression is available in Scikit-Learn's linear_model module

In [1]:
from sklearn.linear_model import LogisticRegression

## Step 2: Prepare the Data:

- X: Feature matrix (eg, characteristics of customers like age, salary, etc.)
- Y: Target variable (binary labels like spam or not spam)

## Step 3: Train the model

## Step 4: Predict using the trained model

## Step 5: Evaluate the model using accuracy, precision, recall and F1-score.

# Example

In [3]:
# Step 1 - Import

import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score,precision_score, recall_score, f1_score
from sklearn.preprocessing import StandardScaler

In [4]:
# Sample dataset: Predicting whether a customer will buy a product based on age and salary
data = {'age': [25, 45, 35, 50, 23, 40, 60, 30, 20, 24, 31, 36, 47, 53, 55, 60, 32, 38, 49, 65, 27, 29, 42, 44, 56],
        'salary': [40000, 90000, 60000, 120000, 35000, 70000, 150000, 50000, 40000, 60000, 80000, 85000, 90000, 100000, 101000, 80000, 70000, 66000, 110000, 96000, 125000, 100000, 87000, 94000, 10000],
        'buy_product': [0, 1, 0, 1, 0, 1, 1, 0, 0,1,0,0,1,1,1,0, 0,0, 1, 0, 0, 1, 0, 1, 0]}  # 0 = No, 1 = Yes

In [5]:
# Can I start working with Features and Target Variables on variable 'data' ? --> NO
df = pd.DataFrame(data)
df

Unnamed: 0,age,salary,buy_product
0,25,40000,0
1,45,90000,1
2,35,60000,0
3,50,120000,1
4,23,35000,0
5,40,70000,1
6,60,150000,1
7,30,50000,0
8,20,40000,0
9,24,60000,1


In [6]:
# Features and target variable
# the "x" variable is often written in uppercase (X) to represent a matrix or collection of features.
X = df[['age', 'salary']]
y = df['buy_product']
X

Unnamed: 0,age,salary
0,25,40000
1,45,90000
2,35,60000
3,50,120000
4,23,35000
5,40,70000
6,60,150000
7,30,50000
8,20,40000
9,24,60000


In [7]:
# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [8]:
print(X_test)

    age  salary
8    20   40000
16   32   70000
0    25   40000
23   44   94000
11   36   85000


In [9]:
print(y_test)

8     0
16    0
0     0
23    1
11    0
Name: buy_product, dtype: int64


In [10]:
# Initialize the Logistic Regression model
logreg = LogisticRegression()

In [11]:
logreg.fit(X_train,y_train)

In [12]:
# Make predictions
y_pred = logreg.predict(X_test)
print(y_pred)

[1 1 0 1 1]


In [13]:
# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred, zero_division=0)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)

print(f"Accuracy: {accuracy}")
print(f"Precision: {precision}")
print(f"Recall: {recall}")
print(f"F1-Score: {f1}")

Accuracy: 0.4
Precision: 0.25
Recall: 1.0
F1-Score: 0.4


In [21]:
# Accuracy = TP+TN/(TP+TN+FP+FN)
# Precision = TP / (TP+FP) -> here both TP AND FP ARE ZERO
# Recall = TP/(TP+FN)