# Logistic Regression
# Machine Learning Model on Social Media Ads Dataset

**Logistic Regression Explained**
Logistic regression is a statistical method employed in machine learning for **binary classification tasks**. It estimates the probability of an event occurring, like spam/not spam in email classification or default/not default in loan applications.

Unlike linear regression, which predicts continuous values, logistic regression deals with **categorical dependent variables** with only two possible outcomes (often encoded as 0 or 1).

Here's the gist:
- It analyzes the relationship between one or more **independent variables** (features) and a **binary dependent variable** (target).
- It uses a mathematical function called the **logistic function** to transform the linear combination of the independent variables into a probability between 0 and 1.
- This probability represents the likelihood of an observation belonging to a specific class (e.g., spam email).

**Where to Use Logistic Regression**
Logistic regression is a versatile tool for various classification problems, including:
- **Fraud detection:** Classifying transactions as fraudulent or legitimate.
- **Spam filtering:** Identifying spam emails.
- **Risk assessment:** Predicting loan default risk or creditworthiness.
- **Medical diagnosis:** Supporting medical professionals in diagnosing diseases based on symptoms.
- **Customer churn prediction:** Identifying customers at risk of leaving a service.
- **Marketing campaign targeting:** Tailoring marketing efforts to potential customers.

**Data Requirements for Logistic Regression**
Logistic regression has specific data requirements to ensure its effectiveness:
- **Binary dependent variable:** The target variable you want to predict must have two distinct categories.
- **Independent variables:** These features can be continuous (e.g., age, income) or categorical (e.g., gender, occupation).
- **Linear relationship:** The relationship between the independent variables and the log-odds of the dependent variable should be approximately linear. This can be assessed through visualization techniques or statistical tests.
- **Large enough dataset:** Logistic regression performs better with a sufficient amount of data to accurately estimate the model parameters.
- **Absence of multicollinearity:** The independent variables should not be highly correlated with each other, as this can lead to unstable estimates.

By understanding these requirements, you can determine if logistic regression is a suitable choice for your classification problem.

## Import

### Import Libraries

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.linear_model import LogisticRegression

Pyarrow will become a required dependency of pandas in the next major release of pandas (pandas 3.0),
(to allow more performant data types, such as the Arrow string type, and better interoperability with other libraries)
but was not found to be installed on your system.
If this would cause problems for you,
please provide us feedback at https://github.com/pandas-dev/pandas/issues/54466
        
  import pandas as pd


### Import Dataset

In [2]:
df = pd.read_csv('D:\\Data Practice JN\\Pre-Processing\\Wrangled Data of Social Media Ads.csv')
df

Unnamed: 0.1,Unnamed: 0,Gender,Age,EstimatedSalary,Purchased,Box-Cox_X,Box-Cox_Y,Yeo-Johnson_X,Yeo-Johnson_Y,Quantile_X,Quantile_Y
0,0,0,19.0,69987.0,0,-1.980702,0.133869,-1.980702,0.133869,-2.052715,-0.012565
1,1,0,37.0,20000.0,0,0.001299,-1.770340,0.001299,-1.770340,-0.006282,-1.749524
2,2,1,26.0,43000.0,0,-1.137619,-0.741294,-1.137619,-0.741294,-1.073141,-0.704369
3,3,1,37.0,57000.0,0,0.001299,-0.259357,0.001299,-0.259357,-0.006282,-0.326084
4,4,0,19.0,76000.0,0,-1.980702,0.302718,-1.980702,0.302718,-2.052715,0.289852
...,...,...,...,...,...,...,...,...,...,...,...
395,395,1,46.0,41000.0,1,0.821376,-0.816635,0.821376,-0.816635,0.716498,-0.787349
396,396,0,51.0,23000.0,1,1.245128,-1.608888,1.245128,-1.608888,1.186084,-1.456106
397,397,1,50.0,20000.0,1,1.161960,-1.770340,1.161960,-1.770340,1.142773,-1.749524
398,398,0,36.0,33000.0,0,-0.095203,-1.139506,-0.095203,-1.139506,-0.125978,-0.987682


In [3]:
df.drop(columns=['Unnamed: 0','Box-Cox_X','Box-Cox_Y','Yeo-Johnson_X','Yeo-Johnson_Y','Quantile_X','Quantile_Y'], inplace=True)
df.head()

Unnamed: 0,Gender,Age,EstimatedSalary,Purchased
0,0,19.0,69987.0,0
1,0,37.0,20000.0,0
2,1,26.0,43000.0,0
3,1,37.0,57000.0,0
4,0,19.0,76000.0,0


## Model Building

### Define Features and Labels

In [4]:
x= df.iloc[ : ,:-1]
y= df.iloc[ : ,-1:]

In [5]:
x

Unnamed: 0,Gender,Age,EstimatedSalary
0,0,19.0,69987.0
1,0,37.0,20000.0
2,1,26.0,43000.0
3,1,37.0,57000.0
4,0,19.0,76000.0
...,...,...,...
395,1,46.0,41000.0
396,0,51.0,23000.0
397,1,50.0,20000.0
398,0,36.0,33000.0


In [6]:
y

Unnamed: 0,Purchased
0,0
1,0
2,0
3,0
4,0
...,...
395,1
396,1
397,1
398,0


### Train Test Split

In [7]:
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x,y, test_size=0.3, random_state=0)

In [8]:
x_test

Unnamed: 0,Gender,Age,EstimatedSalary
132,0,30.0,87000.0
309,1,38.0,50000.0
341,0,35.0,75000.0
196,1,30.0,79000.0
246,1,35.0,50000.0
...,...,...,...
216,0,49.0,65000.0
259,1,45.0,131000.0
49,1,31.0,89000.0
238,1,46.0,82000.0


In [9]:
y_test

Unnamed: 0,Purchased
132,0
309,0
341,0
196,0
246,0
...,...
216,0
259,1
49,0
238,0


In [10]:
x_train

Unnamed: 0,Gender,Age,EstimatedSalary
92,0,26.0,15000.0
223,0,60.0,102000.0
234,1,38.0,112000.0
232,0,40.0,107000.0
377,1,42.0,53000.0
...,...,...,...
323,1,48.0,30000.0
192,0,29.0,43000.0
117,0,36.0,52000.0
47,1,27.0,54000.0


In [11]:
y_train

Unnamed: 0,Purchased
92,0
223,1
234,0
232,1
377,0
...,...
323,1
192,0
117,0
47,0


### Model Fitting

In [12]:
model = LogisticRegression()
model.fit(x_train, y_train)

  y = column_or_1d(y, warn=True)
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [13]:
df.head()

Unnamed: 0,Gender,Age,EstimatedSalary,Purchased
0,0,19.0,69987.0,0
1,0,37.0,20000.0,0
2,1,26.0,43000.0,0
3,1,37.0,57000.0,0
4,0,19.0,76000.0,0


### Prediction

In [14]:
model.predict([[1, 19, 30000]])



array([0], dtype=int64)

In [15]:
model.predict([[1, 90, 300000]])



array([1], dtype=int64)

In [16]:
model.predict([[0, 19, 30000]])



array([0], dtype=int64)

In [17]:
model.predict([[0, 90, 300000]])



array([1], dtype=int64)

### Evaluation of Model

In [19]:
print('Training Score of Model =',model.score(x_train, y_train)*100)
print('Testing Score of Model = ',model.score(x_test, y_test)*100)

Training Score of Model = 83.92857142857143
Testing Score of Model =  88.33333333333333
