# Logistic Regression Project 

In [25]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
sns.set_style('whitegrid')

## Checking Out the Data

In [26]:
ad_data = pd.read_csv('advertising.csv')
ad_data.head()

Unnamed: 0,Daily Time Spent on Site,Age,Area Income,Daily Internet Usage,Ad Topic Line,City,Male,Country,Timestamp,Clicked on Ad
0,68.95,35,61833.9,256.09,Cloned 5thgeneration orchestration,Wrightburgh,0,Tunisia,2016-03-27 00:53:11,0
1,80.23,31,68441.85,193.77,Monitored national standardization,West Jodi,1,Nauru,2016-04-04 01:39:02,0
2,69.47,26,59785.94,236.5,Organic bottom-line service-desk,Davidton,0,San Marino,2016-03-13 20:35:42,0
3,74.15,29,54806.18,245.89,Triple-buffered reciprocal time-frame,West Terrifurt,1,Italy,2016-01-10 02:31:19,0
4,68.37,35,73889.99,225.58,Robust logistical utilization,South Manuel,0,Iceland,2016-06-03 03:36:18,0


** Use info and describe() on ad_data**

In [27]:
ad_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 10 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   Daily Time Spent on Site  1000 non-null   float64
 1   Age                       1000 non-null   int64  
 2   Area Income               1000 non-null   float64
 3   Daily Internet Usage      1000 non-null   float64
 4   Ad Topic Line             1000 non-null   object 
 5   City                      1000 non-null   object 
 6   Male                      1000 non-null   int64  
 7   Country                   1000 non-null   object 
 8   Timestamp                 1000 non-null   object 
 9   Clicked on Ad             1000 non-null   int64  
dtypes: float64(3), int64(3), object(4)
memory usage: 78.2+ KB


In [28]:
#changing datatype
ad_data['Timestamp'] = pd.to_datetime(ad_data['Timestamp']) 

#creating new columns from Timestamp to explore new features
ad_data['Hour'] = ad_data['Timestamp'].apply(lambda time: time.hour)
ad_data['Month'] = ad_data['Timestamp'].apply(lambda time: time.month)
ad_data['Day of Week'] = ad_data['Timestamp'].apply(lambda time: time.dayofweek)

# Dropping timestamp column to avoid redundancy
ad_data = ad_data.drop(['Timestamp'], axis=1) 

# Logistic Regression

##### Spliting the Data

In [49]:
X = ad_data[['Daily Time Spent on Site','Age', 'Area Income','Daily Internet Usage','Male']]
y = ad_data['Clicked on Ad']

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, 
                                                    random_state=42)

##### Fitting the Model

In [52]:
from sklearn.linear_model import LogisticRegression

In [53]:
logmodel = LogisticRegression()
logmodel.fit(X_train,y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

### Predictions and Evaluations

In [84]:
predictions = logmodel.predict(X_test)

#creating a classification report - maximum likelihood (MLE)

from sklearn.metrics import classification_report

print(classification_report(y_test, predictions))

              precision    recall  f1-score   support

           0       0.85      0.96      0.90       146
           1       0.96      0.84      0.89       154

    accuracy                           0.90       300
   macro avg       0.90      0.90      0.90       300
weighted avg       0.90      0.90      0.90       300



In [56]:
# Importing a pure confusion matrix from sklearn.metrics family
from sklearn.metrics import confusion_matrix

# Printing the confusion_matrix
print(confusion_matrix(y_test, predictions))

[[140   6]
 [ 25 129]]


* PRECISION - says how accurate your model really is. High precision relates to the low FPR.


* RECALL - says how many True you got. also called TPR!


* F1-SCORE - gives you the harmonic mean of precision and recall.


* SUPPORT - n. of samples of the True response that lie in that class


* ACCURACY - how often your model is correct.  Works better if FP and FN have similar cost.

### Interpreting the Model Evaluation

* Confusion Matrix:

The users that are predicted to click on commercials and the users who actually clicked were 140, the people who were predicted not to click on the commercials and actually did not click on them were 129.

The people who were predicted to click on commercial and actually did not click on them are 6, and the users who were not predicted to click on the commercials and actually clicked on them are 25.

We have only a few mislabelled points which is not bad from the given size of the dataset.

* Classification Report:

From the report obtained, the precision & recall are 0.90 which depicts the predicted values are 90% accurate. Hence the probability that the user can click on the commercial is 0.90 which is a good precision value to get a good model.