Logistic regression is a classification algorithm. It is used to predict a <b> <i> binary </i> </b> outcome based on a set of independent variables.

A binary outcome is one where there are only two possible scenarios—either the event happens (i.e., 1) or it does not happen (i.e., 0). Independent variables are those variables or factors which may influence the outcome (or dependent variable).

So: Logistic regression is the correct type of analysis to use when you’re working with binary data. You know you’re dealing with binary data when the output or dependent variable is dichotomous or categorical in nature; in other words, if it fits into one of two categories (such as “yes” or “no”, “pass” or “fail”, and so on).

However, the independent variables can fall into any of the following categories:

<b>Continuous</b>—such as temperature in degrees Celsius or weight in grams. In technical terms, continuous data is categorized as either interval data, where the intervals between each value are equally split, or ratio data, where the intervals are equally split and there is a true or meaningful “zero”. For example, temperature in degrees Celsius would be classified as interval data; the difference between 10 and 11 degrees C is equal to the difference between 30 and 31 degrees C, but there is no true zero—a temperature of zero degrees does not mean there is “no temperature”. On the other hand, weight in grams would be classified as ratio data; it has the equal intervals and a true zero. In other words, if something weighs zero grams, it truly weighs nothing.
 
<b>Discrete, ordinal</b>—that is, data which can be placed into some kind of order on a scale. For example, if you are asked to state how happy you are on a scale of 1-5, the points on the scale represent ordinal data. A score of 1 indicates a lower degree of happiness than a score of 5, but there is no way of determining the numerical value between each of the points on the scale. Ordinal data is the kind of data you might get from a customer satisfaction survey. 
 
<b>Discrete, nominal</b>—that is, data which fits into named groups which do not represent any kind of order or scale. For example, eye color may fit into the categories “blue”, “brown”, or “green”, but there is no hierarchy to these categories.
So, in order to determine if logistic regression is the correct type of analysis to use, ask yourself the following:

Is the dependent variable dichotomous? In other words, does it fit into one of two set categories? Remember: The dependent variable is the outcome; the thing that you’re measuring or predicting.
 
Are the independent variables either interval, ratio, or ordinal? See the examples above for a reminder of what these terms mean. Remember: The independent variables are those which may impact, or be used to predict, the outcome.
In addition to the two criteria mentioned above, there are some further requirements that must be met in order to correctly use logistic regression. These requirements are known as <b>“assumptions”</b>; in other words, when conducting logistic regression, you’re assuming that these criteria have been met. Let’s take a look at those now.


Logistic regression assumptions

The dependent variable is binary or dichotomous—i.e. It fits into one of two clear-cut categories. This applies to binary logistic regression, which is the type of logistic regression we’ve discussed so far. We’ll explore some other types of logistic regression in section five.
 
There should be no, or very little, multicollinearity between the predictor variables—in other words, the predictor variables (or the independent variables) should be independent of each other. This means that there should not be a high correlation between the independent variables. In statistics, certain tests can be used to calculate the correlation between the predictor variables; if you’re interested in learning more about those, just search “Spearman’s rank correlation coefficient” or “the Pearson correlation coefficient.”
 
The independent variables should be linearly related to the log odds. If you’re not familiar with log odds, we’ve included a brief explanation below.
 
Logistic regression requires fairly large sample sizes—the larger the sample size, the more reliable (and powerful) you can expect the results of your analysis to be. 


What are log odds?
In very simplistic terms, log odds are an alternate way of expressing probabilities. In order to understand log odds, it’s important to understand a key difference between odds and probabilities: odds are the ratio of something happening to something not happening, while probability is the ratio of something happening to everything that could possibly happen.

For example: if you and your friend play ten games of tennis, and you win four out of ten games, the odds of you winning are 4 to 6 ( or, as a fraction, 4/6). The probability of you winning, however, is 4 to 10 (as there were ten games played in total). As we can see, odds essentially describes the ratio of success to the ratio of failure. In logistic regression, every probability or possible outcome of the dependent variable can be converted into log odds by finding the odds ratio. The log odds logarithm (otherwise known as the logit function) uses a certain formula to make the conversion. 



![title](img/3_Why_Not_Linear_Regression.png)

![title](img/3_Logistic_Function_Equation.png)

![title](img/3_Logistic_Function_Graph.png)

In [1]:
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
%matplotlib inline

In [2]:
df = pd.read_csv(r'H:\PythonWork\PracticeData\Social_Network_Ads.csv')

In [3]:
df.head()

Unnamed: 0,UserID,Gender,Age,EstimatedSalary,Purchased
0,15624510,Male,19,19000,0
1,15810944,Male,35,20000,0
2,15668575,Female,26,43000,0
3,15603246,Female,27,57000,0
4,15804002,Male,19,76000,0


In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 400 entries, 0 to 399
Data columns (total 5 columns):
UserID             400 non-null int64
Gender             400 non-null object
Age                400 non-null int64
EstimatedSalary    400 non-null int64
Purchased          400 non-null int64
dtypes: int64(4), object(1)
memory usage: 15.8+ KB


In [5]:
df_new = df.drop('UserID',axis='columns')

In [6]:
df_new.head()

Unnamed: 0,Gender,Age,EstimatedSalary,Purchased
0,Male,19,19000,0
1,Male,35,20000,0
2,Female,26,43000,0
3,Female,27,57000,0
4,Male,19,76000,0


In [7]:
X = df_new.drop('Purchased',axis='columns')

In [8]:
y = df_new.Purchased

In [None]:
#Transforming categorical varaibles to numerical

In [15]:
Gender_Dummy = pd.get_dummies(X['Gender'],drop_first = 'TRUE')

In [16]:
Gender_Dummy

Unnamed: 0,Male
0,1
1,1
2,0
3,0
4,1
...,...
395,0
396,1
397,0
398,1


In [17]:
X = X.drop('Gender', axis = 'columns')

In [18]:
X.head()

Unnamed: 0,Age,EstimatedSalary
0,19,19000
1,35,20000
2,26,43000
3,27,57000
4,19,76000


In [19]:
X = pd.concat([X,Gender_Dummy], axis = 'columns')

In [20]:
X

Unnamed: 0,Age,EstimatedSalary,Male
0,19,19000,1
1,35,20000,1
2,26,43000,0
3,27,57000,0
4,19,76000,1
...,...,...,...
395,46,41000,0
396,51,23000,1
397,50,20000,0
398,36,33000,1


In [21]:
from sklearn.model_selection import train_test_split

In [22]:
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size = 0.2, random_state = 0)

In [24]:
X_test.shape

(80, 3)

In [25]:
from sklearn.linear_model import LogisticRegression
model = LogisticRegression()

In [26]:
model.fit(X_train, y_train)



LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='warn', n_jobs=None, penalty='l2',
                   random_state=None, solver='warn', tol=0.0001, verbose=0,
                   warm_start=False)

In [27]:
y_predicted = model.predict(X_test)

In [28]:
y_predicted

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1,
       1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0,
       1, 0, 1, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1,
       0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1], dtype=int64)

In [29]:
model.predict_proba(X_test)

array([[0.74082505, 0.25917495],
       [0.51299194, 0.48700806],
       [0.71527918, 0.28472082],
       [0.54491558, 0.45508442],
       [0.54662131, 0.45337869],
       [0.84828632, 0.15171368],
       [0.65634503, 0.34365497],
       [0.58035652, 0.41964348],
       [0.69175349, 0.30824651],
       [0.65405826, 0.34594174],
       [0.79481735, 0.20518265],
       [0.61352396, 0.38647604],
       [0.72930615, 0.27069385],
       [0.6737293 , 0.3262707 ],
       [0.81882743, 0.18117257],
       [0.44674392, 0.55325608],
       [0.69320239, 0.30679761],
       [0.82655966, 0.17344034],
       [0.20534858, 0.79465142],
       [0.79332656, 0.20667344],
       [0.55256124, 0.44743876],
       [0.24879569, 0.75120431],
       [0.47724865, 0.52275135],
       [0.53519335, 0.46480665],
       [0.71160587, 0.28839413],
       [0.23665943, 0.76334057],
       [0.5513659 , 0.4486341 ],
       [0.75974252, 0.24025748],
       [0.50025285, 0.49974715],
       [0.50853025, 0.49146975],
       [0.

In [35]:
X_test

Unnamed: 0,Age,EstimatedSalary,Male
132,30,87000,1
309,38,50000,0
341,35,75000,1
196,30,79000,0
246,35,50000,0
...,...,...,...
14,18,82000,1
363,42,79000,0
304,40,60000,0
361,53,34000,0


In [30]:
model.score(X_test,y_test)

0.7875

In [None]:
#Lets go back to the theory again

In [31]:
model.coef_

array([[ 4.50163509e-02,  7.99870107e-06, -9.34102843e-01]])

In [32]:
model.intercept_

array([-2.16253584])

In [33]:
import math
def sigmoid(x):
  return 1 / (1 + math.exp(-x))

In [34]:
def prediction_function(age, EstimatedSalary,  Male):
    z = 4.50163509e-02 * age + 7.99870107e-06 * EstimatedSalary  - 9.34102843e-01 * Male  - 2.16253584 
    y = sigmoid(z)
    return y

In [36]:
age = 30
EstimatedSalary = 87000
Male = 1
prediction_function(age, EstimatedSalary, Male)

0.2591749535243671

In [37]:
age = 38
EstimatedSalary = 50000
Male = 0
prediction_function(age, EstimatedSalary, Male)

0.4870080619968321