# Logistic Regression

## Why Not Just Use A Linear Regression?

### Assumptions for Linear Models:
- Gaussian distribution of residuals (errors)
- Y (target variable) is continuous on the prediction interval
![alt text](images/binary.png "Logo Title Text 1")

### Finding A Decision Boundary
![alt text](images/lr1.png "Logo Title Text 1")

### Log of Equal Odds 
![alt text](images/lr2.png "Logo Title Text 1")

### Logit Link Function
![alt text](images/lr3.png "Logo Title Text 1")

### Solving for Each Class (Binary Target)
![alt text](images/lr4.png "Logo Title Text 1")

### Log Likelihood
![alt text](images/lr5.png "Logo Title Text 1")

In [7]:
import numpy as np
def sigmoid(z):
    return 1 / (1 + np.exp(-z))

In [2]:
import pandas as pd
poor = open("poor_amazon_toy_reviews.txt").readlines()
good = open("good_amazon_toy_reviews.txt").readlines()

good_reviews = list(map(lambda review: (review, 1), good))
poor_reviews = list(map(lambda review: (review, 0), poor))

all_reviews = good_reviews + poor_reviews
all_reviews_df = pd.DataFrame(all_reviews, columns=["review", "positive"])
all_reviews_df.head()

Unnamed: 0,review,positive
0,Excellent!!!\n,1
1,"""Great quality wooden track (better than some ...",1
2,my daughter loved it and i liked the price and...,1
3,Great item. Pictures pop thru and add detail a...,1
4,I was pleased with the product.\n,1


In [3]:
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer(ngram_range=(1, 1), 
                             stop_words="english", 
                             max_features=1000,token_pattern='(?u)\\b[a-zA-Z][a-zA-Z]+\\b')

In [4]:
X = vectorizer.fit_transform(all_reviews_df["review"])
y = all_reviews_df["positive"].values
X

<114917x1000 sparse matrix of type '<class 'numpy.int64'>'
	with 926619 stored elements in Compressed Sparse Row format>

In [5]:
from sklearn.linear_model import LogisticRegression

lr = LogisticRegression()
lr.fit(X, y)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


LogisticRegression()

In [8]:
y_pred = lr.predict(X)

# calculate accuracy
np.mean(y_pred == y)

from sklearn.metrics import confusion_matrix

confusion_matrix(y, y_pred)

array([[  9087,   3613],
       [  1049, 101168]])

In [31]:
# all_reviews is the original list of string text documents
# wrong_predictions is an array of the indices where your prediction was not correct
wrong_predictions = np.nonzero(y_pred != y)[0]

np.take(np.array(all_reviews), wrong_predictions)

array(['1',
       '"Exactly what I was looking for, shipped quickly, my cousin loved it!"\n',
       "Adrienne - wonderful personality --see her on the NetFlix Movie. Representing a fashion doll who is SMART and PRETTY.<br />Best NEW product for young girls - FANTASTIC MESSAGE ! LOVE LOVE LOVE<br />There is a 3 (30minute) Netflix Movie introducing Project Mc2 and all their adventures!<br />Perfect. It's about time / dolls that are great in Science Technology Engineering and Math!\n",
       ..., '1', '1',
       '"Got package really fast, everything looks good, happy overall."\n'],
      dtype='<U12052')

In [None]:
np.mean(y_pred == y)

In [None]:
y.mean()

## AUROC (Area Under the Receiver Operator Curve)

In [17]:
from sklearn.metrics import roc_auc_score
roc_auc_score(y, y_pred)

0.8526246651114863

In [18]:
data = pd.DataFrame(X.toarray(), columns=vectorizer.get_feature_names())
data["TARGET"] = y

In [19]:
from sklearn.model_selection import train_test_split

train_df, test_df = train_test_split(data)
X_train = train_df.loc[:, ~train_df.columns.isin(['TARGET'])]
X_test = test_df.loc[:, ~test_df.columns.isin(['TARGET'])]


y_train = train_df["TARGET"]
y_test = test_df["TARGET"]

In [20]:
print(X_train.shape)
print(y_train.shape)
print(X_test.shape)
print(y_test.shape)

(86187, 1000)
(86187,)
(28730, 1000)
(28730,)


In [21]:
lr.fit(X_train, y_train)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


LogisticRegression()

In [1]:
y_pred = lr.predict(X_test)

np.mean(y_pred == y_test)

NameError: name 'lr' is not defined

## Cross Validation

In [23]:
from sklearn.model_selection import cross_validate
X = data.loc[:, ~data.columns.isin(['TARGET'])]
cv_results = cross_validate(lr, X, y, cv=10,return_train_score=False)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

In [24]:
cv_results['test_score']

array([0.95501218, 0.95475113, 0.95744866, 0.95544727, 0.95475113,
       0.95857988, 0.95466411, 0.95570446, 0.95709686, 0.95561744])