<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Logistic Regression Practice
**Possums**

<img src="./images/pos2.jpg" style="height: 250px">

*The common brushtail possum (Trichosurus vulpecula, from the Greek for "furry tailed" and the Latin for "little fox", previously in the genus Phalangista) is a nocturnal, semi-arboreal marsupial of the family Phalangeridae, native to Australia, and the second-largest of the possums.* -[Wikipedia](https://en.wikipedia.org/wiki/Common_brushtail_possum)

In [1]:
# Imports
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn import metrics

### Get the data

Read in the `possum.csv` data (located in the `data` folder).

In [2]:
df = pd.read_csv('data/possum.csv')

In [3]:
df.head()

Unnamed: 0,site,pop,sex,age,head_l,skull_w,total_l,tail_l
0,1,Vic,m,8.0,94.1,60.4,89.0,36.0
1,1,Vic,f,6.0,92.5,57.6,91.5,36.5
2,1,Vic,f,6.0,94.0,60.0,95.5,39.0
3,1,Vic,f,6.0,93.2,57.1,92.0,38.0
4,1,Vic,f,2.0,91.5,56.3,85.5,36.0


In [4]:
df.shape

(104, 8)

### Preprocessing

> Check for & deal with any missing values.  
Convert categorical columns to numeric.  
Do any other preprocessing you feel is necessary.

In [5]:
df.isnull().sum()

site       0
pop        0
sex        0
age        2
head_l     0
skull_w    0
total_l    0
tail_l     0
dtype: int64

In [6]:
df.shape

(104, 8)

In [7]:
df.dropna(inplace=True)

In [8]:
df.dtypes

site         int64
pop         object
sex         object
age        float64
head_l     float64
skull_w    float64
total_l    float64
tail_l     float64
dtype: object

In [9]:
df['sex'] = df['sex'].map(lambda x: 0 if x == 'f' else 1)
df.rename(columns={'sex': 'sex_male'}, inplace=True)

In [10]:
df['pop'].unique()
df['pop'] = df['pop'].map(lambda x: 1 if x =='Vic' else 0)
df.rename(columns={'pop': 'pop_Vic'}, inplace=True)

### Modeling

> Build Logistic Regression model to predict `pop`; region of origin.  
Examine the performance of the model.

In [11]:
# Step 1: Define X and y
X = df[['site', 'sex_male', 'age', 'head_l', 'skull_w', 'total_l', 'tail_l']]
type(X)

pandas.core.frame.DataFrame

In [12]:
# Step 1: Define X and y
y = df['pop_Vic']
type(y)

pandas.core.series.Series

In [24]:
y.value_counts(normalize=True)

0    0.568627
1    0.431373
Name: pop_Vic, dtype: float64

In [48]:
# Step 1: Split to train and test
X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                    test_size=0.15,
                                                    stratify=y,
                                                    random_state=42
                                                   )

In [49]:
# Step 2: Instantiate the model
logreg = LogisticRegression(solver='liblinear', random_state=42)

# Class imbalance (where there is not enough observations for target y). 
# i.e. 99% pop = 1, 1% pop = 0. 
# E.g. Fraudulent transactions per MM.
# can be resolved by setting class_weight

# n_jobs: If -1, all of computer's processors used.

In [50]:
# Step 3: Fit the model
logreg.fit(X_train, y_train)

LogisticRegression(random_state=42, solver='liblinear')

In [51]:
logreg.score(X_train, y_train)

1.0

In [52]:
logreg.score(X_test, y_test)

1.0

### Interpretation & Predictions

> Interpret at least one coefficient from your model.  
> Generate predicted probabilities for your testing set.  
> Generate predictions for your testing set.

In [53]:
# Step 4: Coefs and intercepts
logreg.intercept_

array([0.12638267])

In [55]:
# Step 4: Coefs and intercepts
pd.DataFrame(logreg.coef_[0], index = X.columns)

Unnamed: 0,0
site,-2.595162
sex_male,-0.207272
age,-0.131705
head_l,0.259275
skull_w,-0.205366
total_l,-0.069781
tail_l,0.026559


In [56]:
# Interpret one coef (sex_male):
np.exp(-0.207272)

# This means that a male possum is 0.81 times as likely to be from Victoria region.

0.8127985387152706

In [57]:
# Step 5: Make predictions
# Generate predicted probabilities for testing set
logreg.predict(X_test)

array([1, 0, 1, 0, 1, 1, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0])

In [59]:
np.round(logreg.predict_proba(X_test),3)

array([[0.142, 0.858],
       [1.   , 0.   ],
       [0.353, 0.647],
       [0.998, 0.002],
       [0.016, 0.984],
       [0.012, 0.988],
       [1.   , 0.   ],
       [1.   , 0.   ],
       [0.011, 0.989],
       [0.998, 0.002],
       [1.   , 0.   ],
       [1.   , 0.   ],
       [0.059, 0.941],
       [0.988, 0.012],
       [0.023, 0.977],
       [0.998, 0.002]])

In [61]:
logreg.predict(X_train)[:5]

array([1, 1, 0, 0, 1])

In [63]:
np.round(logreg.predict_proba(X_test), 3)[:5]

array([[0.142, 0.858],
       [1.   , 0.   ],
       [0.353, 0.647],
       [0.998, 0.002],
       [0.016, 0.984]])