<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Logistic Regression Practice
**Possums**

<img src="../images/pos2.jpg" style="height: 250px">

*The common brushtail possum (Trichosurus vulpecula, from the Greek for "furry tailed" and the Latin for "little fox", previously in the genus Phalangista) is a nocturnal, semi-arboreal marsupial of the family Phalangeridae, native to Australia, and the second-largest of the possums.* -[Wikipedia](https://en.wikipedia.org/wiki/Common_brushtail_possum)

In [1]:
# Imports
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

### Get the data

Read in the `possum.csv` data (located in the `data` folder).

In [2]:
possums = pd.read_csv('../data/possum.csv')

In [3]:
possums.head()

Unnamed: 0,site,pop,sex,age,head_l,skull_w,total_l,tail_l
0,1,Vic,m,8.0,94.1,60.4,89.0,36.0
1,1,Vic,f,6.0,92.5,57.6,91.5,36.5
2,1,Vic,f,6.0,94.0,60.0,95.5,39.0
3,1,Vic,f,6.0,93.2,57.1,92.0,38.0
4,1,Vic,f,2.0,91.5,56.3,85.5,36.0


### Preprocessing

> Check for & deal with any missing values.  
Convert categorical columns to numeric.  
Do any other preprocessing you feel is necessary.

In [4]:
# Check for missings
possums.isnull().sum()

site       0
pop        0
sex        0
age        2
head_l     0
skull_w    0
total_l    0
tail_l     0
dtype: int64

In [5]:
# Drop missings
possums.dropna(inplace = True)

In [6]:
# Convert sex m/f to 0/1
possums['sex_f'] = possums['sex'].map({'m': 0, 'f': 1})
possums.drop(columns = 'sex', inplace = True)

In [7]:
# Check out the pop column
possums['pop'].value_counts()

other    58
Vic      44
Name: pop, dtype: int64

In [8]:
# convert pop column to 0/1\
possums['pop'] = possums['pop'].map({'other': 0, 'Vic': 1})

### Modeling

> Build Logistic Regression model to predict `pop`; region of origin.  
Examine the performance of the model.

In [9]:
# Set up X and y
X = possums.drop(columns = 'pop')
y = possums['pop']

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 42)

In [10]:
# Instantiate model
logreg = LogisticRegression(solver = 'newton-cg') # changing solver b/c of convergence warning

In [11]:
# Fit the model
logreg.fit(X_train, y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=None, solver='newton-cg', tol=0.0001, verbose=0,
                   warm_start=False)

In [12]:
# training accuracy
logreg.score(X_train, y_train)

1.0

In [13]:
# testing accuracy
logreg.score(X_test, y_test)

0.9615384615384616

### Interpretation & Predictions

> Interpret at least one coefficient from your model.  
> Generate predicted probabilities for your testing set.  
> Generate predictions for your testing set.

In [14]:
# Check out coefficients
pd.Series(logreg.coef_[0], index = X.columns)

site      -2.080098
age        0.085753
head_l    -0.155235
skull_w   -0.159104
total_l    0.157248
tail_l    -0.660218
sex_f     -0.122904
dtype: float64

In [15]:
# Interpret coefficient for age:
np.exp(0.085753)

1.0895371793832975

> A 1 year increase in a possum's age suggests that it is 1.09 times as likely to live in the Vic region, holding all else constant.

In [16]:
# Predicted probabilities for test set
logreg.predict_proba(X_test)

array([[6.22005473e-03, 9.93779945e-01],
       [9.92282175e-01, 7.71782488e-03],
       [9.94511461e-01, 5.48853931e-03],
       [6.60151230e-01, 3.39848770e-01],
       [1.19972141e-02, 9.88002786e-01],
       [1.16963250e-01, 8.83036750e-01],
       [9.99901173e-01, 9.88274942e-05],
       [6.31924877e-01, 3.68075123e-01],
       [4.63660383e-02, 9.53633962e-01],
       [1.30018841e-02, 9.86998116e-01],
       [3.70468359e-03, 9.96295316e-01],
       [3.81667245e-02, 9.61833276e-01],
       [9.99856153e-01, 1.43846873e-04],
       [9.99905266e-01, 9.47343442e-05],
       [9.95112397e-01, 4.88760349e-03],
       [1.48072970e-02, 9.85192703e-01],
       [9.96638534e-01, 3.36146632e-03],
       [1.04744212e-01, 8.95255788e-01],
       [1.67041247e-02, 9.83295875e-01],
       [5.37326258e-03, 9.94626737e-01],
       [9.99818722e-01, 1.81278235e-04],
       [9.81947145e-01, 1.80528554e-02],
       [8.19358190e-03, 9.91806418e-01],
       [9.97524050e-01, 2.47594962e-03],
       [4.560015

In [17]:
# Predictions for test set
logreg.predict(X_test)

array([1, 0, 0, 0, 1, 1, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 1, 1, 1, 0, 0,
       1, 0, 1, 0])