## Module 7: Logistic Regression

### Step 0

Load the appropriate libraries and bring in the data. Note that we have to run a script to get the [California Housing dataset](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.fetch_california_housing.html) to match as it is in scikit-learn. We cannot pull it directly from scikit-learn since CodeGrade cannot access the internet.

In [1]:
# CodeGrade step0

from sklearn.datasets import fetch_california_housing
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from scipy.stats import pearsonr
import os
import tarfile
import joblib # Imporxst joblib directly
from sklearn.datasets._base import _pkl_filepath, get_data_home
import statsmodels.api as sm
import statsmodels.formula.api as smf
import seaborn as sns
from sklearn.linear_model import Ridge
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.linear_model import Lasso
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, accuracy_score, classification_report
from statsmodels.stats.outliers_influence import variance_inflation_factor

archive_path = "cal_housing.tgz" # change the path if it's not in the current directory
data_home = get_data_home(data_home=None) # change data_home if you are not using ~/scikit_learn_data
if not os.path.exists(data_home):
    os.makedirs(data_home)
filepath = _pkl_filepath(data_home, 'cal_housing.pkz')

with tarfile.open(mode="r:gz", name=archive_path) as f:
    cal_housing = np.loadtxt(
        f.extractfile('CaliforniaHousing/cal_housing.data'),
        delimiter=',')
    # Columns are not in the same order compared to the previous
    # URL resource on lib.stat.cmu.edu
    columns_index = [8, 7, 2, 3, 4, 5, 6, 1, 0]
    cal_housing = cal_housing[:, columns_index]

    joblib.dump(cal_housing, filepath, compress=6) # Now using the directly imported joblib

# Load dataset
california = fetch_california_housing(as_frame=True)
data = california.data
data['MedianHouseValue'] = california.target

Print the basic information of the data using `.info()` and `.describe`.

In [2]:
# Display structure and summary
print(data.info())
print(data.describe())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20640 entries, 0 to 20639
Data columns (total 9 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   MedInc            20640 non-null  float64
 1   HouseAge          20640 non-null  float64
 2   AveRooms          20640 non-null  float64
 3   AveBedrms         20640 non-null  float64
 4   Population        20640 non-null  float64
 5   AveOccup          20640 non-null  float64
 6   Latitude          20640 non-null  float64
 7   Longitude         20640 non-null  float64
 8   MedianHouseValue  20640 non-null  float64
dtypes: float64(9)
memory usage: 1.4 MB
None
             MedInc      HouseAge      AveRooms     AveBedrms    Population  \
count  20640.000000  20640.000000  20640.000000  20640.000000  20640.000000   
mean       3.870671     28.639486      5.429000      1.096675   1425.476744   
std        1.899822     12.585558      2.474173      0.473911   1132.462122   
min        0.4

### Step 1

Define `threshold` as the median of `MedianHouseValue`.

Next create a binary target value called `HightValue` like so:

> `data['HighValue'] = (data['MedianHouseValue'] > threshold).astype(int)`

Finally give an array of the `unique_values` that returns the unique values of `HighValue`.


In [3]:
# CodeGrade step1
# threshold is the median of medianhousevalue
threshold = data['MedianHouseValue'].median()

# create a binary target value called HighValue
data['HighValue'] = (data['MedianHouseValue'] > threshold).astype(int)

# create array of unique_values that returns unique values of HighValue
unique_values = data['HighValue'].unique()
unique_values

array([1, 0])

### Step 2

Select `MedInc`, `AveRoom`, and `AveOccup` as the variables of `X` and let `y` be the variable `HighValue`.

Let `seed` be set to 42.

Now split the data into `X_train`, `X_test`, `y_train`, and `y_test`, with a test stize of 30% and a random state of 42.

Return the shapes of these four arrays in the same order as listed above.

In [4]:
# CodeGrade step2

X = data[['MedInc', 'AveRooms', 'AveOccup']]
y = data['HighValue']

# set seed to 42
seed = 42

# split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=seed)

# return the shapes of the arrays
X_train.shape, X_test.shape, y_train.shape, y_test.shape

((14448, 3), (6192, 3), (14448,), (6192,))

### Step 3

Using `scaler = StandardScaler()`, `fit_transform` `X_train`, calling this `X_train_scaled`. Likewise use `.transform` to transform `X_test` calling this `X_test_scaled`.

Now return the shape of `X_test_scaled`.

In [5]:
# CodeGrade step3
# scale predictors

scaler = StandardScaler()

# transform training and testing predictors
X_train_scaled = pd.DataFrame(scaler.fit_transform(X_train), columns=X_train.columns)
X_test_scaled = pd.DataFrame(scaler.transform(X_test), columns=X_test.columns)

# shape of X_test_scaled
X_test_scaled.shape

(6192, 3)

### Step 0

Run the code below

In [6]:
# CodeGrade step0

# Train logistic regression model
model = LogisticRegression()
model.fit(X_train_scaled, y_train)

### Step 4

Return the model's intercept.

In [7]:
# CodeGrade step4

# output intercept
print("Intercept: ", model.intercept_)

Intercept:  [0.12755905]


### Step 5

Return the model's coefficients.

In [8]:
# CodeGrade step5

# output coefficients
print("Coefficients:", model.coef_)

Coefficients: [[ 2.33711986 -0.88891482 -2.55688063]]


### Step 6

Using the model, predict the probabilities of `X_test_scaled` calling this `y_pred_prob` and predict the class of `X_test_scaled` calling this `y_pred_class`.

Now return the first five elments of both of these arrays, `y_pred_prob`, `y_pred_class`.

In [9]:
# CodeGrade step6
# predict probabilities and classes
y_pred_prob = model.predict_proba(X_test_scaled)[:, 1]
y_pred_class = model.predict(X_test_scaled)

# display predictions
print("Predicted probabilities:", y_pred_prob[:5])
print("Predicted classes:", y_pred_class[:5])

Predicted probabilities: [0.09349518 0.21617122 0.63030197 0.88899024 0.51316254]
Predicted classes: [0 0 1 1 1]


### Step 7

Give the confusion matrix of `y_test` and `y_pred_class`.

In [10]:
 # CodeGrade step7
 # confusion matrix
print("Confusion matrix:\n", confusion_matrix(y_test, y_pred_class))

Confusion matrix:
 [[2478  591]
 [ 810 2313]]


### Step 8

Roudning to four decimal places, give the accuracy score of `y_test` and `y_pred_class`.

In [11]:
 # CodeGrade step8
print("Accuracy Score:", round(accuracy_score(y_test, y_pred_class), 4))

Accuracy Score: 0.7737


### Step 9

Rounding to 3 decimal placess for each, give the VIFs for each of the three columns of `X_trained_scaled`.

In [12]:
 # CodeGrade step9
# check multicollinearity using VIG
vif = pd.DataFrame()
vif['Variable'] = X_train.columns
vif['VIF'] = [variance_inflation_factor(X_train_scaled.values, i) for i in range(X_train_scaled.shape[1])]

print(round(vif, 3))

   Variable    VIF
0    MedInc  1.118
1  AveRooms  1.117
2  AveOccup  1.001
