# Machine learning: Classification
## Part 3. Neural networks and logistic regression
## Lecture objectives

1. Learn how to estimate neural network models
2. Learn how to estimate logistic regression models
3. More practice with train-test splits and assessing model performance
4. Learn how to standardize data

In the previous video lecture, we estimated a random forests model using `scikit-learn`. 

Here, we'll explore other machine learning algorithms.

Most have almost identical syntax, meaning once you are familiar with one model it's easy to apply another model. However, they will have different hyperparameters, such as the number of trees in the random forest.

To start with, let's do the following:
* load the data we previous saved as a pickle
* recreate the dummy variables
* create a dataframe with the subset of variables that we want to use, and drop the NaNs

In [None]:
# this is the same code as from the last lecture
import pandas as pd
joinedDf = pd.read_pickle('joined_permits.pandas')

dummies1 = pd.get_dummies(joinedDf.UseType, prefix='usetype_')  # creates a dataframe of dummies
dummies2 = pd.get_dummies(joinedDf.UseDescription, prefix='usedesc_')
joinedDf = joinedDf.join(dummies1).join(dummies2) 

xvars = (dummies1.columns.tolist() + dummies2.columns.tolist() + 
            ['YearBuilt1', 'Units1', 'Bedrooms1', 'Bathrooms1', 'SQFTmain1', 
             'Roll_LandValue', 'Roll_ImpValue', 'Roll_LandBaseYear', 
             'Roll_ImpBaseYear', 'CENTER_LAT', 'CENTER_LON' ])
yvar = 'hasADU'

# create a dataframe with no NaNs
df_to_fit = joinedDf[xvars+[yvar]].dropna()

Let's also import the relevant `scikit-learn` functions.

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score, ConfusionMatrixDisplay

## Standardizing data
Many machine learning algorithms are more robust if we *standardize* the data - subtract the mean and divide by the standard deviation. This puts each variable on a common scale.

It didn't matter for random forests, but it does for neural networks.

Let's do this. Note that we need to exclude the dummy variable columns and the dependent variable.

To identify them, we'll use a Python [list comprehension](https://www.w3schools.com/python/python_lists_comprehension.asp). 

In [None]:
# Example: add a suffix to each column
oldcols = ['units', 'squarefeet','lotsize']
newcols = []
for col in oldcols:
    newcols.append(col+'_2020')
print(newcols)

In [None]:
# This is much cleaner as a list comprehension
[col+'_2020' for col in oldcols]

In [None]:
cols_to_exclude = [col for col in df_to_fit.columns 
                if col.startswith('usetype_') or col.startswith('usedesc_') 
                or col=='hasADU']

# this is the same as
cols_to_exclude = []
for col in df_to_fit.columns:
    if col.startswith('usetype_') or col.startswith('usedesc_') or col=='hasADU':
        cols_to_exclude.append(col)


Now let's create a list of the columns that we *don't* want to exclude. 

We can use a list comprehension again: we'll include that column in the new list if the condition (`col not in cols_to_exclude`) is `True`.

In [None]:
otherCols = [col for col in df_to_fit.columns if col not in cols_to_exclude]
otherCols

Now let's scale `otherCols`. Note that the `StandardScaler` returns a numpy array, not a pandas DataFrame. So we need to convert the array to a dataframe and specify the column names and the index.

In [None]:
# see https://scikit-learn.org/stable/modules/preprocessing.html for standardization
from sklearn import preprocessing
scaler = preprocessing.StandardScaler().fit(df_to_fit[otherCols])

# convert to DataFrame and specify the column names and index
df_scaled = pd.DataFrame(scaler.transform(df_to_fit[otherCols]), 
                         columns=otherCols, index=df_to_fit.index)

# create a DataFrame with these scaled columns joined to the columns that we didn't scale
df_scaled = df_scaled.join(df_to_fit[cols_to_exclude])

df_scaled.head()

We can see that the standardization works.

In [None]:
df_scaled.describe()

We'll do our train/test split as before.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(
      df_scaled[xvars], df_scaled[yvar], test_size = 0.25, random_state = 1)

And estimate our neural network model. 

Note that the workflow and syntax is very similar to the random forests:
* Initialize the classifier object - here, we call it `mlp`
* Fit to the data
* Predict

In [None]:
from sklearn.neural_network import MLPClassifier
mlp = MLPClassifier(hidden_layer_sizes=(10, 10, 10), max_iter=1000)
mlp.fit(X_train, y_train)
y_pred = mlp.predict(X_test)

How did we do?

In [None]:
print(classification_report(y_test, y_pred))

In [None]:
ConfusionMatrixDisplay.from_predictions(y_test, y_pred)

Interestingly, we get very similar results to the random forests. Perhaps this indicates the inherent unpredictability of ADUs, given how rarely they are constructed. Or we might be able to do better with additional predictors or through adjusting the hyperparameters.

## Logistic regression
As a point of comparison, how would a more traditional logistic regression fare?

Many different regression estimators are implemented in `scikit-learn`. And the syntax should be familiar by now. Note that standardization (as we did for neural networks) helps.



In [None]:
from sklearn.linear_model import LogisticRegression
lr = LogisticRegression()
lr.fit(X_train, y_train)
y_pred = lr.predict(X_test)

Note that it doesn't even converge! Methods like logistic regression don't handle highly correlated variables very well.

We might be able to do better with a smaller set of predictors.

In [None]:
xvars = ['YearBuilt1', 'Units1', 'Bedrooms1', 'Bathrooms1', 'SQFTmain1', 'Roll_LandValue', 
             'Roll_ImpValue', 'Roll_LandBaseYear', 'Roll_ImpBaseYear', 'CENTER_LAT', 'CENTER_LON', 'usedesc__Single']

lr = LogisticRegression()
lr.fit(X_train[xvars], y_train)
y_pred = lr.predict(X_test[xvars])
ConfusionMatrixDisplay.from_predictions(y_test, y_pred)

Not so great, eh? So our random forests and neural networks approaches look much better by comparison.

<div class="alert alert-block alert-info">
<h3>Key Takeaways</h3>
<ul>
  <li>There are several different approaches to machine learning. Random forests and neural networks are two of the most popular.</li>
  <li>scikit-learn provides a consistent syntax: initialize-fit-predict. So once you've done one ML model, others are much simpler.</li>
  <li>Confusion matrices are an excellent way to assess predictive performance.</li>
</ul>
</div>