# Machine learning: Classification

## Part 1. Data preparation

## Lecture objectives

1. Introduce the principles of machine learning
2. More practice with data wrangling

Machine learning is a very general term that covers many parts of data science. Here, we will look at two specific problems that machine learning is well equipped to handle:
* Classification (this week)
* Clustering (next week)

As a broad generalization, machine learning-based classification focuses on *prediction*. For example: [which neighborhoods are likely to gentrify](https://journals.sagepub.com/doi/abs/10.1177/0042098018789054)? [Which facilities are likely to be violating environmental standards?](https://www.nature.com/articles/s41893-018-0142-9) What is demand likely to be at a new bikeshare station? [What is the race and gender of an author on a course reading list](http://syllabusdiversity.org)?

There are also applications which raise more concerns with ethics and justice (yes, [predictive policing](https://www.technologyreview.com/2020/07/17/1005396/predictive-policing-algorithms-racist-dismantled-machine-learning-bias-criminal-justice/), I'm talking about you). We'll come back to these issues in a couple of weeks.

Machine learning is less successful with questions of *causation* and *hypothesis testing*. Here, a statistical approach (frequentist or Bayesian) is likely to be more appropriate, although there is quite a bit of overlap between "statistics" and "machine learning."

There are at least three widely used approaches to classification.
* Logistic regression. This is often used in a more statistical setting, but is the starting point for much machine learning analysis. 
* Random forests. We'll focus on this technique.
* Neural networks. Often used for image recognition, this can be a "black box" approach to prediction and classification.

Important: machine learning is a very large field, and there are entire courses on the theory and applications. Here, we will give a very high-level overview. We'll focus on the big-picture applicability of machine learning techniques, and actually implementing them in Python. We'll skate over the theoretical underpinnings and the details of the various algorithms.

## Example: ADUs in LA
The example we will use is whether property owners construct Accessory Dwelling Units (ADUs) in the City of Los Angeles. You might imagine that a predictive approach could be useful to planners and policymakers. Not least, they could predict future ADU growth, and the neighborhoods where ADUs are most likely to be built.

We can obtain the data from the City's building permits database (which tells us whether or not an ADU was built), and the County Assessor parcel database (which provides covariates such as lot size). Because both of these datasets are very large, I preprocessed them and saved a slimmed-down version that is in your GitHub folder. Specifically, I extracted a subset of fields, limited the building permits to those that include an ADU, and limited the parcels to those in the City of LA.

## Wrangling the data
We have two input data files: permits and parcels. The aim: add a column to the parcels dataframe that is `True` if an ADU has been permitted on that parcel, and `False` otherwise.
    
Even with this preprocessing, there is some work to do in joining the datasets together. 

In [None]:
import pandas as pd

# get building permit data
# this is an abbreviated version of the data here (>500 MB):
# https://data.lacity.org/City-Infrastructure-Service-Requests/Building-and-Safety-Permit-Information-Old/yv23-pmwf

# this code was used to read in the data and save a subset (ADU permits only)
# that is manageable in size
if 0:  # if 0 means this block won't be executed (because 0 is False)
    cols_to_use = ['Assessor Book', 'Assessor Page', 'Assessor Parcel', '# of Accessory Dwelling Units']
    df = pd.read_csv('Building_and_Safety_Permit_Information_Old.csv', usecols=cols_to_use)
    df = df[df['# of Accessory Dwelling Units']>0]
    df.to_csv('ADU_permits.csv', index=False)

permits = pd.read_csv('data/ADU_permits.csv')  # this file should be in your GitHub folder
permits.head()    

In [None]:
# original data: https://egis-lacounty.hub.arcgis.com/datasets/parcels
# this code was used to read in the data and save a subset 
# (City of LA only, subset of columns) that is manageable in size

import geopandas as gpd

if 0: # if 0 means this block won't be executed
    gdf = gpd.read_file('/Users/adammb/Desktop/LACounty_Parcels.gdb', driver='FileGDB', layer='LACounty_Parcels')
    gdf.dropna(subset=['SitusCity'], inplace=True)
    gdf = gdf[gdf['SitusCity'].str.startswith('LOS ANGELE')]
    cols_to_use = ['APN', 'UseType', 'UseDescription','YearBuilt1', 'Units1','Bedrooms1', 'Bathrooms1', 
         'SQFTmain1','Roll_LandValue', 'Roll_ImpValue', 'Roll_LandBaseYear', 'Roll_ImpBaseYear', 'CENTER_LAT', 'CENTER_LON']
    parceldf = pd.DataFrame(gdf)[cols_to_use]  # drops the geometry column as well
    parceldf.to_csv('parcels.csv', index=False)
    del gdf   # frees up space

parcels = pd.read_csv('data/parcels.csv')
parcels.head()

Note that the `APN` column in `parcels` has a format that corresponds to three columns in `permits`: `Assessor Book`-`Assessor Page`-`Assessor Parcel`. 

So the first step is to create this `APN` column in `permits`.

We first convert each column to an integer (to drop the decimal point), then to a string, then pad with zeros, and then concatenate the columns separated by `-`.

In [None]:
# join
permits['APN'] = (permits['Assessor Book'].astype(int).astype(str).str.zfill(4) + '-' 
                   + permits['Assessor Page'].astype(int).astype(str).str.zfill(3) + '-'
                   + permits['Assessor Parcel'].astype(int).astype(str).str.zfill(3))

What happened? Note two things:
* The problem is that we are trying to convert `'***'` to an integer
* The error is being caused by the `permits['Assessor Parcel'].astype(int)` part of the code.

So let's loook at those rows.

In [None]:
permits[permits['Assessor Parcel']=='***']

It seems like the parcel number is just missing, so let's drop them.

Note the `!=` operator means "not equal to." So we are keeping the rows that are *not* `***`.

In [None]:
permits = permits[permits['Assessor Parcel']!='***']

Now let's try to create the column again.

In [None]:
permits['APN'] = (permits['Assessor Book'].astype(int).astype(str).str.zfill(4) + '-' 
                   + permits['Assessor Page'].astype(int).astype(str).str.zfill(3) + '-'
                   + permits['Assessor Parcel'].astype(int).astype(str).str.zfill(3))
permits.head()

<div class="alert alert-block alert-info">
<strong>Question:</strong> What type of join do we want? Left? Right? Inner? Outer? 1:1? 1:many?
</div>

Note two things:
* We need to keep all of the parcels, even if there isn't a corresponding permit. Otherwise, we can't do any prediction—we'd have a dataset where *every* parcel has an ADU. So that implies a left join to the parcels dataframe
* We don't want to duplicate parcels. So let's drop any duplicates (on the APN column) in both the permit and parcels dataframes. That will guarantee a 1:1 join

Let's first check to see if duplicates exist.

In [None]:
permits.APN.is_unique

In [None]:
parcels.APN.is_unique

There are two ways to drop duplicates: the [pandas `drop_duplicates()` function](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.drop_duplicates.html) is one.

But sometimes it's easier to use `groupby`, and then take the first in each group. If there is only one row in a group, it will be returned unchanged.

A byproduct of using `groupby` on the `APN` column is that `APN` is now our index. That will make the join easier.

In [None]:
# in the permits, take the first row of any duplicates for convenience
print('Before dropping duplicates: {}'.format(len(permits)))
permits = permits.groupby('APN').first()
print('After dropping duplicates: {}'.format(len(permits)))
permits.index.is_unique  # make sure the index (APN) is unique

In [None]:
permits.head()

In [None]:
print('Before dropping duplicates: {}'.format(len(parcels)))
parcels = parcels.groupby('APN').first()
print('After dropping duplicates: {}'.format(len(parcels)))
parcels.index.is_unique  # make sure the index (APN) is unique

In [None]:
parcels.head()

Now let's do the join.

In [None]:
joinedDf = parcels.join(permits, how='left') # left is the default so we could omit that argument
print('N parcels: {}'.format(len(joinedDf)))
print('N joined: {}'.format(joinedDf['# of Accessory Dwelling Units'].count()))
joinedDf.head()

That seems good enough. We join almost all of the permits to the parcels dataframe. 

Now let's create a column that is 0 if there is no ADU (i.e., if the permit data did not join), and 1 otherwise.

We'll use our `lambda` function again. If the value of the column is Null (using the handy `pd.isnull`), we'll return `False`. Otherwise, `True`.

In [None]:
joinedDf['hasADU'] = joinedDf['# of Accessory Dwelling Units'].apply(
                        lambda x: False if pd.isnull(x) else True)
joinedDf.head()

Let's stop there for now. We'll save the data so that we can reload it at the start of the next video lecture.

You could save it as a `csv`. But we can also save the pandas DataFrame object, through "pickling" it. This is convenient when you want to save something temporarily, but it's not advising for long-term archiving or sharing your work.

In [None]:
joinedDf.to_pickle('joined_permits.pandas')

<div class="alert alert-block alert-info">
<h3>Key Takeaways</h3>
<ul>
  <li>Machine learning is particularly valuable for prediction, and when there are many highly correlated variables.</li>
  <li>Data wrangling is almost always your first step, and joins will come with practice.</li>
</ul>
</div>