# STEM Presentation: Predictive Modeling

Predictive modeling is the source of a lot of the cutting-edge software out of
Silicon Valley.

Let's start with a basic example of we perform some basic data analysis using
Python.

Let us begin with some data. Our data is in a csv file. A csv file is simply
a text file with rows in the data delimited by a comma.

In [None]:
import numpy as np
import pandas as pa
import matplotlib.pyplot as plt

boston_crime_data = pa.read_csv('Boston.csv')

print(boston_crime_data.head())

As we can see, each of the rows of data contains information for a different
column. This data contains information on per-capita crime rates by town, with
other information by town such as age of houses, whether or not a town sits on
the Charles River, property tax-rate, pupil-teacher ratio, proportion of large
zones in town, etc.

Before we consider building a model, let's take a look at our data. Let's focus
on determinants of crime. Our outcome variable of interest will be 'crim'.
Let's plot the distribution of it and take a look.

In [None]:
fig, ax = plt.subplots(figsize=(14, 12))
ax.hist(boston_crime_data['crim'], bins=10)
ax.set_xlabel('Crime Rate (%)', fontsize=16)
ax.set_ylabel('Number of towns', fontsize=16)
ax.tick_params(axis='both', which='major', labelsize=14)

As we can see, our data is more populated as we get closer to 0. Why would we
expect this with something like crime rate?

Given that the 'crim' column is distributed so close to 0, it makes linear
predictive modeling difficult to perform, given that negative rates are not
possible.

Now, let's take a look at some of our other variables:

* crim: per capita crime rate by town.
* zn: proportion of residential land zoned for lots over 25,000 sq.ft.
* indus: proportion of non-retail business acres per town.
* chas: Charles River dummy variable (= 1 if tract bounds river; 0 otherwise).
* nox: nitrogen oxides concentration (parts per 10 million).
* rm: average number of rooms per dwelling.
* age: proportion of owner-occupied units built prior to 1940.
* dis: weighted mean of distances to five Boston employment centres.
* rad: index of accessibility to radial highways.
* tax: full-value property-tax rate per \$10,000.
* ptratio: pupil-teacher ratio by town.
* black: 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town.
* lstat: lower status of the population (percent).
* medv: median value of owner-occupied homes in \$1000s.

In [None]:
boston_crime_data['crim_log'] = np.log(boston_crime_data['crim'])
fig, ax = plt.subplots(figsize=(14, 12))
ax.hist(boston_crime_data['crim_log'], bins=10)
ax.set_xlabel('Crime Rate (%)', fontsize=16)
ax.set_ylabel('Number of towns', fontsize=16)
ax.tick_params(axis='both', which='major', labelsize=14)

Now let's try building a simple model. In this model, we are attempting to
estimate the impact of individual influences on the crime rate, holding all
other relevant influences constant.

In [36]:
import statsmodels.formula.api as smf
model = smf.ols(formula='crim_log ~ age + chas + zn + ptratio',
                data=boston_crime_data)
model_fit = model.fit()
print(model_fit.summary())

Let's see how our model did at predicting within the sample:

In [None]:
predicted_values = model_fit.predict()
fig, ax = plt.subplots(figsize=(14, 12))
ax.scatter(boston_crime_data['crim_log'], np.exp(predicted_values))
ax.set_xlabel('Observed', fontsize=16)
ax.set_ylabel('Predicted', fontsize=16)