# STEM Presentation: Predictive Modeling

Predictive modeling is the source of a lot of the cutting-edge software out of
Silicon Valley.

Let's start with a basic example of we perform some basic data analysis using
Python.

Let us begin with some data. Our data is in a csv file. A csv file is simply
a text file with rows in the data delimited by a comma.

In [None]:
import numpy as np
import pandas as pa
import matplotlib.pyplot as plt

boston_crime_data = pa.read_csv('Boston.csv')

print(boston_crime_data.head())

As we can see, each of the rows of data contains information for a different
column. This data contains information on per-capita crime rates by town, with
other information by town such as age of houses, whether or not a town sits on
the Charles River, property tax-rate, pupil-teacher ratio, proportion of large
zones in town, etc.

Before we consider building a model, let's take a look at our data. Let's focus
on determinants of crime. Our outcome variable of interest will be 'crim'.
Let's plot the distribution of it and take a look.

In [None]:
plt.hist(boston_crime_data['crim'], bins=10)

As we can see, our data is more populated as we get closer to 0. Why would we
expect this with something like crime rate?

Now, let's take a look at some of our other variables.

Given that the 'crim' column is distribute so close to 0, it makes linear
predictive modeling difficult to perform, given that negative rates are not
possible.

In [None]:
plt.hist(np.log(boston_crime_data['crim']), bins=10)

Now let's try building a simple model:

In [None]:
import statsmodels.formula.api as smf
model = smf.ols(formula='crim ~ age + chas + zn + ptratio',
                data=boston_crime_data)
model_fit = model.fit()
print(model_fit.summary())