# Week 14: In class exercise
## 12/22/2021

**1. Using the documentation for Recursive Feature Selection, apply this process to the crime dataset to create the best multivariate linear regression model https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.RFE.html . You can select what you’re trying to predict, but be sure to indicate what that is. Be sure to explain what RFE is in the markdown. You should be able to answer this using what’s on the documentation page + what you already know.**

**What is RFE?**

Recursive Feature Elimination (or RFE) is a method for optimizing your model by iteratively removing the least helpful (or 'weakest') features until the number of features you want is reached. For example, if you want 3 features in your model, and you have 20 to choose from, the RFE method will help you select the 3 best features. It will start with all 20 in the model and whittle down, eliminating the worst one each time, until you only have three.

In [19]:
# Load in libraries
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression
from statsmodels.formula.api import ols

# Load file
crime_df = pd.read_csv('crime_data.csv')
crime_df.head()

Unnamed: 0,X1,X2,X3,X4,X5,X6,X7
0,478,184,40,74,11,31,20
1,494,213,32,72,11,43,18
2,643,347,57,70,18,16,16
3,341,565,31,71,11,25,19
4,773,327,67,72,9,29,24


With the code below, we will try to predict the X1 column (reported crime per 1 mil residents)

We will use RFE to create the best model possible for predicting the reported crime rate

**About the Data set**

The data (X1, X2, X3, X4, X5, X6, X7) are for each city.

X1 = total overall reported crime rate per 1 million residents

X2 = reported violent crime rate per 100,000 residents

X3 = annual police funding in $/resident

X4 = % of people 25 years+ with 4 yrs. of high school

X5 = % of 16 to 19 year-olds not in highschool and not highschool graduates.

X6 = % of 18 to 24 year-olds in college

X7 = % of people 25 years+ with at least 4 years of college

**Compute the RFE**

In [20]:
# Get our predictive variables (everything except X1)
X = crime_df.drop('X1', axis=1)
# Get our outcome variable (only X1)
y = crime_df['X1']

In [21]:
# FROM ARTICLE EXAMPLE: A support vector regression is our estimator
        # The kernal is similar to what model we are using
        #estimator = SVR(kernel="linear")

# Try using linear regression classifier as our estimator, instead of SVR
estimator = LinearRegression()

# Run the recursive feature elimination (RFE)
# We want to end up with 2 features here
# The step is how many features to remove with each iteration 
selector = RFE(estimator, n_features_to_select=2, step=1)

# Fit the selector to our data above
selector2 = selector.fit(X, y)

# Shows the mask of selected features
print(selector2.support_)

# Print the rankings from our iterations
print(selector2.ranking_)

[False  True  True False False False]
[5 1 1 3 4 2]


The code above shows the two best features (since we used n_features_to_select=2). These are symbolized by "true" and "true". We get even more detail from the actual rankings. The two features marked as 'true' have a ranking of 1 (being the best). However, we also get to see how the other features ranked, which could still be helpful for us. 

In this case, the RFE is telling us that the two best features to use for predicting total overall reported crime rate per 1 million residents are X3 (annual police funding in per resident) and X4 (percent of people 25 years+ with 4 yrs. of high school)

**Apply our findings by creating the best model**

In [22]:
# Build the model
crime_model = ols('X1 ~ X3 + X4', data=crime_df).fit()

# Extract the parameters
print(crime_model.params)

Intercept    621.426036
X3            11.858331
X4            -5.973412
dtype: float64


In [24]:
# View summary data
crime_model.summary()

0,1,2,3
Dep. Variable:,X1,R-squared:,0.325
Model:,OLS,Adj. R-squared:,0.296
Method:,Least Squares,F-statistic:,11.3
Date:,"Wed, 22 Dec 2021",Prob (F-statistic):,9.84e-05
Time:,22:09:44,Log-Likelihood:,-344.79
No. Observations:,50,AIC:,695.6
Df Residuals:,47,BIC:,701.3
Df Model:,2,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,621.4260,222.685,2.791,0.008,173.441,1069.411
X3,11.8583,2.568,4.618,0.000,6.692,17.024
X4,-5.9734,3.561,-1.677,0.100,-13.138,1.191

0,1,2,3
Omnibus:,14.866,Durbin-Watson:,1.581
Prob(Omnibus):,0.001,Jarque-Bera (JB):,16.549
Skew:,1.202,Prob(JB):,0.000255
Kurtosis:,4.47,Cond. No.,453.0


**Interpretation**

Even by getting the "best" model possible, this model is still not that great at predicting reported crime. Our model only explains 32.5% of the variability (based on our r-squared value). X3 (annual police funding in per resident) was a significant predictor in our model, but X4 (percent of people 25 years+ with 4 yrs. of high school) was not. Clearly, we would need to refine our model to make more accurate predictions of reported crime.

**2. Create a list of preprocessing steps you should try when working to build a model. Briefly describe what each step is. Work with your group to come up with the most comprehensive list you can.**

- Plot the overall distribution of your data
    - This will help us know if our data are very skewed
    - This will also help us know what further preprocessing steps we need
- Look at what data types you have, how many features, how many rows, etc.
    - How many categorical vs. numeric?
    - Any datetime values? etc.
- Cleaning nulls 
    - removing features that have too many nulls (e.g. over 50%)
- Handle missing data
    - Might want to replace null with 0, mean, median, etc. depending on our data
    - Need to be careful on how this affects our data
- Standardize your data (normalize)
- Scale your data
    - Transform data to be the same scale (especially useful for different distance measures that are very sensitive to scale)
- Convert categorical to numeric
    - This is crucial for machine learning algorithms
- One-hot encoding
    - Changing categorical variables to binary (0,1)
- Cleaning up your data in general 
    - Missing values
    - Changing strings to numeric or proper dates
    - Spaces in odd places
    - Weird characters, etc.
- Correlation matrix
    - See where you have multicollinearity
    - If applicable, drop features that are highly correlated with each other before running your model