### Week In Class Assignment 
##### Angela Spencer - December 22, 2021

#### 1. Using the documentation for Recursive Feature Selection, apply this process to the crime dataset to create the best multivariate linear regression model
https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.RFE.html .
#### You can select what you’re trying to predict, but be sure to indicate what that is. Be sure to explain what RFE is in the markdown. You should be able to answer this using what’s on the documentation page + what you already know.


In [56]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.feature_selection import RFE
from sklearn.linear_model import LinearRegression
from statsmodels.formula.api import logit
from sklearn.linear_model import LogisticRegression
from statsmodels.formula.api import ols

The data (X1, X2, X3, X4, X5, X6, X7) are for each city.

    X1 = total overall reported crime rate per 1 million residents
    X2 = reported violent crime rate per 100,000 residents
    X3 = annual police funding in USD/resident
    X4 = % of people 25 years+ with 4 yrs. of high school
    X5 = % of 16 to 19 year-olds not in highschool and not highschool graduates.
    X6 = % of 18 to 24 year-olds in college
    X7 = % of people 25 years+ with at least 4 years of college

Reference: Life In America's Small Cities, By G.S. Thomas

In [2]:
crime_df = pd.read_csv('..\Datasets\crime_data.csv')
crime_df.head()

Unnamed: 0,X1,X2,X3,X4,X5,X6,X7
0,478,184,40,74,11,31,20
1,494,213,32,72,11,43,18
2,643,347,57,70,18,16,16
3,341,565,31,71,11,25,19
4,773,327,67,72,9,29,24


In [18]:
#predicted variable = X3 (annual police funding)
X = crime_df.drop('X3', axis=1)
y = crime_df[['X3']]

In [22]:
#initialize RFE model
rfe = RFE(estimator=LinearRegression(), n_features_to_select = 2)

#transform data with RFE
X_rfe = rfe.fit_transform(X,y)

#fit data to model
model.fit(X_rfe,y)

print(rfe.support_)
print(rfe.ranking_)

[False False False  True False  True]
[4 5 3 1 2 1]


In [20]:
#number of features
numf_list = np.arange(1,6)
high_score = 0

#variable to store optimum features
numf = 0
score_list = []

#loop
for n in range(len(numf_list)):
    X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.3, random_state=0)
    model = LinearRegression()
    rfe = RFE(estimator=LinearRegression(),n_features_to_select = numf_list[n])
    X_train_rfe = rfe.fit_transform(X_train, y_train)
    X_test_rfe = rfe.transform(X_test)
    model.fit(X_train_rfe, y_train)
    
    score = model.score(X_test_rfe, y_test)
    score_list.append(score)
    
    if(score > high_score):
        high_score = score
        numf = numf_list[n]
print('Optimum number of features: %d' %numf)
print('Score with %d features: %f' %(numf, high_score))

Optimum number of features: 3
Score with 3 features: 0.083759


In [23]:
#initialize RFE model
rfe = RFE(estimator=LinearRegression(), n_features_to_select = 3)

#transform data with RFE
X_rfe = rfe.fit_transform(X,y)

#fit data to model
model.fit(X_rfe,y)

print(rfe.support_)
print(rfe.ranking_)

[False False False  True  True  True]
[3 4 2 1 1 1]


In [54]:
#the optimum number of features to include in the X variable is 3
#these features correspond to columns X5, X6 and X7
X = crime_df[['X5', 'X6', 'X7']]
y = crime_df[['X3']]

# Train test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)

array([[38.04799974],
       [47.39194819],
       [30.18392442],
       [33.6918988 ],
       [46.18570699],
       [33.85530005],
       [26.66815046],
       [34.64010725],
       [40.16663065],
       [45.29550244],
       [36.51965446],
       [35.20256553],
       [42.05961964],
       [33.99759484],
       [40.1000183 ]])

In [66]:
#standardization - transforming values so mean is 0 SD is 1
from sklearn.preprocessing import StandardScaler

sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.fit_transform(X_test)
y_train = np.ravel(y_train)
y_test = np.ravel(y_test)

#logistic regression with interaction
reg = LogisticRegression()

reg.fit(X_train,y_train)

y_pred = reg.predict(X_test)
y_pred

array([44, 36, 29, 44, 33, 44, 36, 33, 32, 31, 44, 36, 33, 44, 32],
      dtype=int64)

In [67]:
reg.score(X_test, y_test)

0.0

In [58]:
#Ordinary Least Squares
model = ols('X3~X5*X6*X7', data=crime_df).fit()
print(model.params)

Intercept    63.529765
X5           -3.383845
X6           -1.001016
X5:X6         0.067283
X7           -2.605845
X5:X7         0.327290
X6:X7         0.046423
X5:X6:X7     -0.003778
dtype: float64


#### 2. Create a list of preprocessing steps you should try when working to build a model. Briefly describe what each step is. Work with your group to come up with the most comprehensive list you can.

1. remove or fill NaNs 
    - replace NaNs with values when possible, otherwise remove columns and rows to eliminate all NaNs
2. converting dtypes to numeric
    - regression models can only interpret numerical values, ensure that all numeric values have a dtype of float or integer
3. converting categorical variables to numerical
    - string values and string categories must be converted to numeric values with a binary system of with encoding of each category
4. feature engineering
    - utilize numerical value to create new columns such as differece and mean
5. standardizing, normalizing, scaling
    - features that have a high variance compared to others should be standardized to bring the scales into a similar range
6. feature selection / dimensionality reduction
    - select features that are of interest and eliminate non-significant features; also eliminate any redundant columns already present of from preprocessing steps above
7. stratified sampling
    - split the dataset into training and testing data