## Predictors of Domestic Violence in Washington, D.C.

Contributors: Seoho Hahm, Allison Lee

### Table of Contents
1. <a href='#prob'>Problem Statement
2. <a href='#sources'>Data Sources</a>
3. <a href='#approach'>Approach</a>
4. <a href='#eda'>Exploratory Data Analysis</a>
5. <a href='#feature'>Feature Selection</a>
6. <a href='#reg'>Regression</a>
7. <a href='#analysis'>Findings and Analysis</a>
8. <a href='#concl'>Conclusions and Recommendations</a>

In [382]:
# import libraries
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
import statsmodels.api as sm
import seaborn as sns
from sklearn.linear_model import Lasso, LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.feature_selection import RFE
import sklearn
import statsmodels.stats.outliers_influence as smd
%run ../pyfiles/data_cleaning
%run ../pyfiles/regression

<a id='prob'></a>
### Problem Statement

This analysis seeks to understand which factors are most efficient in predicting domestic violence in cities such as Washington, D.C. 

Specifically, we look at the strength and contribution of different variables to incidents of domestic violence in Washington, D.C., and the potential interactions between variables. 

<a id='sources'></a>
### Data Sources

We use a dataset of 101 features and 431 observations from a study conducted by Caterina Goman from the Urban Institute, "Alcohol Availability, Type of Alcohol Establishment, Distribution Policies, and Their Relationship to Crime and Disorder in the District of Columbia, 2000-2006". Each observation represents a block in Washington, D.C. 

The original dataset is available here: https://www.icpsr.umich.edu/icpsrweb/NACJD/studies/25763/summary

<a id='approach'></a>
### Approach

We take a more statistical approach to this analysis to understand which factors are most efficient at predicting incidents of domestic violence in Washington, D.C. All of our features are continuous. 

We used Cook's D to assess leverage and outliers in our variables. 

We used three different methods for feature selection--recursive feature elimination, domain knowledge / forward selection, and Lasso regression. We ran models on the resulting features and compared model metrics. We selected our final model based on adjusted R squared metric. 

We then checked the residual plots to ensure they met our assumptions of normal distribution and homoscedasticity. 

Finally, we present our main findings and recommendations. 

<a id='eda'></a>
### Exploratory Data Analysis

In [383]:
df_orig = pd.read_csv('../data/data.tsv', sep = '\t')
df = clean_orig_dataset(df_orig)
X = df.drop('AVGDV', axis = 1)
y = df['AVGDV']

There appears to be a high degree of interactions between our variables, which is something to keep in mind throughout our analysis. For example, these scatterplots illustrate how race interacts with poverty in D.C. 

<a id='feature'></a>
### Feature Selection

As a first step, we removed the features that are directly correlated with our dependent variable (for example, domestic violence incidents on the weekend). We tried three methods of feature selection to come up with a subset of features on which to run a model. 

We then used Cook's D to identify variables that have outliers with high influence over our regression results. We eliminated those outliers from our dataset (i.e. dropped the blocks with outlier data). 

In [384]:
cooksddf = cooksd(df, list(df.columns))
cleaned_df = dropping_outliers(list(df.columns), df, cooksddf)

In [385]:
# Scale features
scaled_df = scale_dataset(cleaned_df)

In [386]:
X = scaled_df.drop(['AVGDV'], axis = 1)
y = scaled_df['AVGDV']

**Recursive Feature Elimination**

First, we tried recursive feature elimination. In our case, recursive feature elimination works by recursively selecting smaller and smaller sets of features. The importance of each feature is obtained through the linear regression coefficient attribute (we scaled the features before inputting them in the rfe model). After initially training on the original set of features, the rfe "prunes" the least important features and repeats this process until the desired number of features is reached. We set our desired number of features as 7 for ease of interpretation. 

In [387]:
rfe_df = recursive_feature_elimination(X, y)
rfe_df

Unnamed: 0,VACANT,SAMEHOUSE,YOUNGPOP,RESSTAB,PUBHOUSPT,PHYSDIS0203,ON_SQMI
0,-0.033439,0.011864,-0.111383,-0.195090,-0.182802,-0.250426,-0.351701
1,-0.001685,-0.527070,-1.392122,-0.300838,-0.182802,-1.048205,-0.166552
2,0.871544,-1.344184,-1.734189,-1.146821,-0.182802,-0.934237,2.425525
3,0.474622,-1.727537,-1.925344,-1.757809,-0.182802,-0.535347,5.303740
4,-0.446238,-0.045390,0.901739,-0.747329,-0.182802,-0.848760,-0.351701
...,...,...,...,...,...,...,...
419,-0.414484,-0.023609,1.688493,-0.947075,-0.182802,-1.304634,-0.351701
420,1.093821,-0.783469,2.108028,-1.158571,-0.182802,-0.307410,-0.351701
421,1.903542,0.435045,1.413834,-0.042343,-0.182802,2.066933,-0.351701
422,-0.747899,-2.748151,-1.578247,-2.333547,-0.182802,-1.304634,-0.351701


In [388]:
rfe_df.corr()> 0.7

Unnamed: 0,VACANT,SAMEHOUSE,YOUNGPOP,RESSTAB,PUBHOUSPT,PHYSDIS0203,ON_SQMI
VACANT,True,False,False,False,False,False,False
SAMEHOUSE,False,True,False,True,False,False,False
YOUNGPOP,False,False,True,False,False,False,False
RESSTAB,False,True,False,True,False,False,False
PUBHOUSPT,False,False,False,False,True,False,False
PHYSDIS0203,False,False,False,False,False,True,False
ON_SQMI,False,False,False,False,False,False,True


**Lasso Regression**

After standardizing our features, we select a subset of features using Lasso Regression (an embedded method). We tried this approach because we are interested in the importance of the features, and there is a high degree of multicollinearity within our features given the size of our dataset. 

In [389]:
X_lasso = cleaned_df.drop('AVGDV', axis = 1)
y_lasso = cleaned_df['AVGDV']
lasso_df = run_lasso(cleaned_df, X_lasso, y_lasso)

In [390]:
lasso_df

Unnamed: 0,VACANT,YOUNGPOP,CONCDIS,METRO_BG,PUBHOUSPT,PHYSDIS0203
0,60,18.92,0.38,0,0,139
1,62,6.19,-0.97,0,0,55
2,117,2.79,-0.97,0,0,67
3,92,0.89,-0.91,0,0,109
4,34,28.99,0.32,0,0,76
...,...,...,...,...,...,...
426,36,36.81,0.35,0,0,28
427,131,40.98,2.05,0,0,133
428,182,34.08,0.68,0,0,383
429,15,4.34,-0.51,1,0,28


In [391]:
lasso_df.corr() > 0.7

Unnamed: 0,VACANT,YOUNGPOP,CONCDIS,METRO_BG,PUBHOUSPT,PHYSDIS0203
VACANT,True,False,False,False,False,False
YOUNGPOP,False,True,True,False,False,False
CONCDIS,False,True,True,False,False,False
METRO_BG,False,False,False,True,False,False
PUBHOUSPT,False,False,False,False,True,False
PHYSDIS0203,False,False,False,False,False,True


We therefore further removed variables due to multicollinearity. 

**Forward Selection**

In this method, we started with all variables in the model, and selected a subset of variables with the lowest p-values. We then added these variables to our model one by one, until we reached a limit where one became insignificant. 

In [392]:
forward_df = forward_selection(scaled_df)
forward_df

Unnamed: 0,POV,AA_POP,UNEMPL,FEMALE
0,1.022728,0.081237,-0.064911,-2.388464
1,-0.834808,-0.066985,-0.787580,-0.205504
2,-0.581413,0.323418,-0.893725,-0.343946
3,-0.288926,2.502726,-0.770774,0.101474
4,0.640191,-0.299527,0.836434,-1.220741
...,...,...,...,...
419,-1.065168,-0.299527,0.645374,1.072571
420,2.853039,-0.299527,1.606868,-0.105184
421,0.530596,-0.254057,0.211949,0.498741
422,-0.660293,-0.299527,-0.992793,-7.775640


The correlation matrix shows that there are no variables with a correlation coefficient above 0.7.

In [None]:
forward_df.corr()

<a id='reg'><a/>
### Multiple Linear Regression

We employ multiple linear regression with average incidents of domestic violence over 2005 to 2006 as our dependent variable. Multiple linear regression models make the following assumptions:
 - a linear relationship between the predictor variables and the dependent variable
 - the residuals are normally distributed
 - independent variables are not highly correlated with each other
 - homoscedasticity of the variance of error terms when plotted against independent variables

We run three different models using the three outputs of our feature selection methods. 

We compared the models using the Adjusted R squared metric, which is adjusted for the number of variables in the model. Our findings are summarized below:


In [393]:
run_model(lasso_df, np.array(y).reshape(-1, 1))

                                 OLS Regression Results                                
Dep. Variable:                      y   R-squared (uncentered):                   0.553
Model:                            OLS   Adj. R-squared (uncentered):              0.547
Method:                 Least Squares   F-statistic:                              86.27
Date:                Wed, 04 Dec 2019   Prob (F-statistic):                    5.03e-70
Time:                        16:12:45   Log-Likelihood:                         -430.82
No. Observations:                 424   AIC:                                      873.6
Df Residuals:                     418   BIC:                                      897.9
Df Model:                           6                                                  
Covariance Type:            nonrobust                                                  
                  coef    std err          t      P>|t|      [0.025      0.975]
----------------------------------------

In [380]:
run_model(rfe_df, y)

                                 OLS Regression Results                                
Dep. Variable:                  AVGDV   R-squared (uncentered):                   0.651
Model:                            OLS   Adj. R-squared (uncentered):              0.645
Method:                 Least Squares   F-statistic:                              111.1
Date:                Wed, 04 Dec 2019   Prob (F-statistic):                    3.06e-91
Time:                        16:11:37   Log-Likelihood:                         -378.42
No. Observations:                 424   AIC:                                      770.8
Df Residuals:                     417   BIC:                                      799.2
Df Model:                           7                                                  
Covariance Type:            nonrobust                                                  
                  coef    std err          t      P>|t|      [0.025      0.975]
----------------------------------------

In [381]:
run_model(forward_df, y)

                                 OLS Regression Results                                
Dep. Variable:                  AVGDV   R-squared (uncentered):                   0.318
Model:                            OLS   Adj. R-squared (uncentered):              0.312
Method:                 Least Squares   F-statistic:                              49.04
Date:                Wed, 04 Dec 2019   Prob (F-statistic):                    7.57e-34
Time:                        16:11:40   Log-Likelihood:                         -520.38
No. Observations:                 424   AIC:                                      1049.
Df Residuals:                     420   BIC:                                      1065.
Df Model:                           4                                                  
Covariance Type:            nonrobust                                                  
                 coef    std err          t      P>|t|      [0.025      0.975]
-----------------------------------------

### Findings and Analysis