## Domestic Violence and Alcohol Availability

Contributors: Seoho Hahm, Allison Lee

### Table of Contents
1. <a href='#prob'>Problem Statement
2. <a href='#sources'>Data Sources</a>
3. <a href='#approach'>Approach</a>
4. <a href='#eda'>Exploratory Data Analysis</a>
5. <a href='#reg'>Regression</a>
6. <a href='#analysis'>Findings and Analysis</a>
7. <a href='#concl'>Conclusions and Recommendations</a>

In [7]:
# import libraries
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
import statsmodels.api as sm
import seaborn as sns
from sklearn.linear_model import Lasso, LogisticRegression
from sklearn.feature_selection import SelectFromModel
from sklearn.preprocessing import StandardScaler

<a id='prob'></a>
### Problem Statement

This analysis seeks to understand which factors are most efficient in predicting domestic violence in cities such as Washington, D.C. 

Specifically, we look at the strength and contribution of different variables to incidents of domestic violence in Washington, D.C., and the potential interactions between variables. 

<a id='sources'></a>
### Data Sources

We use a dataset of 101 features and 431 observations from a study conducted by Caterina Goman from the Urban Institute, "Alcohol Availability, Type of Alcohol Establishment, Distribution Policies, and Their Relationship to Crime and Disorder in the District of Columbia, 2000-2006". Each observation represents a block in Washington, D.C. 

The original dataset is available here: https://www.icpsr.umich.edu/icpsrweb/NACJD/studies/25763/summary

<a id='approach'></a>
### Approach

We take a more statistical approach to this analysis to understand which factors are most efficient at predicting incidents of domestic violence in Washington, D.C. All of our features are continuous. 

After standardizing our features, we select a subset of features using Lasso Regression (an embedded method). We selected this approach because we are interested in the importance of the features, and there is a high degree of multicollinearity within our features given the size of our dataset. 


 - Optional - we tested several threshold values and compared the models against each other using model metrics. 

We use Cook's D to understand 
# Cook's D
# F-statistic
# check distribution of residuals

<a id='eda'></a>
### Exploratory Data Analysis

<a id='reg'><a/>
### Regression

We employ multiple linear regression with average incidents of domestic violence over 2005 to 2006 as our dependent variable. Multiple linear regression models make the following assumptions:
 - a linear relationship between the predictor variables and the dependent variable
 - the residuals are normally distributed
 - independent variables are not highly correlated with each other
 - homoscedasticity of the variance of error terms when plotted against independent variables


Feature Selection

As a first step, we removed the features that are directly correlated with our dependent variable (for example, domestic violence incidents on the weekend). We use backward selection to identify a subset of feature variables that are relevant to our dependent variable (refer to notebook: AL_notes). Then, we refit the model as we removed features with high p-values. 

Multicollinearity

After we had a subset of features with 


Question - standard scaling?

VIF / Multicollinearity


Cook's D

F-Statistic

The F-statistic tests the null hypothesis that at least one of the regression coefficients are zero. In other words, the F-stat will take a value close to 1 if there is no relationship between domestic violence and the predictors. 

## Next steps
 - Clean outliers
 - Clear on variables 
 - Add features
 - Run a couple of models
 - Look at features and coefficients 
 - Focus on features - which ones are more dominant, how are they interacting. understand importance of features, effects of domestic violence. which are most efficient in predicting dv.
 - Ratio BG / population
 - already have some sort of bias if you only chose 6 features. 
 - run a huge model, compare models, look at coefficients p values and f scores. 
 - rsq not super important, but p value, f scores. 
  - is there a significant different --> statistical realm. Anova test. 
  - but also use linear regression model
  - read paper

<a id='approach'></a>
### Approach