# Goal: To see if an areas Social Vulnerability Index score correlates to an area's amount of crime

#### Intended Result: to walk away with some understanding of if certian types of crime correlate to an areas level of vulnerability

###### Background: Social Vulnerability Index (SVI)

The Socially Vulnerable Population Analysis uses measurements of a geographical area's relative level of vulnerability across multiple variables measure by the American Community Survey. These measurements are derived from The Center for Disease Control's 2016 Social Vulnerability Index (SVI).

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
from scipy.stats import spearmanr, pearsonr
import pyspark

Lets read in the data from some csv files

In [2]:
df = pd.read_csv('clean_crime_data.csv',dtype = {'zip_code':object})
svi = pd.read_csv('census-data/svi_data.csv', dtype = {'zip':object})

Right now the data is structured very similar to how an excel file is structured. We are going to turn the two dataframes into spark dataframes, which is something that is really useful for working with large data sets (due to computational reasons). This is kind of overkill for these datasets, but its cool and will also allow us to query these data.

In [3]:
spark = pyspark.sql.SparkSession.builder.appName('pandasToSparkDF').getOrCreate()

# create spark dataframes
crime_df = spark.createDataFrame(df)
svi_df = spark.createDataFrame(svi)

# write them to a database
for name, data in zip(['crime','svi'],[crime_df,svi_df]):
    data.createOrReplaceTempView(str(name))

In [4]:
spark.sql("""
select * 
from crime 
join svi
on zip = zip_code

""").limit(1).show()

+-------------+---------+------------------+--------+---------+------+-----------------+---+--------+----------+---------------+-------------+-----------------+---------------+---------------+----------------+-----------------+---------+---------+---------+---------+---------+---------+---------+-----------+---------+--------+----------+-----+-------+---------+----------+---------+---------+------------------+----------+----------+----------+---------+----------+---------+---------+----------+----------+----------+----------+----------+----------+----------+----------+----------+
|     crime_id|from_date|       description|zip_code|charge_id|dvflag|firearm_used_flag|arr|male_arr|female_arr|black_race_flag|nan_race_flag|unknown_race_flag|white_race_flag|asian_race_flag|indian_race_flag|pacific_race_flag|age_minor|age_18_24|age_25_29|age_30_34|age_35_39|age_40_44|age_45_49|age_50_plus|total_vic|male_vic|male_vic.1|  zip|epl_pov|epl_unemp|epl_nohsdp|epl_age65|epl_age17|        epl_disabl|