# 501 Project: Crime Rate Analysis in DMV Area

## Introduction






## Analysis 
## 1. **About the Data**
### *Variable considered:*
- Employment
- Education
- Housing
- Population Mobility
- Transportation
- Diabetes
- Poverty
- Crime Rate(label)

### *Data Source:*
- Crime Rate(Virginias and Maryland): The Federal Bureau of Investigation(FBI).
- Crime Rate(DC): Open Data DC.
- Employment, Education, Housing, Poverty, Transportation, Population Mobility: US Census Bureau.
- Diabetes: Centers for Disease Control and Prevention(CDC).


### *Data Gathering:*
### API Use 

**Crime rate in DC are garthered from [Open Data DC](https://opendata.dc.gov/datasets/crime-incidents-in-2016). It's [ArcGIS REST API](https://developers.arcgis.com/rest/services-reference/query-feature-service-layer-.htm), which can provided detailed geomertric information in a specific area. There are more than 30,000 detailed crime records. Each record has a unique id. The maximum records gatherd each query is 1,000, but there is no limitation on the number of record ids which can be gathered each query. Thus, we tried to get all the record ids at first, and use them to track our data gathering progress. More details can be found in dc_crime_cleaning.py.**

### Web Scraping
**For [census data](https://www.census.gov/quickfacts/geo/chart/ameliacountyvirginia/EDU635217), we choose 11 variables about Education, Housing, Employment, Poverty and Transportation.**

**The census website does not provide census data download for more than 6 counties at one time, so it will be time consuming to download them by hand. So we scrape them from the website.**

In [231]:
from urllib.request import urlopen
from bs4 import BeautifulSoup
import pandas as pd
import warnings

# base_url = 'https://www.census.gov/quickfacts/geo/chart/ameliacountyvirginia/'
areas = []
# complete url is base_url + suffix, following is (suffix, variable_name) tuple.
url_h_grad = 'EDU635217', 'h_grad' # high school graduation rate
url_b_grad = 'EDU685217', 'b_grad' # bachelor percentage
url_o_occ_r = 'HSG445217', 'o_occ_r' # owner-occupied house rate
url_o_occ_mv = 'HSG445217', 'o_occ_mv' # owner-occupied house mean value
url_o_m_cst = 'HSG650217', 'o_m_cst' # Median selected monthly owner costs -with a mortgage
url_gos_ret = 'HSG860217', 'gos_ret' # median gross rent
url_ps_pr_hh = 'HSD310217', 'ps_pr_hh' # Persons per household
url_lv_sm = 'POP715217', 'lv_sm' # live in same house in past 1 year
url_tvl_t = 'LFE305217', 'tvl_t' # average travel time to work
url_hh_inc = 'INC110217', 'hh_inc' # house hold income
url_ca_inc = 'INC910217', 'ca_inc' # per capita income
url_ps_pvt = 'IPE120218', 'ps_pvt' # poverty rate
url_emp_chg = 'BZA115216', 'emp_chg' # employment change

all_url = [url_h_grad, url_b_grad, url_o_occ_r, url_o_occ_mv, url_ps_pvt,url_hh_inc,
           url_ca_inc,url_o_m_cst, url_gos_ret,url_ps_pr_hh,url_lv_sm,url_tvl_t, url_emp_chg]


# make complete URL
# for i in range(len(all_url)):
#     all_url[i] = base_url + all_url[i][0], all_url[i][1]

In [222]:
len(census_df.County.values)

133

**All the census data are stored in the strings of div tages which has a common parent div tag. This parent div tag has a unique class, *'qf-graph-scroll'*.**

**After finding this, it will be very easy the scrape all the data for the website.**

In [232]:
def scraping_census_data(url_list, county):
    base_url = 'https://www.census.gov/quickfacts/geo/chart/'
    all_url = url_list.copy()
    # encode url
    for i in range(len(url_list)):
        all_url[i] = base_url + county + url_list[i][0], url_list[i][1]
    census_df = pd.DataFrame()
    # each iteration scrape one variable in the given url list
    for url, var_name in all_url:
#         print(url)
        page = urlopen(url)
        soup = BeautifulSoup(page, 'lxml')
        # web scraping
        d_list = [d for d in list(soup.find(class_='qf-graph-scroll').strings) if d != '\n' and d != '1']
        # print(len(d_list))
        if var_name == 'ps_pvt':
            Counties = d_list[::4]
            data = d_list[3::4]
        else:
            Counties = d_list[::2]
            data = d_list[1::2]
#         print(Counties)
#         print(len(data))
        if 'County' in census_df.columns.values:
#             print(var_name)
            try:
                if (census_df.County.values == Counties).all():
                    census_df.loc[:, var_name] = data
                else:
                    # if the counties order change
                    warnings.warn('The counties do not match.')
            except AttributeError:
                print(d_list)
                break
        else:
            census_df.loc[:, 'County'] = Counties
            census_df.loc[:, var_name] = data
    return census_df

# scraped data
census_df_va = scraping_census_data(all_url, 'ameliacountyvirginia/')
census_df_md = scraping_census_data(all_url, 'baltimorecountymaryland/')

https://www.census.gov/quickfacts/geo/chart/ameliacountyvirginia/EDU635217
https://www.census.gov/quickfacts/geo/chart/ameliacountyvirginia/EDU685217
b_grad
https://www.census.gov/quickfacts/geo/chart/ameliacountyvirginia/HSG445217
o_occ_r
https://www.census.gov/quickfacts/geo/chart/ameliacountyvirginia/HSG445217
o_occ_mv
https://www.census.gov/quickfacts/geo/chart/ameliacountyvirginia/IPE120218
ps_pvt
https://www.census.gov/quickfacts/geo/chart/ameliacountyvirginia/INC110217
hh_inc
https://www.census.gov/quickfacts/geo/chart/ameliacountyvirginia/INC910217
ca_inc
https://www.census.gov/quickfacts/geo/chart/ameliacountyvirginia/HSG650217
o_m_cst
https://www.census.gov/quickfacts/geo/chart/ameliacountyvirginia/HSG860217
gos_ret
https://www.census.gov/quickfacts/geo/chart/ameliacountyvirginia/HSD310217
ps_pr_hh
https://www.census.gov/quickfacts/geo/chart/ameliacountyvirginia/POP715217
lv_sm
https://www.census.gov/quickfacts/geo/chart/ameliacountyvirginia/LFE305217
tvl_t
https://www.census

Unnamed: 0,County,h_grad,b_grad,o_occ_r,o_occ_mv,ps_pvt,hh_inc,o_m_cst,gos_ret,ps_pr_hh,lv_sm,tvl_t,emp_chg
0,"Amelia County, Virginia",80.8%,14.5%,83.6%,83.6%,11.4%,"$26,118","$1,332",$729,2.69,90.6%,38.0,-1.4%
1,"Accomack County, Virginia",82.2%,19.6%,70.0%,70.0%,17.8%,"$24,266","$1,147",$771,2.35,93.4%,22.0,4.0%
2,"Albemarle County, Virginia",91.4%,52.3%,63.6%,63.6%,7.9%,"$39,273","$1,769","$1,189",2.44,81.3%,21.9,3.2%
3,"Alexandria city, Virginia (County)",91.4%,61.8%,43.1%,43.1%,10.1%,"$57,019","$2,648","$1,663",2.23,78.0%,31.8,0.2%
4,"Alleghany County, Virginia",86.5%,15.8%,76.1%,76.1%,14.5%,"$25,952",$950,$653,2.21,89.3%,24.6,-2.1%


In [233]:
# preview
census_df_md.head()

Unnamed: 0,County,h_grad,b_grad,o_occ_r,o_occ_mv,ps_pvt,hh_inc,ca_inc,o_m_cst,gos_ret,ps_pr_hh,lv_sm,tvl_t,emp_chg
0,"Baltimore County, Maryland",91.1%,37.8%,65.8%,65.8%,8.3%,"$71,810","$37,270","$1,724","$1,224",2.58,87.6%,29.5,2.3%
1,"Allegany County, Maryland",89.5%,18.2%,68.8%,68.8%,17.0%,"$42,771","$22,355","$1,060",$672,2.34,84.6%,20.9,-3.0%
2,"Anne Arundel County, Maryland",92.0%,40.1%,74.3%,74.3%,6.1%,"$94,502","$43,258","$2,089","$1,579",2.67,86.2%,30.2,3.1%
3,"Baltimore city, Maryland (County)",84.2%,30.4%,47.4%,47.4%,22.1%,"$46,641","$28,488","$1,424","$1,009",2.48,83.1%,30.7,Z
4,"Calvert County, Maryland",93.8%,30.1%,83.4%,83.4%,6.0%,"$100,350","$41,469","$2,107","$1,612",2.87,88.5%,41.9,1.8%


In [234]:
census_df_va.to_csv('raw_data/census_va_raw.csv')
census_df_md.to_csv('raw_data/census_md_raw.csv')

## 2. Data Cleaning


### Crime data in Maryland and Virginia:

In [284]:
import pandas as pd
import numpy as np
import plotly.graph_objects as go
import plotly.express as px
import plotly.figure_factory as ff

desired_width = 600
pd.set_option('display.width', desired_width)

input_file = ['raw_data/virginia.xls', 'raw_data/maryland.xls']


crime_va, crime_maryland = pd.read_excel(input_file[0]), pd.read_excel(input_file[1])
# crime_maryland.head(10)
# crime_va.head(10)
############## drop description rows and columns
crime_va.drop([0, 1, 2, 3, 4, 100, 101], axis=0, inplace=True)
crime_va.drop(['Table 8', 'Unnamed: 5'], axis=1, inplace=True)
crime_maryland.drop([0,1,2,3,27, 28,29], axis=0, inplace=True)
crime_maryland.drop(['MARYLAND', 'Unnamed: 5'], axis=1, inplace=True)

############# set column names
columns = ['County', 'Violent', 'Murder_and_nonnegligent_manslaughter',
           'Rape','Robbery', 'Aggravated_assault', 'Property_crime',
           'Burglary', 'Larceny_theft', 'Motor_vehicle_theft', 'Arson', 'Population']
crime_va.columns = columns
crime_maryland.columns = columns
crime_va["State"] = 'Virginia'
crime_maryland["State"] = 'Maryland'

# crime_va.Population.to_csv('cleaned_data/va_pop.csv')
# crime_maryland.Population.to_csv('cleaned_data/md_pop.csv')

crime_DF = pd.concat([crime_va, crime_maryland], ignore_index=True)

print('columns:', crime_DF.columns)
print('len:', len(crime_DF))
crime_DF.head()

columns: Index(['County', 'Violent', 'Murder_and_nonnegligent_manslaughter', 'Rape', 'Robbery', 'Aggravated_assault', 'Property_crime', 'Burglary', 'Larceny_theft', 'Motor_vehicle_theft', 'Arson', 'Population', 'State'], dtype='object')
len: 118


Unnamed: 0,County,Violent,Murder_and_nonnegligent_manslaughter,Rape,Robbery,Aggravated_assault,Property_crime,Burglary,Larceny_theft,Motor_vehicle_theft,Arson,Population,State
0,Albemarle County Police Department,88,1,29,21,37,1271,145,1061,65,8,105715,Virginia
1,Amelia,18,2,6,2,8,125,28,87,10,0,12856,Virginia
2,Amherst,32,1,13,6,12,279,35,222,22,2,29930,Virginia
3,Appomattox,14,0,5,2,7,129,18,107,4,0,15388,Virginia
4,Arlington County Police Department,363,1,56,103,203,3252,170,2912,170,5,236691,Virginia


In [259]:
# read poppulation data
pop_va = pd.read_csv('raw_data/pop_va_2016.csv')
pop_md = pd.read_csv('raw_data/pop_md_2016.csv')
pop_df = pd.concat([pop_va, pop_md], ignore_index=True)

In [248]:
pop_df.head()

Unnamed: 0,GEO.id,GEO.id2,GEO.display-label,rescen42010,resbase42010,respop72010,respop72011,respop72012,respop72013,respop72014,respop72015,respop72016
0,0500000US51001,51001,"Accomack County, Virginia",33164,33164,33164,33292,33324,33012,33024,32995,32947
1,0500000US51003,51003,"Albemarle County, Virginia",98970,98998,99240,100621,101920,102799,104235,105603,106878
2,0500000US51005,51005,"Alleghany County, Virginia",16250,16261,16212,16335,16239,16193,15884,15685,15595
3,0500000US51007,51007,"Amelia County, Virginia",12690,12695,12742,12754,12731,12687,12764,12869,12913
4,0500000US51009,51009,"Amherst County, Virginia",32353,32354,32386,32127,32458,32209,32060,31636,31633


In [249]:
pop_df.columns

Index(['GEO.id', 'GEO.id2', 'GEO.display-label', 'rescen42010', 'resbase42010', 'respop72010', 'respop72011', 'respop72012', 'respop72013', 'respop72014', 'respop72015', 'respop72016'], dtype='object')

In [256]:
pop_df.loc[pop_df['GEO.display-label']=='Accomack County, Virginia', 'respop72016'][0]

32947

**The excel file doesn't have a clear dataframe structure like csv. So we need to extract the useful information we need.**

*We only show the cleaning work for VA here. The cleaning for MD is similar, you can refer that in the code.*

In [285]:
########## clean county name and set index
## remove if exist
s = ' Police Department'
States = crime_DF.State.values.copy()
for n, c in enumerate(crime_DF.County.values):
    if c.endswith(s):
        county_name = c[:-len(s)].strip()
#         crime_DF.County.values[n] = c[:-len(s)]
#     crime_DF.County.values[n] = crime_DF.County.values[n].strip() + ', ' + States[n]
    else:
        county_name = c.strip() + ' County'
    county_name = county_name + ', ' + States[n]
    crime_DF.County.values[n] = county_name
    try:
        crime_DF.Population.values[n] = pop_df.loc[pop_df['GEO.display-label']==county_name, 
                                                   'respop72016'].values[0]
    except:
        print(county_name)
        print(pop_df.loc[pop_df['GEO.display-label']==county_name, 
                                                   'respop72016'])
#         print(pop_df.loc[pop_df['GEO.display-label']==county_name, 'respop72016'])
# crime_DF.index = np.arange(len(crime_DF))
crime_DF.sort_values(by='County', inplace=True)
crime_DF.index = crime_DF.County.values.copy()
crime_DF.drop(['County'], axis=1, inplace=True)
crime_DF.drop(['State'], axis=1, inplace=True)

### change data store type
crime_DF = crime_DF.astype(np.float)

### calculate crime rate (per 100,000 people)
for c in crime_DF.columns[0:-1]:
    crime_DF.loc[:, c] = crime_DF.loc[:, c] / crime_DF.iloc[:, -1] * 100000

crime_DF.describe()

Unnamed: 0,Violent,Murder_and_nonnegligent_manslaughter,Rape,Robbery,Aggravated_assault,Property_crime,Burglary,Larceny_theft,Motor_vehicle_theft,Arson,Population
count,118.0,118.0,118.0,118.0,118.0,118.0,118.0,118.0,118.0,118.0,118.0
mean,115.959191,2.913479,25.438908,15.588311,72.018493,909.994565,176.613647,681.889366,51.491552,6.035046,95436.21
std,73.825809,4.743298,20.085099,23.630772,52.326912,466.383518,87.851426,391.04116,35.131204,6.692835,192009.9
min,1.101262,0.0,0.0,0.0,0.0,0.110126,0.0,0.110126,0.0,0.0,2216.0
25%,63.939169,0.0,11.460232,0.507089,37.318245,604.679151,113.569666,395.754753,30.139666,0.0,15859.0
50%,101.761154,1.079679,21.859626,7.947331,54.556364,828.812196,165.541223,603.552217,44.978665,3.543777,31347.5
75%,151.486385,3.619177,33.700788,17.157898,94.769241,1142.158918,217.805525,846.08428,68.002819,8.871978,74848.75
max,529.104986,24.713733,106.894709,180.138768,306.488606,2809.057719,526.776169,2117.743609,248.36768,29.157353,1138652.0


In [281]:
pop_df.loc[pop_df['GEO.display-label']==county_name, 'respop72016'].values[0]

array([37278])

In [292]:
crime_DF.Population.values

array([  32947.,  106878.,   72130.,   15595.,   12913.,   31633.,
        568346.,   15475.,  230050.,   74997.,  831026.,    4476.,
         77960.,    6513.,   33231.,   16243.,   22178.,   17048.,
         91251.,   54952.,   32850.,   30178.,  167656.,   29531.,
        102603.,    7071.,  157705.,   12129.,  339009.,   14374.,
          5158.,   50083.,    9652.,   14968.,   28144.,   32258.,
         11123., 1138652.,   69069.,   15731.,   26271.,   56069.,
        247591.,   84421.,   29425.,   16857.,   37214.,   22668.,
         15107.,   19371.,   11706.,   34992.,  104392.,  251032.,
        326501.,   51445.,    2216.,  317233.,   36596.,   74404.,
         19730.,   25984.,   16334.,    7159.,   10972.,   24179.,
        385945.,   35236.,   12273.,   13078.,    8782.,   30892.,
         10778., 1043863.,   98602.,   14869.,   21147.,   12139.,
         12222.,   15595.,   35533.,   23654.,   17923.,   61687.,
         28443.,   23142.,   37845.,  908049.,  455210.,   342

In [297]:
# violin plot of population
fig = px.violin(crime_DF, y="Population", box=True, points='all')
fig.update_layout(
    title='County Population Distribution',
    yaxis_title='Population'
)
fig.show()




# Create distplot with curve_type set to 'normal'
fig = ff.create_distplot([crime_DF.Population.values], 
                         group_labels=['Population'], show_hist=False)

# Add title
fig.update_layout(title_text='Curve and Rug Plot')
fig.show()

The violin plot for population is highly squeezed, population in different counties seems to have a exponential distribution. There is an outlier, but actually it's correct.

We may consider dividing all the counties in the following groups:

- A: population > 20,000
- B: 20,000 >=  population

In [26]:
# crime_DF.loc[crime_DF.Population.values <= 20000, 'Pop_Gp'] = 'B'
# crime_DF.loc[(crime_DF.Population.values > 20000), 'Pop_Gp'] = 'A'
# print('Number of counties in group A:', sum(crime_DF.Population.values > 20000))
# print('Number of counties in group B:', sum(crime_DF.Population.values <= 20000))

In [298]:
# check distribution of different violent crime types
fig = go.Figure()

fig.add_trace(go.Box(y=crime_DF.Rape, name='Rape', boxpoints='all', jitter=0.3, pointpos=-1.8))
fig.add_trace(go.Box(y=crime_DF.Robbery, name='Robbery', boxpoints='all', jitter=0.3, pointpos=-1.8))
fig.add_trace(go.Box(y=crime_DF.Murder_and_nonnegligent_manslaughter,
                     name='Murder_and_nonnegligent_manslaughter', boxpoints='all', jitter=0.3, pointpos=-1.8))
fig.add_trace(go.Box(y=crime_DF.Aggravated_assault, 
                     name='Aggravated_assault', boxpoints='all', jitter=0.3, pointpos=-1.8))

fig.update_layout(
    title='Violent crime distribution',
    yaxis_title='crime per 100,000 inhabitant',
    
)
fig.show()

In [125]:
# check distribution of different property crime types
fig = go.Figure()
# fig.add_trace(go.Box(y=crime_DF.Burglary, x=crime_DF.Pop_Gp, name='Burglary',
#                      boxpoints='all', jitter=0.3, pointpos=-1.8))
# fig.add_trace(go.Box(y=crime_DF.Larceny_theft, x=crime_DF.Pop_Gp, name='Larceny_theft',
#                      boxpoints='all', jitter=0.3, pointpos=-1.8))
# fig.add_trace(go.Box(y=crime_DF.Motor_vehicle_theft, x=crime_DF.Pop_Gp, name='Motor_vehicle_theft',
#                      boxpoints='all', jitter=0.3, pointpos=-1.8))

fig.add_trace(go.Box(y=crime_DF.Burglary, name='Burglary',
                     boxpoints='all', jitter=0.3, pointpos=-1.8))
fig.add_trace(go.Box(y=crime_DF.Larceny_theft, name='Larceny_theft',
                     boxpoints='all', jitter=0.3, pointpos=-1.8))
fig.add_trace(go.Box(y=crime_DF.Motor_vehicle_theft, name='Motor_vehicle_theft',
                     boxpoints='all', jitter=0.3, pointpos=-1.8))


fig.update_layout(
    title='Property crime distribution',
    yaxis_title='crime per 100,000 inhabitant',
    
)
fig.show()

In [181]:
crime_DF.to_csv('cleaned_data/crime_VM_cleaned.csv')
crime_DF.Population.to_csv('cleaned_data/pop_cleaned.csv')





In [127]:
# bar chart for 25% counties with lowest violent rate and 
# 25% counties with highest violent rate
num_total = len(crime_DF)
p_25 = np.int(0.25*num_total)
p_75 = np.int(0.75*num_total)
sorted_by_violent = crime_DF.sort_values(by='Violent')
violent_low = sorted_by_violent.iloc[0:p_25]
violent_high = sorted_by_violent.iloc[p_75:]

types = ['Robbery', 'Rape', 'Murder', 'Assault']
means_lo = violent_low.mean()
means_hi = violent_high.mean()
violent_means_lo = means_lo.loc[['Robbery','Rape',
                                      'Murder_and_nonnegligent_manslaughter',
                                      'Aggravated_assault']].values
violent_means_hi = means_hi.loc[['Robbery','Rape',
                                      'Murder_and_nonnegligent_manslaughter',
                                      'Aggravated_assault']].values

fig = go.Figure(data=[
    go.Bar(name='Low-violent-rate Counties', x=types, y=violent_means_lo),
    go.Bar(name='High-violent-rate Counties', x=types, y=violent_means_hi)
])
# Change the bar mode
fig.update_layout(title='Violent Crime Comparison',barmode='group')
fig.show()

ratio = violent_means_hi/violent_means_lo
fig = go.Figure(data=[
    go.Bar(name='Ratio', x=types, y=ratio, text=["{r:.2f}".format(r=r) for r in ratio],
           textposition='auto')
])
# Change the bar mode
fig.update_layout(title='Violent Crime rate ratio')
fig.show()

# fig = go.Figure()
# fig.add_trace(go.Bar(
#     x=sorted_by_violent.index,
#     y=sorted_by_violent.Robbery,
#     name='Robbery',
# ))
# fig.add_trace(go.Bar(
#     x=sorted_by_violent.index,
#     y=sorted_by_violent.Rape,
#     name='Rape'
# ))
# fig.add_trace(go.Bar(
#     x=sorted_by_violent.index,
#     y=sorted_by_violent.Murder_and_nonnegligent_manslaughter,
#     name='Murder_and_nonnegligent_manslaughter'
# ))
# fig.add_trace(go.Bar(
#     x=sorted_by_violent.index,
#     y=sorted_by_violent.Aggravated_assault,
#     name='Aggravated_assault'
# ))
# fig.show()

In [128]:
sorted_by_property = crime_DF.sort_values(by='Property_crime')
property_lo = sorted_by_property.iloc[0:p_25]
property_hi = sorted_by_property.iloc[p_75:]

types = ['Burglary', 'Larceny_theft', 'Motor_vehicle_theft']
pmeans_lo = property_lo.mean()
pmeans_hi = property_hi.mean()
property_means_lo = pmeans_lo.loc[['Burglary','Larceny_theft','Motor_vehicle_theft']].values
property_means_hi = pmeans_hi.loc[['Burglary','Larceny_theft','Motor_vehicle_theft']].values

fig = go.Figure(data=[
    go.Bar(name='Low-property-rate Counties', x=types, y=property_means_lo),
    go.Bar(name='High-property-rate Counties', x=types, y=property_means_hi)
])
# Change the bar mode
fig.update_layout(title='Property Crime Comparison',barmode='group')
fig.show()

ratio = property_means_hi/property_means_lo
fig = go.Figure(data=[
    go.Bar(name='Ratio', x=types, y=ratio, text=["{r:.2f}".format(r=r) for r in ratio],
           textposition='auto')
])
# Change the bar mode
fig.update_layout(title='Property Crime rate ratio')
fig.show()

Some counties seems to have extremly high crime rates and seems like outlier. We will explore why those counties has such high crime rates in analysis part.

### Diabetes data:

In [82]:
import pandas as pd
import numpy as np
import plotly.graph_objects as go
import plotly.express as px

desired_width = 600
pd.set_option('display.width', desired_width)

input_file = ['raw_data/DiabetesVA.csv', 'raw_data/DiabetesMD.csv']

diabetes_va, diabetes_md = pd.read_csv(input_file[0]), pd.read_csv(input_file[1])

In [83]:
diabetes_va.head()

Unnamed: 0,Diagnosed Diabetes; Age-Adjusted; Percentage; Adults Aged 20+ Years; Virginia Counties; 2016,Unnamed: 1,Unnamed: 2,Unnamed: 3,Unnamed: 4,Unnamed: 5
0,Data downloaded on 5-October-2019,,,,,
1,County,State,CountyFIPS,Percentage,Lower Limit,Upper Limit
2,Accomack County,Virginia,51001,12.7,9.5,16.8
3,Albemarle County,Virginia,51003,6.2,4.5,8.4
4,Alexandria City,Virginia,51510,6.6,4.9,8.8


In [84]:
diabetes_md.head()

Unnamed: 0,Diagnosed Diabetes; Age-Adjusted; Percentage; Adults Aged 20+ Years; Maryland Counties; 2016,Unnamed: 1,Unnamed: 2,Unnamed: 3,Unnamed: 4,Unnamed: 5
0,Data downloaded on 8-October-2019,,,,,
1,County,State,CountyFIPS,Percentage,Lower Limit,Upper Limit
2,Allegany County,Maryland,24001,13.2,11.1,15.5
3,Anne Arundel County,Maryland,24003,8.8,8.0,9.7
4,Baltimore City,Maryland,24510,12.2,11.1,13.4


In [241]:
len(diabetes_md)

24

In [85]:
############### drop description rows and columns
diabetes_va.columns = diabetes_va.iloc[1].values
diabetes_va.drop([0, 1, 136], axis=0, inplace=True)
diabetes_va.drop(['Lower Limit', ' Upper Limit'], axis=1, inplace=True)
diabetes_md.columns = diabetes_md.iloc[1].values
diabetes_md.drop([0, 1, 26], axis=0, inplace=True)
diabetes_md.drop(['Lower Limit', ' Upper Limit'], axis=1, inplace=True)
diabetes_DF = pd.concat([diabetes_va, diabetes_md], ignore_index=True)

In [237]:
diabetes_va.tail()

Unnamed: 0,County,State,CountyFIPS,Percentage
131,Williamsburg City,Virginia,51830,11.3
132,Winchester City,Virginia,51840,7.3
133,Wise County,Virginia,51195,16.0
134,Wythe County,Virginia,51197,10.6
135,York County,Virginia,51199,7.6


In [86]:
diabetes_DF.head()

Unnamed: 0,County,State,CountyFIPS,Percentage
0,Accomack County,Virginia,51001,12.7
1,Albemarle County,Virginia,51003,6.2
2,Alexandria City,Virginia,51510,6.6
3,Alleghany County,Virginia,51005,13.2
4,Amelia County,Virginia,51007,12.1


In [87]:
############### we rename counties to match the crime data and only keep county data.
s1 = 'County'
s2 = 'City'

def bf(s):
    if s == 'Virginia':
        return ', VA'
    elif s == 'Maryland':
        return ', MD'
    else:pass
    
for n, c in enumerate(diabetes_DF.County.values):
    if c.endswith(s1):
        diabetes_DF.loc[n, 'County'] = c[:-7] + bf(diabetes_DF.loc[n, 'State'])
    elif c.endswith(s2):
        diabetes_DF.drop(n, axis=0, inplace=True)
############### missing values
print('Missing values:', np.where(pd.isnull(diabetes_DF)))

Missing values: (array([], dtype=int64), array([], dtype=int64))


In [88]:
diabetes_DF.tail()

Unnamed: 0,County,State,CountyFIPS,Percentage
153,"St. Marys, MD",Maryland,24037,11.2
154,"Talbot, MD",Maryland,24041,8.8
155,"Washington, MD",Maryland,24043,12.0
156,"Wicomico, MD",Maryland,24045,10.6
157,"Worcester, MD",Maryland,24047,9.3


In [89]:
############### Reassign index
diabetes_DF.sort_values(by='County', inplace=True)
diabetes_DF.index = diabetes_DF.County.values
diabetes_DF.drop(['County', 'State'], axis=1, inplace=True)

In [92]:
diabetes_DF.head()

Unnamed: 0,CountyFIPS,Percentage
"Accomack, VA",51001,12.7
"Albemarle, VA",51003,6.2
"Allegany, MD",24001,13.2
"Alleghany, VA",51005,13.2
"Amelia, VA",51007,12.1


In [93]:
# histogram for distribution of diabetes percentage in VA
# fig = px.histogram(diabetes_DF, x="Percentage",nbins=20)
# fig.update_layout(
#     title='Diabetes Distribution Histogram',
#     xaxis_title='Percentage',
#     yaxis_title='Count'
# )
# fig.show()

import plotly.figure_factory as ff
import numpy as np

# hist_data = [x]
group_labels = ['Percentage'] # name of the dataset

fig = ff.create_distplot([diabetes_DF.Percentage.values.astype(np.float)], 
                         group_labels, bin_size=.5)
fig.update_layout(
    title='Diabetes Distribution Histogram',
    xaxis_title='Percentage',
    yaxis_title='Count'
)
fig.show()

In [94]:
diabetes_DF.to_csv('cleaned_data/diabetes_cleaned.csv')

In [95]:
diabetes_DF = pd.read_csv('cleaned_data/diabetes_cleaned.csv')

fig = ff.create_choropleth(fips=diabetes_DF.CountyFIPS, 
                           values=diabetes_DF.Percentage.astype(np.float),
                           scope=['VA', 'MD', 'DE','WV'], 
                           binning_endpoints=[1,2,3,4,5,6,7,8,9,10,11,12,13],
#                            simplify_county=0.5,
                           state_outline={'width': 1},
                           county_outline={'color': 'rgb(255,255,255)', 'width': 0.5},
                           legend_title='Diabetes by County', 
                           title='Diabetes rates by Counties in VA and MD')
fig.layout.template = None
fig.show()


Sorting because non-concatenation axis is not aligned. A future version
of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.





### Mobility data

**Population mobility are County-to-County migration flows during 2012-2016.**

**For example, we want to know the migration flows of a county in VA, let's name it county A. There are many other counties which county A's people can flow to or flow from. Let's say the other county is B, which is not necessarily in VA.**

**This dataset contains information about the number of people flow from B to A and from A to B during 2012-2016.**

In [129]:
# import pandas as pd
# import numpy as np
# import plotly.graph_objects as go
# import plotly.figure_factory as ff


# desired_width = 600
# pd.set_option('display.width', desired_width)

# input_file = 'raw_data/mobility_2012_2016.xlsx'

# mobility_va_ori = pd.read_excel(input_file, sheet_name='Virginia', header=1)
# mobility_md_ori = pd.read_excel(input_file, sheet_name='Maryland', header=1)
mobility_va = mobility_va_ori.copy()
mobility_md = mobility_md_ori.copy()

In [130]:
mobility_va.head()

Unnamed: 0,State Code of Geography A,FIPS County Code of Geography A,State/U.S. Island Area/Foreign Region Code of Geography B,FIPS County Code of Geography B,State Name of Geography A,County Name of Geography A,State/U.S. Island Area/Foreign Region of Geography B,County Name of Geography B,Flow from Geography B to Geography A,Unnamed: 9,Counterflow from Geography A to Geography B1,Unnamed: 11,Net Migration from Geography B to Geography A1,Unnamed: 13,Gross Migration between Geography A and Geography B1,Unnamed: 15
0,,,,,,,,,Estimate,MOE,Estimate,MOE,Estimate,MOE,Estimate,MOE
1,51.0,1.0,1.0,3.0,Virginia,Accomack County,Alabama,Baldwin County,0,25,96,152,-96,152,96,152
2,51.0,1.0,1.0,89.0,Virginia,Accomack County,Alabama,Madison County,0,25,12,19,-12,19,12,19
3,51.0,1.0,2.0,130.0,Virginia,Accomack County,Alaska,Ketchikan Gateway Borough,0,25,8,12,-8,12,8,12
4,51.0,1.0,4.0,13.0,Virginia,Accomack County,Arizona,Maricopa County,3,6,0,28,3,6,3,6


In [131]:
mobility_md.head()

Unnamed: 0,State Code of Geography A,FIPS County Code of Geography A,State/U.S. Island Area/Foreign Region Code of Geography B,FIPS County Code of Geography B,State Name of Geography A,County Name of Geography A,State/U.S. Island Area/Foreign Region of Geography B,County Name of Geography B,Flow from Geography B to Geography A,Unnamed: 9,Counterflow from Geography A to Geography B1,Unnamed: 11,Net Migration from Geography B to Geography A1,Unnamed: 13,Gross Migration between Geography A and Geography B1,Unnamed: 15
0,,,,,,,,,Estimate,MOE,Estimate,MOE,Estimate,MOE,Estimate,MOE
1,24.0,1.0,1.0,127.0,Maryland,Allegany County,Alabama,Walker County,29,35,0,27,29,35,29,35
2,24.0,1.0,2.0,16.0,Maryland,Allegany County,Alaska,Aleutians West Census Area,0,28,6,7,-6,7,6,7
3,24.0,1.0,4.0,13.0,Maryland,Allegany County,Arizona,Maricopa County,3,6,0,28,3,6,3,6
4,24.0,1.0,4.0,27.0,Maryland,Allegany County,Arizona,Yuma County,15,17,0,28,15,17,15,17


This is a data frame with many columns.

In [132]:
##################### extract useful data #####################
mobility_va = mobility_va.loc[:, ['FIPS County Code of Geography A',
                                  'County Name of Geography A', 'County Name of Geography B',
                                  'Flow from Geography B to Geography A', 'Counterflow from Geography A to Geography B1']]
mobility_va.drop([0, 18817, 18818,18819,18820,18821], axis=0, inplace=True)
mobility_va.columns = ['FIPS','County', 'County2', 'In', 'Out']
mobility_va['State'] = 'VA'
mobility_md = mobility_md.loc[:, ['FIPS County Code of Geography A',
                                  'County Name of Geography A', 'County Name of Geography B',
                                  'Flow from Geography B to Geography A', 'Counterflow from Geography A to Geography B1']]
mobility_md.drop([0, 6094,6095,6096,6097,6098], axis=0, inplace=True)
mobility_md.columns = ['FIPS','County', 'County2', 'In', 'Out']
mobility_md['State'] = 'MD'

In [133]:
mobility_va.head()

Unnamed: 0,FIPS,County,County2,In,Out,State
1,1.0,Accomack County,Baldwin County,0,96,VA
2,1.0,Accomack County,Madison County,0,12,VA
3,1.0,Accomack County,Ketchikan Gateway Borough,0,8,VA
4,1.0,Accomack County,Maricopa County,3,0,VA
5,1.0,Accomack County,Washington County,0,35,VA


In [134]:
##################### look at it again
print(mobility_va.head())

   FIPS           County                    County2 In Out State
1   1.0  Accomack County             Baldwin County  0  96    VA
2   1.0  Accomack County             Madison County  0  12    VA
3   1.0  Accomack County  Ketchikan Gateway Borough  0   8    VA
4   1.0  Accomack County            Maricopa County  3   0    VA
5   1.0  Accomack County          Washington County  0  35    VA


**Next we want to generate new features from the in flow and out flow data.**

**For example, for Accomack County above, we add up all the IN flow and OUT flow. These two summation cannot be used directly as new features as different counties have different population size.**

**A easy way to normalize it is just divide the summation of the In flow by the summation of the Out flow(in/out). If this value equals 1, then the number of people move in and that move out is the same. If this value bigger than 1, then more people move into this county than those move out of this county.**

**We also generate another new feature, which is the oversea IN flow divided by total IN flow, representing the portion of people move in from oversea.**

In [182]:
##################### generate new features #####################
def fe_gen(df, pops, state_fips, state):
    s1 = 'County'
    s2 = 'city'
    agg_flows = []
    all_areas = pd.unique(df.County.values)
    for area in all_areas:
        area_df = df.loc[df['County']==area]
        area_dic = {}
        area_dic['FIPS'] = state_fips+ str(int(area_df.FIPS.values[0])).zfill(3)
#         print(area)
        if area.endswith(s2):
#             area_dic['County'] = area
            continue
        elif area.endswith(s1):
            area_dic['County'] = area[:-7].strip() + ', ' + state
        # County population
            pop = pops.loc[area_dic['County']].values
            area_dic['Population'] = pop[0]
            outflow = np.sum(area_df['Out'])
#             overseas = np.sum(area_df.loc[area_df['County2']=='-','In'])
#             inflow = np.sum(area_df['In']) - overseas
            inflow = np.sum(area_df['In'])
            area_dic['mob_in_ratio'] = (inflow / pop)[0]
            area_dic['in_out_ratio'] = inflow / outflow
            agg_flows.append(area_dic)
        else: pass
    return pd.DataFrame.from_dict(agg_flows)

pop = pd.read_csv('cleaned_data/pop_cleaned.csv', names=['Population'] , index_col=0)
mob_va_cleaned = fe_gen(mobility_va, pop, '051', "VA")
mob_md_cleaned = fe_gen(mobility_md, pop, '024', "MD")
mob_cleaned = pd.concat([mob_va_cleaned, mob_md_cleaned], ignore_index=True)
print('Missing values:', np.where(pd.isnull(mob_cleaned)))
#################################################################

Missing values: (array([], dtype=int64), array([], dtype=int64))


95

In [183]:
mob_cleaned

Unnamed: 0,County,FIPS,Population,in_out_ratio,mob_in_ratio
0,"Accomack, VA",051001,27073.0,0.804408,0.053928
1,"Albemarle, VA",051003,105715.0,1.292739,0.121601
2,"Alleghany, VA",051005,11822.0,0.652778,0.067586
3,"Amelia, VA",051007,12856.0,0.393004,0.044571
4,"Amherst, VA",051009,29930.0,1.032148,0.070799
5,"Appomattox, VA",051011,15388.0,0.336057,0.021380
6,"Arlington, VA",051013,236691.0,1.092936,0.128636
7,"Augusta, VA",051015,74809.0,0.876044,0.063107
8,"Bath, VA",051017,4652.0,2.170507,0.101247
9,"Bedford, VA",051019,70904.0,1.060827,0.067641


In [184]:
mob_cleaned.sort_values(by='County', inplace=True)
mob_cleaned.index = mob_cleaned.County.values
mob_cleaned.drop(['County'], axis=1, inplace=True)
mob_cleaned

Unnamed: 0,FIPS,Population,in_out_ratio,mob_in_ratio
"Accomack, VA",051001,27073.0,0.804408,0.053928
"Albemarle, VA",051003,105715.0,1.292739,0.121601
"Allegany, MD",024001,72130.0,1.066239,0.055345
"Alleghany, VA",051005,11822.0,0.652778,0.067586
"Amelia, VA",051007,12856.0,0.393004,0.044571
"Amherst, VA",051009,29930.0,1.032148,0.070799
"Anne Arundel, MD",024003,568346.0,0.960957,0.059763
"Appomattox, VA",051011,15388.0,0.336057,0.021380
"Arlington, VA",051013,236691.0,1.092936,0.128636
"Augusta, VA",051015,74809.0,0.876044,0.063107


In [168]:
# mob_cleaned.loc[mob_cleaned.index.values=='Norton city, VA']

Unnamed: 0,FIPS,Population,in_out_ratio,mob_in_ratio


In [185]:
fig = go.Figure()
fig.add_trace(go.Box(x=mob_cleaned.in_out_ratio, name='in/out',jitter=0.3,
                    boxpoints='suspectedoutliers',marker_color = 'indianred'
                    ))

# fig.add_trace(go.Box(y=mob_cleaned.mob_in_ratio, name='in/population', 
#                      boxpoints='all',jitter=0.3,pointpos=-1.8,
#                     ))
fig.update_layout(
    title='Mobility in/out ratio boxplot',
    xaxis_title='ratio'
)
fig.show()

fig = go.Figure()
fig.add_trace(go.Box(x=mob_cleaned.mob_in_ratio, name='in/population',
                     jitter=0.3,
                    boxpoints='suspectedoutliers'
                    ))

# fig.add_trace(go.Box(y=mob_cleaned.mob_in_ratio, name='in/population', 
#                      boxpoints='all',jitter=0.3,pointpos=-1.8,
#                     ))
fig.update_layout(
    title='Mobility in/population ratio boxplot',
    xaxis_title='ratio'
)
fig.show()

#### boxplot shows tha there is a in/out ratio extremely higher than majority
#### we check what happened
# with pd.option_context('display.max_columns', None):
#     print(mob_va_cleaned.sort_values(by='mob_ratio').iloc[-5:])

In [187]:
### boxplot shows tha there is a in/out ratio extremely higher than majority
### we check what happened
mob_cleaned.sort_values(by='in_out_ratio').iloc[-10:]

Unnamed: 0,FIPS,Population,in_out_ratio,mob_in_ratio
"Sussex, VA",51183,9672.0,1.666063,0.190343
"Patrick, VA",51141,18039.0,1.669281,0.070791
"Montgomery, VA",51121,98440.0,1.688818,0.145601
"New Kent, VA",51127,20895.0,1.979825,0.108016
"Somerset, MD",24039,25928.0,2.027011,0.133138
"Shenandoah, VA",51171,25854.0,2.068985,0.140365
"Bath, VA",51017,4652.0,2.170507,0.101247
"Northumberland, VA",51133,12089.0,2.402135,0.055836
"Bland, VA",51021,6571.0,2.789474,0.11292
"Highland, VA",51091,2300.0,4.064516,0.054783


In [188]:
mob_cleaned.sort_values(by='mob_in_ratio').iloc[-10:]

Unnamed: 0,FIPS,Population,in_out_ratio,mob_in_ratio
"Richmond, VA",51159,7534.0,1.568297,0.12344
"Arlington, VA",51013,236691.0,1.092936,0.128636
"Somerset, MD",24039,25928.0,2.027011,0.133138
"Shenandoah, VA",51171,25854.0,2.068985,0.140365
"Prince George, VA",51149,36656.0,1.145676,0.140959
"Montgomery, VA",51121,98440.0,1.688818,0.145601
"Nottoway, VA",51135,9962.0,0.992547,0.147059
"Prince Edward, VA",51147,15424.0,1.080418,0.154175
"Greensville, VA",51081,11625.0,1.663957,0.158452
"Sussex, VA",51183,9672.0,1.666063,0.190343


In [189]:
mob_cleaned.to_csv('cleaned_data/mob_cleaned.csv')

In [203]:
mob_cleaned = pd.read_csv('cleaned_data/mob_cleaned.csv')
fig = ff.create_choropleth(fips=mob_cleaned.FIPS, values=mob_cleaned.mob_in_ratio,
                           scope=['VA', 'MD', 'DE','WV'], 
                           binning_endpoints=[0.03, 0.04, 0.05, 0.06, 0.07, 0.08, 0.09,
                           0.1, 0.11, 0.14],
                           county_outline={'color': 'rgb(255,255,255)', 'width': 0.5},
                           legend_title='Mobility in/population ratio by County', 
                           title='Mobility by Counties in VA')
fig.layout.template = None
fig.show()

### Census data:


In [77]:
census_df = pd.read_csv('raw_data/census_raw.csv')

                               County h_grad b_grad o_occ_r o_occ_mv o_m_cst  \
0             Amelia County, Virginia  80.8%  14.5%   83.6%    83.6%  $1,332   
1           Accomack County, Virginia  82.2%  19.6%   70.0%    70.0%  $1,147   
2          Albemarle County, Virginia  91.4%  52.3%   63.6%    63.6%  $1,769   
3  Alexandria city, Virginia (County)  91.4%  61.8%   43.1%    43.1%  $2,648   
4          Alleghany County, Virginia  86.5%  15.8%   76.1%    76.1%    $950   

  gos_ret ps_pr_hh  lv_sm tvl_t emp_chg  
0    $729     2.69  90.6%  38.0   -1.4%  
1    $771     2.35  93.4%  22.0    4.0%  
2  $1,189     2.44  81.3%  21.9    3.2%  
3  $1,663     2.23  78.0%  31.8    0.2%  
4    $653     2.21  89.3%  24.6   -2.1%  


In [79]:


perc_col = ['h_grad', 'b_grad', 'o_occ_r', 'o_occ_mv', 'lv_sm', 'emp_chg']
mny_col = ['o_m_cst', 'gos_ret']

for c in perc_col:
    census_df.loc[:, c] = [np.float(v[:-1])/100 for v in census_df.loc[:, c].values]
    
for c in mny_col:
    census_df.loc[:, c] = [np.float(v[1:].replace(',', '')) for v in census_df.loc[:, c].values]

In [None]:
census_df.iloc[:, 1:].astype(np.float)

In [None]:
census_df.to_csv('cleaned_data/census_cleaned.csv')

### Crime data in DC:

In [29]:
import requests
from pprint import pprint
import pandas as pd
import logging
import json
import numpy as np
from datetime import datetime


# First let's have a preview of the data
crime_df = pd.read_csv('raw_data/crime_df.csv')
print(crime_df.head())
print('Shape of the DF:', crime_df.shape)

   Unnamed: 0                                       BLOCK      END_DATE  \
0           0   600 - 669 BLOCK OF PENNSYLVANIA AVENUE SE  1.453975e+12   
1           1       2600 - 2799 BLOCK OF JASPER STREET SE  1.454784e+12   
2           2  4500 - 4529 BLOCK OF CONNECTICUT AVENUE NW  1.463393e+12   
3           3       2400 - 2499 BLOCK OF MARKET STREET NE  1.473258e+12   
4           4     1300 - 1353 BLOCK OF MARYLAND AVENUE NE  1.480817e+12   

    LATITUDE  LONGITUDE  METHOD NEIGHBORHOOD_CLUSTER       OFFENSE  \
0  38.885133 -76.997326  OTHERS           Cluster 26   THEFT/OTHER   
1  38.851744 -76.969241  OTHERS           Cluster 36   THEFT/OTHER   
2  38.948353 -77.065951  OTHERS           Cluster 12   THEFT/OTHER   
3  38.919914 -76.952698  OTHERS           Cluster 24  THEFT F/AUTO   
4  38.898528 -76.987354  OTHERS           Cluster 25  THEFT F/AUTO   

     REPORT_DAT     SHIFT    START_DATE  
0  1.455026e+12       DAY  1.453223e+12  
1  1.455284e+12       DAY  1.454783e+12  
2 

In [30]:
# Extract data columns we need
cols = ['START_DATE','END_DATE','SHIFT','LATITUDE','LONGITUDE',
        'BLOCK','OFFENSE','METHOD']
crime_df = crime_df[cols]

# check missing values
print(np.where(pd.isnull(crime_df)))
print('Missing values in column', np.unique(np.where(pd.isnull(crime_df))[1]))
print('Number of missing values', np.sum(np.where(pd.isnull(crime_df))[1]))

# check data values
print('Number of uniqiue BLOCK:', len(np.unique(crime_df.BLOCK.values)))
print('Number of uniqiue METHOD:', len(np.unique(crime_df.METHOD.values)))
print('All METHODS:', np.unique(crime_df.METHOD.values))
print('Number of uniqiue OFFENSE:', len(np.unique(crime_df.OFFENSE.values)))
print('All OFFENSE:', np.unique(crime_df.OFFENSE.values))
print('All SHIFT:', np.unique(crime_df.SHIFT.values))

(array([   37,    43,    89, ..., 36988, 37105, 37139]), array([1, 1, 1, ..., 1, 1, 1]))
Missing values in column [1]
Number of missing values 1219
Number of uniqiue BLOCK: 7226
Number of uniqiue METHOD: 3
All METHODS: ['GUN' 'KNIFE' 'OTHERS']
Number of uniqiue OFFENSE: 9
All OFFENSE: ['ARSON' 'ASSAULT W/DANGEROUS WEAPON' 'BURGLARY' 'HOMICIDE'
 'MOTOR VEHICLE THEFT' 'ROBBERY' 'SEX ABUSE' 'THEFT F/AUTO' 'THEFT/OTHER']
All SHIFT: ['DAY' 'EVENING' 'MIDNIGHT']


**All the date  are UNIX timestamp, we maybe need to convert them to human-readable time.**

**All the variables, except date variables, has no missing, incorrect values and format issues.**

**This data frame seems to have missing value in column one(END_DATE), which corresponds to crime end date and time. About 3% of the samples has missing values in this column.**

**We decide to fill the missing values in the END_DATE columns with their corresponding starting date plus the average between START_DATE and END_DATE.**

In [25]:
# where is the missing values
not_null = pd.notnull(crime_df.END_DATE)
is_null = pd.isnull(crime_df.END_DATE)

## date in this df are all unix timestamp + 000 at end
## convert them to human-readable time
crime_df.START_DATE = (crime_df.START_DATE.values / 1000).astype(np.int)
crime_df.END_DATE.values[not_null] = (crime_df.END_DATE.values[not_null] / 1000).astype(np.int)

## Compute the average time between START_DATE and END_DATE and fill the missing value in END_DATE
t_mean = np.mean(crime_df.END_DATE.values[not_null] - crime_df.START_DATE.values[not_null]).astype(np.int)
crime_df.END_DATE.values[is_null] = (crime_df.START_DATE.values[is_null] + t_mean).astype(np.int)

**Next we want to check if the samples are between 1/1/2016 and 12/31/2016.**

In [None]:
start_time = datetime(2016, 1, 1, 0, 0, 0)
end_time = datetime(2017, 1, 1, 0, 0, 0)
start_dates = [datetime.fromtimestamp(t) for t in crime_df.START_DATE.values]
end_dates = [datetime.fromtimestamp(t) for t in crime_df.END_DATE.values]
crime_df.START_DATE = start_dates
crime_df.END_DATE = end_dates

## Choose rows with correct START_DATE
true_time = [start_time<t<end_time for t in start_dates]
crime_df = crime_df[true_time]

## 3. Initial Visual Analysis

We may want to see which counties have high crime rates:

In [87]:
# bar chart for 10 counties with lowest violent rate and 10 counties with highest violent rate

crime_df = pd.read_csv('cleaned_data/crime_VM_cleaned.csv')
sorted_by_violent = crime_DF.sort_values(by='Violent').iloc[np.r_[0:10, -10:0]]

fig = go.Figure()
fig.add_trace(go.Bar(
    x=sorted_by_violent.County,
    y=sorted_by_violent.Robbery,
    name='Robbery',
))
fig.add_trace(go.Bar(
    x=sorted_by_violent.County,
    y=sorted_by_violent.Rape,
    name='Rape'
))
fig.add_trace(go.Bar(
    x=sorted_by_violent.County,
    y=sorted_by_violent.Murder_and_nonnegligent_manslaughter,
    name='Murder_and_nonnegligent_manslaughter'
))
fig.add_trace(go.Bar(
    x=sorted_by_violent.County,
    y=sorted_by_violent.Aggravated_assault,
    name='Aggravated_assault'
))

fig.update_layout(
    title='10 Counties with highest violent crime rate and 10 Counties lowest ones.',
    yaxis_title='crime per 100,000 inhabitant',
)
fig.show()

In [88]:
# bar chart for 10 counties with lowest Property_crime rate and 10 counties with highest Property_crime rate
sorted_by_property = crime_DF.sort_values(by='Property_crime').iloc[np.r_[0:10, -10:0]]

fig = go.Figure()
fig.add_trace(go.Bar(
    x=sorted_by_violent.County,
    y=sorted_by_violent.Burglary,
    name='Burglary',
))
fig.add_trace(go.Bar(
    x=sorted_by_violent.County,
    y=sorted_by_violent.Larceny_theft,
    name='Larceny_theft'
))
fig.add_trace(go.Bar(
    x=sorted_by_violent.County,
    y=sorted_by_violent.Motor_vehicle_theft,
    name='Motor_vehicle_theft'
))

fig.update_layout(
    title='10 Counties with highest property crime rate and 10 Counties with lowest ones. ',
    yaxis_title='crime per 100,000 inhabitant',
)
fig.show()

## Results

## Conclusions