# Analysis Tutorial

This notebook exists to showcase a few of the metrics we developed and their general distributions, should someone want to consume our output dataset.

## Imports

In [1]:
import pandas as pd
import plotly.express as px

## Load Data

In [59]:
df = pd.read_csv('../data/output.csv', low_memory=False)
df.sample(2)

Unnamed: 0,case_number,date_of_incident,date_of_death,age,gender,race,latino,manner_of_death,primarycause,primarycause_linea,...,inhalant_related_primary,cannabis_related_primary,death_datetime,death_time,death_date,death_year,death_month,death_day,death_week,motel
39564,ME2020-00360,01/13/2020 10:30:00 PM,01/14/2020 11:11:00 PM,63.0,Male,Black,False,NATURAL,ORGANIC CARDIOVASCULAR DISEASE,,...,False,False,2020-01-14 23:11:00,23:11:00,2020-01-14,2020.0,1.0,14.0,3.0,False
11735,ME2016-02560,05/24/2016 08:00:00 AM,05/24/2016 10:00:00 AM,50.0,Male,Black,False,SUICIDE,INTRAORAL GUNSHOT WOUND OF HEAD,,...,False,False,2016-05-24 10:00:00,10:00:00,2016-05-24,2016.0,5.0,24.0,21.0,False


Distribution of our geocoding scores using the ArcGIS geocoding web service

In [60]:
px.histogram(df, x='coded_score', nbins=10, histnorm='probability')

Distance (in miles) to the nearest pharmacy from each record.

In [69]:
px.violin(df, y='nearest_pharmacy', points='suspectedoutliers')

Various LandUse categories and the counts of records in each category.

In [27]:
groups = df.groupby(['major_name', 'sub_name', 'name']).count().reset_index()
groups = groups[['major_name', 'sub_name', 'name', 'case_number']]
groups.columns = ['major_name', 'sub_name', 'name', 'case_count']
groups.set_index(['major_name', 'sub_name', 'name'])

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,case_count
major_name,sub_name,name,Unnamed: 3_level_1
AGRICULTURE,AGRICULTURE,AGRICULTURE,38
Non-Parcel Areas,Non-Parcel Areas,NON-PARCEL AREAS,9063
Not Classifiable,Not Classifiable,Not Classifiable,4
Open Space,Golf Course,Golf Course,35
Open Space,Non-Public Open Space,Non-Public Open Space,1
Open Space,Primarily Conservation,Open Space - Primarily Conservation,58
Open Space,Primarily Recreation,Open Space - Primarily Recreation,213
Open Space,Trail or Greenway,Trail or Greenway,1
Urbanized,Commercial,Cultural/Entertainment,79
Urbanized,Commercial,Hotel/Motel,710


A cool heatmap of showing our drug-related extractions.  We are able to extract ~3k more records by examining secondary cause as well as the primary cause.

In [72]:
drug_counts = df[['drug_related_primary', 'drug_related_secondary']]
px.density_heatmap(
    drug_counts, 
    x='drug_related_primary', 
    y='drug_related_secondary', 
    marginal_x='histogram',
    marginal_y='histogram',
    color_continuous_scale=px.colors.sequential.Blues,
)