<IMG SRC="https://github.com/jacquesroy/byte-size-data-science/raw/master/images/Banner.png" ALT="BSDS Banner" WIDTH=1195 HEIGHT=200>

## Chicago Car Accident Data Analysis
In this notebook, we analyze the data using a Python environment.<br/>
We also use Pixiedust as the engine over Mapbox to display maps in the later part of the analysis.

In an additional section, we see how we could use additional data to add the city name to each record.

## Additional Information
The chicago accident information includes three files:

<ul><li>Traffic_Crashes_-_Crashes.csv</li>
    <li>Traffic_Crashes_-_People.csv</li>
    <li>Traffic_Crashes_-_Vehicles.csv</li>
</ul>
We could add information coming from the people or vehicle files to our crashes data. This is beyond the scope of this example.

Useful data could include: How many people were involved in each accident, how many vehicles were involved in each accident, what type of vehicle is involved in each accidents.

### 018-Python Pandas Data Exploration
Execute the next cell if you want to see the `Byte Size Data Science` youtube channel video

In [None]:
from IPython.display import IFrame

IFrame(src="https://www.youtube.com/embed/AeeHapnLhyE?rel=0&amp;controls=0&amp;showinfo=0", width=560, height=315)


## Read the crash data
In this section, we read the data as a Pandas DataFrame

In [None]:
# PixieDust is an open source library that was contributed by IBM
!pip install --user --upgrade pixiedust

In [None]:
import pixiedust

In [None]:
import sys
import types
import pandas as pd
import urllib.request
import zipfile

url = 'https://github.com/jacquesroy/byte-size-data-science/raw/master/data/ChicagoTrafficCrashes20180917.csv.zip'
# get the filename from the url: "ChicagoTrafficCrashes20180917.csv"
filename = url.rsplit('/', 1)[-1].rsplit('.', 1)[0]

urllib.request.urlretrieve(url, filename)
compressed_file = zipfile.ZipFile(filename)
csv_file = compressed_file.open(filename)
collisions_pd = pd.read_csv(csv_file)

print("Number of records: {}".format(collisions_pd['RD_NO'].count()))
collisions_pd.head(1)

## Basic Statistics

In [None]:
# Display the DataFrame schema
collisions_pd.dtypes

In [None]:
# Convert the two datetime columns to the proper type
collisions_pd['CRASH_DATE'] = \
           collisions_pd['CRASH_DATE'].apply(pd.to_datetime, infer_datetime_format=True, errors='coerce')
collisions_pd['DATE_POLICE_NOTIFIED'] = \
           collisions_pd['DATE_POLICE_NOTIFIED'].apply(pd.to_datetime, infer_datetime_format=True, errors='coerce')

In [None]:
collisions_pd.dtypes

### Count of non-null values in each column

In [None]:
collisions_pd.count()

In [None]:
# Column statistics for numerical columns
# It is important to note that the datetime columns are not included.
collisions_pd.describe()

In [None]:
# Min/max statistics on the datetime columns
print("                      Min                Max")
print("CRASH_DATE           " + str(collisions_pd['CRASH_DATE'].min()) + " " + 
                                str(collisions_pd['CRASH_DATE'].max()) )
print("DATE_POLICE_NOTIFIED " + str(collisions_pd['DATE_POLICE_NOTIFIED'].min()) + " " + 
                                str(collisions_pd['DATE_POLICE_NOTIFIED'].max()) )
      

In [None]:
# Number of unique values in each column
collisions_pd.nunique()

From the previous outputs, we see that CRASH_DATE_EST_I is mostly null (92.5% of the time)<br/>
We see multiple columns with even lower count of non-null values.<br/>
We see that the POSTED_SPEED_LIMIT maximum is 99 and there are 35 different speed limits so that raises some questions.

So these statistics already provide a lot of information.

### Exploring further
We saw earlier that the minumum CRASH DATE was 2014-01-21 and the minimum DATE_POLICE_NOTIFIED was 2015-07-25<br/>
That indicates that there probably are some errors in the data.

The POSTED_SPEED_LIMIT value has a maximum of 99. This is suspicious.

There are more...

In [None]:
# DATE_POLICE_NOTIFIED should always be greater or equal to CRASH_DATE
# It is weird that nlargest is set to 30 and only 21 values are returned
# When it is set to 20, only 11 are returned
import numpy as np
x = collisions_pd[collisions_pd['DATE_POLICE_NOTIFIED'].notna()][['DATE_POLICE_NOTIFIED', 'CRASH_DATE']]
x['days'] = (x.DATE_POLICE_NOTIFIED - x.CRASH_DATE).astype('timedelta64[D]')
# x.describe()
x[x['days'] > 0][['days']].nlargest(10, columns='days').groupby('days').size()

In [None]:
# Number of date differences larger than 30
x[x['days'] > 30]['days'].count()

## Posted Speed Limit
We saw earlier that there are 35 different speed limits. Let's see what they are and their count

We could do a similar analysis with other columns.

In [None]:
collisions_pd.groupby('POSTED_SPEED_LIMIT').size()

## Count accidents, accidents with injuries, accidents with casualties

In [None]:
# Count only accidents that have longitude and latitude
print( "Number of accidents                : " + str(collisions_pd[collisions_pd['LONGITUDE'].notna() \
                                                                   & collisions_pd['LATITUDE'].notna()]['RD_NO'].count()) )
print( "Number of accidents with injuries  : " + str(collisions_pd[collisions_pd['LONGITUDE'].notna() \
                                                                   & collisions_pd['LATITUDE'].notna() \
                                                                   & collisions_pd['INJURIES_TOTAL'] > 0]['RD_NO'].count()) )
print( "Number of accidents with fatalities: " + str(collisions_pd[collisions_pd['LONGITUDE'].notna() \
                                                                   & collisions_pd['LATITUDE'].notna() \
                                                                   & collisions_pd['INJURIES_FATAL'] > 0]['RD_NO'].count()) )


## Additional stuff
<ul><li>Visualization: grouping by street/type of accident, beat/type of accident, month, day, week, hour</li>
    <li>Plot all accidents, plot by accident type</li>
    <li>Plot accidents and street with most accidents</li>
</ul>

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns

import matplotlib.pyplot as plt
# matplotlib.patches lets us create colored patches, which we can use for legends in plots
import matplotlib.patches as mpatches
# seaborn also builds on matplotlib and adds graphical features and new plot types
# adjust settings
# The inline statement insures that the plot will show in the cell output. Look at the documentation for more information
%matplotlib inline
sns.set_style("white")
plt.rcParams['figure.figsize'] = (15, 15)

### Grouping accidents
First by street for 3 categories:

- All accidents
- Accidents with injuries
- Accidents with fatalities

In [None]:
# Plot the top 15 streets by accident count
plt.figure(figsize=(8,5))
streets = collisions_pd.groupby('STREET_NAME')['RD_NO'].agg(['count']).sort_values('count', ascending=-False).head(15).reset_index(drop=False)
colors = ['g','0.75','y','k','b','r']
streets.sort_values(by='count', ascending=True)['count'].plot.barh(color=colors)
plt.xlabel('Collisions')
plt.ylabel('Street')
plt.title('Total Number of Collisions by Street', size=15)
plt.yticks(range(0,15),streets['STREET_NAME'])
plt.tight_layout()
plt.show()

In [None]:
#divide dataset into accident categories: fatal, non-fatal but with injuries, none of the above
killed_pd = collisions_pd[collisions_pd['INJURIES_FATAL']>0]
injured_pd = collisions_pd[np.logical_and(collisions_pd['INJURIES_TOTAL']>0, collisions_pd['INJURIES_FATAL']==0)]
nothing_pd = collisions_pd[np.logical_and(collisions_pd['INJURIES_FATAL']==0, collisions_pd['INJURIES_TOTAL']==0)]

In [None]:
#create scatterplots
plt.figure(figsize=(15,10))
plt.scatter(collisions_pd.LONGITUDE, collisions_pd.LATITUDE, alpha=0.05, s=4, color='darkseagreen')

#adjust more settings
plt.title('Motor Vehicle Collisions in Chicago', size=25)
plt.xlim((-87.92,-87.52))
plt.ylim((41.64,42.03))
plt.xlabel('Longitude',size=20)
plt.ylabel('Latitude',size=20)

plt.show()

## Enhance the scatter plot to identify the accidents severity
We draw from Pandas DataFrames we created earlier to plot the severity in different color

In [None]:
#adjust settings
plt.figure(figsize=(15,10))

#create scatterplots
plt.scatter(nothing_pd.LONGITUDE, nothing_pd.LATITUDE, alpha=0.04, s=1, color='blue')
plt.scatter(injured_pd.LONGITUDE, injured_pd.LATITUDE, alpha=0.1, s=1, color='yellow')
plt.scatter(killed_pd.LONGITUDE, killed_pd.LATITUDE, color='red', s=5)

#create legend
blue_patch = mpatches.Patch( label='car body damage', alpha=0.2, color='blue')
yellow_patch = mpatches.Patch(color='yellow', label='personal injury', alpha=0.5)
red_patch = mpatches.Patch(color='red', label='lethal accidents')
plt.legend([blue_patch, yellow_patch, red_patch],('car body damage', 'personal injury', 'fatal accidents'), 
           loc='upper left', prop={'size':20})

#adjust more settings
plt.title('Severity of Motor Vehicle Collisions in Chicago', size=20)
plt.xlim((-87.92,-87.52))
plt.ylim((41.64,42.03))
plt.xlabel('Longitude',size=20)
plt.ylabel('Latitude',size=20)
plt.savefig('anothertry.png')

plt.show()

## Using K-Means to find hot spots
We are using K-means to find the center of groupings of accidents.

The process is as follows:
<ul>
    <li>We extract the longitude and latitude of all accidents</li>
    <li>We create a model (for, arbitrarily, 10 clusters)</li>
    <li>We extract the centers and convert them to a Panda DataFrame</li>
    <li>We display the result on a map using pixiedust</li>
</ul>

In [None]:
# Create dataframes for all accidents, accidents with injuries and accidents with fatalities
data_pd = collisions_pd[collisions_pd['LONGITUDE'].notna() \
                        & collisions_pd['LATITUDE'].notna()][['INJURIES_TOTAL','INJURIES_FATAL','LONGITUDE','LATITUDE']]
data_injuries_pd = data_pd[data_pd['INJURIES_TOTAL'] > 0]
data_fatal_pd = data_pd[data_pd['INJURIES_FATAL'] > 0]

In [None]:
import matplotlib.pyplot as plt
from sklearn import datasets
from sklearn.cluster import KMeans
import sklearn.metrics as sm

%matplotlib inline

In [None]:
# K Means Cluster
k=10
model = KMeans(n_clusters=k)
kmeans = model.fit(data_pd[['LONGITUDE','LATITUDE']])
vals=[0] * k
for i in kmeans.labels_ :
    vals[i] = vals[i] + 1

In [None]:
# Create a Panda dataframe for display
d = {'longitude': kmeans.cluster_centers_[:,0], 'latitude': kmeans.cluster_centers_[:,1], 'total' : vals}
k_pd = pd.DataFrame(data=d)

In [None]:
display(k_pd)

## K-Means for accidents with fatalities

In [None]:
# K Means Cluster
k=10
model = KMeans(n_clusters=k)
kmeans = model.fit(data_fatal_pd[['LONGITUDE','LATITUDE']])
vals=[0] * k
for i in kmeans.labels_ :
    vals[i] = vals[i] + 1

In [None]:
# Create a Panda dataframe for display
d2 = {'longitude': kmeans.cluster_centers_[:,0], 'latitude': kmeans.cluster_centers_[:,1], 'total' : vals}
k2_pd = pd.DataFrame(data=d2)

In [None]:
display(k2_pd)