# Data Visualization

There are a seemingly infinite number of different tools for data visualization in Python. For today, we're going to focus on Matplotlib and Seaborn. 

> Matplotlib is a standard, Python, 2D plotting library (https://matplotlib.org/) <br> 
> Seaborn is also a Python, data visualization library built atop Matplotlib (https://seaborn.pydata.org/)

We'll also delve into some work with geographic plotting using geopandas [bokeh](https://bokeh.pydata.org/en/latest/index.html). 

In [None]:
# rendering our plots inline (aka, in our Jupyter notebook) and changing the layout a bit

%matplotlib inline 
%config InlineBackend.figure_format = 'retina' # allowing us to use highest possible resolution

In [None]:
# installing all of our libraries

import pandas as pd
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
# setting some more styling

sns.set_style("whitegrid")
sns.set(rc={'figure.figsize': (20, 20)})
matplotlib.style.use(['seaborn-talk', 'seaborn-ticks']) # allows us to control figure aesthetic

## Data

Today we are going to use the NYC Vehicle Collisions '[accidents.csv](https://data.cityofnewyork.us/Public-Safety/NYPD-Motor-Vehicle-Collisions-Crashes/h9gi-nx95)' dataset again. Remember this curl command is going to take a while, so I recommend just uploading the CSV from Brightspace directly into your Colab environment.

In [None]:
# !curl 'https://data.cityofnewyork.us/api/views/h9gi-nx95/rows.csv?accessType=DOWNLOAD' -o accidents.csv

In [None]:
data = pd.read_csv("./accidents.csv",low_memory=False)

## Dtypes

As usual, we need to take a moment and convert some of our dtypes:

In [None]:
data.dtypes # let's check our data types

As we did previously, let's create our new DATETIME column, as well as convert "CRASHTIME" and "DATE" to Datetime format. 

In [None]:
data['DATETIME'] = data['CRASH DATE'] + ' ' + data['CRASH TIME'] # create a new field called 'datetime' that combines date and time
data['DATETIME'] = pd.to_datetime(data['DATETIME'], format="%m/%d/%Y %H:%M") # format this new column as a datetime

# https://docs.python.org/3/library/datetime.html

In [None]:
data['CRASH TIME'] = pd.to_datetime(data['CRASH TIME'], format="%H:%M")

In [None]:
data['DATE'] = pd.to_datetime(data['CRASH DATE'], format="%m/%d/%Y")

In [None]:
data.head()

---

# ⭕ **QUESTIONS?**

---

## Feature Creation

We also want to create two new columns, one called 'Injury' that hosts a true value if there was at least one injury in an accident, and another column called 'Death' that hosts a true value if there was at least one death in an accident.

In [None]:
# we'll also create two new columns, 'injury' and 'death' 

data['INJURY'] = (data['NUMBER OF PERSONS INJURED']>0) # true if there's at least one injury, false if otherwise
data['DEATH'] = (data['NUMBER OF PERSONS KILLED']>0) # true if there's at least one death, false if otherwise

## Overplotting

As you can see, when we try to plot or Lat/Long there is clearly an issue...we seem to have overplotted.

In [None]:
data.plot(kind='scatter', x='LONGITUDE', y='LATITUDE')

To solve, we can create a mask where we are restricting the Lat/Long data to what Google tells us is the bounds of NYC.

In [None]:
clean_mask = (data.LATITUDE > 40) & (data.LATITUDE < 41) & (data.LONGITUDE < -72) & (data.LONGITUDE > -74.5)
cleandf = data[clean_mask]

In [None]:
cleandf.plot(kind='scatter', x='LONGITUDE', y='LATITUDE')

This is definitely better. Let's try increasing the figure size, too.

In [None]:
cleandf.plot(kind='scatter', x='LONGITUDE', y='LATITUDE', figsize=(20, 15))

## Addressing Overplotting

Other than using our mask and increasing the figure size, there a few other ways to address overplotting: 

## `sampling` 

We can specify how many points we want to plot by either passing an integer or fraction

In [None]:
sample = cleandf.sample(n=10000) # keep 10,000 data points

sample.plot(kind='scatter', x='LONGITUDE', y='LATITUDE', figsize=(20, 15))

In [None]:
sample = cleandf.sample(frac=0.01) # keep 1% of the dataset

sample.plot(kind='scatter', x='LONGITUDE', y='LATITUDE', figsize=(20, 15))

## `marker size`

In [None]:
cleandf.plot(kind='scatter', x='LONGITUDE', y='LATITUDE', figsize=(20, 15), s=0.5 ) # altering the marker size:

## `marker transparency`

In [None]:
cleandf.plot(
    kind='scatter',
    x='LONGITUDE',
    y='LATITUDE',
    figsize=(20, 15),
    s=0.5, 
    alpha=0.05) # altering the marker transparency:

---

# ⭕ **QUESTIONS?**

---

## Histograms, Density Plots, and Contour Plots

The hexbin (Hexagonal Bin Plot) creates a 2-d histogram, where the color signals the number of points within a particular area; The gridsize parameter chooses the size of each bin. 

In [None]:
cleandf.plot(
    kind='hexbin',
    x='LONGITUDE',
    y='LATITUDE',
    gridsize=100,
    cmap=plt.cm.Blues,
    figsize=(15, 12))

## Density Plots

In [None]:
plt.subplots(figsize=(20, 15))

sample = cleandf.sample(10000) # take sample because density plots take a while to computer

sns.kdeplot(
    sample.LONGITUDE,
    sample.LATITUDE,
    gridsize=100,  # controls the resolution
    cmap=plt.cm.rainbow,  # color scheme
    shade=  # whether to have a density plot (True), or just the contours (False)
    True,
    alpha=0.5,
    shade_lowest=False,
    n_levels=50  # how many contours/levels to have
)

## Contour Plots

In [None]:
plt.subplots(figsize=(20, 15))

sample = cleandf.sample(10000)

sns.kdeplot(
    sample.LONGITUDE,
    sample.LATITUDE,
    gridsize=100,
    cmap=plt.cm.rainbow,
    shade=False,
    shade_lowest=False,
    n_levels=25)

## Combining plots

We can combine multiple plots using the ax parameter (think of 'ax' as representative of an individual plot). 

In [None]:
# imagine we want to combine the scatter plot with the contour plot above...

sample = cleandf.sample(10000)

scatterplot = cleandf.plot( # we're defining our scatterplot...
    kind='scatter',
    x='LONGITUDE',
    y='LATITUDE',
    figsize=(20, 15),
    s=0.5,
    alpha=0.1)

sns.kdeplot( # and also a kde plot, and adding the scatterpolot to it with ax=scatterplot
    sample.LONGITUDE,
    sample.LATITUDE,
    gridsize=100,
    cmap=plt.cm.rainbow,
    shade=False,
    shade_lowest=False,
    n_levels=20,
    alpha=1,
    ax=scatterplot)

## Adding Geographic Boundaries using Bokeh

In [None]:
cleandf.dropna(subset=["LATITUDE","LONGITUDE"],inplace=True) 
# we're dropping any rows where there is at least one NA value

We'll create a truncated version of our dataset that only has certain columns...

In [None]:
lat_long = cleandf[["LATITUDE","LONGITUDE","CRASH DATE","CRASH TIME","BOROUGH","VEHICLE TYPE CODE 1"]]

In [None]:
lat_long.head()

In [None]:
test = lat_long[:100]

In [None]:
test

For Bokeh, we'll then cast these columns as lists...

In [None]:
lat_list = list(test['LATITUDE'])
lon_list = list(test['LONGITUDE'])

date_list = list(test['CRASH DATE'])
time_list = list(test['CRASH TIME'])
borough_list = list(test['BOROUGH'])
vehicle_list = list(test['VEHICLE TYPE CODE 1'])

Note: If you want to avoid the "For Dev Purposes Only" message on the following map, go [here](https://developers.google.com/maps/get-started) and follow the instructions to set u pa Google API account.

In [None]:
# https://docs.bokeh.org/en/latest/

import bokeh.io

from bokeh.io import output_file, show, output_notebook
from bokeh.models import *

bokeh.io.output_notebook()


map_options = GMapOptions(lat=40.7128, lng=-74.0060, map_type="roadmap", zoom=11)

plot = GMapPlot(x_range=Range1d(), y_range=Range1d(), map_options=map_options,api_key = "AIzaSyDmyE8tAty-Lhd-rJQvIsGk8ocOIdHwYSE")

source = ColumnDataSource(
    data = dict(
        lat=lat_list,
        lon=lon_list,
        date = date_list,
        time = time_list,
        borough = borough_list, 
        vehicle = vehicle_list
    ))

circle = Circle(x="lon", y="lat", size=15, fill_color="blue", fill_alpha=0.8, line_color=None)
plot.add_glyph(source, circle)

plot.add_tools(PanTool(), WheelZoomTool(), BoxSelectTool(), BoxZoomTool())

plot.title.text="NYC Accidents"

plot.add_tools(HoverTool(
    tooltips=[
        ( 'date',   '@date' ),
        ( 'time',  '@time' ), 
        ( 'borough', '@borough' ), 
        ( 'vehicle', '@vehicle' )
    ],

    formatters={
        'date' : 'datetime', # use 'datetime' formatter for 'date' field
        'time' : 'printf',
        'borough' : 'numeral',
        'vehicle' : 'numeral'
    },

    mode='vline'
))

#output_file("gmap_plot.html")

bokeh.io.show(plot)

---

# ⭕ **QUESTIONS?**

---

# Example: Analyzing Citibike Station Activity using Pandas

We are going to download 201306-citibike-tripdata.csv from [this AWS s3 bucket](https://s3.amazonaws.com/tripdata/index.html).

In [None]:
df = pd.read_csv("./201306-citibike-tripdata.csv",encoding="UTF-8")

In [None]:
len(df)

In [None]:
df.head()

---

## Examining Time Series per Station

Let's create a pivot table to examine the time series for individual stations.

In [None]:
df['starttime'] = pd.to_datetime(df['starttime'], format="%Y-%m-%d %H:%M:%S")

df['tripduration'] = df['tripduration'].astype(int)
# astype(int) allows you to cast an entire column, whereas int(x) only works for scalar values

In [None]:
station_timeseries = df.pivot_table(
                        index='starttime', 
                        values='tripduration', 
                        aggfunc='mean'
                    ).interpolate(method='pad') # pad will fill NaN's using existing values

station_timeseries.head(5)

Then we plot that over time.

In [None]:
%matplotlib inline

station_timeseries.plot(alpha=.5, figsize=(18, 9))

---

# Exercise 2:

Let's limit our plot to just two stations:
* Station at "Mercer St & Bleecker St"
* Station at "LaGuardia Pl & W 3 St"

which are nearby and tend to exhibit similar behavior. Remember that the list of stations is [available as a JSON](https://feeds.citibikenyc.com/stations/stations.json) 

In [None]:
# your code here

# Solution

In [None]:
df[df['start station name'].str.contains("Mercer") & df['start station name'].str.contains("Bleecker") ].head()
#contains() tests if a pattern or regex is contained in a string of a series or index

In [None]:
df[df['start station name'].str.contains("LaGuardia") ].head()

In [None]:
station_ids = [161,375]

mercer_lga_df = df[df['start station id'].isin(station_ids)]

In [None]:
mercer_lga_df

In [None]:
station_timeseries = mercer_lga_df.pivot_table(
                        index='starttime', 
                        values='tripduration', 
                        aggfunc='mean'
                    ).interpolate(method='pad') # pad will fill NaN's using existing values

station_timeseries.head(5)

In [None]:
%matplotlib inline

station_timeseries.plot(alpha=.5, figsize=(18, 9))

----