# Data Visualization

---

There are a seemingly infinite number of different tools for data visualization in Python. For today, we're going to focus on Matplotlib and Seaborn. 

> Matplotlib is a standard, Python, 2D plotting library (https://matplotlib.org/) <br> 
> Seaborn is also a Python, data visualization library built atop Matplotlib (https://seaborn.pydata.org/)

We'll also delve into some work with geographic plotting using geopandas [bokeh](https://bokeh.pydata.org/en/latest/index.html). 

---

In [None]:
# rendering our plots inline (aka, in our Jupyter notebook) and changing the layout a bit

%matplotlib inline
%config InlineBackend.figure_format = 'retina'

In [None]:
# installing all of our libraries

import pandas as pd
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
# setting some more styling

sns.set_style("whitegrid")
sns.set(rc={'figure.figsize': (20, 20)})
matplotlib.style.use(['seaborn-talk', 'seaborn-ticks'])

## Data

We'll begin using our [NYPD Crashes csv.](https://data.cityofnewyork.us/Public-Safety/NYPD-Motor-Vehicle-Collisions-Crashes/h9gi-nx95) – each row in the csv represents a crash event with associated details.

In [None]:
# DATA

In [None]:
# DTYPES

## Dtypes

As usual, we need to take a moment and convert some of our dtypes:

In [None]:
# DATETIME COLUMN

In [None]:
# TIME

In [None]:
# DATE

## Feature Creation

We also want to create two new columns, one called 'Injury' that hosts a true value if there was at least one injury in an accident, and another column called 'Death' that hosts a true value if there was at least one death in an accident.

In [None]:
# FEATURE CREATION

## Overplotting

In [None]:
# SCATTER

In [None]:
# MASK

In [None]:
# PLOT MASK

In [None]:
# PLOT MASK WITH NEW FIGSIZE

## Addressing Overplotting

## `sampling` 

We can specify how many points we want to plot by either passing an integer or fraction

In [None]:
# SAMPLE INT

In [None]:
# SAMPLE FRAC

## `marker size`

In [None]:
# MARKER SIZE

## `marker transparency`

In [None]:
# MARKER TRANSPARENCY

---

## Histograms, Density Plots, and Contour Plots

The hexbin (Hexagonal Bin Plot) creates a 2-d histogram, where the color signals the number of points within a particular area; The gridsize parameter chooses the size of each bin. 

In [None]:
# HEXBIN

## Density Plots

In [None]:
plt.subplots(figsize=(20, 15))

sample = cleandf.sample(10000) # take sample because density plots take a while to computer

sns.kdeplot(
    sample.LONGITUDE,
    sample.LATITUDE,
    gridsize=100,  # controls the resolution
    cmap=plt.cm.rainbow,  # color scheme
    shade=  # whether to have a density plot (True), or just the contours (False)
    True,
    alpha=0.5,
    shade_lowest=False,
    n_levels=50  # how many contours/levels to have
)

## Contour Plots

In [None]:
plt.subplots(figsize=(20, 15))

sample = cleandf.sample(10000)

sns.kdeplot(
    sample.LONGITUDE,
    sample.LATITUDE,
    gridsize=100,
    cmap=plt.cm.rainbow,
    shade=False,
    shade_lowest=False,
    n_levels=25)

## Combining plots

We can combine multiple plots using the ax parameter (think of 'ax' as representative of an individual plot). 

In [None]:
# COMBINE 

## Adding Geographic Boundaries using Bokeh

In [None]:
# READ

In [None]:
# DROP NA

In [None]:
# LAT LONG

In [None]:
# TEST

In [None]:
# LISTS

In [None]:
# https://docs.bokeh.org/en/latest/

from bokeh.io import output_file, show 
from bokeh.models import *


map_options = GMapOptions(lat=40.7128, lng=-74.0060, map_type="roadmap", zoom=11)

plot = GMapPlot(x_range=Range1d(), y_range=Range1d(), map_options=map_options,api_key = "{KEY HERE}")

source = ColumnDataSource(
    data = dict(
        lat=lat_list,
        lon=lon_list,
        date = date_list,
        time = time_list,
        borough = borough_list, 
        vehicle = vehicle_list
    ))

circle = Circle(x="lon", y="lat", size=15, fill_color="blue", fill_alpha=0.8, line_color=None)
plot.add_glyph(source, circle)

plot.add_tools(PanTool(), WheelZoomTool(), BoxSelectTool(), BoxZoomTool())

plot.title.text="NYC Accidents"

plot.add_tools(HoverTool(
    tooltips=[
        ( 'date',   '@date' ),
        ( 'time',  '@time' ), 
        ( 'borough', '@borough' ), 
        ( 'vehicle', '@vehicle' )
    ],

    formatters={
        'date' : 'datetime', # use 'datetime' formatter for 'date' field
        'time' : 'printf',
        'borough' : 'numeral',
        'vehicle' : 'numeral'
    },

    mode='vline'
))

# output_file("gmap_plot.html")

show(plot)

---

# Example: Analyzing Citibike Station Activity using Pandas

In [None]:
%matplotlib inline
%config InlineBackend.figure_format = 'retina'
import pandas as pd
import matplotlib 
import matplotlib.pyplot as plt
matplotlib.style.use(['seaborn-talk', 'seaborn-ticks', 'seaborn-whitegrid'])

First, let's fetch our data as we did in week 1:

In [None]:
# CON

Unlike in Week 1, though, we are using a script that runs continuously using a crontab (seen below) so that our database is continually populating with recent data. 

The .py script is called citibike_cron_script.py and can be found in the Class 7 folder of the course repo. 

> The crontab used is: 
>> */1 * * * * /Users/siegmanA/anaconda3/bin/python $(which python3) ~/Desktop/NYU-Projects-in-Programming-Fall-2019/\(Class\ 7\)\ Data\ Visualization/citibike_cron_script.py >> ~/Desktop/tmp/citiCron.log 2>&1

Now we want to create a query that gets us the average capacity of a given station in hourly intervals.

In [None]:
# CHECK

In [None]:
df = pd.read_sql("""SELECT station_id,
                    stationName,
                    availableBikes, 
                    availableDocks,
                    totalDocks,
                    latitude, 
                    longitude,
                    lastCommunicationTime
                FROM StationsData""", con=con)

df['lastCommunicationTime'] = pd.to_datetime(df['lastCommunicationTime'], format='%Y-%m-%d %H:%M:%S %p')

df.head()

---

# Exercise 1: 

Create a new column in our df called 'percent_full' that tells us how full a bike station is at a given time

In [None]:
# your code here

---

## Examining Time Series per Station

Let's create a pivot table to examine the time series for individual stations.

In [None]:
# TIMESERIES

It looks like there's an erroneous entry where we have a last communication time from 1969. Let's get rid of that. 

In [None]:
# DROP 1969

In [None]:
# TIMESERIES

Then we plot that over time.

In [None]:
# PLOT

Let's limit our plot to just two stations:
* Station 3260 at "Mercer St & Bleecker St"
* Station 161 at "LaGuardia Pl & W 3 St"

which are nearby and tend to exhibit similar behavior. Remember that the list of stations is [available as a JSON](https://feeds.citibikenyc.com/stations/stations.json) 

In [None]:
# MERCER

In [None]:
# LAGUARDIA

In [None]:
# TIMESERIES

---

# Exercise 2:

Plot a timeseries graph for stations 3260 and 161 only

In [None]:
# your code here

---

## Finding Bike Stations with Similar Behavior

For our next analysis, we are going to try to find bike stations that have similar behaviors over time. A very simple technique that we can use to find similar time series is to treat the time series as vectors, and compute their correlation. Pandas provides the `corr` function that can be used to calculate the correlation of columns. (If we want to compute the correlation of rows, we can just take the transpose of the dataframe using the `transpose()` function, and compute the correlations there.)

In [None]:
# PEARSON

Let's see the similarities of the two stations that we examined above.

In [None]:
# SIMILARITIES

In [None]:
# 393: E 5 St & Avenue C
# 2003: 1 Ave & E 18 St

# ...

For bookkeeping purposes, we are going to drop stations that generate NaN values, as we cannot use such entries for our analysis.

In [None]:
# number of stations with non-NaN similarity per station

check = similarities.count()

# find the number of stations with less than the max number of similarities

todrop = check[check < check.max()].index.values
similarities.drop(todrop, axis='index', inplace=True)
similarities.drop(todrop, axis='columns', inplace=True)

### Clustering Based on Distances

Without explaining too much about clustering, we are going to use a clustering technique and cluster together bike stations that are "nearby" according to our similarity analysis. For this, we need to first convert our similarities to distance.

We are now going to convert our **similarities** into **distance** metrics. Our distance values will be always positive, and bounded between 0 and 1.

* If two stations have correlation 1, they behave identically, and therefore have distance 0, 
* If two stations have correlation -1, they have exactly the oppositite behaviors, and therefore we want to have distance 1 (the max) 

In [None]:
# similarity goes from -1 to 1, so 1-similarity goes from 0 to 2.
# so, we multiply with 0.5 to get it between 0 and 1, and then take the square

distances = ((.5*(1-similarities))**2)
distances.head(5)

The clustering code is very simple: The code below will create two groups of stations.

In [None]:
from sklearn.cluster import AgglomerativeClustering
from sklearn.cluster import KMeans

cluster = KMeans(n_clusters=2)
cluster.fit(distances.values)

We will now take the results of the clustering and associate each of the data points into a cluster.

In [None]:
labels = pd.DataFrame(list(zip(distances.index.values.tolist(), cluster.labels_)), columns = ["station_id", "cluster"])
labels

Let's see how many stations in each cluster

In [None]:
labels.pivot_table(
    index = 'cluster',
    aggfunc = 'count'
)

### Visualizing the Time Series Clusters

We will start by assining a color to each cluster, so that we can plot each station-timeline with the cluster color. (We put a long list of colors, so that we can play with the number of clusters in the earlier code, and still get nicely colored results.)

In [None]:
colors = list(['red','black', 'green', 'magenta', 'yellow', 'blue', 'white', 'cyan'])
labels['color'] = labels['cluster'].apply(lambda cluster_id : colors[cluster_id]) 
labels.head(10)

In [None]:
stations_plot = station_timeseries.plot(
    alpha=0.5, 
    legend=False, 
    figsize=(20,5), 
    linewidth=1,
    color=labels['color'],
    xlim=('2019-10-10 06', '2019-10-10 06:30'),
    ylim=(0,1)
)

The plot still looks messy. Let's try to plot instead a single line for each cluster. To represent the cluster, we are going to use the _median_ fullness value across all stations that belong to a cluster, for each timestamp. For that, we can again use a pivot table: we define the `communication_time` as one dimension of the table, and `cluster` as the other dimension, and we use the `median` function. 

For that, we first _join_ our original dataframe with the results of the clustering, using the `merge` command, and add an extra column that includes the clusterid for each station. Then we compute the pivot table.

In [None]:
median_cluster = df.merge(
    labels,
    how='inner',
    on='station_id'
).pivot_table(
    index='lastCommunicationTime', 
    columns='cluster', 
    values='percent_full', 
    aggfunc='median'
)

median_cluster.head(15)

Now, we can plot the medians for the two clusters.

In [None]:
median_cluster.plot(
    figsize=(20,5), 
    linewidth = 2, 
    alpha = 0.75,
    color=colors,
    ylim = (0,1),
    xlim=('2019-10-10 06', '2019-10-10 06:05'),
    grid = True
)

And just for fun and for visual decoration, let's put the two plots together. We are going to fade a lot the individual station time series (by putting the `alpha=0.005`) and we are going to make more prominent the median lines by increasing their linewidths. We will limit our plot to one week's worth of data:

In [None]:
stations_plot = station_timeseries.plot(
    alpha=0.005, 
    legend=False, 
    figsize=(20,5), 
    color=labels["color"]
)

median_cluster.plot(
    figsize=(20,5), 
    linewidth = 3, 
    alpha = 0.5,
    color=colors, 
    xlim=('2019-10-10 06', '2019-10-10 06:05'),
    ylim=(0,1),
    ax = stations_plot
)