<a href="https://colab.research.google.com/github/afeld/python-public-policy/blob/main/lecture_3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **NYU Wagner - Python Coding for Public Policy**
# Class 3: Data visualization

# LECTURE

## Announcement

Office hours for next two weeks moved to Monday at 6pm Eastern Time (US)

## **Today's goal**: Visualizing requests per community district to help us better understand trends across the city

## Start by importing necessary packages

In [1]:
import pandas as pd
import plotly.express as px

## Data from where we left off last class

Derived dataset containing count of complaints per community district.

In [2]:
df = pd.read_csv('https://storage.googleapis.com/python-public-policy/data/311_community_districts.csv')
df.head()

Unnamed: 0,borocd,Borough,CD Name,2010 Population,count_of_311_requests,request_per_capita
0,112,Manhattan,"Washington Heights, Inwood",190020,81403,0.428392
1,405,Queens,"Ridgewood, Glendale, Maspeth",169190,71506,0.422637
2,412,Queens,"Jamaica, St. Albans, Hollis",225919,70362,0.311448
3,301,Brooklyn,"Williamsburg, Greenpoint",173083,68104,0.393476
4,303,Brooklyn,Bedford Stuyvesant,152985,66360,0.433768


## Let's start with making a histogram to better visualize the difference in scale of 311 requests across community boards

Adapting [the basic histogram example](https://plotly.com/python/histograms/):

In [3]:
fig = px.histogram(df, x='count_of_311_requests')
fig.show()

As we said before, looking at raw volume is probably less useful than density.

1. [Open Homework 3](https://colab.research.google.com/github/afeld/python-public-policy/blob/main/hw_3.ipynb)
1. `Save a copy in Drive`
1. Complete `In-class exercise 1`

How does it compare to the chart of the raw counts?

In [4]:
fig = px.histogram(df, x='request_per_capita')
fig.show()

In [5]:
fig = px.histogram(df, x='count_of_311_requests')
fig.show()

Let's [improve the formatting](https://plotly.com/python/figure-labels/) (based on [the `.histogram()` documentation](https://plotly.com/python-api-reference/generated/plotly.express.histogram.html)):

In [6]:
fig = px.histogram(df,
                   x='request_per_capita',
                   title='Volume of 311 requests, 2018-2019',
                   labels={'request_per_capita': '311 requests per capita'})

# y-axis needs to be done separately, since it's derived
fig.update_layout(yaxis_title_text='Number of community districts')
fig.show()

## Creating a stacked bar chart

In [7]:
fig = px.bar(df,
             x='Borough',
             y='count_of_311_requests',
             hover_data=['borocd', 'CD Name'])

fig.show()

## Make a scatterplot of count of 311 requests per CD against CD population

In [8]:
fig = px.scatter(df,
                 x='2010 Population',
                 y='count_of_311_requests',
                 hover_data=['borocd', 'CD Name'])

fig.show()

**Exercise 2:** [Add a trendline](https://plotly.com/python/linear-fits/).

In [9]:
fig = px.scatter(df,
                 x='2010 Population',
                 y='count_of_311_requests',
                 hover_data=['borocd', 'CD Name'],
                 trendline='ols')

fig.show()

Let's take a look at the statistical summary, via the [`statsmodels`](https://www.statsmodels.org/) package:

In [10]:
trend_results = px.get_trendline_results(fig).iloc[0,0]
trend_results.summary()

0,1,2,3
Dep. Variable:,y,R-squared:,0.471
Model:,OLS,Adj. R-squared:,0.462
Method:,Least Squares,F-statistic:,50.73
Date:,"Wed, 27 Oct 2021",Prob (F-statistic):,1.99e-09
Time:,13:07:39,Log-Likelihood:,-626.67
No. Observations:,59,AIC:,1257.0
Df Residuals:,57,BIC:,1261.0
Df Model:,1,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,1.528e+04,4424.730,3.453,0.001,6416.292,2.41e+04
x1,0.2173,0.031,7.122,0.000,0.156,0.278

0,1,2,3
Omnibus:,0.008,Durbin-Watson:,1.996
Prob(Omnibus):,0.996,Jarque-Bera (JB):,0.065
Skew:,0.006,Prob(JB):,0.968
Kurtosis:,2.837,Cond. No.,488000.0


["In general, the higher the R-squared, the better the model fits your data."](https://blog.minitab.com/blog/adventures-in-statistics-2/regression-analysis-how-do-i-interpret-r-squared-and-assess-the-goodness-of-fit)

## Let's try styling the scatter plot with different colors for each borough

In [11]:
fig = px.scatter(df,
                 x='2010 Population',
                 y='count_of_311_requests',
                 color='Borough')
fig.show()

## Bonus: Produce a map of complaint counts by CD

Following [this example](https://plotly.com/python/choropleth-maps/#indexing-by-geojson-properties), using [community district GIS data](https://data.cityofnewyork.us/City-Government/Community-Districts/yfnk-k7r4):

In [12]:
import requests

geojson = 'https://data.cityofnewyork.us/api/geospatial/yfnk-k7r4?method=export&format=GeoJSON'
response = requests.get(geojson)
response.json()

{'type': 'FeatureCollection',
 'features': [{'type': 'Feature',
   'properties': {'boro_cd': '206',
    'shape_area': '42664311.7513',
    'shape_leng': '35875.7113204'},
   'geometry': {'type': 'MultiPolygon',
    'coordinates': [[[[-73.8718461029101, 40.84376077785579],
       [-73.87191691517187, 40.84345374314911],
       [-73.87196432223092, 40.84323825327201],
       [-73.87213357912808, 40.84249779505227],
       [-73.87231748546176, 40.84169028202038],
       [-73.87234056327031, 40.841583626189475],
       [-73.87236365249346, 40.84147696098337],
       [-73.87239558060982, 40.84133637492926],
       [-73.872527839146, 40.84075398300681],
       [-73.8726780587479, 40.84013719660415],
       [-73.87277204658531, 40.83975128233603],
       [-73.87298043005399, 40.83895248468385],
       [-73.87312728158592, 40.838335096918485],
       [-73.87314714674842, 40.8382638989731],
       [-73.87317603862375, 40.83816583089217],
       [-73.87332746763535, 40.837651949332916],
       [

In [13]:
fig = px.choropleth_mapbox(df,
                           geojson=geojson,
                           locations='borocd',
                           featureidkey='properties.boro_cd',
                           color='request_per_capita',
                           hover_data=['CD Name'],
                           center = {'lat': 40.73, 'lon': -73.98},
                           zoom=9,
                           mapbox_style='carto-positron')

fig.update_layout(height=700)
fig.show()

Midtown, as an outlier, is skewing our results. Let's exclude it.

In [14]:
no_midtown = df[df.borocd != 105]

fig = px.choropleth_mapbox(no_midtown,
                           geojson=geojson,
                           locations='borocd',
                           featureidkey='properties.boro_cd',
                           color='request_per_capita',
                           hover_data=['CD Name'],
                           center = {'lat': 40.73, 'lon': -73.98},
                           zoom=9,
                           mapbox_style='carto-positron')

fig.update_layout(height=700)
fig.show()

**Tip:** Make sure your `locations` match your `featureidkey`.

## Chart hygiene

- Always include a title
- Make sure you label dependent and independent variables (X and Y axes)
- Consider whether you are working with continuous vs. discrete values

## Other pandas/Jupyter best practices

- Make variable names descriptive
    - Ignore that all examples use `df`
- Make notebooks [idempotent](https://en.wikipedia.org/wiki/Idempotence)
    - Makes your work reproducible
    - Use `Restart and run all`

# HOMEWORK 3

Please refer to the [HW3 Starter Notebook](https://colab.research.google.com/github/afeld/python-public-policy/blob/main/hw_3.ipynb).