# Class 3: Data visualization

_"Data visualization", "chart", "graph", and  will be used interchangeably._

- Solutions coming
- No class next week

## **Today's goal**: Visualizing requests per community district

This should help us better understand trends across the city.

## Start by importing necessary packages

In [1]:
import pandas as pd
import plotly.express as px

In [2]:
# boilerplate for allowing PDF export
import plotly.io as pio

pio.renderers.default = "notebook_connected+pdf"

### Data from where we left off last class

Derived dataset containing count of complaints per community district.

In [3]:
districts = pd.read_csv(
    "https://storage.googleapis.com/python-public-policy/data/311_community_districts.csv.zip"
)
districts.head()

Unnamed: 0,borocd,Borough,CD Name,2010 Population,count_of_311_requests,request_per_capita
0,112,Manhattan,"Washington Heights, Inwood",190020,81403,0.428392
1,405,Queens,"Ridgewood, Glendale, Maspeth",169190,71506,0.422637
2,412,Queens,"Jamaica, St. Albans, Hollis",225919,70362,0.311448
3,301,Brooklyn,"Williamsburg, Greenpoint",173083,68104,0.393476
4,303,Brooklyn,Bedford Stuyvesant,152985,66360,0.433768


## Let's start with making a histogram to better visualize the difference in scale of 311 requests across community boards

Adapting [the basic histogram example](https://plotly.com/python/histograms/):

In [4]:
fig = px.histogram(
    districts,
    x="count_of_311_requests",
    title="Distribution of number of 311 requests by number of Community Districts",
)
fig.show()

As we said before, looking at raw volume is probably less useful than density.

1. [Open Homework 3](https://python-public-policy.afeld.me/en/{{school_slug}}/hw_3.html)
1. Complete `In-class exercise 0`

How does it compare to the chart of the raw counts?

In [5]:
fig = px.histogram(districts, x="request_per_capita", height=200)
fig.show()

In [6]:
fig = px.histogram(districts, x="count_of_311_requests", height=200)
fig.show()

Let's [improve the formatting](https://plotly.com/python/figure-labels/) (based on [the `.histogram()` documentation](https://plotly.com/python-api-reference/generated/plotly.express.histogram.html)):

In [7]:
fig = px.histogram(
    districts,
    x="request_per_capita",
    title="Volume of 311 requests, 2018-2019",
    labels={"request_per_capita": "311 requests per capita"},
)

# y-axis needs to be done separately, since it's derived
fig.update_layout(yaxis_title_text="Number of community districts")
fig.show()

## Scatterplot

In [8]:
fig = px.scatter(
    districts,
    x="2010 Population",
    y="count_of_311_requests",
    hover_data=["borocd", "CD Name"],
    title="Number of 311 requests per Community District by population",
)

fig.show()

**Exercise 1:** [Add a trendline](https://plotly.com/python/linear-fits/).

In [9]:
fig = px.scatter(
    districts,
    x="2010 Population",
    y="count_of_311_requests",
    hover_data=["borocd", "CD Name"],
    title="Number of 311 requests per Community District by population",
    trendline="ols",
)

fig.show()

Let's take a look at the statistical summary, via the [`statsmodels`](https://www.statsmodels.org/) package, following [Plotly's example](https://plotly.com/python/linear-fits/#fitting-multiple-lines-and-retrieving-the-model-parameters):

In [10]:
trend_results = px.get_trendline_results(fig).iloc[0, 0]
trend_results.summary()

0,1,2,3
Dep. Variable:,y,R-squared:,0.471
Model:,OLS,Adj. R-squared:,0.462
Method:,Least Squares,F-statistic:,50.73
Date:,"Tue, 11 Apr 2023",Prob (F-statistic):,1.99e-09
Time:,18:14:30,Log-Likelihood:,-626.67
No. Observations:,59,AIC:,1257.0
Df Residuals:,57,BIC:,1261.0
Df Model:,1,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,1.528e+04,4424.730,3.453,0.001,6416.292,2.41e+04
x1,0.2173,0.031,7.122,0.000,0.156,0.278

0,1,2,3
Omnibus:,0.008,Durbin-Watson:,1.996
Prob(Omnibus):,0.996,Jarque-Bera (JB):,0.065
Skew:,0.006,Prob(JB):,0.968
Kurtosis:,2.837,Cond. No.,488000.0


["In general, the higher the R-squared, the better the model fits your data."](https://blog.minitab.com/blog/adventures-in-statistics-2/regression-analysis-how-do-i-interpret-r-squared-and-assess-the-goodness-of-fit)

### Let's try styling the scatter plot with different colors for each borough

In [11]:
fig = px.scatter(
    districts,
    x="2010 Population",
    y="count_of_311_requests",
    color="Borough",
    title="Number of 311 requests per Community District by population by borough",
)
fig.show()

## Map complaint counts by CD

We'll follow [this example](https://plotly.com/python/choropleth-maps/#indexing-by-geojson-properties), using [community district GIS data](https://data.cityofnewyork.us/City-Government/Community-Districts/yfnk-k7r4).

_Jump ahead to the map_

First, let's take a look at the GeoJSON data. We're looking for what we can [match our `boro_cd` column up to](https://plotly.com/python/mapbox-county-choropleth/#indexing-by-geojson-properties). One way to inspect it:

1. Open [Chrome](https://www.google.com/chrome/downloads/)
1. Install [JSON Viewer](https://chrome.google.com/webstore/detail/json-viewer/gbmdgpbipfallnflgajpaliibnhdgobh)
1. Open https://data.cityofnewyork.us/resource/jp9i-3b7y.geojson

Load the GeoJSON data using [the requests package](https://docs.python-requests.org/) (nothing to do with 311 requests):

In [12]:
import requests

response = requests.get("https://data.cityofnewyork.us/resource/jp9i-3b7y.geojson")
shapes = response.json()
print("loaded")

# intentionally not outputting the data here since it's large

loaded


_This is equivalent to the use of [`urlopen()`](https://docs.python.org/3/library/urllib.request.html#urllib.request.urlopen) and [`json.load()`](https://docs.python.org/3/library/json.html) in [the Plotly examples](https://plotly.com/python/mapbox-county-choropleth/)._

The structure looks something like:

```json
{
  "type": "FeatureCollection",
  "features": [
    {
      "type": "Feature",
      "geometry": {
        "type": "MultiPolygon",
        "coordinates": [
          [
            [
              [-73.8718461029101, 40.843760777855834],
              …
            ]
          ]
        ]
      },
      "properties": {
        "boro_cd": "206",
        "shape_leng": "35875.7117328",
        "shape_area": "42664311.5086"
      }
    },
    …
  ]
}
```

Peek at the `properties` of one of the `features` a.k.a. shapes a.k.a. Community Districts:

In [13]:
shapes["features"][0]["properties"]

{'boro_cd': '308',
 'shape_leng': '38232.8866494',
 'shape_area': '45603787.0874'}

Notes:

- `boro_cd` is the property we're looking for. We'll [specify this as the `featureidkey`](https://plotly.com/python/mapbox-county-choropleth/#indexing-by-geojson-properties).
- `response.json()` turns JSON data into nested Python objects: `shapes` is a dictionary, `features` is a list beneath it, etc.

In [14]:
def plot_nyc(df):
    fig = px.choropleth_mapbox(
        df,
        locations="borocd",  # column name to match on
        color="request_per_capita",  # column name for values
        geojson=shapes,
        featureidkey="properties.boro_cd",  # GeoJSON property to match on
        hover_data=["CD Name"],
        center={"lat": 40.71, "lon": -73.98},
        zoom=9,
        mapbox_style="carto-positron",
        height=600,
        title="Requests per capita across Community Districts",
    )

    fig.show()

Wrapping this Plotly code in a function to:

- Save space on subsequent slides
- Make the code reusable for plotting different DataFrames

In [15]:
plot_nyc(districts)

Midtown, as an outlier, is skewing our results. Let's exclude it.

In [16]:
no_midtown = districts[districts.borocd != 105]
plot_nyc(no_midtown)

**Fun fact** (for a certain kind of person): [What the Mapbox zoom level means](https://docs.mapbox.com/help/glossary/zoom-level/)

## Chart hygiene

- Always include a title
- Make sure you label dependent and independent variables (X and Y axes)
- Consider whether you are working with continuous vs. discrete values
- If you're trying to show more than three variables at once (e.g. X axis, Y axis, and color), try simplifying

## What visualization should I use?

Rudimentary guidelines:

What do you want to do? | Chart type
:-- | :-:
Show changes over time | Line chart
Compare values for categorical data | Bar chart
Compare two numeric variables | Scatter plot
Count things / show distribution across a range | Histogram
Show geographic trends | [Map (choropleth, hexbin, bubble, etc.)](https://plotly.com/python/maps/)

The [Data Design Standards](https://xdgov.github.io/data-design-standards/visualizations/) goes into more detail.

## Pivoting

FYI: Pandas supports [reshaping](https://pandas.pydata.org/pandas-docs/stable/user_guide/reshaping.html) DataFrames through [pivoting](https://pandas.pydata.org/pandas-docs/stable/user_guide/reshaping.html#reshaping-by-pivoting-dataframe-objects), [like spreadsheets do](https://support.google.com/docs/answer/1272900).

<video controls width="700" src="https://github.com/afeld/python-public-policy/raw/main/extras/img/pivot.mp4"></video>

## [Homework 3](https://python-public-policy.afeld.me/en/{{school_slug}}/hw_3.html)

## Final Project

In real/ideal world, start with specific question and find data to answer it:

![project flow](extras/img/projectflow.png)

_Source: [Big Data and Social Science](https://textbook.coleridgeinitiative.org/chap-intro.html#the-structure-of-the-book)_

Data needed often doesn't exist or is hard (or impossible) to find/access

![project flow](extras/img/projectflow_amended.png)

[Final Project](https://python-public-policy.afeld.me/en/{{school_slug}}/final_project.html)