## Class 10
## Plotting Points data: 311 data 📲 📲 📲 📲 

c4sue 2021 NYU @avigailvantu

Today we will continue to work with Pandas and Matplotlib. We will also create some maps using geopandas. Looking into 311 complaints from the past month and from the same period in 2021 we will compare, group and visualize the city’s trends. Along the way we will create a GeoDataFrame, this is a geographical format that is similar enough to a data frame but has an extra dimension of geographical attributes to it.  Think of the times where we loaded a CSV data into QGIS and needed to merge with a swapfile of assign column to a geographical unit.

This week we will be doing something similar, only with that we will transform a csv (which we will read into a data frame) and then assign columns in the data to represent geometry. That would enable us to then visualize the data quite easily. We’ll some pretty simple, yet cool, ways to do so!

## Installing GeoPandas: 

Before getting started with today's class you will need to download the GeoPandas library (in case you don't have it already). The easiest way is to follow the intructions on the library page:

https://geopandas.org/getting_started.html 

If you are facing issues installing through Codna you can create a new enviorment. There are some details about it on the GeoPandas page. In addition, there are more instructions inside the "class 10" folder on Github. 

https://github.com/avigailvantu/c4sue2021/tree/main/labs/class_10 


In [None]:
import pandas as pd
import numpy as np
import geopandas as gpd 
import matplotlib 
import matplotlib.pyplot as plt
#from shapely.geometry import Point
#from geopandas import GeoDataFrame

For this assignment I downloaded the 311 data from the NYC Open Data platform. I wanted to look into how people in the city complaint patterns were in the past month. In order to get a relative understating we will compare data from 2020 and 2021. Comparing similar periods in between years is a common method in highlighting changes and trends. Thinking about time series, many phenomenas are seasonal. Which is why comparing one month to the previous month is tricky. Having said that, even the same period in two separate years is likely to have some differences, but hopefully less. 


- Data 2020: March 1st - March 31st 2020
- Data 2021: March 1st - March 31st 2021 

* Note: you will need to fetch the data yourself from the NYC Open Data Flatrform 


In [None]:
#load 311 data 1 for this year and one for last year: 
data20 = pd.read_csv('311_March2020.csv')
data21 = pd.read_csv('311_March2021.csv')

#load 2020 data 



In [None]:
data20.head(5)

In [None]:
data21.head(5)

In [None]:
print ('shape 2020',data20.shape)
print ('shape 2021',data21.shape)

### Question 1: 
What are the changes between 2020 and 2021 data in terms of quantities of non emargency complaints in NYC? 

In [None]:
#What are the columns in the data? 
print ('2020 columns:',data20.columns)
print ('2021 columns:',data21.columns)

# 311 data for 2020 and 2021 data by agency: 

Let's look into the "value_counts" function. That would return the number of values for each value in the Agencey column. Meaning we will get a list of how many complaints were chanaled into each agency. 

Check out this URL for the agencies acronyms
https://www1.nyc.gov/site/mocs/about/agencies-acronyms-initialisms.page

In [None]:
data20['Agency'].value_counts()

In [None]:
data21['Agency'].value_counts()

### Question 2:

- What are some of the differences in patterns we are seeing in which agencies the calls have been channeled to between 2020 and 2021?  Which agencies have been seeing less activity and which ones more? 

So far we worked mainly with Pandas (also some pyplot, numpy and datetime). In addiition to all these liberals Pyhton also has some pretty neat geographical features! Let's check out a few of them on our data: 

## From DataFrame to GeoDataFrame 🧮

GeoDataFrame is a data frame that includes one column with a "special" status. This column is the "geometry" column which enbales Python to refer to the data as geogpraphical. In many cases, like in our case, we will not have the "geometry" column built-in in the data. Instead, we will usually have x any y or Latitue and Longtitude that we will tranform into the needed format. 

To go from DataFrame---> GeoDataFrame:
- we would want to tell python which columns can be used as "geometry". 

Note that for point type data a typical geometry columns looks like this


- POINT (LON LAT) 

The point() format will be created using the GeoDataFrame function. We will only need to tell Python which columns in the data are which (lon, lat).  


In [None]:
#transform data into geo data frame: 


#one geodata frame for 2020 
gdf20 = gpd.GeoDataFrame(
    data20, geometry=gpd.points_from_xy(data20.Longitude, data20.Latitude))

#and another one for 2021 
gdf21 = gpd.GeoDataFrame(
    data21, geometry=gpd.points_from_xy(data21.Longitude, data21.Latitude))

#note that here we tell Python that the column: 
#data20.Longitude is the longtitute and data20.Latitude is the latitude. 

In [None]:
#DataFrame

data20.head()

In [None]:
# and GeoDataFrame
#check out our GeoDataFrame--> note the "geometry" column was added (all the way to the right)
gdf20.head(3)

In [None]:
#check the shape of the data: 

gdf20.shape

In [None]:
gdf21.shape

## Now we can fianly visualize the data: 

First: plot all points for the layer, not I am setting the marker zise on 0.3 since there are so many of them!! 

In [None]:
#plot all 2020 data:
gdf20.plot( color='red',legend=True,figsize=(12, 12),markersize=0.1)
plt.axis('off')
plt.title('311 complaints March 2020')
plt.show()

In [None]:
#plot all 2020 data: 
gdf21.plot( color='blue',legend=True,figsize=(12, 12),markersize=0.1)
plt.axis('off')
plt.title('311 complaints March 2021')
plt.show()

# Exmine one agency: 
### HPD (Housing Preservation & Development)

In order to make better sense of what are people reporting less in these past weeks, we will take a closer look at the different agencies complaints. 

We will start with HPD: 

In [None]:
#filter only hpd

hpd20 = gdf20.loc[gdf20['Agency']=='HPD']

hpd21 = gdf21.loc[gdf21['Agency']=='HPD']

In [None]:
print (len(hpd20))
print (len(hpd21))

## Plot HPD data for both 2020 and 2021

In [None]:
hpd20.plot( color='blue',legend=True,figsize=(12, 12), markersize=1)
plt.axis('off')
plt.title('311 HPD complaints March 2020')
hpd21.plot( color='red',legend=True,figsize=(12, 12),markersize=1)
plt.axis('off')
plt.title('311 HPD complaints March 2021')
plt.show()

## Question 3
Which areas seem to have more or less HPD complaints between March 2020 and March 2021?
 
## Another way for us to look into the data is to sub-slice it again: 

Now dive into the complaint types in the HPD complaints. So we can learn what are the types of housing complaint we are seeing. that would also help us compare what were some of the changes b/w both periods

In [None]:
#plot hpd by complaint type:

#1. for 2020
ax = hpd20.plot(column='Complaint Type',legend=True,figsize=(12, 12), alpha = 0.6,markersize=2)
#we can also visualize HPD complaints based on the complaint type: 
plt.title('311 HPD complaints March 2020 by Complaint type')
plt.axis('off')
plt.show()


#2. for 2020 
ax = hpd21.plot(column='Complaint Type',legend=True,figsize=(12, 12),alpha = 0.6,markersize=2)
#we can also visualize HPD complaints based on the complaint type: 
plt.title('311 HPD complaints March 2021 by Complaint type')
plt.axis('off')
plt.show()

## Question 4: 

What information can we take away from these two maps? 

# Another way to look into the complaint types: 
On top of visualizng the data we can also look into the number of complaints of each type. An easy way to do so is to use the Group.by command. This is a pretty timple command that has a lot of options (more about it on other classes!). 

The main thing to know about group.by right now is that group.by operates on a dataframe so that it basically does 3 main things: 

1. Split : take the data and splits it according to the grouping condition 
2. Apply: calculates what we want it to do: sum, means count etc
3. Combine: it combines the data into new groups 




In [None]:
hpd20['Complaint Type'].unique()

In [None]:
len(hpd20)

In [None]:
len(hpd21)

In our case we will count group by complaint type so that: Python will Split the data according to each type of complaint (hot water, windows etc). Then it will Apply, meaning it would count how many of each compliant type the data has. Finally, Python will Combine the new grouped data. So in our case that would be number of complaints per each complaint type. Note that by doing so, our data frame structure will changes completely so that each row will represent a complaint type, and the data in the cells will be the count of how many of them are there in our data.  All that in one line of code :-) 

In [None]:
#group.by hpd complaints 

#1. for 2019
hpd20_count_type = hpd20.groupby(['Complaint Type']).count()
#1. for 2020
hpd21_count_type = hpd21.groupby(['Complaint Type']).count()

In [None]:
#look at out new data for 2019 
hpd20_count_type.head(10)

In [None]:
#and for 2020
hpd21_count_type.head()

In [None]:
#because the all columns look the same we will remove them and only keep the first one

hpd20_count_type = hpd20_count_type['Unique Key']
hpd21_count_type = hpd21_count_type['Unique Key']


In [None]:
hpd20_count_type.head(10)

In [None]:
#now let's see the most common HPD complaints for both March/April 2020 and 2021:

# sort data 

hpd20_count_type = hpd20_count_type.sort_values()

hpd21_count_type = hpd21_count_type.sort_values()

In [None]:
#5 most common complaints in 2020 were: 
hpd20_count_type.head(5)

In [None]:

#5 most common complaints in 2021  were: 
hpd21_count_type.head(5)

## Assignment:

Your turn: 

So far we worked on the HPD data. 

We will now divide into groups, when each group will look into another agency complaints: 

# Group 1: NYPD
# Group 2: DOT 
# Group 3: DEP 
# Group 4: DSNY 
# Group 5: DOHMH 

For each groups: 

1. Please filter the subset of the data that has *YOUR* Agencey name
2. Plot, summarize and group.by the data for both 2020 and 2021 
 
Deliver:  
- a. What are the patterns in *YOUR* agency complaints between the 2020 and 2021 data? 
- b. What are some geogrpaphical patterns you are seeing comparing both years?

In class: present your main findings. For you homework: submit your jupyter notebook. In addition on your NYU classes submissions write a short summary of your findings. 



In [None]:
#your code... 