# Module 11 - Cleaning and Visualizing Data with Pandas and Matplotlib


**_Author: Jessica Cervi_**

**Expected time = 3 hours**

**Total points =  55 points**



    
## Assignment Overview

In this assignment, we will use `pandas` to clean the data in a given dataframe. After performing some basic exploratory data analysis, we will make use of some `pandas` functions to profile and understand our data. Finally, we will convert some values in our dataframe in a more convenient format using the `.to_numeric()` function and we will filter our data to obtain a dataframe with only meaningful data.

The last part of the assignment is designed as a learning experience for `folium`, a `gmap` alternative.


This assignment is designed to build your familiarity and comfort coding in Python while also helping you review key topics from each module. As you progress through the assignment, answers will get increasingly complex. It is important that you adopt a data scientist's mindset when completing this assignment. **Remember to run your code from each cell before submitting your assignment.** Running your code beforehand will notify you of errors and give you a chance to fix your errors before submitting. You should view your Vocareum submission as if you are delivering a final project to your manager or client. 

***Vocareum Tips***
- Do not add arguments or options to functions unless you are specifically asked to. This will cause an error in Vocareum.
- Do not use a library unless you are expicitly asked to in the question. 
- You can download the Grading Report after submitting the assignment. This will include feedback and hints on incorrect questions. 


### Learning Objectives

- Clean, filter, and group data with Pandas
- Visualize data efficiently in Pandas





## Index: 

#### Module 11: Cleaning and Visualizing Data with Pandas and Matplotlib

- [Question 1](#q1)
- [Question 2](#q2)
- [Question 3](#q3)
- [Question 4](#q4)
- [Question 5](#q5)
- [Question 6](#q6)
- [Question 7](#q7)
- [Question 8](#q8)
- [Question 9](#q9)
- [Question 10](#q10)

## The Dataset


For this assignment, we will be using a dataset similar to that in the lectures that provides a sample of the 311 calls in New York City .  This is a very large dataset, with mora e than 1,000,000 calls for 2019.  For this reason, we have selected a random sample of 2019's calls to make the dataset a bit more managable. The complete dataset can be explored [here](https://nycopendata.socrata.com/Social-Services/311-Service-Requests-from-2010-to-Present/erm2-nwe9).

If you want to downoad the dataset separately, you will need to download the repository  by clicking on 'clone or download' and then clicking 'download as zip':

<img src="./images/clone.png" width="300px">

When the repository has been downloaded, do a search for `nyc_311_data_subset-2.csv` and place the file into a directory *data* where this Jupyter Notebook is saved.


#### Reading in the Dataset

As usual, we will begin by importing the necessary libraries and by reading the dataset into a dataframe called nyc_311 using the `pd.DataFrame.read_csv` method by using the following keyword arguments:

- By specifying the index column to use, we are selecting a column from our source dataset to be used as our dataframe index instead of using the auto-generated `pandas.core.indexes.range.RangeIndex`. This helps us maintain a record of the rows of our source dataset, which can be a helpful reference if our dataset is changed (i.e. if someone else copies it and removes some rows and then overwrites it). If importing from a database export file, this is usually the primary key of the table.

- When we import a csv file, pandas tries to guess what the type of each column is. In the case of `landmark`, `vehicle_type` and `incident_zip` fields, we have mixed types in some columns (i.e. strings, floats, and ints all in the same column) and pandas will throw a warning.  We can tell pandas what data type to expect with this argument, in this case `obect` for all three columns, to resolve this warning. We can attempt to convert the `incident_zip` datatype from string to integer with the `converters` keyword argument with a try/except block, but we'll be doing some string processing on that column later so we'll leave it as a string for our import.


We will begin to clean and visualize data with `pandas`, after downloading and reading in the dataset. We will start by performing exploratory data analysis to understand and profile the missing values.


**Note: Most of the questions in this assignment are connected and need to be solved in sequence.**

In [2]:
%matplotlib inline 
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np

In [3]:
nyc_311 = pd.read_csv('data/nyc_311_data_subset-2.csv',index_col=0,dtype={'landmark': object,'vehicle_type': object,
        'incident_zip': object})

In [4]:
nyc_311.head()

Unnamed: 0,address_type,agency,agency_name,bbl,borough,bridge_highway_direction,bridge_highway_name,bridge_highway_segment,city,closed_date,...,resolution_description,road_ramp,status,street_name,taxi_company_borough,taxi_pick_up_location,unique_key,vehicle_type,x_coordinate_state_plane,y_coordinate_state_plane
0,ADDRESS,NYPD,New York City Police Department,4102260000.0,QUEENS,,,,JAMAICA,2019-01-06T13:06:59.000,...,The Police Department responded to the complai...,,Closed,105 AVENUE,,,41352433,,1043267.0,194959.0
1,ADDRESS,NYPD,New York City Police Department,4100860000.0,QUEENS,,,,JAMAICA,2019-01-12T23:20:05.000,...,The Police Department responded to the complai...,,Closed,WALTHAM STREET,,,41407968,,1039186.0,191847.0
2,ADDRESS,DFTA,Department for the Aging,4066900000.0,QUEENS,,,,FLUSHING,2019-02-21T09:37:06.000,...,The Department for the Aging contacted you and...,,Closed,,,,41658034,,,
3,ADDRESS,NYPD,New York City Police Department,4137450000.0,QUEENS,,,,ROSEDALE,2019-05-05T02:20:18.000,...,The Police Department responded to the complai...,,Closed,148 AVENUE,,,42587192,,1055616.0,177905.0
4,ADDRESS,DEP,Department of Environmental Protection,1008398000.0,MANHATTAN,,,,NEW YORK,2019-04-10T11:16:00.000,...,The Department of Environmental Protection det...,,Closed,WEST 38 STREET,,,42180774,,988485.0,213174.0


[Back to top](#Index:) 
<a id='q1'></a>

### Question 1:

*5 points*
    

Use the `.info()` method to examine the number of non-null values in the bridge_highway_segment column.  Save your result as an integer to `ans1` below.

In [5]:
### GRADED

### YOUR SOLUTION HERE
ans1 = nyc_311.info()

###
### YOUR CODE HERE
###


<class 'pandas.core.frame.DataFrame'>
Int64Index: 100000 entries, 0 to 99999
Data columns (total 43 columns):
 #   Column                          Non-Null Count   Dtype  
---  ------                          --------------   -----  
 0   address_type                    97774 non-null   object 
 1   agency                          100000 non-null  object 
 2   agency_name                     100000 non-null  object 
 3   bbl                             82780 non-null   float64
 4   borough                         100000 non-null  object 
 5   bridge_highway_direction        197 non-null     object 
 6   bridge_highway_name             197 non-null     object 
 7   bridge_highway_segment          251 non-null     object 
 8   city                            96353 non-null   object 
 9   closed_date                     95178 non-null   object 
 10  community_board                 100000 non-null  object 
 11  complaint_type                  100000 non-null  object 
 12  created_date     

In [None]:
###
### AUTOGRADER TEST - DO NOT REMOVE
###


[Back to top](#Index:) 
<a id='q2'></a>

### Question 2:

*5 points*
    

Recall that the `.isnull()` method returns boolean values as to whether or not a value is null.  We can find the total number of null values by summing the values returned by the `.isnull()` method.


Use the `.isnull()` method together with the `.sum()` method to return a series of null counts for each column.
Turn these values into a percentage and save your series to `ans2` below.


In [None]:
### GRADED

### YOUR SOLUTION HERE
ans2 = None

###
### YOUR CODE HERE
###


In [None]:
###
### AUTOGRADER TEST - DO NOT REMOVE
###


[Back to top](#Index:) 
<a id='q3'></a>

### Question 3:

*10 points*
    

**Data profiling** is the process of examining the data available from an existing [information source](https://en.wikipedia.org/wiki/Data_profiling).

We recognize that we have missing values in the previous problem  and want to write a function that takes in a `dataframe` and returns a `series` of only those columns with more than x% of their data missing (defaulting to 50%, or 0.5 as a float). The values in the returned `series` object will be a float value representing the percent missing data for the corresponding index.


Define a function `mostly_missing` that takes, as input, a dataframe and a float `level = 0.5`. This function
returns a series with the values for the features missing a larger percentage than the threshold value.


In [None]:
### GRADED

### YOUR SOLUTION HERE
def mostly_missing(df, level = 0.5):
    '''
    This function accepts a DataFrame
    and returns a series with the values
    for the features missing a larger percentage
    than the threshold value.
    
    -------------------
    Check:
    mostly_missing(nyc_311)
    
    returns ===>>>
    
    bridge_highway_direction    0.99803
    bridge_highway_name         0.99803
    bridge_highway_segment      0.99749
    due_date                    0.59381
    intersection_street_1       0.89402
    intersection_street_2       0.89468
    landmark                    0.99976
    road_ramp                   0.99803
    taxi_company_borough        0.99929
    taxi_pick_up_location       0.99681
    vehicle_type                0.99995
    '''
    return

###
### YOUR CODE HERE
###


In [None]:
###
### AUTOGRADER TEST - DO NOT REMOVE
###


[Back to top](#Index:) 
<a id='q4'></a>

### Question 4:

*10 points*
    
In addition to deciding how to handle our missing data we must work with `datetime` objects to ensure they are useful for analysis.  To avoid repeating operations, we will write a function to test whether a column name contains the word `date` and then attempt the `datetime` conversion on those column(s). When completed, the original dataframe is returned with any `date` containing columns in their datetime converted forms.

*Note*: Explore the Pandas file reader methods such as the `.read_csv()` method.  There is a default datetime conversion argument that sometimes is an easier approach.

Define a function `date_timer` that takes, as input, a dataframe. Your function should look for any column containing the word "date". It should change the data in these columns to floats using the function `.datetime()` and return the updated dataframe `df_copy`.

In [None]:
### GRADED

### YOUR SOLUTION HERE
def date_timer(df):
    '''
    This function takes in a DataFrame
    and looks for any column containing the word "date".
    These columns are changed to datetime datatypes where possible,
    and the new updated dataframe is returned.
    
    **HINT**
    Your new dataframe should contain four features that
    are datetime datatypes.
    '''
    return

###
### YOUR CODE HERE
###


In [None]:
###
### AUTOGRADER TEST - DO NOT REMOVE
###


[Back to top](#Index:) 
<a id='q5'></a>

### Question 5:

*5 points*
    
Now that we have columns in `datetime` format, we can create a new column named `processing_time`.  To do so, we will take the difference between the `closed_date` and `created_date` columns.

Create a new column `time_to_close`  that has, as entries, the difference between closed_date and created_date in `df_copy`.  Save your new dataframe including the new column to `ans5` below.



In [None]:
### GRADED

### YOUR SOLUTION HERE

df_copy['processing_time'] = None

###
### YOUR CODE HERE
###


In [None]:
###
### AUTOGRADER TEST - DO NOT REMOVE
###


[Back to top](#Index:) 
<a id='q6'></a>

### Question 6:

*5 points*
    
Now that we have our new `time_to_close` column, we can explore it using the `.nlargest()` method, to examine the 10 longest closing times from the 2019 311 sample. 


Use your dataframe with the `time_to_close` column and the `.nlargest()` method to select the 10 longest closing times in our data from `df_copy`. Save your results as a dataframe to `ans6` below.

In [None]:
### GRADED

### YOUR SOLUTION HERE

ans6 = None

###
### YOUR CODE HERE
###


In [None]:
###
### AUTOGRADER TEST - DO NOT REMOVE
###


[Back to top](#Index:) 
<a id='q7'></a>

### Question 7:

*5 points*
    
The `.groupby()` method works on a categorical column to group like values.  Once grouped, we will apply a function to each of these groups.  For example,

```python
nyc_311.groupby('agency_name').size().sort_values(ascending = False)
```
returns a series of counts of observations within each agency.  

Use the code example above to determine the top 5  agency names from our dataset `df_copy`.  Save your results as a series to `ans7` below. Sort the values in descending order.



In [None]:
### GRADED

### YOUR SOLUTION HERE

ans7 = None

###
### YOUR CODE HERE
###


In [None]:
###
### AUTOGRADER TEST - DO NOT REMOVE
###


[Back to top](#Index:) 
<a id='q8'></a>

### Question 8:

*5 points*
    
When we group by multiple features, we are returned a multi-index series.  To get back to a familiar `dataframe` object, we use the `.unstack()` method.  For example:

```python
agency_borough = nyc_311.groupby(['agency', 'borough']).size().unstack()
```

This will first group our data by `agency`, and then by `borough` in `df_copy`, counting the number of occurrences for each agency. Use the code above and examine the output. Use the `.loc()` method to locate the  number of DOE incidents in QUEENS. Save your solution to `ans8` as a float.

In [None]:
### GRADED

### YOUR SOLUTION HERE

ans8 = None

###
### YOUR CODE HERE
###


In [None]:
###
### AUTOGRADER TEST - DO NOT REMOVE
###


[Back to top](#Index:) 
<a id='q9'></a>

### Question 9:

*5 points*
    

As with the lecture, we have certain problems with our `zipcode` column.  The function below is from the lecture and demonstrates an approach to cleaning our zipcodes.  

```python
def fix_zip(input_zip):
    try:
        input_zip = int(float(input_zip))
    except:
        try:
            input_zip = int(input_zip.split('-')[0])
        except:
            return np.NaN
    if input_zip < 10000 or input_zip > 19999:
        return np.NaN
    return str(input_zip)
```


Use the function fix_zip above to clean the `incident_zip` column in `df_copy`. Use the `.apply()` method to clean the feature and save your cleaned series to `ans9` below.

In [None]:
### GRADED

### YOUR SOLUTION HERE

ans9 = None 

###
### YOUR CODE HERE
###


In [None]:
###
### AUTOGRADER TEST - DO NOT REMOVE
###


[Back to top](#Index:) 
<a id='q10'></a>

### Question 10:

*10 points*
    
Now that we have cleaned up the zip codes, let's drop any remaining null values.

Create a dataframe that has only zip codes in `df_copy` that are non-null.  Save the dataframe to `ans10` below.

In [None]:
### GRADED

### YOUR SOLUTION HERE

ans10 = None

###
### YOUR CODE HERE
###


In [None]:
###
### AUTOGRADER TEST - DO NOT REMOVE
###


## Part II: Mapping with `folium`

As we saw in the lectures,  `gmaps` is a powerful plugin for embedding Google maps in Jupyter notebooks. 
Normally, to use `gmaps`, you would need to get an API key which prevents us from immediately making maps here.  Instead, we will use another library called `folium`. If you do not have `folium` installed, you can install it from within your notebook by typing:

```python
!pip install folium
```
and executing the cell.

- **NOTE:** The following questions are not graded.
- You can apply the techniques below to any data with latitude and longitude!

#### A Basic Map

We will create a basic map centered at the first latitude longitude pair in the dataset.  This is just a demonstration and meant just for exploration.

In [None]:
import folium

In [None]:
nyc_311[['latitude', 'longitude']]  = nyc_311[['latitude', 'longitude']].astype('float')
start = nyc_311.loc[0, ['latitude', 'longitude']]

In [None]:
m = folium.Map(location = (start[0], start[1]))

In [None]:
m

#### Adding a Marker

We can add a marker at the first location using the `.Marker` method.  Here, we provide the location and add the marker to our map `m` with the `.add_to()` method.

In [None]:
folium.Marker(location=(start[0], start[1]),
                 popup = nyc_311['complaint_type'][0]).add_to(m)

In [None]:
m

### Choropleth Map

Many municipalities make GeoJson files available with information about district boundaries of all kinds.  Below, we use the zip code data from New York City to draw in boundaries by zip codes.  Then, we can bind our data to the map using the zip codes that are housed in the `.json` files. The main idea here is that we will count incidents by zip code and color the map based on this.  

In [None]:
#creating a new DataFrame of complaints by zipcodes
complaints = nyc_311.groupby('incident_zip')[['complaint_type']].size().to_frame()

In [None]:
#renaming the size column
complaints.columns = ['num']

In [None]:
#create a zipcode column
#based on the index
complaints['zip'] = complaints.index

In [None]:
complaints.info()

In [None]:
#create boundaries for values
#to be colored on
vals = complaints.quantile([.1, .3, .5, .7, .9])['num'] 
vals = list(vals)

In [None]:
vals.append(complaints.num.max() + 1)

In [None]:
vals

In [None]:
import json
import requests
#url to import the geojson data from nyc
url = 'http://data.beta.nyc//dataset/3bf5fb73-edb5-4b05-bb29-7c95f4a727fc/resource/6df127b1-6d04-4bb7-b983-07402a2c3f90/download/f4129d9aa6dd4281bc98d0f701629b76nyczipcodetabulationareas.geojson'
#create a dictionary like object from geojson 
geo_json_data = json.loads(requests.get(url).text)
#create our map
m = folium.Map([start[0], start[1]], zoom_start=9.5, tiles = 'Stamen Toner')
#add the boundaries and colors
m.choropleth(geo_json_data, data = complaints, columns = ['zip','num'], 
             #this is the key from json dictionary
             key_on = 'feature.properties.postalCode', 
             threshold_scale= vals, 
             #these are RColorBrewer codes
             fill_color= 'BuPu')



In [None]:
m