# Homework 2

In [1]:
import ipytest

ipytest.autoconfig()

## Coding: Using keywords to categorize 311 requests

**Problem Statement:** When you read through the `descriptor` and `resolution_description` columns in the 311 data, you will see that complaints related to graffiti are actually scattered throughout multiple `complaint_type` categories. We want to identify all complaints related to graffiti and see which community districts have the most instances of graffiti.

To help make this assignment easier, there's a smaller subset of the 311 data for you to use:

https://storage.googleapis.com/python-public-policy/data/cleaned_311_data_hw2.csv.zip

This smaller dataset only contains ~65,000 records from relevant complaint type categories, and has columns renamed to be lowercase and underscored.

### Hints

- You can adapt the `recode_borocd_counts()` example from [Lecture 2](lecture_2.ipynb) for this problem.
- You may run into issues with empty values; [how to deal with them](https://medium.com/geekculture/dealing-with-null-values-in-pandas-dataframe-1a67854fe834).
- [Ways to do case-insensitive string comparison in pure Python](https://www.geeksforgeeks.org/case-insensitive-string-comparison-in-python/), which translates over to pandas.

### Step 0

Load the data.

In [2]:
import pandas as pd

In [3]:
# your code here

### Step 1

Create a `flag_graffiti` function that checks each row in the 311 dataframe to see if the word "graffiti" is present in the `complaint_type`, `descriptor`, and/or `resolution_description`. Any of the columns may contain the word, so you should check all of them. If the word "graffiti" is found, the function should return the boolean value `True`. If "graffiti" is not found, the function should return the boolean value `False`.

#### Hints

- Make sure to look for "graffiti" _in_ those strings. The strings may contain more than just that word.
- Capitalization may be inconsistent.

In [4]:
# your code here

Test by passing in a fake row.

In [5]:
%%ipytest --tb=short


def test_complaint_type():
    test_row = pd.Series({
        "complaint_type": "graffiti",
        "descriptor": "",
        "resolution_description": ""
    })
    assert flag_graffiti(test_row) == True

    
def test_descriptor():
    test_row = pd.Series({
        "complaint_type": "",
        "descriptor": "graffiti",
        "resolution_description": ""
    })
    assert flag_graffiti(test_row) == True

    
def test_description():
    test_row = pd.Series({
        "complaint_type": "",
        "descriptor": "",
        "resolution_description": "graffiti"
    })
    assert flag_graffiti(test_row) == True

    
def test_none():
    test_row = pd.Series({
        "complaint_type": "",
        "descriptor": "",
        "resolution_description": ""
    })
    assert flag_graffiti(test_row) == False


def test_mixed_cases():
    test_row = pd.Series({
        "complaint_type": "GrafFiti",
        "descriptor": "",
        "resolution_description": ""
    })
    assert flag_graffiti(test_row) == True

    
def test_substring():
    test_row = pd.Series({
        "complaint_type": "",
        "descriptor": "there's graffiti on the wall",
        "resolution_description": ""
    })
    assert flag_graffiti(test_row) == True

[31mF[0m[31mF[0m[31mF[0m[31mF[0m[31mF[0m[31mF[0m[31m                                                                                       [100%][0m
[31m[1m_______________________________________ test_complaint_type ________________________________________[0m
[1m[31m/var/folders/kg/1ys0dccx4237f5wsd_w10dt80000gn/T/ipykernel_12687/623815755.py[0m:7: in test_complaint_type
    [94massert[39;49;00m flag_graffiti(test_row) == [94mTrue[39;49;00m
[1m[31mE   NameError: name 'flag_graffiti' is not defined[0m
[31m[1m_________________________________________ test_descriptor __________________________________________[0m
[1m[31m/var/folders/kg/1ys0dccx4237f5wsd_w10dt80000gn/T/ipykernel_12687/623815755.py[0m:16: in test_descriptor
    [94massert[39;49;00m flag_graffiti(test_row) == [94mTrue[39;49;00m
[1m[31mE   NameError: name 'flag_graffiti' is not defined[0m
[31m[1m_________________________________________ test_description ________________________________

### Step 2

Apply the function created in Step 1 to the 311 dataframe and create a new column called `graffiti_flag` that captures the output from the function.

Tip: There are two checks you can use to confirm that the function worked as expected.

- Make sure there are records tagged with `graffiti_flag` `True`.
- Make sure that more than one `complaint_type` has `graffiti_flag` `True` (and `False`).

In [6]:
# your code here

In [7]:
%%ipytest --tb=short

def test_graffiti_flag():
    assert 'graffiti_flag' in df.columns, "column missing"
    assert df.dtypes['graffiti_flag'] == 'bool', "column should be booleans"

[31mF[0m[31m                                                                                            [100%][0m
[31m[1m________________________________________ test_graffiti_flag ________________________________________[0m
[1m[31m/var/folders/kg/1ys0dccx4237f5wsd_w10dt80000gn/T/ipykernel_12687/4193169000.py[0m:2: in test_graffiti_flag
    [94massert[39;49;00m [33m'[39;49;00m[33mgraffiti_flag[39;49;00m[33m'[39;49;00m [95min[39;49;00m df.columns, [33m"[39;49;00m[33mcolumn missing[39;49;00m[33m"[39;49;00m
[1m[31mE   NameError: name 'df' is not defined[0m
FAILED tmpkp8bxd_c.py::test_graffiti_flag - NameError: name 'df' is not defined
[31m[31m[1m1 failed[0m[31m in 0.01s[0m[0m


### Step 3

Create another dataframe `df_graffiti` that only contains records where `graffiti_flag` is `True`.

In [8]:
# your code here

In [9]:
%%ipytest --tb=short

def test_all_have_graffiti():
    assert df_graffiti['graffiti_flag'].all(), "not all have graffiti_flag set to True"

[31mF[0m[31m                                                                                            [100%][0m
[31m[1m______________________________________ test_all_have_graffiti ______________________________________[0m
[1m[31m/var/folders/kg/1ys0dccx4237f5wsd_w10dt80000gn/T/ipykernel_12687/601934799.py[0m:2: in test_all_have_graffiti
    [94massert[39;49;00m df_graffiti[[33m'[39;49;00m[33mgraffiti_flag[39;49;00m[33m'[39;49;00m].all(), [33m"[39;49;00m[33mnot all have graffiti_flag set to True[39;49;00m[33m"[39;49;00m
[1m[31mE   NameError: name 'df_graffiti' is not defined[0m
FAILED tmp29s3ixf3.py::test_all_have_graffiti - NameError: name 'df_graffiti' is not defined
[31m[31m[1m1 failed[0m[31m in 0.01s[0m[0m


### Step 4

Group your dataframe `df_graffiti` to get the count of requests per `community_board`. Identify which Community District has the highest count.

In [10]:
# your code here

### Bonus 1

_0.2 points_

Create a `graffiti_flag2` column using only built-in pandas operations, i.e. without using a custom function (`def`). Another way to think about it: Instead of operating on a single row at a time, how can you operate across entire columns? See [working with text data](https://pandas.pydata.org/pandas-docs/stable/user_guide/text.html) for clues.

In [11]:
# your code here

### Bonus 2

_0.2 points_

Clean another column of the dataset. Include explanation and code for how you got there.

In [12]:
# your code here

Now [turn in the assignment](https://python-public-policy.afeld.me/en/latest/README.html#turning-in-assignments).

## Tutorials

- [Pythonic Data Cleaning With Pandas and NumPy](https://realpython.com/python-data-cleaning-numpy-pandas/)
- ["You’re Not Mapping Rats, You're Mapping Gentrification"](https://theconcourse.deadspin.com/you-re-not-mapping-rats-you-re-mapping-gentrification-1835005060)—article about bias in 311 data
- [Read about the Spatial Data Equity Tool](https://medium.com/@urban_institute/introducing-a-spatial-equity-data-tool-b959c40298cf)
- [Intro to Plotly Express](https://plotly.com/python/plotly-express/). You don't have to work through every one of these examples; just review to get familiar with what types of charts are possible.

### Optional

- [Python Tools for Record Linkage](https://pbpython.com/record-linking.html)
- [Reshaping and pivot tables](https://pandas.pydata.org/pandas-docs/stable/user_guide/reshaping.html)
- [How to reshape the layout of tables](https://pandas.pydata.org/pandas-docs/stable/getting_started/intro_tutorials/07_reshape_table_layout.html)

## Participation

Reminder about the [between-class participation requirement](https://python-public-policy.afeld.me/en/latest/syllabus.html#participation).