
# Answering Data Science Questions

Time for a case study to reinforce all of your learning so far! You'll use all the containers and data types you've learned about to answer several real world questions about a dataset containing information about crime in Chicago. Have fun!

# Case Study - Counting Crimes

## Data Set Overview

- data is in a csv file
- data is clean up
- Chicago Open Data Portal https://data.cityofchicago.org/

```csv
Date,Block,Primary Type,Description,
Location Description,Arrest,Domestic, District

05/23/2016 05:35:00 PM,024XX W DIVISION ST,ASSAULT,SIMPLE,
STREET,false,true,14

03/26/2016 08:20:00 PM,019XX W HOWARD ST,BURGLARY,FORCIBLE ENTRY,
SMALL RETAIL STORE,false,false,24
```



# Part 1 - Step 1
- Read data from CSV
- buld list to hold data 
 
 
```python 
In [1]: import csv

In [2]: csvfile = open('ART_GALLERY.csv', 'r')

In [3]: for row in csv.reader(csvfile):
   ...:     print(row)
```




## Part 1 - Step 2


- Create and use a Counter with a slight twist
- count data by month



```python 
In [1]: from collections import Counter

In [2]: nyc_eatery_count_by_types = Counter(nyc_eatery_types)
```
   
- use the month date parr group to counting the data by month
- Use date parts for Grouping like in Chapter 4

```python
In [1]: daily_violations = defaultdict(int)

In [2]: for violation in parking_violations:
   ...:     violation_date = datetime.strptime(violation[4], '%m/%d/%Y')
   ...:     daily_violations[violation_date.day] += 1

```
#### twist

- in the date part grouping example, we used a `defaultdict()`
- here were going to be using a `Counter()`
- both work the same way since both are dictionaries



## Part 1 - Step 3

- extract the data into a dictionary key by  the month 
    - stores a list of the location times that occured that month
- use the `defaultdict()` along with the date component grouping
- Group data by Month
- The date components we learned about earlier.

```python
In [1]: from collections import defaultdict

In [2]: eateries_by_park = defaultdict(list)

In [3]: for park_id, name in nyc_eateries_parks:
   ...:     eateries_by_park[park_id].append(name)
```
## Part 1 - Final

To answer the real question, what are the 5 most common crime locations per month, well use a `Counter()` on our dictionary to find the answer we are seaking.


- Find 5 most common locations for crime each month.
 
```python
In [1]: print(nyc_eatery_count_by_types.most_common(3))
[('Mobile Food Truck', 114), ('Food Cart', 74), ('Snack Bar', 24)]
```


## EXERCISE 1


```python
# Import the csv module
import csv

# Create the file object: csvfile
csvfile = open('crime_sampler.csv','r')

# Create an empty list: crime_data
crime_data = []

# Loop over a csv reader on the file object
for row in csv.reader(csvfile):

    # Append the date, type of crime, location description, and arrest
    crime_data.append((row[0], row[2], row[4], row[5]))
    
# Remove the first element from crime_data

crime_data.pop(0)

# Print the first 10 records
print(crime_data[:10])

```

- took 1hr to figure out :(

    
```python
# Import necessary modules
from collections import Counter
from datetime import datetime

# Create a Counter Object: crimes_by_month
crimes_by_month = Counter()

# crime data includes:
# ['date-time', 'street', 'Type-of-crime', 'detail-of-crime', 'location', ' bool', 'bool', 'num']

# Loop over the crime_data list
for i in crime_data:
    
    # Convert the first element of each item into a Python Datetime Object: date
    date = datetime.strptime(i[0], '%m/%d/%Y %I:%M:%S %p')
    
    # Increment the counter for the month of the row by one
    
    # O:10/16/2016 06:00:00 PM, N:2016-10-16 18:00:00
    #print('O:{}, N:{}'.format(i[0],date))
    
    #print(date.month)
    # 1,..12
    crimes_by_month[date.month] += 1
    #crimes_by_month.update(date.month) += 1
    
# Print the 3 most common months for crime
print(crimes_by_month.most_common(3))

```



- had problems setting the key to the month thinking I had to
    -  `locations_by_month.update(date.month) += row[4]`
    - locations_by_month[date.month] += row[4]`
- but it should be treated as a dic so you use `.append()
```python
# Import necessary modules
from collections import defaultdict
from datetime import datetime


# Create a dictionary that defaults to a list: locations_by_month
locations_by_month = defaultdict(list)

# Loop over the crime_data list
for row in crime_data:
    # Convert the first element to a date object
    date = datetime.strptime(row[0], '%m/%d/%Y %I:%M:%S %p')
    
    # If the year is 2016 
    if date.year == 2016:
        # Set the dictionary key to the month and add the location (fifth element) to the values list
        locations_by_month[date.month].append(row[4])
    
# Print the dictionary
print(locations_by_month)
```



```python
# Import Counter from collections
from collections import Counter

# Loop over the items from locations_by_month using tuple expansion of the month and locations
for month, locations in locations_by_month.items():
    # Make a Counter of the locations
    location_count = Counter(locations)
    # Print the month 
    print(month)
    # Print the most common location
    print(location_count.most_common(5))
```

## Dictionaries with Time Windows for Keys


# Case Study - Crimes by District and Differences by Block

1. Determine how many crimes occur by district
2. How types of crimes differ between city blocks

## Part 2 - Step 1

- Read in the CSV data as a dictionary

```python
In [1]: import csv

In [2]: csvfile = open('ART_GALLERY.csv', 'r')

In [3]: for row in csv.DictReader(csvfile):
   ...:     print(row)
```

- Pop out the key and store the remaining dict
- leaves the original dict with everything but that key and values

```python 
In [1]: galleries_10310 = art_galleries.pop('10310')
```


## Part 2 - Step 2

### How many crimes occur by district

- Pythonically iterate over the Dictionary
- use `Counter()` and `defaultdict()`
 
```python 
In [1]: for zip_code, galleries in art_galleries.items():
   ...:     print(zip_code)
   ...:     print(galleries)
```




## Wrapping Up

We identify a few blocks of data that we want to work on, and see the difference in crime that occur in these locations
1. take a list and get an unique set of crimes for that block
2. look for the difference in the unique crime sets using the `.difference()` method

- Use sets for uniqueness

```python
In [1]: cookies_eaten_today = ['chocolate chip', 'peanut butter', 
   ...: 'chocolate chip', 'oatmeal cream', 'chocolate chip']

In [2]: types_of_cookies_eaten = set(cookies_eaten_today)

In [3]: print(types_of_cookies_eaten)
set(['chocolate chip', 'oatmeal cream', 'peanut butter'])
difference() set method as at the end of Chapter 1
 In [1]: cookies_jason_ate.difference(cookies_hugo_ate)
set(['oatmeal cream', 'peanut butter'])
```







## EXERCISE 2


```python
# Create the CSV file: csvfile
csvfile = open('crime_sampler.csv', 'r')

# Create a dictionary that defaults to a list: crimes_by_district
crimes_by_district = defaultdict(list)

# Loop over a DictReader of the CSV file
for row in csv.DictReader(csvfile):
    # Pop the district from each row: district
    district = row.pop('District')
    # Append the rest of the data to the list for proper district in crimes_by_district
    crimes_by_district[district].append(row)
```
---

```python
# Loop over the crimes_by_district using expansion as district and crimes
for district, crimes in crimes_by_district.items():
    # Print the district
    print(district)
    
    # Create an empty Counter object: year_count
    year_count = Counter()
    
    # Loop over the crimes:
    for crime in crimes:
        # If there was an arrest
        if crime['Arrest'] == 'true':
            # Convert the Date to a datetime and get the year
            year = datetime.strptime(crime['Date'], '%m/%d/%Y %I:%M:%S %p').year
            # Increment the Counter for the year
            year_count[year] += 1
            
    # Print the counter
    print(year_count)
```


```python
# Create a unique list of crimes for the first block: n_state_st_crimes
n_state_st_crimes = set(crimes_by_block['001XX N STATE ST'])

# Print the list
print(n_state_st_crimes)

# Create a unique list of crimes for the second block: w_terminal_st_crimes
w_terminal_st_crimes = set(crimes_by_block['0000X W TERMINAL ST'])

# Print the list
print(w_terminal_st_crimes)

# Find the differences between the two blocks: crime_differences
crime_differences = n_state_st_crimes.difference(w_terminal_st_crimes)

# Print the differences
print(crime_differences)



```

## Final Thoughts





In [10]:
from collections import Counter

In [11]:
Counter

collections.Counter

In [12]:
crimes = Counter()

In [13]:
crime_data = [['05/23/2016 05:35:00 PM',
  '024XX W DIVISION ST',
  'ASSAULT',
  'SIMPLE',
  'STREET',
  'false',
  'true',
  '14'],
 ['03/26/2016 08:20:00 PM',
  '019XX W HOWARD ST',
  'BURGLARY',
  'FORCIBLE ENTRY',
  'SMALL RETAIL STORE',
  'false',
  'false',
  '24'],
 ['04/25/2016 03:05:00 PM',
  '001XX W 79TH ST',
  'THEFT',
  'RETAIL THEFT',
  'DEPARTMENT STORE',
  'true',
  'false',
  '6'],
 ['04/26/2016 05:30:00 PM',
  '010XX N PINE AVE',
  'BATTERY',
  'SIMPLE',
  'SIDEWALK',
  'false',
  'false',
  '15'],
 ['06/19/2016 01:15:00 AM',
  '027XX W AUGUSTA BLVD',
  'BATTERY',
  'AGGRAVATED: HANDGUN',
  'SIDEWALK',
  'false',
  'false',
  '12'],
 ['05/28/2016 08:00:00 PM',
  '070XX S ASHLAND AVE',
  'BATTERY',
  'DOMESTIC BATTERY SIMPLE',
  'GAS STATION',
  'false',
  'true',
  '7'],]

In [14]:
type(crime_data)

list

In [8]:
crimes.update(crime_data[0][0])

In [16]:
crimes[1]+=1

In [17]:
crimes

Counter({1: 1})

SyntaxError: invalid syntax (<ipython-input-18-9dc79ae5a081>, line 1)