# Assignment 4

Before working on this assignment please read these instructions fully. In the submission area, you will notice that you can click the link to **Preview the Grading** for each step of the assignment. This is the criteria that will be used for peer grading. Please familiarize yourself with the criteria before beginning the assignment.

This assignment requires that you find **at least two datasets** on the web which are related, and that you visualize these datasets to answer the assignment question. You are free to utilize datasets with any location or domain, the usage of **Ann Arbor sports and athletics** datasets in the example is just a suggestion.

You are welcome to choose datasets at your discretion, but keep in mind **they will be shared with your peers**, so choose appropriate datasets. Sensitive, confidential, illicit, and proprietary materials are not good choices for datasets for this assignment. You are welcome to upload datasets of your own as well, and link to them using a third party repository such as github, pastebin, etc. Please be aware of the Coursera terms of service with respect to intellectual property.

Also, you are welcome to preserve data in its original language, but for the purposes of grading you should provide english translations. You are welcome to provide multiple visuals in different languages if you would like!

As this assignment is for the whole course, you must incorporate principles discussed in the first week, such as having as high data-ink ratio (Tufte) and aligning with Cairo’s principles of truth, beauty, function, and insight.

Here are the assignment instructions:

 * You must state a question you are seeking to answer with your visualizations.
 * You must provide at least two links to available datasets. These could be links to files such as CSV or Excel files, or links to websites which might have data in tabular form, such as Wikipedia pages.
 * You must upload an image which addresses the research question you stated. In addition to addressing the question, this visual should follow Cairo's principles of truthfulness, functionality, beauty, and insightfulness.
 * You must contribute a short (1-2 paragraph) written justification of how your visualization addresses your stated research question.

## Tips
* Wikipedia is an excellent source of data, and I strongly encourage you to explore it for new data sources.
* Many governments run open data initiatives at the city, region, and country levels, and these are wonderful resources for localized data sources.
* Several international agencies, such as the [United Nations](http://data.un.org/), the [World Bank](http://data.worldbank.org/), the [Global Open Data Index](http://index.okfn.org/place/) are other great places to look for data.
* This assignment requires you to convert and clean datafiles. Check out the discussion forums for tips on how to do this from various sources, and share your successes with your fellow students!

## Example
Looking for an example? Here's what our course assistant put together as an example! [Example Solution File](./readonly/Assignment4_example.pdf)

# Local sports teams winning percentage impact on crime

The winning percentage of the Pittsburgh Steelers, Pittsburgh Pirates, and Pittsburgh Penguins will be compared to the number of arrests over that time frame.

Does a winning season or a losing season impact the number of arrests, or is there no correlation?

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

## Data Collection

### Crime Data

The crime data from the City of Pittsburgh is provided at https://data.wprdc.org/dataset/arrest-data.

The ARRESTTIME field will be used to obtain the day and year of the arrest to compare against the Penguins, Pirates, and Steelers seasons. Since the crime data only goes back to 1998 the range of comparisons will be 1998 to 2023.

In [49]:
crime = pd.read_csv(r"data\pittsburgh_arrest_data.csv").set_index("_id")
crime["ARRESTTIME"] = pd.to_datetime(crime["ARRESTTIME"])
crime.sort_values("ARRESTTIME").head()

Unnamed: 0_level_0,PK,CCR,AGE,GENDER,RACE,ARRESTTIME,ARRESTLOCATION,OFFENSES,INCIDENTLOCATION,INCIDENTNEIGHBORHOOD,INCIDENTZONE,INCIDENTTRACT,COUNCIL_DISTRICT,PUBLIC_WORKS_DIVISION,X,Y
_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1
73367,2048259,21130139,37.0,M,B,2022-05-31 07:40:00,"#1 Lois LANE Greenwood, WV 26415",3929 Retail Theft.,"400 Block N Highland AV Pittsburgh, PA 15206",East Liberty,5,1115.0,9.0,2.0,-79.922175,40.465166
57382,2034201,20076161,21.0,M,B,2021-01-23 08:56:00,"0900 Block 2nd AV Pittsburgh, PA 15219",3925 Receiving Stolen Property. / 6106 Firearm...,"1400 Block Washington BL Pittsburgh, PA 15206",Highland Park,5,1106.0,9.0,2.0,-79.908617,40.470306
23947,2004923,18167846,42.0,M,H,2018-09-18 13:42:00,"10 Block 35th ST Pittsburgh, PA 15201",3921(a) Theft by Unlawful Taking or Dispositio...,"6000 Block Harvard SQ Pittsburgh, PA 15206",East Liberty,5,1115.0,,,,
4190,1979900,17000760,34.0,M,W,2017-01-02 09:12:00,"10 Block 40th ST Pittsburgh, PA 15201",13(a)(16) Possession of Controlled Substance /...,"10 Block 40th ST Pittsburgh, PA 15201",Central Lawrenceville,2,901.0,7.0,2.0,-79.96488,40.470229
22079,2002540,18128084,30.0,F,B,2018-07-06 18:27:00,"10 Block Ainsworth ST Pittsburgh, PA 15220",2702 Aggravated Assault.,"10 Block Ainsworth ST Pittsburgh, PA 15220",Elliott,6,2020.0,2.0,5.0,-80.039603,40.44402


### Pittsburgh Penguins win/loss data

The Pittsburgh Penguins win/loss data is obtained from Wikipedia. The NHL season typically runs from October until June. The time frame of October 1998 until June 2023 will be compared.

In [135]:
penguins = pd.read_html(
    "https://en.wikipedia.org/wiki/List_of_Pittsburgh_Penguins_seasons"
)[2]
penguins = penguins.iloc[31:-1, :13]
penguins.columns = (col[1] for col in penguins.columns)
penguins["Season"] = penguins["Season"].astype(str)
penguins = penguins.set_index("Season")
penguins = penguins[penguins["Conference"] != "Season not played due to lockout"]
penguins["Pct"] = penguins["W"].astype(int) / penguins["GP"].astype(int)
penguins.head()

Unnamed: 0_level_0,Penguins season,Conference,Division,Finish,GP,W,L,T[5],OT[6],Pts,GF,GA,Pct
Season,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
1998–99,1998–99,Eastern,Atlantic,3rd,82,38,30,14,—,90,242,225,0.463415
1999–2000[e],1999–2000,Eastern,Atlantic,3rd,82,37,31,8,6,88,241,236,0.45122
2000–01,2000–01,Eastern,Atlantic,3rd,82,42,28,9,3,96,281,256,0.512195
2001–02,2001–02,Eastern,Atlantic,5th,82,28,41,8,5,69,198,249,0.341463
2002–03,2002–03,Eastern,Atlantic,5th,82,27,44,6,5,65,189,255,0.329268


### Pittsburgh Pirates win/loss data

The Pittsburgh Pirates win/loss data is obtained from Wikipedia. The Major League Baseball season typically runs from May into October. The crime data will be compared from May 1998 through October 2023.

In [107]:
pirates = pd.read_html(
    "https://en.wikipedia.org/wiki/List_of_Pittsburgh_Pirates_seasons",
    skiprows=[1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
)[2]
pirates.columns = (col[0] for col in pirates.columns)
pirates = pirates.iloc[:-6, :-1]
pirates = pirates.set_index("MLB season")
pirates = pirates[pirates.index >= "1998"]
pirates.head()

Unnamed: 0_level_0,Team season,League,Division,Finish,Wins,Losses,Win%,GB,Post-season,Awards[7]
MLB season,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
1998,1998,NL,Central,6th,69,93,0.426,33,,
1999,1999,NL,Central,3rd,78,84,0.484,18½,,
2000,2000[r],NL,Central,5th,69,93,0.426,26,,
2001,2001,NL,Central,6th,62,100,0.383,31,,
2002,2002,NL,Central,4th,72,89,0.447,24½,,


### Pittsburgh Steelers win/loss data

The Pittsburgh Steelers win/loss data is obtained from Wikipedia. The National Football league season typically runs from September through January. The crime data will be compared from September 1998 through January 2024.

In [108]:
steelers = pd.read_html(
    "https://en.wikipedia.org/wiki/List_of_Pittsburgh_Steelers_seasons"
)[1]
steelers.columns = (col[1] for col in steelers.columns)
steelers = steelers.iloc[17:-3, :].set_index("Season")
steelers = steelers[steelers.index >= "1998"]
steelers.head()

Unnamed: 0_level_0,Team,League,Conference,Division,Finish,W,L,T,Pct[1],Postseason results,Awards,Head coaches
Season,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
1998,1998,NFL,AFC,Central,3rd,7,9,0,0.438,,,Bill Cowher
1999,1999,NFL,AFC,Central,4th,6,10,0,0.375,,,Bill Cowher
2000,2000,NFL,AFC,Central,3rd,9,7,0,0.563,,,Bill Cowher
2001,2001,NFL,AFC,Central,1st,13,3,0,0.813,Won Divisional Playoffs (Ravens) 27–10 Lost AF...,Kendrell Bell (DROY),Bill Cowher
2002,2002,NFL,AFC,North,1st,10,5,1,0.656,Won Wild Card Playoffs (Browns) 36–33 Lost Div...,Tommy Maddox (CBPOY),Bill Cowher
