# Introduction to data analysis: `Pandas` and vizualisation

-- [SICSS Zürich 2021](https://github.com/computational-social-science-zurich/sicss-zurich) -- 
 ## Preambule

This notebooks walks you through some basic programming excercises in python to get you acquainted with using python and jupyter notebooks.

 <span style='color:green'> Questions to answer are in green</span>

### Core python basic knowledge
 => See the `R-stata-cookbook-for-python` notebook

### Loading Packages

One of the things that makes python such a powerful programming+statistical tool are the freely available and high quality packages which are constantly being written by users almost daily. For our purposes we will be loading two packages: 

* `pandas` package for data management;
* `matplotlib` for plotting, and `seaborn` for nicer plots

In [None]:
import matplotlib.pyplot as plt
%matplotlib inline
import numpy as np
import pandas as pd
import seaborn as sn
pd.set_option('display.max_columns', 500)

### Background

We will be using the [Stop, Question and Frisk Data from the NYPD](https://www1.nyc.gov/site/nypd/stats/reports-analysis/stopfrisk.page) which contains information about over 100,000 police citizen interactions between 2003-2016 (but we use it for 2016 only).

Throughout the notebook, we study how individual level characteristics correlates with being arrested, aim at describing whether arrests are showing evidence of racism. 

We will explore the connection between the following variables:

* **arstmade** - Was an arrest made?
* **race** - Race of the suspect.
* **timestop** - Time that the suspect was stopped. 
* **datestop** - Date that the suspect was stopped. 
* **age** - Suspect's age.


## Importing and cleaning data

In [None]:
# Importation from a URL, but works with a filepath instead
df=pd.read_csv("https://www1.nyc.gov/assets/nypd/downloads/excel/analysis_and_planning/stop-question-frisk/sqf-2016.csv")

### Looking at the data

In [None]:
df.columns

In [None]:
df.shape

This tells us that we have 12,405 observations (stops) and have 112 variables which were collected for the stops.

In [None]:
# have a look at the first rows
df.head()

## Summary statistics

Show some summary statistics :

### <span style='color:green'> Share of people arrested $=>$ clean this variable</span>

There seeems to be a missing variable : remove the observation 

### <span style='color:green'>Distribution of race</span>

### <span style='color:green'>Number and % of people arrested and % of people not arrested within each racial category.</span>

In other words, of the people that are arrested, what percent are Black, White, Hispanic etc. Of the people that are not arrested, what percent are Black, White, Hispanic etc..

Use the `groupby` syntax from `pandas`.

**Percentage of people from different racial category, for the frisked group and non-frisked group, respectively**

## Visualization

### <span style='color:green'>Plot the distribution of race among stopped individual</span>

### Create a plot of arrests made vs. race 

#### <span style='color:green'>Create a variable `arst_dummy` equal to 1 if the stop led to an arrest and 0 if not.</span>

You can use the `apply` syntax from Pandas. 

#### <span style='color:green'>Represent the previous summary statistics for the percent of people within each racial group that were arrested  with a bar plot instead.</span>

The bar plot should have the title "Percent of Racial Group Arrested".
Now you can plot arrests made vs. race using the previous dummy. 
You can use the `barplot` function from `seaborn`.

### What is the distrbution of ages by those who are arrested vs. not?

#### <span style='color:green'>Cleaning step: convert `age` to a numeric</span>

#### <span style='color:green'>Plot the age distribution by those who are arrested vs. not</span>

You can use `displot` from seaborn or `hist` from matplotlib 

### Writing function & using loops

Many of the variables in the stop and frisk data are coded as "Y" for "Yes" and "N" for no. 

#### <span style='color:green'>Propose an easy means of recoding every variable in the stop and frisk data set using a function that you define. </span>

In order to save some time from having to recode every single variable that contains a "Y" or a "N", write a function that transforms:

* "Y" codings to 1
* "N" codings to 0
* " " codings to np.NAN (missing)

You should be able to use this function in an `apply` framework.

#### <span style='color:green'>`for` loop </span>

Using the `yesno` function, write a loop that transforms every single variable in the "stopandfrisk2016" data frame containing a "Y" or "N" coding into "1", "0" or "NA" codings as specified above. 

Save these newly coded variables in a data frame called `recoded` and use the `head()` function to print out the first few observations of the new dataframe that you created.

## Going further: Plot arrest rate depending on hour of the day

In [None]:
# Calculate the overall arrest rate

In [None]:
# Calculate the hourly arrest rate


In [None]:
# Plot of 'hourly_arrest_rate'
