# Introduction to data analysis: Data Frames and vizualisation

-- [SICSS Zürich 2021](https://github.com/computational-social-science-zurich/sicss-zurich) -- 
 ## Preambule

This notebooks walks you through some basic programming excercises in julia to get you acquainted with using julia and jupyter notebooks.

<span style='color:green'> Questions to answer are in green</span>

### Core Julia basic knowledge
 => See the `R-stata-cookbook-for-python` notebook

### Loading Packages

One of the things that makes julia such a powerful programming+statistical tool are the freely available and high quality packages which are constantly being written by users and contributors daily. For our purposes we will be loading two packages: 

* `DataFrames` package for data management;
* `Plots` for plotting, and `StatsPlots` for nicer plots

In [1]:
using DataFrames
using HTTP
using CSV
using Plots
using StatsPlots

### Background

We will be using the [Stop, Question and Frisk Data from the NYPD](https://www1.nyc.gov/site/nypd/stats/reports-analysis/stopfrisk.page) which contains information about over 100,000 police citizen interactions between 2003-2016 (but we use it for 2016 only).

Throughout the notebook, we study how individual level characteristics correlates with being arrested, aim at describing whether arrests are showing evidence of racism. 

We will explore the connection between the following variables:

* **arstmade** - Was an arrest made?
* **race** - Race of the suspect.
* **timestop** - Time that the suspect was stopped. 
* **datestop** - Date that the suspect was stopped. 
* **age** - Suspect's age.


## Importing and cleaning data

In [2]:
# Get the file from URL
url = "https://www1.nyc.gov/assets/nypd/downloads/excel/analysis_and_planning/stop-question-frisk/sqf-2016.csv"
df = DataFrame(CSV.File(HTTP.get(url).body))

Unnamed: 0_level_0,year,pct,ser_num,datestop,timestop,recstat,inout,trhsloc,perobs,crimsusp
Unnamed: 0_level_1,String,String,String,String,String,String,String,String,String,String
1,2016,41,22,2072016,100,A,O,P,1.00,BURG
2,2016,10,22,2182016,30,1,O,P,8.00,MISDEMEANOR
3,2016,66,1,1012016,30,1,I,P,2.00,FEL
4,2016,47,18,1012016,40,1,O,H,1.00,FEL
5,2016,79,1,1012016,50,1,O,P,3.00,D.W.I.
6,2016,73,1,1012016,100,1,O,P,1.00,FELONY
7,2016,43,6,1012016,130,1,O,P,1.00,FEL
8,2016,67,3,1012016,135,1,O,P,1.00,FEL
9,2016,43,12,1012016,210,1,O,P,1.00,MISD
10,2016,73,2,1012016,210,1,O,P,2.00,MISD


### Looking at the data

In [3]:
names(df)

112-element Vector{String}:
 "year"
 "pct"
 "ser_num"
 "datestop"
 "timestop"
 "recstat"
 "inout"
 "trhsloc"
 "perobs"
 "crimsusp"
 "perstop"
 "typeofid"
 "explnstp"
 ⋮
 "city"
 "state"
 "zip"
 "addrpct"
 "sector"
 "beat"
 "post"
 "xcoord"
 "ycoord"
 "dettypCM"
 "lineCM"
 "detailCM"

In [4]:
size(df)

(12405, 112)

This tells us that we have 12,405 observations (stops) and have 112 variables which were collected for the stops.

In [5]:
first(df, 5)

Unnamed: 0_level_0,year,pct,ser_num,datestop,timestop,recstat,inout,trhsloc,perobs,crimsusp
Unnamed: 0_level_1,String,String,String,String,String,String,String,String,String,String
1,2016,41,22,2072016,100,A,O,P,1.0,BURG
2,2016,10,22,2182016,30,1,O,P,8.0,MISDEMEANOR
3,2016,66,1,1012016,30,1,I,P,2.0,FEL
4,2016,47,18,1012016,40,1,O,H,1.0,FEL
5,2016,79,1,1012016,50,1,O,P,3.0,D.W.I.


## Summary statistics

Show some summary statistics:

### <span style='color:green'> Share of people arrested $=>$ clean this variable</span>

There seeems to be a missing variable: remove the observation 

### <span style='color:green'>Distribution of race</span>

### <span style='color:green'>Number and % of people arrested and % of people not arrested within each racial category.</span>

In other words, of the people that are arrested, what percent are Black, White, Hispanic etc. Of the people that are not arrested, what percent are Black, White, Hispanic etc..

Use the `groupby` syntax from `DataFrames`: https://dataframes.juliadata.org/stable/lib/functions/#DataFrames.groupby

**Percentage of people from different racial category, for the frisked group and non-frisked group, respectively**

## Visualization

### <span style='color:green'>Plot the distribution of race among stopped individual</span>

### Create a plot of arrests made vs. race 

#### <span style='color:green'>Create a variable `arst_dummy` equal to 1 if the stop led to an arrest and 0 if not.</span>

You can use the `transform` function from DataFrames. https://dataframes.juliadata.org/stable/lib/functions/#DataFrames.transform

#### <span style='color:green'>Represent the previous summary statistics for the percent of people within each racial group that were arrested  with a bar plot instead.</span>

The bar plot should have the title "Percent of Racial Group Arrested".
Now you can plot arrests made vs. race using the previous dummy. 
You can try the `groupedbar` function from `StatsPlots`, or the line type `bar` for the `plot` function call.

### What is the distrbution of ages by those who are arrested vs. not?

#### <span style='color:green'>Cleaning step: convert `age` to a numeric</span>

#### <span style='color:green'>Plot the age distribution by those who are arrested vs. not</span>

You can use `density` from StatsPlots or `histogram` from Plots 

### Writing function & using loops

Many of the variables in the stop and frisk data are coded as "Y" for "Yes" and "N" for no. 

#### <span style='color:green'>Propose an easy means of recoding every variable in the stop and frisk data set using a function that you define. </span>

In order to save some time from having to recode every single variable that contains a "Y" or a "N", write a function that transforms:

* "Y" codings to 1
* "N" codings to 0
* " " codings to nothing (missing)

You should be able to use this function in an `apply` framework.

#### <span style='color:green'>`for` loop </span>

Using the `yesno` function, write a loop that transforms every single variable in the "stopandfrisk2016" data frame containing a "Y" or "N" coding into "1", "0" or "missing" codings as specified above. 

Save these newly coded variables in a data frame called `recoded` and use the `first()` function to print out the first few observations of the new dataframe that you created.

## Going further: Plot arrest rate depending on hour of the day

In [6]:
# Calculate the overall arrest rate

In [7]:
# Calculate the hourly arrest rate


In [8]:
# Plot of 'hourly_arrest_rate'
