# Situated data

Our reading described the importance of considering data's context.

Within CORGIS, let's look at another dataset in which it is important to consider context, inclusion, and exclusion.
* Police shootings:  https://corgis-edu.github.io/corgis/csv/police_shootings/

Import Pandas:

In [1]:
import pandas as pd

Import the data (I've copied the CSV file into our class repo, so we're importing it from a link).

In [3]:
df = pd.read_csv('https://raw.githubusercontent.com/benjum/UCLA-23W-DH140/main/Weeks/Week04/data/police_shootings.csv')

We can use our newly learned dataframe methods to investigate this data.

First, let's just look at a snapshot.

In [4]:
df

Unnamed: 0,Person.Name,Person.Age,Person.Gender,Person.Race,Incident.Date.Month,Incident.Date.Day,Incident.Date.Year,Incident.Date.Full,Incident.Location.City,Incident.Location.State,Factors.Armed,Factors.Mental-Illness,Factors.Threat-Level,Factors.Fleeing,Shooting.Manner,Shooting.Body-Camera
0,Tim Elliot,53,Male,Asian,1,2,2015,2015/01/02,Shelton,WA,gun,True,attack,Not fleeing,shot,True
1,Lewis Lee Lembke,47,Male,White,1,2,2015,2015/01/02,Aloha,OR,gun,True,attack,Not fleeing,shot,True
2,John Paul Quintero,23,Male,Hispanic,1,3,2015,2015/01/03,Wichita,KS,unarmed,True,other,Not fleeing,shot and Tasered,True
3,Matthew Hoffman,32,Male,White,1,4,2015,2015/01/04,San Francisco,CA,toy weapon,True,attack,Not fleeing,shot,True
4,Michael Rodriguez,39,Male,Hispanic,1,4,2015,2015/01/04,Evans,CO,nail gun,True,attack,Not fleeing,shot,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
6564,Unknown,0,Male,Unknown,9,7,2021,2021/09/07,Fruit Cove,FL,gun,True,attack,Not fleeing,shot,True
6565,Anthony Cravo,52,Male,Unknown,9,7,2021,2021/09/07,Lufkin,TX,gun,True,attack,Not fleeing,shot,True
6566,Cedric Williams,29,Male,Unknown,9,10,2021,2021/09/10,Oxon Hill,MD,toy weapon,True,other,unknown,shot,True
6567,Desmond Lewis,30,Male,Unknown,9,11,2021,2021/09/11,Shreveport,LA,gun,True,attack,Foot,shot,True


* What considerations do we need to make about the data that is present here?
* What does the documentation tell us?
  * https://corgis-edu.github.io/corgis/csv/police_shootings/
* Is any information clouded?  misrepresented?  unusual?

It will help to know how to use `groupby`.  This allows us to do aggregate calculations across unique values within a column, such as:
* How many Male vs Female vs Unknown Gender persons are in the data?  
  * `df.groupby('Person.Gender').count()` and `df.groupby('Person.Gender')['Person.Gender'].count()`
* How many people were considered Mentally Ill?
  * `df.groupby('Factors.Mental-Illness')['Factors.Mental-Illness'].count()` 
  * and `df['Factors.Mental-Illness'].unique()`
* What is the distribution of values by race? by age? by state?
  * `df.groupby('Person.Race')['Person.Race'].count()`
  * `df.groupby('Person.Age')['Person.Age'].count()`
  * `df.groupby('Incident.Location.State')['Incident.Location.State'].count()`
* How can plots help?
  * `df.plot(y = 'Person.Age', kind='hist')`
  * maps

When looking at statistics like this, it is also important to keep in mind that the numbers are aggregated over multiple years, and that you may need to "normalize" the data to make meaningful comparisons.

For example, comparing a count of shootings for different race categories is different than using that count to compare a **rate** of shootings for different race categories.  [We can look at an example using the assumptions that the numbers were the same across all years and that the populations are similar to those in 2022: the populations of African Americans is ~42 million, Hispanic Americans is ~62 million, and White Americans is ~192 million.]

If we are diligent, we can also try to seek other sources of confirmation for our findings:
* https://www.statista.com/statistics/1123070/police-shootings-rate-ethnicity-us/
* We can compare numbers
  * Do they compare well?
  * What do you think our weekly reading's authors would have to say about the plot title on that page?
* We can also compare sources
  * What do you see when you do that?
* What happens when you go to the Washington Post article?
  * https://www.washingtonpost.com/national/how-the-washington-post-is-examining-police-shootings-in-the-united-states/2016/07/07/d9c52238-43ad-11e6-8856-f26de2537a9d_story.html

## Mapping

In [24]:
state_data = df.groupby('Incident.Location.State')['Incident.Location.State'].count()

In [25]:
state_data

Incident.Location.State
AK     47
AL    124
AR     95
AZ    301
CA    968
CO    237
CT     23
DC     20
DE     16
FL    431
GA    240
HI     36
IA     38
ID     53
IL    129
IN    120
KS     62
KY    116
LA    129
MA     43
MD     97
ME     26
MI    105
MN     76
MO    162
MS     82
MT     40
NC    185
ND     14
NE     34
NH     18
NJ     76
NM    131
NV    108
NY    120
OH    186
OK    190
OR    107
PA    135
RI      4
SC    101
SD     20
TN    174
TX    584
UT     81
VA    113
VT     11
WA    178
WI    106
WV     62
WY     15
Name: Incident.Location.State, dtype: int64

In [10]:
import folium

In [47]:
state_geo = "https://raw.githubusercontent.com/python-visualization/folium/master/examples/data/us-states.json"

In [49]:
m = folium.Map(location=[36, -98], zoom_start=4)

folium.Choropleth(
    geo_data=state_geo,
    data=state_data,
    key_on="feature.id",
    fill_color="YlGn",
).add_to(m)

m

AttributeError: 'str' object has no attribute 'to_linear'

This is another context in which it is important to consider normalization (we should be dividing by the population of the states to draw comparisons between states).