# Sprint 2: Data 

**Data Source**: [Mapping Police Violence](https://airtable.com/appzVzSeINK1S3EVR/shroOenW19l1m3w0H/tblxearKzw8W7ViN8) retrieved on 2/21/2024

*Mapping Police Violence is a research collective that collects data from a variety of sources in order to track incidents of lethal force by police officers on civilians. They have discovered that Black people are 3 times as likely to be killed by a police officer than their White counterparts. This data was selected because it offers the most detailed information on the specific cases of lethal police force from 2013 to February 6, 2024. Furthermore, the data is collected and reviewed by researchers at Mapping Police Violence before being added to the dataset which allows for more accurate and detailed data to be shared.* 

> For further exploration or understanding of the data collection process or the variable explanations refer to [Mapping Police Violence Codebook](https://mappingpoliceviolence.org/files/MappingPoliceViolence_Methodology.pdf) 

## DATA CLEANING

In [58]:
# importing dependencies 
import pandas as pd
import numpy as np
import plotly.express as px
import datetime
import seaborn as sns

In [2]:
#loading dataset
dtst = pd.read_csv('mapping_police_violence.csv')
dtst.head()

Unnamed: 0,name,age,gender,race,victim_image,date,street_address,city,state,zip,...,congressperson_party,prosecutor_head,prosecutor_race,prosecutor_gender,prosecutor_party,prosecutor_term,prosecutor_in_court,prosecutor_special,independent_investigation,prosecutor_url
0,Jonathan Foster,38.0,Male,Unknown race,,2/6/2024,43543 20th St W,Lancaster,CA,93534.0,...,,,,,,,,,,
1,Eric Seckington,65.0,Male,White,https://hips.hearstapps.com/vidthumb/11640a0a-...,2/6/2024,900 block of Great Bend Road,Altamonte Springs,FL,,...,,,,,,,,,,
2,Sterling Ramon Alavache,36.0,Male,Black,https://wareham.theweektoday.com/sites/beaverd...,2/6/2024,13099 US Highway 41 S.E./Cleveland Ave.,Fort Myers,FL,33907.0,...,,,,,,,,,,
3,Decarlos Cornelius Long,43.0,Male,Black,https://postellsmortuary.com/wp-content/upload...,2/6/2024,Davisson Ave and Edgewater Drive,Fairview Shores,FL,32804.0,...,,,,,,,,,,
4,Chase Ditter,17.0,Male,White,https://cdn.batesvilletechnology.com/fh_live/1...,2/6/2024,3600 block of 39th Avenue,Columbus,NE,68601.0,...,,,,,,,,,,


In [3]:
#getting shape of the dataset
dtst.shape

(12717, 62)

In [4]:
#getting a list of all of the variables 
dtst.columns.values.tolist()

['name',
 'age',
 'gender',
 'race',
 'victim_image',
 'date',
 'street_address',
 'city',
 'state',
 'zip',
 'county',
 'agency_responsible',
 'ori',
 'cause_of_death',
 'circumstances',
 'disposition_official',
 'officer_charged',
 'news_urls',
 'signs_of_mental_illness',
 'allegedly_armed',
 'wapo_armed',
 'wapo_threat_level',
 'wapo_flee',
 'wapo_body_camera',
 'wapo_id',
 'off_duty_killing',
 'geography',
 'mpv_id',
 'fe_id',
 'encounter_type',
 'initial_reason',
 'officer_names',
 'officer_races',
 'officer_known_past_shootings',
 'call_for_service',
 'tract',
 'urban_rural_uspsai',
 'urban_rural_nchs',
 'hhincome_median_census_tract',
 'latitude',
 'longitude',
 'pop_total_census_tract',
 'pop_white_census_tract',
 'pop_black_census_tract',
 'pop_native_american_census_tract',
 'pop_asian_census_tract',
 'pop_pacific_islander_census_tract',
 'pop_other_multiple_census_tract',
 'pop_hispanic_census_tract',
 'congressional_district_113',
 'congressperson_lastname',
 'congressperso

In [5]:
#creating dataset with the columns of interest for this project
df = dtst[['name', 'age', 'gender', 'race', 'date', 'city','state','agency_responsible','hhincome_median_census_tract', 'latitude', 
           'longitude', 'pop_total_census_tract','pop_white_census_tract','pop_black_census_tract','pop_native_american_census_tract',
           'pop_asian_census_tract','pop_pacific_islander_census_tract','pop_other_multiple_census_tract','pop_hispanic_census_tract']]
df.head()

Unnamed: 0,name,age,gender,race,date,city,state,agency_responsible,hhincome_median_census_tract,latitude,longitude,pop_total_census_tract,pop_white_census_tract,pop_black_census_tract,pop_native_american_census_tract,pop_asian_census_tract,pop_pacific_islander_census_tract,pop_other_multiple_census_tract,pop_hispanic_census_tract
0,Jonathan Foster,38.0,Male,Unknown race,2/6/2024,Lancaster,CA,L.A. County Sheriff’s Department,56250.0,34.673161,-118.165536,4145.0,47%,14%,0%,5%,1%,5%,29%
1,Eric Seckington,65.0,Male,White,2/6/2024,Altamonte Springs,FL,Altamonte Springs Police Department,58487.0,28.661109,-81.365624,4381.0,69%,5%,0%,3%,0%,3%,19%
2,Sterling Ramon Alavache,36.0,Male,Black,2/6/2024,Fort Myers,FL,"Lee County Sheriff's Office,Federal Bureau of ...",37708.0,26.598555,-81.871586,2391.0,57%,11%,0%,1%,0%,8%,22%
3,Decarlos Cornelius Long,43.0,Male,Black,2/6/2024,Fairview Shores,FL,Orange County Sheriff's Office,87125.0,28.601872,-81.40442,3866.0,67%,6%,0%,6%,0%,2%,19%
4,Chase Ditter,17.0,Male,White,2/6/2024,Columbus,NE,Columbus Police Department,80238.0,41.453312,-97.374876,2914.0,71%,0%,0%,0%,0%,8%,21%


> I only need 19 variables to answer my questions about the types of victims of police brutality. Therefore, I dropped all of the other variables that are not necessary for my dashboard. 

In [6]:
df.dtypes

name                                  object
age                                  float64
gender                                object
race                                  object
date                                  object
city                                  object
state                                 object
agency_responsible                    object
hhincome_median_census_tract         float64
latitude                             float64
longitude                            float64
pop_total_census_tract               float64
pop_white_census_tract                object
pop_black_census_tract                object
pop_native_american_census_tract      object
pop_asian_census_tract                object
pop_pacific_islander_census_tract     object
pop_other_multiple_census_tract       object
pop_hispanic_census_tract             object
dtype: object

Goal of data cleaning: ensuring that all variables are in the correct data type

`age` from float -> int             
`date` from object -> datetime      
`pop_white_census_tract` from object -> float       
`pop_black_census_tract` from object -> float       
`pop_native_american_census_tract` from object -> float         
`pop_asian_census_tract` from object -> float       
`pop_pacific_islander_census_tract` from object -> float        
`pop_other_multiple_census_tract` from object -> float          
`pop_hispanic_census_tract` from object -> float 

In [7]:
#checking missing values
df.isnull().sum()

name                                   0
age                                  504
gender                                 9
race                                 223
date                                   0
city                                  10
state                                  0
agency_responsible                    24
hhincome_median_census_tract         102
latitude                               0
longitude                              0
pop_total_census_tract                62
pop_white_census_tract                62
pop_black_census_tract                62
pop_native_american_census_tract      62
pop_asian_census_tract                62
pop_pacific_islander_census_tract     62
pop_other_multiple_census_tract       62
pop_hispanic_census_tract             62
dtype: int64

> Given the size of the dataset there are very few missing values, therefore, I am choosing not to remove the values because I don't believe they will have a great impact on the functioning of the dashboard. 

In [8]:
#checking how the missing values are encoded
df['gender'].unique().tolist()

['Male',
 'Female',
 'Unknown',
 nan,
 'Transgender Male',
 'Non-Binary',
 'Transgender']

In [9]:
#consolidating value types 
print(df['gender'].count()) #ensuring no values are lost
df['gender'] = df['gender'].replace(np.nan, 'Unknown') #using the replace function to put all nan values into the unknown category
df['gender'] = df['gender'].replace("Transgender Male", "Transgender")
print(df['gender'].count()) #ensuring no values are lost
df['gender'].unique().tolist() #checking all of the nan values are gone

12708
12717


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['gender'] = df['gender'].replace(np.nan, 'Unknown') #using the replace function to put all nan values into the unknown category
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['gender'] = df['gender'].replace("Transgender Male", "Transgender")


['Male', 'Female', 'Unknown', 'Transgender', 'Non-Binary']

In [10]:
#checking the unique values of race
df['race'].unique().tolist()

['Unknown race',
 'White',
 'Black',
 'Hispanic',
 nan,
 'Native Hawaiian and Pacific Islander',
 'Asian',
 'Native American']

In [11]:
#placing nan values and those listed as "unknown race" into the "unknown" category
df['race'] = df['race'].replace([np.nan, "Unknown race"], "Unknown")
df['race'].unique()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['race'] = df['race'].replace([np.nan, "Unknown race"], "Unknown")


array(['Unknown', 'White', 'Black', 'Hispanic',
       'Native Hawaiian and Pacific Islander', 'Asian', 'Native American'],
      dtype=object)

In [12]:
#seeing how missing values are encoded
df['city'].unique().tolist()

['Lancaster',
 'Altamonte Springs',
 'Fort Myers',
 'Fairview Shores',
 'Columbus',
 'Ozark',
 'Hackettstown ',
 'Graham',
 'Ansonia',
 'Cherryville',
 'Beaumont',
 'Jurupa Valley',
 'Los Angeles ',
 'Carol Stream',
 'Omaha ',
 'Tuscon',
 'Lauderhill',
 'Port St Lucie',
 'East Ridge',
 'Hamburg',
 'Massapequa',
 'Hanover',
 'Summerville',
 'North Fond du Lac',
 'Suitland',
 'Spokane',
 'New Carrollton',
 'Fryeburg',
 'Springfield',
 'Temple',
 'Peoria',
 'Decatur',
 'Willmar',
 'Palm Bay',
 'Fife Lake',
 'Ooltewah',
 'Phoenix',
 'Security-Widefield',
 'Evergreen',
 'Olympic Valley',
 'Philadelphia',
 'Piedmont',
 'San Antonio',
 'Makaha',
 'Greenfield',
 'Indianapolis',
 'Wichita',
 'Albuquerque ',
 'Washington',
 'Boise',
 'South Brunswick',
 'Elko',
 'Pittsburgh',
 'Bluffdale',
 'Lakewood',
 'Appleton',
 'Eureka',
 'Wildwood',
 'Nashua',
 'Gilbert',
 'Augusta',
 'Winston-Salem',
 'Crescent',
 'Bakersfield',
 'Norfolk',
 'Cleveland',
 'Landrum',
 'Miami',
 'Toledo',
 'Portland',
 'Sum

In [13]:
#replacing nan values with unknown 
df['city'] = df['city'].replace(np.nan, 'Unknown')
#taking the space out of ' Kea‘au' in the list to be congruent with the other Kea'au inputs
df['city'] = df['city'].replace(" Kea‘au", "Kea\'au")
#df['city'].unique().tolist()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['city'] = df['city'].replace(np.nan, 'Unknown')
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['city'] = df['city'].replace(" Kea‘au", "Kea\'au")


In [14]:
#looking to see how missing values are encoded in the population columns 
df['pop_asian_census_tract'].unique().tolist()

['5%',
 '3%',
 '1%',
 '6%',
 '0%',
 '9%',
 '11%',
 '2%',
 '4%',
 '21%',
 '7%',
 '8%',
 '49%',
 '14%',
 '16%',
 '10%',
 '25%',
 '44%',
 '13%',
 '27%',
 '22%',
 '17%',
 '19%',
 '24%',
 nan,
 '50%',
 '18%',
 '36%',
 '39%',
 '12%',
 '28%',
 '29%',
 '32%',
 '26%',
 '64%',
 '15%',
 '88%',
 '20%',
 '23%',
 '65%',
 '47%',
 '55%',
 '40%',
 '35%',
 '43%',
 '42%',
 '41%',
 '45%',
 '51%',
 '37%',
 '31%',
 '34%',
 '74%',
 '30%',
 '53%',
 '52%',
 '33%',
 '38%',
 '59%',
 '83%',
 '73%',
 '56%',
 '77%',
 '84%',
 '91%',
 '69%',
 '79%',
 '61%',
 '68%',
 '78%',
 '62%',
 '46%',
 '75%',
 '54%',
 '58%',
 '89%',
 '63%',
 '67%',
 '72%',
 '48%',
 '57%',
 '70%',
 '80%',
 '60%',
 '66%']

In [15]:
#creating a function to take the % sign out of the race population columns 
def perc_to_dec(data, column_name):
    data[column_name] = data[column_name].str.replace("%", '').astype(float).div(100)
    return data

df = perc_to_dec(df, "pop_black_census_tract")
df = perc_to_dec(df, "pop_white_census_tract")
df = perc_to_dec(df, "pop_asian_census_tract")
df = perc_to_dec(df, "pop_native_american_census_tract")
df = perc_to_dec(df, "pop_pacific_islander_census_tract")
df = perc_to_dec(df, "pop_other_multiple_census_tract")
df = perc_to_dec(df, "pop_hispanic_census_tract")

df.head()


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data[column_name] = data[column_name].str.replace("%", '').astype(float).div(100)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data[column_name] = data[column_name].str.replace("%", '').astype(float).div(100)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data[column_name] = data[column_name].str

Unnamed: 0,name,age,gender,race,date,city,state,agency_responsible,hhincome_median_census_tract,latitude,longitude,pop_total_census_tract,pop_white_census_tract,pop_black_census_tract,pop_native_american_census_tract,pop_asian_census_tract,pop_pacific_islander_census_tract,pop_other_multiple_census_tract,pop_hispanic_census_tract
0,Jonathan Foster,38.0,Male,Unknown,2/6/2024,Lancaster,CA,L.A. County Sheriff’s Department,56250.0,34.673161,-118.165536,4145.0,0.47,0.14,0.0,0.05,0.01,0.05,0.29
1,Eric Seckington,65.0,Male,White,2/6/2024,Altamonte Springs,FL,Altamonte Springs Police Department,58487.0,28.661109,-81.365624,4381.0,0.69,0.05,0.0,0.03,0.0,0.03,0.19
2,Sterling Ramon Alavache,36.0,Male,Black,2/6/2024,Fort Myers,FL,"Lee County Sheriff's Office,Federal Bureau of ...",37708.0,26.598555,-81.871586,2391.0,0.57,0.11,0.0,0.01,0.0,0.08,0.22
3,Decarlos Cornelius Long,43.0,Male,Black,2/6/2024,Fairview Shores,FL,Orange County Sheriff's Office,87125.0,28.601872,-81.40442,3866.0,0.67,0.06,0.0,0.06,0.0,0.02,0.19
4,Chase Ditter,17.0,Male,White,2/6/2024,Columbus,NE,Columbus Police Department,80238.0,41.453312,-97.374876,2914.0,0.71,0.0,0.0,0.0,0.0,0.08,0.21


In [16]:
#type casting the date variable from object to date
df['date'] = pd.to_datetime(df['date'], format='%m/%d/%Y')

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['date'] = pd.to_datetime(df['date'], format='%m/%d/%Y')


In [17]:
df.dtypes

name                                         object
age                                         float64
gender                                       object
race                                         object
date                                 datetime64[ns]
city                                         object
state                                        object
agency_responsible                           object
hhincome_median_census_tract                float64
latitude                                    float64
longitude                                   float64
pop_total_census_tract                      float64
pop_white_census_tract                      float64
pop_black_census_tract                      float64
pop_native_american_census_tract            float64
pop_asian_census_tract                      float64
pop_pacific_islander_census_tract           float64
pop_other_multiple_census_tract             float64
pop_hispanic_census_tract                   float64
dtype: objec

> All of the data types are in their correct form for the purposes of this dashboard. 

Goal for this next section of cleaning: renaming the median income and population columns to be more concise

`hhincome_median_census_tract` -> `median_income` 

`pop_total_census_tract` -> `total_pop`   

`pop_white_census_tract` -> `white_pop`   

`pop_black_census_tract` -> `black_pop`   

`pop_native_american_census_tract` -> `native_american_pop`    

`pop_asian_census_tract` -> `asian_pop`    

`pop_pacific_islander_census_tract` -> `pacific_islander_pop`     

`pop_other_multiple_census_tract` -> `other_multiple_pop`     

`pop_hispanic_census_tract` -> `hispanic_pop`    

In [18]:
df = df.rename(columns = {"hhincome_median_census_tract":"median_income", "pop_total_census_tract":"total_pop", 
                          "pop_white_census_tract":"white_pop", "pop_black_census_tract":"black_pop", 
                          "pop_native_american_census_tract":"native_american_pop", "pop_asian_census_tract":"asian_pop", 
                          "pop_pacific_islander_census_tract":"pacific_islander_pop", "pop_other_multiple_census_tract":"other_multiple_pop", 
                          "pop_hispanic_census_tract":"hispanic_pop"})
df.head()

Unnamed: 0,name,age,gender,race,date,city,state,agency_responsible,median_income,latitude,longitude,total_pop,white_pop,black_pop,native_american_pop,asian_pop,pacific_islander_pop,other_multiple_pop,hispanic_pop
0,Jonathan Foster,38.0,Male,Unknown,2024-02-06,Lancaster,CA,L.A. County Sheriff’s Department,56250.0,34.673161,-118.165536,4145.0,0.47,0.14,0.0,0.05,0.01,0.05,0.29
1,Eric Seckington,65.0,Male,White,2024-02-06,Altamonte Springs,FL,Altamonte Springs Police Department,58487.0,28.661109,-81.365624,4381.0,0.69,0.05,0.0,0.03,0.0,0.03,0.19
2,Sterling Ramon Alavache,36.0,Male,Black,2024-02-06,Fort Myers,FL,"Lee County Sheriff's Office,Federal Bureau of ...",37708.0,26.598555,-81.871586,2391.0,0.57,0.11,0.0,0.01,0.0,0.08,0.22
3,Decarlos Cornelius Long,43.0,Male,Black,2024-02-06,Fairview Shores,FL,Orange County Sheriff's Office,87125.0,28.601872,-81.40442,3866.0,0.67,0.06,0.0,0.06,0.0,0.02,0.19
4,Chase Ditter,17.0,Male,White,2024-02-06,Columbus,NE,Columbus Police Department,80238.0,41.453312,-97.374876,2914.0,0.71,0.0,0.0,0.0,0.0,0.08,0.21


### Data Dictionary 

| Name of Variable | Definition of Variable |
| ---------------- | -------------------------- | 
| `name` | The name of the victim killed by police violence |
| `age` | The age of the victim at the time of their death |
| `gender` | The gender of the victim indicated by news and official reports |
| `race` | The race of the victim as reported by news and official reports | 
| `date` | Date of the incident in Y-M-D format |
| `city` | City of the incident |
| `state` | State of the incident |
| `agency_responsible` | Name of the agency that employs the officer that killed the victim |
| `median_income` | Median household income from the American Community Survey for each city |
| `latitude` | Latitude coordinates of the incident |
| `longitude` | Longitude coordinates of the incident |
| `total_pop` | Total population of each city from the American Community Survey 5 year estimate from 2019 |
| `white_pop` | Proportion of total population that listed White as their race in each city |
| `black_pop` | Proportion of total population that listed Black as their race in each city |
| `native_american_pop` | Proportion of total population that listed Native American as their race in each city |
| `asian_pop` | Proportion of total population that listed Asian as their race in each city |
| `pacific_islander_pop` | Proportion of total population that listed Pacific Islander as their race in each city |
| `other_multiple_pop` | Proportion of total population that listed Other or Multiple as their race in each city |
| `hispanic_pop` | Proportion of total population that listed Hispanic as their race in each city |


## DATA EXPLORATION

In [19]:
#finding the dimensions of the cleaned dataset
df.shape

(12717, 19)

> This dataset has 19 columns and 12,717 observations.

In [20]:
# showing the number of missing values for each variable
df.isnull().sum()

name                      0
age                     504
gender                    0
race                      0
date                      0
city                      0
state                     0
agency_responsible       24
median_income           102
latitude                  0
longitude                 0
total_pop                62
white_pop                62
black_pop                62
native_american_pop      62
asian_pop                62
pacific_islander_pop     62
other_multiple_pop       62
hispanic_pop             62
dtype: int64

> Given that there are over 12,000 observations, the number of missing values does not concern me about the efficiency of the dashboard. 

#### Numerical Data Exploration

In [21]:
# displaying a statistical summary table for the numerical variables
num_vars = ["age", "median_income", "total_pop", "white_pop", "black_pop", "native_american_pop", "asian_pop", "pacific_islander_pop", 
            "other_multiple_pop", "hispanic_pop"]

df[num_vars].describe()

Unnamed: 0,age,median_income,total_pop,white_pop,black_pop,native_american_pop,asian_pop,pacific_islander_pop,other_multiple_pop,hispanic_pop
count,12213.0,12615.0,12655.0,12655.0,12655.0,12655.0,12655.0,12655.0,12655.0,12655.0
mean,37.183002,55552.444867,4658.278546,0.524926,0.16513,0.012506,0.039357,0.002131,0.030157,0.223682
std,13.202758,27125.690591,2401.31268,0.310543,0.240287,0.064872,0.080588,0.013348,0.031422,0.254452
min,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,27.0,36965.0,3148.5,0.24,0.01,0.0,0.0,0.0,0.01,0.04
50%,35.0,50258.0,4352.0,0.57,0.06,0.0,0.01,0.0,0.02,0.11
75%,45.0,67102.0,5719.0,0.81,0.21,0.01,0.04,0.0,0.04,0.33
max,107.0,250001.0,61133.0,1.0,1.0,1.0,0.91,0.47,0.5,1.0


> Based on the statistical summary table above, the age variable indicates that the victims of police lethal force are generally younger than 45 years old. This demonstrates that young people are often the targets of police brutality. Additionally, the median income for the cities shows that a majority of the areas that experience lethal police force are poor communities with 75% of this data living in an area that has a median income below $67,000. 

In [25]:
#histogram of age
px.histogram(df, x = 'age', nbins = 30, title = "Age Histogram").show()

#boxplot of age
px.box(df, x = 'age', title = "Age Boxplot")

> The histogram shows the age variable is unimodal and right-skewed. The peak is at ages 30-34. The majority of the data is concentrated between ages 15 and 64. The boxplot shows that there are some outliers in the upper extreme of the distribution and any age over 72 is considered an outlier. There are no outliers on the lower extreme of the distribution. 

In [26]:
#histogram of the income variable 
px.histogram(df, x = "median_income", nbins = 40, title = "Median Income Histogram").show()

#boxplot of income variable
px.box(df, x = 'median_income', title = "Median Income Boxplot")

> The histogram for the median income variable indicates that most of the police brutality that is noted in this dataset occurs in poor communities with median household incomes less than $60,000. However, lethal police force is also experienced in wealthier communities, thus, showing that this is not an isolated issue. The boxplot displays a large number of outliers on the upper extremity of the chart.

In [27]:
# histogram of the total population variable
px.histogram(df, x = "total_pop", nbins = 70, title = "Total Population Histogram").show()

# boxplot of the total popultion variable
px.box(df, x = 'total_pop', title = "Total Population Boxplot")

> Contrary to what one may think about police violence being worse in large cities, the histogram of total population indicates that most of the cases of lethal police force occur in areas with a population less than 10,000. There are a lot of outliers in the upper extremity of the boxplot. 

In [31]:
#looking at value counts to see how many zeros are in total population
df['total_pop'].value_counts()

total_pop
0.0        18
2861.0      9
2452.0      9
4179.0      9
4352.0      9
           ..
7182.0      1
11627.0     1
4134.0      1
5412.0      1
3657.0      1
Name: count, Length: 5925, dtype: int64

In [38]:
#removing the zeros to see how they affect the distribution because you can't have a population of 0 and saving to a new dataframe
df1 = df.drop(df[df['total_pop'] == 0].index)
df1[df1['total_pop'] == 0].value_counts()

Series([], Name: count, dtype: int64)

In [40]:
#histogram of total population without 0
px.histogram(df1, x = 'total_pop', title="Total Population Histogram without zeros", nbins = 70).show()

#boxplot of total population without 0 
px.box(df1, x = "total_pop", title = "Total Population Boxplot without zeros")

> Since there were only 18 zeros in this column the plots didn't change that drastically (median dropped by 4 and upper fence dropped by 3).  I don't want to drop these rows because there is still valuable information for the other columns within each observation, such as name, age, and race. However, I am going to replace the zero values with 'NaN' values because a population of 0 does not make sense.

In [42]:
#replacing the 0 values in total population with NaN values 
df['total_pop'] = df['total_pop'].replace(0, np.nan)
df['total_pop'].unique().tolist()

[4145.0,
 4381.0,
 2391.0,
 3866.0,
 2914.0,
 2325.0,
 4585.0,
 5476.0,
 4712.0,
 2617.0,
 2602.0,
 3518.0,
 2757.0,
 4703.0,
 3893.0,
 4257.0,
 5266.0,
 4395.0,
 5886.0,
 3667.0,
 7460.0,
 3621.0,
 6285.0,
 6593.0,
 4555.0,
 2975.0,
 6276.0,
 3400.0,
 5010.0,
 1644.0,
 3423.0,
 3145.0,
 2774.0,
 2649.0,
 3805.0,
 6689.0,
 8079.0,
 6402.0,
 6948.0,
 4060.0,
 1651.0,
 5031.0,
 7028.0,
 5759.0,
 3162.0,
 8332.0,
 2600.0,
 2357.0,
 1735.0,
 1602.0,
 2758.0,
 7847.0,
 1685.0,
 1630.0,
 5661.0,
 3485.0,
 6972.0,
 2883.0,
 4680.0,
 14170.0,
 4005.0,
 4409.0,
 4008.0,
 3461.0,
 4917.0,
 2278.0,
 5684.0,
 2803.0,
 3730.0,
 3619.0,
 3073.0,
 3486.0,
 3220.0,
 3738.0,
 4773.0,
 3914.0,
 5435.0,
 4176.0,
 3806.0,
 5466.0,
 3359.0,
 7761.0,
 6216.0,
 736.0,
 2215.0,
 4804.0,
 3428.0,
 3561.0,
 3243.0,
 7714.0,
 8390.0,
 5564.0,
 2935.0,
 4621.0,
 4236.0,
 5032.0,
 1672.0,
 5597.0,
 6686.0,
 4007.0,
 3944.0,
 7753.0,
 3233.0,
 3460.0,
 3881.0,
 3160.0,
 2147.0,
 7874.0,
 5828.0,
 5552.0,
 11793.0,


In [44]:
#creating a new df for the race proportions to then melt 
race_df = df[['name', 'asian_pop', 'white_pop', 'black_pop', 'native_american_pop', 'pacific_islander_pop', 'other_multiple_pop', 'hispanic_pop']]
race_df = pd.melt(race_df, id_vars='name', 
                  value_vars= ['asian_pop', 'white_pop', 'black_pop', 'native_american_pop', 'pacific_islander_pop', 'other_multiple_pop', 
                               'hispanic_pop'],
                               var_name= "race", 
                               value_name= "proportion")

race_df.head()



Unnamed: 0,name,race,proportion
0,Jonathan Foster,asian_pop,0.05
1,Eric Seckington,asian_pop,0.03
2,Sterling Ramon Alavache,asian_pop,0.01
3,Decarlos Cornelius Long,asian_pop,0.06
4,Chase Ditter,asian_pop,0.0


In [45]:
#generating a grouped boxplot for each race proportion
px.box(race_df, x = 'race', y = 'proportion', title="Grouped Boxplot of Race Proportions")

> This grouped boxplot shows the proportions of each race in the different communities. The White boxplot is the only plot without any outliers in the upper extremity. Pacific Islander has the smallest boxplot and the lowest proportions. After looking at some of the datapoints, it appears that the reason for so many outliers in the upper extremity could be from the number of 0's in the data for non-White racial groups. 

#### Categorical Data Exploration

In [51]:
#bar chart of race
px.histogram(df, x = 'race', title="Race Bar Chart")


> Based on this data, the majority of the victims are White people. However, this chart is simply based on counts and not on the true proportions of each race in the US population. As stated in the data provenance section, the researchers found that Black people are 3x more likely than White people to become a victim of lethal police brutality, based on proportional data. 

In [52]:
#grouped bar chart for gender
px.histogram(df, x = 'gender', title = "Gender Grouped Bar Chart")

In [53]:
#bar chart was unclear for 'unknown', 'transgender' and 'non-binary' values. Now displaying value counts
df['gender'].value_counts()

gender
Male           11977
Female           700
Unknown           26
Transgender       13
Non-Binary         1
Name: count, dtype: int64

> Most of the victims in this dataset are male. Gender non-conforming individuals, as identified in the news and offical reports, make up the smallest number of victims in this dataset. 

In [54]:
#displaying the value counts for each state
df['state'].value_counts()

state
CA    1817
TX    1197
FL     875
AZ     553
GA     494
CO     405
NC     363
OH     346
WA     339
TN     329
OK     327
MO     326
IL     300
PA     285
NY     276
LA     248
NM     248
AL     246
IN     229
MI     223
VA     220
SC     205
NV     203
KY     198
MD     193
OR     191
WI     186
MS     181
AR     172
NJ     160
UT     147
MN     127
KS     121
WV     114
ID      95
MA      87
MT      73
IA      73
AK      72
NE      64
CT      58
HI      57
ME      56
DC      45
SD      41
WY      37
NH      32
DE      31
ND      23
VT      18
RI      11
Name: count, dtype: int64

> California and Texas are the two most common states in this dataset, meaning that, in terms of raw numbers, police departments in these two states are the most lethal. Rhode Island is the state with the least number of incidents. 

In [57]:
# value counts of the agencies
df['agency_responsible'].value_counts().to_dict()

{'Los Angeles Police Department': 184,
 'Phoenix Police Department': 159,
 "Los Angeles County Sheriff's Department": 143,
 'New York Police Department': 115,
 'Houston Police Department': 109,
 'Chicago Police Department': 100,
 'Las Vegas Metropolitan Police Department': 93,
 'San Antonio Police Department': 89,
 'U.S. Marshals Service ': 89,
 'California Highway Patrol ': 68,
 'Miami-Dade Police Department': 67,
 "Jacksonville Sheriff's Office": 65,
 'Pennsylvania State Police': 60,
 'Oklahoma City Police Department': 60,
 'Albuquerque Police Department': 57,
 'Kentucky State Police': 57,
 "Riverside County Sheriff's Department": 57,
 "San Bernardino County Sheriff's Department": 56,
 'Philadelphia Police Department': 53,
 'Dallas Police Department': 50,
 'San Diego Police Department': 50,
 'Tucson Police Department': 50,
 'Denver Police Department': 49,
 'Austin Police Department': 47,
 'Columbus Division of Police': 46,
 'St. Louis Metropolitan Police Department': 46,
 "Harris Cou

> While we can't see all of the data due to the number of departments that are in this dataset, it is apparent that the majority of police departments have lethal incidents that are in the single digits. Furthermore, the top 6 departments are the only ones that have over 100 lethal police force cases. 

In [63]:
# exporting the final dataset
df.to_csv("data.csv")

## UI COMPONENTS BRAINSTORM

I want every graphic to be very interactive and allow the user to individualize a lot of the elements based on what kind of data they are looking for. Therefore, each graphic is going to have at least one interactive element. For example, I am planning on making a heatmap and the user can filter the heatmap based on age, race, gender, state, or police department using radio buttons. Then a dropdown will change the accepted values based on what radio button the user selects. Additionally, I am planning on having a date range slider that the user can use on the heat map. The user will also be able to select distribution charts for age, race and gender to see the counts for each of those variables. Finally, the user will be allowed to select states and towns to see the population proportions of the different races that were recorded. 

- radio buttons    
- Dropdown    
- date range slider   

## DATA VISUALIZATION BRAINSTORM

1. The first visual I will have on my app will be a heatmap of the United States that is shaded by the number of lethal police force incidents in each state. The user will use radio buttons to select a variable that they want to filter for (ex: race and then select Black) to have a heatmap of the number of victims in each state/region. 

2. My next visual will be a changing histogram to show how some of the variables are distributed such as age, race, and gender. The user will be able to select which variable they want to focus on via selecting radio buttons. 

3. The final visual will be the number of victims that were killed by police in the specific city, state, or police department. Users will be able to check which search option they would like and then use a search/dropdown box to find the wanted city, state or police department. Additionally, below the number that is displayed will be the names of the victims to ensure that their names are not forgotten. 
