# Traitement des données

Dans cette partie, l'objectif est d'importer nos données, les convertir à un format convenable pour la production de statistiques descriptives, et exporter le résultat pour que ces bases puissent être utilisées par les autres programmes

Les bases en question sont :
- la base listant tous les incidents de violence par armes à feu aux USA entre 2013 et 2018
- la base listant les caractéristiques générales des comtés et de ses habitants.

In [3]:
#Pour le traitement classique
import numpy as np
import pandas as pd
import geopandas as gpd
from shapely.geometry import MultiPolygon
from tqdm import tqdm

#Pour l'interaction avec l'API 
import requests
from statistics import mean
import time

## Traitement de la base d'incidents armes à feu

In [2]:
#Base incidents armes à feu
url="https://drive.google.com/file/d/1GGOLMc_Ow9yZC9sICegPegDggQuHOD3t/view?usp=drive_link"
url="https://drive.google.com/uc?export=download&confirm=1&id=" + url.split("/")[-2]
gun_violence_db = pd.read_csv(url)
gun_violence_db.sample(5)

Unnamed: 0,incident_id,date,state,city_or_county,address,n_killed,n_injured,incident_url,source_url,incident_url_fields_missing,...,participant_age,participant_age_group,participant_gender,participant_name,participant_relationship,participant_status,participant_type,sources,state_house_district,state_senate_district
105952,477684,2016-01-02,California,Richmond,2200 block of Cutting Boulevard,0,1,http://www.gunviolencearchive.org/incident/477684,http://www.insidebayarea.com/crime-courts/ci_2...,False,...,0::52,0::Adult 18+||1::Adult 18+,0::Female||1::Male,0::Claire Dugan,,0::Injured||1::Unharmed,0::Victim||1::Subject-Suspect,http://www.insidebayarea.com/breaking-news/ci_...,15.0,9.0
127111,563704,2016-05-21,Michigan,Detroit,13900 block of Wyoming,0,3,http://www.gunviolencearchive.org/incident/563704,http://www.detroitnews.com/story/news/local/de...,False,...,0::20||1::21||2::21,0::Adult 18+||1::Adult 18+||2::Adult 18+,0::Male||1::Male||2::Male,,,0::Injured||1::Injured||2::Injured,0::Victim||1::Victim||2::Victim,http://www.detroitnews.com/story/news/local/de...,7.0,3.0
40017,216857,2014-10-03,California,Oakland,1100 10th Avenue,0,0,http://www.gunviolencearchive.org/incident/216857,https://data.oaklandnet.com/Public-Safety/Crim...,False,...,,,0::Male||1::Male,,,,0::Victim||1::Subject-Suspect,https://data.oaklandnet.com/Public-Safety/Crim...,18.0,9.0
157829,737796,2016-11-22,Alaska,Soldotna,,0,0,http://www.gunviolencearchive.org/incident/737796,http://www.ktuu.com/content/news/Soldotna-conv...,False,...,0::32,0::Adult 18+,0::Male,0::Scott Hashemian,,"0::Unharmed, Arrested||1::Unharmed",0::Subject-Suspect||1::Subject-Suspect,http://www.ktuu.com/content/news/Soldotna-conv...,29.0,
171978,774387,2017-02-14,Arkansas,Jonesboro,109 Harvester Dr,0,0,http://www.gunviolencearchive.org/incident/774387,http://www.kait8.com/story/34500618/deputys-ca...,False,...,,,,,,,,http://www.kait8.com/story/34500618/deputys-ca...,58.0,21.0


Conformément à la documentation de la base, certaines colonnes sont codées de façon à pouvoir les reconvertir en dictionnaire :

In [3]:
def convert_to_dict(value):
    if pd.isna(value):
        return value

    pairs = value.split('||')
    result_dict = {}
    for pair in pairs:
        #Some are corrupted : 1: instead of ::
        if '::' in pair:
            key, val = pair.split('::', 1)
            result_dict[int(key)] = val
        else:
            key, val = pair.split(':', 1)
            result_dict[int(key)] = val
    return result_dict

list_of_dict_columns = ['gun_stolen', 'gun_type', 'participant_age', 'participant_age_group', 'participant_gender', 'participant_name', 'participant_relationship', 'participant_status', 'participant_type']
gun_violence_db[list_of_dict_columns] = gun_violence_db[list_of_dict_columns].applymap(convert_to_dict)
gun_violence_db.head()



  gun_violence_db[list_of_dict_columns] = gun_violence_db[list_of_dict_columns].applymap(convert_to_dict)


Unnamed: 0,incident_id,date,state,city_or_county,address,n_killed,n_injured,incident_url,source_url,incident_url_fields_missing,...,participant_age,participant_age_group,participant_gender,participant_name,participant_relationship,participant_status,participant_type,sources,state_house_district,state_senate_district
0,461105,2013-01-01,Pennsylvania,Mckeesport,1506 Versailles Avenue and Coursin Street,0,4,http://www.gunviolencearchive.org/incident/461105,http://www.post-gazette.com/local/south/2013/0...,False,...,{0: '20'},"{0: 'Adult 18+', 1: 'Adult 18+', 2: 'Adult 18+...","{0: 'Male', 1: 'Male', 3: 'Male', 4: 'Female'}",{0: 'Julian Sims'},,"{0: 'Arrested', 1: 'Injured', 2: 'Injured', 3:...","{0: 'Victim', 1: 'Victim', 2: 'Victim', 3: 'Vi...",http://pittsburgh.cbslocal.com/2013/01/01/4-pe...,,
1,460726,2013-01-01,California,Hawthorne,13500 block of Cerise Avenue,1,3,http://www.gunviolencearchive.org/incident/460726,http://www.dailybulletin.com/article/zz/201301...,False,...,{0: '20'},"{0: 'Adult 18+', 1: 'Adult 18+', 2: 'Adult 18+...",{0: 'Male'},{0: 'Bernard Gillis'},,"{0: 'Killed', 1: 'Injured', 2: 'Injured', 3: '...","{0: 'Victim', 1: 'Victim', 2: 'Victim', 3: 'Vi...",http://losangeles.cbslocal.com/2013/01/01/man-...,62.0,35.0
2,478855,2013-01-01,Ohio,Lorain,1776 East 28th Street,1,3,http://www.gunviolencearchive.org/incident/478855,http://chronicle.northcoastnow.com/2013/02/14/...,False,...,"{0: '25', 1: '31', 2: '33', 3: '34', 4: '33'}","{0: 'Adult 18+', 1: 'Adult 18+', 2: 'Adult 18+...","{0: 'Male', 1: 'Male', 2: 'Male', 3: 'Male', 4...","{0: 'Damien Bell', 1: 'Desmen Noble', 2: 'Herm...",,"{0: 'Injured, Unharmed, Arrested', 1: 'Unharme...","{0: 'Subject-Suspect', 1: 'Subject-Suspect', 2...",http://www.morningjournal.com/general-news/201...,56.0,13.0
3,478925,2013-01-05,Colorado,Aurora,16000 block of East Ithaca Place,4,0,http://www.gunviolencearchive.org/incident/478925,http://www.dailydemocrat.com/20130106/aurora-s...,False,...,"{0: '29', 1: '33', 2: '56', 3: '33'}","{0: 'Adult 18+', 1: 'Adult 18+', 2: 'Adult 18+...","{0: 'Female', 1: 'Male', 2: 'Male', 3: 'Male'}","{0: 'Stacie Philbrook', 1: 'Christopher Ratlif...",,"{0: 'Killed', 1: 'Killed', 2: 'Killed', 3: 'Ki...","{0: 'Victim', 1: 'Victim', 2: 'Victim', 3: 'Su...",http://denver.cbslocal.com/2013/01/06/officer-...,40.0,28.0
4,478959,2013-01-07,North Carolina,Greensboro,307 Mourning Dove Terrace,2,2,http://www.gunviolencearchive.org/incident/478959,http://www.journalnow.com/news/local/article_d...,False,...,"{0: '18', 1: '46', 2: '14', 3: '47'}","{0: 'Adult 18+', 1: 'Adult 18+', 2: 'Teen 12-1...","{0: 'Female', 1: 'Male', 2: 'Male', 3: 'Female'}","{0: 'Danielle Imani Jameison', 1: 'Maurice Eug...",{3: 'Family'},"{0: 'Injured', 1: 'Injured', 2: 'Killed', 3: '...","{0: 'Victim', 1: 'Victim', 2: 'Victim', 3: 'Su...",http://myfox8.com/2013/01/08/update-mother-sho...,62.0,27.0


In [4]:
gun_violence_db.to_csv("data/gun_violence_db.csv", index=False)

## Traitements des bases comtés de l'API


La documentation précisant le mode d'interaction avec l'API de StLouisFed se trouve à la page https://fred.stlouisfed.org/docs/api/fred/#API.

In [4]:
#Each request is categorized with an url and an id
#The gist here is to recover the proper id to retrieve data
api_key = "180de2e6a1d1e953d270ebf38341cd44"
param = {"api_key" : api_key, "file_type" : "json", "category_id" : "27281"}
url = "https://api.stlouisfed.org/fred/category/children?"

In [5]:
def request_db(index):
    #this function requests to the API the database associated with the category id index
    param["category_id"] = index #on ajuste les paramètres de la request pour demander la bonne catégorie
    response = requests.get(url, params = param)
    data = response.json()
    return data

In [7]:
#Some names are ambiguous between dframes
def simplify_name(name):
    if name.endswith("County"):
        return name.rsplit("County", 1)[0].strip()
    if name.endswith("Parish"):
        return name.rsplit("Parish", 1)[0].strip()
    if name.endswith("Census Area"):
        return name.replace("Census Area", "CA")
    if name.endswith("Borough/city"):
        return name.replace("Borough/city", "Cty&Bor")      
    if name.endswith("Municipality"):
        return name.replace("Municipality", "Muny")
    if name.endswith("Borough/municipality"):
        return name.replace("Borough/municipality", "Muny")    
    if name.endswith("County/city"):
        return name.rsplit("County/city", 1)[0].strip() 

        
    return name

In [8]:
us_data = request_db(27281)['categories']
#We create our dframe by creating a list of dicts, each element is a new row
database = list()
for state in us_data:
    id_state = state['id']
    state_name = state['name']
    
    #Request to recover id in order to extract counties
    state_info = request_db(id_state)["categories"]
    if state_info != []: #One exception : which one ?
        id_list_of_state_counties = state_info[0]['id']
        list_of_state_counties = request_db(id_list_of_state_counties)["categories"]
        for county in list_of_state_counties:
            dict_county = dict()
            id_county = county['id']
            
            parts = county['name'].split(', ')
            county_name, state_code = parts[0], parts[-1]
            
            if '+' in county_name:
                county_name, county_bis = county_name.split(' + ')
                dict_county['Nom'] = simplify_name(county_bis)
                dict_county['Etat'] = state_name
                dict_county['Code_Etat'] = state_code
                dict_county['id_Etat'] = id_state
                dict_county['id_county'] = id_county
                database.append(dict_county)

            #Update the dict to include basic values on each county
            dict_county['Nom'] = simplify_name(county_name)
            dict_county['Etat'] = state_name
            dict_county['Code_Etat'] = state_code
            dict_county['id_Etat'] = id_state
            dict_county['id_county'] = id_county
            database.append(dict_county)

counties_db = pd.DataFrame(database)

In [9]:
counties_db.sample(5)

Unnamed: 0,Nom,Etat,Code_Etat,id_Etat,id_county
2149,Choctaw,Oklahoma,OK,27318,29510
3009,Brooke,West Virginia,WV,27332,30381
1413,Bolivar,Mississippi,MS,153,612
1767,Pershing,Nevada,NV,27310,29120
1329,Carlton,Minnesota,MN,27305,28676


In [10]:
#Gestion des exceptions pour le merging(obligé de faire du cas par cas car merging sur le code_Etat + nom)

counties_db.loc[(counties_db['Code_Etat'] == 'Aleutian Islands Census Area'), 'Code_Etat'] = 'AK'
counties_db.loc[(counties_db['Code_Etat'] == 'District of Columbia'), 'Code_Etat'] = 'DC'
counties_db.loc[(counties_db['Nom'] == 'De Soto'), 'Nom'] = 'DeSoto'
counties_db.loc[(counties_db['Nom'] == 'DeSoto') & (counties_db['Code_Etat'] == 'LA'), 'Nom'] = 'De Soto'
counties_db.loc[(counties_db['Nom'] == 'De Kalb'), 'Nom'] = 'DeKalb'
counties_db.loc[(counties_db['Nom'] == 'Du Page'), 'Nom'] = 'DuPage'
counties_db.loc[(counties_db['Nom'] == 'La Salle'), 'Nom'] = 'LaSalle'
counties_db.loc[(counties_db['Nom'] == 'La Porte'), 'Nom'] = 'LaPorte'
counties_db.loc[(counties_db['Nom'] == 'Lagrange'), 'Nom'] = 'LaGrange'
counties_db.loc[(counties_db['Nom'] == 'LaFourche'), 'Nom'] = 'Lafourche'
counties_db.loc[(counties_db['Nom'] == 'Lac Qui Parle'), 'Nom'] = 'Lac qui Parle'
counties_db.loc[(counties_db['Nom'] == 'Dona Ana'), 'Nom'] = 'Doña Ana'
counties_db.loc[(counties_db['Nom'] == 'La Moure'), 'Nom'] = 'LaMoure'
counties_db.loc[(counties_db['Nom'] == 'De Witt'), 'Nom'] = 'DeWitt'
counties_db.loc[(counties_db['Nom'] == 'DeWitt') & (counties_db['Code_Etat'] == 'IL'), 'Nom'] = 'De Witt'
counties_db.loc[(counties_db['Nom'] == 'LaSalle') & (counties_db['Code_Etat'] == 'TX'), 'Nom'] = 'La Salle'
counties_db.loc[(counties_db['Code_Etat'] == 'WI (includes Menominee)'), 'Code_Etat'] = 'WI'
counties_db.loc[(counties_db['Nom'] == 'Fond Du Lac'), 'Nom'] = 'Fond du Lac'

On a maintenant un premier dframe recensant tous les comtés des USA ainsi que les ID permettant de les retrouver dans l'API. On peut désormais extraire pour chaque comté les informations socio-démographiques nous permettant de produire nos statistiques descriptives et notre modèle.

Note : id_county fait office de clé primaire dans cette base (au sein de l'API FRED)

In [11]:
counties_db.to_csv("data/counties_db.csv", index=False)

## Récupération des données géographiques par comté

In [12]:
#On change l'URL pour pouvoir récupérer des données géographiques sur tous les comtés (ainsi que le fips, le nom simplifié)
#L'enjeu ici est d'associer les données aux comtés correspondants : les codes id de l'API n'ont pas de correspondance avec les donneés géographiques à part le nom
url = "https://api.stlouisfed.org/geofred/shapes/file?shape=county"

In [13]:
geom_counties_db = request_db(29802)#The code is arbitrary here, each request gives the geometries of all counties
geom_counties_db = gpd.GeoDataFrame.from_features(geom_counties_db['features'])
geom_counties_db = geom_counties_db.loc[geom_counties_db['hc-group'] == 'admin2']
geom_counties_db['Code_Etat'] = geom_counties_db['hc-key'].apply(lambda x: x.split('-')[1].upper() if len(x.split('-')) > 1 else None)
geom_counties_db['name'] = geom_counties_db['name'].apply(lambda x : x.rsplit("Parish", 1)[0].strip())
geom_counties_db.sample(5)


Unnamed: 0,geometry,hc-group,hc-middle-x,hc-middle-y,hc-key,hc-a2,fips,name,Code_Etat
2132,"POLYGON ((5972.000 7969.000, 5998.000 7971.000...",admin2,0.5,0.52,us-wi-015,CA,55015,Calumet,WI
1904,"POLYGON ((3678.000 5110.000, 3677.000 5072.000...",admin2,0.51,0.5,us-tx-433,ST,48433,Stonewall,TX
2524,"POLYGON ((4953.000 5902.000, 4970.000 5902.000...",admin2,0.13,0.51,us-ar-007,BE,5007,Benton,AR
1465,"POLYGON ((7192.000 5038.000, 7193.000 5038.000...",admin2,0.64,0.61,us-ga-225,PE,13225,Peach,GA
352,"POLYGON ((1532.000 7434.000, 1502.000 7440.000...",admin2,0.54,0.48,us-ut-005,CA,49005,Cache,UT


In [14]:
counties_db = pd.merge(left= counties_db, right=geom_counties_db, how='left', left_on=['Nom', 'Code_Etat'], right_on=['name', 'Code_Etat'])
counties_db = counties_db.loc[:,['Nom', 'Etat', 'Code_Etat', 'id_Etat', 'id_county', 'fips', 'geometry']]
counties_db.sample(10)

Unnamed: 0,Nom,Etat,Code_Etat,id_Etat,id_county,fips,geometry
2068,Darke,Ohio,OH,27317,29428,39037,"POLYGON ((6846.000 7003.000, 6854.000 6945.000..."
338,Dixie,Florida,FL,27291,27674,12029,"MULTIPOLYGON (((7531.000 4373.000, 7527.000 43..."
1352,Jackson,Minnesota,MN,27305,28699,27063,"MULTIPOLYGON (((4653.000 7807.000, 4705.000 78..."
390,Walton,Florida,FL,27291,27725,12131,"POLYGON ((6726.000 4425.000, 6743.000 4435.000..."
1715,Lancaster,Nebraska,NE,27309,29067,31109,"POLYGON ((4369.000 6943.000, 4369.000 6965.000..."
563,Bingham,Idaho,ID,27294,27899,16011,"POLYGON ((1366.000 7867.000, 1349.000 7847.000..."
1203,Baltimore City,Maryland,MD,27302,28547,24510,"POLYGON ((8448.000 7029.000, 8428.000 7029.000..."
186,Van Buren,Arkansas,AR,149,658,5141,"POLYGON ((5308.000 5728.000, 5308.000 5709.000..."
2551,Brazos,Texas,TX,27326,29919,48041,"POLYGON ((4503.000 4452.000, 4502.000 4443.000..."
2078,Greene,Ohio,OH,27317,29438,39057,"POLYGON ((7019.000 6857.000, 6981.000 6855.000..."


Les comtés où la jointure géographique n'est pas possible :

In [15]:
counties_db.loc[counties_db['geometry'] == None]

Unnamed: 0,Nom,Etat,Code_Etat,id_Etat,id_county,fips,geometry
68,Aleutian Islands CA,Alaska,AK,27283,33743,,
90,Prince of Wales-Outer Ketchikan CA,Alaska,AK,27283,27421,,
92,Skagway-Hoonah-Angoon CA,Alaska,AK,27283,27423,,
96,Wade Hampton CA,Alaska,AK,27283,27426,,
97,Wrangell Borough/City,Alaska,AK,27283,33518,,
98,Wrangell-Petersburg CA,Alaska,AK,27283,27427,,
99,Yakutat City and Borough,Alaska,Yakutat City and Borough,27283,32212,,
2425,Shannon,South Dakota,SD,27324,29791,,
2856,Clifton Forge City,Virginia,VA,27330,30228,,
2945,South Boston City,Virginia,VA,27330,32143,,


## Récupération des séries FRED

Ici, le but est de récupérer des données clé pour chaque comté : population, taux de chômage, bénéficiaires d'aides sociales etc.

In [23]:
def recover_unemp_mean_rate(id_county):
     #Aim : recover date for the unemployement rate for each county
     #We choose here to keep only the mean value from 2013 to 2018

    series_id = None
    data_unemp = None
    
    #Request type to retrieve the id of the series
    url = "https://api.stlouisfed.org/fred/category/series?"
    param = {"api_key" : api_key, "file_type" : "json", "category_id" : id_county}
    response = requests.get(url, params= param)
    series_id = response.json()

    if 'seriess' not in series_id.keys():
        print(series_id)
        return None
    series_id = series_id['seriess']

    id_unemp_series = None
    for serie in series_id:
        if ("Unemployment Rate" in serie["title"]) and ("Monthly" in serie["frequency"]):
            id_unemp_series = serie['id']
    
    #Now that the id of the unemployement series is known, it is time to retrieve values
    #Request type to retrieve the values in the series
    unemployment_rate_mean = np.nan
    if id_unemp_series:
        url = "https://api.stlouisfed.org/fred/series/observations?" #on va chercher les séries correspondant au chomage mensuel
        param = {"api_key" : api_key, "file_type" : "json", "series_id" : id_unemp_series, "observation_start" : "2013-01-01", "observation_end" : "2018-03-01"} 
        response = requests.get(url, params= param)
        data_unemp = response.json()

        if 'observations' not in data_unemp:
            print(data_unemp)
            return None
        data_unemp = data_unemp['observations']


        #Mean of all monthly unemp rates
        list_of_unemp_rates = [float(obs["value"]) for obs in data_unemp]
        if len(list_of_unemp_rates) > 1:
            unemployment_rate_mean = mean(list_of_unemp_rates)

    return unemployment_rate_mean

In [24]:
tqdm.pandas(desc = 'Extraction données chômage')
counties_db['unemp_rate'] = counties_db['id_county'].progress_apply(recover_unemp_mean_rate)

Extraction données chômage:   2%|▏         | 63/3195 [00:31<23:41,  2.20it/s]

{'error_code': 429, 'error_message': 'Too Many Requests.  Exceeded Rate Limit'}
{'error_code': 429, 'error_message': 'Too Many Requests.  Exceeded Rate Limit'}


Extraction données chômage:   2%|▏         | 64/3195 [00:31<19:23,  2.69it/s]

{'error_code': 429, 'error_message': 'Too Many Requests.  Exceeded Rate Limit'}


Extraction données chômage:   2%|▏         | 65/3195 [00:31<16:51,  3.09it/s]

{'error_code': 429, 'error_message': 'Too Many Requests.  Exceeded Rate Limit'}


Extraction données chômage:   2%|▏         | 67/3195 [00:32<13:30,  3.86it/s]

{'error_code': 429, 'error_message': 'Too Many Requests.  Exceeded Rate Limit'}
{'error_code': 429, 'error_message': 'Too Many Requests.  Exceeded Rate Limit'}


Extraction données chômage:   2%|▏         | 68/3195 [00:32<12:35,  4.14it/s]

{'error_code': 429, 'error_message': 'Too Many Requests.  Exceeded Rate Limit'}
{'error_code': 429, 'error_message': 'Too Many Requests.  Exceeded Rate Limit'}


Extraction données chômage:   2%|▏         | 69/3195 [00:32<24:44,  2.11it/s]


KeyboardInterrupt: 

In [78]:
counties_db.head(100)

Unnamed: 0,Nom,Etat,Code_Etat,id_Etat,id_county,fips,geometry,unemp_rate
0,Autauga,Alabama,AL,27282,27336,01001,"POLYGON ((6581.000 4919.000, 6555.000 4969.000...",5.219048
1,Baldwin,Alabama,AL,27282,27337,01003,"MULTIPOLYGON (((6355.000 4470.000, 6354.000 44...",5.520635
2,Barbour,Alabama,AL,27282,27338,01005,"POLYGON ((6976.000 4890.000, 6979.000 4880.000...",8.660317
3,Bibb,Alabama,AL,27282,27339,01007,"POLYGON ((6431.000 5078.000, 6453.000 5080.000...",6.439683
4,Blount,Alabama,AL,27282,27340,01009,"POLYGON ((6608.000 5424.000, 6619.000 5411.000...",5.401587
...,...,...,...,...,...,...,...,...
95,Valdez-Cordova CA,Alaska,AK,27283,27425,02261,"MULTIPOLYGON (((710.000 3060.000, 717.000 3078...",9.007937
96,Wade Hampton CA,Alaska,AK,27283,27426,,,23.627778
97,Wrangell Borough/City,Alaska,AK,27283,33518,,,7.788889
98,Wrangell-Petersburg CA,Alaska,AK,27283,27427,,,
