![BTS](img/Logo-BTS.jpg)

# Session 12: Data gathering

### Juan Luis Cano Rodríguez <juan.cano@bts.tech> - Data Science Foundations (2018-11-16)

Open this notebook in Google Colaboratory: [![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/Juanlu001/bts-mbds-data-science-foundations/blob/master/sessions/12-Data-gathering.ipynb)

## Exercise 1: Extract gender from names

<div class="alert alert-danger">The <code>genderize.io</code> API is rate limited, and if you make too many requests you might not be able to continue with the exercise. Avoid making too many queries in a short period of time.</div>

1. Create a function that returns the list of **first names** from the data below
2. Use https://genderize.io/ with your name. Does it give the correct gender?
3. Include your home country as part of the query as explained in https://genderize.io/. Does it change the result or the probability?
4. Write a function that receives a string and returns the same string in lowercase **without** special characters
5. Write a funtion that receives a name and, if the status code is correct, returns the result, and otherwise prints some error message and returns an empty dictionary
6. Modify this function so it **does not** repeat a request to `genderize.io` if the name was already requested before (keyword: cache)
7. Plot a bar chart with the probability of each name. Do the same with the counts.
8. Create another function with cache that downloads the data for all the names **in one request** and returns a pandas `DataFrame`

In [2]:
import requests

In [4]:
names = [
    "Martín Cano",
    "Benjamin Wein",
    "Francisco Coreas",
    "Jordi Hurtado",
    "Jonas Cristens",
    "Allison Walker",
    "Cen Liang",
    "Mahmouhd Belhaj",
    "David Ordóñez",
    "Sebastian Natalevich",
    "Alan Kwan",
    "Anastasia Gracheva",
    "Pelin Gundogdu",
    "Sibasis Dash",
    "Bekzod Sharipov",
    "Annalaura Ianiro",
    "Christian Siever",
    "Seung Bin Yoo",
    "Miah Mohammad Rashedeul Hasan",
]

In [6]:
def first_names(names):
    first_names_list = []
    for name in names:
        first_names_list.append(name.split()[0])

    return first_names_list

print(first_names(names))

['Martín', 'Benjamin', 'Francisco', 'Jordi', 'Jonas', 'Allison', 'Cen', 'Mahmouhd', 'David', 'Sebastian', 'Alan', 'Anastasia', 'Pelin', 'Sibasis', 'Bekzod', 'Annalaura', 'Christian', 'Seung', 'Miah']


In [7]:
def first_names(names):
    return [name.split()[0] for name in names]

print(first_names(names))

['Martín', 'Benjamin', 'Francisco', 'Jordi', 'Jonas', 'Allison', 'Cen', 'Mahmouhd', 'David', 'Sebastian', 'Alan', 'Anastasia', 'Pelin', 'Sibasis', 'Bekzod', 'Annalaura', 'Christian', 'Seung', 'Miah']


In [8]:
import requests

In [9]:
r = requests.get("https://api.genderize.io/?name=martín")

In [26]:
requests.get("https://api.genderize.io", params={'name': 'andrea'}).json()

{'name': 'andrea', 'gender': 'female', 'probability': 0.79, 'count': 5794}

In [25]:
requests.get("https://api.genderize.io", params={'name': 'andrea', 'language_id': 'it'}).json()

{'name': 'andrea', 'gender': 'male', 'probability': 0.99, 'count': 1070}

In [35]:
requests.get("https://api.genderize.io/blabla", params={'name': 'martín', 'language_id': 'es'})

<Response [404]>

In [20]:
r = requests.get("https://api.genderize.io", params={'name': 'martín', 'language_id': 'es'})

In [21]:
r

<Response [200]>

In [22]:
r.status_code

200

In [23]:
r.text

'{"name":"martín","gender":null}'

In [19]:
r.json()['gender']

In [28]:
import unidecode

In [31]:
def normalize(word):
    return unidecode.unidecode(word).lower()

In [33]:
normalized_first_names = [normalize(name) for name in first_names(names)]

In [34]:
print(normalized_first_names)

['martin', 'benjamin', 'francisco', 'jordi', 'jonas', 'allison', 'cen', 'mahmouhd', 'david', 'sebastian', 'alan', 'anastasia', 'pelin', 'sibasis', 'bekzod', 'annalaura', 'christian', 'seung', 'miah']


In [27]:
'Martín'

'Martín'

In [37]:
def get_gender(name):
    r = requests.get(
        "https://api.genderize.io",
        params={'name': normalize(name)}
    )
    if r.status_code == 200:
        return r.json()
    else:
        print("Something went wrong: {}".format(r.text))
        return {}

In [38]:
get_gender("Martín")

{'name': 'martin', 'gender': 'male', 'probability': 1, 'count': 3568}

In [39]:
get_gender("")

Something went wrong: {"error":"Missing 'name' parameter"}


{}

In [40]:
name_cache = dict()

In [41]:
name_cache['martin'] = 'male'

In [42]:
name_cache

{'martin': 'male'}

In [50]:
name_cache = dict()

def get_name_cached(name):
    if normalize(name) in name_cache:
        return name_cache[normalize(name)]
    else:
        r = requests.get(
            "https://api.genderize.io",
            params={'name': normalize(name)}
        )
        name_cache[normalize(name)] = r.json()['gender']
        return r.json()['gender']

In [51]:
name_cache

{}

In [56]:
from functools import lru_cache

In [58]:
@lru_cache()
def get_gender(name):
    r = requests.get(
        "https://api.genderize.io",
        params={'name': normalize(name)}
    )
    if r.status_code == 200:
        return r.json()
    else:
        print("Something went wrong: {}".format(r.text))
        return {}

In [59]:
get_gender("Martín")

{'name': 'martin', 'gender': 'male', 'probability': 1, 'count': 3568}

In [60]:
get_gender("Martín")

{'name': 'martin', 'gender': 'male', 'probability': 1, 'count': 3568}

In [62]:
get_gender("martín")

{'name': 'martin', 'gender': 'male', 'probability': 1, 'count': 3568}

In [55]:
get_name_cached("martín")

'male'

In [53]:
name_cache

{'martin': 'male'}

In [None]:
def get_gender(name):
    r = requests.get(
        "https://api.genderize.io",
        params={'name': normalize(name)}
    )
    if r.status_code == 200:
        return r.json()
    else:
        print("Something went wrong: {}".format(r.text))
        return {}

## Exercise 2: `GET` vs `POST`

1. Use http://postcodes.io/ to compare how to make a request using `GET` or `POST`.

## Exercise 3: Postcode data

1. Register in https://geoapi.es/ (in Spanish, get aid by Google Translate or the Spaniards in the room) to obtain an API key
2. Look for the endpoint that lists all the communities ("comunidades") in Spain 
2. Create a function that receives the name of a community in Spain ("Andalucía", "Cataluña") and retrieves the code (`CCOM`) of that community
3. Create a function that receives the name of a community and lists all its provinces ("provincias")
4. Create a function that receives the name of a province and lists all its zipcodes ("códigos postales")

In [1]:
import requests

In [2]:
r = requests.get('https://apiv1.geoapi.es/comunidades',
                 params={'key': '8d96c2590d61dec2fc25cea9984bd484bcb82802d766cced2413d1c844b0c15d'})

In [3]:
r.status_code

200

In [4]:
import pandas as pd

In [5]:
r.json()

{'update_date': '2016.07',
 'size': 19,
 'data': [{'CCOM': '01', 'COM': 'ANDALUCÍA'},
  {'CCOM': '02', 'COM': 'ARAGÓN'},
  {'CCOM': '03', 'COM': 'PRINCIPADO DE ASTURIAS'},
  {'CCOM': '04', 'COM': 'ISLAS BALEARES'},
  {'CCOM': '05', 'COM': 'CANARIAS'},
  {'CCOM': '06', 'COM': 'CANTABRIA'},
  {'CCOM': '07', 'COM': 'CASTILLA-LA MANCHA'},
  {'CCOM': '08', 'COM': 'CASTILLA Y LEÓN'},
  {'CCOM': '09', 'COM': 'CATALUÑA'},
  {'CCOM': '10', 'COM': 'COMUNIDAD VALENCIANA'},
  {'CCOM': '11', 'COM': 'EXTREMADURA'},
  {'CCOM': '12', 'COM': 'GALICIA'},
  {'CCOM': '13', 'COM': 'LA RIOJA'},
  {'CCOM': '14', 'COM': 'MADRID'},
  {'CCOM': '15', 'COM': 'NAVARRA'},
  {'CCOM': '16', 'COM': 'PAÍS VASCO'},
  {'CCOM': '17', 'COM': 'MURCIA'},
  {'CCOM': '18', 'COM': 'CEUTA'},
  {'CCOM': '19', 'COM': 'MELILLA'}]}

In [6]:
communities = pd.DataFrame(r.json()['data']).set_index('COM')['CCOM']
communities.head()

COM
ANDALUCÍA                 01
ARAGÓN                    02
PRINCIPADO DE ASTURIAS    03
ISLAS BALEARES            04
CANARIAS                  05
Name: CCOM, dtype: object

In [7]:
communities

COM
ANDALUCÍA                 01
ARAGÓN                    02
PRINCIPADO DE ASTURIAS    03
ISLAS BALEARES            04
CANARIAS                  05
CANTABRIA                 06
CASTILLA-LA MANCHA        07
CASTILLA Y LEÓN           08
CATALUÑA                  09
COMUNIDAD VALENCIANA      10
EXTREMADURA               11
GALICIA                   12
LA RIOJA                  13
MADRID                    14
NAVARRA                   15
PAÍS VASCO                16
MURCIA                    17
CEUTA                     18
MELILLA                   19
Name: CCOM, dtype: object

In [8]:
communities.index[communities == '04']

Index(['ISLAS BALEARES'], dtype='object', name='COM')

In [9]:
communities['ISLAS BALEARES']

'04'

In [10]:
communities = pd.DataFrame(r.json()['data']).set_index('COM')['CCOM']

def get_ccom_df(community_name):
    return communities[community_name.upper()]

In [11]:
for ii in range(5):
    print(ii)
    #if ii > 2:
    #    print("Breaking!")
    #    break
else:
    print("I reached the end!")

0
1
2
3
4
I reached the end!


In [12]:
def get_ccom_alt(community_name):
    for row in r.json()['data']:
        if row['COM'] == community_name.upper():
            return row['CCOM']
    #else:
    raise KeyError("That community does not exist")

In [13]:
get_ccom_alt("Andalucía")

'01'

In [14]:
get_ccom_alt("Cataluña")

'09'

In [15]:
r.json()

{'update_date': '2016.07',
 'size': 19,
 'data': [{'CCOM': '01', 'COM': 'ANDALUCÍA'},
  {'CCOM': '02', 'COM': 'ARAGÓN'},
  {'CCOM': '03', 'COM': 'PRINCIPADO DE ASTURIAS'},
  {'CCOM': '04', 'COM': 'ISLAS BALEARES'},
  {'CCOM': '05', 'COM': 'CANARIAS'},
  {'CCOM': '06', 'COM': 'CANTABRIA'},
  {'CCOM': '07', 'COM': 'CASTILLA-LA MANCHA'},
  {'CCOM': '08', 'COM': 'CASTILLA Y LEÓN'},
  {'CCOM': '09', 'COM': 'CATALUÑA'},
  {'CCOM': '10', 'COM': 'COMUNIDAD VALENCIANA'},
  {'CCOM': '11', 'COM': 'EXTREMADURA'},
  {'CCOM': '12', 'COM': 'GALICIA'},
  {'CCOM': '13', 'COM': 'LA RIOJA'},
  {'CCOM': '14', 'COM': 'MADRID'},
  {'CCOM': '15', 'COM': 'NAVARRA'},
  {'CCOM': '16', 'COM': 'PAÍS VASCO'},
  {'CCOM': '17', 'COM': 'MURCIA'},
  {'CCOM': '18', 'COM': 'CEUTA'},
  {'CCOM': '19', 'COM': 'MELILLA'}]}

In [16]:
def get_provinces(community_name):
    ccom = get_ccom_alt(community_name)
    r = requests.get(
        'https://apiv1.geoapi.es/provincias',
        params={
            'key': '8d96c2590d61dec2fc25cea9984bd484bcb82802d766cced2413d1c844b0c15d',
            'CCOM': ccom
        }
    )
    #return [p['PRO'].capitalize() for p in r.json()['data']]
    return r.json()['data']

In [17]:
def get_all_provinces():
    r = requests.get(
        'https://apiv1.geoapi.es/provincias',
        params={
            'key': '8d96c2590d61dec2fc25cea9984bd484bcb82802d766cced2413d1c844b0c15d',
        }
    )
    #return [p['PRO'].capitalize() for p in r.json()['data']]
    return r.json()['data']

In [18]:
get_all_provinces()

[{'CCOM': '16', 'CPRO': '01', 'PRO': 'ÁLAVA'},
 {'CCOM': '07', 'CPRO': '02', 'PRO': 'ALBACETE'},
 {'CCOM': '10', 'CPRO': '03', 'PRO': 'ALICANTE'},
 {'CCOM': '01', 'CPRO': '04', 'PRO': 'ALMERÍA'},
 {'CCOM': '03', 'CPRO': '33', 'PRO': 'PRINCIPADO DE ASTURIAS'},
 {'CCOM': '08', 'CPRO': '05', 'PRO': 'ÁVILA'},
 {'CCOM': '11', 'CPRO': '06', 'PRO': 'BADAJOZ'},
 {'CCOM': '09', 'CPRO': '08', 'PRO': 'BARCELONA'},
 {'CCOM': '08', 'CPRO': '09', 'PRO': 'BURGOS'},
 {'CCOM': '11', 'CPRO': '10', 'PRO': 'CÁCERES'},
 {'CCOM': '01', 'CPRO': '11', 'PRO': 'CÁDIZ'},
 {'CCOM': '06', 'CPRO': '39', 'PRO': 'CANTABRIA'},
 {'CCOM': '10', 'CPRO': '12', 'PRO': 'CASTELLÓN'},
 {'CCOM': '07', 'CPRO': '13', 'PRO': 'CIUDAD REAL'},
 {'CCOM': '01', 'CPRO': '14', 'PRO': 'CÓRDOBA'},
 {'CCOM': '12', 'CPRO': '15', 'PRO': 'LA CORUÑA'},
 {'CCOM': '07', 'CPRO': '16', 'PRO': 'CUENCA'},
 {'CCOM': '09', 'CPRO': '17', 'PRO': 'GERONA'},
 {'CCOM': '01', 'CPRO': '18', 'PRO': 'GRANADA'},
 {'CCOM': '07', 'CPRO': '19', 'PRO': 'GUADALAJARA

In [19]:
get_provinces("Cataluña")

[{'CCOM': '09', 'CPRO': '08', 'PRO': 'BARCELONA'},
 {'CCOM': '09', 'CPRO': '17', 'PRO': 'GERONA'},
 {'CCOM': '09', 'CPRO': '25', 'PRO': 'LÉRIDA'},
 {'CCOM': '09', 'CPRO': '43', 'PRO': 'TARRAGONA'}]

In [47]:
def get_zipcodes(province_name):
    zipcodes = []
    cpro = [row['CPRO'] for row in get_all_provinces() if row['PRO'] == province_name.upper()]

    r = requests.get(
        'https://apiv1.geoapi.es/municipios',
        params={
            'key': '8d96c2590d61dec2fc25cea9984bd484bcb82802d766cced2413d1c844b0c15d',
            'CPRO': cpro,
        }
    )

    all_municipios = [row['CMUM'] for row in r.json()['data']]

    for cmum in all_municipios:
        r = requests.get(
            'https://apiv1.geoapi.es/poblaciones',
            params={
                'key': '8d96c2590d61dec2fc25cea9984bd484bcb82802d766cced2413d1c844b0c15d',
                'CPRO': cpro,
                'CMUM': cmum,
            }
        )
        all_poblaciones = [row['CUN'] for row in r.json()['data']]

        for cun in all_poblaciones:
            r = requests.get(
                'https://apiv1.geoapi.es/cps',
                params={
                    'key': '8d96c2590d61dec2fc25cea9984bd484bcb82802d766cced2413d1c844b0c15d',
                    'CPRO': cpro,
                    'CMUM': cmum,
                    'CUN': cun,
                }
            )

            zipcodes.extend([row['CPOS'] for row in r.json()['data']])

    return zipcodes

In [48]:
# Super slow!
get_zipcodes("ALMERÍA")

['04510',
 '04510',
 '04510',
 '04510',
 '04510',
 '04510',
 '04510',
 '04520',
 '04533',
 '04533',
 '04533',
 '04770',
 '04779',
 '04778',
 '04779',
 '04779',
 '04779',
 '04770',
 '04779',
 '04770',
 '04779',
 '04778',
 '04779',
 '04778',
 '04778',
 '04770',
 '04778',
 '04778',
 '04770',
 '04770',
 '04770',
 '04770',
 '04770',
 '04779',
 '04778',
 '04779',
 '04857',
 '04857',
 '04858',
 '04858',
 '04858',
 '04858',
 '04858',
 '04857',
 '04858',
 '04858',
 '04858',
 '04858',
 '04858',
 '04858',
 '04858',
 '04531',
 '04558',
 '04600',
 '04800',
 '04814',
 '04814',
 '04814',
 '04692',
 '04814',
 '04812',
 '04812',
 '04814',
 '04812',
 '04813',
 '04813',
 '04480',
 '04768',
 '04768',
 '04897',
 '04897',
 '04897',
 '04897',
 '04897',
 '04898',
 '04897',
 '04897',
 '04276',
 '04567',
 '04400',
 '04400',
 '04409',
 '04001',
 '04002',
 '04003',
 '04004',
 '04005',
 '04006',
 '04007',
 '04008',
 '04009',
 '04120',
 '04130',
 '04160',
 '04120',
 '04150',
 '04120',
 '04002',
 '04120',
 '04160',


Ideas:

* Print only unique?
* Use generators (not useful for the whole list)
* Use asynchronicity https://pawelmhm.github.io/asyncio/python/aiohttp/2016/04/22/asyncio-aiohttp.html