# Diversas fuentes de datos

![rdb](https://cdn.pixabay.com/photo/2016/12/09/18/30/database-schema-1895779_960_720.png)

## Datos relacionales

Los tipos de datos más simples que hemos visto consisten de una sola tabla con algunas variables (columnas) y algunos registros (filas). Este tipo de datos es fácil de analizar, y muchas veces podemos reducir nuestros datos a una única tabla antes de empezar a correr algoritmos de aprendizaje de máquinas sobre ellos.

Sin embargo, los datos en el mundo real no necesariamente son tan "bonitos". La mayoría de datos reales que nos encontramos son complejos y desordenados, y no son fáciles de organizar en una sola tabla sin antes hacer un buen trabajo en su procesamiento.

Adicionalmente, muchas veces podemos reducir el costo de guardar los datos en memoria distribuyendo los datos en varias tablas con relaciones definidas, en lugar de una sola tabla que concentre toda la información.

El día de hoy vamos a revisar un poco como combinar datos de diferentes fuentes, y cómo podemos generar características bastante útiles. 

Como ejemplo tomaremos datos de las 10 compañías top en el índice [Fortune Global 500](https://en.wikipedia.org/wiki/Fortune_Global_500). Para trabajarlas, usaremos la función `read_html` de pandas, la cual nos permitirá ingerir los datos directamente desde la página.

In [1]:
# Importar librerías
import pandas as pd

In [2]:
# Cargar datos en memoria usando pd.read_html
data = pd.read_html("https://en.wikipedia.org/wiki/Fortune_Global_500")

In [4]:
type(data)

list

In [8]:
for i in range(len(data)):
    print(type(data[i]))

<class 'pandas.core.frame.DataFrame'>
<class 'pandas.core.frame.DataFrame'>
<class 'pandas.core.frame.DataFrame'>
<class 'pandas.core.frame.DataFrame'>
<class 'pandas.core.frame.DataFrame'>
<class 'pandas.core.frame.DataFrame'>
<class 'pandas.core.frame.DataFrame'>
<class 'pandas.core.frame.DataFrame'>
<class 'pandas.core.frame.DataFrame'>
<class 'pandas.core.frame.DataFrame'>
<class 'pandas.core.frame.DataFrame'>
<class 'pandas.core.frame.DataFrame'>
<class 'pandas.core.frame.DataFrame'>
<class 'pandas.core.frame.DataFrame'>
<class 'pandas.core.frame.DataFrame'>


In [9]:
fortune500 = data[0]
fortune500

Unnamed: 0,Rank,Company,Country,Industry,Revenue in USD
0,1,Walmart,United States,Retail,$524 billion
1,2,Sinopec Group,China,Petroleum,$407 billion
2,3,State Grid,China,Energy,$384 billion
3,4,China National Petroleum,China,Petroleum,$379 billion
4,5,Royal Dutch Shell,Netherlands,Petroleum,$352 billion
5,6,Saudi Aramco,Saudi Arabia,Energy,$330 billion
6,7,Volkswagen,Germany,Automobiles,$283 billion
7,8,BP,United Kingdom,Petroleum,$283 billion
8,9,Amazon.com,United States,Internet Services and Retailing,$281 billion
9,10,Toyota Motor,Japan,Automobiles,$275 billion


### ¿Qué es lo que hay detrás de la función `pd.read_html`?

Los pasos se pueden detallar tanto como se quiera, pero escencialmente son #:

1. Hacer un **GET request** a la página web (usando la librería [requests](https://docs.python-requests.org/en/master/)):

In [10]:
# Importar librería requests
import requests

In [11]:
help(requests.get)

Help on function get in module requests.api:

get(url, params=None, **kwargs)
    Sends a GET request.
    
    :param url: URL for the new :class:`Request` object.
    :param params: (optional) Dictionary, list of tuples or bytes to send
        in the query string for the :class:`Request`.
    :param \*\*kwargs: Optional arguments that ``request`` takes.
    :return: :class:`Response <Response>` object
    :rtype: requests.Response



In [22]:
# Hacer un get request a la página
response = requests.get("https://en.wikipedia.org/wiki/Fortune_Global_500")

In [27]:
help(response.json)

Help on method json in module requests.models:

json(**kwargs) method of requests.models.Response instance
    Returns the json-encoded content of a response, if any.
    
    :param \*\*kwargs: Optional arguments that ``json.loads`` takes.
    :raises ValueError: If the response body does not contain valid json.



¿Qué obtenemos con este request?

In [23]:
# Atributo text
response.text

'<!DOCTYPE html>\n<html class="client-nojs" lang="en" dir="ltr">\n<head>\n<meta charset="UTF-8"/>\n<title>Fortune Global 500 - Wikipedia</title>\n<script>document.documentElement.className="client-js";RLCONF={"wgBreakFrames":!1,"wgSeparatorTransformTable":["",""],"wgDigitTransformTable":["",""],"wgDefaultDateFormat":"dmy","wgMonthNames":["","January","February","March","April","May","June","July","August","September","October","November","December"],"wgRequestId":"99191cad-aeca-489c-a88e-f9d1b5451840","wgCSPNonce":!1,"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":!1,"wgNamespaceNumber":0,"wgPageName":"Fortune_Global_500","wgTitle":"Fortune Global 500","wgCurRevisionId":1029039347,"wgRevisionId":1029039347,"wgArticleId":17581425,"wgIsArticle":!0,"wgIsRedirect":!1,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["Articles with short description","Short description is different from Wikidata","All articles with unsourced statements","Articles with unsourced

Inspeccionar página ...

Entonces, obtenemos todos los datos de la página. Lo "único" que nos hace falta es:

2. Llevar estos datos a un formato adecuado usando un **HTML parser** (usamos [Beautiful Soup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/)):

In [24]:
# Importar bs4.BeautifulSoup
from bs4 import BeautifulSoup

In [25]:
help(BeautifulSoup)

Help on class BeautifulSoup in module bs4:

class BeautifulSoup(bs4.element.Tag)
 |  BeautifulSoup(markup='', features=None, builder=None, parse_only=None, from_encoding=None, exclude_encodings=None, element_classes=None, **kwargs)
 |  
 |  A data structure representing a parsed HTML or XML document.
 |  
 |  Most of the methods you'll call on a BeautifulSoup object are inherited from
 |  PageElement or Tag.
 |  
 |  Internally, this class defines the basic interface called by the
 |  tree builders when converting an HTML/XML document into a data
 |  structure. The interface abstracts away the differences between
 |  parsers. To write a new tree builder, you'll need to understand
 |  these methods as a whole.
 |  
 |  These methods will be called by the BeautifulSoup constructor:
 |    * reset()
 |    * feed(markup)
 |  
 |  The tree builder may call these methods from its feed() implementation:
 |    * handle_starttag(name, attrs) # See note about return value
 |    * handle_endtag(na

In [28]:
# Instanciar un objeto tipo BeautifulSoup con los contenidos del request
soup = BeautifulSoup(response.text, "html")

[Entrada de stackoverflow donde se discuten los diferentes parsers](https://stackoverflow.com/questions/25714417/beautiful-soup-and-table-scraping-lxml-vs-html-parser)

¿Qué contiene nuestra "sopa"?

In [29]:
soup

<!DOCTYPE html>
<html class="client-nojs" dir="ltr" lang="en">
<head>
<meta charset="utf-8"/>
<title>Fortune Global 500 - Wikipedia</title>
<script>document.documentElement.className="client-js";RLCONF={"wgBreakFrames":!1,"wgSeparatorTransformTable":["",""],"wgDigitTransformTable":["",""],"wgDefaultDateFormat":"dmy","wgMonthNames":["","January","February","March","April","May","June","July","August","September","October","November","December"],"wgRequestId":"99191cad-aeca-489c-a88e-f9d1b5451840","wgCSPNonce":!1,"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":!1,"wgNamespaceNumber":0,"wgPageName":"Fortune_Global_500","wgTitle":"Fortune Global 500","wgCurRevisionId":1029039347,"wgRevisionId":1029039347,"wgArticleId":17581425,"wgIsArticle":!0,"wgIsRedirect":!1,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["Articles with short description","Short description is different from Wikidata","All articles with unsourced statements","Articles with unsourced state

Podemos hacer búsqueda de diferentes objetos:

In [30]:
# Título
soup.find("title")

<title>Fortune Global 500 - Wikipedia</title>

In [35]:
# Tablas
table = soup.find_all("table")[0].text

In [39]:
table.split("\n1")[1].split("\n2")

['\n\nWalmart\n\n\xa0United States\n\nRetail\n\n$524 billion\n\n',
 '\n\nSinopec Group\n\n\xa0China\n\nPetroleum\n\n$407 billion\n\n\n3\n\nState Grid\n\n\xa0China\n\nEnergy\n\n$384 billion\n\n\n4\n\nChina National Petroleum\n\n\xa0China\n\nPetroleum\n\n$379 billion\n\n\n5\n\nRoyal Dutch Shell\n\n\xa0Netherlands\n\nPetroleum\n\n$352 billion\n\n\n6\n\nSaudi Aramco\n\n\xa0Saudi Arabia\n\nEnergy\n\n$330 billion\n\n\n7\n\nVolkswagen\n\n\xa0Germany\n\nAutomobiles\n\n$283 billion\n\n\n8\n\nBP\n\n\xa0United Kingdom\n\nPetroleum\n\n$283 billion\n\n\n9\n\nAmazon.com\n\n\xa0United States\n\nInternet Services and Retailing\n\n$281 billion\n\n']

Observamos que la tabla la podríamos "parsear" usando la clase `str`. Acá podríamos hacer uso del parser que nos provee pandas:

In [42]:
fortune500 = pd.read_html(str(soup.find("table")))[0]

In [43]:
fortune500

Unnamed: 0,Rank,Company,Country,Industry,Revenue in USD
0,1,Walmart,United States,Retail,$524 billion
1,2,Sinopec Group,China,Petroleum,$407 billion
2,3,State Grid,China,Energy,$384 billion
3,4,China National Petroleum,China,Petroleum,$379 billion
4,5,Royal Dutch Shell,Netherlands,Petroleum,$352 billion
5,6,Saudi Aramco,Saudi Arabia,Energy,$330 billion
6,7,Volkswagen,Germany,Automobiles,$283 billion
7,8,BP,United Kingdom,Petroleum,$283 billion
8,9,Amazon.com,United States,Internet Services and Retailing,$281 billion
9,10,Toyota Motor,Japan,Automobiles,$275 billion


De este modo podemos obtener información relevante de páginas web públicas.

Hay mucho más en cuanto al tema de scraping de páginas web. 

- Si para obtener información de una página debes navegar en ella, hacer clicks en botones o cosas por el estilo, hay otra librería que nos puede ayudar a automatizar estas tareas. Su nombre es [Selenium](https://selenium-python.readthedocs.io/).

- Por otra parte, cuando una página web no quiere que sus contenidos sean obtenidos de manera masiva y repetitiva, normalmente incluyen sistemas "antibots":

![antibots](https://miro.medium.com/max/1400/1*4NhFKMxr-qXodjYpxtiE0w.gif)

- Otra práctica común, es que limiten los requests cuando identifican que se hacen con la misma dirección ip.

Volviendo a nuestros datos:

In [44]:
# Datos de fortune 500
fortune500

Unnamed: 0,Rank,Company,Country,Industry,Revenue in USD
0,1,Walmart,United States,Retail,$524 billion
1,2,Sinopec Group,China,Petroleum,$407 billion
2,3,State Grid,China,Energy,$384 billion
3,4,China National Petroleum,China,Petroleum,$379 billion
4,5,Royal Dutch Shell,Netherlands,Petroleum,$352 billion
5,6,Saudi Aramco,Saudi Arabia,Energy,$330 billion
6,7,Volkswagen,Germany,Automobiles,$283 billion
7,8,BP,United Kingdom,Petroleum,$283 billion
8,9,Amazon.com,United States,Internet Services and Retailing,$281 billion
9,10,Toyota Motor,Japan,Automobiles,$275 billion


Una pregunta que quisieramos resolver es, ¿Cuál es el ingreso promedio por empleado?

Podemos buscar estos datos en Wikipedia también. Yo ya los "scrapeé" manualmente por ustedes para que los usemos en la clase:

In [48]:
other_data = [
    {"name": "Walmart",
     "employees": 2200000,
     "year founded": 1962
    },
    {"name": "State Grid Corporation of China",
     "employees": 1566000,
     "year founded": 2002
    },
    {"name": "China National Petroleum Corporation",
     "employees": 460724,
     "year founded": 1988
    },
    {"name": "Berkshire Hathaway Inc.",
     "employees": 360000,
     "year founded": 1839
    },
    {"name": "BP plc",
     "employees": 70100,
     "year founded": 1909
    },
    {"name": "China Petrochemical Corporation",
     "employees": 582648,
     "year founded": 1998
     },
    {"name": "Royal Dutch Shell",
     "employees": 86000,
     "year founded": 1907
    },
    {"name": "Toyota Motor Corporation",
     "employees": 364445,
     "year founded": 1937
    },
    {"name": "Saudi Aramco",
     "employees": 66800,
     "year founded": 1933
    },
    {"name": "Apple Inc.",
     "employees": 147000,
     "year founded": 1976
    },
    {"name": "Volkswagen AG",
     "employees": 307342,
     "year founded": 1937
    },
    {"name": "Amazon.com, Inc.",
     "employees":1298000,
     "year founded": 1994
    }
]

In [55]:
employees_info = pd.DataFrame(other_data)

In [56]:
employees_info

Unnamed: 0,name,employees,year founded
0,Walmart,2200000,1962
1,State Grid Corporation of China,1566000,2002
2,China National Petroleum Corporation,460724,1988
3,Berkshire Hathaway Inc.,360000,1839
4,BP plc,70100,1909
5,China Petrochemical Corporation,582648,1998
6,Royal Dutch Shell,86000,1907
7,Toyota Motor Corporation,364445,1937
8,Saudi Aramco,66800,1933
9,Apple Inc.,147000,1976


In [51]:
fortune500

Unnamed: 0,Rank,Company,Country,Industry,Revenue in USD
0,1,Walmart,United States,Retail,$524 billion
1,2,Sinopec Group,China,Petroleum,$407 billion
2,3,State Grid,China,Energy,$384 billion
3,4,China National Petroleum,China,Petroleum,$379 billion
4,5,Royal Dutch Shell,Netherlands,Petroleum,$352 billion
5,6,Saudi Aramco,Saudi Arabia,Energy,$330 billion
6,7,Volkswagen,Germany,Automobiles,$283 billion
7,8,BP,United Kingdom,Petroleum,$283 billion
8,9,Amazon.com,United States,Internet Services and Retailing,$281 billion
9,10,Toyota Motor,Japan,Automobiles,$275 billion


Pensaríamos que podría ser tan fácil como hacer un merge entre ambas tablas sobre el nombre de las columnas. Sin embargo, es fácil notar que no todos los nombres coinciden.

In [53]:
# Diccionario de mapeo entre nombres
name_map = {"Walmart": "Walmart",
            "China Petrochemical Corporation": "Sinopec Group",
            "State Grid Corporation of China": "State Grid",
            "China National Petroleum Corporation": "China National Petroleum",
            "Royal Dutch Shell": "Royal Dutch Shell",
            "Saudi Aramco": "Saudi Aramco",
            "Volkswagen AG": "Volkswagen",
            "BP plc": "BP",
            "Amazon.com, Inc.": "Amazon.com",
            "Toyota Motor Corporation": "Toyota Motor"
           }

In [57]:
# Hacer un map de los nombres en el dataframe inicial
employees_info["company"] = employees_info["name"].map(name_map)
employees_info

Unnamed: 0,name,employees,year founded,company
0,Walmart,2200000,1962,Walmart
1,State Grid Corporation of China,1566000,2002,State Grid
2,China National Petroleum Corporation,460724,1988,China National Petroleum
3,Berkshire Hathaway Inc.,360000,1839,
4,BP plc,70100,1909,BP
5,China Petrochemical Corporation,582648,1998,Sinopec Group
6,Royal Dutch Shell,86000,1907,Royal Dutch Shell
7,Toyota Motor Corporation,364445,1937,Toyota Motor
8,Saudi Aramco,66800,1933,Saudi Aramco
9,Apple Inc.,147000,1976,


In [63]:
# Hacer el merge
fortune500_with_employees = fortune500.merge(right=employees_info[["company", "employees"]],
                                             left_on="Company",
                                             right_on="company",
                                             how="left")
fortune500_with_employees

Unnamed: 0,Rank,Company,Country,Industry,Revenue in USD,company,employees
0,1,Walmart,United States,Retail,$524 billion,Walmart,2200000
1,2,Sinopec Group,China,Petroleum,$407 billion,Sinopec Group,582648
2,3,State Grid,China,Energy,$384 billion,State Grid,1566000
3,4,China National Petroleum,China,Petroleum,$379 billion,China National Petroleum,460724
4,5,Royal Dutch Shell,Netherlands,Petroleum,$352 billion,Royal Dutch Shell,86000
5,6,Saudi Aramco,Saudi Arabia,Energy,$330 billion,Saudi Aramco,66800
6,7,Volkswagen,Germany,Automobiles,$283 billion,Volkswagen,307342
7,8,BP,United Kingdom,Petroleum,$283 billion,BP,70100
8,9,Amazon.com,United States,Internet Services and Retailing,$281 billion,Amazon.com,1298000
9,10,Toyota Motor,Japan,Automobiles,$275 billion,Toyota Motor,364445


In [64]:
fortune500_with_employees.dtypes

Rank               int64
Company           object
Country           object
Industry          object
Revenue in USD    object
company           object
employees          int64
dtype: object

In [65]:
fortune500_with_employees["Revenue in USD"] = \
    fortune500_with_employees["Revenue in USD"].apply(
        lambda s: int(s[1:4]) * 10**9
    )
fortune500_with_employees

Unnamed: 0,Rank,Company,Country,Industry,Revenue in USD,company,employees
0,1,Walmart,United States,Retail,524000000000,Walmart,2200000
1,2,Sinopec Group,China,Petroleum,407000000000,Sinopec Group,582648
2,3,State Grid,China,Energy,384000000000,State Grid,1566000
3,4,China National Petroleum,China,Petroleum,379000000000,China National Petroleum,460724
4,5,Royal Dutch Shell,Netherlands,Petroleum,352000000000,Royal Dutch Shell,86000
5,6,Saudi Aramco,Saudi Arabia,Energy,330000000000,Saudi Aramco,66800
6,7,Volkswagen,Germany,Automobiles,283000000000,Volkswagen,307342
7,8,BP,United Kingdom,Petroleum,283000000000,BP,70100
8,9,Amazon.com,United States,Internet Services and Retailing,281000000000,Amazon.com,1298000
9,10,Toyota Motor,Japan,Automobiles,275000000000,Toyota Motor,364445


In [66]:
fortune500_with_employees.dtypes

Rank               int64
Company           object
Country           object
Industry          object
Revenue in USD     int64
company           object
employees          int64
dtype: object

In [67]:
# Responder la pregunta
fortune500_with_employees["Revenue per employee"] = \
    fortune500_with_employees["Revenue in USD"] / fortune500_with_employees["employees"]

In [69]:
fortune500_with_employees.sort_values(by="Revenue per employee", ascending=False)

Unnamed: 0,Rank,Company,Country,Industry,Revenue in USD,company,employees,Revenue per employee
5,6,Saudi Aramco,Saudi Arabia,Energy,330000000000,Saudi Aramco,66800,4940120.0
4,5,Royal Dutch Shell,Netherlands,Petroleum,352000000000,Royal Dutch Shell,86000,4093023.0
7,8,BP,United Kingdom,Petroleum,283000000000,BP,70100,4037090.0
6,7,Volkswagen,Germany,Automobiles,283000000000,Volkswagen,307342,920798.3
3,4,China National Petroleum,China,Petroleum,379000000000,China National Petroleum,460724,822618.3
9,10,Toyota Motor,Japan,Automobiles,275000000000,Toyota Motor,364445,754572.0
1,2,Sinopec Group,China,Petroleum,407000000000,Sinopec Group,582648,698535.0
2,3,State Grid,China,Energy,384000000000,State Grid,1566000,245210.7
0,1,Walmart,United States,Retail,524000000000,Walmart,2200000,238181.8
8,9,Amazon.com,United States,Internet Services and Retailing,281000000000,Amazon.com,1298000,216486.9


<script>
  $(document).ready(function(){
    $('div.prompt').hide();
    $('div.back-to-top').hide();
    $('nav#menubar').hide();
    $('.breadcrumb').hide();
    $('.hidden-print').hide();
  });
</script>

<footer id="attribution" style="float:right; color:#808080; background:#fff;">
Created with Jupyter by Esteban Jiménez Rodríguez.
</footer>