#Big Data

##**Desafío 2 - Introducción a Big Data**

###__Francisca Pinto__
### 27 de diciembre de 2021

**Ejercicio 1 - Ingesta de datos semi-estructurados**

1. Se inicia con la importación de los módulos correspondientes.
2. Utilizando <code>get</code>, se realiza la consulta a la API <code>Balls don't lie</code> con los primeros 100 juegos solicitados.
3. Se preservará el objeto en formato <code>json</code>.
4. Se revisa la data y la metadata del objeto.

In [73]:
#incorporación de tiempo de ejecución de celdas
!pip install ipython-autotime
%load_ext autotime

import requests
import json
import pandas as pd
import functools
import numpy as np

The autotime extension is already loaded. To reload it, use:
  %reload_ext autotime
time: 3.15 s (started: 2021-12-28 00:09:08 +00:00)


In [74]:
bdl_url = "https://www.balldontlie.io/api/v1/games"
bdl_params = {"per_page" : 100}

bdl = requests.get(bdl_url, bdl_params)

time: 1.32 s (started: 2021-12-28 00:09:12 +00:00)


Se revisa código de estado:

In [75]:
bdl.status_code

200

time: 4.24 ms (started: 2021-12-28 00:09:13 +00:00)


Se conserva la consulta en formato semi-estructurado <code>json</code>.

In [76]:
bdl_json = bdl.json()

time: 1.78 ms (started: 2021-12-28 00:09:13 +00:00)


Se consultan los datos y metadatos contenidos en el objeto recién creado.

In [77]:
bdl_json["data"]

[{'date': '2019-01-30T00:00:00.000Z',
  'home_team': {'abbreviation': 'BOS',
   'city': 'Boston',
   'conference': 'East',
   'division': 'Atlantic',
   'full_name': 'Boston Celtics',
   'id': 2,
   'name': 'Celtics'},
  'home_team_score': 126,
  'id': 47179,
  'period': 4,
  'postseason': False,
  'season': 2018,
  'status': 'Final',
  'time': ' ',
  'visitor_team': {'abbreviation': 'CHA',
   'city': 'Charlotte',
   'conference': 'East',
   'division': 'Southeast',
   'full_name': 'Charlotte Hornets',
   'id': 4,
   'name': 'Hornets'},
  'visitor_team_score': 94},
 {'date': '2019-02-09T00:00:00.000Z',
  'home_team': {'abbreviation': 'BOS',
   'city': 'Boston',
   'conference': 'East',
   'division': 'Atlantic',
   'full_name': 'Boston Celtics',
   'id': 2,
   'name': 'Celtics'},
  'home_team_score': 112,
  'id': 48751,
  'period': 4,
  'postseason': False,
  'season': 2018,
  'status': 'Final',
  'time': '     ',
  'visitor_team': {'abbreviation': 'LAC',
   'city': 'LA',
   'conferenc

time: 71.3 ms (started: 2021-12-28 00:09:13 +00:00)


Al solicitar <code>["data"]</code> se muestran los datos consultados a la API, que en este caso corresponden a la información de los 100 juegos.

In [78]:
bdl_json["meta"]

{'current_page': 1,
 'next_page': 2,
 'per_page': 100,
 'total_count': 51163,
 'total_pages': 512}

time: 4.62 ms (started: 2021-12-28 00:09:13 +00:00)


Al solicitar <code>["meta"]</code> se obtiene la metadata de la consulta, que corresponde a información que la pone en contexto respecto al total de datos almacenados en la API.

**Ejercicio 2 - Organización de los datos**

1. El objeto tipo <code>json</code> se guardará en un nuevo dataframe, solo con los datos necesarios (filtrando solo la sección de <code>Data</code>).
2. Se eliminarán las columnas que no se utilizarán para el análisis.
3. Con comprensiones de lista, se crearán las listas con los datos dentro de los diccionarios de <code>home_team</code> y <code>visitor_team</code>, para posteriormente incorporarlos como nuevas columnas dentro del dataframe ya creado.
4. Se trabajará en paralelo con los equipos y estadísticas de <code>home_team</code> y <code>visitor_team</code>.
5. Finalmente, se revisa el dataframe para asegurar su estado.

In [79]:
bdl_df = pd.DataFrame(data = bdl_json["data"]).copy()

time: 9.41 ms (started: 2021-12-28 00:09:13 +00:00)


In [80]:
bdl_json_data = bdl_json["data"]

time: 1.2 ms (started: 2021-12-28 00:09:13 +00:00)


In [81]:
bdl_df.drop(columns = ["id",
                       "date",
                       "postseason",
                       "status",
                       "time"],
            inplace = True)

time: 4.66 ms (started: 2021-12-28 00:09:13 +00:00)


In [82]:
bdl_df

Unnamed: 0,home_team,home_team_score,period,season,visitor_team,visitor_team_score
0,"{'id': 2, 'abbreviation': 'BOS', 'city': 'Bost...",126,4,2018,"{'id': 4, 'abbreviation': 'CHA', 'city': 'Char...",94
1,"{'id': 2, 'abbreviation': 'BOS', 'city': 'Bost...",112,4,2018,"{'id': 13, 'abbreviation': 'LAC', 'city': 'LA'...",123
2,"{'id': 23, 'abbreviation': 'PHI', 'city': 'Phi...",117,4,2018,"{'id': 8, 'abbreviation': 'DEN', 'city': 'Denv...",110
3,"{'id': 30, 'abbreviation': 'WAS', 'city': 'Was...",119,4,2018,"{'id': 6, 'abbreviation': 'CLE', 'city': 'Clev...",106
4,"{'id': 26, 'abbreviation': 'SAC', 'city': 'Sac...",102,4,2018,"{'id': 16, 'abbreviation': 'MIA', 'city': 'Mia...",96
...,...,...,...,...,...,...
95,"{'id': 30, 'abbreviation': 'WAS', 'city': 'Was...",112,4,2018,"{'id': 12, 'abbreviation': 'IND', 'city': 'Ind...",119
96,"{'id': 3, 'abbreviation': 'BKN', 'city': 'Broo...",99,4,2018,"{'id': 25, 'abbreviation': 'POR', 'city': 'Por...",113
97,"{'id': 6, 'abbreviation': 'CLE', 'city': 'Clev...",111,4,2018,"{'id': 24, 'abbreviation': 'PHX', 'city': 'Pho...",98
98,"{'id': 3, 'abbreviation': 'BKN', 'city': 'Broo...",116,4,2018,"{'id': 30, 'abbreviation': 'WAS', 'city': 'Was...",125


time: 47.7 ms (started: 2021-12-28 00:09:13 +00:00)


In [83]:
bdl_home = [bdl_json_data[x]["home_team"]["name"] for x in range(0, 100)]
bdl_visitor = [bdl_json_data[x]["visitor_team"]["name"] for x in range(0, 100)]

bdl_home_conference = [bdl_json_data[x]["home_team"]["conference"] for x in range(0, 100)]
bdl_visitor_conference = [bdl_json_data[x]["visitor_team"]["conference"] for x in range(0, 100)]

bdl_home_division = [bdl_json_data[x]["home_team"]["division"] for x in range(0, 100)]
bdl_visitor_division = [bdl_json_data[x]["visitor_team"]["division"] for x in range(0, 100)]

bdl_home_differential = [bdl_json_data[x]["home_team_score"] - bdl_json_data[x]["visitor_team_score"] for x in range(0, 100)]
bdl_visitor_differential = [bdl_json_data[x]["visitor_team_score"] - bdl_json_data[x]["home_team_score"] for x in range(0, 100)]

time: 12.4 ms (started: 2021-12-28 00:09:13 +00:00)


In [84]:
bdl_df["home_team"] = bdl_home
bdl_df["visitor_team"] = bdl_visitor

bdl_df["home_conference"] = bdl_home_conference
bdl_df["visitor_conference"] = bdl_visitor_conference

bdl_df["home_division"] = bdl_home_division
bdl_df["visitor_division"] = bdl_visitor_division

bdl_df["home_differential"] = bdl_home_differential
bdl_df["visitor_differential"] = bdl_visitor_differential

time: 10.3 ms (started: 2021-12-28 00:09:13 +00:00)


In [85]:
bdl_df

Unnamed: 0,home_team,home_team_score,period,season,visitor_team,visitor_team_score,home_conference,visitor_conference,home_division,visitor_division,home_differential,visitor_differential
0,Celtics,126,4,2018,Hornets,94,East,East,Atlantic,Southeast,32,-32
1,Celtics,112,4,2018,Clippers,123,East,West,Atlantic,Pacific,-11,11
2,76ers,117,4,2018,Nuggets,110,East,West,Atlantic,Northwest,7,-7
3,Wizards,119,4,2018,Cavaliers,106,East,East,Southeast,Central,13,-13
4,Kings,102,4,2018,Heat,96,West,East,Pacific,Southeast,6,-6
...,...,...,...,...,...,...,...,...,...,...,...,...
95,Wizards,112,4,2018,Pacers,119,East,East,Southeast,Central,-7,7
96,Nets,99,4,2018,Trail Blazers,113,East,West,Atlantic,Northwest,-14,14
97,Cavaliers,111,4,2018,Suns,98,East,West,Central,Pacific,13,-13
98,Nets,116,4,2018,Wizards,125,East,East,Atlantic,Southeast,-9,9


time: 32.2 ms (started: 2021-12-28 00:09:13 +00:00)


**Ejercicio 3 - El efecto de jugar de local**

1. Con <code>np.where</code>, se crean las columnas <code>home_wins</code> y <code>visitor_wins</code> para identificar en qué casos ganó el equipo local, o el de visita.
2. Con <code>sort_values</code> se ordena el dataframe para saber qué equipos tienen mejor rendimiento como local y visita. Dado que los resultados entre las columnas <code>home_differential</code> y <code>visitor_differential</code> son opuestos entre sí, la operación se raliza solo una vez y se obtienen todos los datos solicitados.

In [86]:
bdl_df["home_wins"] = np.where(bdl_df["home_differential"] > 0, 1, 0)
bdl_df["visitor_wins"] = np.where(bdl_df["visitor_differential"] > 0, 1, 0)

time: 4.48 ms (started: 2021-12-28 00:09:13 +00:00)


In [87]:
bdl_df_sorted = bdl_df.sort_values(by = ["home_differential"],
                                   ascending = True)

time: 2.61 ms (started: 2021-12-28 00:09:13 +00:00)


In [100]:
bdl_df_sorted.iloc[-10: , :]

Unnamed: 0,home_team,home_team_score,period,season,visitor_team,visitor_team_score,home_conference,visitor_conference,home_division,visitor_division,home_differential,visitor_differential,home_wins,visitor_wins
76,Magic,119,4,2018,76ers,98,East,East,Southeast,Atlantic,21,-21,1,0
39,Trail Blazers,129,4,2018,Warriors,107,West,West,Northwest,Pacific,22,-22,1,0
61,Pistons,131,4,2018,Bulls,108,East,East,Central,Central,23,-23,1,0
12,76ers,143,4,2018,Lakers,120,East,West,Atlantic,Pacific,23,-23,1,0
69,Kings,129,4,2018,Bulls,102,West,East,Pacific,Central,27,-27,1,0
84,Trail Blazers,116,4,2018,Grizzlies,89,West,West,Northwest,Southwest,27,-27,1,0
0,Celtics,126,4,2018,Hornets,94,East,East,Atlantic,Southeast,32,-32,1,0
88,Magic,149,4,2018,Hawks,113,East,East,Southeast,Southeast,36,-36,1,0
93,Magic,127,4,2018,Hornets,89,East,East,Southeast,Southeast,38,-38,1,0
51,Nets,127,4,2018,Mavericks,88,East,West,Atlantic,Southwest,39,-39,1,0


time: 61.8 ms (started: 2021-12-28 00:12:13 +00:00)


Los equipos con mejor rendimiento como local son:

1. <code>Nets</code>
2. <code>Magic</code>
3. <code>Celtics</code>
4. <code>Trail Blazers</code>
5. <code>Kings</code>

Los equipos con mejor rendimiento como visita son:

1. <code>Warriors</code>
2. <code>Bulls</code>
3. <code>Pelicans</code>
4. <code>Cavaliers</code>
5. <code>Bucks</code>

**Ejercicio 4 - Obteniendo el porcentaje de ganar local y de visita**

1. Se crearán <code>arrays</code> con los nombres de cada equipo, con el método <code>unique</code>.
2. Se crearán dataframes con la agrupación de cada equipo, y la probabilidad de ganar como local y visita (dos dataframes ya que son dos agrupaciones diferentes, que se juntarán en un paso siguiente).
3. Se filtrarán los equipos que tienen el mismo comportamiento a nivel de probabilidad, como local o como visita.

In [89]:
home_team = bdl_df["home_team"].unique()
visitor_team = bdl_df["visitor_team"].unique()

time: 2.34 ms (started: 2021-12-28 00:09:13 +00:00)


In [90]:
groupby_home_team = bdl_df.groupby(by = ["home_team"])[["home_wins"]].mean().copy()
groupby_visitor_team = bdl_df.groupby(by = ["visitor_team"])[["visitor_wins"]].mean().copy()

bdl_wins_df = groupby_home_team.join(groupby_visitor_team.visitor_wins)

time: 15.2 ms (started: 2021-12-28 00:09:13 +00:00)


In [91]:
groupby_home_team

Unnamed: 0_level_0,home_wins
home_team,Unnamed: 1_level_1
76ers,0.8
Bucks,0.666667
Bulls,0.2
Cavaliers,0.5
Celtics,0.75
Grizzlies,0.333333
Hawks,0.2
Heat,1.0
Hornets,0.0
Jazz,1.0


time: 16.8 ms (started: 2021-12-28 00:09:13 +00:00)


In [92]:
groupby_visitor_team

Unnamed: 0_level_0,visitor_wins
visitor_team,Unnamed: 1_level_1
76ers,0.666667
Bucks,1.0
Bulls,0.25
Cavaliers,0.0
Celtics,0.666667
Clippers,1.0
Grizzlies,0.0
Hawks,0.0
Heat,0.333333
Hornets,0.285714


time: 16.1 ms (started: 2021-12-28 00:09:13 +00:00)


In [93]:
bdl_wins_df[bdl_wins_df["home_wins"] == bdl_wins_df["visitor_wins"]]

Unnamed: 0_level_0,home_wins,visitor_wins
home_team,Unnamed: 1_level_1,Unnamed: 2_level_1
Knicks,0.0,0.0
Lakers,0.5,0.5
Pelicans,0.333333,0.333333
Suns,0.0,0.0


time: 15.1 ms (started: 2021-12-28 00:09:13 +00:00)


Finalmente, los equipos que presentan el mismo comportamiento como local y como visita, teniendo igual rendimiento en los partidos analizados (sin considerar que solo son 100 partidos y que esto representa una temporada y una pequeña cantidad de una segunda), son:

1. <code>Kicks</code>: no tiene partidos ganados en el registro, ni de local ni de visita.
2. <code>Lakers</code>: tiene un 50% de probabilidades de ganar como local o visita.
3. <code>Pelicans</code>: de cada 3 partidos, gana 1.
4. <code>Suns</code>: igual que <code>Kicks</code>.