# Inteligencia de Negocios: Laboratorio 1
## Integrantes: Grupo 17

* Mariana Díaz Arenas - [m.diaza2](mailto:m.diaza2@uniandes.edu.co) 
* Esteban Gonzales Ruales - [e.gonzalez5](mailto:e.gonzalez5@uniandes.edu.co) 
* Juan Diego Yepes - [j.yepes](mailto:j.yepes@uniandes.edu.co) 


En el siguiente cuaderno de Jupyter implementamos la solución al siguiente laboratorio propuesto: [link](https://gitlab.virtual.uniandes.edu.co/ISIS3301/laboratorios/blob/master/202310/Lab%201%20-%20Clustering/Laboratorio1_enunciado.md)

# 1. Entendimiento de los datos

In [2]:
# Importación de librerías y creación del dataFrame

import numpy as np
import pandas as pd

from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_samples, silhouette_score

import matplotlib.pyplot as plt
import matplotlib.cm as cm
import seaborn as sns

In [4]:
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 50)

df_roads = pd.read_csv('./data/BiciAlpes.csv', index_col = 0, encoding = "latin1", sep = ";")

# Imprime el tamaño del dataframe
df_roads.shape

(5338, 14)

In [28]:
# Verificación de que se cargó correctamente
#TODO por qué sale ese Unnamed 14?
df_roads

Unnamed: 0_level_0,Number_of_Casualties,Day_of_Week,Road_Type,Speed_limit,Light_Conditions,Weather_Conditions,Road_Surface_Conditions,Urban_or_Rural_Area,Vehicle_Type,Did_Police_Officer_Attend_Scene_of_Accident,Junction_Detail,Number_of_Vehicles,Accident_Severity,Unnamed: 14
Time,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
Mañana,1,Día laboral,6,30.0,1,1,1,1,bike,1,3,1,3,
Mañana,1,Día laboral,6,30.0,1,1,1,1,bike,1,0,1,3,
Tarde,1,Fin de semana,6,30.0,1,1,1,1,bike,1,3,1,3,
Tarde,2,Día laboral,6,30.0,1,1,1,1,bike,1,6,1,2,
Mañana,2,Día laboral,6,30.0,1,1,1,1,bike,1,6,1,3,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
Tarde,1,Día laboral,6,20.0,1,1,2,1,bike,1,9,1,3,
Noche,2,Día laboral,6,30.0,1,1,1,2,bike,2,0,1,2,
Tarde,1,Día laboral,6,30.0,4,5,2,1,bike,2,3,1,2,
Noche,1,Día laboral,6,30.0,1,1,1,1,bike,1,6,1,3,


In [9]:
# Completitud

(df_roads.isnull().sum() / df_roads.shape[0]).sort_values(ascending = False)

Unnamed: 14                                    1.000000
Day_of_Week                                    0.003559
Number_of_Casualties                           0.000000
Road_Type                                      0.000000
Speed_limit                                    0.000000
Light_Conditions                               0.000000
Weather_Conditions                             0.000000
Road_Surface_Conditions                        0.000000
Urban_or_Rural_Area                            0.000000
Vehicle_Type                                   0.000000
Did_Police_Officer_Attend_Scene_of_Accident    0.000000
Junction_Detail                                0.000000
Number_of_Vehicles                             0.000000
Accident_Severity                              0.000000
dtype: float64

In [21]:
# Unicidad
df_roads.loc[df_roads.duplicated(subset = df_roads.columns[1:], keep = False)].sort_values(by = "Time")

Unnamed: 0_level_0,Number_of_Casualties,Day_of_Week,Road_Type,Speed_limit,Light_Conditions,Weather_Conditions,Road_Surface_Conditions,Urban_or_Rural_Area,Vehicle_Type,Did_Police_Officer_Attend_Scene_of_Accident,Junction_Detail,Number_of_Vehicles,Accident_Severity,Unnamed: 14
Time,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
Madrugada,1,Fin de semana,6,30.0,4,1,1,1,bike,1,3,1,3,
Madrugada,1,,6,30.0,4,1,1,1,bike,1,0,1,3,
Madrugada,1,Día laboral,6,30.0,1,1,1,1,bike,1,9,1,3,
Madrugada,1,Día laboral,6,30.0,4,1,1,1,bike,1,0,1,3,
Mañana,1,Día laboral,6,30.0,1,1,1,1,bike,1,3,1,3,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
Tarde,2,Fin de semana,6,30.0,1,1,1,1,bike,1,0,1,2,
Tarde,1,Día laboral,6,30.0,1,1,2,1,bike,1,3,1,2,
Tarde,1,Día laboral,6,30.0,1,1,1,1,bike,1,6,1,3,
Tarde,1,Día laboral,6,30.0,1,1,1,1,bike,1,6,1,3,


In [23]:
# Número de resgistros duplicados
# TODO tiene sentido tener todos estos valores repetidos?
print('Total de registros repetidos:', df_roads.loc[df_roads.duplicated(subset = df_roads.columns[1:], keep = False)].shape[0])

Total de registros repetidos: 4375


In [26]:
# Consistencia
df_roads.groupby(['Time', 'Day_of_Week']).size()

Time       Day_of_Week  
Madrugada  Día laboral         3
           Fin de semana       1
Mañana     Día laboral      1249
           Fin de semana     429
Noche      Día laboral      1041
           Fin de semana     428
Tarde      Día laboral      1539
           Fin de semana     629
dtype: int64