# Limpieza Events

En este notebook se busca realizar una limpieza de información del archivo events.csv, buscamos que los tipos de las columnas sean correctos y ocupen el menor espacio posible, y se analiza que datos son o no relevantes.

In [5]:
import pandas as pd
import numpy as np

import warnings
warnings.filterwarnings('ignore')

A la hora de leer los datos del csv, especificamos el tipo de dato de las columnas con el parámetro dtype y parseamos las columnas cuya información son fechas con el parámetro parse (habiendo visto previamente el csv para identificar los datos de cada columna). Así, se reduce el tiempo de lectura del csv y el espacio de memoria que ocupan los datos. 

In [6]:
events = pd.read_csv('../Data/events.csv.gzip', compression = 'gzip', 
                     dtype={'event_id':np.int16,'ref_type': 'category', 'application_id':np.int16, 
                            'device_countrycode': 'category', 'connection_type':'category'}, parse_dates=['date'])

In [7]:
events.head()

Unnamed: 0,date,event_id,ref_type,ref_hash,application_id,attributed,device_countrycode,device_os_version,device_brand,device_model,...,trans_id,user_agent,event_uuid,carrier,kind,device_os,wifi,connection_type,ip_address,device_language
0,2019-03-05 00:09:36.966,0,1891515180541284343,2688759737656491380,38,False,6333597102633388268,5.908703e+17,,5.990117e+18,...,,,a9c0b263-acb2-4577-92c5-cbde5d7a5db1,2.248157e+17,5.516623e+18,7.531669e+18,,Cable/DSL,7858558567428669000,4.077062e+17
1,2019-03-05 00:09:38.920,1,1891515180541284343,2688759737656491380,38,False,6333597102633388268,5.908703e+17,,5.990117e+18,...,,,1cd98205-0d97-4ec2-a019-667997dbfe7a,2.248157e+17,9.97766e+17,7.531669e+18,,Cable/DSL,7858558567428669000,4.077062e+17
2,2019-03-05 00:09:26.195,0,1891515180541284343,2688759737656491380,38,False,6333597102633388268,5.908703e+17,,5.990117e+18,...,,,f02e2924-21ae-492b-b625-9021ae0a4eca,2.248157e+17,5.516623e+18,7.531669e+18,,Cable/DSL,7858558567428669000,4.077062e+17
3,2019-03-05 00:09:31.107,2,1891515180541284343,2688759737656491380,38,False,6333597102633388268,5.908703e+17,,5.990117e+18,...,,,a813cf45-a36e-4668-85e2-5395f1564e98,2.248157e+17,8.561153e+18,7.531669e+18,,Cable/DSL,6324037615828123965,4.077062e+17
4,2019-03-09 21:00:36.585,3,1891515180541284343,2635154697734164782,38,False,6333597102633388268,7.391844e+18,,5.960896e+18,...,,,63a4f0aa-e147-469f-8c55-4ca4f8d0e310,2.248157e+17,8.731902e+17,7.531669e+18,,Cable/DSL,2894495631302821483,3.301378e+18


In [8]:
events.describe(include='all')

Unnamed: 0,date,event_id,ref_type,ref_hash,application_id,attributed,device_countrycode,device_os_version,device_brand,device_model,...,trans_id,user_agent,event_uuid,carrier,kind,device_os,wifi,connection_type,ip_address,device_language
count,2494423,2494423.0,2494423.0,2494423.0,2494423.0,2494423,2494423.0,1022066.0,1164963.0,2406456.0,...,82,1391527.0,2489324,616434.0,2489324.0,657667.0,1378872,612463,2494423.0,2406604.0
unique,2488829,,2.0,,,2,1.0,,,,...,13,,2489324,,,,2,3,,
top,2019-03-12 14:36:58.017000,,1.8915151805412844e+18,,,False,6.333597102633388e+18,,,,...,{hash},,be356e52-ebf2-434c-aba8-ac0634dc0c04,,,,True,Cable/DSL,,
freq,3,,1882743.0,,,2489324,2494423.0,,,,...,33,,1,,,,930902,331948,,
first,2019-03-05 00:00:00.255000,,,,,,,,,,...,,,,,,,,,,
last,2019-03-13 23:59:59.984000,,,,,,,,,,...,,,,,,,,,,
mean,,99.70445,,4.641486e+18,99.10934,,,4.986001e+18,1.633891e+18,4.478847e+18,...,,4.856492e+18,,1.470186e+18,5.364362e+18,7.251101e+18,,,4.620786e+18,5.865447e+18
std,,107.0903,,2.660724e+18,57.80986,,,2.394834e+18,1.626674e+18,2.718014e+18,...,,2.487552e+18,,2.575962e+18,2.242979e+18,5.247066e+17,,,2.672746e+18,2.281192e+18
min,,0.0,,163367500000000.0,0.0,,,1.004084e+16,7.949737e+16,953021600000000.0,...,,5072532000000000.0,,2.248157e+17,7.75827e+16,2.748831e+18,,,5287755000000.0,2.025809e+16
25%,,22.0,,2.326142e+18,63.0,,,4.35375e+18,3.083059e+17,2.331947e+18,...,,2.723465e+18,,2.248157e+17,4.647949e+18,6.941825e+18,,,2.33341e+18,3.301378e+18


In [9]:
events.shape

(2494423, 22)

Calculamos la cantidad de datos no nula de cada columna.

In [10]:
(events.isnull().sum()*(-1)+2494423).sort_values()

trans_id                   82
connection_type        612463
device_city            614698
carrier                616434
device_os              657667
device_os_version     1022066
device_brand          1164963
wifi                  1378872
user_agent            1391527
device_model          2406456
device_language       2406604
session_user_agent    2482637
event_uuid            2489324
kind                  2489324
device_countrycode    2494423
attributed            2494423
application_id        2494423
ref_hash              2494423
ref_type              2494423
event_id              2494423
ip_address            2494423
date                  2494423
dtype: int64

In [11]:
events['device_countrycode'].value_counts()

6333597102633388268    2494423
Name: device_countrycode, dtype: int64

Vemos que "device_countrycode" tiene un único valor.

In [12]:
events['trans_id'].value_counts()

{hash}                                                                                                           33
0                                                                                                                16
103430dcab4b60eb4f                                                                                                9
433f38e2c758468ab632dcab7281d4be_Y2NhPTEwLzI1LzIwMTggMTA6Mjk6MjUgUE0mb2ZmZXJJZD0zMzQ1NjQ0NiZhZmZJZD0yMjMyNzUx     7
210a4c5786d249c78bb30237abcac890_Y2NhPTQvMjEvMjAxOCA1OjI2OjM3IFBNJm9mZmVySWQ9MzM0NTY0NDYmYWZmSWQ9MTY2MTgxNQ==     6
1901171053a509cd7317f2c6                                                                                          2
0941bb7b-866f-4d5a-9b85-63e77b27d562                                                                              2
77ca31a9-b0e0-4884-8de8-c2ee74f1cc32                                                                              2
73f1hsvh52g4soo                                                         

Como de los 2494423 registros que tenemos solo 82 tienen información en la columna "trans_id", dicha información no nos aporta mucho para el análisis exploratorio. Por lo tanto, decidimos descartar dicha columna

In [13]:
events.drop(columns='trans_id',inplace=True)

Redefinimos la columna wifi en una del tipo categórica para que la información que presenta sea más comprensible.

In [14]:
events['wifi'].value_counts()

True     930902
False    447970
Name: wifi, dtype: int64

In [15]:
events['Wifi_cat'] = ''

In [16]:
events.loc[events['wifi'] == 0, 'Wifi_cat'] = 'Sin Wifi'
events.loc[events['wifi'] == 1, 'Wifi_cat'] = 'Con Wifi'
events.loc[events['wifi'].isnull(), 'Wifi_cat'] = 'Sin Definir'

In [17]:
events['Wifi_cat'].value_counts()

Sin Definir    1115551
Con Wifi        930902
Sin Wifi        447970
Name: Wifi_cat, dtype: int64

In [18]:
events['Wifi_cat'] = events['Wifi_cat'].astype('category')

In [19]:
events.drop(columns=['wifi'], inplace=True)

Con los datos obtenidos de la limpieza de installs reemplazamos los hashs de ref_hash por Google o Apple.

In [20]:
events['ref_type'].replace({'1891515180541284343':'Google','1494519392962156891':'Apple'}, inplace=True)

In [25]:
events['ref_type'] = events['ref_type'].astype('category')

In [26]:
events['ref_type'].value_counts()

Google    1882743
Apple      611680
Name: ref_type, dtype: int64

In [27]:
events.head()

Unnamed: 0,date,event_id,ref_type,ref_hash,application_id,attributed,device_countrycode,device_os_version,device_brand,device_model,...,session_user_agent,user_agent,event_uuid,carrier,kind,device_os,connection_type,ip_address,device_language,Wifi_cat
0,2019-03-05 00:09:36.966,0,Google,2688759737656491380,38,False,6333597102633388268,5.908703e+17,,5.990117e+18,...,7.164321e+18,,a9c0b263-acb2-4577-92c5-cbde5d7a5db1,2.248157e+17,5.516623e+18,7.531669e+18,Cable/DSL,7858558567428669000,4.077062e+17,Sin Definir
1,2019-03-05 00:09:38.920,1,Google,2688759737656491380,38,False,6333597102633388268,5.908703e+17,,5.990117e+18,...,7.164321e+18,,1cd98205-0d97-4ec2-a019-667997dbfe7a,2.248157e+17,9.97766e+17,7.531669e+18,Cable/DSL,7858558567428669000,4.077062e+17,Sin Definir
2,2019-03-05 00:09:26.195,0,Google,2688759737656491380,38,False,6333597102633388268,5.908703e+17,,5.990117e+18,...,7.164321e+18,,f02e2924-21ae-492b-b625-9021ae0a4eca,2.248157e+17,5.516623e+18,7.531669e+18,Cable/DSL,7858558567428669000,4.077062e+17,Sin Definir
3,2019-03-05 00:09:31.107,2,Google,2688759737656491380,38,False,6333597102633388268,5.908703e+17,,5.990117e+18,...,7.164321e+18,,a813cf45-a36e-4668-85e2-5395f1564e98,2.248157e+17,8.561153e+18,7.531669e+18,Cable/DSL,6324037615828123965,4.077062e+17,Sin Definir
4,2019-03-09 21:00:36.585,3,Google,2635154697734164782,38,False,6333597102633388268,7.391844e+18,,5.960896e+18,...,7.164321e+18,,63a4f0aa-e147-469f-8c55-4ca4f8d0e310,2.248157e+17,8.731902e+17,7.531669e+18,Cable/DSL,2894495631302821483,3.301378e+18,Sin Definir


In [28]:
events.dtypes

date                  datetime64[ns]
event_id                       int16
ref_type                    category
ref_hash                       int64
application_id                 int16
attributed                      bool
device_countrycode          category
device_os_version            float64
device_brand                 float64
device_model                 float64
device_city                  float64
session_user_agent           float64
user_agent                   float64
event_uuid                    object
carrier                      float64
kind                         float64
device_os                    float64
connection_type             category
ip_address                     int64
device_language              float64
Wifi_cat                    category
dtype: object