# Limpieza Events

En este notebook se busca realizar una limpieza de información del archivo events.csv, buscamos que los tipos de las columnas sean correctos y ocupen el menor espacio posible, y se analiza que datos son o no relevantes.

In [1]:
import pandas as pd
import numpy as np

import warnings
warnings.filterwarnings('ignore')

A la hora de leer los datos del csv, especificamos el tipo de dato de las columnas con el parámetro dtype y parseamos las columnas cuya información son fechas con el parámetro parse (habiendo visto previamente el csv para identificar los datos de cada columna). Así, se reduce el tiempo de lectura del csv y el espacio de memoria que ocupan los datos. 

In [2]:
events = pd.read_csv('../Data/events.csv.gzip', compression = 'gzip', 
                     dtype={'index':np.int32, 'event_id':np.int16,'ref_type': 'category', 'application_id':np.int16, 
                            'device_countrycode': 'category', 'connection_type':'category'}, parse_dates=['date'])

In [3]:
events.head()

Unnamed: 0,index,date,event_id,ref_type,ref_hash,application_id,attributed,device_countrycode,device_os_version,device_brand,...,trans_id,user_agent,event_uuid,carrier,kind,device_os,wifi,connection_type,ip_address,device_language
0,2130678,2019-04-20 01:42:49.120,0,1891515180541284343,5857744372586891366,210,False,6287817205707153877,,,...,,5.046185e+18,5b506964-5f47-4b28-a8c2-8a92d6c23379,,5.882882e+18,,False,,7544543351571901618,3.301378e+18
1,2130680,2019-04-20 01:42:49.340,1,1891515180541284343,7642521036780133571,210,False,6287817205707153877,,,...,,,f1fb9d15-1a7b-4116-8d3b-c4c403e197e2,,4.017674e+18,,False,,6949523255335024165,
2,2130681,2019-04-20 01:42:49.365,1,1891515180541284343,2548841562898283198,210,False,6287817205707153877,,,...,,,c85a0b15-a5d7-472e-8116-6bfa3db19687,,4.017674e+18,,False,,6428537280982666957,
3,2130684,2019-04-20 01:42:51.438,2,1891515180541284343,609402887625919085,210,False,6287817205707153877,,,...,,,f4aa0a97-2de6-4f22-95c6-1b3150112cb9,,6.168309e+18,,False,,7607371352198017145,
4,2130688,2019-04-20 01:42:51.838,1,1891515180541284343,9114651763556439823,210,False,6287817205707153877,,,...,,,08e2f7f7-875f-4aa0-b337-b9b87b0d83ea,,4.017674e+18,,False,,2901772839007473756,


In [4]:
events.describe(include='all')

Unnamed: 0,index,date,event_id,ref_type,ref_hash,application_id,attributed,device_countrycode,device_os_version,device_brand,...,trans_id,user_agent,event_uuid,carrier,kind,device_os,wifi,connection_type,ip_address,device_language
count,7744581.0,7744581,7744581.0,7744581.0,7744581.0,7744581.0,7744581,7744581.0,2332975.0,2553424.0,...,37642.0,3341483.0,7714809,1925901.0,7714809.0,1870190.0,7744581,1809296,7744581.0,5665409.0
unique,,7693028,,2.0,,,2,1.0,,,...,11042.0,,7714809,,,,2,4,,
top,,2019-04-24 14:41:30.558000,,1.8915151805412844e+18,,,False,6.287817205707153e+18,,,...,0.0,,b0ad49ad-00c2-4ed8-9b8a-588016f46742,,,,False,Cable/DSL,,
freq,,4,,6421584.0,,,7714809,7744581.0,,,...,7446.0,,1,,,,5478103,1291512,,
first,,2019-04-18 00:00:00.027000,,,,,,,,,...,,,,,,,,,,
last,,2019-04-26 23:59:59.881000,,,,,,,,,...,,,,,,,,,,
mean,343249.1,,59.31369,,4.581569e+18,141.3932,,,6.315609e+18,3.072329e+18,...,,4.979959e+18,,7.04185e+18,5.386691e+18,7.485012e+18,,,4.586159e+18,5.347474e+18
std,549595.5,,87.38097,,2.658818e+18,76.09497,,,2.526906e+18,2.469124e+18,...,,2.866372e+18,,2.542362e+18,1.860761e+18,1.591975e+17,,,2.657233e+18,1.981312e+18
min,0.0,,0.0,,40621410000000.0,1.0,,,1.004084e+16,4359146000000000.0,...,,504638200000000.0,,2.359613e+16,1.621526e+16,6.941825e+18,,,33114280000000.0,6.645612e+16
25%,40315.0,,2.0,,2.275713e+18,68.0,,,4.821386e+18,3.083059e+17,...,,2.527115e+18,,6.647944e+18,4.017674e+18,7.531669e+18,,,2.278616e+18,3.301378e+18


In [5]:
events.shape

(7744581, 23)

Calculamos la cantidad de datos no nula de cada columna.

In [6]:
(events.isnull().sum()*(-1)+7744581).sort_values()

trans_id                37642
connection_type       1809296
device_os             1870190
device_city           1894935
carrier               1925901
device_os_version     2332975
device_brand          2553424
user_agent            3341483
device_language       5665409
device_model          5668092
session_user_agent    7702301
event_uuid            7714809
kind                  7714809
attributed            7744581
ip_address            7744581
application_id        7744581
ref_hash              7744581
ref_type              7744581
event_id              7744581
wifi                  7744581
date                  7744581
device_countrycode    7744581
index                 7744581
dtype: int64

In [7]:
events['device_countrycode'].value_counts()

6287817205707153877    7744581
Name: device_countrycode, dtype: int64

Vemos que "device_countrycode" tiene un único valor.

In [8]:
events['trans_id'].value_counts()

0                                                                                                                                              7446
3gggAwDzJw5PjvtG5aQp6rsN9kPdqswE_LajCYSD-MLFWk1YAAgxMzQzIDU3OTc3MjNiNTM1YTc4ZGJSODk4NDg2MDJfSkIyMDE5MDQxODAwNDM5NVZOMTlXVVNLTktLTE9HT1YAAAA      36
{hash}                                                                                                                                           34
3gggAwBKYrbPH7pE44CNe2webp2viNkS_LajCbb9j6DCWk1YAC4zOTQyMDAyLTgzM18yMzMxMF85NzY2MzA1Y2IwYTQ1OTQ4YmFmMzAwMDE2M2FmMGEAAAAA                         33
58ba31291a73b10bbcca8542                                                                                                                         32
58effea4                                                                                                                                         29
v.2_g.88415_a.6611679423314634635_c.19_t.ua_u.ae002bd04471b299                                                  

Como de los 7744581 registros que tenemos solo 37642 tienen información en la columna "trans_id", dicha información no nos aporta mucho para el análisis exploratorio. Por lo tanto, decidimos descartar dicha columna

In [9]:
events.drop(columns='trans_id',inplace=True)

Redefinimos la columna wifi en una del tipo categórica para que la información que presenta sea más comprensible.

In [10]:
events['wifi'].value_counts()

False    5478103
True     2266478
Name: wifi, dtype: int64

In [11]:
events['Wifi_cat'] = ''

In [12]:
events.loc[events['wifi'] == 0, 'Wifi_cat'] = 'Sin Wifi'
events.loc[events['wifi'] == 1, 'Wifi_cat'] = 'Con Wifi'
events.loc[events['wifi'].isnull(), 'Wifi_cat'] = 'Sin Definir'

In [13]:
events['Wifi_cat'].value_counts()

Sin Wifi    5478103
Con Wifi    2266478
Name: Wifi_cat, dtype: int64

In [14]:
events['Wifi_cat'] = events['Wifi_cat'].astype('category')

In [15]:
events.drop(columns=['wifi'], inplace=True)

Con los datos obtenidos de la limpieza de installs reemplazamos los hashs de ref_type por Google o Apple.

In [16]:
events['ref_type'].replace({'1891515180541284343':'Google','1494519392962156891':'Apple'}, inplace=True)

In [17]:
events['ref_type'] = events['ref_type'].astype('category')

In [18]:
events['ref_type'].value_counts()

Google    6421584
Apple     1322997
Name: ref_type, dtype: int64

In [19]:
events.head()

Unnamed: 0,index,date,event_id,ref_type,ref_hash,application_id,attributed,device_countrycode,device_os_version,device_brand,...,session_user_agent,user_agent,event_uuid,carrier,kind,device_os,connection_type,ip_address,device_language,Wifi_cat
0,2130678,2019-04-20 01:42:49.120,0,Google,5857744372586891366,210,False,6287817205707153877,,,...,3.819516e+18,5.046185e+18,5b506964-5f47-4b28-a8c2-8a92d6c23379,,5.882882e+18,,,7544543351571901618,3.301378e+18,Sin Wifi
1,2130680,2019-04-20 01:42:49.340,1,Google,7642521036780133571,210,False,6287817205707153877,,,...,3.819516e+18,,f1fb9d15-1a7b-4116-8d3b-c4c403e197e2,,4.017674e+18,,,6949523255335024165,,Sin Wifi
2,2130681,2019-04-20 01:42:49.365,1,Google,2548841562898283198,210,False,6287817205707153877,,,...,3.819516e+18,,c85a0b15-a5d7-472e-8116-6bfa3db19687,,4.017674e+18,,,6428537280982666957,,Sin Wifi
3,2130684,2019-04-20 01:42:51.438,2,Google,609402887625919085,210,False,6287817205707153877,,,...,3.819516e+18,,f4aa0a97-2de6-4f22-95c6-1b3150112cb9,,6.168309e+18,,,7607371352198017145,,Sin Wifi
4,2130688,2019-04-20 01:42:51.838,1,Google,9114651763556439823,210,False,6287817205707153877,,,...,3.819516e+18,,08e2f7f7-875f-4aa0-b337-b9b87b0d83ea,,4.017674e+18,,,2901772839007473756,,Sin Wifi


In [20]:
events.dtypes

index                          int32
date                  datetime64[ns]
event_id                       int16
ref_type                    category
ref_hash                       int64
application_id                 int16
attributed                      bool
device_countrycode          category
device_os_version            float64
device_brand                 float64
device_model                 float64
device_city                  float64
session_user_agent           float64
user_agent                   float64
event_uuid                    object
carrier                      float64
kind                         float64
device_os                    float64
connection_type             category
ip_address                     int64
device_language              float64
Wifi_cat                    category
dtype: object