# Limpieza Clicks


En este notebook se busca realizar una limpieza de información del archivo clicks.csv, buscamos que los tipos de las columnas sean correctos y ocupen el menor espacio posible, y se analiza que datos son o no relevantes.

In [1]:
import pandas as pd
import numpy as np

import warnings
warnings.filterwarnings('ignore')


A la hora de leer los datos del csv, especificamos el tipo de dato de las columnas con el parámetro dtype y parseamos las columnas cuya información son fechas con el parámetro parse (habiendo visto previamente el csv para identificar los datos de cada columna). Así, se reduce el tiempo de lectura del csv y el espacio de memoria que ocupan los datos. 

In [2]:
clicks = pd.read_csv('../Data/clicks.csv.gzip', compression = 'gzip', 
                     dtype={'ref_type': 'category', 'timeToClick': np.float64, 
                            'specs_brand': 'category', 'country_code': 'category', 'os_minor': np.float32, 'os_major': np.float32, 
                            'agent_device': np.float16, 'advertiser_id': np.int8, 'source_id': np.int8}, parse_dates=['created'])

In [3]:
clicks['touchX'] = clicks['touchX'].astype('float32')
clicks['touchY'] = clicks['touchY'].astype('float32')

In [4]:
clicks.head()

Unnamed: 0,advertiser_id,action_id,source_id,created,country_code,latitude,longitude,wifi_connection,carrier_id,trans_id,os_minor,agent_device,os_major,specs_brand,brand,timeToClick,touchX,touchY,ref_type,ref_hash
0,1,,2,2019-04-18 05:27:42.197,6287817205707153877,1.714547,0.871535,False,3.0,9JMAfrb-b9cSEVCJb0P9JfihGthaS7E,1.517644e+18,,5.131615e+18,71913840936116953,0.0,2.317,0.968,0.503,1891515180541284343,1293710398598742392
1,1,,1,2019-04-18 05:27:03.164,6287817205707153877,1.714512,0.871062,True,2.0,r3xtTRv2lInfiXG8JI3NQsNcBo8GyFQ,1.288578e+18,,3.90839e+18,3576558787748411622,1.0,7.653,0.712,1.689,1891515180541284343,1663930990551616564
2,1,,1,2019-04-18 05:42:07.926,6287817205707153877,1.714547,0.871535,True,4.0,WOnHFqQtY48z_ygKZ-030U_g0TMGVMw,2.238736e+18,,3.581233e+18,3576558787748411622,,464.796,0.227,0.251,1891515180541284343,8488038938665586188
3,1,,1,2019-04-18 05:26:04.446,6287817205707153877,1.708041,0.870772,True,1.0,wQMLLmYqiFhSuha9p9B13PMtcyBW_vM,2.41164e+18,,3.90839e+18,3576558787748411622,,225.311,0.696,6.587,1891515180541284343,6488361690105189959
4,1,,1,2019-04-18 05:23:37.764,6287817205707153877,1.715514,0.870772,True,2.0,GeFoyBzMA7taylMxxjzlNPTU-n4FXFs,1.517644e+18,,5.131615e+18,3576558787748411622,0.0,84.736,0.059,0.142,1891515180541284343,1348993302102753419


In [5]:
clicks.describe(include='all')

Unnamed: 0,advertiser_id,action_id,source_id,created,country_code,latitude,longitude,wifi_connection,carrier_id,trans_id,os_minor,agent_device,os_major,specs_brand,brand,timeToClick,touchX,touchY,ref_type,ref_hash
count,64296.0,7.0,64296.0,64296,64296.0,64296.0,64296.0,64296,63097.0,64296,64261.0,8920.0,64261.0,64296.0,15035.0,38178.0,43678.0,43678.0,64296.0,64296.0
unique,,,,64275,1.0,,,2,,64279,,,,5.0,,,,,2.0,
top,,,,2019-04-23 15:15:05.754000,6.287817205707153e+18,,,True,,jSeHencK5ZYAfQ2azB9GasPmUdM4v6U,,,,7.191384093611695e+16,,,,,1.8915151805412844e+18,
freq,,,,2,64296.0,,,40254,,2,,,,35613.0,,,,,60492.0,
first,,,,2019-04-12 00:00:01.981000,,,,,,,,,,,,,,,,
last,,,,2019-04-26 23:59:22.065000,,,,,,,,,,,,,,,,
mean,1.560595,122478.285714,1.251166,,,1.74019,0.867035,,10.431558,,3.576229e+18,inf,5.03092e+18,,1.400798,206.954231,inf,inf,,4.561908e+18
std,0.518691,8510.626039,1.407715,,,0.045593,0.017515,,14.161776,,2.320198e+18,,1.151353e+18,,1.798718,921.497533,,,,2.662462e+18
min,0.0,103178.0,0.0,,,1.660487,0.810223,,0.0,,5.106671e+17,inf,6.90065e+17,,0.0,0.013,0.0,0.0,,693609700000000.0
25%,1.0,125695.0,0.0,,,1.714512,0.863469,,1.0,,1.517644e+18,inf,3.90839e+18,,0.0,2.03025,0.361,0.105,,2.236736e+18


In [6]:
clicks.dtypes

advertiser_id                int8
action_id                 float64
source_id                    int8
created            datetime64[ns]
country_code             category
latitude                  float64
longitude                 float64
wifi_connection              bool
carrier_id                float64
trans_id                   object
os_minor                  float32
agent_device              float16
os_major                  float32
specs_brand              category
brand                     float64
timeToClick               float64
touchX                    float32
touchY                    float32
ref_type                 category
ref_hash                    int64
dtype: object

In [7]:
clicks['brand'].isnull().sum()

49261

Consideramos al valor -1 como nulo, esto es para poder cambiar el tipo de datos de la columna a int8.

In [8]:
clicks['brand'].fillna(value=-1,inplace=True)

In [9]:
clicks['brand'] = clicks['brand'].astype('int8')

In [10]:
clicks['carrier_id'].isnull().sum()

1199

Consideramos que el id -1 es que no posee id, esto es para poder cambiar el tipo de datos de la columna a int16.

In [11]:
clicks['carrier_id'].fillna(value=-1,inplace=True)

In [12]:
clicks['carrier_id'] = clicks['carrier_id'].astype('int16')

In [13]:
clicks['action_id'].count()

7

Descartamos la columna "action_id" ya que vemos que casi todos sus valores son nulos y por lo tanto la información que nos aporta es mínima.

In [14]:
clicks.drop(columns='action_id',inplace=True)

Con los datos obtenidos de la limpieza de installs reemplazamos los hashs de ref_hash por Google o Apple.

In [15]:
clicks['ref_type'].replace({'1891515180541284343':'Google','1494519392962156891':'Apple'}, inplace=True)

In [16]:
clicks['ref_type'] = clicks['ref_type'].astype('category')

In [17]:
clicks['ref_type'].value_counts()

Google    60492
Apple      3804
Name: ref_type, dtype: int64

In [18]:
clicks.head()

Unnamed: 0,advertiser_id,source_id,created,country_code,latitude,longitude,wifi_connection,carrier_id,trans_id,os_minor,agent_device,os_major,specs_brand,brand,timeToClick,touchX,touchY,ref_type,ref_hash
0,1,2,2019-04-18 05:27:42.197,6287817205707153877,1.714547,0.871535,False,3,9JMAfrb-b9cSEVCJb0P9JfihGthaS7E,1.517644e+18,,5.131615e+18,71913840936116953,0,2.317,0.968,0.503,Google,1293710398598742392
1,1,1,2019-04-18 05:27:03.164,6287817205707153877,1.714512,0.871062,True,2,r3xtTRv2lInfiXG8JI3NQsNcBo8GyFQ,1.288578e+18,,3.90839e+18,3576558787748411622,1,7.653,0.712,1.689,Google,1663930990551616564
2,1,1,2019-04-18 05:42:07.926,6287817205707153877,1.714547,0.871535,True,4,WOnHFqQtY48z_ygKZ-030U_g0TMGVMw,2.238736e+18,,3.581233e+18,3576558787748411622,-1,464.796,0.227,0.251,Google,8488038938665586188
3,1,1,2019-04-18 05:26:04.446,6287817205707153877,1.708041,0.870772,True,1,wQMLLmYqiFhSuha9p9B13PMtcyBW_vM,2.41164e+18,,3.90839e+18,3576558787748411622,-1,225.311,0.696,6.587,Google,6488361690105189959
4,1,1,2019-04-18 05:23:37.764,6287817205707153877,1.715514,0.870772,True,2,GeFoyBzMA7taylMxxjzlNPTU-n4FXFs,1.517644e+18,,5.131615e+18,3576558787748411622,0,84.736,0.059,0.142,Google,1348993302102753419


In [19]:
clicks.dtypes

advertiser_id                int8
source_id                    int8
created            datetime64[ns]
country_code             category
latitude                  float64
longitude                 float64
wifi_connection              bool
carrier_id                  int16
trans_id                   object
os_minor                  float32
agent_device              float16
os_major                  float32
specs_brand              category
brand                        int8
timeToClick               float64
touchX                    float32
touchY                    float32
ref_type                 category
ref_hash                    int64
dtype: object