# Limpieza Clicks


En este notebook se busca realizar una limpieza de información del archivo clicks.csv, buscamos que los tipos de las columnas sean correctos y ocupen el menor espacio posible, y se analiza que datos son o no relevantes.

In [16]:
import pandas as pd
import numpy as np

import warnings
warnings.filterwarnings('ignore')


A la hora de leer los datos del csv, especificamos el tipo de dato de las columnas con el parámetro dtype y parseamos las columnas cuya información son fechas con el parámetro parse (habiendo visto previamente el csv para identificar los datos de cada columna). Así, se reduce el tiempo de lectura del csv y el espacio de memoria que ocupan los datos. 

In [17]:
clicks = pd.read_csv('../Data/clicks.csv.gzip', compression = 'gzip', 
                     dtype={'ref_type': 'category', 'touchX': np.float32, 'touchY': np.float32, 'timeToClick': np.float64, 
                            'specs_brand': 'category', 'country_code': 'category', 'os_minor': np.float32, 'os_major': np.float32, 
                            'agent_device': np.float16, 'advertiser_id': np.int8, 'source_id': np.int8}, parse_dates=['created'])

In [18]:
clicks.head()

Unnamed: 0,advertiser_id,action_id,source_id,created,country_code,latitude,longitude,wifi_connection,carrier_id,trans_id,os_minor,agent_device,os_major,specs_brand,brand,timeToClick,touchX,touchY,ref_type,ref_hash
0,2,,4,2019-03-06 22:42:12.755,6333597102633388268,1.205689,1.070234,False,1.0,iGgClCM9exiHF4K31g94XmvHEBSLKIY,6.768137e+18,,3.072849e+18,2733035977666442898,,1.563,0.905,0.078,1891515180541284343,1904083516767779093
1,0,,0,2019-03-08 10:24:30.641,6333597102633388268,1.218924,1.071209,False,4.0,MMHTOJ6qKAOeIH_Eywh1KIcCaxtO9oM,3.025219e+18,,1.774085e+18,392184377613098015,,,,,1891515180541284343,3086509764961796666
2,0,,0,2019-03-08 15:24:16.069,6333597102633388268,1.205689,1.070234,False,6.0,vIrEIdf9izUaWdAri6Ezk7T3nHFvNQU,5.975656e+18,,3.072849e+18,392184377613098015,,,0.946,0.473,1891515180541284343,6958163894863846647
3,2,,3,2019-03-06 03:08:51.543,6333597102633388268,1.205689,1.070234,False,45.0,YaKxxEAs2UmZhSpRfiCO9Zpa82B_AKM,6.768137e+18,,3.072849e+18,2733035977666442898,,19.013,0.035,0.431,1891515180541284343,4368617728156436525
4,2,,3,2019-03-06 03:32:55.570,6333597102633388268,1.205689,1.070234,False,45.0,X5XTOcYQovkl6yadYdAD7xioVGU9jiY,6.768137e+18,,3.072849e+18,2733035977666442898,,28.11,0.054,0.423,1891515180541284343,4368617728156436525


In [19]:
clicks.describe(include='all')

Unnamed: 0,advertiser_id,action_id,source_id,created,country_code,latitude,longitude,wifi_connection,carrier_id,trans_id,os_minor,agent_device,os_major,specs_brand,brand,timeToClick,touchX,touchY,ref_type,ref_hash
count,26351.0,0.0,26351.0,26351,26351.0,26351.0,26351.0,26351,26340.0,26351,26339.0,3243.0,26339.0,26351.0,6235.0,22977.0,23011.0,23011.0,26351.0,26351.0
unique,,,,26347,1.0,,,1,,26351,,,,5.0,,,,,4.0,
top,,,,2019-03-10 05:02:10.703000,6.333597102633388e+18,,,False,,lZtRCOnk2WKiTu69dL_vbhyvElo0A3Q,,,,3.92184377613098e+17,,,,,1.8915151805412844e+18,
freq,,,,2,26351.0,,,26351,,1,,,,16172.0,,,,,25549.0,
first,,,,2019-03-05 01:17:30.663000,,,,,,,,,,,,,,,,
last,,,,2019-03-13 23:59:59.298000,,,,,,,,,,,,,,,,
mean,2.991993,,1.245266,,,1.206906,1.070233,,7.743812,,4.635112e+18,inf,3.913305e+18,,1.482277,230.403309,0.638791,1.478659,,4.611581e+18
std,0.16407,,2.188948,,,0.004484,0.001896,,7.017027,,1.642969e+18,,1.885866e+18,,1.583764,976.849149,0.30198,2.622726,,2.673175e+18
min,0.0,,0.0,,,1.205058,1.058204,,0.0,,6.666626e+17,inf,7.436481e+17,,0.0,0.017,0.0,0.0,,928619200000000.0
25%,3.0,,0.0,,,1.205689,1.070234,,3.0,,3.37864e+18,inf,1.774085e+18,,0.0,2.915,0.426,0.183,,2.273798e+18


In [20]:
clicks.dtypes

advertiser_id                int8
action_id                 float64
source_id                    int8
created            datetime64[ns]
country_code             category
latitude                  float64
longitude                 float64
wifi_connection              bool
carrier_id                float64
trans_id                   object
os_minor                  float32
agent_device              float16
os_major                  float32
specs_brand              category
brand                     float64
timeToClick               float64
touchX                    float32
touchY                    float32
ref_type                 category
ref_hash                    int64
dtype: object

In [21]:
clicks['brand'].isnull().sum()

20116

Consideramos al valor -1 como nulo, esto es para poder cambiar el tipo de datos de la columna a int8.

In [22]:
clicks['brand'].fillna(value=-1,inplace=True)

In [23]:
clicks['brand'] = clicks['brand'].astype('int8')

In [24]:
clicks['carrier_id'].isnull().sum()

11

Consideramos que el id -1 es que no posee id, esto es para poder cambiar el tipo de datos de la columna a int16.

In [25]:
clicks['carrier_id'].fillna(value=-1,inplace=True)

In [26]:
clicks['carrier_id'] = clicks['carrier_id'].astype('int16')

In [27]:
clicks['action_id'].count()

0

Descartamos la columna "action_id" ya que vemos que todos sus valores son nulos y por lo tanto no nos aporta información.

In [28]:
clicks.drop(columns='action_id',inplace=True)

Con los datos obtenidos de la limpieza de installs reemplazamos los hashs de ref_hash por Google o Apple.

In [40]:
clicks['ref_type'].replace({'1891515180541284343':'Google','1494519392962156891':'Apple',\
                            '5016171802147987303': 'Otros','6323871695571587575': 'Otros'}, inplace=True)

In [43]:
clicks['ref_type'] = clicks['ref_type'].astype('category')

In [44]:
clicks['ref_type'].value_counts()

Google    25549
Apple       739
Otros        63
Name: ref_type, dtype: int64

In [45]:
clicks.head()

Unnamed: 0,advertiser_id,source_id,created,country_code,latitude,longitude,wifi_connection,carrier_id,trans_id,os_minor,agent_device,os_major,specs_brand,brand,timeToClick,touchX,touchY,ref_type,ref_hash
0,2,4,2019-03-06 22:42:12.755,6333597102633388268,1.205689,1.070234,False,1,iGgClCM9exiHF4K31g94XmvHEBSLKIY,6.768137e+18,,3.072849e+18,2733035977666442898,-1,1.563,0.905,0.078,Google,1904083516767779093
1,0,0,2019-03-08 10:24:30.641,6333597102633388268,1.218924,1.071209,False,4,MMHTOJ6qKAOeIH_Eywh1KIcCaxtO9oM,3.025219e+18,,1.774085e+18,392184377613098015,-1,,,,Google,3086509764961796666
2,0,0,2019-03-08 15:24:16.069,6333597102633388268,1.205689,1.070234,False,6,vIrEIdf9izUaWdAri6Ezk7T3nHFvNQU,5.975656e+18,,3.072849e+18,392184377613098015,-1,,0.946,0.473,Google,6958163894863846647
3,2,3,2019-03-06 03:08:51.543,6333597102633388268,1.205689,1.070234,False,45,YaKxxEAs2UmZhSpRfiCO9Zpa82B_AKM,6.768137e+18,,3.072849e+18,2733035977666442898,-1,19.013,0.035,0.431,Google,4368617728156436525
4,2,3,2019-03-06 03:32:55.570,6333597102633388268,1.205689,1.070234,False,45,X5XTOcYQovkl6yadYdAD7xioVGU9jiY,6.768137e+18,,3.072849e+18,2733035977666442898,-1,28.11,0.054,0.423,Google,4368617728156436525


In [46]:
clicks.dtypes

advertiser_id                int8
source_id                    int8
created            datetime64[ns]
country_code             category
latitude                  float64
longitude                 float64
wifi_connection              bool
carrier_id                  int16
trans_id                   object
os_minor                  float32
agent_device              float16
os_major                  float32
specs_brand              category
brand                        int8
timeToClick               float64
touchX                    float32
touchY                    float32
ref_type                 category
ref_hash                    int64
dtype: object