# Limpieza Installs

En este notebook se busca realizar una limpieza de información del archivo installs.csv, buscamos que los tipos de las columnas sean correctos y ocupen el menor espacio posible, y se analiza que datos son o no relevantes.

In [1]:
import pandas as pd
import numpy as np

import warnings
warnings.filterwarnings('ignore')

A la hora de leer los datos del csv, especificamos el tipo de dato de las columnas con el parámetro dtype y parseamos las columnas cuya información son fechas con el parámetro parse (habiendo visto previamente el csv para identificar los datos de cada columna). Así, se reduce el tiempo de lectura del csv y el espacio de memoria que ocupan los datos. 

In [2]:
installs = pd.read_csv('../Data/installs.csv.gzip', compression = 'gzip', 
                       dtype={'ref_type':'category','application_id':np.int8, 'device_countrycode':'category'}, parse_dates=['created'])

In [3]:
installs.head()

Unnamed: 0,created,application_id,ref_type,ref_hash,click_hash,attributed,implicit,device_countrycode,device_brand,device_model,session_user_agent,user_agent,event_uuid,kind,wifi,trans_id,ip_address,device_language
0,2019-04-24 06:23:29.495,1,1494519392962156891,4716708407362582887,,False,True,6287817205707153877,,3.739127e+17,adjust.com,,79837499-2f2a-4605-a663-e322f759424f,app_open,,,4243443387795468703,3.301378e+18
1,2019-04-24 02:06:01.032,1,1494519392962156891,7143568733100935872,,False,False,6287817205707153877,,7.805539e+18,adjust.com,,,,,,4724288679627032761,3.301378e+18
2,2019-04-20 10:15:36.274,1,1494519392962156891,5230323462636548010,,False,True,6287817205707153877,,8.355496e+18,adjust.com,,dda99e3c-9c4b-487d-891c-79f0a02cb4a8,app_open,,,8291809486355890410,4.06093e+18
3,2019-04-20 21:56:47.151,1,1494519392962156891,5097163995161606833,,False,True,6287817205707153877,,2.355772e+18,adjust.com,,7010c3ce-0fcf-46c6-9be8-374cc0e20af4,app_open,,,4006811922873399949,3.301378e+18
4,2019-04-20 22:40:41.239,1,1494519392962156891,6328027616411983332,,False,False,6287817205707153877,,6.156971e+18,adjust.com,,,,,,3386455054590810771,3.301378e+18


In [4]:
installs.describe(include='all')

Unnamed: 0,created,application_id,ref_type,ref_hash,click_hash,attributed,implicit,device_countrycode,device_brand,device_model,session_user_agent,user_agent,event_uuid,kind,wifi,trans_id,ip_address,device_language
count,481511,481511.0,481511.0,481511.0,1142,481511,481511,481511.0,276443.0,454619.0,466672,330768,103168,103168,294829,8933.0,481511.0,453934.0
unique,480962,,2.0,,1142,2,2,1.0,,,4537,13727,103083,183,2,4053.0,,
top,2019-04-20 16:34:38.892000,,1.8915151805412844e+18,,Dcd6LLZcvDwqV0DL9DcBzxuO49JkJAc,False,False,6.287817205707153e+18,,,http-kit/2.0,Grability/17420 CFNetwork/978.0.7 Darwin/18.5.0,b53e3149-2621-4a35-8f7a-f7820555108d,Open,True,0.0,,
freq,4,,398906.0,,1,480369,378343,481511.0,,,335311,10809,2,37257,235130,3145.0,,
first,2019-04-18 00:00:01.560000,,,,,,,,,,,,,,,,,
last,2019-04-26 23:59:58.788000,,,,,,,,,,,,,,,,,
mean,,29.923584,,4.60827e+18,,,,,3.048272e+18,5.100845e+18,,,,,,,4.600401e+18,5.80185e+18
std,,72.792831,,2.664215e+18,,,,,2.512359e+18,2.504734e+18,,,,,,,2.668897e+18,1.927775e+18
min,,-128.0,,40621410000000.0,,,,,1892301000000000.0,796734600000000.0,,,,,,,33114280000000.0,6.645612e+16
25%,,-30.0,,2.302034e+18,,,,,3.083059e+17,3.057402e+18,,,,,,,2.285332e+18,3.301378e+18


In [5]:
installs.shape

(481511, 18)

Calculamos la cantidad de datos no nula de cada columna.

In [6]:
(installs.isnull().sum()*(-1)+481511).sort_values()

click_hash              1142
trans_id                8933
kind                  103168
event_uuid            103168
device_brand          276443
wifi                  294829
user_agent            330768
device_language       453934
device_model          454619
session_user_agent    466672
ip_address            481511
implicit              481511
attributed            481511
ref_hash              481511
ref_type              481511
application_id        481511
device_countrycode    481511
created               481511
dtype: int64

Como podemos ver, muy pocos valores de la columnas "click_hash" y "trans_id" tienen valor no nulo. Como no nos aportan mucha información para el análisis, eliminamos ambas columnas del Dataframe.

In [7]:
installs.drop(columns=['click_hash','trans_id'], inplace=True)

Redefinimos la columna wifi en una del tipo categórica para que la información que presenta sea más comprensible.

In [8]:
installs['wifi'].value_counts()

True     235130
False     59699
Name: wifi, dtype: int64

In [9]:
installs['Wifi_cat'] = ''

In [10]:
installs.loc[installs['wifi'] == 0, 'Wifi_cat'] = 'Sin Wifi'
installs.loc[installs['wifi'] == 1, 'Wifi_cat'] = 'Con Wifi'
installs.loc[installs['wifi'].isnull(), 'Wifi_cat'] = 'Sin Definir'

In [11]:
installs['Wifi_cat'].value_counts()

Con Wifi       235130
Sin Definir    186682
Sin Wifi        59699
Name: Wifi_cat, dtype: int64

In [12]:
installs['Wifi_cat'] = installs['Wifi_cat'].astype('category')

In [13]:
installs.drop(columns=['wifi'], inplace=True)

Agrupamos los valores de la columna "ref_type" con los de "session_user_agent" para poder encontrar cuales son las publicidades de Google y cuales las de Apple.

In [14]:
installs.groupby(['ref_type','session_user_agent']).count().reset_index()[['ref_type','session_user_agent']]

Unnamed: 0,ref_type,session_user_agent
0,1494519392962156891,HasOffers Mobile AppTracking v1.0
1,1494519392962156891,Mozilla/5.0 (iPad; CPU OS 12_1_4 like Mac OS X...
2,1494519392962156891,Mozilla/5.0 (iPhone; CPU iPhone OS 11_2_6 like...
3,1494519392962156891,Mozilla/5.0 (iPhone; CPU iPhone OS 11_4_1 like...
4,1494519392962156891,Mozilla/5.0 (iPhone; CPU iPhone OS 12_1 like M...
5,1494519392962156891,Mozilla/5.0 (iPhone; CPU iPhone OS 12_1 like M...
6,1494519392962156891,Mozilla/5.0 (iPhone; CPU iPhone OS 12_1_2 like...
7,1494519392962156891,Mozilla/5.0 (iPhone; CPU iPhone OS 12_1_2 like...
8,1494519392962156891,Mozilla/5.0 (iPhone; CPU iPhone OS 12_1_4 like...
9,1494519392962156891,adjust.com


Con los datos obtenidos reemplazamos los hashs por Google o Apple.

In [14]:
installs['ref_type'].replace({'1891515180541284343':'Google','1494519392962156891':'Apple'}, inplace=True)

In [15]:
installs['ref_type'] = installs['ref_type'].astype('category')

In [16]:
installs['ref_type'].value_counts()

Google    398906
Apple      82605
Name: ref_type, dtype: int64

In [17]:
installs.head()

Unnamed: 0,created,application_id,ref_type,ref_hash,attributed,implicit,device_countrycode,device_brand,device_model,session_user_agent,user_agent,event_uuid,kind,ip_address,device_language,Wifi_cat
0,2019-04-24 06:23:29.495,1,Apple,4716708407362582887,False,True,6287817205707153877,,3.739127e+17,adjust.com,,79837499-2f2a-4605-a663-e322f759424f,app_open,4243443387795468703,3.301378e+18,Sin Definir
1,2019-04-24 02:06:01.032,1,Apple,7143568733100935872,False,False,6287817205707153877,,7.805539e+18,adjust.com,,,,4724288679627032761,3.301378e+18,Sin Definir
2,2019-04-20 10:15:36.274,1,Apple,5230323462636548010,False,True,6287817205707153877,,8.355496e+18,adjust.com,,dda99e3c-9c4b-487d-891c-79f0a02cb4a8,app_open,8291809486355890410,4.06093e+18,Sin Definir
3,2019-04-20 21:56:47.151,1,Apple,5097163995161606833,False,True,6287817205707153877,,2.355772e+18,adjust.com,,7010c3ce-0fcf-46c6-9be8-374cc0e20af4,app_open,4006811922873399949,3.301378e+18,Sin Definir
4,2019-04-20 22:40:41.239,1,Apple,6328027616411983332,False,False,6287817205707153877,,6.156971e+18,adjust.com,,,,3386455054590810771,3.301378e+18,Sin Definir


In [18]:
installs.dtypes

created               datetime64[ns]
application_id                  int8
ref_type                    category
ref_hash                       int64
attributed                      bool
implicit                        bool
device_countrycode          category
device_brand                 float64
device_model                 float64
session_user_agent            object
user_agent                    object
event_uuid                    object
kind                          object
ip_address                     int64
device_language              float64
Wifi_cat                    category
dtype: object