# Trabajo Práctico 2 - Organización de Datos
## Competencia de Machine Learning
### Facultad de Ingeniería de la Universidad de Buenos Aires
### 95-58: Organización de Datos - 2do Cuat. 2018

#### Integrantes: Gonzalo Diz,  Ariel Windey, Gabriel Robles y Matías El Dócil




#### Objetivo
Determinar, para cada usuario presentado, cuál es la probabilidad de que ese
usuario realice una conversión en Trocafone en un periodo determinado.

#### Fuentes
El archivo "events_up_to_01062018.csv" contiene en el mismo formato utilizado en el TP1
información de eventos realizado en la plataforma para un conjunto de usuarios hasta el
31/05/2018.

Por otro lado el archivo "labels_training_set.csv" indica para un subconjunto de los
usuarios incluidos en el set de eventos "events_up_to_01062018.csv" si los mismos
realizaron una conversión (columna label = 1) o no (columna label = 0) desde el 01/06/2018
hasta el 15/06/2018.

In [63]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

import sklearn as skl

%matplotlib inline

pd.set_option('display.max_columns', 51)

In [158]:
# Carga del set de datos de eventos
eventos = pd.read_csv('../dataset/events_up_to_01062018.csv', low_memory=False)
# Carga del set de datos de labels
labels = pd.read_csv('../dataset/labels_training_set.csv', low_memory=False)


In [159]:
eventos.shape

(2341681, 23)

In [160]:
eventos.sample(5)

Unnamed: 0,timestamp,event,person,url,sku,model,condition,storage,color,skus,search_term,staticpage,campaign_source,search_engine,channel,new_vs_returning,city,region,country,device_type,screen_resolution,operating_system_version,browser_version
2193035,2018-05-29 21:11:34,visited site,aa7eae2f,,,,,,,,,,,,Direct,Returning,Guarulhos,Sao Paulo,Brazil,Computer,1366x768,Windows 7,Chrome 66.0
40830,2018-05-12 04:04:14,viewed product,dc2d6173,,8513.0,Samsung Galaxy S7 Edge,Bom,32GB,Preto,,,,,,,,,,,,,,
995298,2018-04-15 22:48:12,viewed product,58ee8efb,,2704.0,Samsung Galaxy S4 i9515,Bom,16GB,Preto,,,,,,,,,,,,,,
1661912,2018-05-26 23:57:30,viewed product,a221b1bf,,2820.0,Samsung Galaxy Win Duos,Bom,8GB,Branco,,,,,,,,,,,,,,
1771972,2018-05-29 03:13:52,brand listing,caf64c34,,,,,,,"6371,6357,3371,2777,2718,6413,10896,3191,2773,...",,,,,,,,,,,,,


In [161]:
# Formateo los eventos
eventos['timestamp'] = pd.to_datetime(eventos['timestamp'])


In [162]:
# Promedio de eventos por persona

eventos_por_persona = eventos[['person', 'event']]
eventos_por_persona = eventos_por_persona.groupby(['person']).count().reset_index()
promedio_eventos_por_persona = eventos_por_persona['event'].sum() / eventos_por_persona.count()
promedio_eventos_por_persona

person    60.307528
event     60.307528
dtype: float64

## Feature 1: Cantidad de eventos por usuario

In [163]:
# Veo cantidad de eventos por persona

eventos['cantidad'] = 1
grouped = eventos.groupby(['person','event']).agg({'cantidad':'sum'})
grouped = grouped.unstack().reset_index()
grouped = grouped.fillna(value=0)

In [164]:
grouped.sample(5)


Unnamed: 0_level_0,person,cantidad,cantidad,cantidad,cantidad,cantidad,cantidad,cantidad,cantidad,cantidad,cantidad,cantidad
event,Unnamed: 1_level_1,ad campaign hit,brand listing,checkout,conversion,generic listing,lead,search engine hit,searched products,staticpage,viewed product,visited site
29416,c26cc685,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0
27220,b42f6d1e,12.0,1.0,0.0,0.0,17.0,0.0,1.0,0.0,0.0,21.0,11.0
22850,97426cb4,3.0,0.0,1.0,0.0,0.0,0.0,3.0,0.0,0.0,3.0,1.0
6624,2c1ca63e,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0
18809,7ca248dd,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [165]:
grouped.columns = ['person','ad campaign hit', 'brand listing', 'checkout', 'conversion', 'generic listing', 'lead', 'search engine hit', 'searched products', 'staticpage', 'viewed product', 'visited site']
grouped.sample(5)

Unnamed: 0,person,ad campaign hit,brand listing,checkout,conversion,generic listing,lead,search engine hit,searched products,staticpage,viewed product,visited site
33121,daaec93b,6.0,2.0,1.0,1.0,3.0,0.0,7.0,16.0,0.0,15.0,5.0
3273,15bf8430,4.0,1.0,1.0,0.0,2.0,0.0,1.0,0.0,0.0,4.0,6.0
31211,ce445806,2.0,1.0,3.0,0.0,1.0,0.0,0.0,0.0,0.0,12.0,1.0
25572,a92e7ad2,7.0,0.0,4.0,0.0,1.0,0.0,4.0,0.0,0.0,12.0,6.0
2478,10a81b9e,1.0,12.0,1.0,0.0,19.0,0.0,19.0,37.0,0.0,42.0,9.0


In [166]:
grouped = grouped.set_index('person')
grouped.sample(5)

Unnamed: 0_level_0,ad campaign hit,brand listing,checkout,conversion,generic listing,lead,search engine hit,searched products,staticpage,viewed product,visited site
person,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
52ac5248,13.0,1.0,1.0,0.0,9.0,0.0,11.0,1.0,0.0,71.0,11.0
e9299833,11.0,35.0,1.0,0.0,18.0,0.0,17.0,94.0,0.0,112.0,11.0
1c2e4361,1.0,4.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,12.0,0.0
d96dc82e,3.0,3.0,1.0,0.0,3.0,0.0,3.0,0.0,0.0,5.0,2.0
9e015932,2.0,17.0,1.0,0.0,4.0,0.0,1.0,0.0,0.0,45.0,3.0


Tenemos la cantidad de eventos por cada persona

#### Tratamiento de LABELS

In [167]:
labels.shape

(19414, 2)

In [168]:
labels.sample(5)

Unnamed: 0,person,label
995,b87cf285,0
3171,4ec33d23,0
2656,be0ad131,0
10436,1d417563,0
9560,d0854e2d,0


In [169]:
labels['label'].value_counts()

0    18434
1      980
Name: label, dtype: int64

In [170]:
labels = labels.set_index('person')

In [171]:
labels.sample(5)

Unnamed: 0_level_0,label
person,Unnamed: 1_level_1
645da9c9,0
c7641d93,1
e1c9c3c7,0
6d4292b2,0
4167535c,1


## Feature 1: Cantidad de eventos por usuario
#### Join de Eventos con Labels

In [181]:
merged = pd.merge(left=grouped, right=labels, how='inner', on='person')
merged['label'].fillna(value=3)
merged.sample(10)


Unnamed: 0_level_0,ad campaign hit,brand listing,checkout,conversion,generic listing,lead,search engine hit,searched products,staticpage,viewed product,visited site,label
person,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
0f1f1f7b,1.0,1.0,1.0,0.0,2.0,0.0,1.0,0.0,0.0,4.0,1.0,0
48716370,5.0,3.0,2.0,11.0,19.0,0.0,3.0,17.0,3.0,103.0,18.0,1
de8f267e,0.0,0.0,1.0,0.0,2.0,0.0,0.0,1.0,0.0,7.0,3.0,0
4a7119be,54.0,1.0,3.0,0.0,0.0,0.0,5.0,34.0,0.0,118.0,35.0,0
07ea5933,8.0,2.0,1.0,0.0,6.0,0.0,4.0,8.0,0.0,8.0,3.0,0
ebb5df7d,0.0,1.0,1.0,0.0,6.0,0.0,2.0,0.0,1.0,4.0,2.0,0
7fc9c33c,6.0,0.0,1.0,0.0,4.0,0.0,0.0,0.0,0.0,22.0,8.0,0
d71787f7,0.0,0.0,4.0,1.0,2.0,0.0,1.0,0.0,1.0,2.0,1.0,0
a4f28212,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,1.0,0
d132afeb,10.0,14.0,3.0,2.0,2.0,0.0,2.0,4.0,1.0,11.0,4.0,0


## Desarrollo 2: Dispositivo

In [185]:
eventos.sample(5)

Unnamed: 0,timestamp,event,person,url,sku,model,condition,storage,color,skus,search_term,staticpage,campaign_source,search_engine,channel,new_vs_returning,city,region,country,device_type,screen_resolution,operating_system_version,browser_version,cantidad
918817,2018-05-29 04:26:04,viewed product,dd83b0fc,,3361.0,Samsung Galaxy S6 Flat,Excelente,32GB,Preto,,,,,,,,,,,,,,,1
798754,2018-05-21 20:04:21,viewed product,272aa790,,10477.0,Motorola Moto G4 DTV,Muito Bom,16GB,Preto,,,,,,,,,,,,,,,1
1639507,2018-05-29 05:19:18,viewed product,c7a15318,,7519.0,LG G4 H818P,Bom,32GB,Branco,,,,,,,,,,,,,,,1
339504,2018-04-28 14:33:15,viewed product,de62319a,,14296.0,Samsung Galaxy Note 8,Novo,64GB,Preto,,,,,,,,,,,,,,,1
18129,2018-05-15 18:47:40,viewed product,3be84255,,10198.0,iPhone 7 Plus,Excelente,128GB,Prateado,,,,,,,,,,,,,,,1


In [198]:
dispositivos = eventos[['person', 'device_type']]
dispositivos = dispositivos.dropna()

In [199]:
dispositivos['device_type'].value_counts()

Smartphone    103502
Computer       97485
Tablet          2799
Unknown          283
Name: device_type, dtype: int64