# Trabajo Práctico 2 - Organización de Datos
## Competencia de Machine Learning
### Facultad de Ingeniería de la Universidad de Buenos Aires
### 95-58: Organización de Datos - 2do Cuat. 2018

#### Integrantes: Gonzalo Diz,  Ariel Windey, Gabriel Robles y Matías El Dócil




#### Objetivo
Determinar, para cada usuario presentado, cuál es la probabilidad de que ese
usuario realice una conversión en Trocafone en un periodo determinado.

#### Fuentes
El archivo "events_up_to_01062018.csv" contiene en el mismo formato utilizado en el TP1
información de eventos realizado en la plataforma para un conjunto de usuarios hasta el
31/05/2018.

Por otro lado el archivo "labels_training_set.csv" indica para un subconjunto de los
usuarios incluidos en el set de eventos "events_up_to_01062018.csv" si los mismos
realizaron una conversión (columna label = 1) o no (columna label = 0) desde el 01/06/2018
hasta el 15/06/2018.

In [256]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

import sklearn as skl

%matplotlib inline

pd.set_option('display.max_columns', 51)

In [257]:
# Carga del set de datos de eventos
eventos = pd.read_csv('../dataset/events_up_to_01062018.csv', low_memory=False)
# Carga del set de datos de labels
labels = pd.read_csv('../dataset/labels_training_set.csv', low_memory=False)


In [258]:
eventos.shape

(2341681, 23)

In [259]:
eventos.sample(5)

Unnamed: 0,timestamp,event,person,url,sku,model,condition,storage,color,skus,search_term,staticpage,campaign_source,search_engine,channel,new_vs_returning,city,region,country,device_type,screen_resolution,operating_system_version,browser_version
2043123,2018-04-26 13:08:22,generic listing,fb3969c5,,,,,,,"2820,2750,6706,6721,12618,2773,2774,5447,2812,...",,,,,,,,,,,,,
979596,2018-05-16 19:48:46,viewed product,1dffe825,,9020.0,iPhone 6S Plus,Bom,32GB,Cinza espacial,,,,,,,,,,,,,,
1227131,2018-05-31 21:14:04,searched products,a9e0eed1,,,,,,,"4883,3000,4919,4871,3012,10868,10854,6595,6636...",Celular,,,,,,,,,,,,
685136,2018-04-10 18:50:25,viewed product,171e75cb,,6957.0,iPhone 6S,Excelente,128GB,Dourado,,,,,,,,,,,,,,
235703,2018-05-24 01:08:35,searched products,22b9085f,,,,,,,"6357,3371,6371,2777,10896,2718,2694,6413,3191,...",j7 32 gb,,,,,,,,,,,,


In [260]:
# Formateo los eventos
eventos['timestamp'] = pd.to_datetime(eventos['timestamp'])


In [261]:
# Promedio de eventos por persona

eventos_por_persona = eventos[['person', 'event']]
eventos_por_persona = eventos_por_persona.groupby(['person']).count().reset_index()
promedio_eventos_por_persona = eventos_por_persona['event'].sum() / eventos_por_persona.count()
promedio_eventos_por_persona

person    60.307528
event     60.307528
dtype: float64

In [268]:
personas = eventos['person'].to_frame().drop_duplicates()
personas.shape

(38829, 1)

## Feature 1: Cantidad de eventos por usuario

In [275]:
# Veo cantidad de eventos por persona

eventos['cantidad'] = 1
grouped = eventos.groupby(['person','event']).agg({'cantidad':'sum'})
grouped = grouped.unstack().reset_index()
grouped = grouped.fillna(value=0)

In [276]:
grouped.sample(5)


Unnamed: 0_level_0,person,cantidad,cantidad,cantidad,cantidad,cantidad,cantidad,cantidad,cantidad,cantidad,cantidad,cantidad
event,Unnamed: 1_level_1,ad campaign hit,brand listing,checkout,conversion,generic listing,lead,search engine hit,searched products,staticpage,viewed product,visited site
36143,ee58d70c,1.0,0.0,1.0,0.0,1.0,0.0,1.0,3.0,0.0,5.0,1.0
33094,da8fcfea,2.0,3.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
29225,c10a2db1,5.0,0.0,1.0,0.0,1.0,0.0,4.0,0.0,0.0,16.0,3.0
20454,87d3961c,1.0,3.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0,11.0,1.0
11525,4c2000d4,2.0,0.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,13.0,1.0


In [277]:
grouped.columns = ['person','ad campaign hit', 'brand listing', 'checkout', 'conversion', 'generic listing', 'lead', 'search engine hit', 'searched products', 'staticpage', 'viewed product', 'visited site']
grouped.sample(5)

Unnamed: 0,person,ad campaign hit,brand listing,checkout,conversion,generic listing,lead,search engine hit,searched products,staticpage,viewed product,visited site
29961,c622d79d,3.0,1.0,1.0,0.0,0.0,0.0,2.0,0.0,0.0,2.0,2.0
32628,d76ff6ff,7.0,1.0,4.0,0.0,0.0,0.0,3.0,6.0,0.0,139.0,7.0
26126,acbc9760,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,1.0
23092,98ffe412,7.0,0.0,1.0,0.0,0.0,0.0,0.0,4.0,0.0,13.0,7.0
34363,e29a9761,3.0,1.0,1.0,0.0,2.0,0.0,3.0,0.0,0.0,9.0,2.0


In [278]:
grouped = grouped.set_index('person')
grouped.sample(5)

Unnamed: 0_level_0,ad campaign hit,brand listing,checkout,conversion,generic listing,lead,search engine hit,searched products,staticpage,viewed product,visited site
person,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
0eca7ed3,31.0,12.0,2.0,0.0,30.0,0.0,44.0,90.0,0.0,127.0,14.0
d72d80b3,0.0,0.0,1.0,0.0,2.0,0.0,2.0,4.0,0.0,7.0,2.0
a896b933,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0
843e9159,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0
1a71716a,0.0,0.0,2.0,0.0,1.0,0.0,0.0,0.0,0.0,3.0,0.0


In [282]:
non_visit = grouped.loc[grouped['visited site'] == 0]
non_visit.shape

(587, 11)

Tenemos la cantidad de eventos por cada persona

In [267]:
grouped.shape

(38829, 11)

Como todos las personas realizaron al menos un evento, no hace falta hacer el join con persons.

In [299]:
grouped.to_csv('events_qty_per_person.csv')

## Feature 2: Dispositivos

Para cada persona, se tendrá la cantidad de "device type" utilizados. Este es un feature para el evento "visited site"

In [271]:
eventos.sample(5)

Unnamed: 0,timestamp,event,person,url,sku,model,condition,storage,color,skus,search_term,staticpage,campaign_source,search_engine,channel,new_vs_returning,city,region,country,device_type,screen_resolution,operating_system_version,browser_version,cantidad
5404,2018-05-29 19:46:20,viewed product,c7fba740,,9286.0,Samsung Galaxy J7 Prime,Bom,32GB,Dourado,,,,,,,,,,,,,,,1
2068040,2018-05-22 23:54:33,generic listing,1db0efa0,,,,,,,"6594,6636,1061,6707,2750,12619,11346,2766,1260...",,,,,,,,,,,,,,1
826480,2018-05-25 23:53:59,viewed product,85e44e49,,2939.0,Sony Xperia Z2,Bom,16GB,Preto,,,,,,,,,,,,,,,1
2091479,2018-05-02 15:27:19,brand listing,fcead3d1,,,,,,,"12758,12744,12772,8541,8513,8485,8471,8527,641...",,,,,,,,,,,,,,1
336391,2018-04-30 13:25:58,viewed product,a8ece5ba,,12745.0,Samsung Galaxy S8,Muito Bom,64GB,Ametista,,,,,,,,,,,,,,,1


In [286]:
dispositivos = eventos.groupby(['person', 'device_type']).agg({'cantidad':'sum'})
dispositivos = dispositivos.unstack().reset_index()
dispositivos = dispositivos.fillna(value=0)

In [287]:
dispositivos.columns = ['person','Computer', 'Smartphone', 'Tablet', 'Unknown']
dispositivos = dispositivos.set_index('person')
dispositivos.sample(5)

Unnamed: 0_level_0,Computer,Smartphone,Tablet,Unknown
person,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
42d2dd49,1.0,0.0,0.0,0.0
94d38c82,0.0,5.0,0.0,0.0
4f8a4a9f,1.0,0.0,0.0,0.0
6a989c01,3.0,0.0,0.0,0.0
03bcd854,1.0,0.0,0.0,0.0


In [288]:
dispositivos.shape

(38242, 4)

Veo que quedaron personas afuera, significa que hay personas que no tienen ningun device_type asociado, esto es, porque no tienen ningun evento registrado del tipo "visited site". Haremos un merge con persons.

In [297]:
dispositivos = pd.merge(dispositivos, personas, on='person', how='right')
dispositivos = dispositivos.fillna(value=0)

In [300]:
dispositivos.to_csv('device_types_qty_per_person.csv')