# This notebook is also probably a bit janky. You might not be able to run it.
Takes the original collection of ad data, groups it by phone with some summary stats, and ads some additional phone-level columns.

It roughly approximates the procedures that Jeff was using to create his classifier data; see Part I of `jeff_classifier.py` for more information. I've tried to clean up the process a little, while maintaining his transformations.


## Setup

In [1]:
from itertools import chain
import ujson as json
import multiprocessing as mp
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.cross_validation import ShuffleSplit
from sklearn.metrics import roc_curve
from sklearn.metrics import auc
from sklearn.cross_validation import StratifiedKFold
from sklearn.metrics import roc_auc_score
from sklearn.ensemble import ExtraTreesClassifier

from helpers import disaggregated_df
from helpers import aggregated_df
from helpers import dummify_df
from helpers import phone_str_to_dd_format

## Read Svebor's Merge

In [2]:
df = pd.read_csv('../../data/merged/data_to_use_by_ad_v3_with_exp_imgs.csv')
print(df.shape)

(191949, 63)


Some initial transformations

In [3]:
# df['has_images'] = df['images_count'].notnull()
df.images_count.fillna(0)
df['class'] = df['class'].isin(['positive'])

At some point, a formatting issue got inserted into this ethnicity data. We fix that here.

In [4]:
df.ethnicity = df.ethnicity.fillna('nan').apply(lambda x: x.replace(' ','|'))

## Join in Steve's Data

In [5]:
steve = pd.read_csv('../../data/phone_aggregates/phones.csv')
steve_cols = ['n_ads',
              'n_distinct_locations',
              'location_tree_length',
              'n_outcall',
              'n_incall',
              'n_incall_and_outcall',
              # 'average_n_days_before_revisit',
              'n_cooccurring_phones']
steve_phone = steve.loc[:, ['phone'] + steve_cols].drop_duplicates()

print(steve_phone.shape)

(1527450, 8)


In [6]:
df = df.merge(steve_phone, how='left')
df = df.loc[df[['dd_id', 'phone']].drop_duplicates().index]

print(df.shape)

(29039, 70)


## Sort out aggregation of continuous variables

In [7]:
numerical_vars = ['age',
                  'price',
                  'duration_in_mins',
                  'price_per_min',
                  'images_count',
                  'exp_ads_from_simimages_count',
                  'similar_images_count']

phone_level_vars = ['n_ads',
                    'n_distinct_locations',
                    'location_tree_length',
                    'n_outcall',
                    'n_incall',
                    'n_incall_and_outcall',
                    'n_cooccurring_phones']

In [8]:
missing_vars = ['missing_{}'.format(col) for col in numerical_vars]

# Missing images means 0 images
# We've solved that with fillna(0)
missing_vars.remove('missing_images_count')

for col in missing_vars:
    df[col] = ~df[col[len('missing_'):]].notnull().astype(int)

In [9]:
numerical_df = df.groupby('phone')[numerical_vars + missing_vars].describe().unstack()
print(numerical_df.shape)
numerical_df = numerical_df.dropna(0, 'all')
print(numerical_df.shape)



(567, 104)
(567, 104)


In [10]:
phone_level_df = df.groupby('phone')[phone_level_vars].max()
print(phone_level_df.shape)
phone_level_df = phone_level_df.dropna(0, 'all')
print(phone_level_df.shape)

(567, 7)
(390, 7)


## Sort out categorical variables

In [11]:
flag_dummies = dummify_df(df.loc[:, ['phone', 'flag', 'ethnicity']], ['flag', 'ethnicity'], '|')
discrete_df = flag_dummies.groupby('phone').mean()

## Merge all and clean up.

This process includes fixing column names: many of the columns have tuples as names because of our use of `describe`. We fix that here, turning each tuple into a colon-separated list.

In [12]:
phone_level_df = phone_level_df.join([numerical_df, discrete_df], how='outer')
# phone_level_df['has_images'] = df.groupby('phone')['has_images'].max()
phone_level_df['class'] = df.groupby('phone')['class'].max()

phone_level_df = phone_level_df.fillna(-1).reset_index()
phone_level_df['phone'] = phone_level_df['index']
del phone_level_df['index']

print(phone_level_df.shape)

(567, 150)


In [13]:
phone_level_df.columns = [x if not isinstance(x, tuple) else ':'.join(x) for x in phone_level_df.columns]

In [14]:
phone_level_df.to_csv('../../data/merged/data_to_use_by_phone_v4.csv', index=False)
phone_level_df.to_pickle('../../data/merged/data_to_use_by_phone_v4.pkl')