In [27]:
import datetime
import json
import os

import keras
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from keras.models import Sequential
from keras.layers import Dense, Activation, Dropout
from keras.optimizers import SGD
from sklearn.model_selection import train_test_split


MIN_WEEK_DATA = 10
countries = ['INDIA', 'RUSSIA', 'SOUTHAFRICA', 'UK', 'US', 'CHINA', 'BRAZIL']
country_labels = {}
for idx, country in enumerate(countries):
    country_labels[country] = idx

Load in all the CSV data.

In [45]:
load_data = False

if 'country_data' not in locals():
    locals()['country_data'] = {}
    
if load_data:    
    for country in countries:
        # List the contents of each extracted ZIP file
        # Expected to be one directory per user; directory name = userID
        users = os.listdir('../data/%s' % country)
        country_data[country] = {}
        for user_id in users:
            csv_path = '../data/%s/%s/QueryResults.csv' % (country, user_id)
            if os.path.exists(csv_path):
                country_data[country][user_id] = pd.read_csv(csv_path)
            else:
                print('does not exist ' + csv_path)

Expand the CSV panda dataframes by turning `creationdate` into a pandas `Timestamp` and adding `minute_of_week` and `week_of_year`.

In [9]:
for country in country_data:
    user_dfs = country_data[country]
    for user_id in user_dfs:
        user_df = user_dfs[user_id]
        user_df['parsed'] = pd.to_datetime(user_df['creationdate'], format='%Y-%m-%d %H:%M:%S')
        user_df['minute_of_week'] = user_df['parsed'].apply(lambda row: (row.dayofweek * 24 * 60) + (row.hour * 60) + (row.minute))
        user_df['week_of_year'] = user_df['parsed'].apply(lambda row: row.weekofyear)

Create `country_weekly_data` which is a list for every country containing week-by-week timestamps. This is built by looping through every data frame for every user. When we detect a difference in `week_of_year`, we start a new week of timestamps. 


**WARNING** this cell is very slow. There are functions in a cell below to help save the output for faster loading if the notebook crashes.

In [10]:
process_data = False

if 'country_weekly_data' not in locals():
    locals()['country_weekly_data'] = {}

if process_data:
    for country in country_data:
        user_dfs = country_data[country]
        country_weekly_data[country] = []
        print('processing %s' % country)
        for user_id in user_dfs:
            user_df = user_dfs[user_id]
            last_week_of_year = None
            cur_week = None
            for index, row in user_df.iterrows():
                # This is the start of a new week
                if row['week_of_year'] != last_week_of_year:
                    # Only add weeks with at least MIN_WEEK_DATA timestamps
                    if cur_week and len(cur_week) > MIN_WEEK_DATA:
                        country_weekly_data[country].append(list(cur_week))
                    last_week_of_year = row['week_of_year']
                    # We use a set here to prevent duplicates
                    cur_week = set()
                cur_week.add(row['minute_of_week'])

processing INDIA
processing RUSSIA
processing SOUTHAFRICA
processing UK
processing US
processing CHINA
processing BRAZIL


Seeing lots of users with timestamps at exactly midnight and no seconds (YYYY-MM-DD 00:00:00), this is suspicious. 

**TODO** investiage this more later.

Examine how many weeks we have for each country. 

**CONCERN 1**: China, Russia and South Africa are low.

**CONCERN 2**: The number of timestamps in each of these weeks probably varies a lot (US, UK, India probably have weeks with many timestamps; vice-versa for the lower countries)

These concerns could lead to an overfit/poor model

In [26]:
for country in country_weekly_data:
    print('%s %d' % (country, len(country_weekly_data[country])))

INDIA 53228
RUSSIA 18883
SOUTHAFRICA 7485
UK 86500
US 94413
CHINA 7246
BRAZIL 12625


Below are some functions to save or load the processed `country_weekly_data` to save time and memory for replaying the notebook. Uses a simple JSON format since the arrays are not massively huge.

In [10]:
save_data = False
load_data = True

def save(arr, pth):
    with open(pth, 'w') as fh:
        fh.write(json.dumps(arr))
        fh.flush()
        os.fsync(fh.fileno())

def load(pth):
    with open(pth, 'r') as fh:
        return json.loads(fh.read()):
    
if save_data:
    for country in country_weekly_data:
        save(country_weekly_data[country], '../data/processed/%s.json' % country)
if load_data:
    locals()['country_weekly_data'] = {}
    for country in countries:
        country_weekly_data[country] = load('../data/processed/%s.json' % country)

Here we get our selection of what weeks we will train/test on. We naively grab the first 7500 of the larger datasets.

**TODO**: check out the lengths of the weeks inside all the countries and determine which ones would be best to use to accomodate countries with thinner weeks

In [11]:
weeks_to_use = {
    'CHINA': country_weekly_data['CHINA'],
    'SOUTHAFRICA': country_weekly_data['SOUTHAFRICA'],
    'INDIA': country_weekly_data['INDIA'][:7500],
    'RUSSIA': country_weekly_data['RUSSIA'][:7500],
    'US': country_weekly_data['US'][:7500],
    'UK': country_weekly_data['UK'][:7500],
    'BRAZIL': country_weekly_data['BRAZIL'][:7500],
}

Next we encode the weeks we are going to use into a format the neural networks will like. There are `10080` minutes in a week, so for each week we create an array of length `10080` filled with `0`s and set the indexes of the minutes a user is active to `1`.

In [42]:
encoded_weeks = {}
for country in weeks_to_use:
    week_data = weeks_to_use[country]
    encoded_weeks[country] = []
    print('Encoding ' + country)
    for week in week_data:
        encoded = np.zeros(10080, dtype=int)
        for minute in week:
            encoded[minute] = 1
        encoded_weeks[country].append(encoded)
        

Encoding CHINA
Encoding SOUTHAFRICA
Encoding INDIA
Encoding RUSSIA
Encoding US
Encoding UK
Encoding BRAZIL


Simple sanity check

In [35]:
assert encoded_weeks['US'][0][weeks_to_use['US'][0][0]] == 1

Create the labels for the data

In [36]:
labels = []
for country in encoded_weeks:
    weeks = encoded_weeks[country]
    for week in weeks:
        labels.append([country_labels[country]])
one_hot_labels = keras.utils.to_categorical(labels, num_classes=7)

Group all the data into one array

In [17]:
data = []
for country in encoded_weeks:
    data += encoded_weeks[country]
data = np.array(data)

Train/test split the data and labels

In [None]:
X_train, X_test, y_train, y_test = train_test_split(data, one_hot_labels, test_size=0.33, random_state=42)

Basic single input classifier model

In [38]:
basic_model = Sequential()
basic_model.add(Dense(32, activation='relu', input_dim=10080))
basic_model.add(Dense(7, activation='softmax'))
basic_model.compile(optimizer='rmsprop',
              loss='categorical_crossentropy',
              metrics=['accuracy'])

basic_model.fit(X_train, y_train, epochs=10, batch_size=32)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x7f52603ecba8>

Evaluate the basic model. It performs poorly with an accuracy of 36%.

In [39]:
basic_model.evaluate(X_test, y_test, batch_size=32)



[2.656821145766253, 0.3625920984016367]

A more advanced model with multi-layer perceptrons and dropout.

In [40]:
mlp_model = Sequential()
mlp_model.add(Dense(64, activation='relu', input_dim=10080))
mlp_model.add(Dropout(0.5))
mlp_model.add(Dense(64, activation='relu'))
mlp_model.add(Dropout(0.5))
mlp_model.add(Dense(7, activation='softmax'))

sgd = SGD(lr=0.01, decay=1e-6, momentum=0.9, nesterov=True)
mlp_model.compile(loss='categorical_crossentropy',
              optimizer=sgd,
              metrics=['accuracy'])

mlp_model.fit(X_train, y_train,
          epochs=20,
          batch_size=128)

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


<keras.callbacks.History at 0x7f525850e080>

Evaluate the MLP model. It performs slightly worse than the basic model.

In [41]:
mlp_model.evaluate(X_test, y_test, batch_size=128)



[2.1737916718961166, 0.35574635960472467]

**TODO** Analyze results, see where models are performing poorly.

**TODO** More analysis of the week lengths for a given country, since none was done. Suspect that the countries with larger datasets have more timestamps in a week and the model is overfitting towards