# Today you are an MLE in the Personalization Department in Macy's cosmetics!
Your goal is to predict outcomes of online browsing sessions, namely predicting if the next sequence of events/session will result in a purchase or not. 

Models used in this assignment are similar to https://github.com/guillaume-chevalier/seq2seq-signal-prediction/blob/master/seq2seq.ipynb



# Task 0: Getting familiar with the Data

...if you're in Colab...

In [None]:
#Mount the RAW session level data: shopping.pkl
from google.colab import drive
drive.mount('/content/drive')

In [None]:
# Import all libraries
import numpy as np
import pandas as pd
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import LSTM
from keras.layers import Bidirectional

In [None]:
#Read and look at the RAW data
data = pd.read_pickle('/content/drive/MyDrive/Datasets/week_10/Sequence_Models/shopping.pkl') # This is where I stored my data. Where'd you put yours?
print('Shape of data=', data.shape)
data.head()

So, for each unique user_session ID, the data fields that are collected are the time the event occured, the type of event (view, cart, remove, purchase), the product ID, category ID, brand, price, user ID, year, month, weekday, and hour. Notice that we have 1.5M data samples.

# Task 1: Set up and train a simple RNN on this time series data

There's a lot of data we could use here, but we're going to start with something pretty simple. Think of a user session as being made up of a series of events (e.g. ['view', (add to)'cart', 'view', (add to)'cart', 'purchase']). We want a model that can take in a series of non-purchase events, and predict whether a purchase is going to occur. In the example above, ['view', (add to)'cart', 'view', (add to)'cart'] is a **sequence** which culminates in a purchase. We'll create a representation of sequences of events, and train a model to predict whether or not a purchase is going to occur. 

This work is the modification from the paper: https://arxiv.org/ftp/arxiv/papers/2010/2010.02503.pdf

## Step 1: Create sequence data for each session

In [None]:
# Convert the event types to numeric values
events = {'purchase':1,'cart': 2,'view': 3, 'remove_from_cart':4}
data['event'] = data.event_type.map(events)

In [None]:
# Sort the events by 'event_time'
data = data.sort_values('event_time')
data.head()

^ The `event` column we added is just a numerical representation of the of the event type (purchase, cart, view, remove_from_cart)

Next, process the data into a new dataframe `sequence`, to have three columns:
1. There will still be a `user_session` column, but now only one row for each user session
2. The `event` column will now be a list of the events that occured in that session, in the order they occured. 
3. A `purchase` column indicates whether the events for that session included a purchase. The number 1 will indicate a purchase, 0 will indicate no purchase.

In [None]:
sequence = data.groupby('user_session')['event'].apply(list)
sequence = sequence.reset_index()
sequence['purchase'] = sequence['event'].apply(lambda x: 1 if 1 in x else 0)
sequence = sequence[sequence['event'].map(len)> 1]

In [None]:
sequence.head()

...one problem we have is that the event entries still have '1's for the purchase events, let's get rid of those because their presence is already indicated by the purchase column... 

In [None]:
#The sequence data should not contain the "purchase field" so it is filtered out
sequence['event']= sequence.event.apply(lambda row: list(filter(lambda a: a != 1, row)))
print('Total number of records=', sequence.shape[0])
sequence.head()

In [None]:
temp_one_hot = np.array(pd.get_dummies(sequence['purchase'],prefix='Purchase'))
fraction_with_purchase=np.sum(temp_one_hot[:,1])/len(temp_one_hot)
print('Fraction of sessions ending in a purchase:', fraction_with_purchase)

What does an average sequence look like? Can some sequences be especially long and others be very short?

In [None]:
#Find the length of events per user-session
length = sequence['event'].map(len).to_list()

In [None]:
import seaborn as sns

In [None]:
sns.distplot(length)

### So we see that most sequences are about 100 or shorter. 
One difficult task in time series modeling is to find the optimal sequence size for highest prediction accuracy. We won't attemp that today, but let's at least split the data up into short and long sequences, to see if there's a difference in models trained on these different sequence lengths. 
### We'll focus on sequences that have lengths less than or equal to 10.

In [None]:
# select all sequences that are upto 10 events long. Discard remaining sequences.
short_sequence_10 = sequence[sequence['event'].map(len) <= 10]

In [None]:
#Lets see how many records come up
short_sequence_10

Let's do a little more preprocessing to arrange this data in a form amenable to training a TensorFlow neural network...

In [None]:
event_sequence = short_sequence_10['event'].to_list()

In [None]:
event_sequence[0:10]

In [None]:
# Pad all sequences with zeros so all inputs have same consistent size of 10
from keras.preprocessing.sequence import pad_sequences
from sklearn.model_selection import train_test_split
event = pad_sequences(event_sequence)

In [None]:
event[0:10]

In [None]:
# One Hot Encoding the Purchase label
y = np.array(pd.get_dummies(short_sequence_10['purchase'],prefix='Purchase'))
z=np.sum(y[:,1])/len(y)
print('Fraction of purchase sessions for the length-10 sequence data:',z)

In [None]:
#Define a function to generate 70/30 data split followed by data resizing
def prepare_train_test_data(data,y):
  #input is data[nxd] and Y[nx2], outputs 70/30 split formatted for the sequence models
  X_train, X_test, y_train, y_test = train_test_split(data, y, test_size=0.3)
  #Resizing is necessary since input to Tensorflow sequence models is (1,d)
  X_train = X_train.reshape((X_train.shape[0], 1, X_train.shape[1]))
  X_test = X_test.reshape((X_test.shape[0], 1, X_test.shape[1]))
  return (X_train,X_test,y_train,y_test)

In [None]:
X_train, X_test, y_train, y_test=prepare_train_test_data(np.array(event),y)
print(X_train.shape, y_train.shape)

These helper functions below are just used to display model accuracy during training, and also evaluate the performance of a trained model:

In [None]:
import matplotlib.pyplot as plt

# demonstration of calculating metrics for a neural network model using sklearn
from sklearn.metrics import accuracy_score
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import f1_score


def plot_history(history):
  # This function will plot the model fit process
  print(history.history.keys())
  # summarize history for accuracy
  plt.plot(history.history['acc'])
  plt.plot(history.history['val_acc'])
  plt.title('model accuracy')
  plt.ylabel('acc')
  plt.xlabel('epoch')
  plt.legend(['train', 'test'], loc='upper left')
  plt.show()
  # summarize history for loss
  plt.plot(history.history['loss'])
  plt.plot(history.history['val_loss'])
  plt.title('model loss')
  plt.ylabel('loss')
  plt.xlabel('epoch')
  plt.legend(['train', 'test'], loc='upper left')
  plt.show()


def evaluate_on_test(X_test, y_test, training_model):
  #This function will evaluate the fit model on test data
  model_output=training_model.predict(X_test)
  g_preds = np.argmax(model_output,axis=1)
  gaccuracy = accuracy_score(y_test[:,1], g_preds)
  print('Accuracy: %f' % gaccuracy)
  # precision tp / (tp + fp)
  gprecision = precision_score(y_test[:,1], g_preds)
  print('Precision: %f' % gprecision)
  # recall: tp / (tp + fn)
  grecall = recall_score(y_test[:,1], g_preds)
  print('Recall: %f' % grecall)
  # f1: 2 tp / (2 tp + fp + fn)
  gf1 = f1_score(y_test[:,1], g_preds)
  print('F1 score: %f' % gf1)


We'll start with training a [`SimpleRNN`](https://www.tensorflow.org/api_docs/python/tf/keras/layers/SimpleRNN) and then move on to fancier things.

In [None]:
from tensorflow.keras.layers import GRU, Embedding, SimpleRNN, Activation
import tensorflow as tf

In [None]:
#This is a simple RNN model
def simple_RNN_model(neurons=40, op=10):
    model = Sequential()
    model.add(SimpleRNN(neurons, return_sequences = True, input_shape = (1,op))) ##neurons: units (layer output shape), op: input vector size = sequence length
    model.add(SimpleRNN(2*neurons))
    model.add(Dense(2, activation='softmax'))
    model.compile(
      optimizer=tf.optimizers.Adam(learning_rate=0.0003),
      loss='binary_crossentropy',
      metrics=['acc'])
    return model

In [None]:
#Visualize the Model
tf.keras.backend.clear_session()
RNN_model = simple_RNN_model(neurons=40, op=10)
RNN_model.summary()

In [None]:
#Fit the model using 80/20 validation split at runtime
r_history = RNN_model.fit(X_train, y_train,
                    epochs=20,
                    batch_size=1000,
                    validation_split=0.2)

In [None]:
plot_history(r_history)

In [None]:
evaluate_on_test(X_test,y_test,RNN_model)

Okay, so that was pretty straightforward--we preprocessed the data to have numerical sequences of length 10 (corresponding to different online shoping events), and then used a model with two SimpleRNN layers from TensorFlow to train on this data. The recall here is fairly low, ostensibly because of the imbalance of purchase and non-purchase events, but this was a good first start.

# Task 2: Train GRU-based and LSTM-based models on the length-10 sequence data.


Define a network based on GRU layers called `GRU_model`. Just copy and paste the architecture of `simple_RNN_model` and exchange the `SimpleRNN` layers for `GRU` layers (these have already been imported above).

In [None]:
def GRU_model(neurons=40, op=10):
  model = Sequential()
  ############### START CODE HERE ################
  ####### END CODE HERE #######
  return model

In [None]:
#Visualize the Model
tf.keras.backend.clear_session()
G_model = GRU_model()
G_model.summary()

^ Note the increase in the number of parameters.

In [None]:
#Train the G_model (40 epochs, 1000 samples per batch, validation split=0.2)
g_history = G_model.fit(X_train, y_train,
                        epochs=20,
                        batch_size=1000,
                        validation_split=0.2)

In [None]:
#Use plot_history function to plot the model curves for loss and accuracy
plot_history(g_history)

In [None]:
# Use evaluate_on_test function to note accuracy, precision, recall and F1 score on test data
evaluate_on_test(X_test,y_test, G_model)

These may get a very similar performance as the simple RNN, but you can check that it does produce slightly different outputs. GRU recurrent neural networks are just a more complicated type of network and often perform better on sequence prediction tasks. 

Do the same as you did above for GRU, but this time for a model based on [`LSTM`](https://www.tensorflow.org/api_docs/python/tf/keras/layers/LSTM) layers. Try wrapping the first LSTM layer with [`Bidirectional`](https://www.tensorflow.org/api_docs/python/tf/keras/layers/Bidirectional).



In [None]:
#Define an LSTM model function. Use the LSTM layer as shown below.
#Notice the change in number of parameters
def LSTM_model(neurons=40, op=10):
  model = Sequential()
  #### YOUR CODE HERE #####
  #### END CODE HERE #####
  return model

In [None]:
tf.keras.backend.clear_session()
l_model = LSTM_model(neurons=40, op=10)
l_model.summary()

In [None]:
#Train the training_model (20 epochs, 1000 samples per batch, validation split=0.2)
l_history = l_model.fit(X_train, y_train,
                    epochs=20,
                    batch_size=1000,#atleast 1000 records per epoch
                    validation_split=0.2)

In [None]:
#Use plot_history function to plot the model curves for loss and accuracy
plot_history(l_history)

In [None]:
# Use evaluate_on_test function to note accuracy, precision, recall and F1 score on test data
evaluate_on_test(X_test,y_test,l_model)

...LSTMs can sometimes take longer to fit because they have more free parameters.

# Task 3: Let's try running the GRU model on a different dataset with more informative features.

In [None]:
#Next lets look at some other session "features"
feat = pd.read_pickle('/content/drive/MyDrive/Datasets/week_10/Sequence_Models/Session_features.pkl')
print('Shape of data=', feat.shape)
feat.head()

This is just a different dataset that's focused on whole sessions, for which we have more different types of features (36 of them)

Suppose we were to treat each row of features as a "sequence." This is not the most common setup for RNNs, (since each element in the sequence (each column) represents different types of quantities) but it'll work in our case. We'll take the first 35 columns as the elements of our sequence and the 36th column, the purchase column, as the target:

In [None]:
Xf=feat.iloc[:,0:35]
Yf=feat.iloc[:,35]
yf = np.array(pd.get_dummies(Yf, prefix='Purchase'))
Xf_train, Xf_test, yf_train, yf_test=prepare_train_test_data(np.array(Xf), np.array(yf)) # Function call to 'prepare_train_test_data' to create 70/30 split data
print(Xf_train.shape, yf_train.shape)

In [None]:
#Initialize ANY model (RNN or GRU or LSTM)
def GRU_model(neurons=40, op=10):
  model = Sequential()
  model.add(GRU(neurons, return_sequences = True, input_shape = (1,op)))
  model.add(GRU(2*neurons))
  model.add(Dense(2, activation='softmax'))
  model.compile(
    optimizer=tf.optimizers.Adam(learning_rate=0.0003),
    loss='binary_crossentropy',
    metrics=['acc'])
  return model

#Visualize the Model (Notice the increase in parameters)
tf.keras.backend.clear_session()
gru_model = GRU_model(neurons=40, op=Xf_train.shape[2])
gru_model.summary()

In [None]:
#Fit your model on Training data (20 epochs, 1000 samples per batch, validation_split=0.2)
gru_history = gru_model.fit(Xf_train, yf_train,
                    epochs=20,
                    batch_size=1000,#atleast 1000 records per epoch
                    validation_split=0.2)

In [None]:
#Use plot_history function to plot the model curves for loss and accuracy
plot_history(gru_history)

In [None]:
# Use evaluate_on_test function to note accuracy, precision, recall and F1 score on test data
evaluate_on_test(Xf_test,yf_test,gru_model)

#Comment on what you would suggest to your manager?


*  Is feature-level data necessary? What metrics suggest that?
*  What you you suggest to improve the quality of the model trained on event-sequence data (the first dataset?) 

