# Basic Item recommender in Keras

### Summary

This is intended to demonstrate one implementation of an Item Recommendation deep neural network, built using Keras. The strategy used is shamelessly ripped off from [Youtube's paper ](https://storage.googleapis.com/pub-tools-public-publication-data/pdf/45530.pdf), which is very detailed. They follow a two stage process; a "candidate generation" neural network which effectively tries to predict a customers next purchase using their purchase history amongst other things. The predictions from this first network are amalgamated with items seeded by other sources and fed into a second network that ranks them...with the top N finally shown to the client. 

I've written this notebook because the method through which Youtube encoded purchase history was interesting, but also difficult to figure out using the standard tools of Pandas, Numpy and Keras. Here's the representation of the network architecture from the Youtube paper:

![Youtube's Model](yt_model.png)

The embedded video watches and embedded search tokens are what took me a little while to puzzle out. In essence though they are represented in a Pandas dataframe as a feature containing an array of every video id watched up until the current row. These are then embedded normally, and the embeddedings averaged to the result that the whole watch-history array is "flattened" into however many dimensions you chose during embedding. 

As data for this demo, I'm using the the UCIMLR [Open Retail](http://archive.ics.uci.edu/ml/datasets/Online+Retail) dataset. It's pretty well suited to the problem, good practice set if you want to approach the problem. 

So, let's import the data and do some dumps to take a look at what we have

In [3]:
import pandas as pd
from keras.models import Model
from keras.layers import Input, Embedding, GlobalAveragePooling1D, Dense, Dropout, Concatenate

In [4]:
df = pd.read_csv('online_retail.csv')

In [5]:
df.head()

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
0,536365,85123A,WHITE HANGING HEART T-LIGHT HOLDER,6,01/12/2010 08:26,2.55,17850.0,United Kingdom
1,536365,71053,WHITE METAL LANTERN,6,01/12/2010 08:26,3.39,17850.0,United Kingdom
2,536365,84406B,CREAM CUPID HEARTS COAT HANGER,8,01/12/2010 08:26,2.75,17850.0,United Kingdom
3,536365,84029G,KNITTED UNION FLAG HOT WATER BOTTLE,6,01/12/2010 08:26,3.39,17850.0,United Kingdom
4,536365,84029E,RED WOOLLY HOTTIE WHITE HEART.,6,01/12/2010 08:26,3.39,17850.0,United Kingdom


In [14]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 541909 entries, 0 to 541908
Data columns (total 8 columns):
InvoiceNo      541909 non-null object
StockCode      541909 non-null object
Description    540455 non-null object
Quantity       541909 non-null int64
InvoiceDate    541909 non-null datetime64[ns]
UnitPrice      541909 non-null float64
CustomerID     406829 non-null float64
Country        541909 non-null object
dtypes: datetime64[ns](1), float64(2), int64(1), object(4)
memory usage: 33.1+ MB


InvoiceNo we'll just drop. In production we might use it, but for the purposes of this demo it's not necessary. StockCode is the Item identifier; we're going to add a new feature called ItemHistory that embeds all the StockCodes from previous rows for this Customer (at least, those with different InvoiceNos) into an array. We'll do a similar thing for Description. First though, we'll quickly fix the dtypes for InvoiceDate and CustomerID...the latter involves a number of null values which are useless to us (because we can't concatenate purchase history for them), so those rows will be dropped.

In [22]:
df.InvoiceDate = pd.to_datetime(df.InvoiceDate, format='%d/%m/%Y %H:%M')

df.drop(df.loc[df.CustomerID.isnull()].index, axis=0, inplace=True)
df.CustomerID = df.CustomerID.astype('int').astype('object')

In [37]:
df.sort_values(by='CustomerID').head(11)

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
61619,541431,23166,MEDIUM CERAMIC TOP STORAGE JAR,74215,2011-01-18 10:01:00,1.04,12346,United Kingdom
61624,C541433,23166,MEDIUM CERAMIC TOP STORAGE JAR,-74215,2011-01-18 10:17:00,1.04,12346,United Kingdom
286628,562032,21578,WOODLAND DESIGN COTTON TOTE BAG,6,2011-08-02 08:48:00,2.25,12347,Iceland
72263,542237,47559B,TEA TIME OVEN GLOVE,10,2011-01-26 14:30:00,1.25,12347,Iceland
72264,542237,21154,RED RETROSPOT OVEN GLOVE,10,2011-01-26 14:30:00,1.25,12347,Iceland
72265,542237,21041,RED RETROSPOT OVEN GLOVE DOUBLE,6,2011-01-26 14:30:00,2.95,12347,Iceland
72266,542237,21035,SET/2 RED RETROSPOT TEA TOWELS,6,2011-01-26 14:30:00,2.95,12347,Iceland
72267,542237,22423,REGENCY CAKESTAND 3 TIER,3,2011-01-26 14:30:00,12.75,12347,Iceland
72268,542237,84969,BOX OF 6 ASSORTED COLOUR TEASPOONS,6,2011-01-26 14:30:00,4.25,12347,Iceland
72269,542237,22134,MINI LADLE LOVE HEART RED,12,2011-01-26 14:30:00,0.42,12347,Iceland


In [35]:
[x.StockCode[:i].tolist() for j, x in df.groupby('CustomerID') 
                                          for i in range(len(x))]

[[],
 ['23166'],
 [],
 ['85116'],
 ['85116', '22375'],
 ['85116', '22375', '71477'],
 ['85116', '22375', '71477', '22492'],
 ['85116', '22375', '71477', '22492', '22771'],
 ['85116', '22375', '71477', '22492', '22771', '22772'],
 ['85116', '22375', '71477', '22492', '22771', '22772', '22773'],
 ['85116', '22375', '71477', '22492', '22771', '22772', '22773', '22774'],
 ['85116',
  '22375',
  '71477',
  '22492',
  '22771',
  '22772',
  '22773',
  '22774',
  '22775'],
 ['85116',
  '22375',
  '71477',
  '22492',
  '22771',
  '22772',
  '22773',
  '22774',
  '22775',
  '22805'],
 ['85116',
  '22375',
  '71477',
  '22492',
  '22771',
  '22772',
  '22773',
  '22774',
  '22775',
  '22805',
  '22725'],
 ['85116',
  '22375',
  '71477',
  '22492',
  '22771',
  '22772',
  '22773',
  '22774',
  '22775',
  '22805',
  '22725',
  '22726'],
 ['85116',
  '22375',
  '71477',
  '22492',
  '22771',
  '22772',
  '22773',
  '22774',
  '22775',
  '22805',
  '22725',
  '22726',
  '22727'],
 ['85116',
  '22375'