# DATA 586 PROJECT: BESTBUY - RECOMMENDATION SYSTEM

### TEAM MEMBER:   CLAUDIA NIKEL - VINCENT PHAN

Predict which Xbox game a visitor will be most interested in based on their search query
https://www.kaggle.com/c/acm-sf-chapter-hackathon-small/data

### DATA DESCRIPTION  

The main data for this competition is in the train.csv and test.csv files. These files contain information on what items users clicked on after making a search.

Each line of train.csv describes a user's click on a single item. It contains the following fields:

**user**: A user ID  
**sku**: The stock-keeping unit (item) that the user clicked on   
**category**: The category the sku belongs to  
**query**: The search terms that the user entered  
**click_time**: Time the sku was clicked on  
**query_time**: Time the query was run 

test.csv contains all of the same fields as train.csv except for sku. It is your job to estimate which sku's were clicked on in these test queries. (Note: the label values for test data is not provided so we can not validate the accuracy of the model. Therefore, we will create test dataset and training data by the information from train.csv)  


### METHODOLOGY

Using a feedforward neutral network with a 776-output hidden layer and a 388-output layer to predict the sku of the observation. However to enhane the accuracy of the model, we use a pre-trained text embedding as the first layer, which will have advantages: benefit from transfer learning and the embedding has a fixed size, so it's simpler to process.This example we will use a pre-trained text embedding model from TensorFlow Hub called google/tf2-preview/gnews-swivel-20dim/1.

We also go through the following step

(Note : Step 1 and 2 below can skipped by loading the postpreprocessed dataset  in the code of step 3)

1. Loading the data:   42365 queries  
2. Preprocessing the query text (after this step the number of queries is 34199)  
    - remove non-ASCII  
    - remove punctuation  
    - remove multiple characters occuring more than 2 times  
    - remove non English query text  
    - apply stemming to query text  
3. Creating training and testing data set 
    - 80% for training
    - 10% for validation
    - 10% for testing
4. Setting up neutral network 
5. Training and validate testing result

The accuracy of the model with testing data is 92%


#### 1. LOADING THE DATA

In [31]:
import numpy as np
import pandas as pd
import string
df = pd.read_csv("train.csv")
print(df.shape)
df.head()

(42365, 6)


Unnamed: 0,user,sku,category,query,click_time,query_time
0,0001cd0d10bbc585c9ba287c963e00873d4c0bfd,2032076,abcat0701002,gears of war,2011-10-09 17:22:56.101,2011-10-09 17:21:42.917
1,00033dbced6acd3626c4b56ff5c55b8d69911681,9854804,abcat0701002,Gears of war,2011-09-25 13:35:42.198,2011-09-25 13:35:33.234
2,00033dbced6acd3626c4b56ff5c55b8d69911681,2670133,abcat0701002,Gears of war,2011-09-25 13:36:08.668,2011-09-25 13:35:33.234
3,00033dbced6acd3626c4b56ff5c55b8d69911681,9984142,abcat0701002,Assassin creed,2011-09-25 13:37:23.709,2011-09-25 13:37:00.049
4,0007756f015345450f7be1df33695421466b7ce4,2541184,abcat0701002,dead island,2011-09-11 15:15:34.336,2011-09-11 15:15:26.206


#### 2. PREPROCESSING WITH QUERY TEXT

In [2]:
import tensorflow as tf

In [32]:
# convert to string 
df['query_mod'] = df['query'].astype(str)
# remove non-ASCII
df['query_mod'] = df['query_mod'].str.replace('[^\x00-\x7F]','')
# remove punctuation
df['query_mod'] = df['query_mod'].str.replace('[{}]'.format(string.punctuation),'')
# remove multiple characters occuring more than 2 times
import re
def replaceRepeat(x):
    return re.sub(r'([a-z])\1{2,}', r'\1\1', x)
df['query_mod']= df['query_mod'].map(replaceRepeat)

In [33]:
#Check non English words
import langid as ld
def enDect(x):
    return ld.classify(x)[0] !='en'
nonEn_index = df[df['query_mod'].map(enDect)==True].index.tolist()
df=df.drop(nonEn_index)

In [34]:
#Applying Stemming - basically removing the suffix from a word and reduce it to its root word
from nltk.stem import PorterStemmer
from nltk.tokenize import sent_tokenize, word_tokenize
porter = PorterStemmer()
#from nltk.tokenize import sent_tokenize, word_tokenize
def stemSentence(sentence):
    token_words  = word_tokenize(sentence)
    stem_sentence=[]
    for word in token_words:
        stem_sentence.append(porter.stem(word))
        stem_sentence.append(" ")
    return "".join(stem_sentence)
df['query_mod']= df['query_mod'].map(stemSentence)

In [36]:
print(df.shape)

(34199, 7)


In [11]:
df.to_csv("postprocessdata.csv")

#### 3. CREATING TRAINING AND TESTING DATA

In [1]:
import numpy as np
import tensorflow as tf
#!pip install -q tensorflow-hub
#!pip install -q tfds-nightly
import tensorflow_hub as hub
import tensorflow_datasets as tfds
import pandas as pd
import string
import matplotlib.pyplot as plt

In [4]:
# Not loading the "postprocessdata.csv" please unhidden this cell
#df2 = pd.concat([df.query_mod, df.sku], axis=1)
#df2.columns = ['feature', 'label']
#dftf = tf.data.Dataset.from_tensor_slices((df2.feature, df2.label))

In [5]:
df2 =pd.read_csv("postprocessdata.csv",header=0)
df2 =pd.concat([df2.feature, df2.label], axis=1)
df2['feature'] = df2['feature'].astype(str)
dftf = tf.data.Dataset.from_tensor_slices((df2.feature, df2.label))

In [6]:
train_size = int(0.8 * len(df2))
val_size   = int(0.1*len(df2))
test_size  = int(0.1 * len(df2))
train_dataset = dftf.take(train_size)
test_dataset  = dftf.skip(train_size)
val_dataset   = dftf.skip(test_size)
test_dataset  = dftf.take(test_size)

#### 4 SETUP NEUTRAL NETWORK MODEL

In [7]:
# Determining number of output
output_number=len(df2.label.unique())

# Represent the text is to convert sentences into embeddings vector, 
# using a pre-trained text embedding model from TensorFlow Hub
embedding = "https://tfhub.dev/google/tf2-preview/gnews-swivel-20dim/1"
hub_layer = hub.KerasLayer(embedding, input_shape=[], 
                           dtype=tf.string, trainable=True)

NNmodel = tf.keras.Sequential()
NNmodel.add(hub_layer)
NNmodel.add(tf.keras.layers.Dense(output_number*2, activation='relu'))
NNmodel.add(tf.keras.layers.Dense(output_number,activation='sigmoid' ))
NNmodel.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
keras_layer (KerasLayer)     (None, 20)                400020    
_________________________________________________________________
dense (Dense)                (None, 776)               16296     
_________________________________________________________________
dense_1 (Dense)              (None, 388)               301476    
Total params: 717,792
Trainable params: 717,792
Non-trainable params: 0
_________________________________________________________________


In [8]:
# Setting an optimizer and a loss function
NNmodel.compile(loss='categorical_crossentropy', optimizer=tf.keras.optimizers.Adam(learning_rate=1e-1), metrics=['accuracy'])

#### 5. TRAINNING NN MODEL

In [10]:
NNmodel.fit(train_dataset.batch(200),
                            epochs=20,
                            validation_data = val_dataset.batch(200),
                            )


Train for 137 steps, validate for 154 steps
Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20


Epoch 6/20
Epoch 7/20


Epoch 8/20
Epoch 9/20
Epoch 10/20


Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20


Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20


Epoch 19/20
Epoch 20/20


<tensorflow.python.keras.callbacks.History at 0x168799fac18>

In [11]:
results = NNmodel.evaluate(test_dataset.batch(200), verbose=2)
for name, value in zip(NNmodel.metrics_names, results):
    print("%s: %.3f" % (name, value))

18/18 - 0s - loss: 8890656312.8889 - accuracy: 0.9204
loss: 8890656312.889
accuracy: 0.920
