# Product recommendation engine
---
![](resources/groceries.jpg)

## Main objective:
---
Imagine we are a grocery store owner, and we are trying to maximize the purchases of our customers per visit. 

A first strategy that comes to our mind is placing products next to each other that are usually bought together.

Since we have succesfully completed a Data Science task in the past we immediately realize that this problem can be formulated as a recommendation task.


The machine learning part has the following goal:


Essentially we will try to predict the last item of a customers purchase list, given all the other items that he has already in his shopping basket. Those predictions are a helpful first heuristic for the placement of certain products in our grocery store. 

Thus we start collecting the purchase histories of past customers and start writing down the following steps needed, to build our recommendation pipeline:


### Plan of attack:
1. Load the customer purchase data, located in 'data/training_data.csv', 'data/training_labels.csv'
    - Note on the dataset: Each row in each of the data files refers to one 'incomplete' item-list of a customers purchase.
    - The labels represent the item that was purchased by the customer in addition to the items in the dataset
    
    
2. Plot the following statistics:
    - histogram of 10 most purchased products
    - pie chart of all product purchase frequencies
    - which other interesting plots can you think of ? -> extra points



3. Compute and present the following results(you are free to choose any method to present your results):
    - Find the pair of products, that are bought together the most
    - How many customers purchased all the products 
    - Which product was the least purchased ?


4. Transform it into a Machine learning-classifier digestable format:
    - Machine learning algorithms consume data, that has a unified format!
    - For example it should look like that:
    
    
    | feature 1(e.g. product/grocery): | feature 2: | ... | feature N: |
    | "apple"                          | "banana"   | ... | mango      |
    --------------------------------------------------------------------
    | no                               | yes        | ... | no         | <- customer 1: purchased only banana 
    --------------------------------------------------------------------
    | yes                              | yes        | ... | yes        | <- customer 2: purchased all 3 shown
    -------------------------------------------------------------------- 
                                .
                                .
                                .
    --------------------------------------------------------------------
    | no                              | no         | ... | no          | <- customer N: purchased nothing
    --------------------------------------------------------------------
    


5. Train your model on the training set, and predict an item for the each row in the test set(DON'T change the order of the test set):
    - Item-predictions should be in the original string format(=item name)



6. Save the predictions for the test set in a csv-file


### Note on implementation:
- You are free to use any classification algorithm that you want. If you find better recommendation approaches on the web(there certainly are better, but also more involved ones), you are free to use those. The main goal though will be to 
- Try to implement classes


### Note on grading:
- End result = 25%
- Clean code(e.g. classes instead of script like functions etc.) = 25 %
- Documentation = 25%
- Usage of numpy, pandas, pyplot etc. functions for faster computation = 25%

In [1]:
#all modules used
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
import torch, torchvision
import torch.nn.functional as F
import torchvision.transforms as transforms
import seaborn as sns
import nltk
nltk.download('stopwords')
import matplotlib
from matplotlib import pyplot as plt
%matplotlib inline
from IPython.display import set_matplotlib_formats
set_matplotlib_formats('pdf', 'png')
plt.rcParams['savefig.dpi'] = 90

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\apple\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [2]:
train_x = pd.read_csv('data/training_data.csv', header=None)
test_x = pd.read_csv('data/test_data.csv', header=None)
train_y = pd.read_csv('data/training_labels.csv', header=None)

In [3]:
class ProductRcmd:
    
    def __init__(self, df_items, df_labels):
        self.rawdata = df_items
        self.rawlabels = df_labels
        
    def Data_preproc(self, NAdrop=True, onehot=True, fwrite=True,\
                     dataname="new_data.csv", labelsname="new_labels.csv"):
        if NAdrop:
            self.rawdata.dropna(axis=0, how='all', inplace=True)
        if onehot:
            self.data = pd.get_dummies(self.rawdata, prefix='').groupby(axis = 1, level = 0).sum()
            self.data.columns = self.data.columns.str.replace("_", "")
            
        self.ItemType = self.data.shape[1]
        self.CustNum = self.data.shape[0]
        
        self.data.index = ["Custom No." + str(index) for index in range(self.CustNum)]
        self.labels = self.rawlabels.iloc[self.rawdata.index,:]
        self.LabelType = np.unique(self.labels).shape[0]
        self.labels.index = ["Custom No." + str(index) for index in range(self.CustNum)]
        if fwrite:
            self.data.to_csv(dataname)
            self.labels.to_csv(labelsname)
            
    def Data_split(self, train_size=0.8, shuffle=False):
        self.train_x, self.test_x, self.train_y, self.test_y =\
        train_test_split(self.data, self.labels, train_size = train_size, shuffle=shuffle)
    
    

In [4]:
myRcmd = ProductRcmd(train_x, train_y)
myRcmd.Data_preproc()
myRcmd.Data_split(shuffle = True)

In [5]:
myRcmd.train_y.shape,myRcmd.train_x.shape

((4136, 1), (4136, 119))

In [6]:
#pre-handle, remove meaningless lines, add columns and index, onehot coding
data = train_x
data.dropna(axis=0, how='all', inplace=True)
new_data = pd.get_dummies(data, prefix='')
nn_data = new_data.groupby(axis = 1, level = 0).sum()
nn_data.columns = nn_data.columns.str.replace("_", "")
nn_data.index = ["Custom No." + str(index) for index in range(5171)]
nn_data.to_csv("new_data.csv")
labels = train_y.iloc[data.index,:]
labels.index = ["Custom No." + str(index) for index in range(5171)]
labels.to_csv("new_labels.csv")

In [7]:
np.unique(labels).shape

(109,)

## Method 1: Neural Network
稍微改了一下，用train data训练用test data测试，把layer改成2层，结果同样是7%。
问就是不知道为什么

In [8]:
train_x = myRcmd.train_x
test_x = myRcmd.test_x
train_y = pd.get_dummies(myRcmd.train_y)
test_y = pd.get_dummies(myRcmd.test_y)

In [9]:
# Get the data

###### GET THE TRAIN TARGET UNIQUE LIST #####
def Get_train_targets(train_y):
    train_y_dummies = pd.get_dummies(list(train_y[0]))

    # get a dictionary
    product_dict = {}
    for i,product in enumerate(nn_data.columns):
        product_dict[product] = i

    #rename the dummies
    train_y_idx = train_y_dummies.rename(columns=product_dict)

    #return the idxmax value
    train_targets = train_y_idx.idxmax(axis=1)
    
    return train_targets

#### GET THE TRAINING DATAPOINT #####
def Get_train_datas(train_x):
    #### receive one hot data!!!! #####
    one_hot_training_data = train_x
    return one_hot_training_data

class CustomDataset():

    def __init__(self, data, labels):

        self.target = torch.LongTensor(labels)
        self.data = torch.Tensor(np.asarray(data))
        
    def __len__(self):
        return self.data.shape[0]

    def __getitem__(self, idx):
        return self.data[idx], self.target[idx]
    
#### BATCH THE DATAPOINT ####
def Get_batch(one_hot_training_data, train_targets, size):
    # zip the data and target
    training_data = CustomDataset(one_hot_training_data,train_targets)
    # batch the data point
    train_loader = torch.utils.data.DataLoader(training_data, batch_size=size, shuffle=True)
    return train_loader

In [10]:
class Neural_Network(torch.nn.Module):
    
    def __init__(self, input_dim, num_classes):
        
        super(Neural_Network, self).__init__()
        
        self.input_dim = input_dim
        self.num_classes = num_classes
        
        self.layer1 = torch.nn.Linear(self.input_dim, 10)
        
        self.layer2 = torch.nn.Linear(10, self.num_classes)
        
    def forward(self, x):
        
        x = self.layer1(x.view(-1, self.input_dim))
        x = F.sigmoid(x)
        
        x = self.layer2(x)
        
        return x

In [11]:
neural_net = Neural_Network(119, 119)
optimizer = torch.optim.SGD(params=neural_net.parameters(), lr=0.01)
loss_fn = torch.nn.CrossEntropyLoss()

In [12]:
# training loop:

for epoch in range(20):
    running_loss = 0.0
    
    for i, (x, y) in enumerate(train_loader, 1):
        
        # set optimizer gradients to zero
        optimizer.zero_grad()
        
        # forward pass
        predictions = neural_net.forward(x)
                
        # backward pass + optimization step
        loss = loss_fn(predictions, y)
        loss.backward()
        optimizer.step()
        
        # print statistics
        running_loss += loss.item()
        
        if i % 1000 == 0:
            print(f'Epoch: {epoch}, loss: {running_loss / i}')
        
    print(f'Loss after epoch: {epoch} = {running_loss / len(train_loader)}')

NameError: name 'train_loader' is not defined

In [None]:
correct = 0
total = 0
with torch.no_grad():
    for data, labels in test_loader:
        outputs = neural_net(data)
        _, predicted = torch.max(outputs.data, 1)
        print(predicted)
        total += labels.size(0)
        correct += (predicted == labels).sum().item()

print('Accuracy: %d %%' % (
    100 * correct / total))

## Method 2: SVM

In [None]:
def Get_targets(train_y):
    train_y_dummies = pd.get_dummies(list(train_y[0]))

    # get a dictionary
    product_dict = {}
    for i,product in enumerate(nn_data.columns):
        product_dict[product] = i

    #rename the dummies
    train_y_idx = train_y_dummies.rename(columns=product_dict)

    #return the idxmax value
    train_targets = train_y_idx.idxmax(axis=1)
    
    return train_targets

In [None]:
train_x = myRcmd.train_x
test_x = myRcmd.test_x
train_y = Get_targets(myRcmd.train_y)
test_y = Get_targets(myRcmd.test_y)

In [None]:
from sklearn import svm
clf = svm.SVC(decision_function_shape='ovo')
clf.fit(np.asarray(train_x), np.asarray(train_y).reshape(-1,1))
clf.decision_function_shape = "ovr"
dec = clf.decision_function(np.asarray(myRcmd.test_x))

In [None]:
dec

## Method 3: Logistic Regression

In [None]:
def Get_targets(train_y):
    train_y_dummies = pd.get_dummies(list(train_y[0]))

    # get a dictionary
    product_dict = {}
    for i,product in enumerate(nn_data.columns):
        product_dict[product] = i

    #rename the dummies
    train_y_idx = train_y_dummies.rename(columns=product_dict)

    #return the idxmax value
    train_targets = train_y_idx.idxmax(axis=1)
    
    return train_targets

In [None]:
train_x = myRcmd.train_x
test_x = myRcmd.test_x
train_y = Get_targets(myRcmd.train_y)
test_y = Get_targets(myRcmd.test_y)

In [None]:
from sklearn.svm import SVR

clf = SVR(C=500, epsilon=0.8)
clf.fit(train_x, train_y)
predict_y = clf.predict(test_x).astype(int)
list(predict_y), test_y

In [None]:
k = 0
for i, num in enumerate(list(predict_y)):
    if list(predict_y)[i] == test_y[i]:
        k += 1
k