#**Big Data Application in E-commense**
##——Customer Behavior analysis and recommendation
###Project Contributor : **Tao Liu**



In this project, we will use [Amazon Review Data](http://deepyeti.ucsd.edu/jianmo/amazon/index.html) and perform different kind of data anlysis methods to analyze the customer behaviors for buying good and then give recommendation based on anlyze. 

##**Step 0** - Package import
All the packages will be imported here.

In [1]:
import numpy
import sklearn
import os
import json
import gzip
import pandas as pd
from urllib.request import urlopen
import requests
import array

##**Step 1** - Data Implemtation and cleaning

We will start with Amazon Review Data implementation and cleaning them if possible.



In [2]:
!wget http://deepyeti.ucsd.edu/jianmo/amazon/metaFiles/meta_Gift_Cards.json.gz

--2020-11-02 17:05:49--  http://deepyeti.ucsd.edu/jianmo/amazon/metaFiles/meta_Gift_Cards.json.gz
Resolving deepyeti.ucsd.edu (deepyeti.ucsd.edu)... 169.228.63.50
Connecting to deepyeti.ucsd.edu (deepyeti.ucsd.edu)|169.228.63.50|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 380174 (371K) [application/octet-stream]
Saving to: ‘meta_Gift_Cards.json.gz’


2020-11-02 17:05:50 (704 KB/s) - ‘meta_Gift_Cards.json.gz’ saved [380174/380174]



In [4]:
def parse(path):
    g = gzip.open(path, 'r')
    for l in g:
      yield json.dumps(eval(l))
def getDF(path):
    i = 0
    df = {}
    for d in parse(path):
      df[i] = d
      i += 1
    return pd.DataFrame.from_dict(df, orient='index')
def build_database():
    f = open("output.strict", 'w')
    for l in parse("meta_Gift_Cards.json.gz"):
      (f.write(l + '\n'))
    df = getDF('meta_Gift_Cards.json.gz')[0]
    database =[]
    asin_list = []
    for i in df:
      dictionary = json.loads(i)
      also_view = dictionary['also_view']
      also_buy = dictionary['also_buy']
      similar_item = dictionary['similar_item']
      asin = dictionary['asin']
      asin_list.append(asin)
      sub_dictionary ={}
      sub_dictionary['asin']=asin
      sub_dictionary['also_view'] = also_view
      sub_dictionary['also_buy'] = also_buy 
      sub_dictionary['similar_item'] = similar_item
      database.append(sub_dictionary)
    return asin_list, database
asin_list, database = build_database()
print("This is the asin number",asin_list[1])
print("This is the data it contained in dictonary",database[1])

This is the asin number B001GXRQW0
This is the data it contained in dictonary {'also_view': ['BT00DC6QU4', 'B01I4AHZXC', 'B0719C5P56', 'B01K8RLHZG', 'B00X4SHPFS', 'B01K8RL9C2', 'B01K8RJDEI', 'B01DCN6SFM', 'B01JQSONCC', 'B01K8RMDO0', 'B0091JKU5Q', 'B01C9MW8Z6', 'B0153R37XQ', 'B01K8RL0AI', 'BT00DDC7BK', 'B01L0KQ1WO', 'B06WVJBVT4', 'B06ZY43PDR', 'B072F9T6VX', 'B079ZR4DC8', 'B01K8RK2KW', 'B0084AVVOM', 'B0725JM87R', 'B01N5TMK8I', 'B071JKLGT5', 'B0753GRNQZ'], 'also_buy': [], 'similar_item': '', 'asin': 'B001GXRQW0'}


##**Step 2** - KNN Implementation


In [5]:
# here is KNN classifier we perform algorithm
class KNN_Classifier:
    def __init__(self, k):
        self.k = k
    # redefined since we have d-dimension attributes per dataset
    def euclidean_distance(self, point1, point2):
        num=0
        if (len(point1)!=len(point2)):
            print("it should never happened")
            pass
        else:
            for i in range(len(point1)):
                num+= (point1[i] - point2[i]) * (point1[i] - point2[i])
        return math.sqrt(num)
    #pick the most frequent label
    def pick_label(self, top_k_labels):
        list=unique(top_k_labels)
        current=0
        mostfrequentlabel=None
        for i in range(len(list[0])):
            if list[1][i]>current:
                current=list[1][i]
                mostfrequentlabel=list[0][i]
        return mostfrequentlabel
    def classify(self, point, sample_points, sample_labels):
        k=self.k
        fun=lambda s:self.euclidean_distance(s,point)
        lenth=len(sample_points)
        label=[]
        for i in range(lenth):
            label.append((fun(sample_points[i]),sample_labels[i]))
        ourlabel=[]
        for i in range(k):
            label.sort(key=lambda x:x[0])
            value=heapq.heappop(label)
            ourlabel.append(value[1])
        return self.pick_label(ourlabel)

##**Step 3** - Neural Network Implementation

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import keras
from sklearn.model_selection import train_test_split
from keras import Sequential
from keras.layers import Dense
from sklearn.metrics import accuracy_score
import time

#reference: https://towardsdatascience.com/building-our-first-neural-network-in-keras-bdc8abbc17f5

class NeuralNetworkClassifier:
    def classify(self,dataset):
            first = time.time() #assign first to be the start time
            
            self.dataset = dataset#read the dataset

            #data_x is the attributes that the dataset has
            #data_y is the attributes that whether the recommendation is right or not
            data_x = dataset['similar_item','also_buy','also_view']
            data_y = dataset['asin']
            data_y = keras.utils.to_categorical(data_y, num_classes=None, dtype='float32')

            #preparing data for training 
            X_train, X_test, y_train, y_test = train_test_split(data_x, data_y, test_size = 0.8, random_state=0)

            #training the algorithm
            #creating model sequentially and the output of each layer we add is input to the next layer we specify
            model = Sequential()
            model.add(Dense(10, input_dim = 11, activation = 'relu'))
            model.add(Dense(5,activation='relu'))
            model.add(Dense(2, activation='softmax'))

            #specify the loss function and optimizer
            model.compile(loss='categorical_crossentropy', optimizer = 'adam',
                          metrics=['accuracy'])

            #training model
            history = model.fit(X_train, y_train, epochs = 10, batch_size = 64, verbose = 0)

            #check the accuracy
            y_pred = model.predict(X_test)

            pred = list()
            for i in range(len(y_pred)):
                pred.append(np.argmax(y_pred[i]))

                test = list()
            for i in range(len(y_test)):
                test.append(np.argmax(y_test[i]))

            a = accuracy_score(pred, test)
            print('time for nn', time.time() - first)
            return a



