#**Big Data Application in E-commense**
##——Customer Recommendation Strategy
###Project Contributor : **Tao Liu**

---

In this project, The [Amazon Review Data](http://deepyeti.ucsd.edu/jianmo/amazon/index.html) will be used as database. The [User-Based Collaborative Filtering](https://medium.com/sfu-cspmp/recommendation-systems-user-based-collaborative-filtering-using-n-nearest-neighbors-bf7361dc24e0) method will be performed to give customers potential Recommedations on which item they may want to purchase next. 

The features for this strategy will be customer's **Also View Records** and **Also Buy Records**.

##**Step 0** - Package import
---
All the packages will be imported here.

In [10]:
import numpy as np
import sklearn
import os
import json
import gzip
import pandas as pd
from urllib.request import urlopen
import requests
import array
import collections
from sklearn.metrics.pairwise import cosine_similarity
from operator import itemgetter

##**Step 1** - Data Implemtation
---

We will start with Amazon Review Data implementation and give a raw database to the local environment.



In [3]:
!wget http://deepyeti.ucsd.edu/jianmo/amazon/metaFiles/meta_Gift_Cards.json.gz

--2020-11-18 13:58:48--  http://deepyeti.ucsd.edu/jianmo/amazon/metaFiles/meta_Gift_Cards.json.gz
Resolving deepyeti.ucsd.edu (deepyeti.ucsd.edu)... 169.228.63.50
Connecting to deepyeti.ucsd.edu (deepyeti.ucsd.edu)|169.228.63.50|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 380174 (371K) [application/octet-stream]
Saving to: ‘meta_Gift_Cards.json.gz.1’


2020-11-18 13:58:48 (1.39 MB/s) - ‘meta_Gift_Cards.json.gz.1’ saved [380174/380174]







The Raw Database presents in the following:

In [4]:
def parse(path):
    g = gzip.open(path, 'r')
    for l in g:
      yield json.dumps(eval(l))
def getDF(path):
    i = 0
    df = {}
    for d in parse(path):
      df[i] = d
      i += 1
    return pd.DataFrame.from_dict(df, orient='index')
def raw_database():
    f = open("output.strict", 'w')
    for l in parse("meta_Gift_Cards.json.gz"):
            (f.write(l + '\n'))
    df = getDF('meta_Gift_Cards.json.gz')
    return df

#running the function 
df = raw_database()
print("THE RAW DATABASE ")
print("The size of DATABASE :", len(df))
print(df)


THE RAW DATABASE 
The size of DATABASE : 1547
                                                      0
0     {"category": ["Gift Cards", "Gift Cards"], "te...
1     {"category": ["Gift Cards", "Gift Cards"], "te...
2     {"category": ["Gift Cards", "Gift Cards"], "te...
3     {"category": ["Gift Cards", "Gift Cards"], "te...
4     {"category": ["Gift Cards", "Gift Cards"], "te...
...                                                 ...
1542  {"category": ["Gift Cards", "Gift Cards"], "te...
1543  {"category": ["Gift Cards", "Gift Cards"], "te...
1544  {"category": ["Gift Cards", "Gift Cards"], "te...
1545  {"category": ["Gift Cards", "Gift Cards"], "te...
1546  {"category": ["Gift Cards", "Gift Cards"], "te...

[1547 rows x 1 columns]


##**Step 2** - Data Cleaning and Reformatting
---
Since the raw data Implemented is alphapet character **"B00H5BPJQC"** rather than numeric character **"1"**, it will be required to reformat for the strategy identification purpose. 



The limited data attributes(**"asin","also_like","also_view"**) would be considered from the attribute based on their Relevance to Customer's Recommendation Strategy.  

In [5]:
def build_database(asin_dict):
    f = open("output.strict", 'w')
    for l in parse("meta_Gift_Cards.json.gz"):
          (f.write(l + '\n'))
    df = getDF('meta_Gift_Cards.json.gz')[0]
    length = len(df)
    n = 2000
    #also_view
    also_view_dict ={}
    also_view = []
    for i in df:
          dictionary = json.loads(i)
          asin =dictionary['asin']
          asin = asin_dict[asin]
          also_view_raw = dictionary['also_view']
          sub_list1=[]
          for j in also_view_raw:
                try:
                    sub_list1.append(asin_dict[j])
                    also_view_dict.setdefault(asin_dict[j],[]).append(asin)
                except:
                    sub_list1.append(n)
                    also_view_dict.setdefault(n,[]).append(asin)
                    asin_dict[j] = n
                    n = n+1
          also_view.append(sub_list1)
    #also_buy
    also_buy_dict ={}
    also_buy = []
    for i in df:
          dictionary = json.loads(i)
          asin =dictionary['asin']
          asin = asin_dict[asin]
          also_buy_raw = dictionary['also_buy']
          sub_list2=[]
          for k in also_buy_raw:
                try:
                    sub_list2.append(asin_dict[k])
                    also_buy_dict.setdefault(asin_dict[k],[]).append(asin)
                except:
                    sub_list2.append(n)
                    also_buy_dict.setdefault(n,[]).append(asin)
                    asin_dict[k] = n
                    n = n+1
          also_buy.append(sub_list2)  
    return  also_view,also_buy,also_view_dict,also_buy_dict,asin_dict
def reformat_asin(df):
    asin_dict = {}
    n = 0
    for i in df[0]:
          dictionary = json.loads(i)
          asin = dictionary['asin']
          asin_dict[asin] = n
          n = n+1
    return asin_dict
# running the function    
asin_dict = reformat_asin(df)
also_view,also_buy,also_view_dict,also_buy_dict,asin_dict = build_database(asin_dict)

##**Step 3** - Data Presentation
---
Here will given a brief presentation for the reformat dataset. 

In [6]:
def list_presentation(list,start_number, end_number):
  for i in range(start_number,end_number):
    try:
      print("Item ",i,": ",list[i])
    except:
      print("Item ",i, ":  do not have sub_items in it")

print(len(asin_dict)," Lines of Reformatted ASIN Reference dictionary")
print(asin_dict)
print("---------------------------------------------------")
print("also_view List: The first 20 lines")
list_presentation(also_view,0,20)
print("---------------------------------------------------")
print("also_buy List: The first 20 lines")
list_presentation(also_buy,0,20)
print("---------------------------------------------------")
print("also_view Dictionary: Item 2000 to 2020")
list_presentation(also_view_dict,2000,2020)
print("---------------------------------------------------")
print("also_view Dictionary: Item 2000 to 2020")
list_presentation(also_buy_dict,2000,2020)


4561  Lines of Reformatted ASIN Reference dictionary
{'B001BKEWF2': 0, 'B001GXRQW0': 1, 'B001H53QE4': 2, 'B001H53QEO': 3, 'B001KMWN2K': 4, 'B001M1UVQO': 5, 'B001M1UVZA': 6, 'B001M5GKHE': 7, 'B002BSHDJK': 8, 'B002DN7XS4': 9, 'B002H9RN0C': 10, 'B002MS7BPA': 11, 'B002NZXF9S': 12, 'B002O018DM': 13, 'B002O0536U': 14, 'B002OOBESC': 15, 'B002PY04EG': 16, 'B002QFXC7U': 17, 'B002QTM0Y2': 18, 'B002QTPZUI': 19, 'B002SC9DRO': 20, 'B002UKLD7M': 21, 'B002VFYGC0': 22, 'B002VG4AR0': 23, 'B002VG4BRO': 24, 'B002W8YL6W': 25, 'B002XNLC04': 26, 'B002XNOVDE': 27, 'B002YEWXZ0': 28, 'B002YEWXMI': 29, 'B003755QI6': 30, 'B003CMYYGY': 31, 'B003NALDC8': 32, 'B003XNIBTS': 33, 'B003ZYIKDM': 34, 'B00414Y7Y6': 35, 'B0046IIHMK': 36, 'B004BVCHDC': 37, 'B004CG61UQ': 38, 'B004CZRZKW': 39, 'B004D01QJ2': 40, 'B004KNWWPE': 41, 'B004KNWWP4': 42, 'B004KNWWR2': 43, 'B004KNWWRC': 44, 'B004KNWWT0': 45, 'B004KNWWRW': 46, 'B004KNWWQ8': 47, 'B004KNWWNG': 48, 'B004KNWWPO': 49, 'B004KNWWXQ': 50, 'B004KNWWUE': 51, 'B004KNWWYU': 52, 'B

##**Step 4** - Recommendation Rate Calculation
---
An Example for Recommendation Rate Calculation:

If Customer A purchase Item 1, and Item 2 appears in the also_like or also_view list in Item 2, then the Item 2's similarity to Item 1 will be affected. 

If it is the case Item 2 appears in the also_like record of Item 1, the similarity score will be added on 2.

If it is the case Item 2 appears in the also_buy record of Item 1, the similarity score will be added on 5.

If it is the case Item 2 appears in both records, the similarity score will be added on 7. 

Otherwise the mean value will be assigned to the similarity score.

The recommendation rate calculation function is presented in the following:


In [7]:
def itemtoitem_relationship(also_view,also_buy,also_view_dict,also_buy_dict,asin_dict):
  matrix =[]
  for itemNO in range(0,1547):
      asin_list = [0]*len(asin_dict)
      for asin in also_view[itemNO]:
          try:
            asin_list[asin] = asin_list[asin]+2
          except:
            print(len(asin_list))
          try:
              for i in also_view_dict[asin]:
                asin_list[i] = asin_list[i]+2
          except:
              pass
      for asin in also_buy[itemNO]:
          try:
               for i in also_buy_dict[asin]:
                asin_list[i] = asin_list[i]+10
          except:
              pass
      try:
        avg = (len(also_view[itemNO]) + len(also_buy[itemNO]))/2
      except:
        avg = 0
      asin_list = [i if i != 0 else avg for i in asin_list]
      matrix. append(asin_list)
  return matrix
matrix = itemtoitem_relationship(also_view,also_buy,also_view_dict,also_buy_dict,asin_dict)

print("The matrix represents for the recommendation rate: The first 20 lines")
list_presentation(matrix,0,20)

The matrix represents for the recommendation rate: The first 20 lines
Item  0 :  [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0

##**Step 5** - Recommendation Strategy similarity Calculation
Here the [Cosine Similarity Strategy](https://medium.com/sfu-cspmp/recommendation-systems-user-based-collaborative-filtering-using-n-nearest-neighbors-bf7361dc24e0) will be used to calculate the similarity.


In [8]:

cosine = cosine_similarity(matrix)
np.fill_diagonal(cosine, 0 )
print("A shortcut for the similarity table will be the following:")
similarity_example =pd.DataFrame(cosine)
similarity_example.head()


A shortcut for the similarity table will be the following:


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,...,1507,1508,1509,1510,1511,1512,1513,1514,1515,1516,1517,1518,1519,1520,1521,1522,1523,1524,1525,1526,1527,1528,1529,1530,1531,1532,1533,1534,1535,1536,1537,1538,1539,1540,1541,1542,1543,1544,1545,1546
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.928569,0.0,0.0,0.0,0.0,0.0,0.873242,0.0,0.932718,0.895627,0.718031,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.957735,0.766296,0.806564,0.940168,0.766296,0.820201,0.879648,0.863465,0.0,0.0,0.901829,0.9105,0.788548,0.814934,0.787332,0.887501,0.821184,0.856774,0.834135,0.791907,0.783412,0.904253,0.910979,0.678149,0.904549,0.898554,0.538294,0.529602,0.549931,0.564857,0.552024,0.728267,0.563946,0.431074,0.0,0.551786,0.366387,0.567856,0.876188
2,0.0,0.928569,0.0,0.0,0.0,0.0,0.0,0.0,0.91606,0.0,0.930553,0.89268,0.736485,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.950722,0.847784,0.892851,0.851972,0.847784,0.904136,0.931512,0.921625,0.0,0.0,0.944907,0.809825,0.90011,0.911231,0.858773,0.953365,0.907724,0.933473,0.921867,0.902232,0.898392,0.948418,0.957183,0.747432,0.949056,0.954262,0.653532,0.646918,0.664976,0.687954,0.672772,0.834744,0.68325,0.516743,0.0,0.673049,0.454307,0.69012,0.93301
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


##**Step 6** - Recommendation Strategy Showcase
Here the strategy showcase will be presented:

For example, suppose Customer A purchase gift card ["B00H5BPJQC","B00GOLH1LK","B002NZXF9S"], the system can based on the cosine_similarity in the last step to calculate a list of recommendation goods for him. 

In [26]:
trainingdata  = itemtoitem_relationship(also_view,also_buy,also_view_dict,also_buy_dict,asin_dict)
database = ["B00H5BPJQC","B00GOLH1LK","B002NZXF9S"]
def recommedndation_based_on_purchased(database,n):
  possibility_list =[1]*1547
  for i in database:
    internal_code = asin_dict[i]
    possibility_list = [a+b for a, b in zip(possibility_list, cosine[internal_code])]
  index_sorted = np.argsort(possibility_list)[::-1][:n]
  return [list(asin_dict.keys())[list(asin_dict.values()).index(n)] for n in index_sorted]
recommedndation_based_on_purchased(database,3)

['B00KF5MK08', 'B00NM4IXMS', 'B00I0DMP98']