## Item - Item Collaborative Filtering Model
 After initially starting this project thinking that the customer based collaborative filtering model was the way to go, I slowly realised that an item-item filtering would be better for the following reasons:
 1. The number of products a company has is finite while the number of customers is potentially infinite.
 2. If I visualise the products connected to each other, you can see clustering when it comes to category and other product attributes.
 3. Most importantly, this particular dataset has 91,000 unique customers and only 31,000 unique products. This means:
   - more connections between the products as they have more frequency of purchase
   - 30 times shorter calculation time for the full product set (64 days versus 1,895)
   - smaller file size for storage
  
  The methodology used is basically the same
  
This notebook goes through the steps needed to create the model and any code for the accompanying app:
1. Importing Libraries and data
2. TFIDF Vectorizer
3. Singular Value Decomposition
4. Cosine Similarity Calculation
5. Making Recommendations
6. Network Visualisation
7. Web Application

#### Import Relevant Libraries and load data

In [4]:
import numpy as np
import seaborn as sns
import pandas as pd
from scipy import stats
import barnum
from pprint import pprint
import json

from sklearn.metrics import jaccard_score

from collections import Counter

import matplotlib
import matplotlib.pyplot as plt

%config InlineBackend.figure_format = 'retina'
%matplotlib inline
# import warnings

plt.style.use('fivethirtyeight')

In [2]:
df = pd.read_csv('./datasets/cleaned-df.csv')

In [3]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.decomposition import TruncatedSVD
from tqdm import tqdm

#### Create initial product and customer lists for formatting

In [4]:
cust_names = list(df.customer_name.unique())
prod_names = list(df.product_name.unique())

In [5]:
prod_sales = df.apply(lambda x: list([x['customer_name'], x['product_name']]),axis=1) 

In [6]:
#creation of a dictionary with all the products and the customers that purchased them
prod_cust = {}
for cust, prod in prod_sales:
    prod_cust.setdefault(prod, set()).add(cust)

In [8]:
prod_lists = {k:list(prod_cust[k]) for k in list(prod_cust)}

In [10]:
#Adjust the dictionary to be able to be used by the TFIDF vectorizer
prod_vecs = pd.Series(prod_lists)
prod_vecs = prod_vecs.apply(lambda x : [', '.join(x) for word in [x]][0])

In [11]:
prod_vecs.head()

limit Lisa           Jarod Morrow, Palmer Bankston, Genaro Cheatham...
archeology curler    Major Geer, Blake Thomason, Vikki Torrez, Inez...
aunt leg                  Carla Dellinger, Betty Nestor, Chris Baldwin
rotate macrame       Elden Eddy, Caren Whitt, Billie Millard, Laver...
apple lentil         Catharine Mancini, Tracy Dupre, Hazel Chester,...
dtype: object

#### Creation of TFIDF dataframe for comparison of products and the customers that purchased them

In [12]:
tfidf = TfidfVectorizer(lowercase = False, ngram_range=(1,2),vocabulary=cust_names)

In [13]:
prod_tfidfvecs =  pd.DataFrame(tfidf.fit_transform(prod_vecs).todense(),columns=tfidf.get_feature_names())                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  

In [14]:
prod_tfidfvecs.set_index(prod_vecs.index,inplace=True)

In [15]:
prod_tfidfvecs.head(10)

Unnamed: 0,Rodrigo Keefe,Julianna Queen,Palmer Bankston,Lupe Pettigrew,Genaro Cheatham,Rod Nesbitt,Jarod Morrow,Betty Nestor,Ellis Whittle,Major Geer,...,Armando Adams,Ursula Christie,Seymour Hutchens,Owen Curtin,Bernadette Burgess,Shad Stroud,Elvis Fisher,Elvia Bell,Jana Horsley,Hattie Olivas
limit Lisa,0.334724,0.334724,0.334724,0.334724,0.334724,0.334724,0.334724,0.321992,0.334724,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
archeology curler,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.288675,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
aunt leg,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.569743,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
rotate macrame,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
apple lentil,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
ceramic barber,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
cafe advertisement,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
gum chronometer,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
segment ocean,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
grip decade,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


#### Now I want to do an SVD on the matrix to reduce the dimensionality and therefor create closer similarities

In [16]:
svd = TruncatedSVD(n_components=2000)

In [17]:
prod_svd = svd.fit(tfidf.fit_transform(prod_vecs)) 

print(prod_svd.explained_variance_ratio_.sum())
print(prod_svd.components_.shape)

0.10741621847906922
(2000, 91593)


#### While this SVD only covers 10% of the explained variance, this is ok because most of the variance is where the products are not related to each other so this is just explaining the parts that are

In [18]:
prod_svds = pd.DataFrame(prod_svd.transform(prod_tfidfvecs))

In [19]:
#I want to round the values in the matrix to 5 decimal places so that the similarity calculations 
#will be more efficient and will hopefully not take as long to compute
prod_svds = prod_svds.apply(lambda x: round(x,5))

In [20]:
prod_svds['prod_names'] = prod_names

prod_svds.head(10)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,1991,1992,1993,1994,1995,1996,1997,1998,1999,prod_names
0,-0.0,-0.0,-0.0,0.0,0.0,-1e-05,-0.0,-1e-05,2e-05,-2e-05,...,-0.00536,-0.00452,0.00215,0.00284,-0.00243,-0.00856,0.00256,-4e-05,0.00484,limit Lisa
1,0.0,-0.0,-0.0,-0.0,-0.0,-0.0,0.0,0.0,0.0,0.0,...,-0.00365,0.00872,0.00173,0.00558,-0.00573,-0.00789,0.00236,0.00675,-1e-05,archeology curler
2,-0.0,-0.0,-0.0,0.0,-0.0,-1e-05,-0.0,-1e-05,2e-05,-2e-05,...,-0.00057,-0.00838,0.00461,0.00162,-0.00712,-0.01198,-0.00304,-0.00056,0.00569,aunt leg
3,-0.0,0.0,-0.0,0.0,0.0,0.0,-0.0,0.0,-1e-05,0.0,...,-0.00665,-0.00964,-0.01027,0.00301,-0.00603,-0.00968,0.00678,0.00973,0.00196,rotate macrame
4,-0.0,0.0,0.0,0.0,-0.0,-0.0,-0.0,-0.0,0.0,-1e-05,...,0.00274,-0.0166,-0.00468,0.00674,0.00451,0.01203,-0.01105,-0.00731,-0.00966,apple lentil
5,-0.0,-0.0,-0.0,0.0,-0.0,0.0,0.0,0.0,1e-05,1e-05,...,-0.00848,0.00748,0.00296,0.00739,0.00273,0.00053,-0.00458,-0.0032,-0.00022,ceramic barber
6,-0.0,0.0,-0.0,0.0,-0.0,-0.0,0.0,-1e-05,-0.0,-0.0,...,-0.01435,-0.00282,0.00797,-0.00617,-0.00488,0.00358,0.00408,-0.00171,-4e-05,cafe advertisement
7,-0.0,-0.0,0.0,0.0,-0.0,0.0,0.0,-0.0,-1e-05,-1e-05,...,-0.0106,-0.00116,0.00424,-0.00343,0.00282,0.0033,0.00839,0.00994,0.0069,gum chronometer
8,-0.0,0.0,-0.0,-0.0,0.0,0.0,-0.0,0.0,0.0,-0.0,...,0.00713,0.00563,0.00044,-0.00265,0.00438,-0.00613,0.00517,0.001,-0.00246,segment ocean
9,-0.0,0.0,-0.0,0.0,-0.0,-0.0,0.0,-1e-05,0.0,-0.0,...,-0.01413,-0.0042,0.00572,0.00222,-0.00287,0.00156,-0.00056,0.0091,-0.00383,grip decade


In [21]:
from itertools import islice

In [22]:
cosine_similarity(prod_svds[prod_svds.prod_names == 'limit Lisa'].iloc[:,0:-1],
                            prod_svds[prod_svds.prod_names == 'aunt leg'].iloc[:,0:-1])[0][0]

0.9340011391985048

In [40]:
## for initializing the dictionary
# prod_cosine = {}

In [25]:
# for adding to the pre-saved dictionary
prod_cosine = data

#### I want to start with the most popular products for my model

In [26]:
col_counts = df['product_name'].value_counts()
top_100_prod = col_counts[0:100].index.to_list()

In [27]:
top_800_prod = col_counts[700:800].index.to_list()

#### Creation of a dictionary of the related cosine scores between a product and every other product 

In [28]:
for prod in tqdm(top_800_prod):
    cosine_scores = {}
    for n in prod_cust:
        cosine_scores[n] = cosine_similarity(prod_svds[prod_svds.prod_names == prod].iloc[:,0:-1],
                                             prod_svds[prod_svds.prod_names == n].iloc[:,0:-1])[0][0]
    prod_cosine[prod] = cosine_scores    

100%|██████████| 100/100 [5:57:38<00:00, 219.35s/it] 


In [29]:
len(prod_cosine)

1077

* Saving to json for storage

In [31]:
with open('./datasets/prod_cosine.json', 'w') as fp:
    json.dump(prod_cosine, fp)

In [24]:
with open('./datasets/prod_cosine.json', 'r') as fp:
    data = json.load(fp)

In [7]:
cosined_products = list(data.keys())

#### Creation of the comparative weights for multiple item purchase history
* An issue to take into account with the item to item collaborative filtering is whether or not to average out the cosine similarities or to take each individual items similar products and just recommend the highest scores to the customers previous purchases
* In this case, I have chosen to recommend the highest scores for each product and not average the scores because it makes more sense to me that a customer would want the most similar to each of the products they have purchased, not something in the middle

In [8]:
def combine_weights(prod_set, prod_dict = data):
    """
       Takes products and returns a dictionary with the highest similarity scores of those 
       products for each of the entire set of products
       
       Parameters:
       *args
         products in string format
      
       Returns:
       dictionary
         a dictionary of the highest similarity scores for each product 
        """
    weights = prod_dict[list(prod_set)[0]]
    for prod in list(prod_set):
        for k,v in weights.items(): 
            if prod_dict[prod][k] > v:
                weights[k] = prod_dict[prod][k]
            else:
                weights[k] = v
    return weights 

In [14]:
combine_weights(cust_prods['Betty Nestor'])

{'limit Lisa': 0.9999999999999998,
 'archeology curler': -0.004519703387372783,
 'aunt leg': 1.0000000000000004,
 'rotate macrame': -0.035217512630506626,
 'apple lentil': -0.058534911094064085,
 'ceramic barber': 0.006433362019971469,
 'cafe advertisement': -0.034776066172946686,
 'gum chronometer': 0.012080735240454117,
 'segment ocean': 0.0007895193125048326,
 'grip decade': -0.04078707084062931,
 'January physician': 0.4856142247973135,
 'stop radio': 0.07077637918057055,
 'back cowbell': -0.017302014525895096,
 'butane crocodile': -0.037669713925346265,
 'rose Abyssinian': 0.005172991704375899,
 'broker ethernet': 0.12178199084515486,
 'battery letter': -0.03590236453550776,
 'cause hedge': 0.010672384895336708,
 'idea meteorology': 0.012506388976320608,
 'mother house': -0.012925457517473003,
 'Brian Sagittarius': 0.03999130625069794,
 'port captain': -0.00021173816593716283,
 'agreement pentagon': -0.023373521789304488,
 'path smash': 0.006953959760698738,
 'metal lyric': 0.0102

#### Creation of the recommendatons

In [15]:
def recommend_for_prods(prod_set, n_recs = 5):
    """
    Return top five recommended products
    based on the products in the customers previous purchases.
    
    Parameters:
    prod_set
      set of products the customer has previously purchased of any length
    prod_dict
      dictionary of all products with their cosine similarities to all other products
    n_recs
      integer describing number or recommendations required
      
    Return
      set of recommended products
    
    """
    #pulling out the max weights of each product against this set
    weights = combine_weights(prod_set)


    #sorting the weighted dictionary by the weight    
    sorted_w = sorted(weights.items(), key=lambda kv: kv[1], reverse=True) 


    #creation of a set so the same product doesn't get added multiple times
    
    unique_prods = set()
    list_prods = []
    
    #iterating through sorted weights to create a set of 5 unique products
    for i in sorted_w:
        if round(i[1],5) == 1:
            pass
        else:
            if len(unique_prods) == n_recs:
                break
            elif i in prod_set:
                pass
            else:
                unique_prods.add(i[0])
                list_prods.append(i[0])
                                       
    return list_prods
    

In [31]:
rec_products = recommend_for_prods({'grip decade', 'back cowbell', 'aunt leg','January physician','limit Lisa'})

In [192]:
recommend_for_prods(cust_prods['Betty Nestor'])

{'French jewel',
 'January physician',
 'cell basketball',
 'hyena peer-to-peer',
 'touch panty'}

##### Creation of dictionary containing all the customers and the products they have previously purchased so that a customers name can be called if needed

In [13]:
cust_prods = {}
for cust, prod in prod_sales:
    cust_prods.setdefault(cust, set()).add(prod)

#### Setting up network data for this model
 For a network model you need to provide the information of nodes (each of the products represented as a point) and the edges (the connections between them, in this case, the cusomers who have purchased both items)

In [420]:
products_df = df.loc[df['product_name'].isin(cosined_products)]

In [421]:
products_df.shape

(33681, 32)

In [422]:
column_edge = 'customer_name'
column_ID = 'product_name'

# select columns, remove NaN
data_to_merge = products_df[[column_ID, column_edge]].dropna(subset=[column_edge]).drop_duplicates() 

# To create connections between products who have been purchased by the same person,
# join data with itself on the 'ID' column.
data_to_merge = data_to_merge.merge(
    data_to_merge[[column_ID, column_edge]].rename(columns={column_ID:column_ID+"_2"}), 
    on=column_edge)

In [423]:
data_to_merge.head()

Unnamed: 0,product_name,customer_name,product_name_2
0,limit Lisa,Rodrigo Keefe,limit Lisa
1,limit Lisa,Julianna Queen,limit Lisa
2,limit Lisa,Palmer Bankston,limit Lisa
3,limit Lisa,Lupe Pettigrew,limit Lisa
4,limit Lisa,Genaro Cheatham,limit Lisa


In [424]:
# By joining the data with itself, products will have a connection with themselves.
# Remove self connections, to keep only connected products which are different.
d = data_to_merge[~(data_to_merge[column_ID]==data_to_merge[column_ID+"_2"])] \
    .dropna()[[column_ID, column_ID+"_2", column_edge]]
    
# To avoid counting twice the connections (product 1 connected to product 2 and product 2 connected to product 1)
# we force the first ID to be "lower" then ID_2
d.drop(d.loc[d[column_ID+"_2"]<d[column_ID]].index.tolist(), inplace=True)

In [425]:
d.head()

Unnamed: 0,product_name,product_name_2,customer_name
9,aunt leg,limit Lisa,Betty Nestor
20,January physician,aunt leg,Chris Baldwin
46,cafe advertisement,grip decade,Joni Bennett
53,cafe advertisement,grip decade,Jerry Mccarty
84,back cowbell,grip decade,Nina Cooper


In [426]:
flourish_edges = d.groupby([column_ID, column_ID+'_2'])[column_edge].count().reset_index()

In [427]:
flourish_edges.head()

Unnamed: 0,product_name,product_name_2,customer_name
0,America passbook,button Romania,1
1,America passbook,deposit Bangladesh,1
2,Barbara motorcycle,cattle meat,1
3,CD owner,scraper Maria,1
4,Carol ease,alley buffet,1


In [428]:
flourish_edges.shape

(389, 3)

##### I only want to graph the products that are connected to others as an example of the connections. Otherwise, I will have too many products in my graph

In [429]:
nodes_list = flourish_edges['product_name'].to_list()

In [430]:
nodes_list.extend(flourish_edges['product_name_2'].to_list())

In [431]:
nodes_list = list(dict.fromkeys(nodes_list))

In [432]:
len(nodes_list)

393

In [433]:
flourish_edges.to_csv(r'./datasets/network_edges.csv', index=False)

In [434]:
flourish_nodes = df[df['product_name'].isin(nodes_list)][['product_name','category','brand_name','price']]\
                                             .drop_duplicates(subset=['product_name'],keep='first')

In [435]:
flourish_nodes.shape

(393, 4)

In [436]:
flourish_nodes.to_csv(r'./datasets/network_nodes.csv', index=False)

##### Trialling pieces of code for my web app

In [56]:
product_1 = str('grip decade')
product_2 = str('')
product_3 = str('aunt leg')
product_4 = str()
product_5 = str('limit Lisa')

In [57]:
product = {product_1,product_2,product_3,product_4,product_5}
products = set()
for n in product:
    if n != '':
        products.add(n)

In [58]:
products

{'aunt leg', 'grip decade', 'limit Lisa'}

In [None]:
hello = {'grip decade', 'back cowbell', 'aunt leg','January physician','limit Lisa'}

In [None]:
type(hello)

In [None]:
recommend_for_prods(hello)

In [None]:
rec_products

In [None]:
final_recs = {}
for n in range(1, (len(rec_products)+1)):
        final_recs['rec_prod_'+str(n)] = rec_products[n-1]

In [None]:
final_recs

#### Collating images for each of the products I will use in the app demo

In [69]:
hello = {'karate tugboat'}

In [70]:
recommend_for_prods(hello)

['margin balance', 'deal mirror', 'fan draw', 'mayonnaise mom', 'text Italy']

In [11]:
prod_images = {'broker ethernet':"http://wiznetmuseum.com/wp/wp-content/uploads/2018/03/106-300x183.jpg",
          'baseball step-grandmother':"https://i.etsystatic.com/14835601/r/il/f8cb7e/2054134143/il_794xN.2054134143_htjh.jpg",
          'delivery order':"https://www.dinnerfactory.ca/wp-content/uploads/2018/07/dinner-factory-delivery.png",
          'epoch bat':"https://www.clipartwiki.com/clipimg/detail/11-113504_cute-bat-clipart-cute-bat-clipart-png.png",
          'cafe advertisement': "https://public-media.si-cdn.com/filer/20120718075006coffee_stepheye.jpg",
          'duckling court': "https://i.kinja-img.com/gawker-media/image/upload/s--5OWqTQeR--/c_scale,f_auto,fl_progressive,q_80,w_800/dk1kxw5lrjpdx4alhjud.jpg",
          'grip decade': "https://cdn.shopify.com/s/files/1/0095/7552/products/mcc-new-decade-7colors_grande.jpg?v=1534546352",
          'back cowbell': "https://i.pinimg.com/originals/13/8d/fa/138dfaf91ad2694eab65e8c89253bb19.jpg",
          'aunt leg': "https://ctl.s6img.com/society6/img/qR5kUo42dezjcO774h570hlyQxY/w_700/leggings/medium/front/~artwork,fw_7503,fh_9001,fx_-274,fy_-1300,iw_9000,ih_9000/s6-original-art-uploads/society6/uploads/misc/39bf0cb86f994ff18a88dfacc5b5b752/~~/aunt-smiling-leggings.jpg",
          'limit Lisa' : "https://media1.popsugar-assets.com/files/thumbor/IAIBqKwqZX4M8aIImjr2hu9KR4Y/fit-in/2048xorig/filters:format_auto-!!-:strip_icc-!!-/2018/12/10/940/n/1922283/1c694d4c5c0edbcdf0bee0.07759189_lisa_simpson_pin/i/Lisa-Simpson-Pin.jpg",
          'nerve motorboat': "https://www.flemingyachts.com/photos/yachts/large/711.jpg",
          'motion cave': "https://s.abcnews.com/images/Health/180709_vod_orig_cavedisease_hpMain_16x9_992.jpg",
          'sphere room': "https://i.pinimg.com/736x/1b/5e/b7/1b5eb76969c882fcd9981f64ea36addd--pinterest-account.jpg",
          'offer ear': "https://fiverr-res.cloudinary.com/images/t_main1,q_auto,f_auto/gigs/110627907/original/6f346bcbc8cb84a777719e392438483d13ddbc7a/lend-a-listening-ear-and-offer-advice-if-there-is-a-need.png",
          'picture toad': "https://www.abc.net.au/news/image/131608-3x2-700x467.jpg",
          'January physician': "https://www.myhaliburtonnow.com/wp-content/uploads/2016/02/doctor-health-wellness.jpg",
          'rayon enemy': "https://shop.r10s.jp/palm-nut/cabinet/04917692/06046630/imgrc0074368302.jpg",
          'comparison Mexico': "https://i.redd.it/1ywm6fflfpq21.jpg",
          'Kenneth pipe': "http://credo.library.umass.edu/images/resize/600x600/mums887-s03-f19-i001-001.jpg",
          'margin balance': "https://www.itaubba-economia.com/cdn/uf/images/Imagem4(610).png",
          'router cup': "http://www.thecivilian.co.nz/wp-content/uploads/2019/08/elderlyrugbyfeature-300x205.jpg",
          'Norwegian tomatoes': "https://image.sciencenorway.no/1554342.jpg?imageId=1554342&width=353&height=265",
          'William llama': "https://www.yellowstonesafari.com/modules/mod_btimagegallery/images/original/536fbf0469ea8777708cbd6b2c4a465b.jpg",
          'eggnog bathtub': "https://activerain.com/image_store/uploads/4/3/7/5/8/ar122998894885734.jpg",
          'bicycle particle': "https://media.istockphoto.com/vectors/cyclist-rides-a-bicycle-particle-divergent-silhouette-vector-id825171206",
          'deal mirror': "https://assets.loaf.com/images/product_800/2282480-real-deal-brass-round-mirror.jpg",
          'fan draw': "https://www.miyaguikenbrasil.com/upload/2019/03/07/table-fan-drawing-architectural-drawings-of-ceiling-fans-l-f14f7442658f9edb.png",
          'mayonnaise mom': "https://cdn10.phillymag.com/wp-content/uploads/sites/3/2018/08/mayonnaise-industry-millennials.jpg",
          'text Italy': "https://upload.wikimedia.org/wikipedia/commons/thumb/0/01/Banner_of_Giovine_Italia.png/200px-Banner_of_Giovine_Italia.png",
          'feeling biology': "https://cdn.the-scientist.com/assets/articleNo/32941/iImg/7018/afab1467-56c2-4623-8e75-b1ba1d274129-feature21.jpg",
          'spring invoice': "https://i.ebayimg.com/images/g/JGkAAOSwHhRb5gRm/s-l1600.jpg",
          'thunder random': "http://www.dumpaday.com/wp-content/uploads/2018/03/photos-14-10.jpg",
          'tuna swordfish': "https://caperfrasers.files.wordpress.com/2010/08/tuna1.jpg",
          'pajama dinner': "http://4.bp.blogspot.com/-WAd2LtjVuHw/TvU6O2Fp3HI/AAAAAAAAEko/9na37ZrpfKA/s400/IMG_1992.JPG",
          'pink poet': "https://cdn.shopify.com/s/files/1/2376/4853/products/c.jpg?v=1556234864",
          'rule kayak':"https://i2.wp.com/www.paddlinglight.com/pl/wp-content/uploads/2016/08/hansel_bryan_051016-215.jpg?w=960&ssl=1",
          'ball authorization': "https://tshop.r10s.jp/matsucame/cabinet/sam/tsu6322-sam.jpg?fitin=330:330",
          'barge Siamese': "https://grangerprints.printstoreonline.com/image/497/7499435/7499435_600_450_676_0_fit_0_66f9b956dc0363e92e291fffb02b4f3e.jpg",
          'wallaby pig': "https://external-preview.redd.it/O0mqZ57Fwe_iuWBkNF4YVl0n9L6kZyLYdmeL0pTNMZw.jpg?auto=webp&s=ce0bc6723e5da4815ed5c2088e5b09807bf63321",
          'root badge': "https://cdn.shopify.com/s/files/1/1545/2923/products/root-red-logo-badge-161207_500x.jpg?v=1481150915",
          'samurai brow': "https://cdn.asiatatler.com/asiatatler/i/hk/2018/11/06204003-story-image-10262_cover_1000x616.jpg",
          'creditor hour':"https://www.experian.com/blogs/insights/wp-content/uploads/2018/03/Blog-credit-freeze-930x420.jpg",
          'karate tugboat': "https://m.media-amazon.com/images/M/MV5BMDI2NWQxOGUtNWM5Yi00YjE5LTg2ZjctOThmM2Y0NDAwNGVkXkEyXkFqcGdeQXVyMjM1ODU5MDU@._V1_UY268_CR43,0,182,268_AL_.jpg",
          'stocking insulation': "http://www.ybsinsulation.com/wp-content/uploads/2017/07/brands2-244x300.jpg",
          'hyena peer-to-peer': "https://www.pbs.org/wgbh/nova/media/images/Coalition_photo_credit_Kate_Yoshida_kIuFHhE.width-1500.jpg",
          'French jewel': "https://1.bp.blogspot.com/-sKPFZItIuh4/W7ulkokNCLI/AAAAAAAAwI4/x1XsCFHAmEU5axwEUl2F2FEaJQMQokc_gCLcBGAs/s320/hopediamond.jpg",
          'question match': "https://miro.medium.com/max/800/1*Km98PgzRp9yRYfVZeSzwzQ.png",
          'desert vest': "https://www.varusteleka.com/pictures/thumbs500a/6149.jpg",
          'stitch cause': "https://cdn.hswstatic.com/gif/treat-side-stitch-2.jpg",
          'view reaction': "https://cdn.shopify.com/s/files/1/0065/4917/6438/products/a-scientist-creating-a-chemical-reaction-and-a-view-of-the-city-from-a-top-of-a-building-during-the-day-background_740x.jpg?v=1544771614",
          'law icon': "https://image.freepik.com/free-vector/background-with-advocacy-elements_23-2147802094.jpg",
          'teaching softdrink':"https://i0.wp.com/happyhomefairy.com/wp-content/uploads/2012/04/soda-11.jpg?resize=392%2C666&ssl=1",
          'Nancy mistake':"https://i.gr-assets.com/images/S/compressed.photo.goodreads.com/books/1174333611l/382090.jpg",
          'weapon pigeon': "https://futureforce.navylive.dodlive.mil/files/2015/01/Project-Pigeon-Web.jpg",
          'tree oatmeal': "https://media.istockphoto.com/photos/oatmeal-in-the-bowl-and-ears-of-oats-on-a-piece-of-bark-tree-picture-id856898374",
          'expansion boot': "https://images-na.ssl-images-amazon.com/images/I/61YP743Nv3L._AC_UY675_.jpg",
          'advertisement squash': "https://d3nuqriibqh3vw.cloudfront.net/styles/aotw_card_ir/s3/baxters_bof_press_251115_3_aotw_0.jpg?itok=o1o8EG7p",
          'vein minibus': "https://assets.newatlas.com/dims4/default/66909f0/2147483647/strip/true/crop/1440x961+0+60/resize/1160x774!/quality/90/?url=https%3A%2F%2Fassets.newatlas.com%2Farchive%2Fsuzuki-air-triser-5.jpg",
          'acrylic television': "https://p.globalsources.com/IMAGES/PDT/B1163660980/Acrylic-TV-stand.jpg"}
                


In [12]:
prod_images['broker ethernet']

'http://wiznetmuseum.com/wp/wp-content/uploads/2018/03/106-300x183.jpg'

In [13]:
with open('./datasets/prod_images.json', 'w') as fp:
    json.dump(prod_images, fp)

##### Trial of code for showing no image if an image is missing

In [33]:
final_recs = []
for prod in recommend_for_prods(hello):
    attr = {}
    attr['name'] = prod
    try:
        attr['image'] = images[prod]
    except:
        attr['image'] = ""
    final_recs.append(attr)

In [34]:
final_recs

[{'name': 'baseball step-grandmother',
  'image': 'https://i.etsystatic.com/14835601/r/il/f8cb7e/2054134143/il_794xN.2054134143_htjh.jpg'},
 {'name': 'delivery order',
  'image': 'https://www.dinnerfactory.ca/wp-content/uploads/2018/07/dinner-factory-delivery.png'},
 {'name': 'epoch bat',
  'image': 'https://www.clipartwiki.com/clipimg/detail/11-113504_cute-bat-clipart-cute-bat-clipart-png.png'},
 {'name': 'cafe advertisement',
  'image': 'https://cdn.atulhost.com/wp-content/uploads/2017/10/advertise-cafe-facebook.jpg'},
 {'name': 'duckling court', 'image': ''}]

##### Splitting dictionary into two so that the file size is small enough for my app

In [35]:
prod_cosine_1 = {key: value for i, (key, value) in enumerate(prod_cosine.items()) if i % 2 == 0}
prod_cosine_2 = {key: value for i, (key, value) in enumerate(prod_cosine.items()) if i % 2 == 1}

In [36]:
len(prod_cosine_1)

539

In [37]:
len(prod_cosine_2)

538

In [38]:
with open('./datasets/prod_cosine_1.json', 'w') as fp:
    json.dump(prod_cosine_1, fp)

In [39]:
with open('./datasets/prod_cosine_2.json', 'w') as fp:
    json.dump(prod_cosine_2, fp)

#### Now I want to compress the json files to further reduce their size

In [7]:
import gzip
import shutil

In [10]:
files = ['prod_images']
# files = ['prod_cosine_1', 'prod_cosine_2', 'prod_images']
for n in files:
    with open('./datasets/' + n +'.json', 'rb') as f_in:
        with gzip.open('./web_application/datasets/' + n + '.json.gz', 'wb') as f_out:
            shutil.copyfileobj(f_in, f_out)

##### and check the new loading code will work

In [51]:
data_2 = {}
for n in range(1,3):
    with gzip.open('./web application/datasets/prod_cosine_' + str(n) + '.json.gz', 'rb') as fp:
        json_bytes = fp.read()
    data_2.update(json.loads(json_bytes))

In [52]:
len(data_2)

1077