# Initial Recommendation Engine

*Given*: The profile of the user and possibly some situational context, i.e. user making a purchase or rating an item.

*Required*: Creating a set of items, and creating a score for each recommendable item in that set.

*Profile*: User profile may contain past purchases, ratings in either implicit or explicit form, demographics and interest scores for item features.

*Problem*: We want to learn a function that predicts the relevance score for a given (typically unseen) item based on user profile and context.

## Utility Matrix

In a recommendation system there are two types of entities: `users` and `items`. Users have preferences for certain items, and these preferences must be teased out of the data. The data itself is represented as a `utility matrix`, giving for each user-item pair, a value that represents what is known about the degree of preference of that user for that item. 

Values come from an ordered set (e.g. integers 1-5 representing the number of stars that the user gave as a rating for that item). We assume that the matrix is sparse, meaning that most entries are unkown. 

The goal of a recommendation system is to predict the blanks in the utility matrix. In many cases, it is not necessary to predict every blank entry in a utility matrix. Rather, it is only necessary to discover some entries in each row that are likely to be high. In most applications, the recommendation system does not offer users a ranking of all items, but rather suggests a few that the user should value highly. It may not even be necessary to find all items with the highest expected ratings, but only to find a large subset of those with the highest ratings. 

Without a utility matrix, it is almost impossible to recommend items. There are two general approaches to acquiring data to fill in matrix:
1. Ask users to rate items. This approach is limited in its effectiveness as many users are unwilling to provide responses
2. Make inferences from user behavior. If a user buys a product, it can likely be said that the user "likes" that item. This sort of rating system only really has one value: 1 means that the user likes the item. If a user views an item, we can say that they are interested in it as well

## Content-Based Recommenders
Content-based systems focus on properties of items. Similarity of items is determined by measuring the similarity in their properties.

#### Item Profiles
In a content-based system we must construct for each item a `profile`, which is a record or collection of records representing important characteristics of that item. 

Document Feature Extraction -- identification of words that characterize the topic of a document. First, eliminate stop words. For the remaining words, compute the TF-IDF score for each word in the document. The ones with the highest scores are the words that characterize the document. We may then take the n words with the highest TF-IDF scores. 

Now, documents are represented by sets of words. We could use several distance measurements:
- Jaccard distance between the sets of words 
- Cosine distance between the sets, treated as vectors

Ultimate goal for content-based recommendation is to create both an item profile consisting of feature-value pairs and a user profile summarizing the preferences of the user, based off their row of the utility matrix. 

Represent features as vectors -- for example, if one feature of movies is the set of actors, then imagine there is a component for each actor, with 1 if the actor is in the movie, and 0 if not. 

There is also another class of features that is not readily represented by Boolean vectors: those features that are numerical. For instance, we might take the average rating for movies to be a feature, and this average is a real number. It does not make sense to have one component for each of the possible average ratings, and doing so would cause us to lose the structure implicit in the numbers. That is, two ratings that are close should be considered more similar than widely different ratings. 

Numerical features should be represented by single components of vectors representing items. Those components hold the exact value of that feature. 

There is no harm having some components of the vectors be Boolean and others be real-valued or integer-valued. We can still compute the cosine distance between vectors, although if we do so, we should give some thought to the appropriate scaling of the non-Boolean components, so that they neither dominate the calculation nor are irrelevant. 

#### User Profiles
We need not only create vectors describing items; we also need to create vectors with the same components that describe the user's preferences. We have the utility matrix representing the connection between users and items. 

The best estimate we can make regarding which items the user likes is some aggregation of the profiles of those items. If the utility matrix has only 1's, then the natural aggregation is the average of the components of the vectors representing the item profiles for the items in which the utility matrix has 1 for that user. 

Example: suppose items are movies, represented by Boolean profiles with components corresponding to actors. Also, the utility matrix has a 1 if the user has seen the movie and is blank otherwise. If 20% of the movies that user U likes have Julia Roberts as one of the actors, then the user profile for U will have 0.2 in the component for Julia Roberts. 

#### Classification Algorithms

Regard the given data as a training set, and for each user, build a classifier that predicts the rating of all items. 

One example: Decision Trees -- in our case the decision at the nodes would be likes or not likes. Each interior node is a condition on the objects being classified. 

# ----------------------------------

## Import Libraries

In [1]:
import numpy as np 
import pandas as pd
from sklearn.metrics.pairwise import cosine_similarity, linear_kernel
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split 
from sklearn.feature_extraction.text import TfidfVectorizer
from bs4 import BeautifulSoup

# NLP Libraries
import nltk 
from nltk.corpus import stopwords 
from nltk.collocations import * 
from nltk.stem.wordnet import WordNetLemmatizer
import string 
import re 



## Load and Process Data

In [2]:
# import data from csv 
df = pd.read_csv('/Users/akg/Documents/shopkit/clothing_recs/complete_product_lists/halfdays_unexploded.csv')

In [3]:
df.columns

Index(['id', 'title', 'body_html', 'vendor', 'product_type', 'created_at',
       'handle', 'updated_at', 'published_at', 'template_suffix', 'status',
       'published_scope', 'tags', 'admin_graphql_api_id', 'variants',
       'options', 'images', 'image.id', 'image.position', 'image.created_at',
       'image.updated_at', 'image.alt', 'image.width', 'image.height',
       'image.src', 'image.variant_ids', 'image.admin_graphql_api_id',
       'image'],
      dtype='object')

In [4]:
len(df)

46

In [5]:
# filter relevant columns
df = df[['id', 'title', 'body_html', 'vendor', 'product_type', 'handle', 
         'status', 'tags']]

In [6]:
# remove test products
df = df.loc[df['title'] != 'A Test Product']

In [7]:
len(df)

45

In [8]:
# Fill missing body html with empty string
df['body_html'] = df['body_html'].fillna(value='')

In [9]:
# further downselect features to use for initial recommendations 
clean_df = df.copy()
clean_df = clean_df[['id', 'title', 'body_html', 'vendor', 'product_type',
                     'status', 'tags']]

In [10]:
# Remove Archived Products
clean_df = clean_df.loc[clean_df['status'] != 'archived']

In [11]:
# Remove Route Package Protection
clean_df = clean_df.loc[clean_df['title'] != 'Route Package Protection']

# Remove Navidium Shipping Protection
clean_df = clean_df.loc[clean_df['title'] != 'Navidium Shipping Protection']

In [12]:
# Remove all remaining test products 
clean_df = clean_df.loc[~clean_df['title'].str.contains('TEST')]

In [13]:
# Remove Gift Cards
clean_df = clean_df.loc[clean_df['title'] != 'Gift Card']

In [14]:
# Extract paragraph from body_html as description 
clean_df['description'] = clean_df['body_html'].apply(lambda x: BeautifulSoup(x, 'html.parser').p if BeautifulSoup(x, 'html.parser').p else x)

In [15]:
# fill missing tag values with empty string 
clean_df['tags'] = clean_df['tags'].fillna(value='')

## Apply NLP to Descriptions

In [16]:
# set up stopwords 
# stopwords_list = stopwords.words('english') + list(string.punctuation)
stopwords_list = list(string.punctuation)

# add p and /p to stopwords list 
stopwords_list += ['p', '/p']

In [17]:
# function to process text descriptions
def process_product_description(description: str, stopwords: list) -> list:
    
    # tokenize
    tokens = nltk.word_tokenize(description)
    
    # lemmatize
    lemmatizer = WordNetLemmatizer()
    tokens = [lemmatizer.lemmatize(token) for token in tokens]
    
    # remove stopwords and lowercase
    processed_tokens = [token.lower() for token in tokens if token.lower() not in stopwords]
    
    # return processed_tokens
    return processed_tokens

In [18]:
# process descriptions
clean_df['description_tokens'] = clean_df.apply(lambda x: process_product_description(str(x.description), stopwords_list), axis=1)

In [19]:
# replace description with processed, joined description
clean_df['description'] = clean_df['description_tokens'].apply(lambda x: " ".join(x))

In [20]:
clean_df = clean_df.reset_index(drop=True)

In [21]:
# Build Vectorizer
tfidf = TfidfVectorizer(stop_words='english')

# construct tfidf matrix 
description_matrix = tfidf.fit_transform(clean_df['description'])

# output shape 
description_matrix.shape

(30, 219)

 Every product has 227 features (words). In order to find the similarity between the descriptions we can `cosine similarity`. In our case, the linear kernel function will compute the same for us.

In [22]:
# build similarity matrix 
similarity_matrix = linear_kernel(description_matrix, description_matrix)
similarity_matrix

array([[0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ],
       [0.        , 1.        , 1.        , 0.        , 0.        ,
        0.        , 0.25072523, 0.25072523, 0.        , 0.14485776,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.03286786, 0.05816447,
        0.11786709, 0.        , 0.05041585, 0.        , 0.        ,
        0.03624914, 0.        , 0.        , 0.        , 0.        ],
       [0.        , 1.        , 1.        , 0.        , 0.        ,
        0.        , 0.25072523, 0.25072523, 0.        , 0.14485776,
        0.        , 0.        , 0.        , 0.

Create a series that maps the index of product titles to index of the matrix to make it easy for us to just feed in product titles and get the recommendation.

In [23]:
# index mapping
mapping = pd.Series(clean_df.index, index = clean_df['title'])
mapping[:10]

title
Adams Nylon Short            0
Alessandra Pant              1
Alessandra Pant - Short      2
Aston Jacket                 3
Aston Jacket Belt            4
Banana Beanie                5
Carson Bib Pant              6
Carson Bib Pant - Short      7
Douglas Nylon Windbreaker    8
Emma Soft Shell Pant         9
dtype: int64

In [24]:
# write recommendation function 
def recommend_item_based_on_description(product_input):
    item_idx = mapping[product_input]
    
    # get similarity values with other items 
    # similarity score is the list of index and similarity matrix
    similarity_score = list(enumerate(similarity_matrix[item_idx]))
    
    # sort in descending order 
    similarity_score = sorted(similarity_score, key=lambda x: x[1], reverse=True)
    
    # get top five most similar, ignore the first
    similarity_score = similarity_score[1:20]
    
    # return product names 
    item_indices = [i[0] for i in similarity_score]
    
    return (clean_df['title'].iloc[item_indices])    

In [25]:
# recommend
recommend_item_based_on_description('Pieper Fleece')

3                      Aston Jacket
21                      Johnson Top
19         Hunter Merino Rib Beanie
6                   Carson Bib Pant
7           Carson Bib Pant - Short
18           Hayes Knit Neck Warmer
9              Emma Soft Shell Pant
23    Nellie Packable Puffer Jacket
20       Isabel Soft Shell Bib Pant
5                     Banana Beanie
28                      Tabei Parka
27                   Sophia Legging
1                   Alessandra Pant
2           Alessandra Pant - Short
0                 Adams Nylon Short
4                 Aston Jacket Belt
8         Douglas Nylon Windbreaker
10                   Fay Merino Top
11            Georgie Puffer Jacket
Name: title, dtype: object

In [26]:
clean_df.loc[clean_df['title'] == 'Johnson Top']

Unnamed: 0,id,title,body_html,vendor,product_type,status,tags,description,description_tokens
21,5982175559839,Johnson Top,<p>A mock neck active layer designed to take y...,Halfdays,Shirts & Tops,draft,,a mock neck active layer designed to take you ...,"[a, mock, neck, active, layer, designed, to, t..."


# ----------------------------------

# Ola Canvas

## Load and Process Data

In [27]:
# import data from csv 
df = pd.read_csv('/Users/akg/Documents/shopkit/clothing_recs/complete_product_lists/ola_canvas_products.csv')

In [28]:
# filter relevant columns
df = df[['id', 'title', 'body_html', 'vendor', 'product_type', 'handle', 
         'status', 'tags']]

In [29]:
# further downselect features to use for initial recommendations 
clean_df = df.copy()
clean_df = clean_df[['id', 'title', 'body_html', 'vendor', 'product_type',
                     'status', 'tags']]

In [30]:
# fill missing values with empty string
clean_df = clean_df.fillna(value='')

In [31]:
# Remove Draft Products
# clean_df = clean_df.loc[clean_df['status'] != 'draft']

In [32]:
clean_df.iloc[0]['body_html']

'<p><strong>Surf Hat\xa0</strong></p>\n<p>Made for the water. \xa0Unstructured hat made from UV resistant quick dry nylon. \xa0 Proprietary polywrap elastic and turnbuckle back closure. \xa0The back closure on each hat we make is done by hand in our studio. \xa0</p>\n<p><span></span><strong>Product Details</strong></p>\n<ul>\n<li>Color: Black</li>\n<li>Unstructured Cap</li>\n<li>100% Nylon\xa0</li>\n<li>Quick Dry</li>\n<li>Elastic ploy wrap back band</li>\n<li>Signature OLA CANVAS closure</li>\n</ul>\n<span style="font-size: 1.4em;">\xa0</span><br>\n<p><strong>Hat Sizing</strong><br></p>\n<p data-mce-fragment="1">SM: 6 3/4 - 7 1/4</p>\n<p data-mce-fragment="1">L/XL: 7 1/4 -7 3/4</p>\n<blockquote>\n<h5><span style="font-size: 1.4em;" data-mce-style="font-size: 1.4em;"></span></h5>\n</blockquote>\n<br>\n<p>Designed in California by\xa0OLA\xa0CANVAS</p>'

In [33]:
# Extract paragraph from body_html as description 
clean_df['description'] = clean_df['body_html'].apply(lambda x: BeautifulSoup(x, 'html.parser').get_text() if BeautifulSoup(x, 'html.parser').get_text() else x)

In [34]:
clean_df.iloc[0]['description']

'Surf Hat\xa0\nMade for the water. \xa0Unstructured hat made from UV resistant quick dry nylon. \xa0 Proprietary polywrap elastic and turnbuckle back closure. \xa0The back closure on each hat we make is done by hand in our studio. \xa0\nProduct Details\n\nColor: Black\nUnstructured Cap\n100% Nylon\xa0\nQuick Dry\nElastic ploy wrap back band\nSignature OLA CANVAS closure\n\n\xa0\nHat Sizing\nSM: 6 3/4 - 7 1/4\nL/XL: 7 1/4 -7 3/4\n\n\n\n\nDesigned in California by\xa0OLA\xa0CANVAS'

## Apply NLP to Descriptions

In [35]:
# set up stopwords 
# stopwords_list = stopwords.words('english') + list(string.punctuation)
stopwords_list = list(string.punctuation)

# add p and /p to stopwords list 
stopwords_list += ["\xa0", "\n"]

In [36]:
# process descriptions
clean_df['description_tokens'] = clean_df.apply(lambda x: process_product_description(str(x.description), stopwords_list), axis=1)

# replace description with processed, joined description
clean_df['description'] = clean_df['description_tokens'].apply(lambda x: " ".join(x))

# reset index 
clean_df = clean_df.reset_index(drop=True)

In [37]:
# construct tfidf matrix 
description_matrix = tfidf.fit_transform(clean_df['description'])

# output shape 
description_matrix.shape

(196, 567)

In [38]:
# build similarity matrix 
similarity_matrix = linear_kernel(description_matrix, description_matrix)
similarity_matrix

array([[1.        , 0.98368497, 0.07530808, ..., 0.10289489, 0.10529019,
        0.12166241],
       [0.98368497, 1.        , 0.06827899, ..., 0.10195145, 0.1043248 ,
        0.1205469 ],
       [0.07530808, 0.06827899, 1.        , ..., 0.05013   , 0.05129699,
        0.09685194],
       ...,
       [0.10289489, 0.10195145, 0.05013   , ..., 1.        , 0.94411694,
        0.03519429],
       [0.10529019, 0.1043248 , 0.05129699, ..., 0.94411694, 1.        ,
        0.03601359],
       [0.12166241, 0.1205469 , 0.09685194, ..., 0.03519429, 0.03601359,
        1.        ]])

In [39]:
# index mapping
mapping = pd.Series(clean_df.index, index = clean_df['title'])
mapping[:10]

title
"O"RIGINAL SURF HAT- Black              0
"O"RIGINAL SURF HAT- Navy               1
ALBUM x OLA CANVAS Boardshort- BLACK    2
Anthem Tee - Black                      3
Anthem Tee - Pale Pink                  4
Anthem Tee - Signal Yellow              5
Back Bay Tee - Bone                     6
Back Bay Tee - Sulfur                   7
Blackball Boardshort- Black             8
Blackball Boardshort- Hazard Yellow     9
dtype: int64

In [40]:
clean_df.loc[clean_df['title'].str.contains('SOUVENIR')]

Unnamed: 0,id,title,body_html,vendor,product_type,status,tags,description,description_tokens
162,6662860505264,SOUVENIR SHORT - LEGION GREEN,<p>Called the souvenir short as throwback to m...,OLA Canvas,,active,,called the souvenir short a throwback to milit...,"[called, the, souvenir, short, a, throwback, t..."
163,6985682944176,SOUVENIR SHORT - MARINE BLUE,<p>Called the souvenir short as throwback to m...,OLA Canvas,,active,,called the souvenir short a throwback to milit...,"[called, the, souvenir, short, a, throwback, t..."
164,6743654498480,SOUVENIR SHORT - OBSIDIAN BLUE,<p>Called the souvenir short as throwback to m...,OLA Canvas,,active,,called the souvenir short a throwback to milit...,"[called, the, souvenir, short, a, throwback, t..."
165,7097190645936,SOUVENIR SHORT PALM EMB - MARINE BLUE,<p>Called the souvenir short as throwback to m...,OLA Canvas,,active,,called the souvenir short a throwback to milit...,"[called, the, souvenir, short, a, throwback, t..."
166,7097190318256,SOUVENIR SHORT TIGER EMB - MARINE BLUE,<p>Limited Edition Embroidered Souvenir Shorts...,OLA Canvas,,active,,limited edition embroidered souvenir shorts pr...,"[limited, edition, embroidered, souvenir, shor..."


In [41]:
# generate recommendations
recommend_item_based_on_description('SOUVENIR SHORT - MARINE BLUE')

165     SOUVENIR SHORT PALM EMB - MARINE BLUE
164            SOUVENIR SHORT - OBSIDIAN BLUE
162             SOUVENIR SHORT - LEGION GREEN
187                    UTILITY SHORT 01- SAIL
186             UTILITY SHORT 01- CARGO KHAKI
185                          UTILITY SHORT 01
166    SOUVENIR SHORT TIGER EMB - MARINE BLUE
60                       OG Boardshort- Black
119               Sample OG Boardshort- Black
61                       OG Boardshort- Olive
46                Dockside Chino-Desert Khaki
45                       Dockside Chino-Birch
182                            UTILITY PANT01
8                 Blackball Boardshort- Black
184                      UTILITY PANT01- SAIL
183               UTILITY PANT01- CARGO KHAKI
9         Blackball Boardshort- Hazard Yellow
2        ALBUM x OLA CANVAS Boardshort- BLACK
83        OLA CANVAS LEVI PKT TEE - Off White
Name: title, dtype: object