# Day 1 Recommenders

## Overview

Introduction and level setting

Review the types of Recommender Systems 

1. Rule

2. Content-Based

3. Collaborative Filtering

4. Hybrid

Implement a python version of the Collaborative model with small data set   


Describe the Collaborative model (ALS) and how it is implemented


Lab: Building Recommenders 


Recommenders at larger scale


Lab: Using Spark to build recommenders


Evaluation of Recommender Systems
1. RMSE
2. Precision & Recall
3. DCG 


Practical issues when using Recommender Systems
1. Latency
2. Memory
3. Fallbacks
4. In-session updating


### Non-personalized

Non-personalized recommendations are usually where most sites start because it’s easy and doesn’t require that you know anything specific about the users. Non-personalized recommendations are good because you can **always show those**, despite how little you know about the users. People might say that non-personalized recs should only be shown until the system knows enough about the user to show more personalized recs, but always remember that humans are flock animals by nature, so most will be suckers for knowing what content items are the most popular—if for no other reason than to ensure what not to like.

![Cupon non personalized](img/cupon_alt.jpg)

Cupon.com uses non-personalized recommendations to recommend more offers. In the top, it lists the popular categories and brands. While the central part of the screen contains lists of vouchers to save money on, it’s hard to say how that’s calculated. Cupon.com is one of many choices, and it’s a way for sellers to get in contact with people who are happy to get things cheaper—in quantity, thus spending more money.

A recommendation, personalized or not, is based on data and calculated from data. To keep the definition from being too murky, we’ll restrict it to be computer-calculated based on usage data. That means popular categories at cupon.com are recommendations (by calculating which category is viewed more). Before you start calculating, let’s look at examples of what a site can do if it doesn’t have any data at all.

![Dzone](img/dzone.jpg)

Above we see how editors or tastemakers may produce non-personalised recs.

####Frequently bought items similar to the one you’re viewing

Can you create an FBT recommender by finding all the products that are bought together with the current product and then take the top X of that? You could, but as you’ll see later, it doesn’t work well.

One of the challenges of showing an FBT recommendation is that most products are bought together with other popular items. A classic example is that most people leaving a supermarket in Denmark will have a liter of milk in the bag, so almost no matter what else is in the basket, you could say that those items were frequently bought together with a liter of milk.

As I was writing this, my wife came home from the supermarket with not only one, but two liters of milk (figure 5.11 shows the shopping receipt). Most products in a supermarket that are bought often will also be purchased with milk. This yields frequency sets containing milk and most other products. When two or more items are often seen together it’s called a frequency set.

#### Association rules
Instead of looking at most popular objects, there’s the idea of association rules, something a bit closer to kindness than marketing talk. Association rules in the commerce scenario can be thought of as well-meaning advice. Most people hate to come home with a new hard disk only to realize that they don’t have any cable to connect it. If you buy things on Amazon, then you’re in luck because they remind you that most people buy a cable with the hard drive, as you can see:

![Amazon](img/amazon.jpg)

Definition: Confidence

![Conf](img/confidence.jpg)

where T(X) is the set of transactions that contain X


Let’s calculate what the confidence rating is that milk will be in the basket when bread is also bought. This can be written like this:

![Support](img/support.jpg)

Next you need to find all the transactions containing first both bread and milk and then only bread.

#### Implementing association rules
The procedure described in the previous section goes something like the following:


1.  Settle on a minimum support and minimum confidence level.

2.  Get all transactions.

3.  Create a list of itemsets, one for each element, and calculate their support (number of times it’s present divided by the number of transactions) and set confidence to one.

4.  Build a list of itemsets containing more than one item and calculate support and confidence by inferring that each transaction finds all combinations of items and adds one to the itemset’s support.

5.  Iterate through the itemsets and remove the ones that don’t fulfill the confidence requirement.

Let’s translate this into Python code, but we’ll wait a bit before setting the minimum support and confidence level, calculating everything to start with.



In [2]:
def retrieve_buy_events():

    sql = """
    SELECT *
    FROM  Collector_log
    WHERE event = 'buy'
    ORDER BY session_id, content_id
    """                                            

    cursor = data_helper.get_query_cursor(sql)
    data = data_helper.dictfetchall(cursor)

    return data

IndentationError: expected an indented block (<ipython-input-2-8d63169a9abc>, line 3)

In [None]:
def generate_transactions(data):
    transactions = dict()

    for trans_item in data:                                  
        id = trans_item["session_id"]                        
        if id not in transactions:                           
            transactions[id] = []
        transactions[id].append(trans_item["content_id"])    

    return transactions

In [4]:
def calculate_support_confidence(transactions, min_sup=0.01):

    N = len(transactions)                                          
    one_itemsets = calculate_itemsets_one(transactions, min_sup)   
    two_itemsets = calculate_itemsets_two(transactions,
                            one_itemsets, min_sup)                 
    rules = calculate_association_rules(one_itemsets,
                                        two_itemsets, N)           

    return sorted(rules)

In [5]:
def calculate_itemsets_one(transactions, min_sup=0.01):

    N = len(transactions)                               

    temp = defaultdict(int)                             
    one_itemsets = dict()

    for key, items in transactions.items():             
        for item in items:                              
            inx = frozenset({item})                     
            temp[inx] += 1                              

    # remove all items that is not supported.
    for key, itemset in temp.items():                   
        if itemset > min_sup * N:                       
            one_itemsets[key] = itemset

    return one_itemsets

In [None]:
def calculate_itemsets_two(transactions, one_itemsets, min_sup=0.01):
    two_itemsets = defaultdict(int)

    for key, items in transactions.items():                        
        items = list(set(items))                                   

        if (len(items) > 2):                                       
            for perm in combinations(items, 2):                    
                if has_support(perm, one_itemsets):                
                    two_itemsets[frozenset(perm)] += 1             

        elif len(items) == 2:                                      
            if has_support(items, one_itemsets):                   
                two_itemsets[frozenset(items)] += 1                
    return two_itemsets

In [None]:
def calculate_association_rules(one_itemsets, two_itemsets, N):
    timestamp = datetime.now()

    rules = []
    for source, source_freq in one_itemsets.items():                       
        for key, group_freq in two_itemsets.items():                       
            if source.issubset(key):                                       
                target = key.difference(source)                            
                support = group_freq / N                                   
                confidence = group_freq / source_freq                      
                rules.append((timestamp, next(iter(source)), next(iter(target)), confidence, support))                                                             
    return rules

In [6]:
def get_association_rules_for(request, content_id, take=6):
    data = SeededRecs.objects.filter(source=content_id) \
               .order_by('-confidence') \
               .values('target', 'confidence', 'support')[:take]     

    return JsonResponse(dict(data=list(data)), safe=False)

### Personalized

#### Content Recommenders

For a movie content can include categories such as genres, actors, and directors. In other sites, it can be things such as clothing sizes, brand, style and colors, or engine sizes for cars. 

Content-based filtering is about extracting properties or knowledge from the content. You’ll try to extract precise definitions of each content item and represent each item as a list of values. Described like this it sounds easy, but it does pose challenges. The first image below illustrates a simple version of how to train a content-based recommender (offline), while the second figure shows how it’s used when a user arrives at your site (online).

![content1](img/offline-content.jpg)

To sum up, you need the following to make things work:


1.  Content analyzer—Creates a model based on the content. In a way, it creates a profile for each item. It’s where the training of the model is done.

2.  User profiler—Creates a user profile; sometimes the user profile is a simple list of items consumed by the user.

3.  Item retriever—Retrieves relevant items found by comparing the user profiles to the item profiles as shown in figure 10.5. If the user profile is a list of items, this list is iterated, and similar items are found for each item in the user’s list.

![online-content](img/online-content.jpg)

Metadata about a film is everything that you can find on an IMDb page, such as genre, starring artists, and production year. It could also be something like the type of filming or the style of clothing worn by the actors in the film, or in other domains, the shade of paint on the car or the number of freckles on men on dating sites. I like to split the metadata loosely into two types:

Facts
Tags
This isn’t a division normally used, but it’s beneficial for you to think about. Because facts are the things such as production year or starring actors in a movie that can’t be disputed, and you can also use them as input. Tags can mean different things to people and should be considered before adding them.

The social internet has made it popular for people to add descriptive tags to content. Tags can be something as simple as “uplifting” or more subjective like “breaking the fourth wall.” I’ve no idea what that means, but 10 people describing Deadpool said that it was relevant, and apparently it applies to a number of films across genres and decades.

Facts and tags have no clear divisions, so remember that facts are something that people often agree on, while tags can be a bit more subjective. In this light, you should probably put genres in the tag category, but that’s a matter of debate also.

One of the biggest issues for developers trying to use content-based recommenders is that they can’t get the data about the items. What options do you have? You could try to build it yourself or you could hire people to go through the content and tag it. But beware, that can produce strange recommendations. Entire companies exist where people tag content for a living.  



#### EXTRACTING METADATA FROM DESCRIPTIONS

In the scope of content-based filtering, news articles are interesting because they’re often only relevant for a short time (and so more affected by cold-start), which means that they’re hard to recommend using collaborative filtering. But you might still want to recommend them.

Besides using popularity, you can analyze the content. One way of doing that is to look at what words are in the article, how many times each occurs, and how commonly they appear in all the news items in the database. This can be done using TF–IDF and NLP on text descriptions. An article is text, so the content is in the description, while a movie or fashion items have a description that’s written by somebody.


  * tf(word,document) = 1 + log(word frequency)
  
  ![idf](img/idf.jpg)
 
  * tf - idf(word,document) = tf(word,document)*idf(word,document)
  
  Besdies TF-IDF, we often use LDA to generate content features from text based on the topics that text discusses.  
  ![LDA](img/LDA.jpg)

Note that LDA is sensitive to both the corpus used to train it and the choices of K, or how many topics you want.  It can contain many of the same frustrations as other unsuoervised models, such as clustering.  Make sure to use a corpus of documents talking about the same subject that you want to process.  So if you are describing fashion, don't use academic journals!

![dashboard](img/LDAdashboard.jpg)

In [8]:
for movie in movies:                                   
    id = movie.movie_id

    rating = ratings[id]                               

    r = rating.rating
    sum_rating += r
    movie_dtos.append(MovieDto(id, movie.title, r))
    for genre in movie.genres.all():                   

        if genre.name in genres_ratings.keys():
            genres_ratings[genre.name] += r - user_avg
            genres_count[genre.name] += 1

max_value = max(genres_ratings.values())               
max_value = max(max_value, 1)                          
max_count = max(genres_count.values())                 
max_count = max(max_count, 1)                          

genres = []
for key, value in genres_ratings.items():              
    genres.append((key, 'rating', value/max_value))
    genres.append((key, 'count', genres_count[key]/max_count))

NameError: name 'movies' is not defined

![sample-user](img/user100.jpg)

#### PROS AND CONS OF CONTENT-BASED FILTERING
Here are some things to consider when you build a content-based filtering algorithm:

Pros:
  1. New items are easy to add, overcoming the COLD START problem. Create the item feature vector, and you’re set to go.  
  2. You don’t require much traffic. Because you can find similarity based on content descriptions, you can start recommending things from the first visit or rating.
  3. It recommends across popularity; content-based recommenders don’t care which content is popular right now if it finds that a film nobody ever watched is as likely to be recommended as one that everybody watched.
  
Cons:
  1. Conflates liking with importance. If you like science fiction films with Harrison Ford, the system will also give you films with Harrison Ford that aren’t science fiction.
  2. No serendipity; it’s specialized.
  3. Limited understanding of content. It might be hard to include all features that mark the aspects that make content favorable to a user, which means that the system can easily misunderstand what the user likes.


An example of this is the first Thor movie. It could be that a user likes everything that comes out of the Shakespearian school, but normally dislikes action, but the system interprets a user liking Thor because it’s an action film. Or as Joseph Konstan says in his “Introduction to Recommender Systems”, if I like Sandra Bullock in action films and Meg Ryan in comedies, but if I hate Meg Ryan in action films and Sandra Bullock in comedies, there’s no way for that to be captured in the feature vector. That is, unless you start combining them to have a feature “Action film starring Sandra Bullock” and “Comedy starring Sandra Bullock,” and so on.

### Activity: 
Get into groups and discuss content and rule based recommenders.  What can you come up with?