# Recommendation system in python

## Introduction

__Context__

The goal of this notebook is to build a recommendation system based on the [Entree Chicago Recommendation](http://archive.ics.uci.edu/ml/datasets/Entree+Chicago+Recommendation+Data) dataset with the distributed algorithm system [Spark](https://spark.apache.org/). By building this recommendation system, we will cover several methods used for natural language processing and recommendation system. 
- Content-base recommendation
- Collaboration filtering
- Tf IDF algorithm
- Hybrid filtering

__Dataset Information__

The dataset contained two subdirectories. One contains the data about users' sessions and the other the restaurants' informations. Each subfolder contains a readme file with more details about the dataset.

__1.__ Users' Sessions

This data records interactions with Entree Chicago restaurant recommendation system (originally [Web Link](http://infolab.cs.uchicago.edu/entree)) from September, 1996 to April, 1999. The data is organized into files roughly spanning a quarter year -- with Q3 1996 and Q2 1999 each only containing one month.

Each line in a session file represents a session of user interaction with the system. The (tab-separated) fields are as follows:

- date: Date of the connection
- ip: ip adress of the user
- entry point: Users can use a restaurant from any city as a entry point, but they always get recommendations for Chicago restaurants. Entry points have the form nnnX, where nnn is a numeric restaurant ID and X is a character A-H that encodes the city.
- rates: These are all Chicago restaurants. These entries have the form nnnX, where nnn is a numeric restaurant ID and X is a character L-T that encodes the navigation operation (see readme for more details).
- end point: Just the numeric id for the (Chicago) restaurant that the user saw last

Bellow an example of the data

|        date        |           ip       | entry point |     rates    | end point
-------------------- | ------------------ | ----------- | ------------ | ---------
01/Oct/1996:10:08:41 | keeper.tribune.com |      0      | 369N    369P |    -1
01/Oct/1996:11:34:23 | 128.103.79.152     |      0      | 387L    245L |    245

__2.__ Restaurant information

In addition to the user's interactions, there is also data linking the restaurant ID with its name and features such as "fabulous wine lists", "good for younger kids", and "Ethopian" cuisine. This data is stored by city (e.g. Atlanta, Boston, etc.) and is in the following format:

- id: ID of the restaurant
- name: name of the restaurant
- description: features about the restaurant

Bellow an example of the data

|        id        |           name       |             description                             |
-------------------| -------------------- | --------------------------------------------------- |
0000436            | La Fontanella        | 214 249 229 125 075 205 052 165                     |
0000437            | Retro Bistro         | 137 174 200 196 193 191 192 025 092 076 206 053 166 |

In [2]:
from pyspark import SparkContext
from pyspark.sql import SQLContext
from pyspark.sql import SparkSession
import pyspark.ml.feature as proc
import pyspark.sql.functions as F
from pyspark.mllib.linalg.distributed import IndexedRow, IndexedRowMatrix
from pathlib import Path
import numpy as np

project_dir = Path.cwd().parents[1]
data_dir = project_dir / 'data'

In [3]:
sc = SparkContext()
spark = SparkSession(sc)

## Get the data

In [6]:
url = 'http://archive.ics.uci.edu/ml/machine-learning-databases/entree-mld/entree_data.tar.gz'
!cd $data_dir && wget $url && tar -xzf entree_data.tar.gz

--2019-12-01 12:47:10--  http://archive.ics.uci.edu/ml/machine-learning-databases/entree-mld/entree_data.tar.gz
Resolving archive.ics.uci.edu (archive.ics.uci.edu)... 128.195.10.252
Connecting to archive.ics.uci.edu (archive.ics.uci.edu)|128.195.10.252|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1074349 (1.0M) [application/x-httpd-php]
Saving to: ‘entree_data.tar.gz’


2019-12-01 12:47:13 (527 KB/s) - ‘entree_data.tar.gz’ saved [1074349/1074349]



## Content based filtering

We are going to use content-based filtering to provided the recommendation for the user. By using the restaurant's features provided in the dataset, we can compute the tf-idf.

In [32]:
# Get mapping for tokens to words
tokens_to_words = dict(sc.textFile(str(data_dir / 'entree/data/features.txt'))\
                    .map(lambda line: line.split('\t'))
                    .collect())

In [33]:
def transform_restaurants_data(file):
    raw_data = sc.textFile(file)\
                 .map(lambda line: line.split('\t'))\
                 .map(lambda column: (column[0], column[1], ' '.join(column[2].split(' ')[:-1]), column[2].split(' ')[-1]))
    return raw_data.toDF(('id', 'name', 'description', 'price'))
    
df = transform_restaurants_data(str(data_dir / "entree/data/chicago.txt"))

In [34]:
df.show()

+-------+--------------------+--------------------+-----+
|     id|                name|         description|price|
+-------+--------------------+--------------------+-----+
|0000000|          Moti Mahal|214 035 149 021 1...|  163|
|0000001|             Village|026 249 174 004 1...|  165|
|0000002|Millrose Brewing ...|137 249 194 215 2...|  165|
|0000003|       Dover Straits|137 190 174 249 2...|  165|
|0000004| Eat Your Hearts Out|214 249 249 197 1...|  164|
|0000005|  Pizzeria Uno & Due|026 249 004 132 1...|  163|
|0000006|    Trattoria Franco|214 136 125 078 2...|  170|
|0000007|    Little Bucharest|214 004 132 249 1...|  164|
|0000008|             Pattaya| 026 235 074 205 051|  163|
|0000009|      House of Hunan|026 191 192 024 0...|  164|
|0000010| Morton's of Chicago|137 174 099 249 2...|  168|
|0000011|             Jezebel|214 174 249 200 1...|  166|
|0000012|               Capri|137 249 174 249 1...|  164|
|0000013|          Don Roth's|137 174 249 063 2...|  166|
|0000014|     

__Tf-IDF__

The embedings given for the restaurant data are not usable by spark. So, we transform the features into sentence and aply tf-if with spark

In [35]:
tokenizer = proc.Tokenizer(inputCol="description", outputCol="tokens")
tokens_df = tokenizer.transform(df).rdd.map(lambda row: (row.id, row.name, row.description, row.price, 
                                           [tokens_to_words[token].split(' ') for token in row.tokens]))\
                                            .toDF(('id', 'name', 'description', 'price', 'tokens'))
tokens_df = tokens_df.withColumn('tokens', F.flatten(tokens_df['tokens']))

In [36]:
count_vectorizer = proc.CountVectorizer(inputCol='tokens', outputCol='tf')\
                       .fit(tokens_df)

__Cosine similarity__

A common distance metric is cosine similarity. The metric can be thought of geometrically if one treats a given item’s row of the ratings matrix as a vector. For content-based filtering, two item similarity is measured as the cosine of the angle between the two items’ vectors. The class below compute a matrix cosinus similarity for each restaurant in our dataset

In [37]:
class CosinusSimilarityClassifier:
    """Class to compute matrix cosinus similarity based on a column description
    from the input dataframe
    """
    
    def __init__(self):
        self.normalizer = None
        self.cosine_similarity = None
        self.count_vectorizer = None
        self.idfModel = None
    
    def _fit_tf_idf(self, df, tokens_col, output_col):
        self.count_vectorizer = proc.CountVectorizer(inputCol=tokens_col, outputCol='tf').fit(df)
        tf_df = self.count_vectorizer.transform(tokens_df)
        
        self.idfModel = proc.IDF(inputCol="tf", outputCol="idf").fit(tf_df)
        idf_df = self.idfModel.transform(tf_df)
        
        self.normalizer = proc.Normalizer(inputCol='idf', outputCol=output_col)
        return self.normalizer.transform(idf_df)
    
    def _fit_matrix_similarity(self, df, id_col, embedding_col):
        mat = IndexedRowMatrix(df.select(id_col, embedding_col)\
                                 .rdd.map(lambda row: IndexedRow(row.id, row.norm.toArray())))\
                                 .toBlockMatrix()
        dot = mat.multiply(mat.transpose())
        self.cosine_similarity = dot.toLocalMatrix().toArray()
        return self
    
    def fit(self, df, tokens_col, id_col):
        df = self._fit_tf_idf(df, tokens_col=tokens_col, output_col='norm')
        self._fit_matrix_similarity(df, id_col=id_col, embedding_col='norm')

In [38]:
cos_sim_classifier = CosinusSimilarityClassifier()
cos_sim_classifier.fit(df=tokens_df, tokens_col='tokens', id_col='id')

__Test with users__

For a user which like restaurant as the first one of the dataset (Moti Mahal) which serve indian food, we can test the cosinus similarity method

In [41]:
# List the k histest similarities obtained for the first restaurant of the matrix
k = 10
similarities = np.sort(cos_sim_classifier.cosine_similarity[0])[::-1][:k]
restaurants = [df.select('name').where('id = {}'.format(restaurant)).collect()[0][0] 
               for restaurant in cos_sim_classifier.cosine_similarity[0].argsort()[::-1][:k]]

for i in range(10):
    print(restaurants[i], similarities[i])

Moti Mahal 0.9999999999999999
Standard India 0.9489745950547768
Udupi Palace 0.8097034711916234
Natraj 0.8073763310007285
Shree 0.7855653336442743
Gandhi Indian 0.6909826463283292
Mei Shung 0.6519763691217288
Formosa 0.6495665383643229
Anna Maria Pasteria 0.6437300279238267
Pine Yard 0.6422507002813812


## Collaborative filtering

In [50]:
def transform_session_data(file):
    raw_data = sc.textFile(file)\
                 .map(lambda line: line.split('\t', 3))\
                 .map(lambda features: (features[0], features[1], features[2], 
                                        ' '.join(features[3].split('\t')[:-1]), features[3].split('\t')[-1]))
    return raw_data.toDF(('date', 'ip', 'entry_point', 'rates', 'end_point'))

In [51]:
df_session = transform_session_data(str(data_dir / 'entree/session/session.1996-Q4'))

In [52]:
df_session.show()

+--------------------+--------------------+-----------+--------------------+---------+
|                date|                  ip|entry_point|               rates|end_point|
+--------------------+--------------------+-----------+--------------------+---------+
|01/Oct/1996:10:08...|  keeper.tribune.com|          0|           369N 369P|       -1|
|01/Oct/1996:11:32...|        mail.smc.com|          0|                204L|      505|
|01/Oct/1996:11:34...|      128.103.79.152|          0|           387L 245L|      245|
|01/Oct/1996:11:33...|        mail.smc.com|          0|                369N|       -1|
|01/Oct/1996:11:35...|        mail.smc.com|          0|                465L|      438|
|25/Sep/1996:19:53...|www-r5.proxy.aol.com|          0|       19L 558N 543N|      192|
|01/Oct/1996:11:49...|       proxy.hud.gov|          0|            19L 558L|      558|
|01/Oct/1996:12:04...|www-r5.proxy.aol.com|          0|      598L 483L 421L|      598|
|26/Sep/1996:21:18...|www-q5.proxy.aol.com|

__Get taste of users__

The second file show us the ratings given by the users. each user that navigate on the website can rates on a particular restaurant by giving a letter. There is different type of evaluation depending on the letter (see readme of the data for more details). By extract each letter used by the user, it's possible to obtain some information about his tastes.

In [53]:
letters_to_personalities = dict(M='spendthrift', P='traditional', Q='creative', R='fun', S='quiet')

def get_personalities(rates, mapping):
    letters = ''.join(filter(str.isalpha, rates))
    personalities = ' '.join(set([mapping.get(letter) for letter in letters if mapping.get(letter)]))
    return personalities

In [54]:
df = df_session.rdd.map(lambda row: (*row, get_personalities(row.rates, mapping=letters_to_personalities)))\
              .toDF(('date', 'ip', 'entry_point', 'rates', 'end_point', 'personality'))

In [55]:
def get_dummies(df, col_name):
    col_attributes = set(df.select(col_name).distinct().rdd\
                                        .map(lambda row: row.personality.split(' '))\
                                        .flatMap(lambda x: x)\
                                        .collect())
    col_expr = [F.when(F.col(col_name) == ty, 1).otherwise(0).alias(ty) for ty in col_attributes]
    return df.select("*", *col_expr)

In [56]:
columns = ['date', 'ip'] + list(letters_to_personalities.values()) + ['rates']
users_rates = get_dummies(df, 'personality').select(columns)

In [57]:
users_rates.show()

+--------------------+--------------------+-----------+-----------+--------+---+-----+--------------------+
|                date|                  ip|spendthrift|traditional|creative|fun|quiet|               rates|
+--------------------+--------------------+-----------+-----------+--------+---+-----+--------------------+
|01/Oct/1996:10:08...|  keeper.tribune.com|          0|          1|       0|  0|    0|           369N 369P|
|01/Oct/1996:11:32...|        mail.smc.com|          0|          0|       0|  0|    0|                204L|
|01/Oct/1996:11:34...|      128.103.79.152|          0|          0|       0|  0|    0|           387L 245L|
|01/Oct/1996:11:33...|        mail.smc.com|          0|          0|       0|  0|    0|                369N|
|01/Oct/1996:11:35...|        mail.smc.com|          0|          0|       0|  0|    0|                465L|
|25/Sep/1996:19:53...|www-r5.proxy.aol.com|          0|          0|       0|  0|    0|       19L 558N 543N|
|01/Oct/1996:11:49...|      

Our categoricals features are not very usefull to apply collaborative filtering. We need more information about the users. However, it's possible to weight the prediction made with the cosinus similatiry by using the information about "this restaurant is quiet or fun ?". By doing that, we will able to give a better recommendation based on the restaurant description and the taste of the user