# Content-based Recommendation System with Python

### The Overall Concept/Direction of this Recommendation System

Here is the overall concept:

1. Prepare a list of keywords (or `Bags Of Words(BOWs)`) from text content
1. Count the occurences of each words using `CountVectorizer()`
1. Use the output from above and calculate the similarities using `cosine_similarity()`
1. Create a function to return the top `N` similar topics using the topic/title as input

Reference: https://www.kaggle.com/code/annalee7/content-based-movie-recommendation-engine/notebook

#### Install required libraries

In [1]:
# !pip install paddlepaddle
# !pip install pandas numpy
# !pip install sklearn
# !pip install jieba

#### Import required libraries

In [2]:
import pandas as pd
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.feature_extraction.text import CountVectorizer #tokenizes a collection of words extracted from a text doc
import jieba
import paddle
import datetime
from zipfile import ZipFile, ZIP_DEFLATED
import os
import jieba.analyse
from tqdm import tqdm

## Prepare a list of keywords (or `Bags Of Words(BOWs)`) from text content

### Points to note:

- This is actually the most time-consuming step, 80% of development time will be spent on this
- This is just one way of preparing content, **this is not the only way!!!!!**

### Remember, our main objective is to get a list of keywords (BOWs) for each article

Here is how:

- To get BOWs, we need to extract keywords from `post_title`, `post_content`, `post_category`, `post_tags`

  - We use [`jieba`](https://github.com/fxsjy/jieba) to analyse and extract (or called **Tokenize**) the whole article into  keywords
  - You can provide your own text list (called `userDict` in `jieba`) to specify your own keywords
  - You can also use `jieba.add_word()` to provide your own keywords, I use both here. 

- __**In this case**__, BOW contains 2 parts

  1. Directly use `post_category` and `post_tags` as keywords
  1. Combine `post_title` and `post_content` together and use `jieba.analyse.extract_tags()` and `jieba.analyse.textrank()` to get a list of keywords
    - [`jieba.analyse.extract_tags()`](https://github.com/fxsjy/jieba#%E5%9F%BA%E4%BA%8E-tf-idf-%E7%AE%97%E6%B3%95%E7%9A%84%E5%85%B3%E9%94%AE%E8%AF%8D%E6%8A%BD%E5%8F%96) - Using TF-IDF (Find the word frequency and weight in an article) to extract top N keywords
    - [`jieba.analyse.textrank()`](https://github.com/fxsjy/jieba#%E5%9F%BA%E4%BA%8E-textrank-%E7%AE%97%E6%B3%95%E7%9A%84%E5%85%B3%E9%94%AE%E8%AF%8D%E6%8A%BD%E5%8F%96) - Use TextRank algorithm to extract top N keywords.


- Once you have above, combine/join/merge (or however you called it) into BOW **for each article(row in dataFrame)**
 

In [3]:
data = pd.read_json("search.json")

In [4]:
all_data = pd.DataFrame(data["hits"]["hits"])
# all_data[['post_id', 'post_title', 'post_content', 'post_type', 'post_category']]

In [5]:
all_data['keywords'] = all_data[["post_category", "post_tags"]].apply(lambda x: x['post_category'] + x['post_tags'], axis=1)

Create a unique list (or `set()`) of keywords from `post_category` and `post_tags`

In [6]:
full_list = []
for list_item in all_data['keywords']:
    full_list += list_item

categories_and_tags = set(full_list)

Find missing data

In [7]:
# all_data.isnull().sum().to_frame()
# all_data[['keywords', 'post_category', 'post_tags']].isnull().sum().to_frame()

Join `post_title` and `post_content` together

In [8]:
all_data['title_content'] = pd.DataFrame(all_data['post_title'] + all_data['post_content'])

Clean up the content by replacing/removing full-width text and unwanted text

**Yours may be different**

In [9]:
%%time

with open("special_chars.txt", encoding="utf8") as f:
    stop_words = list(f.read())

def replace_words(text, stop_words):
    for s in stop_words:
        if(s in text):
            text = text.replace(s, ' ')
    
    text = text.replace("延伸閱讀", '') \
                .replace("\xa0", ' ') \
                .replace("  ", ' ')
                
    return text

all_data['title_content'] = all_data['title_content'].apply(replace_words, args=(stop_words,))

CPU times: total: 15.6 ms
Wall time: 4.36 ms


We are performing the same steps for both `post_title` and `post_content`

- First we are going to use `jieba` to find keywords.
  - To use `jieba`, you need to set it up.

### What is `jieba.enable_paddle()`?  What is Paddle?

[PaddlePaddle](https://github.com/PaddlePaddle) is a general machine learning library.  If you know TensorFlow, Keras or PyTorch, you can think of it as a similar "Framework".

In order for `jieba` to tokenize text, there are two ways to do it:

1. Provide your own list of text/keywords (or called "Dictionary" in terms of `jieba`)
1. Use existing, public keywords list

However, as the world is evolving and new texts are coming, it is better to use a machine-learning model, which is trained by others, to "understand" your content.

To do that, `jieba` has a new feature that allow user to use Paddle to "understand" and tokenize your content.  All you need is to run `jieba.enable_paddle()` to enable that.  That's why you need to install `paddlepaddle` in the first place.

This feature is only available starting from `jieba 0.4`.  Make sure you are using the right version.

In [10]:
# You need to run this line before running `jieba.enable_paddle()`, or you will see error
paddle.enable_static() 
jieba.enable_paddle()

Paddle enabled successfully......
DEBUG 2022-06-30 12:41:16,736 _compat.py:47] Paddle enabled successfully......


Then, We use `jieba.add_word()` to add a list of keywords (coming from `categories_and_tags`) as a custom list of specific keywords used by `jieba` to analyse (aka Tokenize) text.

### What is `freq=5`?  Why are you doing this?

I add it here because I just think that `post_category` and `post_tag` are, in this case, entered by our content editor, and hence it should have a higher weight (or `freq`, i.e. more important), so I set it here.

This is optional (i.e. you can just use `jieba.add_word(w)`), you don't need to do this.  

In [11]:
for w in categories_and_tags:
    jieba.add_word(w, freq=5)

Building prefix dict from the default dictionary ...
DEBUG 2022-06-30 12:41:16,746 __init__.py:113] Building prefix dict from the default dictionary ...
Loading model from cache C:\Users\elleryl\AppData\Local\Temp\jieba.cache
DEBUG 2022-06-30 12:41:16,747 __init__.py:132] Loading model from cache C:\Users\elleryl\AppData\Local\Temp\jieba.cache
Loading model cost 0.598 seconds.
DEBUG 2022-06-30 12:41:17,344 __init__.py:164] Loading model cost 0.598 seconds.
Prefix dict has been built successfully.
DEBUG 2022-06-30 12:41:17,346 __init__.py:166] Prefix dict has been built successfully.


Sometimes you may want to add your own keywords, which may not be in `categories_and_tags`, you can add it to a text file.  In this case it is called `custom_dictionary.txt`.

The format of this file is simple: one keyword in a row, with optional frequency and POS tag.  [More details here](https://github.com/fxsjy/jieba#%E8%BD%BD%E5%85%A5%E8%AF%8D%E5%85%B8)

To use it you can run: `jieba.load_userdict("custom_dictionary.txt")`

### This is the most time-consuming code

For around 5,700 Chinese articles (words around 100 - 300 words), it takes around < 3 mins for TF-IDF to complete the tokenize process.

For tqdm in pandas, [reference here](https://stackoverflow.com/a/34365537/1802483)

In [12]:
%%time
tqdm.pandas()

jieba.analyse.set_stop_words("stop_words.txt")
topK = 10
all_data['title_content_keywords_tfidf'] = all_data['title_content'].progress_apply(lambda x: jieba.analyse.extract_tags(x, topK=topK) )

# This is slow
# all_data['title_content_keywords_textrank'] = all_data['title_content'].progress_apply(lambda x: jieba.analyse.textrank(x, topK=topK) )

100%|██████████| 3/3 [00:00<00:00, 76.45it/s]

CPU times: total: 46.9 ms
Wall time: 44 ms





Let's pick an article and see the result

In [14]:
print(all_data['title_content_keywords_tfidf'][1])
# print(all_data['title_content_keywords_textrank'][112])

['枕頭', '選擇', '脊醫', '打鼻鼾', '合適', '頸部', '蕎麥', '適合', '時間', '睡覺']


Combine/Join keywords generated from `jieba.analyse.extract_tags()` and keywords from `all_data['keywords']`

Update: It seems that `jieba.analyse.textrank()` generates not so good result, so I did not use `title_content_keywords_textrank` here.  If you want to add it back, you can do this:

`all_data['bow_tfidf_kw'] = all_data['title_content_keywords_tfidf'] + all_data['title_content_keywords_textrank'] + all_data['keywords']`

Beware: Running `jieba.analyse.textrank()` is slow.  It usually takes around 3 hours to complete the whole progress

In [15]:
all_data['bow_tfidf_kw'] = all_data['title_content_keywords_tfidf'] + all_data['keywords']

Since the above will generate a `list` inside `Series`, so I convert the `list` to plain string inside `Series`  

In [16]:
all_data['bow_to_str'] = all_data['bow_tfidf_kw'].progress_apply(lambda x: ' '.join(x))

100%|██████████| 3/3 [00:00<?, ?it/s]


## Congratulation! You finally finish Step 1!

And yes!  This is how time-consuming and so important in all machine learning/AI/Data science project when you are preprocess/preparing your data.

So don't think that machine-learning/AI/Data Science is all about algorithm and is the most important part.  It is not.

Think about this: if you have the perfect algorithm in the world, and you have wrong data, it will produce wrong result.  Period.

## Step 2: Count the occurences of each words using CountVectorizer()

Hold on, we move fast here

In [17]:
%%time
cv = CountVectorizer()
cv_mx = cv.fit_transform(all_data['bow_to_str'])

CPU times: total: 0 ns
Wall time: 1.3 ms


## Step 3: Use the output from above and calculate the similarities using cosine_similarity()

In [18]:
cosine_sim = cosine_similarity(cv_mx, cv_mx)

**Sidenote: Backup and zip the old cosine data file called `models/model_YYMMDDHHII.zip`, then create the latest one called `models/model_latest.npy`**

In [19]:
%%time
date_time = datetime.datetime.now().strftime("%Y-%m-%d-%H-%M")
backup_file = f"models/model_{date_time}.zip"
model_file = "models/model_latest.npy" # .npy will be appended if it is not ended with .npy

with ZipFile(backup_file, "w", ZIP_DEFLATED, compresslevel=9) as z:
    z.write(model_file)

# Remove old file
if os.path.exists(model_file):
    os.remove(model_file)

# Write new filen
np.save(model_file, cosine_sim)

CPU times: total: 47.2 s
Wall time: 47.5 s


### Pause here for a moment...

The processing of creating a "Model" is actually completed here.  Below steps are for generating result using the file created above.

In real project, we usually create a cron/schedule job to generate the model file.  Then we create an API to:

- Read the model file
- Get the result by passing in existing article title

So what we did below have to be considered as a separated API, which mean **it did not have any coding relationship with above.**

I just put the code below for your reference and how it works only.

**Repeat: In real life project, the code below would be a separated API**

#### Load the model from file and get result

In [20]:
cosine_sim = np.load(model_file)

In [21]:
print(all_data.index)
print(type(all_data.index))

RangeIndex(start=0, stop=3, step=1)
<class 'pandas.core.indexes.range.RangeIndex'>


In [22]:
print(all_data['post_title'])
print(type(all_data['post_title']))

0            貓狗不舒服  主人要知道！ 洞悉10大常見疾病
1    鼻鼾改善｜脊醫教你選擇枕頭4大重點 有助改善疼痛、打鼻鼾情況！
2        精神健康｜明明好攰 點解都係瞓得唔好？失眠原因知多少！
Name: post_title, dtype: object
<class 'pandas.core.series.Series'>


### Create list of indices for later matching

In [23]:
indices = pd.Series(all_data.index, index = all_data['post_title'])

## Step 4: Create a function to return the top N similar topics using the topic/title as input

In [24]:
def recommend_article(title, n = 10, cosine_sim = cosine_sim):
    # retrieve matching movie title index
    if title not in indices.index:
        print("Article not found")
        return
    else:
        idx = indices[title]
    
    # cosine similarity scores of movies in descending order
    scores = pd.Series(cosine_sim[idx]).sort_values(ascending = False)
    
    # top n most similar movies indexes
    # use 1:n because 0 is the same movie entered
    top_n_idx = list(scores.iloc[1:n].index)
        
    return all_data['post_title'].iloc[top_n_idx]

## Test your machine learning!

In [26]:
recommend_article("貓狗不舒服  主人要知道！ 洞悉10大常見疾病", n = 10, cosine_sim = cosine_sim)

1    鼻鼾改善｜脊醫教你選擇枕頭4大重點 有助改善疼痛、打鼻鼾情況！
2        精神健康｜明明好攰 點解都係瞓得唔好？失眠原因知多少！
Name: post_title, dtype: object