<h1 style="text-align: center;">LLM Recommendation System using RAG with Query Expansion and Cross-Encoder re-Ranking</h1>
<img src="https://media.licdn.com/dms/image/D4D12AQHY76w85U8W5g/article-cover_image-shrink_600_2000/0/1695787886133?e=2147483647&v=beta&t=JAtf8m765q6X46fw6yAoF1VKyd9hTKtjp6YfPNqCCYM" width=500px>

## Introduction
Recently, DeepLearning.AI came up with a new course called "Advanced Retrieval for AI with Chroma". I took some notes and you can see it below:

The classic approach we use RAG is basically to embbed the documents, store them in a Vector Database and when we ask a question about the document, this question is encoded using the same technique as before. Finally, we get the top K parts of the document that have similar embeddings to the question.

In other words, you transform a model that only REMEMBERS some facts into a model that PROCESSES information.

However... the documents returned that have similar topics don't necessarily contain the answer to the user's question (query). So what do we do?

## Query Expansion Technique
The first technique is called "Query expansion", where we rewrite the user's query by adding / removing words so that the search for the most similar documents contains the answer.

There are two ways of doing this:
- **First one**: The query goes through an LLM whose purpose is to generate questions with a similar topic to the original query and use all these new questions to search for the most similar documents.
- **Second one**: The query goes through an LLM which generates a possible answer to the user's question and, based on this question and the possible answer, we search for the documents.

## Cross-Encoder Re-Ranking Technique
The second technique is called Cross-encoder re-ranking.

Bi-encoders process two inputs separately (query -> BERT -> Pooling -> vector -> Cosine similarity).

Cross-encoders, on the other hand, process the two inputs together as a single input (queries -> BERT -> Classifier -> 0...1), which allows the model to directly compare and contrast the inputs, understanding their relationships in a more integrated and detailed way. In the end, they generate a score of how similar these inputs are, with a higher score meaning greater similarity.

In other words, we're going to use a cross-encoder by passing the user's query and one of the K documents selected from the similarity search as inputs. These selected documents can be those that we obtained using Query Expansion techniques.

We then get the scores for each document and can sort them according to which are the most similar! Hence the name re-ranking. This way, we only use the most similar documents as input for our final LLM.

## Embedding Adaptors Technique
The third technique is called Embedding Adaptors, which uses user feedback to adapt the search for the most similar documents (like an RLHF). We will receive an output from the LLM and we will evaluate it with 1 and -1 (1 and -1 because these are the respective cosine similarity values when the vectors are in the same and opposite directions, respectively).

## Conclusion
Given this background let's jump into the project.

In [129]:
# Importing the libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from os import chdir
from warnings import filterwarnings
from tqdm import tqdm

from sentence_transformers import SentenceTransformer, CrossEncoder

from qdrant_client import QdrantClient, models 
from qdrant_client.http.models import Distance, VectorParams, PointStruct
from qdrant_client.models import Filter
from qdrant_client.http import models

import openai


chdir(r'C:\Users\Anwar\Desktop\VSCode\LLMRecommendation')
filterwarnings('ignore')
plt.style.use('ggplot')
openai.api_key = "SUA_API_KEY"

In [144]:
# Helper Functions
def wrap_words(scored_point):
    return df.iloc[scored_point.id, :]['sentences']

# Function to create related queries
def generate_related_queries(query, model="gpt-3.5-turbo"):
    messages = [
        {
            "role": "system",
            "content": "You are a helpful assistant that helps people to find good articles."
            "Your users are trying to find good articles about certain area. "
            "Suggest up to five additional related search texts to help them find the article they need"
            "Suggest only short search texts without compound sentences. Suggest a variety of texts that cover different aspects of the topic."
            "Make sure they are complete search texts, and that they are related to the original text."
            "Output one text per line. Do not number the texts."
        },
        {"role": "user", "content": query}
    ]

    content = openai.ChatCompletion.create(
            model = model,
            messages = messages
            )['choices'][0]['message']['content']
    content = content.split("\n")
    return content

# Function to create the answer
def generate_titles(query, model="gpt-3.5-turbo"):
    messages = [
        {
            "role": "system",
            "content": "You are a helpful assistant that helps people to find good articles."
            "Your users are trying to find good articles about certain area. "
            "Create five possible titles for the articles they need"
            "Create only short titles without compound sentences. Suggest a variety of titles that cover the main aspect of the topic."
            "Make sure they are complete titles, and that they are related to the original text."
            "Output one title per line. Do NOT number the texts."
        },
        {"role": "user", "content": query}
    ]

    content = openai.ChatCompletion.create(
            model = model,
            messages = messages
            )['choices'][0]['message']['content']
    content = content.split("\n")
    return content

# Function to create related words
def generate_related_words(query, model="gpt-3.5-turbo"):
    messages = [
        {
            "role": "system",
            "content": "You are a helpful assistant that helps people to find good articles."
            "Your users are trying to find good articles about certain area. "
            "Suggest up to 10 related words for the articles they need"
            "Suggest a variety of words that cover the main aspect of the topic."
            "Make sure they are related to the original text."
            "Output one word per line. Do NOT number the words."
        },
        {"role": "user", "content": query}
    ]

    content = openai.ChatCompletion.create(
            model = model,
            messages = messages
            )['choices'][0]['message']['content']
    content = content.split("\n")
    return content

In [2]:
# Loading the data
df = pd.read_csv('articles.csv')
df.head()

Unnamed: 0,author,claps,reading_time,link,title,text
0,Justin Lee,8.3K,11,https://medium.com/swlh/chatbots-were-the-next...,Chatbots were the next big thing: what happene...,"Oh, how the headlines blared:\nChatbots were T..."
1,Conor Dewey,1.4K,7,https://towardsdatascience.com/python-for-data...,Python for Data Science: 8 Concepts You May Ha...,If you’ve ever found yourself looking up the s...
2,William Koehrsen,2.8K,11,https://towardsdatascience.com/automated-featu...,Automated Feature Engineering in Python – Towa...,Machine learning is increasingly moving from h...
3,Gant Laborde,1.3K,7,https://medium.freecodecamp.org/machine-learni...,Machine Learning: how to go from Zero to Hero ...,If your understanding of A.I. and Machine Lear...
4,Emmanuel Ameisen,935,11,https://blog.insightdatascience.com/reinforcem...,Reinforcement Learning from scratch – Insight ...,Want to learn about applied Artificial Intelli...


In [3]:
# Number of rows (articles)
df.shape

(337, 6)

Now, let's create an embedding example concatenating the title with the article. And for this we're going to use the BERT base Multilingual, from QCRI. You can use this model <a href="https://huggingface.co/QCRI/bert-base-multilingual-cased-pos-english" target="_blank">clicking here</a>

In [4]:
# Picking the first instance as an example
fi = df.iloc[0, :]
example = fi['title'] + '\n\n' + fi['text']

# The model
model_name = "bert-base-uncased"
encoder = SentenceTransformer(model_name_or_path = model_name)

# Encoding the example
example_embedding = encoder.encode(example)

# Visualizing the first 30 numbers
print(f"Length of the embedding vector: {len(example_embedding)}")
print(example_embedding[:30])

.gitattributes:   0%|          | 0.00/491 [00:00<?, ?B/s]

LICENSE:   0%|          | 0.00/11.4k [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.5k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

(…)kage/Data/com.apple.CoreML/model.mlmodel:   0%|          | 0.00/165k [00:00<?, ?B/s]

weight.bin:   0%|          | 0.00/532M [00:00<?, ?B/s]

(…)sk/float32_model.mlpackage/Manifest.json:   0%|          | 0.00/617 [00:00<?, ?B/s]

model.onnx:   0%|          | 0.00/532M [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/440M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

No sentence-transformers model found with name C:\Users\Anwar/.cache\torch\sentence_transformers\bert-base-uncased. Creating a new one with MEAN pooling.


'NoneType' object has no attribute 'cadam32bit_grad_fp32'
Length of the embedding vector: 768
[-0.0151996   0.01303956  0.39059126  0.08283757  0.15611067  0.05184288
  0.09080029  0.6347407   0.10698275 -0.23789361  0.21131742 -0.20583372
 -0.2811767   0.2144163  -0.10719037  0.35887092  0.21524167 -0.07407676
 -0.20916578  0.416417    0.21774074  0.12373762 -0.04202881  0.8123086
  0.3441775  -0.1543962  -0.17182498 -0.21816976 -0.22428831 -0.1127854 ]


In [5]:
# Creating a new column with the embedding for all the rows
sentences, encoded_sentences = list(), list()
for title, text in tqdm(zip(df.title.values, df.text.values)):
    sentence = title + '\n\n' + text
    sentences.append(sentence)
    encoded_sentences.append(encoder.encode(sentence))

df['sentences'] = sentences    
df['encoded_sentences'] = encoded_sentences

337it [04:48,  1.17it/s]


In [6]:
# Seeing the result
df.head()

Unnamed: 0,author,claps,reading_time,link,title,text,sentences,encoded_sentences
0,Justin Lee,8.3K,11,https://medium.com/swlh/chatbots-were-the-next...,Chatbots were the next big thing: what happene...,"Oh, how the headlines blared:\nChatbots were T...",Chatbots were the next big thing: what happene...,"[-0.015199602, 0.01303956, 0.39059126, 0.08283..."
1,Conor Dewey,1.4K,7,https://towardsdatascience.com/python-for-data...,Python for Data Science: 8 Concepts You May Ha...,If you’ve ever found yourself looking up the s...,Python for Data Science: 8 Concepts You May Ha...,"[-0.20942944, 0.16949809, 0.21063836, -0.04316..."
2,William Koehrsen,2.8K,11,https://towardsdatascience.com/automated-featu...,Automated Feature Engineering in Python – Towa...,Machine learning is increasingly moving from h...,Automated Feature Engineering in Python – Towa...,"[-0.16634433, 0.07990255, 0.2653013, 0.0159984..."
3,Gant Laborde,1.3K,7,https://medium.freecodecamp.org/machine-learni...,Machine Learning: how to go from Zero to Hero ...,If your understanding of A.I. and Machine Lear...,Machine Learning: how to go from Zero to Hero ...,"[-0.033402447, 0.12892093, 0.30454758, 0.03839..."
4,Emmanuel Ameisen,935,11,https://blog.insightdatascience.com/reinforcem...,Reinforcement Learning from scratch – Insight ...,Want to learn about applied Artificial Intelli...,Reinforcement Learning from scratch – Insight ...,"[-0.21362075, -0.039915938, 0.31208166, 0.0312..."


Nice! Now we have an extra column "encoded_sentences" which has the encoded title, text and author information about the article. Now we're going to store this information in a vector database (open-source) called Qdrant.

## Qdrant
Qdrant is an open-source vector database and allows you to create both local and remote connection with databases. For this project I'll use the local option.

In [7]:
# Setting up
client = QdrantClient(path = "./vectordb")

In [8]:
# Creating a collection in the database that will store the embeddings
client.create_collection(
    collection_name = "medium-articles",
    vectors_config = models.VectorParams(
        size = encoder.get_sentence_embedding_dimension(),
        distance = models.Distance.COSINE # Using cosine similarity to calculate the distance
    )
)

print(client.get_collections())

collections=[CollectionDescription(name='medium-articles')]


In [9]:
# Creating the vector points
points = list()

for articleID, (author, claps, reading_time, link, title, text, sentences, encoded_sentences) in enumerate(df.values):
    points.append(PointStruct(
        id = articleID,
        vector = encoded_sentences.tolist(),
        payload = {
            'claps': claps if str(claps) not in ['None', 'nan'] else '',
            'reading_time': reading_time if str(reading_time) not in ['None', 'nan'] else '',
            'link': link if str(link) not in ['None', 'nan'] else '',
            'title': title if str(title) not in ['None', 'nan'] else '',
            'author': author if str(author) not in ['None', 'nan'] else ''
        }
    ))

In [10]:
# Now we can upload all these points to the database
SIZE = 1000
n_chunks = np.ceil(len(points)/SIZE)

for i, points_chunk in enumerate(np.array_split(points, n_chunks)):
    client.upsert(
        collection_name = 'medium-articles',
        wait = True,
        points = points_chunk.tolist()
    )

Now it's done! Let's put an example query, encode it and retrieval the most similar documents.

In [131]:
# Example query
query = "How to create chatbots"
query_vector = encoder.encode(query).tolist()

# Visualizing the first 30 numbers
print(f"Length of the query vector: {len(query_vector)}")
print(query_vector[:30])

Length of the query vector: 768
[0.13361723721027374, -0.3658362925052643, -0.4088641107082367, 0.2660626471042633, 0.500251293182373, -0.3838483393192291, 0.20698142051696777, 0.15636451542377472, 0.15276719629764557, -0.23971343040466309, 0.023343849927186966, -0.017142565920948982, -0.1754218339920044, 0.12766329944133759, -0.6844878792762756, -0.30942267179489136, -0.45385321974754333, 0.19815373420715332, -0.0013708818005397916, 0.429248571395874, -0.2735697329044342, 0.23619826138019562, -0.4373869299888611, 0.2992454469203949, 0.3642025589942932, 0.21089240908622742, -0.3363164961338043, 0.3862217366695404, -0.2940928339958191, -0.4071648418903351]


In [132]:
# Searching for the most similar documents

results_original = client.search(
            collection_name = "medium-articles",
            query_vector = query_vector,
            query_filter = None
          )

for result in results_original[:5]:
    print(f"Title: {result.payload['title']} - {result.payload['author']}\nLink: {result.payload['link']}\n")

Title: What Are The Best Intelligent Chatbots or AI Chatbots Available Online? - Maruti Techlabs
Link: https://chatbotsmagazine.com/which-are-the-best-intelligent-chatbots-or-ai-chatbots-available-online-cc49c0f3569d?source=tag_archive---------8----------------

Title: How to Create a Chatbot Without Coding a Single Line - Dominik Felix
Link: https://chatbotsmagazine.com/how-to-create-a-chatbot-without-coding-a-single-line-e716840c7245?source=tag_archive---------8----------------

Title: How to learn Deep Learning in 6 months – Towards Data Science - Bargava
Link: https://towardsdatascience.com/how-to-learn-deep-learning-in-6-months-e45e40ef7d48?source=tag_archive---------5----------------

Title: The Complete Beginner’s Guide To Chatbots – Chatbots Magazine - Matt Schlicht
Link: https://chatbotsmagazine.com/the-complete-beginner-s-guide-to-chatbots-8280b7b906ca?source=tag_archive---------3----------------

Title: How we grew from 0 to 4 million women on our fashion app, with a vertica

You see that we got a good result. Now, let's try to use Query Expansion to improve the result even more.

## Query Expansion
Now we're going to use the query expansion technique, which allow us to get more accurate outputs from the retrieval.

### Related Queries

In [91]:
# Generating related queries
original_query = "How to create chatbots"
related_queries = generate_related_queries(original_query)

for query in related_queries:
    print(query)

- Chatbot development best practices
- Benefits of using chatbots in customer service
- How to design effective conversational flows for chatbots
- Chatbot platforms comparison
- Examples of successful chatbot implementation in businesses


In [92]:
# Query using all the queries (original and generated)
queries = [original_query] + related_queries

retrieved = list()
for query in queries:
    query_vector = encoder.encode(query)
    results = client.search(collection_name = "medium-articles", query_vector = query_vector, query_filter = None, limit = 5)
    retrieved.append(results)

In [93]:
# Deduplicate the retrieved documents
unique_documents = list()
titles = set()
for documents in retrieved:
    for document in documents:
        doc_title = document.payload['title']
        if doc_title not in titles:
            unique_documents.append(document)
            titles.add(doc_title)

In [94]:
# Sorting the documents by the score and shoing it's title and link
sorted_documents_related_queries = sorted(unique_documents, key = lambda x: x.score, reverse = True)

for result in sorted_documents_related_queries[:5]:
    print(f"Title: {result.payload['title']} - {result.payload['author']}\nLink: {result.payload['link']}\n")

Title: Reinforcement Learning from scratch – Insight Data - Emmanuel Ameisen
Link: https://blog.insightdatascience.com/reinforcement-learning-from-scratch-819b65f074d8?source=---------4----------------

Title: Chatbots, could we talk? – AIMA: AI Marketing Magazine – Medium - Gabriel Jiménez
Link: https://medium.com/aimarketingassociation/chatbots-could-we-talk-edd6ccbd8f5a?source=---------7----------------

Title: Cheat Sheets for AI, Neural Networks, Machine Learning, Deep Learning & Big Data - Stefan Kojouharov
Link: https://becominghuman.ai/cheat-sheets-for-ai-neural-networks-machine-learning-deep-learning-big-data-678c51b4b463?source=tag_archive---------6----------------

Title: The beginner’s guide to conversational commerce – The Startup – Medium - Justin Lee
Link: https://medium.com/swlh/the-beginners-guide-to-conversational-commerce-96f9c7dbaefb?source=---------5----------------

Title: Architecting a Machine Learning System for Risk – Airbnb Engineering & Data Science – Mediu

We can see that it is not completely focused on Chatbots as the first result. Now, let's use the second approach, generating possible answers instead of multiple queries. Actually, as this problem is a recommendation system, does not make sense to generate answers, but possible TITLES!

### Example Titles

In [117]:
# Generating the titles
original_query = "How to create chatbots"
titles = generate_titles(original_query)

for query in titles:
    print(query)

1. "A Beginner's Guide to Creating Chatbots"
2. "Step-by-Step Tutorial: How to Create Your Own Chatbot"
3. "The Basics of Building Chatbots: A Comprehensive Guide"
4. "Mastering the Art of Chatbot Development: Tips and Tricks"
5. "From Zero to Chatbot: A Crash Course in Creating Virtual Assistants"


In [118]:
# Query using all the queries (original and generated)
queries = [original_query] + titles

retrieved = list()
for query in queries:
    query_vector = encoder.encode(query)
    results = client.search(collection_name = "medium-articles", query_vector = query_vector, query_filter = None, limit = 5)
    retrieved.append(results)

In [119]:
# Deduplicate the retrieved documents
unique_documents = list()
titles = set()
for documents in retrieved:
    for document in documents:
        doc_title = document.payload['title']
        if doc_title not in titles:
            unique_documents.append(document)
            titles.add(doc_title)

In [120]:
# Sorting the documents by the score and shoing it's title and link
sorted_documents_example_titles = sorted(unique_documents, key = lambda x: x.score, reverse = True)

for result in sorted_documents_example_titles[:5]:
    print(f"Title: {result.payload['title']} - {result.payload['author']}\nLink: {result.payload['link']}\n")

Title: How you can train an AI to convert your design mockups into HTML and CSS - Emil Wallner
Link: https://medium.freecodecamp.org/how-you-can-train-an-ai-to-convert-your-design-mockups-into-html-and-css-cc7afd82fed4?source=tag_archive---------9----------------

Title: Artificial Intelligence Top 10 Articles — June 2018 - Brandon Morelli
Link: https://techburst.io/artificial-intelligence-top-10-articles-june-2018-4b3fa7572b46?source=---------8----------------

Title: 30 Amazing Python Projects for the Past Year (v.2018) - Mybridge
Link: https://medium.mybridge.co/30-amazing-python-projects-for-the-past-year-v-2018-9c310b04cdb3?source=tag_archive---------7----------------

Title: 30 Amazing Machine Learning Projects for the Past Year (v.2018) - Mybridge
Link: https://medium.mybridge.co/30-amazing-machine-learning-projects-for-the-past-year-v-2018-b853b8621ac7?source=tag_archive---------4----------------

Title: Machine Learning Exercises In Python, Part 1 – John Wittenauer – Medium - 

It is worst than the first approach and even worst than the common approach, without query expansion.

While I was writing the text above, I had an ideia: "Instead of creating possibile titles or related queries, why not generating single words that are related?". In our case, the words could be "NLP", "Deep Learning", "Chat", "AI" etc.

### Related words
Let's try this:

In [122]:
# Generating the titles
original_query = "How to create chatbots"
words = generate_related_words(original_query)

for query in words:
    print(query)

1. Artificial Intelligence
2. Natural Language Processing
3. Machine Learning
4. Bot Development
5. Conversational User Interface
6. Virtual Assistant
7. Text Analytics
8. Chatbot Platforms
9. Programming
10. User Experience Design


In [123]:
# Query using all the queries (original and generated)
queries = [original_query] + words

retrieved = list()
for query in queries:
    query_vector = encoder.encode(query)
    results = client.search(collection_name = "medium-articles", query_vector = query_vector, query_filter = None, limit = 5)
    retrieved.append(results)

In [124]:
# Deduplicate the retrieved documents
unique_documents = list()
titles = set()
for documents in retrieved:
    for document in documents:
        doc_title = document.payload['title']
        if doc_title not in titles:
            unique_documents.append(document)
            titles.add(doc_title)

In [125]:
# Sorting the documents by the score and shoing it's title and link
sorted_documents_related_words = sorted(unique_documents, key = lambda x: x.score, reverse = True)

for result in sorted_documents_related_words[:5]:
    print(f"Title: {result.payload['title']} - {result.payload['author']}\nLink: {result.payload['link']}\n")

Title: 30 Amazing Python Projects for the Past Year (v.2018) - Mybridge
Link: https://medium.mybridge.co/30-amazing-python-projects-for-the-past-year-v-2018-9c310b04cdb3?source=tag_archive---------7----------------

Title: 30 Amazing Machine Learning Projects for the Past Year (v.2018) - Mybridge
Link: https://medium.mybridge.co/30-amazing-machine-learning-projects-for-the-past-year-v-2018-b853b8621ac7?source=tag_archive---------4----------------

Title: Loading a TensorFlow graph with the C++ API – Jim Fleming – Medium - Jim Fleming
Link: https://medium.com/jim-fleming/loading-a-tensorflow-graph-with-the-c-api-4caaff88463f?source=tag_archive---------1----------------

Title: What Are The Best Intelligent Chatbots or AI Chatbots Available Online? - Maruti Techlabs
Link: https://chatbotsmagazine.com/which-are-the-best-intelligent-chatbots-or-ai-chatbots-available-online-cc49c0f3569d?source=tag_archive---------8----------------

Title: How to Create a Chatbot Without Coding a Single Line

Only the last result is a good output. Now, let's try using cross-encoder re-ranking.

## Cross-encoder re-ranking
Now, we're going to use the cross-encoder to capture the similarity and contrasts between the user query and the documents retrieved.

In [130]:
# Creating an instance of the cross-encoder
cross_encoder = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')

config.json:   0%|          | 0.00/794 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/316 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

Here I'm using the retrieved documents using the original query, without query expansion.

In [145]:
# Getting the scores for each pair of (query, document)
pairs = [[query, wrap_words(doc)] for doc in results_original]
scores = cross_encoder.predict(pairs)
print("Scores:")
for score in scores:
    print(score)

Scores:
2.3161333
6.451616
-10.328022
2.6004615
-9.616787
-10.434868
-8.065396
-8.860885
-8.860885
-9.980819


In [158]:
# Getting the results
top_5_indexes = sorted(range(len(scores)), key=lambda i: scores[i], reverse=True)[:5]

for index in top_5_indexes:
    result = results_original[index]
    print(f"Title: {result.payload['title']} - {result.payload['author']}\nLink: {result.payload['link']}\n")

Title: How to Create a Chatbot Without Coding a Single Line - Dominik Felix
Link: https://chatbotsmagazine.com/how-to-create-a-chatbot-without-coding-a-single-line-e716840c7245?source=tag_archive---------8----------------

Title: The Complete Beginner’s Guide To Chatbots – Chatbots Magazine - Matt Schlicht
Link: https://chatbotsmagazine.com/the-complete-beginner-s-guide-to-chatbots-8280b7b906ca?source=tag_archive---------3----------------

Title: What Are The Best Intelligent Chatbots or AI Chatbots Available Online? - Maruti Techlabs
Link: https://chatbotsmagazine.com/which-are-the-best-intelligent-chatbots-or-ai-chatbots-available-online-cc49c0f3569d?source=tag_archive---------8----------------

Title: Turn your Raspberry Pi into homemade Google Home – Becoming Human: Artificial Intelligence Magazine - Keval Patel
Link: https://becominghuman.ai/turn-your-raspberry-pi-into-homemade-google-home-9e29ad220075?source=tag_archive---------6----------------

Title: An intro to Machine Learni

In [159]:
# Getting the documents
results_reranking = list()
for k in top_5_indexes:
    results_reranking.append(results_original[k])

## Comparison of these methods
Now, let's create a Dataframe with all the results that we've got using the common approach and with query expansion.

In [161]:
titles_result, links_result, authors_result, scores_result, techniques = list(), list(), list(), list(), list()
for common, ex_queries, ex_titles, ex_words, ex_reranking in zip(results_original[:5], sorted_documents_related_queries[:5], sorted_documents_example_titles[:5], sorted_documents_related_words[:5], results_reranking):
    title1, link1, author1, score1 = common.payload['title'], common.payload['link'], common.payload['author'], common.score
    title2, link2, author2, score2 = ex_queries.payload['title'], ex_queries.payload['link'], ex_queries.payload['author'], ex_queries.score
    title3, link3, author3, score3 = ex_titles.payload['title'], ex_titles.payload['link'], ex_titles.payload['author'], ex_titles.score
    title4, link4, author4, score4 = ex_words.payload['title'], ex_words.payload['link'], ex_words.payload['author'], ex_words.score
    title5, link5, author5, score5 = ex_reranking.payload['title'], ex_reranking.payload['link'], ex_reranking.payload['author'], ex_reranking.score
    
    titles_result += [title1, title2, title3, title4, title5]
    links_result += [link1, link2, link3, link4, link4]
    authors_result += [author1, author2, author3, author4, author4]
    scores_result += [score1, score2, score3, score4, score4]
    techniques += ['Original', 'Related Queries', 'Example Titles', 'Related Words', 'Re-Ranking']

In [162]:
# Creating the dataframe
comparison = pd.DataFrame({
    'title': titles_result,
    'author': authors_result,
    'link': links_result,
    'score': scores_result,
    'technique': techniques
})

comparison.head()

Unnamed: 0,title,author,link,score,technique
0,What Are The Best Intelligent Chatbots or AI C...,Maruti Techlabs,https://chatbotsmagazine.com/which-are-the-bes...,0.622205,Original
1,Reinforcement Learning from scratch – Insight ...,Emmanuel Ameisen,https://blog.insightdatascience.com/reinforcem...,0.716298,Related Queries
2,How you can train an AI to convert your design...,Emil Wallner,https://medium.freecodecamp.org/how-you-can-tr...,0.779914,Example Titles
3,30 Amazing Python Projects for the Past Year (...,Mybridge,https://medium.mybridge.co/30-amazing-python-p...,0.650976,Related Words
4,How to Create a Chatbot Without Coding a Singl...,Mybridge,https://medium.mybridge.co/30-amazing-python-p...,0.650976,Re-Ranking


In [163]:
# Sorting by the score
comparison.sort_values(by = 'score', ascending = False).head(10)

Unnamed: 0,title,author,link,score,technique
2,How you can train an AI to convert your design...,Emil Wallner,https://medium.freecodecamp.org/how-you-can-tr...,0.779914,Example Titles
7,Artificial Intelligence Top 10 Articles — June...,Brandon Morelli,https://techburst.io/artificial-intelligence-t...,0.774829,Example Titles
12,30 Amazing Python Projects for the Past Year (...,Mybridge,https://medium.mybridge.co/30-amazing-python-p...,0.769492,Example Titles
17,30 Amazing Machine Learning Projects for the P...,Mybridge,https://medium.mybridge.co/30-amazing-machine-...,0.76324,Example Titles
22,"Machine Learning Exercises In Python, Part 1 –...",John Wittenauer,https://medium.com/@jdwittenauer/machine-learn...,0.730944,Example Titles
1,Reinforcement Learning from scratch – Insight ...,Emmanuel Ameisen,https://blog.insightdatascience.com/reinforcem...,0.716298,Related Queries
6,"Chatbots, could we talk? – AIMA: AI Marketing ...",Gabriel Jiménez,https://medium.com/aimarketingassociation/chat...,0.705268,Related Queries
11,"Cheat Sheets for AI, Neural Networks, Machine ...",Stefan Kojouharov,https://becominghuman.ai/cheat-sheets-for-ai-n...,0.70415,Related Queries
16,The beginner’s guide to conversational commerc...,Justin Lee,https://medium.com/swlh/the-beginners-guide-to...,0.70092,Related Queries
21,Architecting a Machine Learning System for Ris...,AirbnbEng,https://medium.com/airbnb-engineering/architec...,0.697742,Related Queries


## Comparing the original method with and without re-ranking using cross-encoder

In [175]:
# Which articles are in both?
intersection = set(comparison[comparison['technique'] == 'Original'].title.tolist()).intersection(set(comparison[comparison['technique'] == 'Re-Ranking'].title.tolist()))
intersection

{'How to Create a Chatbot Without Coding a Single Line',
 'The Complete Beginner’s Guide To Chatbots – Chatbots Magazine',
 'What Are The Best Intelligent Chatbots or AI Chatbots Available Online?'}

In [178]:
# Which are not?
print("Recommended only by the Original:")
for title in set(comparison[comparison['technique'] == 'Original'].title.tolist()).difference(intersection):
    print('\t', title)

print("\nRecommended only by Re-Ranking:")
for title in set(comparison[comparison['technique'] == 'Re-Ranking'].title.tolist()).difference(intersection):
    print('\t', title)

Recommended only by the Original:
	 How to learn Deep Learning in 6 months – Towards Data Science
	 How we grew from 0 to 4 million women on our fashion app, with a vertical machine learning approach

Recommended only by Re-Ranking:
	 An intro to Machine Learning for designers – UX Collective
	 Turn your Raspberry Pi into homemade Google Home – Becoming Human: Artificial Intelligence Magazine


Note that the additional recommendations of the original method make more sense than the re-ranking additional ones. At least for me.

## End of the project
We can conclude that besides the best score was achieved by example titles, in pratice the best results came from the original method, without using query expansion. In other situations this may be helpful! You should always test.