<a href="https://colab.research.google.com/github/ashraful-iut/NLP_Basic-/blob/main/Search_Engine_over_Medium_with_Bag_of_Words.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

---

Notebook of the course [Pratical NLP with Python](https://www.nlplanet.org/course-practical-nlp/).

Lesson: [Project: Search Engine over Medium with Bag of Words](https://www.nlplanet.org/course-practical-nlp/01-intro-to-nlp/07-search-engine-bow.html)

Made by: [Fabio Chiusano](https://www.linkedin.com/in/fabio-chiusano-b6a3b311b/)

Table of Contents:
- Lesson Code
  - Libraries
  - Download the Dataset
  - Data Preprocessing
  - Make Queries
  - Removing Stopwords
- Code Exercises

---

# Lesson Code

## Libraries

In [None]:
!pip install datasets

In [None]:
from huggingface_hub import hf_hub_download

import nltk
nltk.download('punkt')
from nltk.tokenize import word_tokenize

from collections import Counter
import numpy as np
import pandas as pd

## Download the Dataset

In [None]:
# download dataset of Medium articles from 
# https://huggingface.co/datasets/fabiochiu/medium-articles
df_articles = pd.read_csv(
  hf_hub_download("fabiochiu/medium-articles", repo_type="dataset",
                  filename="medium_articles.csv")
)

# There are 192,368 articles in total, but let's keep only the first 10,000 to
# make computations faster
df_articles = df_articles.sample(n=10000)

df_articles.head()

## Data Preprocessing

In [None]:
# count the number of occurrences of each token in each text
texts_lowercase = df_articles["text"].str.lower()
texts_lowercase_tokenized = texts_lowercase.apply(word_tokenize)
token_counters = texts_lowercase_tokenized.apply(Counter).values.tolist()

# show the tokens found in the first article with at least 10 occurrences
print({token: n_occ for token, n_occ in token_counters[0].items() if n_occ >= 10})

## Make Queries

In [None]:
# tokenize the query
query = "data science nlp"
query_tokens = word_tokenize(query)

In [None]:
# Compute a matching score for each text with respect to the query. The score is
# the number of times each token in the query can be found in a specific text.
def get_scores(query_tokens, token_counters):
  scores = []
  for token_counter in token_counters:
    matches = [token_counter[query_token] for query_token in query_tokens]
    total_score = sum(matches)
    scores.append(total_score)
  return scores

scores = get_scores(query_tokens, token_counters)

In [None]:
# retrieve the top_n articles with the highest scores and show them
def show_best_results(df_articles, scores, top_n=10):
  best_indexes = np.argsort(scores)[::-1]
  for position, idx in enumerate(best_indexes[:top_n]):
    row = df_articles.iloc[idx]
    title = row["title"]
    score = scores[idx]
    print(f"{position + 1} [score = {score}]: {title}")

show_best_results(df_articles, scores)

In [None]:
# try a different query
query = "how to learn data science"
query_tokens = word_tokenize(query)
scores = get_scores(query_tokens, token_counters)
show_best_results(df_articles, scores)

## Removing Stopwords

In [None]:
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
import string

english_stopwords = stopwords.words('english')

In [None]:
print(string.punctuation)

In [None]:
# count the number of occurrences of each token in each text
texts_lowercase = df_articles["text"].str.lower()
texts_lowercase_tokenized = texts_lowercase.apply(word_tokenize)
texts_lowercase_tokenized_no_sw = texts_lowercase_tokenized.apply(
  lambda token_list: [token for token in token_list
                      if token not in english_stopwords and
                      token not in string.punctuation]
)
token_counters = texts_lowercase_tokenized_no_sw.apply(Counter).values.tolist()

# show the tokens found in the first article with at least 6 occurrences
print({token: n_occ for token, n_occ in token_counters[0].items() if n_occ >= 6})

In [None]:
# tokenize the query and remove stopwords
query = "how to learn data science"
query_tokens = word_tokenize(query)
query_tokens_no_sw = [token for token in query_tokens
                      if token not in english_stopwords and
                      token not in string.punctuation]
print(f"Tokenized query without stopwords: {query_tokens_no_sw}")
print()

# show best results
scores = get_scores(query_tokens, token_counters)
show_best_results(df_articles, scores)

# Code Exercises

## Exercise: Reimplement the search engine logic using the `CountVectorizer` class from `sklearn`

In [None]:
# WRITE CODE HERE

## Exercise: Reimplement the search engine logic using the `CountVectorizer` class from `sklearn`. Fix the `max_df` parameter of the `CountVectorizer` so that the query "how to learn data science" returns good results even without manually removing stopwords

In [None]:
# WRITE CODE HERE

## Exercise: Write a new scoring function for the search engine that counts the number of tokens in the query that have at least one occurrence in the document (instead of summing all the occurrences in the document). Test the new scoring function with some queries

In [None]:
# WRITE CODE HERE