<a href="https://colab.research.google.com/github/WuqianMa/GPT3.5Tarotchatbot-Streamlit/blob/main/Another_copy_of_Wipedia_Search_Engine.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# I. Data Collection
(Don't run again we already get the file)


In [None]:
import requests
from bs4 import BeautifulSoup
import json
from queue import Queue

def scrape_wikipedia(url):
    # Send an HTTP GET request to the Wikipedia URL
    response = requests.get(url)

    # Check if the request was successful (status code 200)
    if response.status_code == 200:
        # Parse the HTML content of the page
        soup = BeautifulSoup(response.content, 'html.parser')

        # Extract relevant information (title and content)
        title = soup.find('h1', {'class': 'firstHeading'}).text
        content = soup.find('div', {'id': 'mw-content-text'}).text

        # Clean the data from HTML tags
        cleaned_content = clean_html_tags(content)

        # Create a dictionary representing the article
        article = {'title': title, 'url': url, 'content': cleaned_content}

        return article
    else:
        print(f"Failed to retrieve content from {url}")
        return None

def clean_html_tags(text):
    # Use BeautifulSoup to remove HTML tags
    soup = BeautifulSoup(text, 'html.parser')
    cleaned_text = soup.get_text(separator=' ')
    return cleaned_text

def save_to_json(data, filename):
    # Save the scraped data to a JSON file
    with open(filename, 'w', encoding='utf-8') as json_file:
        json.dump(data, json_file, ensure_ascii=False, indent=4)

def get_wikipedia_links(url):
    # Extract all Wikipedia links from a given page
    response = requests.get(url)
    links = set()

    if response.status_code == 200:
        soup = BeautifulSoup(response.content, 'html.parser')
        for a_tag in soup.find_all('a', href=True):
            href = a_tag['href']
            if href.startswith('/wiki/') and ':' not in href:
                full_link = f"https://en.wikipedia.org{href}"
                links.add(full_link)

    return links

# Starting from the NLP page
seed_url = 'https://en.wikipedia.org/wiki/Natural_language_processing'

# Queue to manage pages to be visited
page_queue = Queue()
page_queue.put(seed_url)

# Set to keep track of visited pages
visited_pages = set()

# Limit the number of pages to scrape
max_pages = 10000
pages_scraped = 0

# List to store scraped pages
all_pages = []

# Scraping loop
while not page_queue.empty() and pages_scraped < max_pages:
    current_url = page_queue.get()

    # Skip if already visited
    if current_url in visited_pages:
        continue

    # Scrape the current page
    current_page = scrape_wikipedia(current_url)
    if current_page:
        # Append the current page to the list
        all_pages.append(current_page)
        print(f"Article '{current_page['title']}' has been added to the list.")
        pages_scraped += 1

        # Add links from the current page to the queue
        linked_pages = get_wikipedia_links(current_url)
        for linked_page in linked_pages:
            page_queue.put(linked_page)

    # Mark the current page as visited
    visited_pages.add(current_url)

# Save all scraped pages to a single JSON file
save_to_json(all_pages, 'all_scraped_pages.json')
print("All scraped pages have been saved to 'all_scraped_pages.json'.")



Article 'Natural language processing' has been added to the list.
Article 'Explicit semantic analysis' has been added to the list.
Article 'Joseph Weizenbaum' has been added to the list.
Article 'Feedforward neural network' has been added to the list.
Article 'Computer-assisted reviewing' has been added to the list.
Article 'Semantic similarity' has been added to the list.
Article 'John Searle' has been added to the list.
Article 'UBY' has been added to the list.
Article 'Formal grammar' has been added to the list.
Article 'Yingli Tian' has been added to the list.
Article 'Transfer-based machine translation' has been added to the list.
Article 'Racter' has been added to the list.
Article 'Adjective' has been added to the list.
Article 'Conditional (computer programming)' has been added to the list.
Article 'Capitalization' has been added to the list.
Article 'Language technology' has been added to the list.
Article 'Decision tree' has been added to the list.
Article 'Ontology (informat



[1;30;43mLe flux de sortie a été tronqué et ne contient que les 5000 dernières lignes.[0m
Article 'Plaintext Players' has been added to the list.
Article 'Dworkin's Game Driver' has been added to the list.
Article 'Loot (video games)' has been added to the list.
Article 'Simutronics' has been added to the list.
Article 'Kill stealing' has been added to the list.
Article 'Breaking character' has been added to the list.
Article 'Engadget' has been added to the list.
Article 'Lysator' has been added to the list.
Article 'Breaking character' has been added to the list.
Article 'Cybersex' has been added to the list.
Article 'Costly signaling theory in evolutionary psychology' has been added to the list.
Article 'Behavioral epigenetics' has been added to the list.
Article 'Management' has been added to the list.
Article 'Logic Theorist' has been added to the list.
Article 'Open-source software' has been added to the list.
Article 'Optical illusion' has been added to the list.
Article 'Phil

# II. Data Processing


In [None]:
!pip install nltk




In [None]:
import nltk
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('wordnet')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...


True

In [None]:
import torch

# Check if a GPU is available
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print("Using device:", device)

Using device: cuda


In [None]:
from nltk.tokenize import word_tokenize

def tokenize_text(text):
    return word_tokenize(text.lower())


In [None]:
from nltk.corpus import stopwords

def remove_stopwords(tokens):
    stop_words = set(stopwords.words('english'))
    return [token for token in tokens if token not in stop_words]


In [None]:
from nltk.stem import WordNetLemmatizer

def lemmatize_tokens(tokens):
    lemmatizer = WordNetLemmatizer()
    return [lemmatizer.lemmatize(token) for token in tokens]


In [None]:
import re

def remove_special_characters(text):
    return re.sub(r'[^a-zA-Z0-9\s]', '', text)


In [None]:
def preprocess_text(text):
    tokens = tokenize_text(text)
    tokens = remove_stopwords(tokens)
    tokens = lemmatize_tokens(tokens)
    processed_text = ' '.join(tokens)
    processed_text = remove_special_characters(processed_text)
    return processed_text

In [None]:
def preprocess_pages(json_file_path):
    with open(json_file_path, 'r', encoding='utf-8') as json_file:
        all_pages = json.load(json_file)

    processed_documents = []
    article_names = []  # New list to store article names

    # Preprocessing steps for each page
    for page in all_pages:
        title = page['title']
        content = page['content']

        # Your preprocessing steps go here
        processed_content = preprocess_text(content)

        # Additional processing or indexing steps can be added here

        processed_documents.append(processed_content)
        article_names.append(title)  # Store the name of the article

    return processed_documents, article_names


In [None]:
import json
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.preprocessing import normalize


json_file_path = 'all_scraped_pages.json'
documents, article_names = preprocess_pages(json_file_path)

# The content isn't quite great. Let's try another way to make this search engine.

# III -New method, Let's focus on TF-IDF

In [None]:
import json

def load_json_file(file_path):
    with open(file_path, 'r', encoding='utf-8') as file:
        data = json.load(file)
    return data


In [None]:
json_file_path = 'all_scraped_pages.json'
articles = load_json_file(json_file_path)

In [None]:

# Check if articles were loaded and list is not empty
if articles and isinstance(articles, list):
    # Print the title and a snippet of the content from the first article as a sample
    print("Title:", articles[0]['title'])
    print("Content Snippet:", articles[0]['content'][:500])  # Print first 500 characters of the content
else:
    print("The articles list is empty or not loaded correctly.")

Title: Natural language processing
Content Snippet: Field of linguistics and computer science
For other uses, see NLP.
This article is about natural language processing done by computers. For the natural language processing done by the human brain, see Language processing in the brain.
Natural language processing (NLP) is an interdisciplinary subfield of computer science and linguistics. It is primarily concerned with giving computers the ability to support and manipulate human language. It involves processing natural language datasets, such as t


# Here we Try 3 differents algorithms to test out their performance(Not necessarily running they are slow)



In [None]:
# User query
user_query = "United States"

# Ensure documents is a list
if documents is None or not isinstance(documents, list):
    raise ValueError("The preprocess_pages function should return a list of documents.")

# TF-IDF representation
tfidf_vectorizer = TfidfVectorizer()
tfidf_matrix = tfidf_vectorizer.fit_transform(documents + [user_query])
tfidf_matrix_normalized = normalize(tfidf_matrix, axis=1, norm='l2')
user_query_tfidf = tfidf_matrix_normalized[-1]
document_tfidf = tfidf_matrix_normalized[:-1]

# Retrieve the name of the article for each document
article_names_for_results = [article_names[i] for i in range(len(documents))]

# Calculate similarity scores
similarity_tfidf = cosine_similarity(user_query_tfidf.reshape(1, -1), document_tfidf)[0]

# Find the most relevant paragraph for the top result
ranked_documents_tfidf = sorted(enumerate(similarity_tfidf), key=lambda x: x[1], reverse=True)[:1]

# Output the result
output_paragraph = ''
for rank, score in ranked_documents_tfidf:
    article_name = article_names_for_results[rank]
    article_text = documents[rank]

    # Find the position of the query in the article
    query_start = article_text.lower().find(user_query.lower())
    query_end = query_start + len(user_query)

    # Extract a relevant window around the query
    window_start = max(0, query_start - 50)
    window_end = min(len(article_text), query_end + 50)

    relevant_paragraph = article_text[window_start:window_end].strip()

    output_paragraph += f'The query "{user_query}" was found in the following article:\n'
    output_paragraph += f'-Name of the Article- {article_name}:\n{relevant_paragraph}...\n\n'
    output_paragraph += f'Similarity Score - {score:.4f}...\n\n'

print(output_paragraph)


The query "United States" was found in the following article:
-Name of the Article- United States:
ssue  country america organization american state united statesmexicocanada free trade agreement  south america...

Similarity Score - 0.0718...




In [None]:
# Bag of Words (BoW) representation
bow_vectorizer = CountVectorizer()
bow_matrix = bow_vectorizer.fit_transform(documents + [user_query])
bow_matrix_normalized = normalize(bow_matrix, axis=1, norm='l2')
user_query_bow = bow_matrix_normalized[-1]
document_bow = bow_matrix_normalized[:-1]

# Calculate similarity scores for BoW
similarity_bow = cosine_similarity(user_query_bow.reshape(1, -1), document_bow)[0]

# Find the most relevant paragraph for the top result in BoW
ranked_documents_bow = sorted(enumerate(similarity_bow), key=lambda x: x[1], reverse=True)[:1]

# Output the result for BoW
output_paragraph_bow = ''
for rank, score in ranked_documents_bow:
    article_name = article_names_for_results[rank]
    article_text = documents[rank]

    # Find the position of the query in the article
    query_start = article_text.lower().find(user_query.lower())
    query_end = query_start + len(user_query)

    # Extract a relevant window around the query
    window_start = max(0, query_start - 50)
    window_end = min(len(article_text), query_end + 50)

    relevant_paragraph = article_text[window_start:window_end].strip()

    output_paragraph_bow += f'The query "{user_query}" was found in the following article:\n'
    output_paragraph_bow += f'-Name of the Article- {article_name}:\n{relevant_paragraph}...\n\n'
    output_paragraph_bow += f'Similarity Score - {score:.4f}...\n\n'

print(output_paragraph_bow)


The query "United States" was found in the following article:
-Name of the Article- Foreign policy of the United States:
state  organization security cooperation europe  united statesmexicocanada agreement  asiapacific economic coope...

Similarity Score - 0.2781...




In [None]:
# N-grams representation (using CountVectorizer)
ngram_vectorizer = CountVectorizer(ngram_range=(2, 4))
ngram_matrix = ngram_vectorizer.fit_transform(documents + [user_query])
ngram_matrix_normalized = normalize(ngram_matrix, axis=1, norm='l2')
user_query_ngram = ngram_matrix_normalized[-1]
document_ngram = ngram_matrix_normalized[:-1]

# Calculate similarity scores for N-grams
similarity_ngram = cosine_similarity(user_query_ngram.reshape(1, -1), document_ngram)[0]

# Find the most relevant paragraph for the top result in N-grams
ranked_documents_ngram = sorted(enumerate(similarity_ngram), key=lambda x: x[1], reverse=True)[:1]

# Output the result for N-grams
output_paragraph_ngram = ''
for rank, score in ranked_documents_ngram:
    article_name = article_names_for_results[rank]
    article_text = documents[rank]

    # Find the position of the query in the article
    query_start = article_text.lower().find(user_query.lower())
    query_end = query_start + len(user_query)

    # Extract a relevant window around the query
    window_start = max(0, query_start - 50)
    window_end = min(len(article_text), query_end + 50)

    relevant_paragraph = article_text[window_start:window_end].strip()

    output_paragraph_ngram += f'The query "{user_query}" was found in the following article:\n'
    output_paragraph_ngram += f'-Name of the Article- {article_name}:\n{relevant_paragraph}...\n\n'
    output_paragraph_ngram += f'Similarity Score - {score:.4f}...\n\n'

print(output_paragraph_ngram)


The query "United States" was found in the following article:
-Name of the Article- Historical sociology:
i germany failure transplant historical sociology united states  international journal politics  culture  society...

Similarity Score - 0.0099...




In [None]:
import re
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
import nltk
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

def preprocess_text2(text):
    # Convert text to lowercase
    text = text.lower()
    # Remove numbers and punctuation
    text = re.sub(r'\d+', '', text)  # Remove numbers
    text = re.sub(r'[^\w\s]', '', text)  # Remove punctuation
    # Tokenize text
    tokens = word_tokenize(text)
    # Remove stopwords and lemmatize
    stop_words = set(stopwords.words('english'))
    lemmatizer = WordNetLemmatizer()
    tokens = [lemmatizer.lemmatize(word) for word in tokens if word not in stop_words]
    return ' '.join(tokens)

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


We process articles for the search

In [None]:
processed_articles = []
for article in articles:
    original_content = article['content']
    searchable_content = preprocess_text2(original_content)
    processed_articles.append({
        'title': article['title'],
        'original_content': original_content,
        'searchable_content': searchable_content
    })


## 2. Splitting Articles into Paragraphs

This function divides articles into individual paragraphs based on double newline characters, filtering out paragraphs below a certain word count to ensure they contain meaningful content.

We want to avoid paragraph like:

    Python
    NLP
    ...
    (Short paragraph)

In [None]:
def split_article_into_paragraphs(article_content):
    # Split paragraphs using your identified criteria
    paragraphs = article_content.split('\n\n')

    # Filter out paragraphs that don't meet a minimum word count, which helps exclude standalone words or headings
    min_word_count = 50  #  adjust based on the content analysis
    paragraphs = [para for para in paragraphs if len(para.split()) >= min_word_count]

    return paragraphs



## 3. Expanding Paragraph Context

Given a list of paragraphs and a specific index, this function retrieves additional context by including paragraphs before and after the indexed paragraph.

In [None]:
def get_expanded_paragraph_context(paragraphs, match_index, expansion_range=1):
    """
    Expand the context around a matched paragraph by including additional paragraphs
    before and after the match, based on the specified expansion range.
    """
    start_index = max(0, match_index - expansion_range)
    end_index = min(len(paragraphs), match_index + expansion_range + 1)
    expanded_context = "\n\n".join(paragraphs[start_index:end_index])
    return expanded_context

## 4. Preprocessing Paragraphs

In [None]:
def preprocess_paragraphs(paragraphs):
    processed_paragraphs = []
    for paragraph in paragraphs:
        processed = preprocess_text2(paragraph)  # Assuming preprocess_text is already defined
        processed_paragraphs.append(processed)
    return processed_paragraphs


## 5. Indexing Paragraphs

Each paragraph is indexed with details like its original content, processed form, and its index within the article for precise search result mapping.

In [None]:
indexed_paragraphs = []
for article in articles:
    original_paragraphs = split_article_into_paragraphs(article['content'])
    processed_paragraphs = preprocess_paragraphs(original_paragraphs)

    for idx, (original, processed) in enumerate(zip(original_paragraphs, processed_paragraphs)):
        indexed_paragraphs.append({
            'title': article['title'],
            'original_paragraph': original,
            'processed_paragraph': processed,
            'article_index': articles.index(article),  # Reference back to the original article
            'paragraph_index': idx  # Position within the article
        })


## 6. Vectorizing Paragraphs with TF-IDF

This creates a TF-IDF matrix for the processed paragraphs, enabling efficient similarity searches.

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer()
processed_texts = [para['processed_paragraph'] for para in indexed_paragraphs]
tfidf_matrix = vectorizer.fit_transform(processed_texts)


## 7. Implementing the Search Function

This function searches the indexed paragraphs based on a query, ranks them by relevance using cosine similarity, and displays the top matches along with their original content.

In [None]:
from sklearn.metrics.pairwise import cosine_similarity

def search_and_retrieve_originals(query):
    processed_query = preprocess_text(query)
    query_vector = vectorizer.transform([processed_query])
    similarity_scores = cosine_similarity(query_vector, tfidf_matrix).flatten()

    top_indices = similarity_scores.argsort()[-5:][::-1]  # Top 5 paragraphs
    for idx in top_indices:
        matched_paragraph_info = indexed_paragraphs[idx]
        print(f"Title: {matched_paragraph_info['title']}")
        print(f"Original Paragraph: {matched_paragraph_info['original_paragraph']}\n")
        print(f"Paragraph Index: {matched_paragraph_info['paragraph_index']}\n")
        print(f"Score: {similarity_scores[idx]:.4f}\n")



## 8. Running a Search Query

Finally, this executes a search with a given query, demonstrating the end-to-end functionality from preprocessing to displaying search results.

In [None]:
search_query = "What is the history of parallel computing"
search_and_retrieve_originals(search_query)


Title: Theoretical computer science
Original Paragraph: Parallel computation[edit]
Main article: Parallel computation
Parallel computing is a form of computation in which many calculations are carried out simultaneously,[33] operating on the principle that large problems can often be divided into smaller ones, which are then solved "in parallel". There are several different forms of parallel computing: bit-level, instruction level, data, and task parallelism. Parallelism has been employed for many years, mainly in high-performance computing, but interest in it has grown lately due to the physical constraints preventing frequency scaling.[34] As power consumption (and consequently heat generation) by computers has become a concern in recent years,[35] parallel computing has become the dominant paradigm in computer architecture, mainly in the form of multi-core processors.[36]
Parallel computer programs are more difficult to write than sequential ones,[37] because concurrency introduces 

## 9 Summary


The flow constructs a search engine for Wikipedia articles by first loading articles from a JSON file, then preprocessing the text of each article to normalize it for search (including lowercasing, removing punctuation, and lemmatizing).

It splits articles into paragraphs, indexing each one with both its original and processed forms.

The processed paragraphs are vectorized using TF-IDF to create a searchable index.

A search function then uses cosine similarity to find and rank the most relevant paragraphs to a query, ultimately displaying the original, unprocessed paragraphs to the user, ensuring the search results are presented in natural language.