
### Task
1. News Extraction: Develop a script to scrape news articles from provided URLs. Ensure the extracted content captures the full text and headline of the articles.
2. GenAI-driven Summarization and Topic Identification:
- Use a GenAI platform or tool (e.g. OpenAI's GPT models, or any other LLM) to analyze the articles. Your tasks will include generating a summary that captures key points and identifying the main topics of each article.
- The focus should be on effectively integrating and utilizing GenAI tools rather than building from scratch.
3. Semantic Search with GenAI:
- Store the extracted news, their GenAI-generated summaries, and topics in a vector database.
- Implement a semantic search feature leveraging GenAI tools to interpret and find relevant articles based on user queries. This search should understand the context of the queries and match them effectively with the summaries and topics. Search should handle semantically close search terms like synonyms.


### Part 1. Downloading a page

##### Option 1: Simply download page source using Requests
**Pros:**
- Simple to implement. The requests library is straightforward to use and does not require additional resources like a browser.  

**Cons:**
- Does not support dynamic content, Single Page Applications (SPA), and more. The content returned is only what is present in the HTML source code, which may not be the full content of the page if JavaScript is used to load additional content.

##### Option 2: Render page using selenium and extract text (Chosen)
**Pros:**
- Any content, including SPA, can be obtained. This method allows for the full rendering of the page, including any content loaded by JavaScript.  

**Cons:**
- Longer page retrieval time. Because a full browser is required to render the page, this method can take longer than simply downloading the page source.
- Requires more resources. Running a browser, even a headless one, requires more system resources than simply making a request to the server.

In [1]:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.firefox.options import Options

def get_page_text_by_url(urls):
    pages_text = []
    options = Options()
    options.set_preference('permissions.default.image', 2)
    options.set_preference('media.autoplay.default', 0)
    options.set_preference('permissions.default.stylesheet', 2)
    options.headless = True
    driver = webdriver.Firefox(options=options)
    for url in urls:
        driver.get(url)
        cleared_page_text = driver.find_element(By.TAG_NAME, "body").text.replace('\n', ' ')
        pages_text.append(cleared_page_text)
    driver.quit()
    
    return pages_text


Let's test the method

In [2]:
get_page_text_by_url(['https://example.com'])

['Example Domain This domain is for use in illustrative examples in documents. You may use this domain in literature without prior coordination or asking for permission. More information...']

### Part 1. Conclusion
I selected Selenium as the primary tool for web page downloading due to its effectiveness with dynamic content, a feature common to most modern websites. I disabled the loading of images and styles, which significantly expedited the site rendering process.  

### Part 2. GenAI-driven Summarization and Topic Identification

##### Option 1: Local LLM (Chosen)

**Pros:**
- **Privacy and Security:** Data never leaves your network, which can be crucial for sensitive information.
- **Customization:** You can tailor the model to your specific needs and data.
- **Control:** You have full control over the model, including updates and maintenance.

**Cons:**
- **Resource Intensive:** Requires significant computational resources and expertise to maintain.
- **Update and Maintenance:** Keeping the model updated and maintained can be a complex and costly process.
- **Scalability:** May not scale as easily as public models for large data volumes.

##### Option 2: Proprietary public LLM:

**Pros:**
- **Ease of Use:** Typically easier to use and maintain as the provider handles updates and maintenance.
- **Scalability:** Can handle large volumes of data and scale easily.
- **Access to Latest Technology:** You get access to the latest advancements and updates in the model.

**Cons:**
- **Privacy Concerns:** Data is processed on the provider's servers, which may not be ideal for sensitive information.
- **Customization Limitations:** There may be limitations on how much you can customize the model to your specific needs.
- **Dependency:** You are dependent on the provider for service availability and quality.

In [3]:
import requests
import json

SUSTEM_PROMPT_FOR_SUMMARIZATION = "Your task is to summarize the provided text and identify the main topics. The summary should be concise, capturing only the most critical elements of the text. Limit the summary to a maximum of 500 words. The list of main topics should not exceed 5 elements. Ensure that each topic is a single, clear concept and does not contain special characters, such as hyphens. Separate topics with commas. Each topic should begin with a small letter except for names and titles. Format your response as 'Summary text: [summary text]. List of topics:[list of topics]'. Avoid any unnecessary details or tangential information in your summary."
BASE_LMM_SERVER_URL = "http://localhost:1234/v1/"

def get_ai_answer(system_prompt, message, temperature, model):
    url = f"{BASE_LMM_SERVER_URL}chat/completions"

    payload = {
        "messages": [
            { "role": "system", "content":  system_prompt},
            { "role": "user", "content": message }
        ],
        "temperature": temperature,
        "model": model,
        "max_tokens": -1,
        "stream": False
    }

    headers = {
        "Content-Type": "application/json"
    }

    return requests.post(url, headers=headers, data=json.dumps(payload)).json()['choices'][0]['message']['content']


I used the local model 'dolphin-2.9.3-mistral-nemo-12b' as it performed very well in the quick test compared to other base models such as 'dolphin-2.9.4-llama3.1-8b' and 'qwen2.5-14b_uncensored_instruct'. The model generates a result that better fits the desired pattern and produces text of reasonably good quality.

In [6]:
import re

def get_summary_adn_topics_by_pattern(url, temperature = 0.8, model='dolphin-2.9.3-mistral-nemo-12b', tries = 10):
    pattern = re.compile("summary text:(.*)list of topics:(.*)", re.DOTALL | re.IGNORECASE)
    pages_text = get_page_text_by_url(url)
    results = []

    for page_text in pages_text:
        
        for _ in range(tries):
            model_answer = get_ai_answer(SUSTEM_PROMPT_FOR_SUMMARIZATION, page_text, temperature, model)
            match = re.search(pattern, model_answer)
            if match:
                summary_text = match.group(1).strip()
                list_of_topics = [topic.strip() for topic in match.group(2).split(',')]
                if any(len(s) > 42 for s in list_of_topics):
                    continue
                results.append({'page_text': page_text, 'summary_text': summary_text, 'list_of_topics': list_of_topics})
                break
    
    return results


I am compelling the model to produce a response in a specific format, which must include a summary text and a section listing topics. If the response doesn't conform to this structure, we attempt to regenerate it. Additionally, the model occasionally generates an inaccurate list of topics, sometimes using an incorrect delimiter or none at all. To address this, there's an extra validation step: if the length of any listed topic exceeds 42 characters, we again try to generate the response. The number 42 is somewhat arbitrary, chosen for its mythical connotations in science fiction.

In [7]:
urls = [
    'https://habr.com/en/articles/868822/',
    'https://habr.com/en/articles/865754/',
    'https://habr.com/en/articles/865274/',
    'https://habr.com/en/articles/865216/',
    'https://habr.com/en/articles/864270/',
    'https://habr.com/en/articles/861974/',
    'https://habr.com/en/articles/861368/',
    'https://habr.com/en/articles/847854/',
    'https://habr.com/en/articles/846898/',
    'https://habr.com/en/articles/841820/'
]

dataset = get_summary_adn_topics_by_pattern(urls)

Let's process 10 recent articles from [https://habr.com/](https://habr.com/) and output a short sammarie and list of topics. We will not output the full text for the sake of cleanliness of the console.

In [8]:
[(el['summary_text'], el['list_of_topics']) for el in dataset]

[('This article explores the multifaceted strategy of RABBIT testing in software development which includes Regression Testing, Automated Testing, Black Box Testing, Beta Testing, Integration Testing, and Test-Driven Development methodologies. Each branch addresses specific aspects to ensure thorough validation from various perspectives. The approach is particularly suitable for CI/CD pipelines, complex projects, environments with frequent updates, user-centric applications, and high-risk industries.',
  ['regression testing',
   'automated testing',
   'black box testing',
   'beta testing',
   'integration testing',
   'test-driven development.']),
 ("Grok AI is an artificial intelligence tool created by xAI and designed for meaningful conversations with users. Integrated into X (formerly Twitter), Grok AI offers tools for chatting, image creation, writing assistance, learning, coding help, and problem-solving. It stands out due to its humor-infused responses and easy accessibility t

### Part 2. Conclusion
I processed 10 articles and highlight the main ideas and a list of topics for each. The chosen model perfectly handles technical text. The context length window can be expanded to 1024000 tokens, which will allow you to process several hundred news or articles without losing context.

### Part 3. Semantic Search with GenAI

To address this task, I will be utilizing the 'text-embedding-nomic-embed-text-v1.5' model. I will convert the text(summary_text + list_of_topics) into a vector and store it in a vector database. For article search, I will similarly transform the user's query into a vector and then seek the closest matching vector in the database. I will not be employing boosting techniques for the list of topics, and for the MVP solution, I will be using an in-memory database.

This approach will enable me to find the article that is semantically most similar to the user's query. Moreover, by using an in-memory database, I can ensure faster data retrieval and processing, which will enhance the user experience. The absence of boosting techniques will ensure that all topics are treated equally, maintaining a balanced approach to information retrieval.

The 'text-embedding-nomic-embed-text-v1.5' model is chosen for its efficiency in converting text into numerical vectors, which can be easily compared and matched in a vector database. This will ensure that the system can handle large volumes of data and still provide accurate and relevant results.

In [16]:
import requests
import json
import faiss
import numpy as np

SYSTEM_PROMPT_FOR_ANSWER_QUESTION = """
You are a system designed to assist the user with their inquiries. An incoming request and context are fed into your system. Based on this context, you are required to provide a response. The response should be as precise and concise as possible. If the user's request does not align with the provided context, the context may be disregarded.
The input data will be presented in the following format: User Request: [user request] Context: [context].
The output should be presented simply as text without additional context and user query sections.
To elaborate, your primary function is to interpret the user's request, consider the given context (if applicable), and generate a response that best answers the user's query. Your responses should be direct and to the point, ensuring that the user receives the information they need in the most efficient manner possible.
Remember, the context is provided to help refine your response, but it is not always necessary. If the user's request is clear and does not require additional context, you can generate a response based solely on the user's request. This flexibility allows you to effectively handle a wide range of user inquiries.
The format of the input data is standardized to ensure consistency and ease of processing. The 'User Request' field will contain the user's query, while the 'Context' field will provide any additional information that may be relevant to the request.
"""

def get_embedding_lmstudio(query, model='text-embedding-nomic-embed-text-v1.5'):
    url = f"{BASE_LMM_SERVER_URL}embeddings"

    payload = {
        "model": model,
        "input": query
    }

    headers = {
        "Content-Type": "application/json"
    }

    return requests.post(url, headers=headers, data=json.dumps(payload)).json()['data'][0]['embedding']

def semantic_search(query, index):
    query_vector = get_embedding_lmstudio(query)
    _, I = index.search(np.array([query_vector]), 1)
    
    return I[0][0]

def build_index(dataset):
    index = faiss.IndexFlatL2(768)

    for data in dataset:
        combined_vector = get_embedding_lmstudio(data['summary_text'] + ' '.join(data['list_of_topics']))
        index.add(np.array([combined_vector]))
    
    return index

def get_best_fit_article(query, index, dataset):
    best_fit_article_index = semantic_search(query, index)
    
    return dataset[best_fit_article_index] 

def get_RAG_answer(query, best_fit_article, temperature = 0.8, model = 'dolphin-2.9.3-mistral-nemo-12b'):
    user_query = f"User Request: {query}. Context:{best_fit_article['page_text']}"
    
    return get_ai_answer(SYSTEM_PROMPT_FOR_ANSWER_QUESTION, user_query, temperature, model)

Example 1

In [17]:
query = 'I want to bypass the captcha on the site. What is the best way for me to do this?'

index = build_index(dataset)
best_fit_article = get_best_fit_article(query, index, dataset)
RAG_answer = get_RAG_answer(query, best_fit_article)

print(f"Best fit article summary: {best_fit_article['summary_text']}")
print(f"RAG answer: {RAG_answer}")

Best fit article summary: This article provides a detailed guide on how to bypass CAPTCHA challenges effectively in automation processes. It covers various methods such as IP rotation, User-Agent rotation, cookie management, simulating human behavior, using CAPTCHA recognition services, and combining these strategies for optimal results.
RAG answer: The best way to bypass CAPTCHA is by employing hybrid strategies that combine prevention techniques with fallback CAPTCHA-solving methods. This approach involves using techniques like rotating IP addresses and altering User-Agent strings to prevent triggers, while solving remaining CAPTCHAs as a fallback solution when necessary. It offers versatility for handling websites with varying protection levels and balances cost-effectiveness with stability and adaptability.


Example 2

In [18]:
query = 'Explain the basic principles of clean code for Python.'
best_fit_article = get_best_fit_article(query, index, dataset)
RAG_answer = get_RAG_answer(query, best_fit_article)

print(f"Best fit article summary: {best_fit_article['summary_text']}")
print(f"RAG answer: {RAG_answer}")

Best fit article summary: The article explores principles from "Clean Code" by Robert C. Martin to improve Python coding practices, including meaningful naming conventions, functions doing one thing, unnecessary comments, error handling, DRY principle (Don't Repeat Yourself), test-driven development, avoiding side effects, and command query separation.
RAG answer: The basic principles of clean code for Python include meaningful naming, functions doing one thing only, avoiding unnecessary comments, proper error handling, consistent formatting, following the DRY principle (Don't Repeat Yourself), test-driven development, avoiding side effects, and adhering to command query separation. These principles help to create clear, readable, maintainable Python code that is easier to understand and modify.



### Part 3. Conclusion
We have managed to construct a basic model of a RAG (Retrieval-Augmented Generation) system. We processed a large text, compressed it into a brief summary, and then converted this summary into a vector. In this case, the vector turns out to be more accurate as the input is condensed information. Subsequently, we search the vector database for the most relevant article and utilize its text to refine the user's query. This approach allows us to use a pre-trained model without fine-tuning it in real-time. For instance, we can process corporate documents in this manner, enabling the assistant to utilize them. This is particularly relevant when documents are updated very frequently.

#### Improvements that can be made to the current implementation include:
- Using the OpenAI library instead of requests for more efficient and streamlined operations.
- Searching for not just the single best document, but the nearest ones in the vector space up to a certain threshold. This would require expanding the Content Length, which in turn would demand more memory.
- Employing models with more parameters, such as 70B instead of 12B, which could potentially improve the accuracy and effectiveness of the system. This would, however, also require more computational resources.
- Use the topic list as a facet or as an element with a high boosting factor.


### Installed libraries:

In [1]:
!pip install selenium
!pip install faiss-cpu

Defaulting to user installation because normal site-packages is not writeable
Defaulting to user installation because normal site-packages is not writeable
