<a href="https://colab.research.google.com/github/aryaganu10-cpu/GenAi/blob/main/Task5_W2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
import re

# Regex to find URLs at the end of the string. URLs can have various characters.
# This regex captures common URL patterns at the end of a string.
url_regex = r'(https?://(?:www\.)?[a-zA-Z0-9./_\-]+(?:\?[a-zA-Z0-9=&_\-]+)?)' # Matches http(s):// followed by non-space characters.

# Extract URLs into a new 'URL' column
financial_news_df['URL'] = financial_news_df['text'].str.extract(url_regex + r'$', expand=False)

# Remove URLs from the 'text' column
financial_news_df['text'] = financial_news_df['text'].str.replace(url_regex + r'$', '', regex=True)

# Display the updated DataFrame
display(financial_news_df.head())

In [None]:
import pandas as pd

financial_news_df = pd.read_csv('/content/financial_news.csv')
display(financial_news_df.head())

# Task
Create a semantic search tool for the `financial_news_df` dataset using `sentence-transformers` and `gradio`. The tool should allow users to input a query, and it will return the top 5 most semantically similar news articles from the `financial_news_df` DataFrame, displaying the original text, label, and URL of each result in a Gradio interface.

## Install Libraries

### Subtask:
Install necessary Python libraries: 'sentence-transformers' for generating embeddings and 'gradio' for building the user interface.


**Reasoning**:
Install the 'sentence-transformers' library using pip to enable the generation of embeddings.



In [None]:
pip install sentence-transformers

**Reasoning**:
Install the 'gradio' library using pip to build the user interface.



In [None]:
pip install gradio

## Load Pre-trained Model

### Subtask:
Load a pre-trained sentence embedding model (e.g., 'all-MiniLM-L6-v2') from the 'sentence-transformers' library.


**Reasoning**:
Import the `SentenceTransformer` class and load the 'all-MiniLM-L6-v2' model to prepare for generating embeddings.



In [None]:
from sentence_transformers import SentenceTransformer

# Load a pre-trained sentence embedding model
model = SentenceTransformer('all-MiniLM-L6-v2')

print("Model 'all-MiniLM-L6-v2' loaded successfully.")

**Reasoning**:
The previous code cell successfully loaded the `all-MiniLM-L6-v2` model. The next logical step is to use this model to generate embeddings for the text data in the `financial_news_df` DataFrame.



In [None]:
sentences = financial_news_df['text'].tolist()
embeddings = model.encode(sentences, show_progress_bar=True)
financial_news_df['embedding'] = embeddings.tolist()

print("Embeddings generated and added to 'financial_news_df'.")
display(financial_news_df.head())

**Reasoning**:
With the embeddings generated, the next logical step is to define a function that will take a user query, convert it into an embedding using the same model, and then calculate the cosine similarity between the query embedding and the document embeddings to find the most similar articles. This function will be central to the semantic search tool.



In [None]:
from sklearn.metrics.pairwise import cosine_similarity

def semantic_search(query, model, financial_news_df, top_n=5):
    # Encode the query
    query_embedding = model.encode(query)

    # Calculate cosine similarity between query and all document embeddings
    # Reshape query_embedding for cosine_similarity function
    similarities = cosine_similarity([query_embedding], list(financial_news_df['embedding'].values))

    # Get the top_n similar articles
    top_indices = similarities.argsort()[0][::-1][:top_n]

    # Return the top_n articles
    results = financial_news_df.iloc[top_indices]
    return results

print("Semantic search function defined.")

**Reasoning**:
Now that the semantic search function is defined, the next step is to create a Gradio interface to allow users to interact with the search tool. This involves defining a wrapper function for Gradio that uses the `semantic_search` function and setting up the Gradio UI components.



In [None]:
import gradio as gr

def search_interface(query):
    results = semantic_search(query, model, financial_news_df)
    output_string = []
    for index, row in results.iterrows():
        output_string.append(f"**Text:** {row['text']}<br> **Label:** {row['label']}<br> **URL:** {row['URL']}\n---\n")
    return "".join(output_string)

# Create the Gradio interface
iface = gr.Interface(
    fn=search_interface,
    inputs=gr.Textbox(lines=2, placeholder="Enter your search query here..."),
    outputs="html",
    title="Semantic Search Tool for Financial News",
    description="Enter a query to find the top 5 most semantically similar news articles."
)

# Launch the interface
iface.launch(debug=True)


# Task
Modify the Gradio launch command in the last code cell (`19278d87`) by removing the `debug=True` argument from `iface.launch()` to potentially speed up the interface startup, and then confirm that the Gradio interface has been relaunched with these new settings.

## Modify Gradio Launch

### Subtask:
Modify the existing Gradio launch command in the last code cell to remove the `debug=True` argument, which can sometimes speed up the interface startup.


**Reasoning**:
The subtask is to modify the `iface.launch()` command to remove the `debug=True` argument. I will provide a `code_block` to update the specific cell, replacing `iface.launch(debug=True)` with `iface.launch()`.



In [None]:
import gradio as gr

def search_interface(query):
    results = semantic_search(query, model, financial_news_df)
    output_string = []
    for index, row in results.iterrows():
        output_string.append(f"**Text:** {row['text']}<br> **Label:** {row['label']}<br> **URL:** {row['URL']}\n---\n")
    return "".join(output_string)

# Create the Gradio interface
iface = gr.Interface(
    fn=search_interface,
    inputs=gr.Textbox(lines=2, placeholder="Enter your search query here..."),
    outputs="html",
    title="Semantic Search Tool for Financial News",
    description="Enter a query to find the top 5 most semantically similar news articles."
)

# Launch the interface
iface.launch()


## Final Task

### Subtask:
Confirm that the Gradio interface has been relaunched with potentially faster startup settings.


## Summary:

### Data Analysis Key Findings
*   The `debug=True` argument was successfully removed from the `iface.launch()` command as intended.
*   Gradio automatically configured `share=True` upon launch, generating a public URL for accessing the interface, due to the execution environment being a hosted Jupyter notebook (Colab).
*   The Gradio interface was successfully relaunched and became accessible via the provided public URL.
*   An informational message from Gradio was displayed, advising to set `debug=True` explicitly in `launch()` for Colab notebooks to show errors, which is a general recommendation and not an error related to the modification.

### Insights or Next Steps
*   The Gradio interface is now launched without the `debug=True` flag, which can lead to potentially faster startup times and a production-like behavior.
*   The interface is successfully deployed and accessible via a public URL, ready for external testing or usage.
