<a href="https://colab.research.google.com/github/awaisfarooqchaudhry/IB9AU-GenerativeAI-2026/blob/main/Task5.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

You can upload a CSV file from your local machine. When prompted, click 'Choose Files' and select the CSV file you want to upload.

In [None]:
from google.colab import files
import pandas as pd
import io # Import the io module

uploaded = files.upload()

for fn in uploaded.keys():
  print('User uploaded file "{name}" with length {length} bytes'.format(
      name=fn, length=len(uploaded[fn])))
  # Assuming the uploaded file is a CSV, read it into a pandas DataFrame
  df = pd.read_csv(io.StringIO(uploaded[fn].decode('utf-8')))

print("\nFirst 5 rows of the uploaded CSV file:")
display(df.head())

# Task
Clean the 'text' column of the `df` DataFrame by extracting URLs into a new 'URL' column and removing them from the original 'text' column.

## Data Cleaning - URL Extraction and Removal

### Subtask:
Extract URLs from the 'text' column using regular expressions, create a new 'URL' column with the extracted URLs, and then remove these URLs from the original 'text' column.


**Reasoning**:
To extract, store, and remove URLs from the 'text' column, I need to define a regular expression pattern and then apply it to each row of the DataFrame. I will use the `re` module for this task.



In [None]:
import re

# Define a regular expression pattern for URLs
url_pattern = re.compile(r'http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+')

# Create a new 'URL' column and initialize with empty strings
df['URL'] = ''

# Iterate through each row to extract URLs and clean the text
for index, row in df.iterrows():
    text = row['text']
    found_urls = url_pattern.findall(text)
    if found_urls:
        df.at[index, 'URL'] = ', '.join(found_urls) # Store URLs as a comma-separated string
        # Remove URLs from the text column
        df.at[index, 'text'] = url_pattern.sub('', text).strip()

print("DataFrame after URL extraction and removal:")
display(df.head())

## Embeddings Generation - Install Libraries

### Subtask:
Install the `sentence-transformers` library, which is necessary for generating sentence embeddings.


**Reasoning**:
The subtask requires installing the `sentence-transformers` library. The `!pip install` command is used for this purpose, with the `--quiet` flag to keep the output concise.



In [None]:
print("Installing sentence-transformers...")
!pip install sentence-transformers --quiet
print("sentence-transformers installed successfully.")

## Embeddings Generation - Load Model and Generate Embeddings

### Subtask:
Load the 'all-MiniLM-L6-v2' sentence-transformer model and generate embeddings for the cleaned text data, storing them for later use in semantic search.


**Reasoning**:
To generate embeddings, I will import the `SentenceTransformer` class, load the 'all-MiniLM-L6-v2' model, handle potential NaN values in the 'text' column by filling them with empty strings, and then generate embeddings for the cleaned text data.



In [None]:
from sentence_transformers import SentenceTransformer

# Load the pre-trained 'all-MiniLM-L6-v2' model
print("Loading SentenceTransformer model...")
model = SentenceTransformer('all-MiniLM-L6-v2')
print("Model loaded successfully.")

# Handle potential NaN values in the 'text' column by filling them with empty strings
# This ensures the encoder receives valid string inputs.
df['text_cleaned'] = df['text'].fillna('')

# Generate embeddings for the 'text_cleaned' column
print("Generating embeddings...")
text_embeddings = model.encode(df['text_cleaned'].tolist(), show_progress_bar=True)
print("Embeddings generated successfully.")

# Display the shape of the generated embeddings
print(f"Shape of generated embeddings: {text_embeddings.shape}")

**Reasoning**:
The previous step successfully generated text embeddings. To make these embeddings easily accessible and associated with their respective text entries for subsequent semantic search operations, they should be added as a new column to the DataFrame `df`.



In [None]:
df['text_embeddings'] = text_embeddings.tolist()

print("Embeddings successfully added to the DataFrame.")
print("DataFrame head with new 'text_embeddings' column:")
display(df.head())

## Semantic Search - Similarity Function

### Subtask:
Define a function that takes a user query, generates its embedding, computes the cosine similarity against all generated text embeddings, and returns the top 5 most similar records along with their similarity scores.


**Reasoning**:
To implement the semantic search functionality, I need to define a Python function that takes a user query, generates its embedding, computes cosine similarity with existing text embeddings, and returns the top similar results from the DataFrame.



In [None]:
from sentence_transformers import util
import torch

def semantic_search_function(user_query, model, df, text_embeddings, top_n=5):
    # Generate embedding for the user query
    query_embedding = model.encode(user_query, convert_to_tensor=True)

    # Calculate cosine similarities
    cosine_scores = util.cos_sim(query_embedding, text_embeddings)

    # Get the top N similarity scores and their indices
    # Squeeze to convert from 1xN tensor to N tensor for topk
    top_results = torch.topk(cosine_scores.squeeze(0), k=top_n)

    # Prepare results
    results = []
    for score, idx in zip(top_results[0], top_results[1]):
        results.append({
            'similarity_score': score.item(),
            'record': df.iloc[idx.item()].to_dict() # Convert Series to dict for better display
        })
    return results

print("Semantic search function defined successfully.")

## Gradio UI - Build Interface

### Subtask:
Create a Gradio interface that provides a text input for the user's query, uses the semantic search function to find relevant records, and displays the top 5 closest records along with their similarity scores.


**Reasoning**:
The first step to building the Gradio interface is to import the `gradio` library.



In [None]:
import gradio as gr
print("Gradio library imported successfully.")

**Reasoning**:
To complete the Gradio interface, I need to define the `search_news` wrapper function, call the `semantic_search_function` within it, format the results, and then create and launch the `gr.Interface`.



In [None]:
def search_news(user_query):
    # Call the previously defined semantic_search_function
    results = semantic_search_function(user_query, model, df, text_embeddings, top_n=5)

    # Format the results into a readable HTML string
    html_output = ""
    if not results:
        html_output = "<p>No results found for your query.</p>"
    else:
        html_output = "<h3>Search Results:</h3>"
        for i, res in enumerate(results):
            record = res['record']
            similarity_score = res['similarity_score']
            html_output += f"<div>"
            html_output += f"<p><b>Result {i+1} (Similarity: {similarity_score:.4f}):</b></p>"
            html_output += f"<p><b>Text:</b> {record.get('text', 'N/A')}</p>"
            if record.get('URL'):
                urls = record['URL'].split(', ')
                html_output += f"<p><b>URL(s):</b> "
                for url in urls:
                    html_output += f"<a href='{url}' target='_blank'>{url}</a> "
                html_output += "</p>"
            html_output += f"<hr></div>"

    return html_output

# Create a Gradio interface
news_interface = gr.Interface(
    fn=search_news,
    inputs=gr.Textbox(lines=2, label='Enter your news query'),
    outputs=gr.HTML(label='Search Results'),
    title='Semantic News Search',
    description='Enter a query to find semantically similar news articles.'
)

# Launch the Gradio interface
print("Launching Gradio interface...")
news_interface.launch(share=True)

## Final Task

### Subtask:
Summarize the performed tasks and provide instructions on how to use the generated Gradio semantic search interface.


## Summary:

### Data Analysis Key Findings

*   **Data Cleaning:** URLs were successfully extracted from the 'text' column into a new 'URL' column and then removed from the original 'text' column, ensuring a clean text dataset.
*   **Library Installation:** The `sentence-transformers` library was successfully installed, which is essential for generating text embeddings.
*   **Embeddings Generation:**
    *   The 'all-MiniLM-L6-v2' SentenceTransformer model was loaded successfully.
    *   `NaN` values in the 'text' column were handled by converting them to empty strings.
    *   Embeddings were generated for 16,990 text entries, resulting in 384-dimensional vectors, and stored in a new `text_embeddings` column in the DataFrame.
*   **Semantic Search Functionality:** A `semantic_search_function` was defined, capable of taking a user query, generating its embedding, calculating cosine similarity against existing text embeddings, and returning the top 5 most similar records along with their similarity scores.
*   **Gradio User Interface:** A Gradio interface was built, allowing users to input a query, which then uses the semantic search function to display the top 5 relevant news articles with their text, URL(s) (as clickable links), and similarity scores in a user-friendly HTML format. The interface was launched with a public URL.

### Insights or Next Steps

*   The created Gradio interface provides an accessible and intuitive way for users to perform semantic searches on the news article dataset, leveraging the power of sentence embeddings.
*   Consider implementing a feedback mechanism within the Gradio interface to allow users to rate the relevance of search results, which could be used to fine-tune the semantic search model or improve result ranking in the future.
