## Project Plan
1. Data Collection: Scrape Free courses data from the Analytics Vidhya website.
2. Data Preprocessing: Clean and preprocess the data for search functionalities.
3. Embedding Model: Use a pre-trained language model(BERT) to generate vector embeddings for course descriptions.
4. Search Tool Development: Build a search interface using Gradio for user queries on Keyword Based.
5. Testing and Evaluation: Test the functionality and performance of the search tool.
6. Deployment: Deploy the search tool for public access on a Hugging Face.

## Data Information

In [1]:
import pandas as pd
data = pd.read_csv('AnalyticsVidhya.csv')

In [2]:
data.head(5)

Unnamed: 0,S.No,CourseTitle,Level,Time(Hours),Category,NumberOfLessons,Description,Curriculum,Price
0,1,GenAI Applied to Quantitative Finance: For Con...,Intermediate,1.0,Generative AI,5,This course explores the application of Genera...,"Introduction,Overview,Problem Definition: Comm...",Free
1,2,Navigating LLM Tradeoffs: Techniques for Speed...,Beginner,1.0,Generative AI,6,This course provides a concise guide to optimi...,"Introduction, Resources, Technique to Increase...",Free
2,3,Creating Problem-Solving Agents using GenAI fo...,Beginner,1.0,Generative AI,6,This introductory course provides a concise ov...,"Introduction, Overview- Count the Number of Ag...",Free
3,4,Improving Real World RAG Systems: Key Challeng...,Beginner,1.0,Generative AI,12,This course explores the key challenges in bui...,"Introduction to RAG Systems, Resources, RAG Sy...",Free
4,5,Framework to Choose the Right LLM for your Bus...,,1.0,Generative AI,6,This course will guide you through the process...,"Introduction, It’s an LLM World!, Understand Y...",Free


In [3]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 61 entries, 0 to 60
Data columns (total 9 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   S.No             61 non-null     int64  
 1   CourseTitle      61 non-null     object 
 2   Level            58 non-null     object 
 3   Time(Hours)      59 non-null     float64
 4   Category         61 non-null     object 
 5   NumberOfLessons  61 non-null     int64  
 6   Description      61 non-null     object 
 7   Curriculum       61 non-null     object 
 8   Price            61 non-null     object 
dtypes: float64(1), int64(2), object(6)
memory usage: 4.4+ KB


## Data Cleaning And Preprocessing

In [4]:
### Handling Missing Value

In [5]:
# Check for missing values in each column
missingvalues = data.isnull().sum()

In [6]:
print(missingvalues)

S.No               0
CourseTitle        0
Level              3
Time(Hours)        2
Category           0
NumberOfLessons    0
Description        0
Curriculum         0
Price              0
dtype: int64


In [7]:
# Display rows with missing values
missingrows = data[data.isnull().any(axis=1)]
missingrows

Unnamed: 0,S.No,CourseTitle,Level,Time(Hours),Category,NumberOfLessons,Description,Curriculum,Price
4,5,Framework to Choose the Right LLM for your Bus...,,1.0,Generative AI,6,This course will guide you through the process...,"Introduction, It’s an LLM World!, Understand Y...",Free
24,25,Certified AI & ML BlackBelt+ Program,,,Data Science,33,This comprehensive certified program combines ...,Introduction to Data Science and Machine Learn...,80000
26,27,AI Ethics by Fractal,,,AI Ethics,5,AI has a huge influence on our lives. From typ...,Introduction and need for Ethical AI FREE PREV...,999


In [8]:
# Drop rows where 'Time(Hours)' or 'Level' 
data.dropna(subset=['Level', 'Time(Hours)'], inplace=True)

In [9]:
data

Unnamed: 0,S.No,CourseTitle,Level,Time(Hours),Category,NumberOfLessons,Description,Curriculum,Price
0,1,GenAI Applied to Quantitative Finance: For Con...,Intermediate,1.0,Generative AI,5,This course explores the application of Genera...,"Introduction,Overview,Problem Definition: Comm...",Free
1,2,Navigating LLM Tradeoffs: Techniques for Speed...,Beginner,1.0,Generative AI,6,This course provides a concise guide to optimi...,"Introduction, Resources, Technique to Increase...",Free
2,3,Creating Problem-Solving Agents using GenAI fo...,Beginner,1.0,Generative AI,6,This introductory course provides a concise ov...,"Introduction, Overview- Count the Number of Ag...",Free
3,4,Improving Real World RAG Systems: Key Challeng...,Beginner,1.0,Generative AI,12,This course explores the key challenges in bui...,"Introduction to RAG Systems, Resources, RAG Sy...",Free
5,6,Building Smarter LLMs with Mamba and State Spa...,Advanced,2.0,Generative AI,14,Unlock the Power of State Space Models (SSM) l...,"Course Overview, An Alternative to Transformer...",Free
6,7,Generative AI - A Way of Life - Free Course,Beginner,6.0,Generative AI,31,This course is a transformative journey tailor...,"Introduction to Generative AI, Text Generation...",Free
7,8,Building LLM Applications using Prompt Enginee...,Beginner,2.1,Generative AI,17,This free course offers a comprehensive guide ...,"How to build different LLM Applications?, Gett...",Free
8,9,Building Your First Computer Vision Model - Fr...,Beginner,0.5,Deep Learning,7,This course will help you gain a deep understa...,"Pixel Perfect - Decoding Images, Understanding...",Free
9,10,Bagging and Boosting ML Algorithms - Free Course,Intermediate,1.0,Machine Learning,16,This course will provide you with a hands-on u...,"Bagging, Boosting",Free
10,11,MidJourney: From Inspiration to Implementation...,Intermediate,0.5,Generative AI,5,This course will provide you with a practical ...,"MidJourney - Storm _ Story, MidJourney - Inspi...",Free


### Preprocessing the Data for Embedding

We will need to clean the text data in the CourseTitle and  Description columns before generating embeddings. 
1. Lowercasing all text.
2. Removing special characters.

In [10]:
import re
def cleantext(text):
    text = text.lower()
    text = re.sub(r'[^a-zA-Z\s]', '', text)
    return text
data['CourseTitle'] = data['CourseTitle'].apply(cleantext)
data['Description'] = data['Description'].apply(cleantext)


In [11]:
data.head(3)

Unnamed: 0,S.No,CourseTitle,Level,Time(Hours),Category,NumberOfLessons,Description,Curriculum,Price
0,1,genai applied to quantitative finance for cont...,Intermediate,1.0,Generative AI,5,this course explores the application of genera...,"Introduction,Overview,Problem Definition: Comm...",Free
1,2,navigating llm tradeoffs techniques for speed ...,Beginner,1.0,Generative AI,6,this course provides a concise guide to optimi...,"Introduction, Resources, Technique to Increase...",Free
2,3,creating problemsolving agents using genai for...,Beginner,1.0,Generative AI,6,this introductory course provides a concise ov...,"Introduction, Overview- Count the Number of Ag...",Free


## Embedding Model

In [12]:
# Install Library
!pip install transformers sentence-transformers



In [13]:
#Import Pre-trained Model
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('all-MiniLM-L6-v2')

  from tqdm.autonotebook import tqdm, trange


In [14]:
# Generate embeddings for the course descriptions
data['DescriptionEmbeddings'] = data['Description'].apply(lambda desc: model.encode(desc))
# Generate embeddings for course titles 
data['TitleEmbeddings'] = data['CourseTitle'].apply(lambda x: model.encode(x))

In [15]:
data.to_csv('courseswithembeddings.csv', index=False)

 ## Search Tool Development

In [16]:
#Implement the Search Functionality
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity

# Set display options
pd.set_option('display.max_colwidth', 100) 
pd.set_option('display.width', 1000) 

def search_courses(query, data, model):
    # Generate embedding for the user query
    queryembedding = model.encode(query)
    
    # Compute cosine similarity between query and course descriptions
    similarities = cosine_similarity([queryembedding], list(data['DescriptionEmbeddings']))[0]
    
    # Top 5 results
    top5 = 5  
    topindices = similarities.argsort()[-top5:][::-1]
    
    # Retrieve the top results
    results = data.iloc[topindices]
    return results[['CourseTitle', 'Level', 'Category']]



### Keyword-Based Search for Free Courses 

In [17]:
#Create the User Interface with Gradio
import gradio as gr
data['CourseTitle'] = data['CourseTitle'].str.title()
def searchinterface(query):
    results = search_courses(query, data, model)
    return results.to_string(index=False)

# Create Gradio interface
interface = gr.Interface(
    fn=searchinterface, 
    inputs="text", 
    outputs="text", 
    title="Smart Engine for Free Courses",
    description="Enter Topic to find relevant free courses."
)

# Launch the interface
interface.launch()


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


* Running on local URL:  http://127.0.0.1:7862

To create a public link, set `share=True` in `launch()`.




### Semantic Search for Free Courses

In [23]:
import gradio as gr
import pandas as pd
import torch
from sentence_transformers import SentenceTransformer, util

courseEmbeddings = model.encode(data['CourseTitle'].tolist(), convert_to_tensor=True)

def searchCourses(query):
    # Encode the user query to get its embedding
    queryEmbedding = model.encode(query, convert_to_tensor=True)
    
    # Compute cosine similarities between the query and course embeddings
    cosineScores = util.pytorch_cos_sim(queryEmbedding, courseEmbeddings)[0]
    topResults = torch.topk(cosineScores, k=3)

    results = []
    for score, idx in zip(topResults[0], topResults[1]):
        idx = idx.item()  
        results.append(f"{data.iloc[idx]['CourseTitle']} (Score: {score.item():.4f})")
    return "\n".join(results)

def searchInterface(query):
    if query.strip() == "":
        return "Please enter a search term."
    
    results = searchCourses(query)
    return results if results else "No results found."

interface = gr.Interface(
    fn=searchInterface, 
    inputs=gr.Textbox(label="Search Topic", placeholder="Enter a topic..."), 
    outputs=gr.Textbox(label="Results"),
    title="Semantic Search for Free Courses",
    description="Enter a topic to find relevant free courses using semantic search."
)

interface.launch()


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


* Running on local URL:  http://127.0.0.1:7866

To create a public link, set `share=True` in `launch()`.




### Semantic Search with Re-Ranking for Free Courses

In [26]:
import gradio as gr
import pandas as pd
import torch
from sentence_transformers import SentenceTransformer, util

def searchCourses(query):
    queryEmbedding = model.encode(query, convert_to_tensor=True)
    cosineScores = util.pytorch_cos_sim(queryEmbedding, courseEmbeddings)[0]
    topResults = torch.topk(cosineScores, k=5)
    initialResults = [(score.item(), idx.item()) for score, idx in zip(topResults[0], topResults[1])]
    reRankedResults = reRankResults(initialResults, queryEmbedding)
    return reRankedResults

def reRankResults(initialResults, queryEmbedding):
    scores = []
    for score, idx in initialResults:
        courseTitle = data.iloc[idx]['CourseTitle']
        courseEmbedding = model.encode(courseTitle, convert_to_tensor=True)
        reRankedScore = util.pytorch_cos_sim(queryEmbedding, courseEmbedding).item()
        scores.append((reRankedScore, courseTitle))
    scores.sort(key=lambda x: x[0], reverse=True)
    results = [f"{title} (Score: {score:.4f})" for score, title in scores]
    return "\n".join(results)

def searchInterface(query):
    if query.strip() == "":
        return "Please enter a search term."
    
    results = searchCourses(query)
    return results if results else "No results found."

# Create Gradio interface
interface = gr.Interface(
    fn=searchInterface, 
    inputs=gr.Textbox(label="Search Topic", placeholder="Enter a topic..."), 
    outputs=gr.Textbox(label="Results"),
    title="Semantic Search with Re-Ranking for Free Courses",
    description="Enter a topic to find relevant free courses using semantic search and re-ranking."
)

# Launch the interface
interface.launch()


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


* Running on local URL:  http://127.0.0.1:7869

To create a public link, set `share=True` in `launch()`.




In [1]:
import pandas as pd
import numpy as np
import faiss
from sentence_transformers import SentenceTransformer
import gradio as gr


df = pd.read_csv('AnalyticsVidhya.csv')

# Load the pre-trained model
model = SentenceTransformer('all-MiniLM-L6-v2')

# Step 1: Combine the relevant columns to create a single text representation
df['CombinedText'] = df.apply(
    lambda row: (
        f"Title: {str(row['CourseTitle'])}. "
        f"Level: {str(row['Level']) if pd.notnull(row['Level']) else 'Not specified'}. "
        f"Duration: {str(row['Time(Hours)']) if pd.notnull(row['Time(Hours)']) else 'Unknown'} hours. "
        f"Category: {str(row['Category']) if pd.notnull(row['Category']) else 'Not specified'}. "
        f"Description: {str(row['Description']) if pd.notnull(row['Description']) else 'No description available'}"
    ), axis=1)

# Step 2: Generate embeddings for all courses
embeddings = model.encode(df['CombinedText'].tolist(), show_progress_bar=True)

# Step 3: Create FAISS index
embedding_dim = embeddings.shape[1]
index = faiss.IndexFlatL2(embedding_dim)

# Add the embeddings to the index
index.add(np.array(embeddings))

# Step 4: Search function to query courses
def search_courses(query, top_k=5):
    # Generate embedding for the query
    query_embedding = model.encode([query])[0]

    # Perform the search in FAISS
    distances, indices = index.search(np.array([query_embedding]), top_k)

    # Retrieve the corresponding course information
    results = []
    for idx in indices[0]:
        course = {
            'CourseTitle': df.iloc[idx]['CourseTitle'],
            'Description': df.iloc[idx]['Description'],
            'Level': df.iloc[idx]['Level'],
            'Category': df.iloc[idx]['Category'],
            'NumberOfLessons': df.iloc[idx]['NumberOfLessons']
        }
        results.append(course)
    
    return results

# Step 5: Define Gradio interface function
def gradio_search(query, top_k):
    results = search_courses(query, top_k)
    display_results = [
        f"Title: {res['CourseTitle']}\n"
        f"Description: {res['Description']}\n"
        f"Category: {res['Category']}\n"
        f"Level: {res['Level']}"
        for res in results
    ]
    return "\n\n".join(display_results)

# Step 6: Create Gradio interface
interface = gr.Interface(
    fn=gradio_search,
    inputs=[
        gr.Textbox(label="Search in our courses"),
        gr.Slider(minimum=1, maximum=10, step=1, value=5, label="Number of Results")
    ],
    outputs="text",
    title="Smart Course Search",
    description="Analytics Vidhya Free Courses",
    flagging_mode=None  # Disable flagging button
)

# Step 7: Launch the Gradio interface
interface.launch()


  from tqdm.autonotebook import tqdm, trange


Batches:   0%|          | 0/2 [00:00<?, ?it/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


* Running on local URL:  http://127.0.0.1:7860

To create a public link, set `share=True` in `launch()`.


