**Goal**: Build a tool that does two things</br>
1- ***Question Classification***: automatically categorize a question (e.g. "SQL", "Arrays", "System Design).</br>
2- ***Semantic Similarity***: find the most similar questions to one a user provides.

### Stage 1: Sourcing & Loading our Data

From HuggingFace datasets library, we'll load a dataset called ***`jacob-h/jacob-h-databricks-dolly-15k-drop-no-context`*** (*databricks/databricks-dolly-15k*) which has instructions, categories, and responses.</br>
We only need the ***`instruction`*** (the question) and the ***`category`***.

*<div class="alert alert-block alert-warning">Run `pip` commands with the excalmation mark to run, otherwise will raise error</div>*

In [1]:
# Installing the needed libraries
# ! pip install datasets pandas

**Python script to load, inspect, and prep the data**

In [2]:
import pandas as pd
from datasets import load_dataset

In [3]:
# Loading the dataset from HuggingFace
print("Loading dataset...")
dataset = load_dataset("databricks/databricks-dolly-15k")

dataset

Loading dataset...


DatasetDict({
    train: Dataset({
        features: ['instruction', 'context', 'response', 'category'],
        num_rows: 15011
    })
})

In [4]:
# Converting the dataset into Pandas dataframe
df = pd.DataFrame(dataset["train"])

df.head()

Unnamed: 0,instruction,context,response,category
0,When did Virgin Australia start operating?,"Virgin Australia, the trading name of Virgin A...",Virgin Australia commenced services on 31 Augu...,closed_qa
1,Which is a species of fish? Tope or Rope,,Tope,classification
2,Why can camels survive for long without water?,,Camels use the fat in their humps to keep them...,open_qa
3,"Alice's parents have three daughters: Amy, Jes...",,The name of the third daughter is Alice,open_qa
4,When was Tomoaki Komorida born?,Komorida was born in Kumamoto Prefecture on Ju...,"Tomoaki Komorida was born on July 10,1981.",closed_qa


In [5]:
# Exploring the categories
print("\nAvailable categories and their counts:")
print(df["category"].value_counts())


Available categories and their counts:
category
open_qa                   3742
general_qa                2191
classification            2136
closed_qa                 1773
brainstorming             1766
information_extraction    1506
summarization             1188
creative_writing           709
Name: count, dtype: int64


For this project, let's filter for just a few strong categories *to make training faster and more focused*.

In [6]:
# Filtering
target_categories = ["information_extraction", "classification", "open_qa", "brainstorming", "summarization"]
df_filtered = df[df["category"].isin(target_categories)].copy()

df_filtered.head()

Unnamed: 0,instruction,context,response,category
1,Which is a species of fish? Tope or Rope,,Tope,classification
2,Why can camels survive for long without water?,,Camels use the fat in their humps to keep them...,open_qa
3,"Alice's parents have three daughters: Amy, Jes...",,The name of the third daughter is Alice,open_qa
5,If I have more pieces at the time of stalemate...,Stalemate is a situation in chess where the pl...,No. \nStalemate is a drawn position. It doesn'...,information_extraction
7,Who gave the UN the land in NY to build their HQ,,John D Rockerfeller,open_qa


In [7]:
# Extracting our final Dataframe where we only need 'instruction' and 'category'
df_final = df_filtered[["instruction", "category"]].copy()

df_final.head()

Unnamed: 0,instruction,category
1,Which is a species of fish? Tope or Rope,classification
2,Why can camels survive for long without water?,open_qa
3,"Alice's parents have three daughters: Amy, Jes...",open_qa
5,If I have more pieces at the time of stalemate...,information_extraction
7,Who gave the UN the land in NY to build their HQ,open_qa


We will convert our **text labels** (like 'classification') **into numbers** so that the model can understand.

In [8]:
# Converting text labels into numbers
df_final["label"] = df_final["category"].astype("category").cat.codes

df_final["label"].value_counts()

label
3    3742
1    2136
0    1766
2    1506
4    1188
Name: count, dtype: int64

In [9]:
# Our final data
print("\nOur final, prepped data:")
print(df_final.head())
print(f"\nTotal samples: {len(df_final)}")


Our final, prepped data:
                                         instruction                category  \
1           Which is a species of fish? Tope or Rope          classification   
2     Why can camels survive for long without water?                 open_qa   
3  Alice's parents have three daughters: Amy, Jes...                 open_qa   
5  If I have more pieces at the time of stalemate...  information_extraction   
7   Who gave the UN the land in NY to build their HQ                 open_qa   

   label  
1      1  
2      3  
3      3  
5      2  
7      3  

Total samples: 10338


In [10]:
# Saving our final data to a csv file
df_final.to_csv("interview_questions.csv", index=False)

<div class="alert alert-block alert-success">

**Deliverable 1**: We now have a clean `interview_questions.csv` file with two columns: `instruction` (the question) and `label` (a number representing the category)

</div>

### Stage 2: Building the Classification Model

Now we'll `fine-tune` a pre-trained **HuggingFace Transformer** model.

In [11]:
# Installing the core "transformers" library
# ! pip install transformers torch scikit-learn

`torch` is the *PyTorch* backend; `scikit-learn` is for splitting our data

In [12]:
from sklearn.model_selection import train_test_split
from datasets import Dataset
from transformers import (
    AutoTokenizer,
    AutoModelForSequenceClassification,
    TrainingArguments,
    Trainer
)
import json

In [13]:
# Loading our previously saved csv file
df = pd.read_csv("interview_questions.csv")

df.head()

Unnamed: 0,instruction,category,label
0,Which is a species of fish? Tope or Rope,classification,1
1,Why can camels survive for long without water?,open_qa,3
2,"Alice's parents have three daughters: Amy, Jes...",open_qa,3
3,If I have more pieces at the time of stalemate...,information_extraction,2
4,Who gave the UN the land in NY to build their HQ,open_qa,3


Let's create a dictionary to map labels to IDs (categories) and back

In [14]:
label_map = dict(enumerate(df["category"].astype("category").cat.categories))
num_labels = len(label_map)
print(f"Number of categories: {num_labels} \nCategories: {label_map}")

Number of categories: 5 
Categories: {0: 'brainstorming', 1: 'classification', 2: 'information_extraction', 3: 'open_qa', 4: 'summarization'}


Let's split the data into train and test dataframes

In [None]:
train_df, test_df = train_test_split(df, train_size=0.8, random_state=42, stratify=df["label"])

Let's convert our new `train dataframe` and `test dataframe` into HuggingFace Dataset objects

In [16]:
train_dataset = Dataset.from_pandas(train_df)
test_dataset = Dataset.from_pandas(test_df)

train_dataset[:5]

{'instruction': ['Identify which of the following are episode titles from "The X-Files": The Pine Bluff Variant, My Teacher is an Alien, Memento Mori, The Day the Earth Stood Still, Millennium, Arcadia, The Matrix',
  'Why indian Marriage is so long process',
  'What checks should be done when a data pipeline fails',
  'Where are good locations to scuba dive?',
  'Name five countries in the Southern Hemisphere'],
 'category': ['classification',
  'brainstorming',
  'brainstorming',
  'brainstorming',
  'brainstorming'],
 'label': [1, 0, 0, 0, 0],
 '__index_level_0__': [5662, 5534, 2957, 1632, 9333]}

**Tokenerization**

`Tokenization` is a fundamental step in Natural Language Processing (NLP) that involves breaking down raw text into smaller, meaningful units called tokens.
For a machine learning model to process human language, these `tokens` are converted into `numerical IDs` (*encoding the tokens*) based on a predefined vocabulary. This numerical representation is the actual input the model uses to learn patterns and make predictions.

In [17]:
model_name = "distilbert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_name)

def tokenize_function(examples):
    # 'padding="max_length"' ensures all inputs are the same size
    return tokenizer(examples["instruction"], padding="max_length", truncation=True)

print("Tokenizing datasets...")
tokenized_train_dataset = train_dataset.map(tokenize_function, batched=True)
tokenized_test_dataset = test_dataset.map(tokenize_function, batched=True)

Tokenizing datasets...


Map:   0%|          | 0/2067 [00:00<?, ? examples/s]

Map:   0%|          | 0/8271 [00:00<?, ? examples/s]

**ML Model**

In [18]:
model = AutoModelForSequenceClassification.from_pretrained(
    model_name,
    num_labels=num_labels
)

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


**Setting/Creating the Trainer**

In [19]:
# run this because Using the `Trainer` with `PyTorch` requires `accelerate>=0.26.0`
# ! pip install accelerate>=0.26.0

In [20]:
training_args = TrainingArguments(
    output_dir="./results",                    # Where to save the model
    eval_strategy="epoch",                      # Check performance every epoch
    num_train_epochs=1,                         # 1 epoch is often enough for fine-tuning
    per_device_train_batch_size=8,
    warmup_steps=500,
    weight_decay=0.01,
    logging_dir="./logs"
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_train_dataset,
    eval_dataset=tokenized_test_dataset
)

Now, let's **train** our model!!!

In [21]:
print("Starting training...")
trainer.train()

Starting training...


Epoch,Training Loss,Validation Loss
1,No log,0.765351


TrainOutput(global_step=259, training_loss=1.2265973772321428, metrics={'train_runtime': 3678.0829, 'train_samples_per_second': 0.562, 'train_steps_per_second': 0.07, 'total_flos': 273824762065920.0, 'train_loss': 1.2265973772321428, 'epoch': 1.0})

Let's **save** the `model`

Along side, we'll save:
- the `tokenizer` and
- the our `label map` too

In [22]:
# Saving the model
print("Training complete. Saving model...")
model_save_path = "./my_question_classifier"
trainer.save_model(model_save_path)

# Saving the tokenizer
tokenizer.save_pretrained(model_save_path)

# Saving the label map
with open(f"{model_save_path}/label_map.json", "w") as f:
    json.dump(label_map, f)

print(f"Model saved to {model_save_path}")

Training complete. Saving model...
Model saved to ./my_question_classifier


<div class="alert alert-block alert-success">
<b>Deliverable 2</b>: We now have a folder <code>./my_question_classifier</code> that contains our fully trained model ready to make predictions.
</div>

### Stage 3: Building the Semantic Similarity Engine

Here, rather than *training* a model, we will use a specific type of model called **Sentence-BERT** (from the `sentence-transformers` library).
Its job is to read a sentence and turn it into a list of numbers (a `vector` or `embedding`) that captures its meaning.

***Note***: Sentences with similar meanings will have mathematically similar vectors.

In [23]:
# Installing the needed library
# ! pip install sentence-transformers

In [24]:
import torch
from sentence_transformers import SentenceTransformer, util

Loading the model

In [25]:
# "all-MiniLM-L6-v2" is a fantastic, fast, all-purpose model
print("Loading Sentence-BERT model...")
model = SentenceTransformer("all-MiniLM-L6-v2")

Loading Sentence-BERT model...


In [26]:
# Extracting questions form our dataframe
questions = df["instruction"].tolist()

# questions[:5]

Creating Embeddings

In [27]:
print(f"Creating embeddings for {len(questions)} questions...")
question_embeddings = model.encode(questions, convert_to_tensor=True)

print("Embeddings created.")

Creating embeddings for 10338 questions...
Embeddings created.


In a real app, we would save these above embeddings to a file/database so we don't have to re-calculate them every time.</br>
For our demo, we'll just keep them in memory.

Creating a "`Finder`" Function

In [28]:
def find_similar_questions(query, top_k=3):
    print(f"\nFinding results for query: '{query}'")
    
    # Converting the new query into an embedding
    query_embedding = model.encode(query, convert_to_tensor=True)
    
    # Compute cosine similarity between the query and all our questions
    cos_scores = util.cos_sim(query_embedding, question_embeddings)[0]
    
    # Find the top_k highest scores
    top_results = torch.topk(cos_scores, k=top_k+1)         # +1 to exclude the query itself if it is in the list
    
    print("Top results:")
    for i, (score, idx) in enumerate(zip(top_results[0], top_results[1])):
        if questions[idx] == query:                         # Skip self-comparison
            continue
        print(f" {i+1}, {questions[idx]} (Score: {score:.4f})")

Let's test our `Finder` Function

In [29]:
find_similar_questions("How do I write a SQL query to find all employees with a salary over $50k?")
find_similar_questions("Explain the difference between a list and a tuple in Python.")


Finding results for query: 'How do I write a SQL query to find all employees with a salary over $50k?'
Top results:
 1, Extract the total number of employees  in 2021 at Westpac Bank in Australia? (Score: 0.4307)
 2, Extract the 2021 Total Profit or Operating Income from the following text (Score: 0.4080)
 3, Summarise the ways an employer can find workers using the given text as a reference (Score: 0.3683)
 4, Who is the largest employer in the world? (Score: 0.3650)

Finding results for query: 'Explain the difference between a list and a tuple in Python.'
Top results:
 1, What is Python? (Score: 0.4515)
 2, What is Python? (Score: 0.4515)
 3, What is the difference between NumPy and pandas? (Score: 0.4324)
 4, What is the difference between bills, notes, and bonds? (Score: 0.3299)


<div class="alert alert-block alert-success">

**Deliverable 3**: We now have the `logic` and `code` to find sementically similar questions for any new query.

</div>

### Stage 4: Creating the Streamlit Demo App

Now that we have our two "*brains*": the `classifier` and the `similarity finder`, let's build a body for them.</br>
This will be a simple web UI using `Streamlit` (helps build a web app just by writing a Python script).

<div class="alert alert-block alert-info">

This is the `app.py` Python code.

</div>