<a href="https://colab.research.google.com/github/christohmg/BERT-Questions-and-Answering-for-Internal-Company-Use--Spotify-Design-Elements/blob/main/BERTSpotifyQAProject.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# BERT Question Answering System


I've utilized the BERT model to create a QA system based on Spotifys design guidlines. QA systems limited to specific data can ensure company policy and branding consistency, as well as reduce hallucinations. Great potential for internal questions and HR purposes.

In [None]:
# Importing Libraries and initializing model
from transformers import BertTokenizer, BertForQuestionAnswering
import torch

tokenizer = BertTokenizer.from_pretrained('bert-large-uncased-whole-word-masking-finetuned-squad')
model = BertForQuestionAnswering.from_pretrained('bert-large-uncased-whole-word-masking-finetuned-squad')

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/443 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.34G [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-large-uncased-whole-word-masking-finetuned-squad were not used when initializing BertForQuestionAnswering: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForQuestionAnswering from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForQuestionAnswering from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [None]:
def get_answer(question, paragraph):
    # Process the question and paragraph
    encoding = tokenizer.encode_plus(text=question, text_pair=paragraph)
    inputs = encoding['input_ids']  # Token embeddings
    sentence_embedding = encoding['token_type_ids']  # Segment embeddings
    tokens = tokenizer.convert_ids_to_tokens(inputs)  # input tokens

    # Get model's answer prediction
    output = model(input_ids=torch.tensor([inputs]), token_type_ids=torch.tensor([sentence_embedding]))
    start_index = torch.argmax(output.start_logits)
    end_index = torch.argmax(output.end_logits)

    # Extract and correct the answer
    answer = ' '.join(tokens[start_index:end_index+1])
    corrected_answer = ''
    for word in answer.split():
        if word[0:2] == '##':  # Correct subword tokens
            corrected_answer += word[2:]
        else:
            corrected_answer += ' ' + word

    return corrected_answer.strip()

In [None]:
# Read the paragraph from a text file
file_path = "/content/Spotify_Design_Guidelines_Processed.txt"
with open(file_path, 'r') as file:
    paragraph = file.read()

# Chatbot loop
while True:
    question = input("Ask a question (or type 'exit' to stop): ")
    if question.lower() == 'exit':
        break
    answer = get_answer(question, paragraph)
    print("Answer:", answer)