In [None]:
pip install beautifulsoup4

In [None]:
pip install torch

## Importing the data from Wikipedia page

I am taking the wikipedia page  https://en.wikipedia.org/wiki/Atal_Tunnel for our Q&A bot context. However, this link can be replaced with any link as per user's choice.

In [13]:
import torch
from transformers import DistilBertForQuestionAnswering, DistilBertTokenizer
from transformers import RobertaTokenizer, RobertaForQuestionAnswering
from transformers import AdamW
from torch.utils.data import DataLoader, TensorDataset
import urllib.request
from bs4 import BeautifulSoup
import transformers

In [9]:
def extract_wikipedia_data(wikipedia_link):
    """
    Extracts data from a Wikipedia page and returns the content.

    Parameters:
    - wikipedia_link (str): The link to the Wikipedia page.

    Returns:
    - wikipedia_data (str): The extracted data from the Wikipedia page.
    """
    try:
        # Fetch HTML content from the Wikipedia page
        page = urllib.request.urlopen(wikipedia_link)
        soup = BeautifulSoup(page, 'html.parser')

        # Extract text data
        text_data = []
        for paragraph in soup.find_all('p'):
            text_data.append(paragraph.text)

        # Combine paragraphs into a single string
        wikipedia_data = '\n'.join(text_data)

        return wikipedia_data

    except Exception as e:
        # Handle exceptions, e.g., invalid URL or network issues
        print(f"Error: {e}")
        return None

#execution
wikipedia_link = 'https://en.wikipedia.org/wiki/Atal_Tunnel'
extracted_data = extract_wikipedia_data(wikipedia_link)
wikipedia_text = extracted_data

if extracted_data:
    print("Data Extracted Successfully:")
    print(extracted_data[:500])  # Displaying the first 500 characters as a sample
else:
    print("Failed to extract data from the Wikipedia page.")


Data Extracted Successfully:


Atal Tunnel (also known as Rohtang Tunnel),[1] named after former Prime Minister of India, Atal Bihari Vajpayee is a highway tunnel built under the Rohtang Pass in the eastern Pir Panjal range of the Himalayas on the National Highway 3 in Himachal Pradesh, India.[1][2] At a length of 9.02 km, it is the highest highway single-tube tunnel above 10,000 feet (3,048 m) in the world.[3][4][5] With the existing Atal Tunnel and after the completion of under-construction Shinku La Tunnel, which is targ


## Cleaning Text

If I print the extracted data from wikipedia as is, there are a lot of HTML tags and line breaks come along with the text, which is why I have wrote the clean_wikipedia_text method to remove those.

In [10]:
import re

def clean_wikipedia_text(text):
    """
    Cleans Wikipedia text by removing HTML tags, hyper references, 
    paragraph numbers, <s> tags, newline characters, and other unwanted elements.
    
    Args:
    text (str): The raw text from a Wikipedia page or similar source.
    
    Returns:
    str: Cleaned text.
    """
    # Remove HTML tags
    clean_text = re.sub('<.*?>', '', text)

    # Remove paragraph numbers like [1], [2], etc.
    clean_text = re.sub(r'\[\d+\]', '', clean_text)

    # Replace newline characters with a space
    clean_text = clean_text.replace('\n', ' ')

    # Remove <s> tags
    clean_text = clean_text.replace('<s>', '').replace('</s>', '')

    # Additional cleaning steps can be added here if needed

    return clean_text

# execution
raw_text = wikipedia_text
cleaned_text = clean_wikipedia_text(raw_text)
print(cleaned_text[:500])


  Atal Tunnel (also known as Rohtang Tunnel), named after former Prime Minister of India, Atal Bihari Vajpayee is a highway tunnel built under the Rohtang Pass in the eastern Pir Panjal range of the Himalayas on the National Highway 3 in Himachal Pradesh, India. At a length of 9.02 km, it is the highest highway single-tube tunnel above 10,000 feet (3,048 m) in the world. With the existing Atal Tunnel and after the completion of under-construction Shinku La Tunnel, which is targeted to be complet


## Q&A Bot using Transformer Model

Now as we do not have a data set of Q&A answer on which we can train our model, I have taken a pre trained transformer model (Deep learning Model as per our case study) to answer questions about the text in wikipedia page without specific fine-tuning on your data.

In [14]:
# Load pre-trained RoBERTa model and tokenizer
tokenizer = RobertaTokenizer.from_pretrained('deepset/roberta-base-squad2')
model = RobertaForQuestionAnswering.from_pretrained('deepset/roberta-base-squad2')

In [15]:
def answer_question(model, tokenizer, question, context, max_length=512):
    # Tokenize the question to understand its length
    question_tokens = tokenizer.encode(question, add_special_tokens=True)

    # Initialize variables to store processed segments
    segments = []

    # Tokenize the context and split into segments
    remaining_tokens = tokenizer.encode(context, add_special_tokens=False, add_prefix_space=True)
    while remaining_tokens:
        segment_tokens = []
        while remaining_tokens and len(question_tokens) + len(segment_tokens) < max_length:
            segment_tokens.append(remaining_tokens.pop(0))
        segments.append(tokenizer.decode(segment_tokens))

    # Iterate over each segment and look for the answer
    for segment in segments:
        inputs = tokenizer.encode_plus(question, segment, add_special_tokens=True, return_tensors='pt', truncation=True, max_length=max_length)
        input_ids = inputs['input_ids'].to(torch.device('cuda' if torch.cuda.is_available() else 'cpu'))

        with torch.no_grad():
            outputs = model(input_ids)

        answer_start_scores, answer_end_scores = outputs.start_logits, outputs.end_logits

        answer_start = torch.argmax(answer_start_scores)
        answer_end = torch.argmax(answer_end_scores) + 1

        # Check if a valid answer is found in this segment
        if answer_start < answer_end:
            return tokenizer.convert_tokens_to_string(tokenizer.convert_ids_to_tokens(input_ids[0][answer_start:answer_end]))

    return "Sorry, I could not find an answer in the text."

# Execution
context = cleaned_text
question = "What is the length of Atal Tunnel?"

answer = answer_question(model, tokenizer, question, context)
print(f"Answer: {answer}")


Token indices sequence length is longer than the specified maximum sequence length for this model (1738 > 512). Running this sequence through the model will result in indexing errors
Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pairs with the 'longest_first' truncation strategy. So the returned list will always be empty even if some tokens have been removed.


Answer:  9.02 km


## Execution

In [25]:

question = "What is the size of Atal tunnel?"

answer = answer_question(model, tokenizer, question, context)
print(f"Answer: {answer}")

Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pairs with the 'longest_first' truncation strategy. So the returned list will always be empty even if some tokens have been removed.


Answer:  9.02 km


In [26]:

question = "Where is Atal tunnel?"

answer = answer_question(model, tokenizer, question, context)
print(f"Answer: {answer}")

Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pairs with the 'longest_first' truncation strategy. So the returned list will always be empty even if some tokens have been removed.


Answer:  Himachal Pradesh, India


In [17]:

question = "Who built it?"

answer = answer_question(model, tokenizer, question, context)
print(f"Answer: {answer}")

Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pairs with the 'longest_first' truncation strategy. So the returned list will always be empty even if some tokens have been removed.


Answer:  Atal Bihari Vajpayee


In [18]:

question = "what is the elevation of atal tunnel?"

answer = answer_question(model, tokenizer, question, context)
print(f"Answer: {answer}")

Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pairs with the 'longest_first' truncation strategy. So the returned list will always be empty even if some tokens have been removed.


Answer:  3,100 metres


In [19]:

question = "what is the cost?"

answer = answer_question(model, tokenizer, question, context)
print(f"Answer: {answer}")

Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pairs with the 'longest_first' truncation strategy. So the returned list will always be empty even if some tokens have been removed.


Answer:  ₹3,200 crore (US$438 million).


In [20]:

question = "By whom it was completed?"

answer = answer_question(model, tokenizer, question, context)
print(f"Answer: {answer}")

Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pairs with the 'longest_first' truncation strategy. So the returned list will always be empty even if some tokens have been removed.


Answer:  Border Roads Organisation (BRO) under Ministry of Defence


In [21]:

question = "which state's tourism will it boost?"

answer = answer_question(model, tokenizer, question, context)
print(f"Answer: {answer}")

Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pairs with the 'longest_first' truncation strategy. So the returned list will always be empty even if some tokens have been removed.


Answer:  Himachal


In [22]:

question = "what challanges did it face?"

answer = answer_question(model, tokenizer, question, context)
print(f"Answer: {answer}")

Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pairs with the 'longest_first' truncation strategy. So the returned list will always be empty even if some tokens have been removed.


Answer:  road blockades, avalanches, and traffic snarls


In [23]:

question = "what are the specififcations of atal tunnel?"

answer = answer_question(model, tokenizer, question, context)
print(f"Answer: {answer}")

Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pairs with the 'longest_first' truncation strategy. So the returned list will always be empty even if some tokens have been removed.


Answer:  it is the highest highway single-tube tunnel above 10,000 feet (3,048 m) in the world


## Evaluation

To Evaluate the model, as we don not have  a existing set for testing, We need to create a evaluation test consiting of questions and correct answers then get the answer from Q&A bot, match Q&A bot answers with actual answer and calculate the accuracy (by f1 score of exact match or any other evaluation methods).

## References

I have taken reference from roberta huggingface website to import and execute the model, doc strings i have written using chatgpt py simply pasting my methods,  Ihave taken references from stackoverflow to fix few bugs faced during development. I haven't copied the entire code directly from any source whatsoever. 