Formatting our CSV data to fit it in to Distilbert

In [1]:
import pandas as pd

# Load CSV data into a DataFrame
data = pd.read_csv('C:/Users/adith/Downloads/nlp testttt/C_Data.csv', encoding='latin1')

# Initialize an empty list to store formatted data
formatted_data = []

# Iterate through each row in the CSV data
for idx, row in data.iterrows():
    # Handle NaN values by replacing them with empty strings
    context = str(row['Context']) if not pd.isnull(row['Context']) else ''
    answer = str(row['Answer']) if not pd.isnull(row['Answer']) else ''

    # Create a unique ID for the question within each context
    question_id = f"{row['CID']}_{idx}"  # Using CID and index for unique question ID
    
    # Create a dictionary for each row in the required format
    formatted_row = {
        "context": context,
        "qas": [
            {
                "question": row['Question'],
                "id": question_id,  # Using the generated unique question ID
                "answers": [
                    {
                        "text": answer,
                        "answer_start": context.find(answer) if context and answer else -1  # Start position of the answer in the context
                    }
                ]
            }
        ],
        "CID": row['CID'],  # Optionally including CID in the formatted data
        "Category": row['Category']  # Optionally including Category in the formatted data
    }
    
    # Append the formatted row to the list
    formatted_data.append(formatted_row)

# Print the first formatted row to verify the structure
print(formatted_data[18])

{'context': 'The UNT Writing Center provides support for undergraduate and graduate writers across the disciplines. Our mission is to help students at all levels improve as writers. Here are some frequently asked questions about what we do and how to use our services. Everyone! We believe that all writers will benefit from sharing their work with a tutor. Whether you are an undergraduate student writing your first college paper or a graduate student working on a dissertation we can help. In addition to helping with coursework we can also help with resums cover letters and personal statements. We are excited to work with writers in the humanities and social sciences as well as writers in the physical and biological sciences. We also work with short and long writing projects at any stage. We have highly skilled undergraduate tutors from many different majors who were selected for their strong writing ability and their desire to help their peers. Tutors receive initial and ongoing trainin

In [5]:
!pip install transformers




Building the single inference model which gets trained on Context column, we use DistilBertForQuestionAnswering

In [8]:
from transformers import DistilBertTokenizer, DistilBertForQuestionAnswering
import torch
import nltk


tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased', return_token_type_ids=True)
model = DistilBertForQuestionAnswering.from_pretrained('distilbert-base-uncased-distilled-squad')

# Display the available CIDs and their respective categories
cid_categories = {row['CID']: row['Category'] for row in formatted_data}
print("Available CIDs and Categories:")
for cid, category in cid_categories.items():
    print(f"CID: {cid}, Category: {category}")

while True:
    # Ask user to input a CID
    selected_cid = input("Enter the CID (or type 'exit' to quit): ")

    if selected_cid.lower() == 'exit':
        print("Exiting...")
        break

    try:
        selected_cid = int(selected_cid)
    except ValueError:
        print("Please enter a valid CID or 'exit' to quit.")
        continue

    selected_context = None
    for data in formatted_data:
        if data['CID'] == selected_cid:
            selected_context = data['context']
            break

    if selected_context:
        # Ask user to input a question
        question_to_predict = input("Enter your question: ")

        # Tokenize the context into sentences
        sentences = nltk.sent_tokenize(selected_context)

        # Calculate sentence similarity with the question
        similarity_scores = []
        for sentence in sentences:
            encoding = tokenizer.encode_plus(
                question_to_predict, sentence, return_tensors="pt", max_length=512, truncation=True
            )
            input_ids, attention_mask = encoding["input_ids"], encoding["attention_mask"]

            with torch.no_grad():
                outputs = model(input_ids=input_ids, attention_mask=attention_mask)

            start_scores = outputs.start_logits
            end_scores = outputs.end_logits

            # Calculate similarity score
            start_index = torch.argmax(start_scores)
            end_index = torch.argmax(end_scores) + 1
            answer_tokens = input_ids[0][start_index:end_index]

            similarity_score = torch.max(start_scores)  # Using the maximum start score as a similarity measure
            similarity_scores.append((sentence, similarity_score.item()))

        # Sort sentences by similarity score
        sorted_sentences = sorted(similarity_scores, key=lambda x: x[1], reverse=True)

        # Select top sentences based on relevance to the question
        selected_sentences = [sentence[0] for sentence in sorted_sentences[:3]]  # Adjust the number of sentences to select

        answers = " ".join(selected_sentences)

        print("\nQuestion:", question_to_predict)
        print("\nAnswer:")
        print(answers)
        print("\n----------------Ask Another Question----------------------\n")
    else:
        print("CID not found or no context available for the selected CID.")

Available CIDs and Categories:
CID: 1, Category: Policy
CID: 2, Category: Writing Center
CID: 3, Category: Integrity
CID: 4, Category: Legal
CID: 5, Category: Admissions
CID: 6, Category: Advising
CID: 7, Category: Navigation
CID: 8, Category: Affairs
CID: 9, Category: Future Students
CID: 10, Category: Dining
CID: 11, Category: Honors
CID: 12, Category: General
CID: 13, Category: Music Dept.
Enter the CID (or type 'exit' to quit): 9
Enter your question: is it mandatory to go to college

Question: is it mandatory to go to college

Answer:
There is no separate admissions process for incoming students and no portfolio is required to begin. Yes all newly accepted students must attend orientation. No there is no animation program at UNT in any college.

----------------Ask Another Question----------------------

Enter the CID (or type 'exit' to quit): 9
Enter your question:  is it mandatory to go to school

Question:  is it mandatory to go to school

Answer:
There is no separate admissions

KeyboardInterrupt: Interrupted by user

In [7]:
nltk.download('punkt')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\adith\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping tokenizers\punkt.zip.


True