<a href="https://colab.research.google.com/github/dianakhutorna/myrepo/blob/main/PROJECT.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Capstone Project: Question & Answering with NLP

## Project Tasks

### **Main tasks:**

*   Use of classic QA Methods: RoBERTa, DistilLBERT
*   Use self-hosted LLM on Huggingface: LLaMa-based models + (optional) fine-tuning
*   Use GPT and Prompt engineering: strategies for question formulation and answer extraction

### **Extra tasks:**

*   Text summarization for more efficient interaction
*   Information Retrieval for Multi-Select Questions
*   Detection and Handling of Missing Responses

## Code

### **1. Generating sample data**

To begin, we import the necessary packages to work on the project.

In [1]:
import json
import pandas as pd
from transformers import pipeline  # Hugging Face pipeline

The first thing we do is merge all 5 files of our questionnaire into one file called “questionnaire.json”. We upload this file to the “Files” folder of our project and save it as “data” for further work.

In [4]:
!ls # check the file in the "Files" directory

questionnaire.json  sample_data


In [13]:
# open file
with open('/content/questionnaire.json', 'r') as f:
    questionnaires = json.load(f)

# show data
print(questionnaires)

[{'id': 'aa2d8cdd-0758-4035-b0b6-ca18e2f380d8', 'type': 'SINGLE_SELECT', 'question': 'Data processing consent', 'options': [{'id': 'ee0437c0-6335-4b88-8bc5-d4eb8e2c68bf', 'option': 'Yes'}, {'id': 'd357ab84-929f-440a-b9ad-42ff36402a53', 'option': 'No'}]}, {'id': '12e1ed1d-edaa-4e93-8645-de3850e998f9', 'type': 'SINGLE_SELECT', 'question': 'Customer group', 'options': [{'id': '53cc44fa-397b-40bf-8af8-3f629818c93b', 'option': 'End User'}, {'id': 'bab850d1-2de8-4cae-9f87-2c86d41615e8', 'option': 'Wholesaler, Distributor'}, {'id': '31ffae1e-f742-4a0b-b279-eb393483f075', 'option': 'Consultant, Planner, Architect'}, {'id': 'ed56f777-a7cd-48db-9d36-e2579c57e361', 'option': 'R&D'}]}, {'id': '625012ae-9192-4cf6-a73d-55e1813d6014', 'type': 'MULTI_SELECT', 'question': 'Products interested in', 'options': [{'id': '4ccbe8ae-ce37-4822-830a-542aa0475d30', 'option': 'MY-SYSTEM'}, {'id': '58fdea0d-46db-41ef-a089-825f6ebfbb99', 'option': 'Notion'}, {'id': '13906f18-4e39-4d39-8da9-b5e49d9e4cc5', 'option': 

We can use Hugging Face or OpenAI to generate texts. Lets use text-generation from GPT2.

In [22]:
# Инициализация модели для генерации текста
generator = pipeline("text-generation", model="gpt2")

# Функция для генерации текста
def generate_text(question, options):
    prompt = f"Question: {question}\nOptions: {', '.join(options)}\nAnswer:"
    # Использование max_new_tokens вместо max_length
    response = generator(prompt, max_new_tokens=100, num_return_sequences=1, temperature=0.7, top_p=0.9)
    return response[0]['generated_text']

# Пример: Генерация данных для всех анкет
generated_data = []
for q in questionnaires:
    options = [opt['option'] for opt in q['options']]
    generated_answer = generate_text(q['question'], options)
    generated_data.append({
        "question_id": q['id'],
        "question": q['question'],
        "generated_answer": generated_answer
    })

# Сохранение данных в JSON
with open('generated_data.json', 'w') as outfile:
    json.dump(generated_data, outfile)



Device set to use cpu
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:5

In [23]:
with open('/content/generated_data.json', 'r') as infile:
    generated_data = json.load(infile)
    print(generated_data)

[{'question_id': 'aa2d8cdd-0758-4035-b0b6-ca18e2f380d8', 'question': 'Data processing consent', 'generated_answer': 'Question: Data processing consent\nOptions: Yes, No\nAnswer:\nData processing consent is a form of consent that requires that the parties agree to share information with each other and that is intended to provide a safe and secure environment for the processing of personal information.\nData processing consent is not a legal obligation of any jurisdiction.\nData processing consent can only be obtained by the parties involved.\nInformation about the data processing consent is available in the Privacy Policy.\nData processing consent is not required to be signed by the parties involved.\nData processing consent is a'}, {'question_id': '12e1ed1d-edaa-4e93-8645-de3850e998f9', 'question': 'Customer group', 'generated_answer': 'Question: Customer group\nOptions: End User, Wholesaler, Distributor, Consultant, Planner, Architect, R&D\nAnswer: Customer group\nOptions: End User, W

In [24]:
for item in generated_data[:5]:
    print(f"Question: {item['question']}")
    print(f"Generated Answer: {item['generated_answer']}")
    print("-" * 50)

Question: Data processing consent
Generated Answer: Question: Data processing consent
Options: Yes, No
Answer:
Data processing consent is a form of consent that requires that the parties agree to share information with each other and that is intended to provide a safe and secure environment for the processing of personal information.
Data processing consent is not a legal obligation of any jurisdiction.
Data processing consent can only be obtained by the parties involved.
Information about the data processing consent is available in the Privacy Policy.
Data processing consent is not required to be signed by the parties involved.
Data processing consent is a
--------------------------------------------------
Question: Customer group
Generated Answer: Question: Customer group
Options: End User, Wholesaler, Distributor, Consultant, Planner, Architect, R&D
Answer: Customer group
Options: End User, Wholesaler, Distributor, Consultant, Planner, Architect, R&D
Options: End User, Wholesaler, D

### 2. Applying a QA-model

### 3. Performance evaluation