Install the required libraries

In [3]:
!pip install pymupdf python-docx openai==0.28 numpy




In [4]:
import os
import openai
import numpy as np
import docx
import fitz
from sklearn.metrics.pairwise import cosine_similarity


 Set OpenAI API Key

In [5]:
from dotenv import load_dotenv
load_dotenv()

openai.api_key = os.getenv("OPENAI_API_KEY")

if openai.api_key is None:
    raise ValueError("OpenAI API Key not found. Please set it in the .env file.")


Functions to load text

In [7]:
def load_text_from_txt(file_path):
    with open(file_path, 'r', encoding='utf-8') as f:
        text = f.read()
    return text

def load_text_from_docx(file_path):
    doc = docx.Document(file_path)
    text = [para.text for para in doc.paragraphs]
    return '\n'.join(text)

def load_text_from_pdf(file_path):
    pdf = fitz.open(file_path)
    text = [pdf.load_page(page_num).get_text() for page_num in range(len(pdf))]
    return '\n'.join(text)

# Function to clean and preprocess text
def preprocess_text(text):
    text = text.lower()  # Convert to lowercase
    text = text.replace('\n', ' ')  # Remove newlines
    return text

Functions to vectorize text

In [8]:
# Function to vectorize text
def vectorize_text(text):
    response = openai.Embedding.create(
        input=text,
        model="text-embedding-ada-002"
    )
    embeddings = response['data'][0]['embedding']
    return np.array(embeddings)

# Function to search text using cosine similarity
def search_text(query, text):
    query_vector = vectorize_text(query)
    text_vector = vectorize_text(text)
    similarity = cosine_similarity([query_vector], [text_vector])[0][0]
    return similarity


Function to generate text using the OpenAI Completion API

In [9]:
def generate_text(retrieved_text, user_input):
    response = openai.ChatCompletion.create(
        model="gpt-3.5-turbo",
        messages=[
            {"role": "system", "content": "You are an AI specialized in answering questions based on provided text data. Limit your response to the context of the provided text."},
            {"role": "user", "content": user_input},
            {"role": "assistant", "content": retrieved_text},
        ],
        max_tokens=200,
        temperature=0.7
    )
    return response.choices[0]['message']['content'].strip()


Main function to handle the RAG application

In [10]:
def rag_application(file_path, query):
    ext = os.path.splitext(file_path)[1].lower()
    if ext == ".txt":
        text = load_text_from_txt(file_path)
    elif ext == ".docx":
        text = load_text_from_docx(file_path)
    elif ext == ".pdf":
        text = load_text_from_pdf(file_path)
    else:
        raise ValueError("Unsupported file type")

    # Preprocess the loaded text
    text = preprocess_text(text)

    similarity = search_text(query, text)

    if similarity > 0.5:
        generated_text = generate_text(text, query)
        return generated_text
    else:
        return "No relevant information found in the text."

In [11]:

# Function to process the query and return the result
def rag_generate(query):
    result = rag_application(file_path, query)
    return result

# Set the file path (update this path to your file's location)
file_path = "C:\\Users\\pc\\OneDrive\\Bureau\\ragapprroject\\NLP models.pdf"

In [12]:
prompt = "What is the subject of this text?"

In [13]:
response = rag_generate(prompt)
response

'The subject of this text is a research report on pre-trained NLP models, focusing on their significance in advancing machine learning operations and software development life cycles. It explores various pre-trained models in NLP, such as BERT, GPT-3, ELMo, Transformer-XL, and RoBERTa, along with innovative approaches like Large Language Models (LLMs) and Retrieval-Augmented Generation (RAG). The report delves into the architecture, training processes, challenges, and improvements associated with these pre-trained models, highlighting their transformative potential in shaping the future of AI applications.'

In [17]:
prompts = ["What is NLP?", "How does Pre-training work?","What are Challenges with large language models (LLMs)?","What is Retrieval augmented generation (RAG)?"]

for prompt in prompts:
    response = rag_generate(prompt)
    print(response)
    print("\n")

NLP stands for Natural Language Processing. It involves the use of pre-trained models and language models to enhance the efficiency and effectiveness of tasks related to understanding and processing human language. Pre-trained models are deep learning models that have been trained on large amounts of data before being fine-tuned for specific tasks in NLP, such as language translation, sentiment analysis, and text summarization. Some of the best pre-trained models in NLP include BERT (Bidirectional Encoder Representations from Transformers), GPT-3 (Generative Pretrained Transformer 3), ELMo (Embeddings from Language Models), Transformer-XL, and RoBERTa (Robustly Optimized BERT). These models are used for various NLP tasks due to their superior performance and resource-saving capabilities.


Pre-training in the context of the provided text refers to training deep learning models on extensive datasets before fine-tuning them for specific tasks. Pre-training allows these models to learn pa