### Data Extraction
The next step is extracting content from these files and converting them into a structured text format for further processing.

#### Extracting Text from PDFs
In working with PDFs (e.g., resumes, cover letters), we will need a library that can extract the text content. One common choice is PyMuPDF (also known as fitz).

In [1]:
import fitz  # PyMuPDF

def extract_text_from_pdf(pdf_path):
    # Open the provided PDF file
    document = fitz.open(pdf_path)
    text = ''
    
    # Iterate over each page in the PDF
    for page in document:
        # Extract text from the page and add it to the overall text
        text += page.get_text()

    document.close()
    return text

# Specify the path to your PDF file
pdf_path = 'resume.pdf'
extracted_text = extract_text_from_pdf(pdf_path)
#print(extracted_text)


#### Extracting Text from Word Documents
For .docx files, you can use the python-docx library to extract text.

In [2]:
from docx import Document

def extract_text_from_docx(docx_path):
    doc = Document(docx_path)
    text = ""
    for para in doc.paragraphs:
        text += para.text + "\n"
    return text

# Example usage
#docx_text = extract_text_from_docx("document.docx")
#print(docx_text)


#### Handling Plain Text Files
For .txt files, you can use standard Python file handling.

In [3]:
def extract_text_from_txt(txt_path):
    with open(txt_path, 'r', encoding='utf-8') as file:
        text = file.read()
    return text

# Example usage
#txt_text = extract_text_from_txt("notes.txt")
#print(txt_text)


#### Extracting Web Data
If we want to include content from personal websites, we can use web scraping tools like BeautifulSoup for static pages or Selenium for dynamic ones.

In [4]:
import requests
from bs4 import BeautifulSoup

def extract_text_from_website(url):
    response = requests.get(url)
    soup = BeautifulSoup(response.content, 'html.parser')
    return soup.get_text()

# Example usage
web_text = extract_text_from_website("https://medium.com/@spaw.co/best-websites-to-practice-web-scraping-9df5d4df4d1")
#print(web_text)


### Processing the Ingested Data
#### Text Cleaning:
The raw text you've extracted contains some noise, such as inconsistent line breaks, special characters, and repeated segments. The first step is to clean the text:

- Remove unnecessary characters: Get rid of symbols or extra spaces.
- Handle line breaks: Convert unnecessary line breaks into spaces where appropriate.
- Fix punctuation issues: Ensure sentence integrity (e.g., merge split sentences across lines).

In [5]:
import re

def preprocess_text(text):
    # Convert text to lowercase
    text = text.lower()
    
    # Remove special characters and multiple spaces
    text = re.sub(r'\W+', ' ', text)  # Replace non-word characters with a space
    text = re.sub(r'\s+', ' ', text)  # Replace multiple spaces with a single space
    
    return text.strip()

# Use the previously extracted text
preprocessed_text = preprocess_text(extracted_text)
#print(preprocessed_text)


### Vectorizing Text Chunks with Sentence Transformers
This code utilizes the SentenceTransformer model to generate vector embeddings for chunks of text, which can be used for efficient retrieval or further AI processing.

- Imports the SentenceTransformer model and defines the vectorize_text function to embed text chunks using the pre-trained all-MiniLM-L6-v2 model.
- Splits the pre-processed text into smaller chunks (200 characters each).
- Passes the text chunks to the vectorize_text function, which encodes each chunk into vector embeddings (embeddings_1), facilitating downstream tasks like retrieval or RAG-based generation.

In [6]:
from sentence_transformers import SentenceTransformer

def vectorize_text(chunks):
    model = SentenceTransformer('all-MiniLM-L6-v2')
    embeddings = model.encode(chunks)
    return embeddings

# Assuming text is chunked already
text_chunks = [preprocessed_text[i:i+200] for i in range(0, len(preprocessed_text), 200)]
embeddings_1 = vectorize_text(text_chunks)


  from tqdm.autonotebook import tqdm, trange


### Indexing Personal Data into ChromaDB
This code snippet creates a collection in ChromaDB and indexes personal data by adding text chunks and their corresponding embeddings into the collection for efficient retrieval.

- Initializes the ChromaDB client and creates a collection named "personal_data".
- Defines the index_text function to iterate through text chunks and embeddings, adding them to the collection.
- Each text chunk is stored with its corresponding embedding, metadata (such as id and source), and a unique ID.
- Calls the index_text function to index the pre-processed text chunks (text_chunks) and their embeddings (embeddings_1) into the collection.

In [13]:
import chromadb

client = chromadb.Client()
collection = client.create_collection("personal_data", get_or_create=True)

def index_text(collection, text_chunks, embeddings):
    for i, chunk in enumerate(text_chunks):
        collection.add(
            documents=[chunk],
            embeddings=[embeddings[i]],
            metadatas=[{"id": str(i), "source": "resume"}],
            ids=[str(i)]
        )

index_text(collection, text_chunks, embeddings_1)


Insert of existing embedding ID: 0
Add of existing embedding ID: 0
Insert of existing embedding ID: 1
Add of existing embedding ID: 1
Insert of existing embedding ID: 2
Add of existing embedding ID: 2
Insert of existing embedding ID: 3
Add of existing embedding ID: 3
Insert of existing embedding ID: 4
Add of existing embedding ID: 4
Insert of existing embedding ID: 5
Add of existing embedding ID: 5
Insert of existing embedding ID: 6
Add of existing embedding ID: 6
Insert of existing embedding ID: 7
Add of existing embedding ID: 7
Insert of existing embedding ID: 8
Add of existing embedding ID: 8
Insert of existing embedding ID: 9
Add of existing embedding ID: 9
Insert of existing embedding ID: 10
Add of existing embedding ID: 10
Insert of existing embedding ID: 11
Add of existing embedding ID: 11
Insert of existing embedding ID: 12
Add of existing embedding ID: 12
Insert of existing embedding ID: 13
Add of existing embedding ID: 13
Insert of existing embedding ID: 14
Add of existing em

### Question Embedding and Retrieval from ChromaDB

- Defines the model for embedding text using SentenceTransformer('all-MiniLM-L6-v2').
- Takes a user's question and vectorizes it into an embedding using the pre-trained model.
- Queries the ChromaDB collection to find the top 5 closest matches to the question's embedding.
- Returns the most relevant document from the database to answer the user's question.

In [14]:
# Define the model for vectorization
model = SentenceTransformer('all-MiniLM-L6-v2')  # Load the embedding model

def retrieve_relevant_data(question, collection, model):
    # Vectorize the input question
    question_embedding = model.encode([question])[0]
    # Retrieve the closest match from ChromaDB
    results = collection.query(query_embeddings=[question_embedding], n_results=5)
    return results['documents'][0]


### Setting Up ChromaDB as a Retriever with Hugging Face Embeddings
This code initializes ChromaDB as a vector store, using embeddings from a Hugging Face model to retrieve relevant data based on vector similarity.

- Loads the HuggingFaceEmbeddings model (all-MiniLM-L6-v2) for embedding text into vectors.
- Sets up ChromaDB as a vector store with the collection "personal_data" and associates it with the embeddings model for future queries.
- Converts the ChromaDB collection into a retriever by enabling similarity-based search (k=5), allowing the retrieval of the top 5 most relevant documents based on query embeddings.

In [15]:
from langchain.vectorstores import Chroma
from langchain.embeddings import HuggingFaceEmbeddings

# Initialize embeddings model
embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")

# Initialize ChromaDB vector store with collection name and embeddings
vector_store = Chroma(
    collection_name="personal_data",
    embedding_function=embeddings
)

# Convert ChromaDB collection to a retriever
retriever = vector_store.as_retriever(search_type="similarity", search_kwargs={"k": 5})


  embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")
  vector_store = Chroma(


### Initializing OpenAI's GPT-3.5-Turbo for Language Model Integration
This code snippet initializes OpenAI's GPT-3.5-turbo model through the LangChain framework, enabling its use for generating responses in a Retrieval-Augmented Generation (RAG) system.

- Imports ChatOpenAI from langchain.chat_models to interact with OpenAI's chat-based language models.
- Initializes the GPT-3.5-turbo model by providing the OpenAI API key and model name (gpt-3.5-turbo), setting up the model for generating AI-driven responses.

In [18]:
from langchain.llms import OpenAI
from langchain.chat_models import ChatOpenAI

# Initialize OpenAI API
llm = ChatOpenAI(openai_api_key="Your_OpenAI_api_key", model="gpt-3.5-turbo")


  llm = ChatOpenAI(openai_api_key="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", model="gpt-3.5-turbo")


### Setting Up a Retrieval-Augmented Generation (RAG) Chain for Question Answering
This code creates a RetrievalQA chain that combines a language model with a retriever, enabling the system to answer questions based on relevant documents from ChromaDB.

- Initializes a RetrievalQA chain using the GPT-3.5-turbo model (llm) and the Chroma retriever (retriever) to process queries.
- Enables the option to return the source documents used to generate the answer, providing transparency and context for the AI-generated response.
- Asks a question ("What is my experience in Data Science?") using the chain, which retrieves relevant documents and generates a response.
- Prints the generated answer, while optionally printing the retrieved documents for reference.

In [19]:
from langchain.chains import RetrievalQA

# Set up the retrieval QA chain
qa_chain = RetrievalQA.from_chain_type(
    llm=llm,
    retriever=retriever,
    return_source_documents=True  # Optional: to return the documents retrieved from Chroma
)

# Ask a question using the chain
question = "What is my experience in Data Science?"
result = qa_chain({"query": question})

# Print the generated answer and the source documents
print("Generated Answer:", result['result'])
#print("Retrieved Documents:", result['source_documents'])


  result = qa_chain({"query": question})


Generated Answer: Based on the provided context, you have experience as a physicist, data scientist, and AI/ML specialist in both academic and corporate settings. You are skilled in data analysis, machine learning, and problem-solving, with a focus on applying AI/ML solutions for business insights and research. Your experience includes designing and delivering comprehensive courses in data analytics and data science as an online educator. You also have strong communication and analytical skills with a proven ability to work independently.


### Creating a Conversational Retrieval Chain with Memory for Follow-up Questions
This code sets up a Conversational Retrieval Chain using OpenAI's GPT-3.5-turbo, a ChromaDB retriever, and memory to handle follow-up questions while maintaining context.

- Conversation Buffer Memory - Initializes ConversationBufferMemory to store and manage the conversation history, allowing the AI to remember previous interactions.
- Conversational Retrieval Chain - Uses from_llm to create a chain that retrieves relevant data from ChromaDB and generates responses while considering the conversation history.
- First Question - Queries the user's experience in web development, retrieving relevant information and generating a response based on the context.
- Follow-up Question - Asks a second question ("Can you explain my most recent project?"), where the chain uses memory to maintain the conversation's context and provide a coherent, contextually appropriate answer.

Both questions and responses are printed, showcasing the conversational capabilities of the chain.

In [20]:
from langchain.chains import ConversationalRetrievalChain
from langchain.memory import ConversationBufferMemory
from langchain.chat_models import ChatOpenAI


# Initialize memory to store conversation context
memory = ConversationBufferMemory(memory_key="chat_history", return_messages=True)

# Use the from_llm method to create a conversational retrieval chain
conversational_chain = ConversationalRetrievalChain.from_llm(
    llm=llm,
    retriever=retriever,  # ChromaDB retriever
    memory=memory
)

# Ask the first question
question1 = "What is my experience in web development?"
response1 = conversational_chain.run(question1)
print(f"Q: {question1}\nA: {response1}")

# Follow-up question
question2 = "Can you explain my most recent project?"
response2 = conversational_chain.run(question2)
print(f"Q: {question2}\nA: {response2}")


  memory = ConversationBufferMemory(memory_key="chat_history", return_messages=True)
  response1 = conversational_chain.run(question1)


Q: What is my experience in web development?
A: I don't have enough information to determine your experience in web development based on the context provided.
Q: Can you explain my most recent project?
A: I don't have access to the specific details of the user's most recent project in web development based on the provided context.


### Setting Up a Gradio Interface for Conversational AI with RAG
This code creates a Gradio interface that allows users to ask questions, retrieves relevant information using a conversational retrieval chain, and generates AI responses.

- Answer Function - Defines answer_question to handle user input and use the conversational chain to retrieve relevant data and generate a response.
- Gradio Interface Setup - Uses Gradio to create an interactive web interface where users can ask questions in a text box (inputs="text") and see the AI-generated response in another text box (outputs="text").
- Customization - Provides a title ("Conversational AI with RAG") and a description that invites users to ask questions about the AI's experience.
- Launch - Calls gr_interface.launch() to start the Gradio web interface, enabling users to interact with the AI system directly via a browser.

In [21]:
import gradio as gr

# Define a function to process user input and return AI response
def answer_question(question):
    response = conversational_chain.run(question)
    return response

# Set up the Gradio interface
gr_interface = gr.Interface(
    fn=answer_question,  # The function that processes user input
    inputs="text",  # Input type is a simple text box
    outputs="text",  # Output is a text box showing the AI's response
    title="Conversational AI with RAG",
    description="Ask me anything about my experience!"
)

# Launch the interface
gr_interface.launch()

Running on local URL:  http://127.0.0.1:7864

To create a public link, set `share=True` in `launch()`.


