### Installations

In [None]:
# !pip install openai
# !pip install langchain
# !pip install langchain_community
# !pip install faiss-cpu
# !pip install python-dotenv
# !pip install langchain pypdf
# !pip install tiktoken

#### Imports

In [None]:
import openai
from langchain.document_loaders import PyPDFLoader
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.text_splitter import RecursiveCharacterTextSplitter

### model loading

In [None]:
import os
from dotenv import load_dotenv

# Load the .env file
load_dotenv(".env")

# Access variables
OPENAI_API_KEY = os.getenv("OPENAI_API_KEY")

In [None]:
from langchain.llms import OpenAI
import os

llm = OpenAI(openai_api_key=OPENAI_API_KEY)


In [None]:
llm.invoke("Expmain EDA in 2 lines")

'\n\nEDA is the process of analyzing data to get a better understanding of the data, identify patterns, and summarize the main characteristics of the data. It involves using statistical methods and visualizations to explore and interpret data in order to make informed decisions.'

### upload a pdf

In [None]:
from google.colab import files

# upload a file
uploaded_files = files.upload()

# get uploaded file path
pdf_file_path = list(uploaded_files.keys())[0]

# check file extension
if not pdf_file_path.lower().endswith(".pdf"):
    raise ValueError("Please upload a PDF file only!")

print(f"Uploaded PDF file: {pdf_file_path}")

Saving Aroosh_Ahmad_AI_Engineer.pdf to Aroosh_Ahmad_AI_Engineer (2).pdf
Uploaded PDF file: Aroosh_Ahmad_AI_Engineer (2).pdf


In [None]:
from langchain.document_loaders import PyPDFLoader

def load_document(pdf_file_path):
  # Load PDF into Document objects
  loader = PyPDFLoader(pdf_file_path)
  documents = loader.load()  # returns a list of Document objects

  # Check number of pages loaded
  print(f"Number of pages loaded: {len(documents)}")

  # Optional: preview first page
  print(documents[0].page_content[:500])  # first 500 chars'
  return documents

In [None]:
documents = load_document(pdf_file_path)

Number of pages loaded: 2
A r o o s h  A h m a d
A I / M L  E n g i n e e r  |  L L M  &  N L P  S p e c i a l i s t  |  P r o d u c t i o n - R e a d y  S y s t e m s
a r o o s h a h m d a . d a t a @ g m a i l . c o m  |  + 9 2 - 3 1 9 - 4 0 4 0 0 6 7  |  g i t h u b . c o m / a r u s h a h m d  |  l i n k e d i n . c o m / i n / a r u s h a h m d  |
L a h o r e ,  P a k i s t a n .
S u m m a r y
 
A I  E n g i n e e r  w i t h  3 +  y e a r s  o f  e x p e r i e n c e  d e l i v e r i n g  p r o d u c t i o n - r e a 


### Splitting Text to Chunks

**Note:**
I am using RecursiveCharacterTextSplitter here to capture more context from pdfs.

**Info**
RecursiveCharacterTextSplitter splits text into chunks that avoids loosing context over pages by keeping paragraphs together.

In [None]:
text_splitter = RecursiveCharacterTextSplitter(chunk_size = 500, chunk_overlap = 100)
chunks = text_splitter.split_documents(documents)

In [None]:
chunks[:2]

[Document(metadata={'producer': 'Canva', 'creator': 'Canva', 'creationdate': '2025-08-17T15:16:56+00:00', 'title': 'Aroosh_Ahmad_AI_Engineer - Latest', 'moddate': '2025-08-17T15:16:55+00:00', 'keywords': 'DAGtyjwAprs,BAFyE4SWrw4,0', 'author': 'Aroosh Ahmad', 'source': 'Aroosh_Ahmad_AI_Engineer.pdf', 'total_pages': 2, 'page': 0, 'page_label': '1'}, page_content='A r o o s h  A h m a d\nA I / M L  E n g i n e e r  |  L L M  &  N L P  S p e c i a l i s t  |  P r o d u c t i o n - R e a d y  S y s t e m s\na r o o s h a h m d a . d a t a @ g m a i l . c o m  |  + 9 2 - 3 1 9 - 4 0 4 0 0 6 7  |  g i t h u b . c o m / a r u s h a h m d  |  l i n k e d i n . c o m / i n / a r u s h a h m d  |\nL a h o r e ,  P a k i s t a n .\nS u m m a r y'),
 Document(metadata={'producer': 'Canva', 'creator': 'Canva', 'creationdate': '2025-08-17T15:16:56+00:00', 'title': 'Aroosh_Ahmad_AI_Engineer - Latest', 'moddate': '2025-08-17T15:16:55+00:00', 'keywords': 'DAGtyjwAprs,BAFyE4SWrw4,0', 'author': 'Aroosh Ah

### Vector Index/DB

FAISS: Facebook AI Similarity Search --> a powerful library for similarity search and clustering of dense vectors.

In [None]:
from langchain.vectorstores import FAISS
from langchain.embeddings import OpenAIEmbeddings  # many others are availabe for other models too.

# creating embeddings
embeddings = OpenAIEmbeddings(api_key = OPENAI_API_KEY)
db = FAISS.from_documents(documents=chunks, embedding=embeddings)

In [None]:
from langchain.chains import ConversationalRetrievalChain
from langchain.prompts import PromptTemplate

CONDENSE_QUESTION_PROMPT = PromptTemplate.from_template(
    """
      Given the following conversation and follow up question, rephrase the following  following follow up question
      to be a standalone question.

      {chat_history}
      Follow up Input: {question}
      Standalone questions:
    """
)

qa = ConversationalRetrievalChain.from_llm(llm=llm, retriever=db.as_retriever(), condense_question_prompt= CONDENSE_QUESTION_PROMPT,
                                           return_source_documents=True, verbose=False)

In [None]:
chat_history= []
query = "Tell about the Person what did he do and what are his top skills ?."
result = qa.invoke(input =
          {
              "question":query,
              "chat_history": chat_history
          })

print(result["answer"])

 The person, Arush Ahmad, is an AI/ML engineer with 3+ years of experience delivering production-ready systems in NLP, computer vision, and LLM applications. He started as a Full-Stack Engineer before transitioning into AI/ML and has experience in reducing manual review by 60% and building a GPT-3 powered fitness assistant with Django REST. His top skills include PyTorch, Hugging Face, LangChain, RAG, FAISS, OpenCV, OCR, YOLOv5, Transformers, NLP, and CV. He also has experience with backend and APIs such as FastAPI, Django REST, React.js, Node.js, .NET Core/MVC, and tools like Docker, Azure ML, GCP Vertex AI, Git, and CI/CD. He is based in Lahore, Pakistan and his contact information can be found on his GitHub and LinkedIn profiles.


In [None]:
chat_history= []
query = "What projects are done and in what domain ?."
result = qa.invoke(input =
          {
              "question":query,
              "chat_history": chat_history
          })

print(result["answer"])

 From the context, it appears that the individual has worked on various AI and ML projects in the domains of NLP, computer vision, and LLM applications. They have also contributed to projects involving YOLO-based disease/tower component detection and OCR preprocessing workflows. Additionally, they have also worked on 20+ AI projects in various areas such as classification, sentiment analysis, regression, and OCR APIs.


In [None]:
chat_history= []
query = "Can you tell more about the OCR Project work done, what models used and what impact was created?. Only tell about OCR."
result = qa.invoke(input =
          {
              "question":query,
              "chat_history": chat_history
          })

print(result["answer"])

 From the information provided, it appears that the individual worked on OCR projects while interning at I C R L Labs, K I C S, U E T, and also as a freelance developer on Fiverr. They mention building and deploying OCR engines, specifically a Urdu OCR engine using a CNN-LSTM model with 98% accuracy. As an AI/ML engineer and NLP specialist, they also mention leading a team in achieving 98% accuracy for OCR engines for Urdu, Arabic, and Farsi while working as an AI research officer at the Center of Language Engineering in Lahore. They also mention reducing CER (Character Error Rate) from 3.4% to 2.3%. However, there is no mention of the specific models used in these projects.
