#### RAG Pipeline
* load data (data ingestion)
* transformation (data broken into smaller chunks)
* embedding (chunks converted into vector)
* storage(vector store in database, so we hit the querry and get the result)

In [1]:
from langchain.document_loaders import pdf
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain.llms import Ollama
from langchain_community.embeddings import OllamaEmbeddings
from langchain_community.vectorstores import FAISS
from langchain_community.vectorstores import Chroma

In [2]:
# pdf data is loading & Transform into chunks
loader = pdf.PyPDFLoader(r"D:\_4.Docs\Dcouments\AsifKhan_Resume.pdf")
docs = loader.load()
text_spliter = RecursiveCharacterTextSplitter(chunk_size = 1000, chunk_overlap = 200)
documents  = text_spliter.split_documents(docs)

In [3]:
# data converting into vector (ebedding) and vector store in chroma database
db = Chroma.from_documents(documents,OllamaEmbeddings())

In [14]:
## Vector database
question = "can you tell me the highest qualification?"
result = db.similarity_search(question)
result[0].page_content


'Pakistan Telecommunication (Pvt)limited(PTCL) Karachi,Pakistan\nGIS Technician Aug-2019–May-2020\nThroughout the course duration, I have actively engaged in conducting surveys using Garmin 10x, establishing\nGIS databases, generating unique identifiers for network elements, and analyzing maps to track wireless\noutages and fiber cuts, as well as producing wire-line infrastructure maps.\nKey roles and responsibilities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .\n○Mapping and creating unique codes (LICs) of all physical assets and network elements of PTCL, including\nexchanges,OFC repeaters, DSLAM, Towers, BTS, optical fiber cable, etc\n○Mapping and creating rings for transmission systems of Nokia, Huawei and ZTE.\n○Collaborating with the development team to meet project deadlines and laying new footprint of network\ncoverage.'

In [16]:
vectorstore = Chroma.from_documents(documents=documents , embedding=OllamaEmbeddings())
retriever = vectorstore.as_retriever(search_type="similarity", search_kwargs={"k": 6})

In [17]:
retrieved_docs = retriever.invoke("can you tell me the highest qualification?")

In [18]:
retrieved_docs

[Document(page_content='Pakistan Telecommunication (Pvt)limited(PTCL) Karachi,Pakistan\nGIS Technician Aug-2019–May-2020\nThroughout the course duration, I have actively engaged in conducting surveys using Garmin 10x, establishing\nGIS databases, generating unique identifiers for network elements, and analyzing maps to track wireless\noutages and fiber cuts, as well as producing wire-line infrastructure maps.\nKey roles and responsibilities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .\n○Mapping and creating unique codes (LICs) of all physical assets and network elements of PTCL, including\nexchanges,OFC repeaters, DSLAM, Towers, BTS, optical fiber cable, etc\n○Mapping and creating rings for transmission systems of Nokia, Huawei and ZTE.\n○Collaborating with the development team to meet project deadlines and laying new footprint of network\ncoverage.', metadat

In [15]:
llm = Ollama(model = "llama3",temperature=0.0)

In [7]:
prompt = """ Please extract the following information from the given text and return it as JSON object.
Total year of eperience:
institute name:
This is the body of description
{docs}"""


In [8]:
from langchain.prompts import PromptTemplate
import os
from groq import Groq

client = Groq(
    api_key=os.environ.get("GROQ_API_KEY"),
)

chat_completion = client.chat.completions.create(
    messages=[
        {
            "role": "user",
            "content": prompt,
        }
    ],
    model="llama3-70b-8192",
)

print(chat_completion.choices[0].message.content)

Here is the extracted information in JSON format:

```
{
  "Total year of experience": null,
  "institute name": null,
  "description": "This is the body of description"
}
```

Note that the `Total year of experience` and `institute name` fields are null because there is no specific value provided in the text. If you want to extract the `{docs}` part as well, it would be an empty object `{}` since it's not clear what `{docs}` refers to.


In [9]:
llm.predict("can you tell me the total countries in the asia")

  warn_deprecated(


"Asia is a vast and diverse continent, comprising 49 countries. Here's the list:\n\n1. Afghanistan\n2. Armenia\n3. Azerbaijan\n4. Bahrain\n5. Bangladesh\n6. Bhutan\n7. Brunei\n8. Cambodia\n9. China\n10. Cyprus\n11. East Timor (Timor-Leste)\n12. Georgia\n13. India\n14. Indonesia\n15. Iran\n16. Iraq\n17. Israel\n18. Japan\n19. Jordan\n20. Kazakhstan\n21. North Korea\n22. South Korea\n23. Kuwait\n24. Kyrgyzstan\n25. Laos\n26. Lebanon\n27. Malaysia\n28. Maldives\n29. Mongolia\n30. Myanmar (Burma)\n31. Nepal\n32. Oman\n33. Pakistan\n34. Palestine\n35. Philippines\n36. Qatar\n37. Russia (partially in Europe)\n38. Saudi Arabia\n39. Singapore\n40. Sri Lanka\n41. Syria\n42. Taiwan\n43. Tajikistan\n44. Thailand\n45. Turkey\n46. Turkmenistan\n47. United Arab Emirates\n48. Uzbekistan\n49. Vietnam\n\nNote that the number of countries in Asia can vary depending on how some territories are classified (e.g., Taiwan is sometimes considered a part of China, while others consider it a separate country). 

In [11]:
template =  PromptTemplate(
    input_variables= ["country"],
    template="can you tell me the capital and coordinates of {country}"
)

In [12]:
prompt = template.format(country = "Pakistan")