# ShadowFox Task Level (Hard):

Problem Statement: Embark on an AI-driven journey in the realm of
natural language processing (NLP) and machine learning (ML) by
deploying a Language Model (LM) of your choice. In this project, you
are tasked with delving into the intricacies of LM technology, where
the selection of the LM is entirely at your discretion. The
comprehensive process involves not only implementing the chosen LM
but also conducting an in-depth analysis of its performance and
capabilities.

# 1. Install Libraries

In [None]:
!pip -q install langchain
!pip -q install openai

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m807.5/807.5 kB[0m [31m4.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.8/1.8 MB[0m [31m24.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m256.9/256.9 kB[0m [31m21.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m66.6/66.6 kB[0m [31m6.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m49.4/49.4 kB[0m [31m5.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m138.5/138.5 kB[0m [31m15.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m227.4/227.4 kB[0m [31m2.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m75.6/75.6 kB[0m [31m8.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━

In [None]:
from google.colab import userdata
OPENAI_API_KEY = userdata.get('OPEN_API_KEY')
HUGGINGFACEHUB_API_TOKEN = userdata.get('HUGGINGFACE_TOKEN')

In [None]:
import os

os.environ["OPENAI_API_KEY"] = OPENAI_API_KEY
os.environ["HUGGINGFACEHUB_API_TOKEN"] = HUGGINGFACEHUB_API_TOKEN

# 2. Loading Data

In [None]:
!wget -q https://www.dropbox.com/s/vs6ocyvpzzncvwh/new_articles.zip
!unzip -q new_articles.zip -d new_articles

In [None]:
from langchain.document_loaders import DirectoryLoader
from langchain.document_loaders import TextLoader

In [None]:
loader = DirectoryLoader('./new_articles/', glob="./*.txt", loader_cls=TextLoader)

documents = loader.load()

# 3. Chunking Documents

In [None]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

In [None]:
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
texts = text_splitter.split_documents(documents)

In [None]:
texts[0]

Document(page_content="Generative AI is pretty impressive in terms of its fidelity these days, as viral memes like Balenciaga Pope would suggest. The latest systems can conjure up scenescapes from city skylines to cafes, creating images that appear startlingly realistic — at least on first glance.\n\nBut one of the longstanding weaknesses of text-to-image AI models is, ironically, text. Even the best models struggle to generate images with legible logos, much less text, calligraphy or fonts.\n\nBut that might change.\n\nLast week, DeepFloyd, a research group backed by Stability AI, unveiled DeepFloyd IF, a text-to-image model that can “smartly” integrate text into images. Trained on a dataset of more than a billion images and text, DeepFloyd IF, which requires a GPU with at least 16GB of RAM to run, can create an image from a prompt like “a teddy bear wearing a shirt that reads ‘Deep Floyd'” — optionally in a range of styles.", metadata={'source': 'new_articles/05-05-with-deepfloyd-gen

In [None]:
len(texts)

233

# 4. Storing docs using Vectorestores + Embedding

In [None]:
!pip install chromadb
!pip install tiktoken

Collecting chromadb
  Downloading chromadb-0.4.24-py3-none-any.whl (525 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m525.5/525.5 kB[0m [31m3.6 MB/s[0m eta [36m0:00:00[0m
Collecting chroma-hnswlib==0.7.3 (from chromadb)
  Downloading chroma_hnswlib-0.7.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (2.4 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.4/2.4 MB[0m [31m21.9 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting fastapi>=0.95.2 (from chromadb)
  Downloading fastapi-0.110.0-py3-none-any.whl (92 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m92.1/92.1 kB[0m [31m10.9 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting uvicorn[standard]>=0.18.3 (from chromadb)
  Downloading uvicorn-0.28.0-py3-none-any.whl (60 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m60.6/60.6 kB[0m [31m7.8 MB/s[0m eta [36m0:00:00[0m
Collecting posthog>=2.4.0 (from chromadb)
  Downloading posthog-3.5.0-py2

In [None]:
from langchain.vectorstores import Chroma
from langchain.embeddings import OpenAIEmbeddings
from langchain.embeddings import HuggingFaceEmbeddings  # Here I will embedding model from huggingface as openai requires a paid account

In [None]:
!pip install sentence-transformers

Collecting sentence-transformers
  Downloading sentence_transformers-2.5.1-py3-none-any.whl (156 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m156.5/156.5 kB[0m [31m1.8 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: sentence-transformers
Successfully installed sentence-transformers-2.5.1


In [None]:
persist_directory = 'db'

#embedding = OpenAIEmbeddings()
embedding = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")

vectordb = Chroma.from_documents(documents=texts,
                                 embedding=embedding,
                                 persist_directory=persist_directory)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.7k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

  return self.fget.__get__(instance, owner)()


tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

In [None]:
print(vectordb._collection.count())

233


# 5. RetrieverQA

### Simple retriever to test our question/vectorestore

In [None]:
question = "What is ai powered supply chain startup?"
docs_similarity_search = vectordb.similarity_search(question, k=3)
for doc in docs_similarity_search:
  print(doc.page_content[:], f"==> metadata = {doc.metadata}")

Pando also taps algorithms and forms of machine learning to make predictions around supply chain events. For example, the platform attempts to match customer orders with suppliers, customers through the “right” channel (in terms of aspects like cost and carbon footprint) and fulfillment strategy (e.g. mode of freight, carrier, etc.). Beyond this, Pando can detect anomalies among deliveries, orders and freight invoices and anticipate supply chain risk given demand and supply trends.

Pando isn’t the only vendor doing this. Altana, which bagged $100 million in venture capital last October, uses an AI system to connect to and learn from logistics and business-to-business data — creating a shared view of supply chain networks. Everstream, another Pando rival, offers its own dashboards for data analysis, integrated with existing ERP, transportation management and supplier relationship management systems. ==> metadata = {'source': 'new_articles/05-03-ai-powered-supply-chain-startup-pando-lan

# 6. Initialize LLM

In [None]:
from langchain.llms import OpenAI
from langchain.llms import HuggingFaceHub

In [None]:
#llm = OpenAI()    #Use openai with a paid account

In [None]:
llm = HuggingFaceHub(repo_id="google/flan-t5-xxl", model_kwargs={"temperature":0.6, "max_length":512})

# 7. RetrievalQA chain

In [None]:
from langchain.chains import RetrievalQA

### Base retriever

In [None]:
# create the chain to answer questions
qa_chain = RetrievalQA.from_chain_type(llm=llm,
                                  chain_type="stuff",
                                  retriever=vectordb.as_retriever(),
                                  return_source_documents=True)

In [None]:
question = "What is ai powered supply chain startup pando do?"
result = qa_chain({"query": question})
print(f"Answer:\n {result['result']}")

Answer:
 Pando provides various tools and apps for accomplishing different tasks across freight procurement, trade and transport management, freight audit and payment and document management, as well as dispatch planning and analytics.
