<a href="https://colab.research.google.com/github/dvdblk/hack4good-oecd/blob/main/langchain_example.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Imports

In [2]:
!pip install langchain chromadb openai tiktoken lark faiss-cpu InstructorEmbedding sentence_transformers

Collecting InstructorEmbedding
  Obtaining dependency information for InstructorEmbedding from https://files.pythonhosted.org/packages/6c/fc/64375441f43cc9ddc81f76a1a8f516e6d63f5b6ecb67fffdcddc0445f0d3/InstructorEmbedding-1.0.1-py2.py3-none-any.whl.metadata
  Downloading InstructorEmbedding-1.0.1-py2.py3-none-any.whl.metadata (20 kB)
Collecting sentence_transformers
  Using cached sentence-transformers-2.2.2.tar.gz (85 kB)
  Preparing metadata (setup.py) ... [?25ldone
Downloading InstructorEmbedding-1.0.1-py2.py3-none-any.whl (19 kB)
Building wheels for collected packages: sentence_transformers
  Building wheel for sentence_transformers (setup.py) ... [?25ldone
[?25h  Created wheel for sentence_transformers: filename=sentence_transformers-2.2.2-py3-none-any.whl size=125923 sha256=940034410df56e2813c38e9c8801d0a85be2e503534a53c62e68ec59af45e97d
  Stored in directory: /Users/dvdblk/Library/Caches/pip/wheels/ff/27/bf/ffba8b318b02d7f691a57084ee154e26ed24d012b0c7805881
Successfully built

In [15]:
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.llms import OpenAI
from langchain.retrievers.self_query.base import SelfQueryRetriever
from langchain.chains.query_constructor.base import AttributeInfo
from langchain.chains import RetrievalQA
from langchain.chat_models import ChatOpenAI
from langchain.embeddings import OpenAIEmbeddings
from langchain.vectorstores import Chroma
import re, os
import openai, lark

# Set Up

In [16]:
# Define our text splitter
from langchain.document_loaders import TextLoader

loader = TextLoader("../../data/interim/01-clean/UK_01.txt")
documents = loader.load()

chunk_size = 2000
chunk_overlap = 0
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=chunk_size, chunk_overlap=chunk_overlap
)
all_splits = text_splitter.split_documents(documents)

In [3]:
from operator import itemgetter

from langchain.prompts import ChatPromptTemplate
from langchain.chat_models import ChatOpenAI
from langchain.llms import Ollama
from langchain.callbacks.manager import CallbackManager
from langchain.callbacks.streaming_stdout import StreamingStdOutCallbackHandler
from langchain.embeddings import HuggingFaceInstructEmbeddings
from langchain.schema.output_parser import StrOutputParser
from langchain.schema.runnable import RunnablePassthrough, RunnableLambda
from langchain.vectorstores import FAISS

# Simplest (Free) Implementation

In [17]:
# Use instructor-xl for embeddings instead of OAI
embeddings = HuggingFaceInstructEmbeddings(model_name="hkunlp/instructor-large")
vectorstore = FAISS.from_documents(all_splits, embedding=embeddings)
retriever = vectorstore.as_retriever()

load INSTRUCTOR_Transformer
max_seq_length  512


In [18]:
# This requires `ollama serve` to be running
llm = Ollama(base_url="http://localhost:11434", model="llama2")

In [19]:
llm("This is just a test bro")

"\nUnderstood! Is there anything else you'd like to chat about or ask? I'm here to help with any questions you may have."

In [20]:
# query = "Which topics is the document most about: General economy, Semiconductors, Quantum Technology, Skills, AI, Lifelong Learning, Digital, or None Of The Above?"
# query = "Which entity, person or organization wrote this document or was its main sponsor?" # Reply with just the name"
query = "Does the document talk about future needs in skills?"

In [21]:
qa_chain = RetrievalQA.from_chain_type(llm, retriever=retriever)
qa_chain.run(query)

'Yes, the document discusses future needs in skills. According to the text, there is a need for more skilled technicians in various fields, particularly in areas related to emerging technologies such as artificial intelligence, robotics, and data science. The document highlights that there is a shortage of skilled technicians in the UK, with evidence suggesting that the share of the workforce with technician or intermediate-level skills is lower in the UK than in major competitors. Additionally, the document notes that graduates often quickly become dissatisfied with their roles due to the mundane nature of much technician work and the relatively low wages they earn, leading to high labour turnover amongst graduates in technician-level roles. To address these issues, the document proposes the development of connections between employers and providers of vocational education and training to keep pace with advances in technology and provide training programs that meet the needs of emergi

# Improvement 1? optimize text for retrieval

In [108]:
# standard text splitter
# loader = TextLoader("./UK_01.txt")
# documents = loader.load()

# chunk_size = 2000
# chunk_overlap = 0
# text_splitter = RecursiveCharacterTextSplitter(
#     chunk_size=chunk_size, chunk_overlap=chunk_overlap
# )

# all_splits = text_splitter.split_documents(documents)

splits2 = []
for idx, splt in enumerate(all_splits):
    metadata = {'source': splt.metadata['source'], 'idx': idx}
    splits2.append(Document(page_content=splt.page_content, metadata=metadata))

all_splits = splits2

In [109]:
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.tokenize.treebank import TreebankWordDetokenizer

nltk.download('punkt')

nltk.download('stopwords')
stop_engl = stopwords.words('english')


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [110]:
def preprocess(line, stopwords=None):

    tokens = word_tokenize(line)

    if stopwords is None:
        return tokens

    stopwords = set(stopwords)
    cleaned = []
    for tok in tokens:
        if tok.lower() in stopwords:
            continue
        cleaned.append(tok)

    string = TreebankWordDetokenizer().detokenize(cleaned)

    return string

In [111]:
preprocess("Oh no, I am an evil sentence. I don't want to be tokenized", stop_engl)

"Oh, evil sentence .n't want tokenized"

In [112]:
from langchain.schema.document import Document

embedding_splits = []
for idx, splt in enumerate(all_splits):
    cleaned = preprocess(splt.page_content, stop_engl)
    embedding_splits.append(Document(page_content=cleaned, metadata=splt.metadata))

In [113]:
clean_vectorstore = vectorstore = FAISS.from_documents(embedding_splits, embedding=OpenAIEmbeddings())
clean_retriever = vectorstore.as_retriever()

In [119]:
query_2 = "Is the document optimistic or pessimistic regarding the future need for techology related skills?"
TOP_K = 10

clean_embedded_docs = clean_vectorstore.similarity_search(query_2,k=TOP_K)

clean_docs = []
clean_ids = []
for sim in clean_embedded_docs:
    clean_docs.append(all_splits[sim.metadata['idx']])
    clean_ids.append(sim.metadata['idx'])


standard = vectorstore.similarity_search(query_2,k=TOP_K)
standard_ids = []
for doc in standard:
    standard_ids.append(doc.metadata['idx'])

if clean_ids == standard_ids:
    print("Same result:")
    for doc in standard:
        print(doc.page_content)

else:
    # common elements in both lists
    common_elements = list(set(clean_ids) & set(standard_ids))

    clean_ids_remainder = list(set(clean_ids) - set(standard_ids))
    standard_ids_remainder = list(set(standard_ids) - set(clean_ids))

    print("QUERY: ", query_2)
    print("-"*100)

    print("BOTH HAVE:")
    for doc in clean_docs:
        if doc.metadata['idx'] in common_elements:
            print(doc.metadata['idx'], doc.page_content)

    print("-"*100)
    print("ONLY IN CLEANED: ")
    for doc in clean_docs:
        if doc.metadata['idx'] in clean_ids_remainder:
            print(doc.metadata['idx'], doc.page_content)
    print("-"*100)
    print("ONLY IN STANDARD EMBEDS: ")
    for doc in standard:
        if doc.metadata['idx'] in standard_ids_remainder:
            print(doc.metadata['idx'], doc.page_content)

Same result:
22organisations involved vocational education training supply adequate numbers skilled technicians innovation system said failed providing firms right quantity, blend, practical theoretical knowledge make best use new technologies . 5.3.2 Systems failures UK: case technician skills emerging technologiesThere evidence coordination problems kind described case technician skills training emerging technologies advanced manufacturing UK . Employers seeking deploy new technologies space industry, advanced therapies, industries make use composite materials, industrial biotechnology report significant difficulties obtaining skilled technician labour need deploy new technologies effectively . example, noted, space firms struggle hire experienced, high-quality manufacturing technicians, employers industrial biotechnology advanced therapies industry (Lewis 2012b: 25-26; 2016a: 34-35). Employers aerospace automotive industries wish make greater use composite parts find hard recruit te