<a href="https://colab.research.google.com/github/claudio1975/Medium-blog/blob/master/Discovering%20LangChain/Discovering_LangChain_v_1_3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# What is LangChain?

LangChain broadly speaking is a wrapper, more specifically it is a framework for developing applications around Large Language Models.

The idea behind the framework is to "chain" together different components in order to build more advanced use cases exploiting LLMs.

# LangChain Components

LangChain is built under seven components:

-Schema

-Models

-Prompts

-Indexes

-Memory

-Chains

-Agents

# LangChain use cases

In [None]:
! pip install langchain==0.0.327

In [None]:
!pip install openai==0.28.1

In [None]:
!pip install tiktoken

In [None]:
!pip install chromadb

In [None]:
! pip install pypdf

In [None]:
!pip install wikipedia

In [None]:
!pip install -U langchain langchain_experimental openai

In [8]:
# upload OpenAI API keys
import os
os.environ["OPENAI_API_KEY"] = ""

# Chat with your documents

### Load data

In [9]:
# import libraries
from langchain.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.vectorstores import Chroma
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.chains.question_answering import load_qa_chain
from langchain.chat_models import ChatOpenAI


In [10]:
from google.colab import files
uploaded = files.upload()

Saving s13278-021-00776-6.pdf to s13278-021-00776-6.pdf


In [11]:
loader = PyPDFLoader("s13278-021-00776-6.pdf")

In [12]:
data = loader.load()

In [13]:
# Document shape
print (f'You have {len(data)} document(s) in your data')
print (f'There are {len(data[0].page_content)} characters in your document')

You have 19 document(s) in your data
There are 4266 characters in your document


### Split data up into smaller documents

In [14]:
# Split document
text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=0)
texts = text_splitter.split_documents(data)

In [15]:
print (f'Now you have {len(texts)} documents')

Now you have 190 documents


### Create embeddings from documents to get ready for semantic search

In [16]:
embeddings = OpenAIEmbeddings()

### Store embeddings into vectore store

In [17]:
# load it into Chroma
docsearch = Chroma.from_documents(texts, embeddings)

### Retrieve documents

In [18]:
query = "What are examples of sentiment analysis?"
docs = docsearch.similarity_search(query)


In [19]:
# Here's an example of the document that was returned
print(docs[1].page_content[:500])

Sentiment and emotion analysis has a wide range of 
applications and can be done using various methodologies. 
There are three types of sentiment and emotion analysis 
techniques: lexicon based, machine learning based, and deep 
learning based. Each has its own set of benefits and draw -
backs. Despite different sentiment and emotion recognition techniques, researchers face significant challenges, including 
dealing with context, ridicule, statements conveying several


In [20]:
# Here's an example of another document that was returned
print(docs[2].page_content[:500])

processed as rapidly as generated to comprehend human psychology, and it can be accomplished using sentiment analysis, 
which recognizes polarity in texts. It assesses whether the author has a negative, positive, or neutral attitude toward an item, 
administration, individual, or location. In some applications, sentiment analysis is insufficient and hence requires emotion


### Query documents to get your answer back

In [21]:
llm = ChatOpenAI(temperature=0, model_name='gpt-3.5-turbo')
chain = load_qa_chain(llm, chain_type="map_reduce")


In [22]:
query1 = "Could you explain the challenges in sentiment analysis and emotion analysis showed in the document?"
docs = docsearch.similarity_search(query1)

In [23]:
chain.run(input_documents=docs, question=query1)


'According to the provided information, the challenges in sentiment analysis and emotion analysis mentioned in the document include:\n\n1. Lack of resources: The document mentions that one of the challenges is the lack of resources, specifically the difficulty in gathering a large annotated dataset for statistical algorithms. Manual labeling of a large dataset is time-consuming and less reliable.\n\n2. Use of emotions: The document suggests that the use of emotions in sentiment analysis and emotion analysis poses a challenge. It does not provide further details on how this challenge manifests.\n\n3. Spreading Web slang: The document mentions that spreading web slang is a challenge in sentiment and emotion analysis. It does not provide further details on how this challenge affects the analysis.\n\n4. Lexical and syntactical ambiguity: The document states that lexical and syntactical ambiguity is a challenge in sentiment and emotion analysis. It does not provide further details on how th

In [24]:
query2 = "Could you describe the feature extraction used in the document?"
docs = docsearch.similarity_search(query2)

In [25]:
chain.run(input_documents=docs, question=query2)

"The feature extraction used in the document is a combination of n-grams with n = 2 and TF-IDF. This involves creating a fixed-length vector where each entry corresponds to a word or combination of two words in a pre-defined dictionary of words. The values in the vector represent the importance or relevance of each word or combination of words in the sentence or document. The document also mentions the use of the 'Bag of Words' (BOW) method, which involves counting the occurrences of each word in the document. Additionally, the document mentions the use of word vectorization or word embedding, where a document is broken down into sentences and further broken down into words, and a feature map or matrix is built based on these words. However, the document does not provide specific details about the feature extraction method used with the proposed lexicon."

In [26]:
query3 = "Could you explain the pre‑processing of text showed in the document?"
docs = docsearch.similarity_search(query3)

In [27]:
chain.run(input_documents=docs, question=query3)

'The document mentions that the pre-processing of text involves several steps. These steps include tokenization, stop word removal, and POS tagging. Tokenization refers to breaking down a document, paragraph, or sentence into smaller units called tokens. Stop word removal involves removing common words that do not carry much meaning, such as "the," "and," or "is." POS tagging stands for Part-of-Speech tagging, which involves labeling each word in a sentence with its corresponding part of speech, such as noun, verb, adjective, etc. These pre-processing techniques are important for organizing the dataset, but they can also result in the loss of crucial information for sentiment and emotion analysis, which needs to be addressed. However, the specific details of these techniques are not mentioned in the given portion of the document.'

# Chatbot Translator

### Load libraries

In [28]:
from langchain.schema import (
    AIMessage,
    HumanMessage,
    SystemMessage
)
from langchain.chat_models import ChatOpenAI

### Ask translation to the chatbot

In [29]:
chat = ChatOpenAI(temperature=0, model_name='gpt-3.5-turbo')

In [30]:
chat([HumanMessage(content="Translate this sentence from English to German: I'm interested in developing a start-up business")])

AIMessage(content='Ich interessiere mich für die Entwicklung eines Start-up-Unternehmens.')

In [31]:
chat([HumanMessage(content="Translate this sentence from English to Russian: I like to use LangChain in developing projects")])

AIMessage(content='Мне нравится использовать LangChain при разработке проектов.')

In [32]:
chat([HumanMessage(content="Translate this sentence from English to Japanese: I love to eat sushi one day a week")])

AIMessage(content='私は週に一度寿司を食べるのが大好きです。 (Watashi wa shū ni ichido sushi o taberu no ga daisuki desu.)')

# Chat with Wikipedia

### Load libraries

In [33]:
from langchain.retrievers import WikipediaRetriever
from langchain.chains import ConversationalRetrievalChain
import pprint



### Retrieve Wikipedia documents from your topic

In [34]:
retriever = WikipediaRetriever()

In [None]:
docs = retriever.get_relevant_documents(query="storm")


In [36]:
docs[0].metadata  # meta-information of the Document

{'title': 'Storm',
 'summary': 'A storm is any disturbed state of the natural environment or the atmosphere of an astronomical body. It may be marked by significant disruptions to normal conditions such as strong wind, tornadoes, hail, thunder and lightning (a thunderstorm), heavy precipitation (snowstorm, rainstorm), heavy freezing rain (ice storm), strong winds (tropical cyclone, windstorm), wind transporting some substance through the atmosphere such as in a dust storm, among other forms of severe weather.\nStorms have the potential to harm lives and property via storm surge, heavy rain or snow causing flooding or road impassibility, lightning, wildfires, and vertical and horizontal wind shear. Systems with significant rainfall and duration help alleviate drought in places they move through. Heavy snowfall can allow special recreational activities to take place which would not be possible otherwise, such as skiing and snowmobiling.\nThe English word comes from Proto-Germanic *sturma

In [37]:
docs[0].page_content[:1000]  # a content of the Document

'A storm is any disturbed state of the natural environment or the atmosphere of an astronomical body. It may be marked by significant disruptions to normal conditions such as strong wind, tornadoes, hail, thunder and lightning (a thunderstorm), heavy precipitation (snowstorm, rainstorm), heavy freezing rain (ice storm), strong winds (tropical cyclone, windstorm), wind transporting some substance through the atmosphere such as in a dust storm, among other forms of severe weather.\nStorms have the potential to harm lives and property via storm surge, heavy rain or snow causing flooding or road impassibility, lightning, wildfires, and vertical and horizontal wind shear. Systems with significant rainfall and duration help alleviate drought in places they move through. Heavy snowfall can allow special recreational activities to take place which would not be possible otherwise, such as skiing and snowmobiling.\nThe English word comes from Proto-Germanic *sturmaz meaning "noise, tumult".Storm

### Ask Wikipedia your interest topics

In [38]:
model = ChatOpenAI(temperature=0, model_name='gpt-3.5-turbo')

In [39]:
qa = ConversationalRetrievalChain.from_llm(model, retriever=retriever)

In [40]:
questions = [
    "What is storm?",
    "What is Sustainability?",
    "What is climate change?",
]
chat_history = []

for question in questions:
    result = qa({"question": question, "chat_history": chat_history})
    chat_history.append((question, result["answer"]))
    print(f"-> **Question**: {question} \n")
    print(f"**Answer**: {result['answer']} \n")

-> **Question**: What is storm? 

**Answer**: A storm is a weather event characterized by strong winds, precipitation (such as rain, snow, or hail), and often thunder and lightning. Storms can vary in intensity and duration, and can occur in various forms, such as thunderstorms, hurricanes, blizzards, or tornadoes. Storms can cause significant damage to property and pose risks to human safety. 

-> **Question**: What is Sustainability? 

**Answer**: Sustainability is a concept that refers to the ability to maintain or support something over the long term. In the context of the environment, sustainability often focuses on countering major environmental problems, such as climate change, loss of biodiversity, and pollution. It involves finding a balance between economic development, environmental protection, and social well-being. Sustainability can guide decisions at the global, national, and individual levels, and it aims to meet the needs of the present generation without compromising 

# Synthetic Data Generator

### Load Libraries

In [41]:
import pandas as pd
from langchain.prompts import FewShotPromptTemplate, PromptTemplate
from langchain.chat_models import ChatOpenAI
from langchain.pydantic_v1 import BaseModel
from langchain_experimental.tabular_synthetic_data.base import SyntheticDataGenerator
from langchain_experimental.tabular_synthetic_data.openai import (
    create_openai_data_generator,
    OPENAI_TEMPLATE,
)
from langchain_experimental.tabular_synthetic_data.prompts import (
    SYNTHETIC_FEW_SHOT_SUFFIX,
    SYNTHETIC_FEW_SHOT_PREFIX,
)
from pydantic import IntEnumError

### Structure of your dataset

In [42]:
class MedicalCharges(BaseModel):
    ID: int
    Age: int
    BMI: float
    Height: int
    Children: int
    Charges: float

### Examples of your data

In [49]:
examples = [
    {
        "example": """ID: 123456, Age: 34, BMI: 27.9,
        Children: 0, Height: 170, Charges: 1884.90"""
    },
    {
        "example": """ID: 253459, Age: 45, BMI: 22.7,
        Children: 2, Height: 167, Charges: 1725.34"""
    },
    {
        "example": """ID: 323758, Age: 23, BMI: 18.9,
        Children: 0, Height: 178, Charges: 3866.60"""
    }
]

### Provide a template

In [50]:
OPENAI_TEMPLATE = PromptTemplate(input_variables=["example"], template="{example}")

prompt_template = FewShotPromptTemplate(
    prefix=SYNTHETIC_FEW_SHOT_PREFIX,
    examples=examples,
    suffix=SYNTHETIC_FEW_SHOT_SUFFIX,
    input_variables=["subject", "extra"],
    example_prompt=OPENAI_TEMPLATE,
)

### Generate synthetic data

In [51]:
synthetic_data_generator = create_openai_data_generator(
    output_schema=MedicalCharges,
    llm=ChatOpenAI(temperature=1, model_name='gpt-3.5-turbo'),
    prompt=prompt_template,
)

In [52]:
# eventually run 2 times
synthetic_results = synthetic_data_generator.generate(
    subject="Medical_Charges",
    extra="chosen at random",
    runs=10)

In [53]:
synthetic_results

[MedicalCharges(ID=987654, Age=37, BMI=25.3, Height=175, Children=1, Charges=2500.8),
 MedicalCharges(ID=876543, Age=30, BMI=28.5, Height=162, Children=3, Charges=4598.75),
 MedicalCharges(ID=123456, Age=45, BMI=27.8, Height=170, Children=2, Charges=5875.12),
 MedicalCharges(ID=897654, Age=42, BMI=23.7, Height=164, Children=4, Charges=3200.45),
 MedicalCharges(ID=876543, Age=30, BMI=28.5, Height=162, Children=3, Charges=4598.75),
 MedicalCharges(ID=987654, Age=37, BMI=25.3, Height=175, Children=1, Charges=3850.67),
 MedicalCharges(ID=165432, Age=28, BMI=21.2, Height=170, Children=2, Charges=2750.89),
 MedicalCharges(ID=765432, Age=45, BMI=32.1, Height=169, Children=2, Charges=5830.94),
 MedicalCharges(ID=987654, Age=37, BMI=25.3, Height=175, Children=1, Charges=3850.67),
 MedicalCharges(ID=582916, Age=32, BMI=23.7, Height=168, Children=3, Charges=4200.35)]

In [56]:
df = pd.DataFrame([x.__dict__ for x in synthetic_results], columns=['ID', 'Age', 'BMI', 'Children', 'Height', 'Charges'])


In [57]:
df

Unnamed: 0,ID,Age,BMI,Children,Height,Charges
0,987654,37,25.3,1,175,2500.8
1,876543,30,28.5,3,162,4598.75
2,123456,45,27.8,2,170,5875.12
3,897654,42,23.7,4,164,3200.45
4,876543,30,28.5,3,162,4598.75
5,987654,37,25.3,1,175,3850.67
6,165432,28,21.2,2,170,2750.89
7,765432,45,32.1,2,169,5830.94
8,987654,37,25.3,1,175,3850.67
9,582916,32,23.7,3,168,4200.35
