# CoQA

CoQA is a large-scale dataset for building Conversational Question Answering systems. The goal of the CoQA challenge is to measure the ability of machines to understand a text passage and answer a series of interconnected questions that appear in a conversation. CoQA is pronounced as coca.

CoQA contains 127,000+ questions with answers collected from 8000+ conversations. Each conversation is collected by pairing two crowdworkers to chat about a passage in the form of questions and answers. The unique features of CoQA include 1) the questions are conversational; 2) the answers can be free-form text; 3) each answer also comes with an evidence subsequence highlighted in the passage; and 4) the passages are collected from seven diverse domains. CoQA has a lot of challenging phenomena not present in existing reading comprehension datasets, e.g., coreference and pragmatic reasoning.

## Steps followed:

1. We will extract the text, question and answer column from the main json dataset and create an csv.
2. We will use HuggingFace embeddings and make embedding of that entire dataset.
3. Those embeddings will then get stored into the FAISS vector database.
4. Then when we ask a question, that question will be converted to embeddings and those embeddings will be compared to the embeddings stored in Vector database. The close embeddings will be picked.
5. Those picked embeddings will be converted to the sentence.
6. Then what we to do is that our question should be answered based on below question/text of similar embeddings.
7. And then finally these text/question will be given to the Google Palm LLM so that it can give a nice coherent answer to the question.

In [None]:
!pip install langchain
!pip install -q google-generativeai

In [14]:
import pandas as pd

In [15]:
qa = pd.read_json('/content/drive/MyDrive/coqa-train-v1.0.json')
qa

Unnamed: 0,version,data
0,1,"{'source': 'wikipedia', 'id': '3zotghdk5ibi9ce..."
1,1,"{'source': 'cnn', 'id': '3wj1oxy92agboo5nlq4r7..."
2,1,"{'source': 'gutenberg', 'id': '3bdcf01ogxu7zdn..."
3,1,"{'source': 'cnn', 'id': '3ewijtffvo7wwchw6rtya..."
4,1,"{'source': 'gutenberg', 'id': '3urfvvm165iantk..."
...,...,...
7194,1,"{'source': 'gutenberg', 'id': '34j10vatjfyw0ao..."
7195,1,"{'source': 'cnn', 'id': '3vj40nv2qinjocrcy7k4z..."
7196,1,"{'source': 'race', 'id': '3rjsc4xj10uw0to3vq0v..."
7197,1,"{'source': 'wikipedia', 'id': '3gs6s824sqxty8v..."


### Making a subset, because the file is too large. 100k plus rows in the json dataset.

In [16]:
qa = qa.head(100)
qa

Unnamed: 0,version,data
0,1,"{'source': 'wikipedia', 'id': '3zotghdk5ibi9ce..."
1,1,"{'source': 'cnn', 'id': '3wj1oxy92agboo5nlq4r7..."
2,1,"{'source': 'gutenberg', 'id': '3bdcf01ogxu7zdn..."
3,1,"{'source': 'cnn', 'id': '3ewijtffvo7wwchw6rtya..."
4,1,"{'source': 'gutenberg', 'id': '3urfvvm165iantk..."
...,...,...
95,1,"{'source': 'race', 'id': '3zwfc4w1uu7c2k1rvfwj..."
96,1,"{'source': 'wikipedia', 'id': '3z3zlgnnsiuha76..."
97,1,"{'source': 'wikipedia', 'id': '3p4rdnwnd56fenk..."
98,1,"{'source': 'wikipedia', 'id': '3ermj6l4dys8qb9..."


### Question/Answer data in json format

For question answering tasks, the input data can be in "JSON files" or in a Python list of dictionaries in the correct format. The structure of both formats is identical, i.e. the input may be a string pointing to a JSON file containing a list of dictionaries, or it the input may be a list of dictionaries itself.

-- Input Structure
The input data should be a single list of dictionaries (or path to a JSON file containing the same). A dictionary represents a single context and its associated questions.

Each such dictionary contains two attributes, the "context" and "qas".

-- context: The paragraph or text from which the question is asked.

-- qas: A list of questions and answers (format below).

Questions and answers are represented as dictionaries. Each dictionary in qas has the following format.

-- id: (string) A unique ID for the question. Should be unique across the entire dataset.

-- question: (string) A question.

-- is_impossible: (bool) Indicates whether the question can be answered correctly from the context.

-- answers: (list) The list of correct answers to the question

In this dataset "story" is the context.

The "questions" contains a list of dictionaries, consisting of questions.

Same goes for the "answers".

In [17]:
qa['data'][0]

{'source': 'wikipedia',
 'id': '3zotghdk5ibi9cex97fepx7jetpso7',
 'filename': 'Vatican_Library.txt',
 'story': 'The Vatican Apostolic Library (), more commonly called the Vatican Library or simply the Vat, is the library of the Holy See, located in Vatican City. Formally established in 1475, although it is much older, it is one of the oldest libraries in the world and contains one of the most significant collections of historical texts. It has 75,000 codices from throughout history, as well as 1.1 million printed books, which include some 8,500 incunabula. \n\nThe Vatican Library is a research library for history, law, philosophy, science and theology. The Vatican Library is open to anyone who can document their qualifications and research needs. Photocopies for private study of pages from books published between 1801 and 1990 can be requested in person or by mail. \n\nIn March 2014, the Vatican Library began an initial four-year project of digitising its collection of manuscripts, to 

We only need three columns of our dataset.

They are text, question and answer.

Below we are taking them and creating a separate CSV file for them.

In [18]:
#required columns in our dataframe
cols = ["text","question","answer"]
#list of lists to create our dataframe
comp_list = []
for index, row in qa.iterrows():
    for i in range(len(row["data"]["questions"])):
        temp_list = []
        temp_list.append(row["data"]["story"])
        temp_list.append(row["data"]["questions"][i]["input_text"])
        temp_list.append(row["data"]["answers"][i]["input_text"])
        comp_list.append(temp_list)
new_df = pd.DataFrame(comp_list, columns=cols)
#saving the dataframe to csv file for further loading
new_df.to_csv("CoQA_data.csv", index=False)

In [19]:
df = pd.read_csv("CoQA_data.csv")
df.head()

Unnamed: 0,text,question,answer
0,"The Vatican Apostolic Library (), more commonl...",When was the Vat formally opened?,It was formally established in 1475
1,"The Vatican Apostolic Library (), more commonl...",what is the library for?,research
2,"The Vatican Apostolic Library (), more commonl...",for what subjects?,"history, and law"
3,"The Vatican Apostolic Library (), more commonl...",and?,"philosophy, science and theology"
4,"The Vatican Apostolic Library (), more commonl...",what was started in 2014?,a project


In [20]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1513 entries, 0 to 1512
Data columns (total 3 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   text      1513 non-null   object
 1   question  1513 non-null   object
 2   answer    1513 non-null   object
dtypes: object(3)
memory usage: 35.6+ KB


----------------------------------------------------------------------------------------------------------------------------------------------------------------

Langchain can gather information from multiple places and track the source of information that is used to answer a query.

Source in this context refers to the source of information within the csv file: if not specified the source defaults to the path of the csv file, but you can also feed in a csv file that has a column that specifies sources for each row.

In [9]:
from langchain_community.document_loaders.csv_loader import CSVLoader

loader = CSVLoader(file_path='/content/CoQA_data.csv', source_column = 'question')
data = loader.load()

In [None]:
!pip install instructor
!pip install InstructorEmbedding
!pip install -U sentence-transformers==2.2.2

import instructor
from InstructorEmbedding import INSTRUCTOR

Using HuggingFace embeddings

In [None]:
# embeddings
from langchain.embeddings import HuggingFaceInstructEmbeddings
embeddings = HuggingFaceInstructEmbeddings()

In [12]:
!pip install faiss-gpu

Collecting faiss-gpu
  Downloading faiss_gpu-1.7.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (85.5 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m85.5/85.5 MB[0m [31m11.0 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: faiss-gpu
Successfully installed faiss-gpu-1.7.2


In [13]:
# Using FAISS vector database
from langchain.vectorstores import FAISS

vectordb = FAISS.from_documents(documents = data, embedding = embeddings)

In [14]:
retriever = vectordb.as_retriever()
aa = retriever.get_relevant_documents("Who is Spiderman?")
aa

[Document(page_content="text: Spiderman is one of the most famous comic book heroes of all time. He was created by Stan Lee in 1963 and was first introduced to the world in the pages of Marvel Comic Books. Spiderman's story is the story of Peter Parker, a child who lost his parents and lives with his aunt and uncle. Peter is a shy, quiet boy wearing glasses and has few friends. One day, on a high school class trip to a science lab, he gets bitten by a special spider. Soon Peter realizes he has amazing powers: he is as strong and quick as a spider and also has a type of sixth sense. He no longer needs his glasses and he can use his super power to fly through the city streets! Remembering something his Uncle Ben has told him _ ,Peter decides to use his powers to fight against enemies who do cruel things to people. And so, Spiderman is born. Life is not easy for Peter even though he is a superhero. He is in love with Mary Jane but he can't tell her about his amazing powers. Besides, his b

Importing GooglePalm and using the API key.

In [16]:
from langchain.llms import GooglePalm

api_key = 'AIzaSyC480A6hg0_FDLDBW5UhhxHdhyaTypJQgc'

llm = GooglePalm(google_api_key = api_key, temperature = 0.6)

In [17]:
# Testing
llm("write me an email to help me get a job as a data analyst")

  warn_deprecated(


'Dear [Hiring Manager name],\n\nI am writing to express my interest in the Data Analyst position at [Company name]. I have been working as a Data Analyst for the past three years, and I have a proven track record of success in extracting insights from data and using them to drive business decisions.\n\nIn my previous role at [Previous company name], I was responsible for developing and implementing data analysis solutions for a variety of business units. I worked closely with stakeholders to understand their needs and then used my technical skills to deliver results that met their expectations. I have a strong understanding of data analysis techniques, including data mining, data visualization, and predictive modeling. I am also proficient in a variety of data analysis software, including SAS, R, and Python.\n\nI am a highly motivated and results-oriented individual with a strong work ethic. I am also a team player and I am able to work effectively with others to achieve common goals.\

The data is divided into chunks and appropriate chunks are fed to the transformer.

As the database is quite large, we cannot use chain_type = 'stuff'. Because this will combine all the appropriate chunks and feed it to the transformer. But the transformer has an input token limit so it may miss out on the appropriate chunks. Hence, Stuff chain_type is not an appropriate choice here.

Hence we will use chain_type="map_reduce".
Map reduce will feed all the appropriate chunks to the transformer but one by one.

return_source_documents=True because we want to see the source from where it has generated the answer from.

In [22]:
from langchain.chains import RetrievalQA

chain = RetrievalQA.from_chain_type(llm = llm,
            chain_type="map_reduce",
            retriever=retriever,
            input_key = 'query',
            return_source_documents=True
)


#retrievalQA = RetrievalQA.from_llm(llm, type="stuff", retriever=retriever, input_key = 'query', return_source_documents=True)

In [23]:
chain('How many cities are there in Budapest?')

  warn_deprecated(


{'query': 'How many cities are there in Budapest?',
 'result': '2',
 'source_documents': [Document(page_content="text: Have you ever been to some big cities in the world? The information below will be helpful to you. Budapest For many centuries, Budapest was two cities, with Buda on the west side of the river Danube and Pest on the east side. Budapest became one city in 1872, and it has been the capital city of Hungary for about eighty years. The population of Budapest is about three million, and the city is a very popular place for tourists. Visitors like to take boat rides along the Danube. Budapest is also known for its exciting nightlife. The best time to visit is summer since Budapest is very cold in winter. Los Angeles Los Angeles was founded in 1781. With 3.5 million people it is now the biggest city in California and the second largest city in the United States. It is famous for its modern highways, its movie stars, and its smog. When the city is really smoggy, you can't see th