#### Generate Description

In [49]:
# Imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

# LangChain imports
from langchain.agents import create_pandas_dataframe_agent
from langchain.chat_models import ChatOpenAI
from langchain.agents.agent_types import AgentType
from langchain.chains.summarize import load_summarize_chain
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.document_loaders import UnstructuredExcelLoader, UnstructuredFileLoader
from langchain.llms import OpenAI
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.vectorstores import FAISS
from langchain.chains import RetrievalQA
from langchain import PromptTemplate

# import openAI API key - put correct key
import os
os.environ["OPENAI_API_KEY"] = "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"
api_key = "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"
# Display all columns from dataset
pd.set_option('display.max_columns', None)

In [2]:
file_ = "../DGDatasets/FLW Feedback on chilli bot.xlsx"
df = pd.ExcelFile(file_)
print(f"This file has {len(df.sheet_names)} sheet(s)")

This file has 3 sheet(s)


In [3]:
pd.read_excel(file_, df.sheet_names[0]).shape

(155, 6)

In [4]:
pd.read_excel(file_, df.sheet_names[1]).shape

(0, 6)

In [5]:
pd.read_excel(file_, df.sheet_names[2]).shape

(0, 6)

In [6]:
sub_df = pd.read_excel(file_)
sub_df.head(3)

Unnamed: 0,S.No,Name of the FLW,Question asked,Response received,Feedback on the response,Over all Observations
0,1.0,B Balaji Naik,Chilli varieties please,"The high yielding varieties in chilli are G3, ...",Appropriate,Response observations from female FLW:\n1. Var...
1,,,,,,
2,2.0,B Balaji Naik,Which pesticides use for black thrips,"To control the sucking pests (Aphids, thrips, ...",Appropriate,Summary: 1. Most of questions asked by the FLW...


1. Generating description using column names only

In [7]:
prompt = PromptTemplate(
    input_variables = ["columns"],
    template = "Explain this dataset based on {columns}"
)
query = prompt.format(columns = list(sub_df.columns))
print(query)

Explain this dataset based on ['S.No', 'Name of the FLW', 'Question asked ', 'Response received ', 'Feedback on the response ', 'Over all Observations ']


In [8]:
llm = OpenAI(openai_api_key=api_key, temperature=0)

In [9]:
description1 = llm(query)

In [11]:
print(description1.strip())

This dataset contains information about a Follow-up (FLW) process. It includes the serial number, name of the FLW, the question asked, the response received, feedback on the response, and overall observations. This data can be used to analyze the effectiveness of the FLW process and to identify areas for improvement.


2. Generating description from column names and row values using langchain PromptTemplate and ChatGPT model

In [13]:
prompt2 = PromptTemplate(
    input_variables=["columns", "rows"],
    template = "Explain this dataset based on {columns} and {rows} information"
    )
query2 = prompt2.format(columns = list(sub_df.columns), rows = sub_df.head())
description2 = llm(query2)

In [15]:
print(description2.strip())

This dataset contains information about the questions asked by a female farmer (FLW) and the responses received from the FLW. The dataset includes the S.No, Name of the FLW, Question asked, Response received, Feedback on the response, and Over all Observations. 

The S.No and Name of the FLW provide information about the individual who asked the question. The Question asked column contains the questions asked by the FLW. The Response received column contains the response received from the FLW. The Feedback on the response column provides feedback on the response received from the FLW. The Over all Observations column provides an overall observation of the response received from the FLW. 

The dataset shows that the FLW asked questions about chilli varieties and pesticides for black thrips. The responses received from the FLW were appropriate and provided observations from the FLW. The overall observations of the responses received from the FLW were also appropriate.


3. Generating description using langchain pandas dataframe agent

In [19]:
agent = create_pandas_dataframe_agent(OpenAI(temperature=0),
                                      sub_df,
                                      verbose=True
)
question = "Explain this dataset using column names and all information available"
description3 = agent.run(question)



[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3mThought: I should look at the column names and the data in each row
Action: python_repl_ast
Action Input: print(df.columns)[0m
Observation: [36;1m[1;3mIndex(['S.No', 'Name of the FLW', 'Question asked ', 'Response received ',
       'Feedback on the response ', 'Over all Observations '],
      dtype='object')
[0m
Thought:[32;1m[1;3m I should look at the data in each row
Action: python_repl_ast
Action Input: print(df.head())[0m
Observation: [36;1m[1;3m   S.No Name of the FLW                                    Question asked   \
0   1.0  B Balaji Naik                             Chilli varieties please   
1   NaN             NaN                                                NaN   
2   2.0  B Balaji Naik               Which pesticides use for black thrips   
3   3.0  B Balaji Naik   Please share some pesticides available in the ...   
4   NaN             NaN                                                NaN   

      

In [17]:
print(description3)

This dataset contains information about questions asked by female farmers and the responses received from a bot. The columns include S.No, Name of the FLW, Question asked, Response received, Feedback on the response, and Over all Observations.


In [20]:
question2 = "Explain this dataset using all information available in the dataset"
description4 = agent.run(question2)



[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3mThought: I should look at the data and think about what it is telling me
Action: python_repl_ast
Action Input: print(df.info())[0m
Observation: [36;1m[1;3m<class 'pandas.core.frame.DataFrame'>
RangeIndex: 155 entries, 0 to 154
Data columns (total 6 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   S.No                       38 non-null     float64
 1   Name of the FLW            44 non-null     object 
 2   Question asked             61 non-null     object 
 3   Response received          99 non-null     object 
 4   Feedback on the response   54 non-null     object 
 5   Over all Observations      2 non-null      object 
dtypes: float64(1), object(5)
memory usage: 7.4+ KB
None
[0m
Thought:[32;1m[1;3m I can see that this dataset contains information about questions asked by Female Livelihood Workers (FLWs) and the responses they received.
Actio

In [21]:
print(description4)

This dataset contains information about questions asked by Female Livelihood Workers (FLWs) and the responses they received. It contains 155 entries, 6 columns, and the S.No column has 38 non-null values with a mean of 19.68.


In [22]:
# pass prompt from template as question to the agent
description5 = agent.run(query2)



[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3m
Thought: I need to understand the data in the dataframe
Action: python_repl_ast
Action Input: df.info()[0m
Observation: [36;1m[1;3m<class 'pandas.core.frame.DataFrame'>
RangeIndex: 155 entries, 0 to 154
Data columns (total 6 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   S.No                       38 non-null     float64
 1   Name of the FLW            44 non-null     object 
 2   Question asked             61 non-null     object 
 3   Response received          99 non-null     object 
 4   Feedback on the response   54 non-null     object 
 5   Over all Observations      2 non-null      object 
dtypes: float64(1), object(5)
memory usage: 7.4+ KB
[0m
Thought:[32;1m[1;3m I now understand the data in the dataframe
Final Answer: This dataset contains 155 entries with 6 columns. The columns are 'S.No', 'Name of the FLW', 'Question asked', 'Resp

In [23]:
print(description5)

This dataset contains 155 entries with 6 columns. The columns are 'S.No', 'Name of the FLW', 'Question asked', 'Response received', 'Feedback on the response', and 'Over all Observations'. The data types are float64 for 'S.No' and object for the other columns. There are 38 non-null entries for 'S.No', 44 non-null entries for 'Name of the FLW', 61 non-null entries for 'Question asked', 99 non-null entries for 'Response received', 54 non-null entries for 'Feedback on the response', and 2 non-null entries for 'Over all Observations'.


In [24]:
prompt3 = PromptTemplate(
    input_variables=["columns", "rows"],
    template = "Generate a summary about this dataset based on {columns} and {rows} information"
    )
query3 = prompt3.format(columns = list(sub_df.columns), rows = sub_df.head())

In [25]:
description6 = agent.run(query3)



[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3m
Thought: I need to look at the data and think about what it is telling me
Action: python_repl_ast
Action Input: df.describe()[0m
Observation: [36;1m[1;3m            S.No
count  38.000000
mean   19.684211
std    11.380650
min     1.000000
25%    10.250000
50%    19.500000
75%    28.750000
max    39.000000[0m
Thought:[32;1m[1;3m I need to look at the data and think about what it is telling me
Action: python_repl_ast
Action Input: df.info()[0m
Observation: [36;1m[1;3m<class 'pandas.core.frame.DataFrame'>
RangeIndex: 155 entries, 0 to 154
Data columns (total 6 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   S.No                       38 non-null     float64
 1   Name of the FLW            44 non-null     object 
 2   Question asked             61 non-null     object 
 3   Response received          99 non-null     object 
 4   Feedback on the

4. Generate description using RetrievalQA

In [27]:
# load doc
loader = UnstructuredExcelLoader(file_)
doc = loader.load()
print(f"There is {len(doc)} document(s) with {len(doc[0].page_content)} characters")

There is 1 document(s) with 45428 characters


In [29]:
# split docs 
text_splitter = RecursiveCharacterTextSplitter(chunk_size=3000, chunk_overlap=400)
docs = text_splitter.split_documents(doc)
# Get the total number of characters and the average per doc
total_num_char = sum([len(x.page_content) for x in docs])
print(f"Now you have {len(docs)} documents that have an average of {total_num_char / len(docs):,.0f} characters")

Now you have 18 documents that have an average of 2,668 characters


In [30]:
# creating vectors db
embeddings = OpenAIEmbeddings(openai_api_key=api_key)
docsearch = FAISS.from_documents(docs[:4], embeddings)

In [33]:
# Create retrieval QA engine
qa = RetrievalQA.from_chain_type(llm = llm,
                                chain_type = "stuff",
                                retriever = docsearch.as_retriever(),
                                verbose = True)

In [39]:
query4 = "explain this dataset"
desc = qa.run(query4)
print(desc)



[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m
 This dataset contains information about conversations between farmers and a bot about chilli cultivation. The conversations include questions about chilli varieties, pests, and cultivation techniques. The dataset also includes observations from female farmers about the bot's responses.


In [40]:
# create method for prompt template
# create method for Retrieval QA
# generate description 3 times then summarize the description, pass the 3 desc and dataset to chatgpt

### Wrapping the description generations in functions

In [41]:
# function for prompt engineering
def create_prompt(col, row):
    prompt = PromptTemplate(
    input_variables=["columns", "rows"],
    template = "Generate a summary about this dataset based on {columns} and {rows} information"
    )
    query = prompt.format(columns = list(col), rows = row)
    return query

In [43]:
# function to generate description with pandas agent
def generate_desc_pd_agent(df, question):
    agent = create_pandas_dataframe_agent(OpenAI(temperature=0),
                                         df,
                                         verbose=True
                                         )
    description = agent.run(question)
    return description

In [108]:
# function to generate description from chatGPT model
def gen_desc_gpt(query):
    description = llm(query)
    return description

In [120]:
# function to generaate description with Retrieval QA
def gen_desc_rQA(file, api_k, llm, query):
    df = pd.read_excel(file)
    query_gpt = create_prompt(df.columns, df.head())
    loader = UnstructuredFileLoader(file)
    doc = loader.load()
    
    # split docs
    text_splitter = RecursiveCharacterTextSplitter(chunk_size=3000, chunk_overlap=400)
    docs = text_splitter.split_documents(doc)
    
    total_num_char = sum([len(x.page_content) for x in docs])
    # print(f"Now you have {len(docs[:4])} documents that have an average of {total_num_char / len(docs):,.0f} characters")
    # embeddings
    # creating vectors db
    embeddings = OpenAIEmbeddings(openai_api_key=api_k)
    docsearch = FAISS.from_documents(docs[:4], embeddings)
    
    # Create retrieval QA engine
    qa = RetrievalQA.from_chain_type(llm = llm,
                                     chain_type = "stuff",
                                     retriever = docsearch.as_retriever(),
                                     verbose = True)
    
    try:                     
        description = qa.run(query)
    except Exception:
        print("Error message:")
        description = gen_desc_gpt(query_gpt)
    finally:
        return description

##### a. Generating description from create pandas agent

In [68]:
question = "explain this dataset using all information provided in dataset"
desc_agent = generate_desc_pd_agent(sub_df, question)



[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3mThought: I need to look at the data and think about what it is telling me
Action: python_repl_ast
Action Input: print(df.head())[0m
Observation: [36;1m[1;3m   S.No Name of the FLW                                    Question asked   \
0   1.0  B Balaji Naik                             Chilli varieties please   
1   NaN             NaN                                                NaN   
2   2.0  B Balaji Naik               Which pesticides use for black thrips   
3   3.0  B Balaji Naik   Please share some pesticides available in the ...   
4   NaN             NaN                                                NaN   

                                  Response received   \
0  The high yielding varieties in chilli are G3, ...   
1                                                NaN   
2  To control the sucking pests (Aphids, thrips, ...   
3  Here are some pesticides that can be used to c...   
4                              

In [69]:
print(desc_agent)

This dataset contains information about questions asked by female farmers and the responses they received. It includes the question asked, the response received, the feedback on the response, and overall observations.


##### b. Generate description from chatGPT model

In [109]:
query_gpt = create_prompt(sub_df.columns, sub_df.head())
desc_gpt = gen_desc_gpt(query_gpt)

In [110]:
print(desc_gpt.strip())

This dataset contains information about the questions asked by a female farmer (FLW) and the responses received from the FLW. The responses received were appropriate and the feedback on the response was also appropriate. Overall, the observations from the female FLW were found to be satisfactory. The questions asked by the FLW included chilli varieties, pesticides for black thrips, and other pesticides available in the market. The responses received provided information about high yielding chilli varieties, pesticides to control sucking pests, and other pesticides available in the market.


##### c. Generate description from retrieval QA

In [123]:
query = "Explain this dataset"
desc_qa = gen_desc_rQA(file_, api_key, llm, query)



[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m


In [122]:
print(desc_qa)

 This dataset contains information about the responses of a bot to questions asked by female farmers in India. The dataset includes the questions asked, the responses given by the bot, and whether the response was appropriate or not.


In [85]:
file_2 = "../DGDatasets/Woreda DA Registry Records.xlsx"

In [124]:
query = "Explain this dataset based on all available information"
desc_qa = gen_desc_rQA(file_2, api_key, llm, query)



[1m> Entering new RetrievalQA chain...[0m
Error message:


In [125]:
print(desc_qa.strip())

This dataset contains information about 5 individuals from the Amhara region in Ethiopia. All of the individuals are either male or female and their ages range from 1984 to 1968. Their marital status is all listed as 'ያላገባ' and their education levels range from 'level4' to 'ድግሪ'. All of the individuals are employed in the position of 'እንሰሳትእርባታ' and their employment dates range from 2005 to 1992. The individuals are all from the North Shewa zone, Basona Worena woreda, and their kebeles range from 'Me/Amba' to 'Delila'.


In [126]:
pd.read_excel(file_2).head()

Unnamed: 0,Salutation,Name,Father Name,Grand Father Name,Sex,Birth Month,Birth Year,Maritial Status,Phone No,Alternate Phone No (Optional),Email,Education Level,Specialization,Specialization (Other),Position,Employment Month (Ethiopian Calendar),Employment Year (Ethiopian Calendar),Assignment Month at Kebele (Ethiopian Calendar),Assignment Year at Kebele (Ethiopian Calendar),Pension Number,Region,Zone,Woreda,Kebele,Kebele (Translated),CIAT Equivalent Kebele,Similarity Index (Confidence)
0,አቶ,ይበልጣል,ደሴ,ባያብል,Male,8.0,1984,ያላገባ,941134440.0,,,level4,እን/ሀ/ል/ማ/ኤክስቴንሽን,,እንሰሳትእርባታ,5,2005,5,2005.0,,Amhara,North Shewa,Basona Worena,መ/አምባ,Me/Amba,Amba,90.0
1,ወ/ሪት,የሺአረግ,ዘነበ,ብዙነህ,Female,12.0,1976,ያላገባ,921744060.0,,,ድግሪ,እንሰሳትሳይንስ,,እንሰሳትእርባታ,1,1997,1,1997.0,,Amhara,North Shewa,Basona Worena,ሳሪያ,Sariya,Sariya,100.0
2,አቶ,አበራ,አበበ,ያግቡ,Male,2.0,1968,ያላገባ,912907732.0,,,ድግሪ,እንሰሳትሳይንስ,,እንሰሳትእርባታ,11,1992,11,1992.0,,Amhara,North Shewa,Basona Worena,ባቄሎ,Bakelo,Bakelo,100.0
3,ወ/ሪት,ሰውሃረግ,አለሙ,ሞላ,Male,2.0,1987,ያላገባ,923547939.0,,,level4,ዲያሪፕሮዳከሽንቴክኒክ,,እንሰሳትእርባታ,7,2005,7,2005.0,,Amhara,North Shewa,Basona Worena,ውሻውሽኝ,Wshawshny,Wushawshegn,80.0
4,አቶ,ሞላ,ሲሳይ,ተፈራ,Male,,1984,ያላገባ,922919013.0,,,ድግሪ,,,እንሰሳትእርባታ,7,2010,7,2010.0,,Amhara,North Shewa,Basona Worena,ደሊላ,Delila,Del,90.0


In [127]:
question2 = "explain this dataset using all information provided in dataset"
desc_agent3 = generate_desc_pd_agent(pd.read_excel(file_2), question2)



[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3mThought: I should look at the column names and the data in each column
Action: python_repl_ast
Action Input: print(df.columns)[0m
Observation: [36;1m[1;3mIndex(['Salutation', 'Name', 'Father Name', 'Grand Father Name', 'Sex',
       'Birth Month', 'Birth Year', 'Maritial Status', 'Phone No',
       'Alternate Phone No (Optional)', 'Email', 'Education Level',
       'Specialization', 'Specialization (Other)', 'Position',
       'Employment Month (Ethiopian Calendar)',
       'Employment Year (Ethiopian Calendar)',
       'Assignment Month at Kebele (Ethiopian Calendar)',
       'Assignment Year at Kebele (Ethiopian Calendar)', 'Pension Number',
       'Region', 'Zone', 'Woreda', 'Kebele', 'Kebele (Translated)',
       'CIAT Equivalent Kebele', 'Similarity Index (Confidence)'],
      dtype='object')
[0m
Thought:[32;1m[1;3m I should look at the data in each column
Action: python_repl_ast
Action Input: print(df.head())[0m


In [128]:
print(desc_agent3)

This dataset contains information about individuals such as their salutation, name, father name, grand father name, sex, birth month, birth year, maritial status, phone number, alternate phone number, email, education level, specialization, position, employment month and year, assignment month and year at kebele, pension number, region, zone, woreda, kebele, kebele (translated), CIAT equivalent kebele, and similarity index (confidence).


In [129]:
sub_df2 = pd.read_excel(file_2)
prompt7 = PromptTemplate(
    input_variables = ["columns"],
    template = "Explain this dataset based on {columns}"
)
query7 = prompt7.format(columns = list(sub_df2.columns))
print(query7)

Explain this dataset based on ['Salutation', 'Name', 'Father Name', 'Grand Father Name', 'Sex', 'Birth Month', 'Birth Year', 'Maritial Status', 'Phone No', 'Alternate Phone No (Optional)', 'Email', 'Education Level', 'Specialization', 'Specialization (Other)', 'Position', 'Employment Month (Ethiopian Calendar)', 'Employment Year (Ethiopian Calendar)', 'Assignment Month at Kebele (Ethiopian Calendar)', 'Assignment Year at Kebele (Ethiopian Calendar)', 'Pension Number', 'Region', 'Zone', 'Woreda', 'Kebele', 'Kebele (Translated)', 'CIAT Equivalent Kebele', 'Similarity Index (Confidence)']


In [130]:
desc10 = llm(query7)

In [132]:
print(desc10.strip())

This dataset contains information about individuals in Ethiopia. It includes their salutation, name, father's name, grandfather's name, sex, birth month and year, maritial status, phone number, alternate phone number (optional), email, education level, specialization, specialization (other), position, employment month and year (in the Ethiopian calendar), assignment month and year at Kebele (in the Ethiopian calendar), pension number, region, zone, woreda, kebele, kebele (translated), CIAT equivalent kebele, and similarity index (confidence). This data could be used to track individuals in Ethiopia, as well as to analyze trends in education, employment, and other demographic information.
