## Big Data Analytics with LLM - OpenAI ChatGPT

Install and Import required libraries

In [14]:
!pip install pandas numpy




[notice] A new release of pip is available: 23.0.1 -> 24.0
[notice] To update, run: python.exe -m pip install --upgrade pip


In [15]:
import pandas as pd
import numpy as np
from random import choices

### 1st Approach
We will create dummy data using numpy and work with dummy df

In [16]:
np.random.seed(10)
num_records=1000
num_outlets=100
num_products=30

In [17]:
# Generating date range for the first half of 2023 (Jan-Jun)
dates = pd.date_range(start='2023-01-01', end='2023-06-30')

# Selecting random dates from the generated date range
dates = choices(dates, k=num_records)
len(dates)

1000

In [18]:
outlets = ['Outlet_'+str(i+1) for i in range(num_outlets)]
# Selecting random outlets from the given list
outlets = choices(outlets, k=num_records)
len(outlets)

1000

In [19]:
products = ["Product_"+str(i+1) for i in range(num_products)]
products = choices(products, k=num_records)
len(products)

1000

In [20]:
# Generating random units sold within the range of 1 to 300
units_sold = np.random.randint(1, 300, num_records)

# Generating random price per unit within the range of 10 to 50
price_per_unit = np.random.uniform(10, 50, num_records)

# Calculating total sales by multiplying units sold with price per unit
total_sales = units_sold * price_per_unit

In [21]:
df = pd.DataFrame({
    'Date': dates,
    'outlets' : outlets,
    'Products' : products,
    'Unit_sold': units_sold,
    'Price_Per_Unit': price_per_unit,
    'Total_Sales' : total_sales
})
df

Unnamed: 0,Date,outlets,Products,Unit_sold,Price_Per_Unit,Total_Sales
0,2023-06-18,Outlet_56,Product_1,266,24.302336,6464.421437
1,2023-05-11,Outlet_75,Product_29,126,47.104760,5935.199755
2,2023-01-08,Outlet_96,Product_11,16,45.284050,724.544792
3,2023-04-24,Outlet_59,Product_18,124,48.305120,5989.834844
4,2023-02-05,Outlet_63,Product_10,157,18.847811,2959.106399
...,...,...,...,...,...,...
995,2023-05-04,Outlet_73,Product_18,105,12.138754,1274.569156
996,2023-03-08,Outlet_23,Product_30,227,41.585490,9439.906214
997,2023-02-05,Outlet_19,Product_24,203,31.168276,6327.159974
998,2023-06-22,Outlet_59,Product_5,162,20.958885,3395.339443


Start Preparing LLM Enviroment

In [22]:
# In order to use OpenAI models, you need to have an OpenAI API key. 
OPEN_AI_KEY = 'your-key'

- PandasAI supports several large language models (LLMs). LLMs are used to generate code from natural language queries. The generated code is then executed to produce the result.

In [23]:
# Installing and importing PandasAI library for using LLMs
!pip install pandasai




[notice] A new release of pip is available: 23.0.1 -> 24.0
[notice] To update, run: python.exe -m pip install --upgrade pip


In [24]:
from pandasai import SmartDataframe
from pandasai.llm.openai import OpenAI

In [25]:
# Instantiating OpenAI LLM object with the provided API key
llm = OpenAI(api_token=OPEN_AI_KEY)

# Converting the DataFrame to a SmartDataframe with LLM configuration
df = SmartDataframe(df, config={"llm": llm})

In [26]:
# Querying the LLM to find the product with the highest total sales
highest_product = df.chat('Which product has the highest total_sales?')
highest_product

'The product with the highest total sales is: Product_25'

In [None]:
# Formatting 'Price_Per_Unit' and 'Total_Sales' columns to two decimal places
df.chat("Plot the chart of the products based on total_sales")
df