# Goal:
1. Use prompt engineer to read pdf and extract json file
2. Simple visualize the result
3. See full tutorial [Link](https://learn.deeplearning.ai/chatgpt-prompt-eng/lesson/1/lesson_1)

In [9]:
# set up
import openai
import os
import pandas as pd
import numpy as np
from dotenv import load_dotenv, find_dotenv
_ = load_dotenv(find_dotenv()) # read local .env file

openai.api_key  = os.getenv('OPENAI_API_KEY')

In [10]:
def get_completion(prompt, model="gpt-3.5-turbo"): # Andrew mentioned that the prompt/ completion paradigm is preferable for this class
    messages = [{"role": "user", "content": prompt}]
    response = openai.chat.completions.create(
        model=model,
        messages=messages,
        temperature=0, # this is the degree of randomness of the model's output
    )
    return response.choices[0].message.content

In [11]:
prompt = "Estimate the market value of supervised learning in the US in 2020."
print(get_completion(prompt))

It is challenging to provide an accurate estimate of the market value of supervised learning in the US in 2020 as there is no specific data available for this specific market segment. However, we can provide a rough estimate based on the overall artificial intelligence (AI) market and its growth.

According to a report by Grand View Research, the global AI market size was valued at $39.9 billion in 2019 and is expected to grow at a compound annual growth rate (CAGR) of 42.2% from 2020 to 2027. The US is one of the leading countries in AI adoption and development, accounting for a significant portion of the global market.

Supervised learning is a fundamental component of AI, and it plays a crucial role in various industries such as healthcare, finance, retail, and more. As AI adoption continues to increase across these sectors, the demand for supervised learning solutions is also expected to grow.

Considering the overall AI market growth and the significance of supervised learning wit

In [13]:
prompt = "How many people die due to jaywalking in the world each year?"
print(get_completion(prompt))

It is difficult to provide an exact number of deaths caused by jaywalking worldwide each year, as data collection and reporting methods vary across countries. Additionally, jaywalking-related deaths may be classified differently in different regions. However, it is known that pedestrian fatalities occur due to various factors, including jaywalking, distracted walking, and other unsafe behaviors. According to the World Health Organization (WHO), globally, around 270,000 pedestrians lose their lives in road traffic crashes each year. It is important to note that this figure includes all pedestrian fatalities, not just those specifically caused by jaywalking.


In [None]:
df = pd.read_csv("./data/abstracts-machine learning urban planning-20231203-1932rows.csv")
df['Abstract'] = df['Abstract'].astype(str)
df['Abstract'] = df['Abstract'].str.replace('\n', ' ')
abstracts = df['Abstract'].values
# bin the abstracts into groups of 200
# abstracts_ls = np.array_split(abstracts, 100)

In [None]:
prompt = """
Read the array of paragraph, delimited by triple 
backticks below, and summarize the specific machine learning method or model used in a list format. 
The summarized methods should not have duplicates. It cannot be a general term such as 
"machine learning" or "supervised learning" or "unsupervised learning".
The summary should be in the following format:
[
    "method_1",
    "method_2",
    ...
    ]

```
{abs_text}
```
"""

In [None]:
# model="gpt-3.5-turbo"
# messages = [{"role": "user", "content": prompt}]
# response = openai.chat.completions.create(
#         model=model,
#         messages=messages,
#         temperature=0, # this is the degree of randomness of the model's output
#     )

In [None]:
task_ls = []
for i, temp in enumerate(abstracts[:2]):
    prompt = prompt.format(abs_text=temp)
    completion = get_completion(prompt)
    task_ls.append(completion)
    print(f"Completed task {i+1}")

In [None]:
# this is the full results
methodsummary = {
  "random forest": 451,
  "neural network": 404,
  "support vector machine": 358,
  "convolutional neural network": 158,
  "gradient boosting": 134,
  "decision tree": 119,
  "clustering": 104,
  "bayesian": 54,
  "lstm": 52,
  "feature selection": 28
}

In [None]:
# visualize the result
import seaborn as sns
import matplotlib.pyplot as plt
sns.set_theme(style="whitegrid")
fig, ax = plt.subplots(figsize=(5, 5))
sns.barplot(y=list(methodsummary.keys()), 
            x=list(methodsummary.values()),
            ax = ax,
            palette="Set2")
sns.despine(left=True, bottom=True)
# plt.xticks(rotation=45)