#  Introduction to Large Language Models with GPT & LangChain

[ChatGPT](https://chat.openai.com/) is wildly popular, with over a billion visits per month. Although this web interface is great for many non-technical use cases, for programming and automation tasks, it is better to access GPT (the AI that powers ChatGPT) via the OpenAI API.

As well as GPT, you'll also make use of LangChain, a programming framework for working with generative AI.

You'll cover:

- Getting set up with an OpenAI developer account and integration with Workspace.
- Calling the chat functionality in the OpenAI API, with and without langchain.
- Simple prompt engineering.
- Holding a conversation with GPT.
- Ideas for incorporating GPT into a data analysis or data science workflow.

You'll be using GPT to explore [a dataset](https://catalog.data.gov/dataset/electric-vehicle-population-data) about electric cars in Washington state, USA. 

## Before you begin

You'll need a developer account with OpenAI.

See *getting-started.ipynb* for steps on how to create an API key and store it in Workspace. In particular, you'll need to follow the instructions in the "Getting started with OpenAI" and "Setting up Workspace Integrations" sections.

## Task 0: Setup

We need to install the `langchain` package. This is currently being developed quickly, sometimes with breaking changes, so we fix the version.

The `langchain` depends on a recent version of `typing_extensions`, so we need to update that package, again fixing the version.

### Instructions

Run the following code to install `langchain` and `typing_extensions`.

In [1]:
# Install the langchain package
# !pip install langchain==0.0.300

In [2]:
# Update the typing_extensions package
# !pip install typing_extensions==4.8.0

In order to chat with GPT, we need first need to load the `openai` and `os` packages to set the API key from the environment variables you just created.

### Instructions

- Import the `os` package.
- Import the `openai` package.
- Set `openai.api_key` to the `OPENAI_API_KEY` environment variable.

In [3]:
# Import the os package
import os

# Import the openai package
import openai

# Set openai.api_key to the OPENAI_API_KEY environment variable
openai.api_key = os.getenv('OPENAI_API_KEY')

We need to import the `langchain` package. It has many submodules, so to save typing later, we'll also import some specific functions from those submodules.

### Instructions

- Import the `langchain` package as `lc`.
- From the `langchain.chat_models` module, import `ChatOpenAI`.
- From the `langchain.schema` module, import `AIMessage`, `HumanMessage`, `SystemMessage`.

In [4]:
# Import the langchain package as lc
import langchain as lc

# From the langchain.chat_models module, import ChatOpenAI
from langchain.chat_models import ChatOpenAI

# From the langchain.schema module, import AIMessage, HumanMessage, SystemMessage
from langchain.schema import AIMessage, HumanMessage, SystemMessage

You'll also need to do some light data manipulation with the `pandas` package and data visualization with `plotly.express`.  Finally, the `IPython.display` pacakges contains functions to prettily display Markdown content.

### Instructions

Import the following packages.

- Import `pandas` using the alias `pd`.
- Import `plotly.express` using the alias `px`.
- From the `IPython.display` package, import `display` and `Markdown`.
- etc.

In [5]:
# Import pandas using the alias pd
import pandas as pd

# Import plotly.express using the alias px
import plotly.express as px

# From the IPython.display package, import display and Markdown
from IPython.display import display, Markdown

Pyarrow will become a required dependency of pandas in the next major release of pandas (pandas 3.0),
(to allow more performant data types, such as the Arrow string type, and better interoperability with other libraries)
but was not found to be installed on your system.
If this would cause problems for you,
please provide us feedback at https://github.com/pandas-dev/pandas/issues/54466
        
  import pandas as pd


## Task 1: Import the Electric Cars Data

The electric cars data is contained in a CSV file named `electric_cars.csv`.

Each row in the dataset represents the count of the number of cars registered within a city, for a particular model.

The dataset contains the following columns.

- `city` (character): The city in which the registered owner resides.
- `county` (character): The county in which the registered owner resides.
- `model_year` (integer): The [model year](https://en.wikipedia.org/wiki/Model_year#United_States_and_Canada) of the car.
- `make` (character): The manufacturer of the car.
- `model` (character): The model of the car.
- `electric_vehicle_type` (character): Either "Plug-in Hybrid Electric Vehicle (PHEV)" or "Battery Electric Vehicle (BEV)".
- `n_cars` (integer): The count of the number of vehicles registered.

Our first step is to import and print the data.

### Instructions

Import the electric cars data to a pandas dataframe.

- Read the data from `electric_cars.csv`. Assign to `electric_cars`.
- Display a description of the numeric columns of `electric_cars`.
- Display a description of the object columns of `electric_cars`.
- Print the whole dataset. 

In [6]:
# Read the data from electric_cars.csv. Assign to electric_cars.
electric_cars = pd.read_csv("data/electric_cars.csv")

# Display a description of the numeric columns
print("Description of numeric columns\n")
display(electric_cars.describe())

# Display a description of the text (object) columns
print("Description of text columns\n")
display(electric_cars.describe(include="O"))

# Print the whole dataset
print("The electric cars dataset\n")
display(electric_cars)

Description of numeric columns



Unnamed: 0,model_year,n_cars
count,26813.0,26813.0
mean,2019.375527,5.612166
std,3.286257,26.997325
min,1997.0,1.0
25%,2017.0,1.0
50%,2020.0,2.0
75%,2022.0,4.0
max,2024.0,1514.0


Description of text columns



Unnamed: 0,city,county,make,model,electric_vehicle_type
count,26813,26813,26813,26813,26813
unique,683,183,37,127,2
top,Bothell,King,TESLA,LEAF,Battery Electric Vehicle (BEV)
freq,479,7066,5071,1889,15885


The electric cars dataset



Unnamed: 0,city,county,model_year,make,model,electric_vehicle_type,n_cars
0,Seattle,King,2023,TESLA,MODEL Y,Battery Electric Vehicle (BEV),1514
1,Seattle,King,2018,TESLA,MODEL 3,Battery Electric Vehicle (BEV),1153
2,Seattle,King,2021,TESLA,MODEL Y,Battery Electric Vehicle (BEV),1147
3,Seattle,King,2022,TESLA,MODEL Y,Battery Electric Vehicle (BEV),1122
4,Bellevue,King,2023,TESLA,MODEL Y,Battery Electric Vehicle (BEV),931
...,...,...,...,...,...,...,...
26808,Lakewood,Pierce,2022,BMW,IX,Battery Electric Vehicle (BEV),1
26809,Lakewood,Pierce,2022,BMW,X5,Plug-in Hybrid Electric Vehicle (PHEV),1
26810,Lakewood,Pierce,2022,FORD,TRANSIT,Battery Electric Vehicle (BEV),1
26811,Lakewood,Pierce,2022,HYUNDAI,KONA ELECTRIC,Battery Electric Vehicle (BEV),1


## Task 2: Asking GPT a Question

Let's start by sending a message to GPT and getting a response. For now, we won't worry about including any details about the dataset&mdash;it's the equivalent of asking "is this microphone turned on?".

We'll also skip using langchain for now so you can see more clearly how the `openai` packages works.

### Types of Message

There are three types of message, documented in the [Introduction](https://platform.openai.com/docs/guides/chat/introduction) to the Chat documentation. We'll use two of them here.

- `system` messages describe the behavior of the AI assistant. If you don't know what you want, try "You are a helpful assistant".
- `user` messages describe what you want the AI assistant to say. We'll cover examples of this today.

### Instructions

Send a question to GPT and get a response.

- Define the system message as follows and assign to `system_msg_test`.

```
"""You are a helpful assistant who understands data science.
 You write in a clear language that a ten year old can understand.
 You keep your answers brief.""". 
```
    
- Define the user message as follows and assign to `user_msg_test`.

```
"Tell me some uses of GPT for data analysis."
```

- Create a message list from the system and user messages. Assign to `msgs_test`.
- Send the messages to GPT. Assign to `rsps_test`.

<details>
<summary>Code hints</summary>
<p>
        
The `openai.ChatCompletion.create()` function expects the messages in the form of a list of dictionaries, each with a `role` and `content` element.
        
```
messages = [
    {"role": "system", "content": system_msg},
    {"role": "user", "content": user_msg}
]
```

</p>
</details>

In [7]:
# Define the system message. Assign to system_msg_test.
system_msg_test = """You are a helpful assistant who understands data science.
 You write in a clear language that a ten year old can understand.
 You keep your answers brief."""

# Define the user message. Assign to user_msg_test.
user_msg_test = "Tell me some uses of GPT for data analysis."

# Create a message list from the system and user messages. Assign to msgs_test.
msgs_test = [
    {"role": "system", "content": system_msg_test},
    {"role": "user", "content": user_msg_test}
]

# Send the messages to GPT. Assign to rsps_test.
client = openai.OpenAI()
rsps_test = client.chat.completions.create(
    model="gpt-3.5-turbo",
    messages=msgs_test
)

Now you need to explore the response. The result is a highly nested object. As well as the text response that we want, there's a lot of metadata. You'll print the whole thing so you can see the structure, and extract just the text content.

### Instructions

Print the whole response and just the text content.

- Print the whole response.
- Print just the response's content.

<details>
<summary>Code hints</summary>
<p>
        
Buried within the response variable is the text we asked GPT to generate. Luckily, it's always in the same place.
        
```
response["choices"][0]["message"]["content"]
```

</p>
</details>

In [8]:
# Print the whole response
print("The whole response\n")
print(rsps_test)

print("\n\n----\n\n")

# Print just the response's content
print("Just the response's content\n")
rsps_test.choices[0].message.content

The whole response

ChatCompletion(id='chatcmpl-8rx39HfnO8ilXw1FotQgn5EFBmtS6', choices=[Choice(finish_reason='stop', index=0, logprobs=None, message=ChatCompletionMessage(content="GPT, or Generative Pre-trained Transformer, can be used for data analysis in various ways. It can help with tasks like:\n\n1. Text Generation: GPT can generate realistic text, such as writing stories or creating product descriptions.\n\n2. Sentiment Analysis: GPT can analyze text and determine the sentiment behind it, whether it's positive, negative, or neutral.\n\n3. Language Translation: GPT can translate text from one language to another, making it easier to understand and communicate with people who speak different languages.\n\n4. Question Answering: GPT can answer questions by understanding the text and providing relevant answers.\n\n5. Chatbots: GPT can be used to create chatbots that can have conversational interactions with users, helping to answer questions and provide assistance.\n\nThese are just

"GPT, or Generative Pre-trained Transformer, can be used for data analysis in various ways. It can help with tasks like:\n\n1. Text Generation: GPT can generate realistic text, such as writing stories or creating product descriptions.\n\n2. Sentiment Analysis: GPT can analyze text and determine the sentiment behind it, whether it's positive, negative, or neutral.\n\n3. Language Translation: GPT can translate text from one language to another, making it easier to understand and communicate with people who speak different languages.\n\n4. Question Answering: GPT can answer questions by understanding the text and providing relevant answers.\n\n5. Chatbots: GPT can be used to create chatbots that can have conversational interactions with users, helping to answer questions and provide assistance.\n\nThese are just a few examples of how GPT can be used for data analysis. It has many other applications as well!"

## Task 3: Asking a Question About the Dataset

Now we know that GPT is working, we can start asking questions about data analysis. Because we have details of our dataset, we can pass these in to our prompt to improve the quality of the mesages we get back.

Another change that we're going to make is to use the `langchain` package, which provides a convenience layer on top of the `openai` package.

### Why should we use LangChain?

The code in the previous task used complicated nested objects in two places (the list of dictionaries for the message, and the dictionary of lists and dictionaries for the response). This sort of object is common in web application development, but not in data analysis, where rectangular data (pandas DataFrames and SQL tables) is the more common.

One of the advantages of LangChain is that it simplifies the code for some tasks, letting you avoid messing about with too many square brackets and curly braces as you navigate these deep objects.

Secondly, if you want to swap GPT for a different model at a later date (as you might in a corporate setting), it can be easier to do so if you use the `langchain` package rather than the `openai` package directly.

### LangChain message types

The LangChain message types are names slightly differently from the OpenAI message types.

- `SystemMessage` is the equivalent of OpenAI's `system` message.
- `HumanMessage` is the equivalent of OpenAI's `user` message.

### Instructions

Create a prompt that includes dataset details.

- _Read the description of the dataset that is provided._
- Create a task for the AI. Assign to `suggest_questions`.
    - Use the text `"Suggest some data analysis questions that could be answered with this dataset."`.
- Concatenate the dataset description and the request. Assign to `msgs_suggest_questions`.
    - The first message is a system message with the content `"You are a data analysis expert."`.
    - The second message is a human message with `dataset_description` and `suggest_questions` concatenated with two line breaks in between.

In [9]:
# A description of the dataset
dataset_description = """
You have a dataset about electric cars registered in Washington state, USA in 2020. It is available as a pandas DataFrame named `electric_cars`.

Each row in the dataset represents the count of the number of cars registered within a city, for a particular model.

The dataset contains the following columns.

- `city` (character): The city in which the registered owner resides.
- `county` (character): The county in which the registered owner resides.
- `model_year` (integer): The [model year](https://en.wikipedia.org/wiki/Model_year#United_States_and_Canada) of the car.
- `make` (character): The manufacturer of the car.
- `model` (character): The model of the car.
- `electric_vehicle_type` (character): Either "Plug-in Hybrid Electric Vehicle (PHEV)" or "Battery Electric Vehicle (BEV)".
- `n_cars` (integer): The count of the number of vehicles registered.
"""

# Create a task for the AI. Assign to suggest_questions.
suggest_questions = "Suggest some data analysis questions that could be answered with this dataset."

# Concatenate the dataset description and the request. Assign to msgs_suggest_questions.
msgs_suggest_questions = [
    SystemMessage(content="You are a data analysis expert."),
    HumanMessage(content=f"{dataset_description}\n\n{suggest_questions}")
]

### Instructions

- Create a `ChatOpenAI` object. Assign to `chat`.
- Pass your message to GPT. Assign to `rsps_suggest_questions`.
- Print the response object and the contents of the response.
- Print the type of the response

In [10]:
# Create a ChatOpenAI object. Assign to chat.
chat = ChatOpenAI()

# Pass your message to GPT. Assign to rsps_suggest_questions.
rsps_suggest_questions = chat(msgs_suggest_questions)

# Print the response
print("The whole response\n")
print(rsps_suggest_questions)

print("\n----\n")

# Print just the response's content
print("Just the response's content\n")
print(rsps_suggest_questions.content)

print("\n----\n")

# Print the type of the response
print("The type of the response\n")
print(type(rsps_suggest_questions))

  warn_deprecated(
  warn_deprecated(


The whole response

content='Certainly! Here are some data analysis questions that could be answered with this dataset:\n\n1. How many electric cars were registered in Washington state in 2020?\n2. What is the distribution of electric car registrations by city and county?\n3. Which cities or counties had the highest number of electric car registrations in 2020?\n4. What is the average number of electric cars registered per city or county?\n5. Which electric vehicle types (PHEV or BEV) were more popular in 2020?\n6. Which manufacturers had the highest number of electric car registrations in 2020?\n7. What is the distribution of electric car registrations by model year?\n8. How many electric cars of each make and model were registered in 2020?\n9. Are there any cities or counties where a specific electric car model is particularly popular?\n10. How does the number of electric car registrations vary across different months or quarters of 2020?\n\nThese questions can provide insights into 

## Task 4: Hold a conversation with GPT

Notice that the response from GPT was a dictionary-like object. The most useful part of this is the `.content` element, which contains the text response to your prompt.

While a single prompt and response can be useful, often you want to have a longer conversation with GPT. In this case, you can pass previous messages so that GPT can "remember" what was said before.

### AI messages

The response from GPT had type `AIMessage`. By distinguishing `AIMessage`s from the `HumanMessage`s, you can tell who said what in a conversation with the AI.

### Displaying Markdown content

When GPT generates code as an output, it if often formatted as a Markdown code block inside triple backticks. You can display Markdown output more beautifully in Workspace by swapping `print()` for the `display()` and `Markdown()` functions.

```py
display(Markdown(your_markdown_text))
```

### Instructions

Append another prompt to the conversation and chat with GPT again.

- Append the response and a new message to the previous messages. Assign to `msgs_python_top_models`.
- Pass your message to GPT. Assign to `rsps_python_top_models`.
- Display the response's Markdown content.

In [11]:
# Append the response and a new message to the previous messages. 
# Assign to msgs_python_top_models.
msgs_python_top_models = msgs_suggest_questions + [ rsps_suggest_questions, HumanMessage(content="Write Python code to find the make/model combinations of electric car in Washington state?") ]

# Pass your message to GPT. Assign to rsps_python_top_models.
rsps_python_top_models = chat(msgs_python_top_models)

# Display the response's Markdown content
display(Markdown(rsps_python_top_models.content))

To find the make/model combinations of electric cars in Washington state, you can use the following Python code:

```python
# Filter the dataset for electric cars in Washington state
electric_cars_wa = electric_cars[electric_cars['state'] == 'WA']
# Select the make and model columns
make_model_combinations = electric_cars_wa[['make', 'model']].drop_duplicates()

# Print the make/model combinations
print(make_model_combinations)
```

This code filters the `electric_cars` DataFrame for rows where the state is 'WA' (Washington state). Then, it selects the 'make' and 'model' columns and drops any duplicate combinations. Finally, it prints the resulting make/model combinations.

Please note that you need to adjust the column names in the code (`state`, `make`, `model`) according to the actual column names in your dataset.

## Task 5: Execute the Code Provided by GPT

You just asked GPT to write some code for you. Next you need to see if it worked, and fix it if it didn't. 

This is a standard workflow for interacting with generative AI: the AI acts as a junior data analyst who writes the code, then you act as the boss who reviews the work.

### Instructions

Review the work of your AI assistant.

- Copy and paste the code generated by GPT into the next code cell and run it.
- _Look at the result. Do you think it is correct?_*
- If the code threw an error or gave a wrong answer, use the Workspace AI Assistant (also powered by GPT!) to fix and explain the code.

*During testing of this code-along project, GPT sometimes wrote incorrect code when using `.sum()`. Double-check those function calls.

In [12]:
# Paste the code generated by GPT and run it
# Filter the dataset to include only electric cars
electric_cars_filtered = electric_cars[electric_cars['electric_vehicle_type'].str.contains('Electric Vehicle')]

# Group the data by make and model, and count the number of occurrences
make_model_combinations = electric_cars_filtered.groupby(['make', 'model']).size().reset_index(name='count')

# Display the make/model combinations
print(make_model_combinations)

                     make   model  count
0              ALFA ROMEO  TONALE     11
1                    AUDI      A3    178
2                    AUDI      A7     11
3                    AUDI    A8 E      3
4                    AUDI  E-TRON    313
..                    ...     ...    ...
122                 VOLVO     V60     23
123                 VOLVO    XC40    225
124                 VOLVO    XC60    369
125                 VOLVO    XC90    427
126  WHEEGO ELECTRIC CARS  WHEEGO      3

[127 rows x 3 columns]


## Task 6: Continue the Conversation to Create a Plot

Doing more analysis with GPT assistance is simply a case of continuing the conversation by appending new `HumanMessage` prompts to the message list, and calling the `chat()` function.

The output from GPT is random, but when doing data analysis this isn't always desirable since you'd like your results to be reproducible. With large language models, the amount of randomness can be controled with a parameter known as "temperature".

- `temperature` controls the randomness of the response. It ranges from `0` to `2` with zero meaning minimal randomness (to make it easier to reproduce results) and two meaning maximum randomness (often gives weird responses). If you use the OpenAI API directly, the default is `1`, but using LangChain reduces the default value to `0.7`. 

### Instructions

- Create a new OpenAI chat object with temperature set to zero. Assign to `chat0`.

In [13]:
# Create a new OpenAI chat object with temperature set to zero. Assign to chat0.
chat0 = ChatOpenAI(temperature=0)

### Instructions

- Work through the previous conversation flow again, appending the previous response and a new request for Python code to draw a bar plot of the total count of electric cars by model year, with bars colored by electric vehicle type.
    - The solution asks for Plotly Express code, but you can pick any Python data viz package you prefer.

In [14]:
# Ask GPT for code for a bar plot, as detailed in the instructions
msgs_to_plot = msgs_python_top_models + [
    rsps_python_top_models, 
    HumanMessage(content="Create Python code to show a bar plot of the total count of electric cars by model year, with bars colored by electric vehicle type, using Plotly Express.")]

rsps_to_plot = chat0(msgs_to_plot)

display(Markdown(rsps_to_plot.content))

To create a bar plot of the total count of electric cars by model year, with bars colored by electric vehicle type, using Plotly Express, you can use the following Python code:

```python
import plotly.express as px

# Group the dataset by model year and electric vehicle type, and calculate the total count of electric cars
grouped_data = electric_cars.groupby(['model_year', 'electric_vehicle_type']).sum().reset_index()

# Create the bar plot using Plotly Express
fig = px.bar(grouped_data, x='model_year', y='n_cars', color='electric_vehicle_type',
             labels={'model_year': 'Model Year', 'n_cars': 'Count of Electric Cars'},
             title='Total Count of Electric Cars by Model Year',
             barmode='group')

# Show the plot
fig.show()
```

This code first groups the `electric_cars` DataFrame by model year and electric vehicle type, and calculates the total count of electric cars for each combination. Then, it uses Plotly Express to create a bar plot, specifying the x-axis as the model year, the y-axis as the count of electric cars, and the color as the electric vehicle type. It also sets the labels and title for the plot, and sets the `barmode` parameter to 'group' to display the bars side by side.

Please make sure to have Plotly and its dependencies installed (`pip install plotly`) before running this code.

### Instructions

To see how much variation there is with temperature set to zero, ask GPT for the same thing again.

- Call GPT again with the same message list and display the response.
- _Look at the response content. How close is it to the previous response content?_

In [15]:
# Call GPT again with the same message list and display the response
rsps_python_review = chat0(msgs_to_plot)

display(Markdown(rsps_python_review.content))

To create a bar plot of the total count of electric cars by model year, with bars colored by electric vehicle type, using Plotly Express, you can use the following Python code:

```python
import plotly.express as px

# Group the dataset by model year and electric vehicle type, and calculate the total count of electric cars
grouped_data = electric_cars.groupby(['model_year', 'electric_vehicle_type']).sum().reset_index()

# Create the bar plot using Plotly Express
fig = px.bar(grouped_data, x='model_year', y='n_cars', color='electric_vehicle_type',
             labels={'model_year': 'Model Year', 'n_cars': 'Count of Electric Cars'},
             title='Total Count of Electric Cars by Model Year',
             barmode='group')

# Show the plot
fig.show()
```

This code first groups the `electric_cars` DataFrame by model year and electric vehicle type, and calculates the total count of electric cars for each combination. Then, it uses Plotly Express to create a bar plot, specifying the x-axis as the model year, the y-axis as the count of electric cars, and the color as the electric vehicle type. It also sets appropriate labels and a title for the plot, and sets the `barmode` parameter to 'group' to display the bars side by side.

Please make sure to have Plotly and its dependencies installed (`pip install plotly`) before running this code.

## Task 7: Execute the Code Provided by GPT to See Your Plot

Setting temperature to zero removed all randomness so you got the same output twice. That makes your workflow more reproducible.

The final task is to see the plot. As before, remember that GPT is only your assistant and you are the boss. Check the code and its output to make sure that you really have what you want.

### Instructions

Run the code and check that the plot is correct.

- Run the bar plot code generated by GPT.
- _Check that the output is suitable. If not, try changing your prompt in the previous task to improve the output (this is prompt engineering, which you'll see more of in the next code-along project in the series)._

In [16]:
# Paste the code generated by GPT and run it
import nbformat
print(nbformat.__version__)

# Filter the dataset to include only electric cars
electric_cars_filtered = electric_cars[electric_cars['electric_vehicle_type'].str.contains('Electric Vehicle')]

# Group the data by model year and electric vehicle type, and count the number of occurrences
count_by_year_type = electric_cars_filtered.groupby(['model_year', 'electric_vehicle_type']).size().reset_index(name='count')

# Create the bar plot
fig = px.bar(count_by_year_type, x='model_year', y='count', color='electric_vehicle_type',
             title='Total Count of Electric Cars by Model Year',
             labels={'model_year': 'Model Year', 'count': 'Count', 'electric_vehicle_type': 'Electric Vehicle Type'})


# Show the plot
fig.show()

5.9.2


  sf: grouped.get_group(s if len(s) > 1 else s[0])


## Summary

You've now seen how to access GPT through the OpenAI API both directly and using LangChain.

You saw how GPT can be used to come up with ideas for analyses to perform and to write code for you.

You also saw how to have an extended conversation and how to control the reproducibility of the responses.