# Actuarial Applications of Natural Language Processing Using Transformers
### A Case Study for Processing Text Features in an Actuarial Context
### Part III – Case Studies on Car Accident Descriptions - Unsupervised Techniques Using ChatGPT

By Andreas Troxler, September 2023

In this Part III of the tutorial, you will learn how ChatGPT can be used to extract features from text when no labels are available.
This is very relevant in practice: text data is often available, but labels are missing or sparse!

We use the car accident reports already familiar from Part I, and try to find out the number of vehicles involved and whether someone was injured.

As a user of ChatGPT, you need to be aware of its limitations, which can lead to incorrect results. Limitations include the following:
* ChatGPT may create wrong answers, due to lack of common sense, lack of detailed and up-to-date information, lack of understanding of the context,biases and prejudices,  etc.
* Results obtained by ChatGPT are not reproducible across runs and model versions.
* It is difficult to explain how ChatGPT arrives at the answer.
* ChatGPT is a very complex Large Language Model (LLM). As such, it requires significant computational resources.
* ...

With that in mind, let's get started.

## Notebook Overview

This notebook is divided into tutorial is divided into three parts; they are:

1. [Introduction.](#intro)<br>
    We begin by explaining pre-requisites. Then we turn to loading and exploring the dataset – ca. 6k records of English and German car accident reports with an average length of about 400 words. This is a repetition from Part I of the tutorial.<br><br>

2. [Extract features using ChatGPT.](#chat_gpt)<br>
     We apply ChatGPT to extract some features from this data.<br><br>
    
3. [Conclusion](#conclusion)


<a id='intro'></a>
<a name="intro"></a>
## 1.&nbsp;Introduction

In this section we discuss the pre-requisites, load and inspect the dataset.

<a id='prerequisites'></a>
<a name='prerequisites'></a>
### 1.1. Prerequisites


#### Computing Power and OpenAI Account

In this notebook, we use the API provided by OpenAI. Therefore, it does not require GPU support.

On the flipside, you need to [set up an OpenAI account](https://platform.openai.com/signup?launch) and generate your personal API authentication key. In the following, we assume that this key is stored in the file `openai-key.txt` in the working directory. Of course, you can use a different file name. Do not share your API key with others, or expose it in the browser or other client-side code.

#### Local files
Make sure the following files are available in the directory of the notebook:
* `tutorial_utils.py` - a collection of utility functions used throughout this notebook
* `NHTSA_NMVCCS_extract.parquet.gzip` - the data
* `openai-key.txt` - a text file containing your OpenAI API authentication key

This notebook will create the following subdirectory:
* `results` - figures and Excel files

#### Getting started with Python and Jupyter Notebook

For this tutorial, we assume that you are already familiar with Python and Jupyter Notebook.

In this section, Jupyter Notebook and Python settings are initialized.
For code in Python, the [PEP8 standard](https://www.python.org/dev/peps/pep-0008/)
("PEP = Python Enhancement Proposal") is enforced with minor variations to improve readability.


In [None]:
# Notebook settings

# clear the namespace variables
from IPython import get_ipython
get_ipython().run_line_magic("reset", "-sf")

# formatting: cell width
from IPython.display import display, HTML
display(HTML("<style>.container { width:100% !important; }</style>"))

#### Importing Required Libraries

If you run this notebook on Google Colab, you will need to install the following libraries:

In [None]:
!pip install openai



In [None]:
!pip install retry



In [None]:
!pip install kaleido



Next, we import the required libraries:

In [None]:
import os
import openai
import pandas as pd
import plotly.express as px
from tqdm import tqdm
from retry import retry
from wordcloud import WordCloud
from tutorial_utils import evaluate_classifier

<a id='dataexploration'></a>
<a name='dataexploration'></a>
### 1.2. Exploring the Data

You can skip this section if you are already fammiliar with Part I of this tutorial.

The data used throughout this tutorial is derived from data of a vehicle crash causation study performed in the United States from 2005 to 2007. The dataset has almost 7'000 records, each relating to one accident. For each case, a verbal description of the accident is available in English, which summarizes road and weather conditions, vehicles, drivers and passengers involved, preconditions, injury severities, etc. The same information is also encoded in tabular form, so that we can apply supervised learning techniques to train the NLP models and compare the information extracted from the verbal descriptions with the encoded data.

The original data consists of multiple tables. For this tutorial, we have aggregated it into a single dataset and added German translations of the English accident descriptions. The translations were generated using the [DeepL python API](https://pypi.org/project/deepl/).

To explore the data, let's load it into a Pandas DataFrame and examine its shape, columns and data types:

In [None]:
df = pd.read_parquet("NHTSA_NMVCCS_extract.parquet.gzip")
print(f"shape of DataFrame: {df.shape}")
print(*list(zip(df.columns, df.dtypes)), sep="\n")

shape of DataFrame: (6949, 16)
('level_0', dtype('int64'))
('index', dtype('int64'))
('SCASEID', dtype('int64'))
('SUMMARY_EN', dtype('O'))
('SUMMARY_GE', dtype('O'))
('INJSEVA', dtype('int64'))
('NUMTOTV', dtype('int64'))
('WEATHER1', dtype('int64'))
('WEATHER2', dtype('int64'))
('WEATHER3', dtype('int64'))
('WEATHER4', dtype('int64'))
('WEATHER5', dtype('int64'))
('WEATHER6', dtype('int64'))
('WEATHER7', dtype('int64'))
('WEATHER8', dtype('int64'))
('INJSEVB', dtype('int64'))


The column `SCASEID` is a unique case identifier.

The columns `SUMMARY_EN` and `SUMMARY_GE` are strings representing the verbal descriptions of the accident in English and German, respectively.

`NUMTOTV` is the number of vehicles involved in the case. Let's have a look at the distribution of this feature:

In [None]:
fig = px.bar(df["NUMTOTV"].value_counts().sort_index(), width=640)
fig.update_layout(title="number of cases by number of vehicles", xaxis_title="number of vehicles",
                  yaxis_title="number of cases")
fig.show(config={"toImageButtonOptions": {"format": 'svg', "filename": "num_vehicles"}})

Most cases involve two vehicles, and only very few accidents involve more than three vehicles.

Each of the columns `WEATHER1` to `WEATHER8` indicates the presence of a specific weather condition (1: weather condition present, 9999: presence of weather condition unknown, 0 otherwise):

| column | meaning | count |
|---|---|---|
| `WEATHER1` | cloudy | 1112 |
| `WEATHER2` | snow | 114 |
| `WEATHER3` | fog, smog, smoke | 28 |
| `WEATHER4` | rain | 624 |
| `WEATHER5` | sleet, hail (freezing drizzle or rain) | 25 |
| `WEATHER6` | blowing snow | 38 |
| `WEATHER7` | severe crosswinds | 20 |
| `WEATHER8` | other | 25 |


These weather conditions are not mutually exclusive, i.e., more than one condition can be present in a single case. The frequency distribution looks as follows:

In [None]:
fig=px.bar(x=range(1,9), y=[(df["WEATHER"+str(i)]==1).sum() for i in range(1,9)], width=640)
fig.update_layout(title="number of cases by weather condition", xaxis_title="weather condition",
                  yaxis_title="number of cases")
fig.show(config={"toImageButtonOptions": {"format": 'svg', "filename": "weather"}})

The most frequently recorded weather conditions are "cloudy" (`WEATHER1`) and "rain" (`WEATHER4`).

`INJSEVA` indicates the most serious sustained injury in the accident. For instance, if one person was not injured, and another person suffered a non-incapacitating injury, injury class 2 was assigned to the case.

Information on injury severity has been taken from police accident reports, which are not available in the data. Unfortunately, this information does not necessarily align with the case description: There are many cases for which the case description indicates the presence of an injury, but INJSEVA does not, and vice versa.

For this reason, we created manually an additional column `INJSEVB` based on the case description, to indicate the presence of a (possible) bodily injury. The table below shows the distribution of number of cases by the two variables.

| `INJSEVA` | meaning | count | `INJSEVB`=0 | `INJSEVB`=1 |
|---|---|---|---|---|
|  0 | O - No injury | 1'458 | 96| 1'554 |
|  1 | C - Possible injury | 1'112 | 1'298 | 2'410 |
|  2 | B - Non-incapacitating injury | 729 | 945 | 1'674 |
|  3 | A - Incapacitating injury | 304 | 373 | 677 |
|  4 | K - Killed | 5 | 114 | 119 |
|  5 | U - Injury, severity unknown | 44 | 122 | 166 |
|  6 | Died prior to crash  | 0 | 0| 0 |
|  9 | Unknown if injured  | 51 | 16 | 67 |
| 10 | No person in crash  | 1 | 0| 1 |
 11 | No PAR (police accident report) obtained | 231 | 50 | 281 |
|**Total**| | **3'935** | **3'014**| **6'949**|

Now we turn to the verbal accident descriptions. First, we examine the length of the English texts, `SUMMARY_EN`. To this end, we split the texts into words, with blank spaces as separator, and show a box plot of the text length by number of vehicles involved in the accident:

In [None]:
# statistics of summary length
df["words per case summary"] = df["SUMMARY_EN"].str.split().apply(len)
print(f"Overall number of words by case summary: min {df['words per case summary'].min()}, "
      f"average {df['words per case summary'].mean():.0f}, max {df['words per case summary'].max()}")
fig = px.box(df, x="NUMTOTV", y="words per case summary", width=640)
fig.show(config={"toImageButtonOptions": {"format": 'svg', "filename": "text_length"}})

Overall number of words by case summary: min 60, average 419, max 1248


Not surprisingly, the length of the descriptions correlates with the number of vehicles involved.

The average length is above 400 words.

Let's examine one of the English texts and its German translation:

In [None]:
display(HTML(df.loc[0, "SUMMARY_EN"]))

In [None]:
display(HTML(df.loc[0, "SUMMARY_GE"]))

To get an impression of the most frequent words, we generate a simple word cloud form all English case descriptions. By default, the word cloud excludes so-called stop words (such as articles, prepositions, pronouns, conjunctions, etc.), which are the most common words and do not add much information to the text.

In [None]:
text = df["SUMMARY_EN"].str.cat(sep=" ")

# Create and generate a word cloud image:
word_cloud = WordCloud(max_words=100, background_color="white").generate(text)

# Display the generated image:
fig = px.imshow(word_cloud, width=640)
fig.update_layout(xaxis_showticklabels=False, yaxis_showticklabels=False)
fig.show(config={"toImageButtonOptions": {"format": 'svg', "filename": "word_cloud"}})

<a id='chat_gpt'></a>
<a name='chat_gpt'></a>
## 2.&nbsp;Extract Features Using ChatGPT

Imagine the following situation: We are building a model to predict the severity of accidents based on features available in tabular form. We believe that knowing the number of vehicles involved in the and whether someone was injured would help improve the model. However, all we have available are accident reports containing the information in unstructured free text form.

If we have sufficient data with labels, we can use supervised techniques such as examined in Part I of this tutorial.

In this Part III, we learn an unsupervised approach that does not require labels.

More precisely, we will use ChatGPT to extract the following information from the car accident reports:
* Was someone injured or killed?
* How many vehicles were involved?

Let's get started.

<a id='chat-gtp-api'></a>
<a name='chat-gtp-api'></a>
###  2.1 First Steps With the ChatGPT API


The idea is very simple: We specify a number of questions and ask ChatGPT to provide answers based on a given accident report.

The prompt might look as follows:

```
Read the following text, and answer the following:
1. Was someone injured?
2. Was someone killed?
3. How many vehicles were involved?
4. Summarize your last answer by a number.
Text:
V1, a 2000 Pontiac Montana minivan, made a left turn [...]
```

The response might look like:

```
1. Yes, the driver of V2 sustained minor injuries.
2. No, no one was killed.
3. Two vehicles were involved.
4. 2
```

So all we have to do is to extract the desired features from this response!

We begin by writing a short function to call the OpenAI API.


In [None]:
@retry((openai.error.APIError, openai.error.ServiceUnavailableError), tries=10, delay=15)
def call_openai(content):
    return openai.ChatCompletion.create(
        model="gpt-3.5-turbo",
        messages=[{"role": "user", "content": content}],
        temperature=0.2,
        max_tokens=256,
        top_p=1,
        frequency_penalty=0,
        presence_penalty=0
    )

Note that we have used the `retry` decorator to retry 10 times with a waiting time of 15 seconds to handle some of the most common exceptions (you are invited to develop more sophisticated ways to deal with such issues).

The parameters have the following effects:
* `model`: Specifies the ChatGPT model version.
* `messages`: Specifies the content of the user prompt.
* `temperature`: Values in the interval $[0, 1]$. Controls the randomness of the text generated. A higher temperature results in more diverse and creative output, while a lower temperature makes the output more deterministic and focused. For our purpose, we require fact-based answers and therefore go for low values of temperature.
* `top_p`: Instead of considering all possible tokens, GPT-3 considers only a subset of tokens (the "nucleus") whose cumulative probability mass adds up to this threshold. With `top_p`=1, we allow all possible tokens.
* `frequency_penalty`: Penalizes repetition of words in the response. For our purpose, we don't mind word repetitions and therefore set this parameter to 0.
* `presence_penalty`: Encourages use of a diverse vocabulary in the response. For our purpose, this aspect is not important and therefore we set this parameter to 0.

You are encouraged to experiment with these parameters.

Please note that the results may not be reproducible between runs and model versions.

Next, we specify the location of the API authentication key:

In [None]:
openai.api_key_path = "./openai-key-at.txt"
openai.api_key = os.getenv("OPENAI_API_KEY")

Now we are ready!

We specify the following prompt ...

In [None]:
prompt = """
Read the following text, and answer the following:
1. Was someone injured?
2. Was someone killed?
3. How many vehicles were involved?
4. Summarize your last answer by a number.
Text:
"""
prompt

'\nRead the following text, and answer the following:\n1. Was someone injured?\n2. Was someone killed?\n3. How many vehicles were involved?\n4. Summarize your last answer by a number.\nText:\n'

... and apply it to the first English accident report:

In [None]:
response = call_openai(prompt + df.iloc[0]["SUMMARY_EN"])
response

<OpenAIObject chat.completion id=chatcmpl-7vkj3d4NvHiu3Uy1BUIQyqVRp7FV2 at 0x7f6200275f30> JSON: {
  "id": "chatcmpl-7vkj3d4NvHiu3Uy1BUIQyqVRp7FV2",
  "object": "chat.completion",
  "created": 1693998665,
  "model": "gpt-3.5-turbo-0613",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "1. Yes, the driver of V2 sustained minor injuries.\n2. No, no one was killed.\n3. Two vehicles were involved.\n4. 2"
      },
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 618,
    "completion_tokens": 33,
    "total_tokens": 651
  }
}

As you can see, we received a chat completion object, from which the response is easy to unpack:

In [None]:
print(response["choices"][0]["message"]["content"])

1. Yes, the driver of V2 sustained minor injuries.
2. No, no one was killed.
3. Two vehicles were involved.
4. 2


Indeed, the text says that "[the driver of V2] sustained minor injuries and was transported to a local trauma facility". There is no mention of a fatality, and there were two vehicles involved, namely V1 and V2:

In [None]:
display(HTML(df.loc[0, "SUMMARY_EN"]))

We can use the same English prompt and apply it to the German version of the accident report. The response is in English:

In [None]:
text = df.iloc[0]["SUMMARY_GE"]
display(HTML(text))
response = call_openai(prompt + text)
print(response["choices"][0]["message"]["content"])

1. Yes, the driver of V2 (Pontiac Grand Am) suffered minor injuries.
2. No, no one was killed.
3. Two vehicles were involved (V1 and V2).
4. 2


Now, we want to examine more examples. We store the results in a list.

In [None]:
# reset results
results = []

It might happen that the following code stops, for instance due to temporary unavailability of the API. In this case, you can simply resume execution after a while.

Feel free to change the upper bound of the loop. In order to run a large number of samples, you may need to switch to a paid scheme.

In [None]:
for i in tqdm(range(len(results), 10)):
    text = df.iloc[i]["SUMMARY_EN"]
    response = call_openai(prompt + text)
    results.append(response["choices"][0]["message"]["content"])

100%|██████████| 10/10 [00:42<00:00,  4.24s/it]


In [None]:
# store the results in a DataFrame and export to a csv file
if not os.path.exists("./results"): os.makedirs("./results")
pd.DataFrame(results, columns=["response"]).to_csv(f"./results/NHTSA_responses_{i:04d}.csv", index=False)

In [None]:
results

['1. Yes, the driver of V2 sustained minor injuries.\n2. No, no one was killed.\n3. Two vehicles were involved.\n4. 2',
 '1. Yes, the driver of V2 was injured and bleeding.\n2. No, no one was killed.\n3. Two vehicles were involved.\n4. 2',
 '1. No one was injured.\n2. No one was killed.\n3. Two vehicles were involved.\n4. The number of vehicles involved is 2.',
 '1. Yes, the driver of vehicle one (V1) was injured.\n2. No, no one was killed.\n3. Only one vehicle (V1) was involved in the crash.\n4. 1',
 '1. Yes, the 17-year-old male driver of Vehicle #1 was transported to a hospital and treated for a complaint of pain.\n2. No, no one was killed in the crash.\n3. Two vehicles were involved in the crash.\n4. 2',
 '1. Yes, the driver of Vehicle #2 and the passenger in Vehicle #2 had minor injuries to the head/neck areas.\n2. No, no one was killed in the crash.\n3. Two vehicles were involved in the crash.\n4. 2',
 '1. Yes, someone was injured. \n2. No, no one was killed. \n3. Two vehicles we

<a id='extraction'></a>
<a name='extraction'></a>
###  2.2 Extracting the Features from the Reponses

Next, we need to extract the desired features from the responses. We write a few of functions to achieve this.

The first function, `first_matching_expression`, accepts a string, a dictionary and a default value as inputs. The dictionary is supposed to hold a mapping from expressions to values. The function searches the expression which appears first in the string and returns its corresponding value. If no expression is found, the default value is returned.

In [None]:
def first_matching_expression(string, dictionary, default):
    """ Given a string and a dict of {expression: value}, returns value corresponding to first occurring expression. """
    # put default at end of the string
    positions = [(len(string), default)]
    # append with tuple (position, value) for each (item: value) in the dictionary
    for item, value in dictionary.items():
        position = string.find(item)
        # suppress items which were not found
        if position >= 0:
            positions.append((position, value))
    # return value corresponding to first position
    return sorted(positions)[0][1]

The next function splits a response into separate substrings representing the answers to each of the four questions. Then it extracts return values by searching defined expressions, by means of the function `first_matching_expression`. It returns both the substrings and the extracted information.

This function is highly task-specific.

In [None]:
def extract_responses(string):
    string = string.lower() + " "
    i1 = string.find("1. ")
    i2 = string.find("2. ")
    i3 = string.find("3. ")
    i4 = string.find("4. ")
    s1 = string[i1:i2][3:]
    s2 = string[i2:i3][3:]
    s3 = string[i3:i4][3:]
    s4 = string[i4:][3:]
    d1 = {"yes": 1, "minor": 1}
    r1 = first_matching_expression(s1, d1, 0)
    d2 = {"yes": 1}
    r2 = first_matching_expression(s2, d2, 0)
    d3 = {"1": 1, "2": 2, "3": 3, "4": 4, "5": 5, "6": 6, "7": 7, "8": 8,
          "9": 9, "only": 1, "one": 1, "two": 2, "three": 3, "four": 4,
          "five": 5, "six": 6, "seven": 7, "eight": 8, "nine": 9}
    r3 = first_matching_expression(s3, d3, 1)
    r4 = first_matching_expression(s4, d3, 1)
    return [s1, s2, s3, s4, r1, r2, r3, r4]

Finally, we define a function `add_responses_to_df` that takes a list of responses, applies `extract_responses`, concatenates the information to the original DataFrame and stores the resulting DataFrame.


In [None]:
def add_responses_to_df(responses, df, path_file_result):
    df_results = pd.concat([
        df.iloc[:len(responses)],
        pd.DataFrame(
            [extract_responses(r) for r in responses["response"]],
            columns=["s1", "s2", "s3", "s4", "r1", "r2", "r3", "r4"])],
        axis=1)
    df_results.to_excel(path_file_result)
    return df_results

We load the responses for the first 1000 samples from a previous run of this notebook.

In [None]:
results = pd.read_csv("NHTSA_responses_0999.csv")
results

Unnamed: 0,response
0,"1. Yes, the driver of V2 sustained minor injur..."
1,"1. No, the driver of V2 was not injured in the..."
2,1. No one was injured.\n2. No one was killed.\...
3,"1. Yes, the driver of vehicle one (V1) was inj..."
4,"1. Yes, the 17-year-old male driver of Vehicle..."
...,...
995,"1. Yes, someone was injured.\n2. No, no one wa..."
996,"1. No, no one was injured.\n2. No, no one was ..."
997,1. No one was injured.\n2. No one was killed.\...
998,1. No one was injured.\n2. No one was killed.\...


In [None]:
df_res = add_responses_to_df(results, df, "./results/df_results.xlsx")

<a id='evaluation'></a>
<a name='evaluation'></a>
###  2.3 Performance evaluation

Next, we want to evaluate the performance of our model.

Let's look at the confusion matrix of the predicted vs true number of involved vehicles:




In [None]:
y_true = df_res["NUMTOTV"]
y_pred = df_res["r4"]
labels = [str(i) for i in sorted(set(y_true).union(set(y_pred)))]
_ = evaluate_classifier(y_true, y_pred, None, labels,
                        "ChatGPT #vehicles", "cm_nv_chat_gpt")

ChatGPT #vehicles
accuracy score = 97.4%,  log loss = nan,  Brier loss = nan
classification report
               precision    recall  f1-score   support

           1       0.99      0.97      0.98       256
           2       0.98      1.00      0.99       625
           3       0.92      0.90      0.91        94
           4       1.00      0.75      0.86        20
           5       0.67      0.50      0.57         4
           8       1.00      1.00      1.00         1

    accuracy                           0.97      1000
   macro avg       0.93      0.85      0.88      1000
weighted avg       0.97      0.97      0.97      1000



For the identification of cases with bodily injury, the performance looks as follows:

In [None]:
y_true = df_res["INJSEVB"]
y_pred = 1- (1 - df_res["r1"]) * (1 - df_res["r2"])
labels = [str(i) for i in sorted(set(y_true).union(set(y_pred)))]
_ = evaluate_classifier(y_true, y_pred, None, labels,
                        "ChatGPT injury", "cm_inj_chat_gpt")

ChatGPT injury
accuracy score = 88.1%,  log loss = nan,  Brier loss = nan
classification report
               precision    recall  f1-score   support

           0       0.87      0.81      0.84       382
           1       0.89      0.92      0.91       618

    accuracy                           0.88      1000
   macro avg       0.88      0.87      0.87      1000
weighted avg       0.88      0.88      0.88      1000



For both tasks, we can compare the accuracy score to the ones achieved by the supervised approaches examined in Part I of this tutorial.
We observe the following:
* The accuracy score is higher than with supervised training of a logistic regression classifier on the DistilBERT-encoded texts.
* The accuracy score is somewhat below the one obtained using task-specific fine-tuning of the DistilBERT model.

Note, however, that here we have not employed any task-specific training!

<a id='conclusion'></a>
<a name='conclusion'></a>
## 3.&nbsp;Conclusion

Congratulations!

In this Part III of the tutorial, you have used ChatGPT to extract features from text in an unsupervised fashion.

Advantages of this approach are certainly that no labels are required, and that it is very simple to implement.

On the other hand, execution time is longer than for the supervised approaches examined in Part I.

In terms of accuracy score, the approach used here performs better than  supervised training of a logistic regression classifier on the DistilBERT-encoded texts, but somewhat worse than task-specific supervised fine-tuning of the DistilBERT model.
Note however that we haven't performed any fine-tuning in this notebook.

In practice, the unsupervised and supervised techniques could be combined, for instance by using ChatGPT to generate labels for a sufficintly large set of data, that is then used in a supervised setting.

If you have enjoyed this tutorial, feel free to apply any of the approaches - or improved versions, of course - to your own text data, to enrich your structured features available for supervised learning tasks.