![NVIDIA Logo](images/nvidia.png)

# Project: Generate Detailed Synthetic Emails

In this notebook you will use your fine-tuned GPT8B-based `generate_list` LLM function, in conjunction with GPT43B to generate a collection of synthetic customer emails, each containing a variety of distinct and appropriate details.

---

## Learning Objectives

By the time you complete this notebook you will:
- Generate several collections of synthetic data for use in customer email generation.
- Generate at least 50 unique customer emails, each with distinct and appropriate details, written to a fictitious company of our creation.

---

## Imports

In [1]:
import random
import json

from tqdm.notebook import tqdm

from llm_utils.helpers import edit_list
from llm_utils.nemo_service_models import NemoServiceBaseModel
from llm_utils.models import Models
from llm_utils.llm_functions import (
    make_llm_function,
    generate_list_8B_lora as generate_list, 
    generate_customer_email as example_generate_customer_email
)

---

## Models

In [2]:
Models.list_models()

gpt8b: gpt-8b-000
gpt20b: gpt20b
gpt43b_2: gpt-43b-002
gpt43b: gpt-43b-001
llama70b_chat: llama-2-70b-chat-hf
llama70b: llama-2-70b-hf


---

## Exercise Main Objective

![Customer Emails](images/customer_emails.png)

The main objective of this project is to synthetically generate 50 unique customer emails roughly 150 words each to a fictitious company of our creation called StarBikes using an LLM function we will call `generate_customer_emails`. Aside from being a coherent, well-formed message, each email should contain text that specifies:
- The customer's distinct mood.
- The sender's name.
- Your company's name.
- The product that the email is writing about, which should be appropriate to your company's industry.
- The location of the store where the product was purchased.

---

## Exercise Example

Below is an example of the kind of email each of your 50 should be like, using an LLM function of our creation `example_generate_email`.

In [3]:
sender_name = 'Josh'
company_name = 'NVIDI-OMG'
industry = 'tech'
product = 'H100000'
mood = 'happy'
store_location = 'Santa Clara'

In [4]:
example_customer_email = example_generate_customer_email(sender_name, 
                                                         company_name, 
                                                         industry, 
                                                         product, 
                                                         mood, 
                                                         store_location, 
                                                         top_k=4, 
                                                         temperature=0.7)

In [5]:
print(example_customer_email)

Dear NVIDI-OMG,
I am very happy with my H100000 and I love the way it makes me feel. I was in Santa Clara and saw it at a store and bought it right away. I have never been happier with a product. Thank you so much for creating it.
Sincerely,
Josh


---

## Exercise Constraints

In addition to the main objective just discussed, please adhere to the following exercise constraints.

### Create Company

You customer emails should be addressed to StarBikes, our fictitious company and which is part of the bike industry.

In [6]:
company_name = 'StarBikes'
company_industry = 'bike'

### Synthetic Email Details

Aside from the company's name and industry, each customer email should contain unique instances of the customer's name, mood, product, and store location. To this end, you will need to create synthetically generated data representing instances of each of these categories for use in generating 50 unique customer emails.

For this part of your work, constrain yourself to using the LoRA fine-tuned GPT8B-based `generate_list` LLM function you created earlier. We have imported our solution implementation for you.

In [7]:
generate_list(5, 'affirmations')

['I am worthy', 'I am loved', 'I am beautiful', 'I am strong', 'I am smart']

### Synthetic Email Generation

When it comes time to generate the emails themselves, please use GPT43B.

In [8]:
email_generator_llm = NemoServiceBaseModel(Models.gpt43b.value)

---

---

# Exercise Walkthrough

## Generate Synthetic Customer Emails

This is the time to put your prompt engineering skills to the test. Begin by creating a prompt template you can use for synthetic email generation. The prompt template should include all of the details needed for an email discussed above.

Start by iteratively developing a prompt with GPT43B, and once you're satisfied with your prompt, capture it in a prompt template function that takes relevant email details and returns a well-formed prompt.

In [9]:
def customer_email_prompt_template(sender_name, company_name, industry, product, mood, store_location):
    return f"""\
Write a 150 word email from a customer named {sender_name} to the fictitious company {company_name} \
that ends in the customer's name.

Context: The mood of the customer {sender_name} is {mood}.

Instructions: Take the following steps in drafting this email from the customer {sender_name} to the {industry} company {company_name}:
1) The customer makes a question or complaint about the following product: {product}.
2) The customer tells a brief story about a relevant experience they had with their {company_name} {product}.
3) The customer describes that they purchased the product at a {company_name} store location in {store_location}.
4) The customer signs off in a way that matches their mood using their name {sender_name}.
5) If the customer has\'t signed off with their name ({sender_name}) they sign off with their name ({sender_name}).
"""

In [10]:
customer_email_prompt = customer_email_prompt_template('Josh', 'NVIDI-OMG', 'tech', 'H100000', 'happy', 'Santa Clara')

In [11]:
email_generator_llm.generate(customer_email_prompt)

" Hi NVIDI-OMG,\nI just wanted to let you know that I love my H100000. I bought it at your store in Santa Clara. I've had it for a few years now and it's still working great. I'm really happy with my purchase.\nThanks,\nJosh"

---

## Create LLM Function

![Email LLM Function](images/email_llm_function.png)

With an appropriate model and prompt template, we can now create an LLM function, which we'll call `generate_customer_email` to encapsulate the synthetic customer email generation task.

You can use `make_llm_function` along with your `email_generator_llm` model, you `customer_email_prompt_template` and the `strip` function below as `postprocessor`.

If you get stuck, feel free to check the *Solution* below.

In [12]:
def strip(response):
    return response.strip()

In [13]:
generate_customer_email = make_llm_function(NemoServiceBaseModel(Models.gpt43b.value), 
                                            customer_email_prompt_template, 
                                            postprocessor=strip)

In [14]:
generate_customer_email('Josh', 'NVIDI-OMG', 'tech', 'H100000', 'happy', 'Santa Clara')

"Hi NVIDI-OMG,\nI just wanted to let you know that I love my H100000. I bought it at your store in Santa Clara. I've had it for a few years now and it's still working great. I'm really happy with my purchase.\nThanks,\nJosh"

---

## Generate Synthetic Data for Prompt Template Parameters

![Email Details](images/email_details.png)

Ultimately, we want to scale the use of `generate_customer_email` but before we can we need to generate synthetic data for the following template parameters:

- sender_names
- products
- moods
- store_locations

To do this you will be using your LoRA fine-tuned GPT8B powered `generate_list` function, along with any other code required, to populate these lists.

---

## Generate Sender Names

Use `generate_list` to populate a `sender_names` list with 30 unique names. Don't forget you can pass in named arguments for `top_k` and `temperature` to get more variety out of your generated lists. Also recall from the previous section that `generate_list` with GPT8B performs best with values for list length of 7 or less, and that if the underlying model response during list generation is malformed that `generate_list` will return an empty list.

Use the `edit_list` helper on your generated list to clean any responses you don't want.

If you'd like to see a solution for this section, expand the _Solution_ section below.

In [15]:
sender_name_queues = [
    "male names",
    "female names",
    "unusual names",
    "names that start with the letter V"
]

In [16]:
sender_names = []
while len(sender_names) < 30:
    for sender_name_queue in sender_name_queues:
        sender_name = generate_list(5, sender_name_queue, top_k=8, temperature=1)
        sender_names.extend(sender_name)
        sender_names = list(set(sender_names))
        print(len(sender_names))

5
8
13
18
22
25
30
34


In [17]:
len(sender_names)

34

In [18]:
edit_list(sender_names)

Label(value="Do you want to keep 'Lily' in the list?")

HBox(children=(Button(description='Keep', style=ButtonStyle()), Button(description='Remove', style=ButtonStyleâ€¦

Text(value='', placeholder='Enter replacement')

In [19]:
sender_names[:10]

['Lily',
 'Jim',
 'Vashti',
 'Violet',
 'Van Gogh',
 'Vera',
 'Layla',
 'Emma',
 'Simba',
 'Spock']

---

## Generate Products

Use your LoRA fine-tuned GPT8B powered `generate_list` function, along with any other code required, to populate `products` list with 30 products appropriate to your company and its industry.

Don't forget you can pass in named arguments for `top_k` and `temperature`.

Use the `edit_list` helper on your generated list to clean any responses you don't want.

If you'd like to see a solution for this section, expand the _Solution_ section below.

In [20]:
bike_queues = [
    "parts sold at a bicycle shop",
    "kinds of bike",
    "bike accesories",
    "unusual things I would find at a bike store"
]

In [22]:
products = []
while len(products) < 40: # Overshooting in case some need to be edited out
    for bike_queue in bike_queues:
        bike_product = generate_list(5, bike_queue, top_k=8, temperature=1)
        products.extend(bike_product)
        products = list(set(products))
        #print(len(products))

In [23]:
len(products)

40

In [None]:
edit_list(products)

In [24]:
products[:10]

['mountain bike',
 'tandem',
 'wheel',
 'tires',
 'bike pump',
 'handlebar',
 'tire',
 'bike',
 'saddles',
 'toolkit']

---

## Generate Moods

Use your LoRA fine-tuned GPT8B powered `generate_list` function, along with any other code required, to populate `moods` with 30 moods the customer might be in.

Don't forget you can pass in named arguments for `top_k` and `temperature`.

Use the `edit_list` helper on your generated list to clean any responses you don't want.

If you'd like to see a solution for this section, expand the _Solution_ section below.

In [25]:
mood_queues = [
    "moods a happy customer might be in",
    "moods a disgruntled customer might be in",
    "moods an inquisitive customer might be in"
]

In [26]:
moods = []
while len(moods) < 30:
    for mood_queue in mood_queues:
        mood = generate_list(5, mood_queue, top_k=8, temperature=1)
        moods.extend(mood)
        moods = list(set(moods))
        #print(len(moods))

In [None]:
len(moods)

In [None]:
edit_list(moods)

In [27]:
moods[:10]

['frustrated',
 'contentment',
 'joyful',
 'content',
 'mad',
 'optimistic',
 'confident',
 'disappointed',
 'cheerful',
 'upbeat']

---

## Generate Store Locations

Use your LoRA fine-tuned GPT8B powered `generate_list` function, along with any other code required, to populate `store_locations` with 30 physical locations (city names for example) where a store that the customer purchased their product might be.

Don't forget you can pass in named arguments for `top_k` and `temperature`.

Use the `edit_list` helper on your generated list to clean any responses you don't want.

If you'd like to see a solution for this section, expand the _Solution_ section below.

In [28]:
store_location_queues = [
    "cities in California",
    "cities in Maryland",
    "cities in Alaska"
]

In [29]:
store_locations = []
while len(store_locations) < 30:
    for store_location_queue in store_location_queues:
        store_location = generate_list(5, store_location_queue, top_k=8, temperature=1)
        store_locations.extend(store_location)
        store_locations = list(set(store_locations))
        #print(len(store_locations))

In [30]:
len(store_locations)

30

In [None]:
edit_list(store_locations)

In [31]:
store_locations[:10]

['Alaska',
 'San Francisco',
 'Carroll County',
 'Baltimore',
 'San Jose',
 'Los Angeles',
 'Chesapeake Beach',
 'Anchorage',
 'Maryland',
 'Fairbanks']

---

## Check List Lengths

At this point `sender_names`, `products`, `moods` and `store_locations` should each have at least 30 unique items. Please run the following cell to confirm.

In [32]:
lists = {'sender_names': sender_names, 'products': products, 'moods': moods, 'store_locations': store_locations}
good = True
for k, l in lists.items():
    if len(set(l)) < 30:
        print(f'{k} only has {len(set(l))} items, please correct.')
        good = False

if good:
    print('All your lists have at least 30 items.')

All your lists have at least 30 items.


---

## Generate Synthetic Customer Emails

![Customer Emails](images/customer_emails.png)

Now that you have a `generate_customer_emails` LLM function, and synthetic data for all the details we would like to include in the synthetic customer emails, you're ready to create the synthetic customer emails.

Using your `generate_customer_emails`, populate an `emails` list with 50 synthetic emails.

Remember that you can set parameters like `top_k` and `temperature` to influence the creativity of your emails.

Here are the lists and values you've created to use in your calls to `generate_customer_email`.

In [33]:
company_name

'StarBikes'

In [34]:
company_industry

'bike'

In [35]:
random.choice(sender_names)

'Victoria'

In [38]:
random.choice(products)

'gloves'

In [37]:
random.choice(moods)

'upbeat'

In [39]:
random.choice(store_locations)

'Baltimore'

And for your reference, here are the arguments `generate_customer_emails` expects.

```python
generate_customer_email(sender_name, company_name, industry, product, mood, store_location)
```

Before doing a large generation loop, be sure to try out one or several generations first.

Feel free to check out the *Solution* below if you get stuck.

### Solution

In [40]:
emails = []
progress_bar = tqdm(total=50)
while len(emails) < 50:
    sender_name = random.choice(sender_names)
    product = random.choice(products)
    mood = random.choice(moods)
    store_location = random.choice(store_locations)

    customer_email = generate_customer_email(sender_name, 
                                             company_name, 
                                             company_industry, 
                                             product, 
                                             mood, 
                                             store_location, 
                                             top_k=4, 
                                             temperature=1.0)
    emails.append(customer_email)
    progress_bar.update(1)

progress_bar.close()

  0%|          | 0/50 [00:00<?, ?it/s]

In [41]:
for email in emails[:5]:
    print(email+'\n')

Hello, I am writing to you because I have an issue with my bike pump. I bought the pump about a week ago at your Seward location, and it has already started leaking air. I've only used it once so far, so I don't think it's my fault. Please send a replacement. Thank you.

Hi StarBikes,
I just wanted to tell you that I am very excited to be using my StarBikes light. I purchased it from the Nome store location a couple of weeks ago.
I have always had trouble seeing at night, so I am very happy to have a light that I can easily see with.
Thanks,
Van Gogh

Dear StarBikes,
I am writing to tell you of my experience with your product, helmet. A few months ago, I went on a bike ride with my son and we were hit by a car. The impact caused me to hit my head on the pavement. I was wearing my StarBikes helmet at the time and I believe this saved me from more serious injury. Thank you so much for making such high quality products.
I bought my helmet at a StarBikes store in Washington.
Sincerely,
Joe

---

## Check Synthetic Email List Length

At this point `emails` should have at least 50 unique items. Please run the following cell to confirm.

In [43]:
num_emails = len(set(emails))
if num_emails < 50:
    print(f'You only have {num_emails}.')
else:
    print(f'Good job, you have {num_emails} emails.')

Good job, you have 50 emails.
