<a href="https://colab.research.google.com/github/fck1023/python-random-quote/blob/master/Generate_Amazon_Book_Reviews_with_Transformers.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
# IMPORTANT: RUN THIS CELL IN ORDER TO IMPORT YOUR KAGGLE DATA SOURCES,
# THEN FEEL FREE TO DELETE THIS CELL.
# NOTE: THIS NOTEBOOK ENVIRONMENT DIFFERS FROM KAGGLE'S PYTHON
# ENVIRONMENT SO THERE MAY BE MISSING LIBRARIES USED BY YOUR
# NOTEBOOK.
# import kagglehub
# cynthiarempel_amazon_us_customer_reviews_dataset_path = kagglehub.dataset_download('cynthiarempel/amazon-us-customer-reviews-dataset')

# print('Data source import complete.')

# --- 步驟一：掛接雲端硬碟 ---
from google.colab import drive
drive.mount('/content/drive')

# --- 步驟二：從雲端硬碟複製 "壓縮檔" 到 Colab 本地 ---
# 假設您的原始路徑是正確的
!cp "/content/drive/MyDrive/Colab Notebooks/Generate Amazon Book Review with transfomer/amazon_reviews_us_Books_v1_02.tsv.zip" /content

# --- 步驟三：解壓縮本地的 ZIP 檔 (這是您缺少的步驟) ---
!unzip /content/amazon_reviews_us_Books_v1_02.tsv.zip -d /content/

# --- 步驟四：用 Pandas 讀取 "解壓縮後" 的檔案 ---
import pandas as pd

# 注意：檔名是 .tsv，代表用 Tab 分隔
# 我們從本地的 /content 路徑讀取，速度才會快
file_to_read = '/content/amazon_reviews_us_Books_v1_02.tsv'

# 使用 read_csv 讀取 tsv 檔時，需要指定分隔符為 '\t'
df = pd.read_csv(file_to_read, sep='\t', on_bad_lines='skip') # 加上 on_bad_lines 以防檔案格式問題

print("資料讀取成功！")
print(df.head())

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
Archive:  /content/amazon_reviews_us_Books_v1_02.tsv.zip
  inflating: /content/amazon_reviews_us_Books_v1_02.tsv  
資料讀取成功！
  marketplace  customer_id       review_id  product_id  product_parent  \
0          US     12076615   RQ58W7SMO911M  0385730586       122662979   
1          US     12703090    RF6IUKMGL8SF  0811828964        56191234   
2          US     12257412  R1DOSHH6AI622S  1844161560       253182049   
3          US     50732546   RATOTLA3OF70O  0373836635       348672532   
4          US     51964897  R1TNWRKIVHVYOV  0262181533       598678717   

                                                    product_title  \
0                      Sisterhood of the Traveling Pants (Book 1)   
1                   The Bad Girl's Guide to Getting What You Want   
2                          Eisenhorn (A Warhammer 40,000 Omnibus)   
3                          

# Generate Fake Amazon Book Reviews with Transformers

![notebook-horizontal-hero.jpg](attachment:4b313cb1-881b-49ce-8458-a8b4f6d0635a.jpg)

# It all started with a phone call..

> "Hey Joel, we've got a problem with fake book reviews at Amazon, and I was hoping you can help."

Sure, Jeff.  What's the problem?

> "Fake reviews are rampant on The Everything Store (you might know this as just Amazon.com), which of course reduces trust on the platform."

Have you considered training a machine learning model to identify and flag fake reviews?

> "Of course, but the problem is, we we don't have the training data we need.  We'd need hundreds of thousands.. maybe millions, even, of fake reviews to train a reliable classifier."

What about using Amazon's own Mechanical Turk?

![8ean3d.jpg](attachment:7aebeaba-5a3a-4007-846f-8f278c7fc67b.jpg)

Uh.. haha, just kidding!  Terrible idea, what was I thinking.  If you're anything like other clients I've worked with, we need this "like, yesterday."  So no time for mere meatbags to write that crap.  Plus, why pay people when we can automate this?

> "My thoughts exactly.  What other ideas do you have?  This is why we pay consultants like you the big bucks!"

Umm, I haven't been paid anything ye..

> "Don't worry about that.  Do a great job with this, and maybe we'll even hire you for an AI/ML Engineering role at Amazon"

I'd love that.  Well, we could use a generative language model and fine-tune it on real Amazon book reviews, thus training it to learn to generate real-sounding reviews.  That would create our synthetic dataset we could then further use to train a classifier.

> "And it'd just put out new fake reviews, just like, abracadabra?"

Abracadabra.

> "If you could do that for me, I'd definitely reach out to Jassy and see what I could do about that engineering role, uh, situation."

That'd be great!  So, we'll want to use a Transformer-based model for this, trained as a Causal Language Model.  Is the deliverable for this assignment just the dataset, or do you need me to write that up as prose?  ;)

![8eao9r.jpg](attachment:a225d0e9-0f35-4992-87fa-2877d2895eef.jpg)

Never mind, I'm on it!

>  &lt;Click&gt;

<div class="alert alert-info">
  <strong>This should be obvious to everyone (though you never know these days): this intro is just for fun, and of course I've never actually spoken to Jeff Bezos in my life.  I have read <i>The Everything Store</i> (the biography about Jeff Bezos by Brad Stone) and I actually do respect Bezos.  He's a popular villain to hate on these days, but you don't build one of the world's most valuable companies without being a genius (and making some enemies).</strong>
</div>

<div class="alert alert-warning">
  <strong>Further &mdash; a caveat with any generative model &mdash; but please don't use this notebook to make fake reviews.  It is just for education purposes.</strong>
</div>

OK, so to fine-tune our Causal Language Model, we're going to need a dataset.  Thankfully, the Amazon Reviews dataset is here to the rescue.  There are over 50GB (uncompressed) of English-language product reviews across dozens of Amazon categories.

We want to fine tune a generative model to see the prompt "A &lt;blank&gt;-star review of the book '&lt;book title&gt;':" and generate a review from that.  So we'll put the existing data into that format to train on.

# Load the Book Review Data

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import os

In [None]:
pd.set_option('display.max_colWidth', None) # To view full text of reviews

In [None]:
# Set random seeds so notebook is reproducible
# 設定隨機種子為 42
np.random.seed(42)
# 執行第一次抽樣
print("--- 第一次執行，設定 seed=42 ---")
print(book_pdf.sample(5))

# 再次設定同樣的隨機種子
np.random.seed(42)
# 執行第二次抽樣
print("\n--- 第二次執行，再次設定 seed=42 ---")
print(book_pdf.sample(5)) # 您會發現這次的結果和上面完全相同

--- 第一次執行，設定 seed=42 ---
        marketplace  customer_id       review_id  product_id  product_parent  \
902700           US     24286685   RFO1MC91Y3FRB  1558609164       290851787   
1769286          US     42226777   RGQC4H3CQEUNZ  0425174484       499536931   
902908           US     24283162  R124VACIXDMLH9  0743422074       707098124   
1972746          US     53045485  R10V9EIF70WRE7  1570622280       645081767   
1310871          US     47485795  R3ANAJ2PHIT08N  0878332669       464577643   

                                                                                                  product_title  \
902700   Business Intelligence: The Savvy Manager's Guide (The Morgan Kaufmann Series on Business Intelligence)   
1769286                                      Expecting Adam: A True Story of Birth, Rebirth, and Everyday Magic   
902908                                                                   Blink-182: Tales from Beneath Your Mom   
1972746                           

In [None]:
# 從 100 萬筆評論中隨機抽出 5 筆
# 因為沒有設定 seed，所以每次執行結果都不同
print("--- 第一次執行 ---")
print(book_pdf.sample(5))

print("\n--- 第二次執行 ---")
print(book_pdf.sample(5))

--- 第一次執行 ---
        marketplace  customer_id       review_id  product_id  product_parent  \
858208           US     52844621   RVXBOTCCH7V3K  1891231758       773772678   
633267           US     42651320  R36JG1DT4GFNQT  0785261486       222922319   
2607945          US     49942573  R3DRO7WPUBOGX6  0060192119       316982155   
796447           US     38632362   RZBNVQ0RL7KLA  0263163687       347150430   
2651707          US     50138045   R2YQGYNSZ5GCH  0966382064       339000526   

                                                                           product_title  \
858208   Weight Loss Surgery: Finding the Thin Person Hiding Inside You - SECOND EDITION   
633267                        Brainwashed: How Universities Indoctrinate America's Youth   
2607945                             As Nature Made Him: The Boy Who Was Raised as A Girl   
796447                                           Bartaldi's Bride (Mills & Boon Romance)   
2651707                                      

In [None]:
# Book Pandas DataFrame
# We'll have more data than we can even use, so if there are any issues with a row, just skip it.
# 檔案已經被複製並解壓縮到 Colab 的 /content 目錄下
# 這是一個本地路徑，讀取速度非常快
correct_local_path = '/content/amazon_reviews_us_Books_v1_02.tsv'

# 使用這個正確的本地路徑
book_pdf = pd.read_csv(correct_local_path,
                       sep='\t',            # .tsv 檔案需要指定分隔符是 Tab ('\t')
                       on_bad_lines="skip")

print("成功從 Colab 本地端讀取檔案！")
print(book_pdf.head())

成功從 Colab 本地端讀取檔案！
  marketplace  customer_id       review_id  product_id  product_parent  \
0          US     12076615   RQ58W7SMO911M  0385730586       122662979   
1          US     12703090    RF6IUKMGL8SF  0811828964        56191234   
2          US     12257412  R1DOSHH6AI622S  1844161560       253182049   
3          US     50732546   RATOTLA3OF70O  0373836635       348672532   
4          US     51964897  R1TNWRKIVHVYOV  0262181533       598678717   

                                                    product_title  \
0                      Sisterhood of the Traveling Pants (Book 1)   
1                   The Bad Girl's Guide to Getting What You Want   
2                          Eisenhorn (A Warhammer 40,000 Omnibus)   
3                                 Colby Conspiracy (Colby Agency)   
4  The Psychology of Proof: Deductive Reasoning in Human Thinking   

  product_category  star_rating  helpful_votes  total_votes vine  \
0            Books          4.0            2.0       

In [None]:
# Look at the shape of the data and the column headers we have
book_pdf.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3105370 entries, 0 to 3105369
Data columns (total 15 columns):
 #   Column             Dtype  
---  ------             -----  
 0   marketplace        object 
 1   customer_id        int64  
 2   review_id          object 
 3   product_id         object 
 4   product_parent     int64  
 5   product_title      object 
 6   product_category   object 
 7   star_rating        float64
 8   helpful_votes      float64
 9   total_votes        float64
 10  vine               object 
 11  verified_purchase  object 
 12  review_headline    object 
 13  review_body        object 
 14  review_date        object 
dtypes: float64(3), int64(2), object(10)
memory usage: 355.4+ MB


As we can see from this summary, there are about 3.1M book review entries loaded.  In the end, we'll only use the `product_title`, `star_rating`, and `review_headline`.  

The `review_headline` is the very brief review summary that you can write when leaving an Amazon review.  We'll use this instead of the full-length `review_body` for two reasons.

1. To save on required memory and runtime.
2. It's already a hard task to write a fake book review from only the title and the star rating.  Think about doing this yourself.. if you know nothing about the book but the title, how are you going to write a very detailed review?  Thus we want these reviews to be shorter and therefore somewhat more generic.

# Analyze & Clean the Data

In [None]:
# Drop any rows that don't have star ratings, product_titles, or review_headlines
book_pdf = book_pdf.dropna(subset=["star_rating", "product_title", "review_headline"])
print("Datapoint entries: ", book_pdf.shape[0])

Datapoint entries:  1902896


A few were dropped because they had blank `star_rating` or `review_headline`s.

Next, we'll drop any reviews that are not verified, so we can be more certain that the "real" reviews we're training on, are indeed actually real.

In [None]:
# Let's get rid of any non-verified reviews,
# so we can be more certain that the reviews we think are real, are real.

# first reset the index when using .index attribute, ensures index is unique
print(book_pdf.columns)
# book_pdf = book_pdf.reset_index(drop=True)
# book_pdf = book_pdf.drop(book_pdf[book_pdf.verified_purchase != 'Y'].index)
# print("Datapoint entries: ", book_pdf.shape[0])

Index(['product_title', 'star_rating', 'review_headline'], dtype='object')


This *greatly* reduced the size of the dataset, to about a tenth of its original size, at 229K entries.  That's OK for this project, as we will drop more later to make the reviews balanced, and it's still more than we need for fine-tuning.  Because a Large Language Model comes with so much inherent "knowledge" about the world pre-existing in its parameters, it needs much less data to fine-tune it for a task than if you were to start from scratch.  A smaller dataset is also less unwieldy, and we won't need to use generator functions, as we can just store it all in GPU and system RAM.

Let's further reduce the size (and improve the quality of the training data) by removing reviews that don't have a least 3 "helpful" upvotes on it.  This will serve as a proxy for them being likely genuine, as reviews that are low-quality, self-promotional, or clearly written by a competing Amazon seller will probably be noticed by humans and not marked as helpful.

In [None]:
# We'll also look for something to have at least 3 "helpful" upvotes on
# the reviews as a proxy for them being likely genuine.  Stuff that's too
# spammy/markety probably won't get marked as helpful.

# first reset the index when using .index attribute, ensures index is unique
book_pdf.reset_index()
book_pdf = book_pdf.drop(book_pdf[book_pdf.helpful_votes < 3].index)
print("Datapoint entries: ", book_pdf.shape[0])

Datapoint entries:  1902922


We've reduced it by about half again, leaving only reviews that are helpful (and thus higher quality).

We're now finished using any of the extra columns that we won't direcly use for training, so we can drop them.

In [None]:
# Only a few columns we care about:
# product_title, star_rating, and review_headline

# Drop all columns except those to speed up future processing
book_pdf = book_pdf[['product_title', 'star_rating', 'review_headline']]
book_pdf.info()

<class 'pandas.core.frame.DataFrame'>
Index: 1902922 entries, 1 to 3105369
Data columns (total 3 columns):
 #   Column           Dtype  
---  ------           -----  
 0   product_title    object 
 1   star_rating      float64
 2   review_headline  object 
dtypes: float64(1), object(2)
memory usage: 58.1+ MB


In [None]:
# book_pdf.head()

Let's clean up some funky stuff in the text so it doesn't get output when we generate later.  We don't want to do too much preprocessing here, as tokenizer should be able to handle these.  Let's just get rid of reviews that contain URLs or HTML tags.

In [None]:
# Let's just delete any reviews that contain html tags or urls.
# My hunch is these aren't common in a legit review, and urls
# seem like they'd be abused by marketers linking out, so this
# can be another cheap form of pruning anything but real reviews.

# Re-index since we've dropped many.  Probably not necessary.
book_pdf.reset_index()

# Amazon has already converted the opening tags of HTML into &lt;, but not the closing tag,
# e.g. &lt;a href="blah">, so we can just look for this simple pattern and drop rows that
# contain it.  This will automatically cover urls too.
html_delete = "&lt;.*?>"
# Can use the .str.contains() method
book_pdf = book_pdf.drop(book_pdf[book_pdf.review_headline.str.contains(html_delete)].index)
print("Datapoint entries: ", book_pdf.shape[0])

ValueError: Cannot mask with non-boolean array containing NA / NaN values

In [None]:
# Let's take a brief look again at the data
book_pdf.head()

Let's now take a look at the distribution of star ratings to see how balanced our dataset is.

In [None]:
# Libraries for plotting
import seaborn as sns            # For visualizations
import matplotlib.pyplot as plt  # For plotting

In [None]:
# General function to generate column chart for any given attribute
def column_chart(attribute, label):
    plt.figure(figsize=(8, 6))
    sns.countplot(data=book_pdf, x=attribute)
    plt.xlabel(label)
    plt.ylabel('Count')
    plt.title(f'Distribution of {label}')
    plt.show()

column_chart('star_rating', 'Star Ratings')

In [None]:
book_pdf['star_rating'].value_counts()

There are far more 5 star reviews than anything else, by a factor of about 7x.

The easiest way to balance these is to just chop them all down to the minimal bucket, which would be 2-star reviews.  That will still be enough to train on, and again will help with memory usage.

But before we chop off the dataset, I actually want to leave just the shortest reviews to train on, because the longer a generated review is, the more obvious it will be that it's fake.  That's because there isn't enough context just from the title of a book to really say much specific about the book.

So let's sort our dataset on the `review_headline` column, so that when we chop off the datapoints, we leave the shortest ones.  We'll specify a minimum length of 15 characters for the review headline, otherwise we get reviews that are like "!" or "1".

In [None]:
# Sort the dataset by the length of the review_headline entries

# Add a new column with the length of the 'review_headline'
book_pdf['headline_length'] = book_pdf['review_headline'].apply(len)

# Sort the DataFrame based on the 'headline_length' column
book_pdf = book_pdf.sort_values(by='headline_length')

# Drop the additional column since we don't need it in the final result
book_pdf = book_pdf.drop(columns=['headline_length'])

# But also drop reviews that are less than a certain length
book_pdf = book_pdf[book_pdf['review_headline'].apply(len) >= 15]

# Display the head of the sorted data
print(book_pdf.head())

In [None]:
book_pdf['star_rating'].value_counts()

Finally, we're ready to chop the data down to a balanced number per bucket.  We'll shuffle the data first, then use the smallest number for pruning, which is 9450 reviews in the 2-star category.

In [None]:
# shuffle all the rows first, fraction of 1 (100%)
book_pdf = book_pdf.sample(frac=1)

# How many examples we want from each star-rating class
num_samples = 9450

# Drop everything after the first K samples, for each star rating
book_pdf.drop(book_pdf[book_pdf.star_rating == 5.0].index[num_samples:], inplace=True)
book_pdf.drop(book_pdf[book_pdf.star_rating == 4.0].index[num_samples:], inplace=True)
book_pdf.drop(book_pdf[book_pdf.star_rating == 3.0].index[num_samples:], inplace=True)
book_pdf.drop(book_pdf[book_pdf.star_rating == 2.0].index[num_samples:], inplace=True)
book_pdf.drop(book_pdf[book_pdf.star_rating == 1.0].index[num_samples:], inplace=True)

book_pdf['star_rating'].value_counts()

In [None]:
column_chart('star_rating', 'Star Ratings')

# Refactoring the Text Sequences for Fine-Tuning

We want to fine-tune in such a way that we can generate the fake reviews later.  To generate a fake review, we'll need the title and the star rating we desire for the review.  The generative transformer model should then generate a fake review based solely on that.  So we need to get our dataset into a single column of sentences of that same structure to use as training examples.

For example:
'A 3-star review of the book "The Bad Girl's Guide to Getting What You Want": '

Note that we also could just do something like "&lt;star&gt; &lt;title&gt;:" but that wouldn't take any advantage of the inherent knowledge that the LLM has about the world.  We are giving it some more context that we're desiring a book review that corresponds to a particular star rating.

Let's get our dataset into that format now.

In [None]:
# Make a new column containing the concatenation of the other columns,
# with our separator as described.
# Need to escape the "" that are meant to be left in the sequence.
# Also turn the star rating from a float to an int.
# e.g. "A 5-star review of the book \"The Evolution of Useful Things\": "
book_pdf['concat_sequence'] = book_pdf.apply(lambda x:
                                             'A ' + str(int(x['star_rating']))
                                             + '-star review of the book \"'
                                             + x['product_title'] + '\": '
                                             + x['review_headline'],
                                             axis=1)

book_pdf.head()

Let's look at a histogram of the final concatenated sequence lengths (the training phrases).

In [None]:
# Calculate the lengths of strings in the concatenated sequences
book_pdf['String_Length'] = book_pdf['concat_sequence'].apply(len)

# Plot histogram
plt.hist(book_pdf['String_Length'], bins=range(min(book_pdf['String_Length']),
                                               max(book_pdf['String_Length']) + 1),
                                    edgecolor='black')
plt.title('String Length Histogram (Characters)')
plt.xlabel('String Length')
plt.ylabel('Frequency')
plt.show()

And as a word count..

In [None]:
# Calculate the number of words in each string in concatenated sequences
book_pdf['Word_Count'] = book_pdf['concat_sequence'].apply(lambda x: len(x.split()))

# Plot histogram
plt.hist(book_pdf['Word_Count'], bins=range(min(book_pdf['Word_Count']),
                                            max(book_pdf['Word_Count']) + 1),
                                 edgecolor='black')
plt.title('Word Count Histogram')
plt.xlabel('Number of Words')
plt.ylabel('Frequency')
plt.show()

We can see that most reviews have 40 words or less, but that there is a long tail of longer review headlines.  Note that this is the entire concatenated sequence, so some of these could also be very long book titles as well.  This also includes our prompt prepended hint "A X-star review of the book:".  As a sanity check, we can see from the histogram that the shortest this could possibly be (if we had a single word book title) is 7 words, which seems right from the histogram.

# Start of Transformer model

We'll use the DistilGPT2 transformer model from Hugging Face for this project.

<div class="alert alert-info">
  <strong>
      📝 DistilGPT2 is a smaller version of GPT2 (from the original OpenAI paper) that has 82M parameters instead of the original 124M, reduced using knowledge distillation.  Ends up being 2x faster than GPT2 and nearly the same performance (in terms of predicting the next word accuracy).  For this simple task, I don't need a too-powerful model and this should suffice.
    </strong>
</div>

In [None]:
import tensorflow as tf
from transformers import AutoTokenizer, TFAutoModelForCausalLM

In [None]:
# Tensorflow seed for reproducibility
tf.random.set_seed(42)

In [None]:
# Use DistilGPT2, a more efficient version of GPT-2
checkpoint = "distilgpt2"

# Added this padding_side argument so that later when I pad batches,
# it doesn't pad the right side (since decoder-only architecture!)
# Without this argument, we'd later have issues during inference if
# we want to do a batch of prompts all at once.  Since they are different
# lengths, some would need to be padded, so we need to define that when
# we initialize the tokenizer.
tokenizer = AutoTokenizer.from_pretrained(checkpoint, padding_side='left')

# Reproducible results
tokenizer.seed = 42

# Use the Hugging Face model architecture for causal language models (CLM),
# TensorFlow variety.  Also, has 81M parameters, which seems high for
# fine-tuning (although maybe you fine-tune all of them).
model = TFAutoModelForCausalLM.from_pretrained(checkpoint)

# Seed all random generators
model.config.seed = 42

In [None]:
# Just a double check of the tokenized data for an example sentence

text = "Oh HAI, I'm just a plan 'ol input sentence prompt."

encoded_input = tokenizer(text, return_tensors='tf')
print(encoded_input)

output = model(encoded_input)

On this particular example, this uses 15 tokens.  Subtract two for the start and end tokens, and we're at about 13 tokens for 10 words and 4 punctuation marks.  This isn't a perfect mapping, but maybe it's around a token per word for this tokenizer, on average.  Just trying to get ballpark here.

## Initial Pipeline Results

Let's see how this bad boy does out of the box on an example.  We've given it **no** clues what it should do with this prompt.

In [None]:
from transformers import pipeline, set_seed

generator = pipeline('text-generation', model=checkpoint)

# Make the pipeline's randomness deterministic too
set_seed(42)

generator("A 5-star review of the book \"Three Steps to Yes: The Gentle Art of Getting Your Way\": ",
          max_length=64,
          num_return_sequences=1,
          pad_token_id=tokenizer.eos_token_id)

In [None]:
generator("A 1-star review of the book \"The Evolution of Useful Things\": ",
          max_length=64,
          num_return_sequences=1,
          pad_token_id=tokenizer.eos_token_id)

Not great!

At other various times running this pipeline (out-of-the-box) model, many times the output was blank.  Once it gave me a charming anecdote about the reviewer's time in their college dorms.  But alas, nothing resembling book reviews.

<div class="alert alert-info">
  <strong>
    📝 Of course, the model doesn't know what we want to do, but I would have thought with that very minimal information that it could do a decent job based on its built-in knowledge.  But GPT2 is from 2019, and is a much smaller model (let alone the distilled version as in here) with worse performance than we're used to these days with ChatGPT and such (1.5B parameters vs. 82M here).  
    </strong>
</div>

Let's now fine-tune it and see if we can do better.

# Prepare the data for fine-tuning

We'll need to truncate and pad the dataset.  

Looking at the histogram, we can see that most of the reviews fit into the 40-word range.  As we saw earlier, it's also not straightforward to determine how many tokens it takes to make the average word, but if we guess that it's a little more than one token per word (using more tokens for less frequent word pieces), we can round this up to a nice power of 2 and pick a `max_length` for a sequence of 64 tokens.  A power of 2 is nice because it helps fit into a matrix and memory more efficiently, though it's not crucial.

In [None]:
# Tokenizer key parameters
max_length = 64
batch_size = 32

## Batch encode the data

In [None]:
# Set the padding token to the EOS token.
tokenizer.pad_token = tokenizer.eos_token

# NOTE:
# Couldn't get this to work without using .tolist() method, but this isn't ideal
# because it means the entire dataset needs to fit into RAM!  Had to shrink my
# training data to make this work.

# Next time, look more into 'prepare_tf_dataset()' and 'to_tf_dataset()' methods.
# Tried the 'to_tf_dataset()' method already, but doesn't work with the batch-encoded
# methods.  Possibly instead use 'prepare_tf_dataset()' as shown here:
# https://huggingface.co/docs/transformers/training
# because it looks like this might batch as it goes, fixing the RAM issues.

# Tokenize the entire column using batch_encode_plus
tokenized_data = tokenizer.batch_encode_plus(
    book_pdf['concat_sequence'].tolist(),
    return_tensors='tf',
    padding=True,
    truncation=True,
    max_length=max_length
)

In [None]:
# Print the tokenized data
print("Input IDs:", tokenized_data['input_ids'])
print("Attention Mask:", tokenized_data['attention_mask'])

In [None]:
# # Examine some of the data types I'm dealing with
# print(type(tokenized_data))
# print(type(tokenized_data['input_ids']))
# print(tokenized_data['input_ids'][:1])
# print(tokenized_data.keys())

# The tokenized data only has keys 'input_ids' and 'attention_mask' for
# CLM (Causal Language Modeling)

# Prints:
# <class 'transformers.tokenization_utils_base.BatchEncoding'>
# <class 'tensorflow.python.framework.ops.EagerTensor'>
# tf.Tensor(
# [[50256 50256 50256 50256 50256 50256 50256 50256 50256 50256 50256 50256
#   50256 50256 50256 50256 50256 50256 50256 50256 50256 50256 50256 50256
#   50256 50256 50256 50256 50256 50256 50256 50256 50256    32   642    12
#    7364  2423   286   262  1492   366  5756  4021 12167   286  4650 11597
#    1058   383  9745   276  2531 16672   351   968 45465   416   262  6434
#    1298  1052  1605 13745]], shape=(1, 64), dtype=int32)
# dict_keys(['input_ids', 'attention_mask'])

## Convert to Tensorflow Dataset

Prepare the data for a Causal Language Model (where our task is to predict the next word).

In a CLM dataset, the label is just the next word!  Note that the model already handles the shifting where the next word becomes the label for the current word, so we don't want to do this here or words would get doubly shifted.  In other words, for this particular model, the label for each word is just itself, which you can see in the code below.

In [None]:
# Need to build our own causal dataset, where the labels are the same as the input tokens.
# Couldn't use the datacollatorforlanguagemodeling because the .to_tf_dataset() method
# is for HF dataset type, not BatchEncoding type.

# Convert to TensorFlow Dataset
tfds = tf.data.Dataset.from_tensor_slices((
    {
        'input_ids': tokenized_data['input_ids'],
        'attention_mask': tokenized_data['attention_mask']
    },
    tokenized_data['input_ids']  # this becomes the labels, labels are just the next word
                                 # (shifted internally inside the model)
))

# Batch the dataset (already shuffled earlier)
tfds = tfds.batch(batch_size=batch_size)

In [None]:
# Let's examine the first batched example from the dataset
for input_batch, label_batch in tfds.take(1):
    print("Input IDs:", input_batch['input_ids'])
    print("Attention Mask:", input_batch['attention_mask'])
    print("Label:", label_batch)
    print("=" * 50)

Above, we printed out the tokenized data for a single batch.  Let's break it down and see if it seems correct.

For a single batch, there should be three dictionary keys for the CLM task: input token IDs, the attention mask (what the transformer should attend to), and the label for which word to predict next.  We do see those three keys in the printout above.  We've also set it for a batch size of 32, and with a max sequence length of 64.  We can see that each key has the correct shape.  We can see that every input ID token is the same as the label token, which as we discussed above, is correct, since the model will shift these internally by one for us.

<div class="alert alert-warning">
  <strong>Now here's something interesting</strong>.. why is every input sequence starting with the same stream of tokens?  Shouldn't there be only one &lt;SOS&gt; token?
</div>

Well, the reason is, we set the tokenizer above to use the &lt;EOS&gt; token as its padding token instead, with the following line:

`tokenizer.pad_token = tokenizer.eos_token`

We also set `padding_side="left"` earlier when we instantiated the tokenizer.  This was to make it causal (so it can't peek at the next word).

As a result, everything is left-padded instead of right, and it's padded with the &lt;EOS&gt; token.  Note that the attention mask is smart and knows not to attend do these tokens (`mask=0`) and the mask is only set to 1 when we want it to attend to the data.

Let's drill down into this once more, and look at only a single example (the first) from a single batch, and we'll also print out the decoded values by going back through the tokenizer to decode the IDs:

In [None]:
# Let's examine the first batched example from the dataset
for input_batch, label_batch in tfds.take(1):
    print("Input IDs:", input_batch['input_ids'][0])
    print(tokenizer.batch_decode(input_batch['input_ids'][0]))
    print("Attention Mask:", input_batch['attention_mask'][0])
    print("Label:", label_batch)
    print("=" * 50)

As we said above, we can now see that the &lt;EOS&gt; token is used to causally prepend padding on the sequences.  We can also see that our guess of 1 token to 1 word was a good approximation, with only a few words being broken down into multiple tokens, like `Collect` + `ed` or ` Spe` + `eches`.

## Splitting the Dataset

We'll use 10% for a validation dataset just to fairly evaluate how training is going.  We won't do much hyperparameter tuning, just tweaking the learning rate (which needs to be much lower for language modeling, like by a factor of 100x smaller).

In [None]:
# Since we now have a TFDS prepared, let's split it,
# to 10% validation set and 90% training

# Example calculation:
# num_samples = 15_000  # per star rating, so multiplied by 5 star ratings.
# (NOTE: this isn't actual num_samples!  Just example to show the math.)
# batch_size = 32
# num_batches = num_samples * 5 / batch_size = 15,000 * 5 / 32 = int(2343.75) = 2344

num_batches = int(num_samples * 5 / batch_size)
print(f"{num_batches} batches of {batch_size} samples each batch.")

# Calculate the size of the training set (90%)
train_size = int(0.9 * num_batches)

# Split the dataset into training and val sets
train_tfds = tfds.take(train_size)
val_tfds = tfds.skip(train_size)

# Print the sizes of the training and test sets
print(f"Training set size: {train_size}")
print(f"Validation set size: {num_batches - train_size}")

## Model Setup and Compilation

We'll use a Learning Rate scheduler to slowly anneal the learning rate down to 0 at the very end.

In [None]:
# Need to define epochs here, because determines num_training_steps, which
# determines the LR scheduler.
num_epochs = 5
print(f"Epochs: {num_epochs}")

# The number of training steps is the number of total samples in the dataset,
# divided by the batch size, then multiplied by the total number of epochs.
# Note that the TF dataset here is a batched tf.data.Dataset,
# so its len() is already num_samples // batch_size.
num_train_steps = train_size * num_epochs
print(f"Training steps: {num_train_steps}")

# Called "polynomial," but in this basic form with default options, the learning rate will
# actually decay linearly.  As mentioned above, we need a much lower initial learning rate
# for language modeling tasks than for other ML tasks, partly becuase there are so many
# parameters, but also since it's already pre-trained and we're just fine-tuning, avoid
# catastrophic forgetting.
lr_scheduler = tf.keras.optimizers.schedules.PolynomialDecay(
    initial_learning_rate=5e-5, end_learning_rate=0.0, decay_steps=num_train_steps
)

# language models need much smaller LR than defaults, and we'll use our schedule object here
opt = tf.keras.optimizers.Adam(learning_rate=lr_scheduler)

# Most HuggingFace models don't SoftMax at output!  Direct from logits
# loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)

# Most HuggingFace models will automatically figure out the loss
# (typically what I've picked here), but can still set it ==> Update: bad idea!
model.compile(optimizer=opt,
              # NOTE:  This was my issue with training!
              # Don't specify my own loss, model now trains, see below text for explanation.
              #loss=loss,
)

## 🐛 Solved a Bug with the Training!

When initially training this model, I was having issues getting it to learn anything.  The weird thing is, the loss was decreasing consistently during training, but then when trying to use the model to generate new sentences, it would only output blank spaces.

I was kinda at a "loss" (ahem) for why my model wasn't learning.  My lucky break was, I tried to fix the code to suppress a warning I was getting (usually a good idea).  The warning was:

> "The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input’s attention_mask to obtain reliable results."

As it turns out, that warning is fine and was a red herring.  But digging into where this error came from led me to reading through the source code for the model and realizing that **the root cause of my issue was with with specifying my own loss function**.

I was able to find that this model uses the `TFCausalLanguageModelingLoss` as the loss function.  It is indeed based on the Keras loss `SparseCategoricalCrossentropy(from_logits=True)`, which is what I was using, but it does some special masking to avoid taking certain tokens (those with label `-100`) into account during calculation of the loss.  These special `-100` labels are used as padding tokens during training (since this is a causal model).  Thus they need to be ignored, or masked out, when calculating the loss.  When I specified my own loss function, I picked the right loss equation, but didn't handle these `-100` tokens, hence why my model failed to train!

<div class="alert alert-warning">
  <strong>
      ⚠ Though you <i>can</i> set your own loss function with Hugging Face models, based on this experience, I wouldn't recommend it.  The loss for each model is already set to be what it should, and there is a great chance of missing some loss customization like what happened here, with essentially nothing to gain.  I was just trying to "test my understanding" by setting the loss manally, but ended up losing half a day.. though I guess in the end I did increase my understanding! 😁
  </strong>
</div>

In [None]:
# Can read the config of the model, but can also just get from the JSON info online
# by visiting https://huggingface.co/distilgpt2/resolve/main/config.json,
# which is found in the Transformers library source code.

# from transformers import AutoConfig
# config = AutoConfig.from_pretrained(checkpoint)
# # Access the configuration to get information about the loss function used
# print(config)

# Actual loss I want is here in the code:
# https://huggingface.co/transformers/v3.3.1/_modules/transformers/modeling_tf_utils.html#TFCausalLanguageModelingLoss

Let's look at what the model looks like..

In [None]:
model.summary()

8.2M parameters, as expected!  Better make sure we use a GPU.

In [None]:
# Check for GPU availability
print("Num GPUs Available: ", len(tf.config.experimental.list_physical_devices('GPU')))

## Train the Model

We're ready to fit!  Let's just add a custom callback so we can see the learning rate decrease.  We are going for 5 epochs and decreasing linearly, starting from a learning rate of 5e-5, so that works out very nicely to make sure it's decreasing correctly.

In [None]:
# Custom callback to see the learning rate and make sure it's shrinking
class PrintLearningRateCB(tf.keras.callbacks.Callback):
    def on_epoch_begin(self, epoch, logs=None):
        lr = float(tf.keras.backend.get_value(self.model.optimizer.learning_rate))
        print(f'Epoch {epoch + 1} - Learning Rate: {lr}')

The next cell takes about 30 minutes on a P100 GPU.

In [None]:
# Fit!
model.fit(train_tfds,
          validation_data=val_tfds,
          epochs=num_epochs,
          callbacks=[PrintLearningRateCB()],
         )

Now let's try generating some fake reviews with our trained model!

# Generating New Text

There are many algorithms we can use to generate new text, and I found [this blog post](https://huggingface.co/blog/how-to-generate) by Hugging Face to be a good refresher.  But don't worry, I'll walk through the various algorithms here as we go.

## Greedy Search

Greedy Search is a text generation where, given an input prompt (or "context"), we literally just pick the next most likely word at that moment, given what we've seen already.  Your phone's keyboard does a rough version of this when you are typing, by suggesting the next most likely word.  This is often based on an older method (before deep learning), using n-grams and other statistical methods.  

Though greedy search is simple, it often ends up creating very repetitive text.

In [None]:
# We'll use these same prompts throughout evaluation
prompt1 = "A 5-star review of the book \"Words to Comfort, Words to Heal\": "
prompt2 = "A 1-star review of the book \"The Evolution of Useful Things\": "
prompt3 = "A 3-star review of the book \"Crossing the Chasm\": "

In [None]:
# NOTE: Tokenizer takes a list of input prompts, like [prompt1].
# But if generating more than one prompt at a time, there will be an error because
# not all the input strings are the same length!  Tokenizer cannot convert them to tensors.
# To resolve this, use padding and truncation arguments of the tokenizer to make
# the input sequences all the same length.
encodings = tokenizer([prompt1, prompt2, prompt3],
                      return_tensors='tf',
                      padding=True,
                      truncation=True
                     )

# Use the newly trained model to generate new outputs
# Use max_new_tokens=max_length (which is 64 here), so outputs are no longer than the sequences they
# were trained on.
outputs = model.generate(**encodings,
                         max_new_tokens=max_length, # not including the input prompts
                         pad_token_id=tokenizer.eos_token_id,
                        )

Just once, let's look at the output tokens generated by the model:

In [None]:
print(outputs)

As you can see, outputs is a tensor with shape (3, 82); one output corresponding to each input prompt.  The output length also includes the input prompts.  We can see also the outputs are padded with the &lt;EOS&gt; special token at the end (see especially last example).  

But let's have the tokenizer decode these outputs so they are human-readable:

In [None]:
# Decode those generated token ids back to text
# Don't print out <SOS>, <EOS>, padding, etc.
decoded = tokenizer.batch_decode(outputs, skip_special_tokens=True)

# Easier to read if we unroll the list
for out in decoded:
    print(f"{out}\n")

Note how these examples tend to repeat themselves.  

Also, even if we didn't do any further optimizations (and even though the results aren't great), these results are *MUCH* better than the original response of the non-fine-tuned model.  To jog your memory, it was:

> 'A 1-star review of the book "The Evolution of Useful Things": ______________________________________ / The book "The Evolution of Useful Things": ______________________________________ / The book "The Evolution of Useful Things": ______________________________________ / The book "The Evolution of Useful Things": ______________________________________ / The book "The Evolution of'

## Beam Search

Instead of finding the sequence one word at a time, which often isn't the optimal result, beam search allows for sortof a parallel search of candidate beams.  As each new word is generated, all beams are evaluated based on which is most likely (often using a metric such as **perplexity**).  Beam search will always outperform greedy search, but it isn't guaranteed to find the best overall "most likely output" phrase (that is, the one with lowest perplexity).

In [None]:
# We don't need to generate encodings again with the tokenizer, since
# we're keeping the prompts the same!

outputs = model.generate(**encodings,
                         max_new_tokens=max_length, # not including prompts
                         num_beams=5,               # beam search on
                         do_sample=True,            # use sample probabilities for next word
                         num_return_sequences=1,    # could return more, just see best
                         pad_token_id=tokenizer.eos_token_id,
                         # Stop generation when all beam hypotheses reach the EOS token.
                         early_stopping=True,
                        )

decoded = tokenizer.batch_decode(outputs, skip_special_tokens=True)

# Easier to read if we unroll the list
for out in decoded:
    print(f"{out}\n")

Hmm, not so good, it basically didn't generate a review for the first two.

Beam search works well for structured tasks like machine translation or summarization, where the output is constrained by what the input was (you'd also use an Encoder-Decoder transformer architecture for this task, not a Decoder-only architecture, like GPTs).  But when doing open-ended text generation, there is no structure to guide the output, so even though we're searching with many beams, the lengths can vary greatly and results are mixed, as seen here.

## n-gram Penalties

Another strategy is to penalize repeating **n-grams** (groups of *n* tokens).  With no n-grams set to 2, for example, it means it will never repeat the same 2-word group twice.  So if you wrote an article about the "United States," it could only say "United States" once in the generated text (this ignores occurrences in the prompt itself, and just limits the output), because once it predicted that once, then next time it predicts "United," it is forbidden from predicting "States" again!  This is quite limiting, but does result in less repetition.

In [None]:
# We don't need to generate encodings again with the tokenizer, since
# we're keeping the prompts the same!

outputs = model.generate(**encodings,
                         max_new_tokens=max_length, # not including prompts
                         num_beams=5,               # beam search on
                         do_sample=True,            # use sample probabilities for next word
                         num_return_sequences=1,    # could return more, just see best
                         pad_token_id=tokenizer.eos_token_id,
                         # Stop generation when all beam hypotheses reach the EOS token.
                         early_stopping=True,
                         # Make no 2-gram appear twice, to try to reduce repetitions
                         no_repeat_ngram_size=2,
                        )

decoded = tokenizer.batch_decode(outputs, skip_special_tokens=True)

# Easier to read if we unroll the list
for out in decoded:
    print(f"{out}\n")

These results seem the best so far.

My favorite review so far:

> This book is a waste of time, money, and time.  If you're not a vegetarian, this is not for you!  Don't waste your time or your money on this book!  DON'T waste any money!

You can't *buy* this review gold on Mechanical Turk! 😎

## Sampling

Beam search (and Greedy search of course) look for high probability distributions.  But studies have shown that real human language does not follow high-probability distributions, and so text that does will sound very robotic.

So we can make it sound more "real" and less predictable by introducing some randomness with **sampling**.  Sampling is, rather than always picking the word with the highest probability, choosing the next word relative to the probability distribution.  Thus there is still some random chance to choosing the next word, and we can make it sound less repetitive.

We've actually already turned on sampling earlier when we covered beam search, but we'll refine our sampling now.  In a sense, every word in the output vocabulary is fair game with this next spin, which should lead to quite random looking text.

In [None]:
# We don't need to generate encodings again with the tokenizer, since
# we're keeping the prompts the same!

outputs = model.generate(**encodings,
                         max_new_tokens=max_length, # not including prompts
                         do_sample=True,            # sample probabilities for next word
                         pad_token_id=tokenizer.eos_token_id,
                        )

decoded = tokenizer.batch_decode(outputs, skip_special_tokens=True)

# Easier to read if we unroll the list
for out in decoded:
    print(f"{out}\n")

Incoherent gibberish!  The words follow basic English grammar rules, but they don't really make any sense when you start reading them.

## Tweak the Temperature

One way to try to mitigate this is to play with the **temperature**, which is a knob we have to tweak when doing the random sampling.  Temperature defaults to 1 (as in the above case), and when we use a low temperature (T < 1), we sharpen the probability distribution of predicted next words, such that we increase the likelihood of higher probability words and decrease the probability of lower probability words.  Similarly, we could *increase* the temperature (T > 1), which would do the opposite, and make words even more random seeming.  But they're already bad enough!

Let's try lowering the temperature to 0.6 and see what that change alone does to the generated outputs.

In [None]:
# We don't need to generate encodings again with the tokenizer, since
# we're keeping the prompts the same!

outputs = model.generate(**encodings,
                         max_new_tokens=max_length, # not including prompts
                         do_sample=True,            # sample probabilities for next word
                         pad_token_id=tokenizer.eos_token_id,
                         # Sharpen the probability curve (higher chance for more common words,
                         # lower chance for less common words)
                         temperature=0.6,
                        )

decoded = tokenizer.batch_decode(outputs, skip_special_tokens=True)

# Easier to read if we unroll the list
for out in decoded:
    print(f"{out}\n")

That definitely looks better than with temperature=1 (which was basically random words with a correct-looking structure), but it's still not great, so let's try some other strategies.

Note that, as temperature would decrease to 0, it becomes equal to greedy decoding and has the same problems that greedy decoding has.

## Top-K Sampling

**Top-K Sampling** essentially just takes the top *K* words from the entire probability distribution, and redistributes the probability distribution amongst those.  For example, if you looked at the Top-10 words at each step, then the probability of those words would add up to 1.0 (whereas before the redistribution, the only way to sum the probability distribution to 1.0 would be to include ALL words in the vocab).  This eliminates unlikely next words, but keeps some sampling in effect.

In [None]:
# We don't need to generate encodings again with the tokenizer, since
# we're keeping the prompts the same!

outputs = model.generate(**encodings,
                         max_new_tokens=max_length,  # not including prompts
                         do_sample=True,             # sample probabilities for next word
                         pad_token_id=tokenizer.eos_token_id,
                         top_k=50,                   # Top-50 words (or tokens, technically)
                        )

decoded = tokenizer.batch_decode(outputs, skip_special_tokens=True)

# Easier to read if we unroll the list
for out in decoded:
    print(f"{out}\n")

## Top-P (Nucleus) Sampling

One issue with Top-K is that the K number is fixed, so if our next word distribution was pretty-flat (any of those words are similarly likely) or was very sharp (some words much more likely), it'd all get lumped into the same K.  So instead of sampling from most likely K words, **Top-P Sampling** (or nucleus sampling) chooses instead a *probability* to be fixed.  In other words, it's a fixed probability instead of a fixed top-word count.  For example, if we wanted to only pick from the words that represent the top 95% of words, we could do that.  

We can use this by setting a top_p setting between 0 and 1 (note that top_k defaults to 0 if we don't set it).  Note also that top_p defaults to 1.0, and in practice, most LLMs use a top_p somewhere between 0.90-0.95.

In [None]:
# We don't need to generate encodings again with the tokenizer, since
# we're keeping the prompts the same!

outputs = model.generate(**encodings,
                         max_new_tokens=max_length, # not including prompts
                         do_sample=True,            # sample probabilities for next word
                         pad_token_id=tokenizer.eos_token_id,
                         top_p=0.92,                # Use Top-P Nucleus sampling instead
                        )

decoded = tokenizer.batch_decode(outputs, skip_special_tokens=True)

# Easier to read if we unroll the list
for out in decoded:
    print(f"{out}\n")

## Putting It All Together

Results are still rather bizarre.  Also, one thing to notice is that the longer the generated sequences, the more random things the LLM is going to bring up.  Perhaps we can get a more realistic sounding fake review by forcing the model to be more terse?

It also tends to generate all the way or almost all the way up to the max length, but most of the training data was much shorter.  In fact, if you look a the histogram we created earlier, we see a peak of most training examples taking 20 tokens.  This *includes* even the prompt in that generation, so we could even do shorter since the `max_new_tokens` doesn't include the prompt.  But lets not overconstrain it and include both sides of the bell curve, and set `max_new_tokens=32` instead.

We can even combine methods, like turning beam search back on, and can even run top_k and top_p at the same time.  The Hugging Face model will perform top-k sampling, and then top-p sampling within that.  We can turn on n-gram avoidance to help cut down repetitions.  We can also set an `early_stopping` flag, which will stop generation when all beam hypotheses reach the &lt;EOS&gt; token.

In [None]:
# We don't need to generate encodings again with the tokenizer, since
# we're keeping the prompts the same!

outputs = model.generate(**encodings,
                         max_new_tokens=32,         # not including prompts
                         do_sample=True,            # sample probabilities for next word
                         pad_token_id=tokenizer.eos_token_id,
                         top_k=250,
                         # Use Top-P sampling.  As approaches 1, the more all words are included!
                         top_p=0.92,
                         no_repeat_ngram_size=3,    # Make no 3-gram appear twice, reduce rep.
                         num_beams=5,               # Beam search on
                         num_return_sequences=1,    # could return more, just see best
                         # Stop generation when all beam hypotheses reach the EOS token
                         early_stopping=True,
                        )

decoded = tokenizer.batch_decode(outputs, skip_special_tokens=True)

# Easier to read if we unroll the list
for out in decoded:
    print(f"{out}\n")

At the time I ran this notebook, the final output was:

> A 5-star review of the book "Words to Comfort, Words to Heal":  This book is a must-have for any self-medicants, not just for those with ADD or ADHD, but for anyone with ADD/ADHD  
> A 1-star review of the book "The Evolution of Useful Things":  This book is a waste of time, time, and money, and a lot of time and money.  This is not a good book for anyone to read  
> A 3-star review of the book "Crossing the Chasm":  This is a good book, but I'm not sure I'll ever be able to finish it again.    If you're going to read this,  

At another time I ran it, the output was:

> A 5-star review of the book "Words to Comfort, Words to Heal":  It's a wonderful way to help a loved one understand and understand the meaning of words to comfort, words to heal, and words to help you cope with pain  
> A 1-star review of the book "The Evolution of Useful Things":  It's not what I expected, but I didn't get it.  I'm not sure what I was expecting.    I was disappointed.  
> A 3-star review of the book "Crossing the Chasm":  A good read, but it's not what you'd expect for a good book on the subject, and it's a little too much of a read for a  

👍 Not too bad!  You could now write a script to iterate a million times and generate a million sorta-decent fake reviews.  But we won't do that because we're good people, so let's give Jeff a call to show him our results.

# Conclusion

"Hey Jeff, this is Joel.  I've got those results we talked about, and they look pretty good.  I think you could use them to gener..

> Oh yeah, heya buddy.  So, it turns out, I'm not gonna be needing that anymore.. so..

What do you mean?!  I just spent a bunch of time doing all this..

> Yeah, I understand.  So I just got off the phone with Andy and it sounds like Elon's got the same problem with bots and fake tweets over on X, so, I guess it's not really a problem for us, either.

What! You don't have to have everything that Elon..

![8eeddx.jpg](attachment:26236c69-1e5f-409a-bd8f-902b3b6bacb5.jpg)

I'll keep my eyes peeled for that offer letter from Amazon in my inbox.  😉

## One last thing..

<div class="alert alert-info">
    If you’ve enjoyed following along with me or have learned something new, ☝ <b>please consider UPVOTING this notebook!</b> 🔼 It helps others discover my notebook and encourages me to spend my time writing more of these.  And follow me here on Kaggle to read more notebooks like this and embark on a journey together towards AI/ML knowledge! 😊
</div>

💡 **You might also enjoy these other notebooks of mine as well:**

**Natural Language Processing**
* [Generate Amazon Book Reviews with Transformers](https://www.kaggle.com/code/quackaddict7/generate-amazon-book-reviews-with-transformers/)  
* [Write Your Own CliffNotes (Book Text Summarizer)](https://www.kaggle.com/code/quackaddict7/write-your-own-cliffnotes-book-text-summarizer/)  
* [Answering Questions from Product Reviews](https://www.kaggle.com/code/quackaddict7/answering-questions-from-product-reviews/)  
* [Ingredient Standardization via Machine Translation](https://www.kaggle.com/code/quackaddict7/ingredient-standardization-via-machine-translation/)  
* [What Kaggle WON'T Tell You About Your Notebooks](https://www.kaggle.com/code/quackaddict7/what-kaggle-won-t-tell-you-about-your-notebooks/)  

**Computer Vision**
* [Enhance, enhance, enhance! (image upscaling)](https://www.kaggle.com/code/quackaddict7/enhance-enhance-enhance-image-upscaling)  
* [Detecting Vehicles in Traffic (Object Detection)](https://www.kaggle.com/code/quackaddict7/object-detection-detecting-vehicles-in-traffic/)  
* [Legions of Lesions (Detecting Skin Cancer with Computer Vision)](https://www.kaggle.com/code/quackaddict7/legions-of-lesions-detecting-skin-cancer-with-cv/)  
* [Creating Synthetic Wildfire Images with Unreal Engine (Blog Post)](https://joelwigton.com/synthetic-data-for-machine-learning-with-unreal-engine)  
* [How I scored Top 5% on MNIST (without cheating!)](https://www.kaggle.com/code/quackaddict7/how-i-scored-top-5-on-mnist-without-cheating/)  

**Machine Learning**
* [Detecting Android Malware from App Permissions](https://www.kaggle.com/code/quackaddict7/detecting-android-malware-from-app-permissions/)  
