# Language models
ChatGPT has brought language models to the forefront of our HAII conversations! But what is a language model? How does it work? In this assignment, we're going to explore a simplified (compared to GPT) language model to understand some of the basic building blocks of how these work.

A language model is an algorithm that takes a sequence of words, and outputs the likely next word in the sequence. Most language models output a list of words, each with its probability of occurance. For example, if we had a sentence that started `I would like to eat a hot`, then ideally the algorithm would predict that  the word `dog` had a much higher chance of being the next word than the word `meeting`. Many of you may be familiar with some aspects of this concept from CS136's WordGen lab, or CS134's AutoComplete lab assignment, although, in that case you were likely using characters rather than entire words in your language models.

Language models are a very powerful building block in natural language processing. They are used for classifying text (e.g. is this review positive or negative?), for answering questions based on text (e.g. "what is the capital of Finland?" based on the Wikipedia page on Finland), and language translation (e.g. English to Japanese).

## The intuition behind why language models are so broadly useful
How can this simple sounding algorithm be that broadly useful? Intuitively, this is because predicting the next word in a sentence requires a lot of information, not just about grammar and syntax, but also about semantics: what things mean in the real-world. For instance, we know that `I would like to eat a hot dog` is semantically reasonable, but `I would like to eat a hot cat` is nonsensical. 

I trained a simple language model, and asked it to predict the word following `I would like to eat a `. We get: `hamburger`

The rest of this notebook will describe how to set up a language model using python modules available to us, to make such word predictions as well as classifications. We will use these modules to classify and generate text in an automated chatbot in a live social media system and then reflect on the process, bringing together all our class discussions about conversational agents, natural language technologies, and human-in-the-loop systems!
    

The data is in a JSON file, so we are using the `read_json` method to read-in the data. If you have different data that is CSV, use the `read_csv` method instead. 

We use the `lines=True` argument here because the author formatted each line as a separate JSON object. At least half of your time as a data scientist/AI researcher is spent dealing with other people's data formats!


# Step 0: Set-up for this Assignment

You'll need to install some new modules to complete this assignment. Set-up instructions are always available via the project README in the Github repository.

Once you've done so, you should be able to run the following code:

In [None]:
# Follow install instructions in the README, prior to running this notebook!

# If using Google Colab, you can uncomment these two lines:
#from google.colab import drive
#drive.mount('/content/drive')

from fastai.text import *

# Step 1: Load all the data 
In this activity, we are going to use a dataset of tweets from the satirical news site, [The Onion](https://www.theonion.com), as well as some non-sarcastic news sources. The data set is from [Kaggle](https://www.kaggle.com/rmisra/news-headlines-dataset-for-sarcasm-detection). 

The dataset is available as a JSON file in the included folder, `lib`.


In [1]:
from pathlib import Path
data_path = Path('./lib')
# If using Google Colab, probably:
#data_path = Path('./drive/MyDrive/lib')

import pandas as pd
headlines = pd.read_json(data_path/'Sarcasm_Headlines_Dataset_v2.json', lines=True)

In [None]:
headlines

As you can see, some of this dataset is drawn from _The Onion_, the rest is drawn from places like the Huffington Post which publishes real news, not satire. 

## Step 1a: Examine the data set (5% effort)

Before we go off adventuring, let's first see what this dataset looks like. 

### Q1.1: How large is this dataset? Is it balanced? (1% effort)

**ANSWER:**  _Write your answer here._

In [4]:
# Insert code here to check size of dataset, and how many are positive (is_sarcastic = 1) and how many negative?
# Hint: Your output will look like this:

is_sarcastic
0    14985
1    13634
Name: count, dtype: int64
28619


### Q1.2: How many words on average is each headline? (4% effort)
Longer text = more information. We want to see what the length of the headline is in order to see how much information it may have. 

_Hint:_ See https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.str.count.html (the the `\s` regex looks for spaces).

**ANSWER:**  _Write your answer here._

In [None]:
# Insert code here to find the average length of headline (in words)

# Step 2: Build a language model that knows how to write news headlines

This is the first step of our project that will be using a machine learning model. 

We are going to use the [fast.ai](https://fast.ai/) library to create this model. If you need help with understanding this section, look at the fast.ai documentation -- it is fantastic! The steps below are modified from the [online tutorial](https://docs.fast.ai/tutorial.text.html#The-ULMFiT-approach).

In [None]:
import fastai
from fastai.text.all import *

*Note: if this import fails for you, make sure you've installed fastai first. This is not super straight-forward. Close the notebook and follow the instructions in the README.*

In [None]:
dls = TextDataLoaders.from_df(headlines, path=data_path, text_col='headline', is_lm=True, valid_pct=0)
print("This code takes less than a minute to run on Iris' laptop and Colab.")

First, we tell fastai that we want to work on a list of texts (headlines in our case), that are stored in a dataframe (that's the `TextDataLoaders.from_df` part.) We also pass in our data path, so after we process our data, we can store it at that location. We tell it where to look for the headline in the dataframe (which  column to use, `text_col=`). We specify that it is a language model (`is_lm`) so it labels the "next word" as the label for each sequence of words. But what kind of validation data do we need for a language model? Remember that a language model predicts the next word in an input sequence of words. So, we can't just take some of the headlines and set them aside as validation data. Instead, we want to use all the sentences and validate whether we can guess the right next word some fraction of the time. So, we tell it not to split the data into validation sets (`valid_pct=0`). 

## Step 2a: Learn the model

Now that we have the data, it's time to train the model.

Now, we *could* learn a language model from scratch. But we're instead going to cheat. We're going to use a pretrained language model, and finetune it for our purpose. Specifically, we're going to use a model trained on the `Wikitext-103` corpus. 

One way to understand it is to think of our pre-trained model is as a model that can predict the next word in a Wikipedia article. We want to train it to write headlines instead. Since headlines still have to sound like English, ie. follow grammar, syntax, be generally plausible etc, being able to predict the next word in Wikipedia is super useful. It allows us to start with a model that already knows some English, and then just train it for writing headlines.



In [None]:
learn = language_model_learner(dls, AWD_LSTM, drop_mult=0.5)

This `AWD_LSTM` is the pretrained Wikipedia model.

Let's train it.

In [None]:
learn.fit_one_cycle(1, 1e-2)
print("This code takes Iris' laptop less than a minute to run, and up to 20 minutes on Colab.")

Once trained, it's time to write some headlines! We give it a starting sequence `Students protest ` and see what it comes up with. 

_Note:_ Don't worry about a `UserWarning: Your generator is empty.` warning, unless the next two lines of code don't work!

_Note:_ If you were unable to get the model training/building working, it _might_ be possible to load the pre-built `headlines-lm.pkl` model that's available [via Google Drive](https://drive.google.com/drive/folders/1aKO1eWeGJmHZFrslRnMQRV1PVFiPv6eB?usp=sharing) and load it with `load_learner(fname='lib/headlines-lm.pkl')`. If everything is working, don't worry about this comment!

In [None]:
learn.predict("Students protest ", n_words=5, no_unk=True)

Pretty good, huh? 

In [None]:
learn.predict('The Fed is expected to', n_words=3, no_unk=True)

OK, it's not perfect! Let's make it a little better. 

The `unfreeze` below is telling fastai to allow us to change the weights throughout the model. We do this when we want to make the model generate text that's more similar to our headlines (than to Wikipedia). 

In [None]:
learn.unfreeze()

In [None]:
learn.fit_one_cycle(1, 1e-3)
print("This code takes Iris' laptop less than a minute to run, 20 minutes on Colab.")

In [None]:
learn.predict('New Study', n_words=5)

In [None]:
learn.predict('16 Problems', n_words=5)

OK, now let's save our hard work. We'll use this later. (Pssst: why is it called an encoder? Look at the Fastai docs to find out!)

In [None]:
learn.save_encoder('headlines-awd.pkl')

Note that we also want to save the whole model, so we can reuse it in our chatbot. 


In [None]:
learn.export('headlines-lm.pkl')

# Step 2b: See how well the language model works (15% effort)

Try generating a few more headlines. Then, answer the following questions. Wherever possible, show what code you ran, or what predictions you asked it for. 

*Suggestion: Try using punctuations, numbers, texts of different lengths etc.*

### Q2.1: What is the effect of starting with longer strings? (5% effort)

We could start our headline generation with just one word, e.g. `learn.predict('White', n_words=9)` or with many: `learn.predict('White House Says Whistleblower Did', n_words=5)`. What is the difference you see in the kinds of headlines generated?

**ANSWER:**  _Write your answer here._

In [None]:
print(learn.predict('White', n_words=9))
print(learn.predict('White House Says Whistleblower Did', n_words=5))

### Q2.2: What aspects of the task of generating headlines does our language model do well? (5% effort)
For example, does it get grammar right? Does it know genders of people or objects? etc.

**ANSWER:**  _Write your answer here._

In [None]:
# Optional code block.

### Q2.3: What aspects of the task of generating headlines does our model do poorly? (5% effort)
What does it frequently get wrong? Why might it make these mistakes?

**ANSWER:**  _Write your answer here._

In [None]:
# Optional code block.


# Step 3: Learn a classifier to see which headlines are satire

Remember, our dataset has some stories that are satire (from _The Onion_) and others that are real. Now, we're going to train a classifier to distinguish one from the other. 

In [None]:
dls_clas = TextDataLoaders.from_df(headlines, path=data_path, text_col='headline', label_col='is_sarcastic', valid_pct=0.2, text_vocab=dls.vocab)
print("This code takes Iris' laptop and Google Colab less than a minute to run.")

We're using a similar DataLoaders method as we did for our language model above. Here, we specify the target column with `label_col` and we use `valid_pct=0.2` so we keep some fraction of our dataset as a validation set. There is one other trick: `text_vocab=dls.vocab` ensures that our classifier only uses words that we have in our language model -- so it never deals with words it hasn't encountered before. (Consider: why is this important?)

See if you can work out what the other arguments are. 

In [None]:
dls_clas.show_batch()

Above: what our data looks like after we apply the vocabulary restriction. `xxunk` is an unknown word. `xxpad` is a padding character used to ensure all strings are the same length. 

Below: we're creating a classifier:

In [None]:
classify = text_classifier_learner(dls_clas, AWD_LSTM, drop_mult=0.5, metrics=accuracy)

Remember that language model we saved earlier? It's time load it back!

In [None]:
classify.load_encoder('headlines-awd.pkl')

What's happening here? 

Here's the trick: a language model predicts the next word in a sequence using all the information it has so far (all the previous words). When we train a classifier, we ask it to predict the label (satire or not) instead of the next word. 

The intuition here is that if you can tell what the next word in a sentence is, you can tell if it is satirical. (Similarly, if you can can tell what the next word in an email is, you can tell if it is spam, etc.)

In [None]:
classify.fit_one_cycle(1, 1e-2)
print("This line takes Iris' laptop about 1 minute to run, Colab 7 minutes.")

In [None]:
classify.freeze_to(-2)

Above: this is similar to `unfreeze()` that we used before. Except, you only allow a few layers of your model to change. Then we can train again, similar to using `unfreeze()`

In [None]:
classify.fit_one_cycle(1, 1e-3)
print("This line takes Iris' laptop about 1 minute to run, Colab 8 minutes.")

Wow! An accuracy of >85%! (_Consider: Where is accuracy reported?_) That sounds great, and for not that much work. 

Now, let's try it on some headlines, to see how well it does. 

# Step 4: Try out the classifier (20% effort)

In [None]:
classify.predict("Despair for Many and Silver Linings for Some in California Wildfires")

Here in the output, the first part of this tuple is the chosen category (`0`, i.e. not satire). The second part of the result is the index of "0" in our data vocabulary, and the last part is the probabilities attributed to each class (98.1% for `0` and 0.02% for `1`). The classifier suggests that the headline (which I got from the [New York Times](https://www.nytimes.com/2019/10/29/us/california-fires-homes.html?action=click&module=Top%20Stories&pgtype=Homepage)) is not satire and it seems pretty confident that's the case (98.1% probability). 

## Step 4a: Try out this classifier (10% effort)

Below, try the classifier with some headlines, real or made up (including made up by the language model above). 


### Q4.1: Two headlines that the classifier correctly classifies 

**ANSWER:** 

1. _Write your answer here_
2. _Write your answer here_

### Q4.2: Two headlines that the classifier classifies incorrectly 

**ANSWER:** 

1. _Write your answer here_
2. _Write your answer here_

In [None]:
# Cell Block for showing your work with attempted headlines

# Real Non-Satire Articles

# Real Satire Articles

# Made up Non-Satire Articles

# Made up Satire Articles

# Real headlines with 1 word changed


Now, we want to find two headlines that the classifier is really confident about, but classifies incorrectly. We want the confidence of the prediction to be at least 85%.

One headline is anything you want to write. Another must be a real headline (not satire) that you could trick the classifier into misclassifying changing only one word. For instance, taking `"Despair for Many and Silver Linings for Some in California Wildfires"`, a real NYTimes headline, you can change it to `"Despair for Many and Silver Linings for Some in Oregon Wildfires"` (note that this particular change does not cause the classifier to misclassify).

### Q4.3: One headline that the classifier classifies incorrectly, with false high confidence. (4% effort)

**ANSWER:**  _Write your headline, classification, and confidence here._

### Q4.4: One real headline, with one word changed, that the classifier classifies as satire, with false high confidence. (4% effort)

**ANSWER:** _Real headline: Write your answer here_

**ANSWER:** _Headline with one word changed that classifies as satire: Write your answer here_

**ANSWER:** _Confidence level: Write your answer here_


## Step 4b: What kinds of headlines are misclassified? (10% effort)

### Q4.5: Write your hypothesis below on what kinds of headlines are misclassified. 
Show your work. (fastai v1 used to have a `TextClassificationInterpretation` utility that doesn't appear to exist anymore, but might've been helpful for this step!). 

**ANSWER:**  _Write your answer here._

In [None]:
# Show your work here

# Step 5: Save your classifier
Now that we've trained the classifier, you're ready for Part 2. You'll use this saved file in your bot later.

In [None]:
classify.export(fname='satire_awd.pkl')

Later, you'll use it like so.

In [None]:
serve_classifier = load_learner(fname=str(data_path)+'/satire_awd.pkl')
serve_lm = load_learner(fname=str(data_path)+'/headlines-lm.pkl')

In [None]:
serve_classifier.predict('How the New Syria Took Shape')

In [None]:
serve_lm.predict('Rising Seas', n_words=7)

# Step 6: Set-up your Discord Bot (20% effort)

"Discord is an instant messaging and VoIP social platform which allows communication through voice calls, video calls, text messaging, and media and files. Communication can be private or take place in virtual communities called "servers"." [src](https://en.wikipedia.org/wiki/Discord)

We're going to use our language model to create a chatbot on a social media platform. For this assignment, we're going to use [Discord](https://discord.com/), but in reality, we could connect our language model to any social network with an API. Recently, lots of companies have been charging money to access API accounts (such as [Twitter](https://fortune.com/2023/02/10/elon-musk-twitter-charging-money-api-research-tool-scientist-explains-damage/) and [Reddit](https://techcrunch.com/2023/04/18/reddit-will-begin-charging-for-access-to-its-api/)), but there's still other options out there that are accessible such as Slack or Mastodon...or Discord!

Follow the instructions on Glow for this Assignment for creating the Discord bot. You will need to download the Discord app as well as create an account if you don't already have them! You will also need to create a Discord server for your bot to post messages to. You could consider sharing a Discord server with some classmates, so long as you all have "Manage Server" access, you should be able to invite your bots.

### Q6.1: Reflect on the sign-up process (5% effort) 
As you go through the process of signing up for a Discord developer account, reflect on the questions you are asked, why you are being asked them, and how you think they will serve the intended purpose (or not). 

**ANSWER:**  _Write your answer here._

### Task 6.2: Test whether your script has access to Discord

We're going to store our Bot API Token in a separate file, so that Iris doesn't have access to your Discord account. Create a `credentials.py` file in the same directory as this notebook, with the following format (replace the `XXX` with your bot token):
```
BOT_TOKEN = 'XXX' # API Key
```

In [None]:
# If you need to force install the bot libraries, you can uncomment the following 2 lines:
#!pip install nest_asyncio 
#!pip install discord # uncomment this if you're using Google Colab
from credentials import BOT_TOKEN
import nest_asyncio # allows us to run the bot via Jupyter Notebooks
import discord

The next step is to create event handlers for our bot to use when it's in our Discord sever:

In [None]:
##### DISCORD BOT CODE #####
nest_asyncio.apply() # Jupyter/Colab notebooks only
## BOT SETUP ##
intents = discord.Intents.default()
intents.messages = True
client = discord.Client(intents = intents)

## EVENT HANDLERS ##
@client.event
async def on_ready():
    print(f'We have logged in as {client.user}')

## END EVENT HANDLERS ##

client.run(BOT_TOKEN) # runs the bot in a loop
# !!! REMEMBER to stop this cell when you're done with it!

What is happening in this code above? The first 4 lines are required to set-up the bot, and the last line runs the bot in a loop. The last line runs in an infinite loop, so you'll have to stop the code block cell before moving on to future blocks!

The section in-between titled `## EVENT HANDLERS ##` contains methods with special names that the Discord API calls when certain events happen. This is part of an Event Listening paradigm in which the software waits for special events to be thrown, and then uses _even handlers_, such as the special methods to specially respond to each event that has been thrown. The `on_ready` event handler is called initially, when the bot is ready to start, so you should see the `We have logged in as BOT NAME` message at the bottom of the code output.

### Task 6.3: Use the `on_message` event handler
There is another event handler method, called `on_message`, which we use in the below cell block, try running this:

In [None]:
##### DISCORD BOT CODE #####
nest_asyncio.apply() # Jupyter/Colab notebooks only
## BOT SETUP ##
intents = discord.Intents.default()
intents.messages = True
client = discord.Client(intents = intents)

## EVENT HANDLERS ##
@client.event
async def on_ready():
    print(f'We have logged in as {client.user}')

@client.event
async def on_message(message):
    if message.author == client.user: # prevents infinte responses to itself
        return

    print("Message received!")

    if client.user.mentioned_in(message):
        print('\t', "username:", str(message.author))
        print('\t', "message:", str(message.content))
        print('\t', "channel:", str(message.channel))
        print('\t', "timestamp:", str(message.created_at))     

## END EVENT HANDLERS ##

client.run(BOT_TOKEN) # runs the bot in a loop
# !!! REMEMBER to stop this cell when you're done with it!

### Q6.4 Let's try some experiments: (3% effort)

1. What happens when you run this code?
1. What happens if you go to your Discord Server and send a message on any channel?
1. What happens if you message your bot on any channel with a message like `@HAIIbot hello!` (replace `@HAIIbot` with your bot's name)?
1. Does it matter if you type-out your bot's name fully, or select it from the drop down menu with the green dot (when you start typing out your bot's name)? ...'cos it does on Iris' machine!

**ANSWER**
1. _Write your answer here_
2. _Write your answer here_
3. _Write your answer here_
4. _Write your answer here_

The `on_message` event handler is called anytime a message is sent to a channel that the bot can read. We can identify messages in which our bot is @-mentioned by checking if `client.user.mentioned_in(message)`.

We can also retrieve the messages sent to our Discord server, and some of the information associated with it. The important details are:

* `message.author` - the Discord username of the user who sent the message
* `message.content` - the text of the returned message
* `message.channel` - channel of the message
* `message.created_at` - time the message was sent

### Task 6.5: Send bot messages

And, of course, you want to be able to send new messages too, which is done with the `.send` method:

_Hint: Remember to STOP any cell blocks that are running an infinite loop before moving to new cell blocks!_

In [None]:
##### DISCORD BOT CODE #####
nest_asyncio.apply() # Jupyter/Colab notebooks only
## BOT SETUP ##
intents = discord.Intents.default()
intents.messages = True
client = discord.Client(intents = intents)

## EVENT HANDLERS ##
@client.event
async def on_ready():
    print(f'We have logged in as {client.user}')

@client.event
async def on_message(message):
    if message.author == client.user: # prevents infinte responses to itself
        return

    print("Message received!")
    await message.channel.send("What'd you say?!")

## END EVENT HANDLERS ##

client.run(BOT_TOKEN) # runs the bot in a loop
# !!! REMEMBER to stop this cell when you're done with it!

### Q6.6: Adapt the code from above to reply in-thread. (2% effort)
You will likely want to not just send a message, but actually reply to the original message. You can do this by adding a keyword paramater to the send method: `    await message.channel.send('this is a reply!', reference=message)`

Place this reply line above in the previous cell to have the bot reply in-thread to messages.

### Q6.7: Adapt the code from above to reply in-thread, using additional message information. (10% effort)

Copy the code from above and modify it to respond to Discord messages referencing your bot (i.e., messages containing `@YourBotName`), and then reply to those messages with something using the message's text such as the number of characters in the original post (or something! Be creative!).

Of course, this is just the tip of the iceberg with what you can do with the Discord API. If you’re interested, there is a lot of documentation online. You might start with the [official Discord.py docs](https://discordpy.readthedocs.io/en/latest/index.html).

In [None]:
# Write your code here!

# Step 7: Integrate the bot with the satire classifier (40% effort)
Now that you can do basic replies with your bot, it’s time to make it do something useful! Specifically, our bot should do two things: 

1. When someone tweets a headline @ the bot, it replies with whether the headline is satire. 
2. It also makes up a headline that plays off the original headline, and replies it back. 

Here’s an example (assuming the bot’s called @HAIIbot):

**User: @HAIIbot Rising Seas Will Erase More Cities by 2050, New Research Shows**

**HAIIbot: Yeah, looks real, not satire. But here’s what I say: “Rising Seas will sound great on national security”**


_Hint_: you want to call `.send(..)` using the outputs of our models (classifier and language models) instead of responding `this is a reply!`

The best assignments will have bots that:

- Respond with prediction of whether the given headline is satire or not (10% effort)
- Respond with a newly generated headline inspired by the original (10% effort)
- Interactions are designed to be interpretable, friendly, non-misleading, and should let users to discover what is going on. (10% effort)

In [None]:
# Write your code here!

### Q7.1: Complete a "User Study" and update your bot: What did you change? (7% effort)

Having a friend informally play around with the bot will reveal some additional pre/post-processing on your chatbot's output is required to result in more streamlined interactions. We can make a much better bot with some string manipulation! Have at least "user" (a friend, classmate, family member) try interacting with your bot to get feedback to help you improve the interaction experience!

What observations did your user make in their interaction with the bot? How did you update your code to address them? Were there any issues you were not able to address?

**ANSWER:**  _Write your answer here._

### Q7.2: Reflect: Would you recommend using our satire-classifier as a good starting point to build a fake-news classifier? (3% effort)
If so, what changes would we need to make to make it useful for this purpose? If not, why not?

**ANSWER:**  _Write your answer here._

# Extra Credit: Test with Users and Iterate (5% additional effort)
In this part, you’ll ask three participants to interact with your bot. You’ll give the user high-level information about what the domain of the bot is, and then see how they interact with it. Ask each of the participants to ask your chatbot at least three different things. Record how they interact with your bot. After this participant input, update your bot to attempt to address how that participant interacted with your chatbot. 

### Q EC1a: How did what your participants input compare to the ones you tested so far? How did participants react when the chatbot didn’t respond correctly, or responded with nonsense? (2.5% additional effort)

**ANSWER:** _Double click this text to write your answer to the question here._

### Q EC1b: What change could you make in response to this feedback? (2.5% additional effort)

**ANSWER:** _Double click this text to write your answer to the question here._

# Assignment Submission

Once you've completed all of the above, you're done with assignment 4! As always, clean up your code and ensure your entire Jupyter Notebook runs before submitting, Iris must be able to run your notebook on her machine.

Once you think everything is set, please change the filename of your notebook to `[yourunixID]_haiiYY[assignmentnumber].ipynb`, e.g., `ikh1_haii26a4.ipynb` and then .zip your Notebook and any additional files you used (this is likely just the Notebook file and your JSON data in the lib/ directory: **DON'T** include your gigantic .pkl language model files, and **DON'T** include your credentials.py!), and submit the `.zip` file on GLOW.