# Week 8 - Conversation and Text Generation
Many natural language activities boil down to text generation, especially the back-and-forth nature of natural conversation and question answering. While some may regard it as a parlour trick due to unpredictability, recent dramatic improvements in text generation suggest that these kind of models can find themselves being used in more serious social scientific applications, such as in survey design and construction, idiomatic translation, and the normalization of phrase and sentence meanings.

These models can be quite impressive, even uncanny in how human like they sound. Check out this [cool website](https://transformer.huggingface.co), which allows you to write with a transformer. The website is built by the folks who wrote the package we are using. The code underneath the website can be found in their examples: [run_generation.py](https://github.com/huggingface/transformers/blob/master/examples/run_generation.py).

Much 2022 NLP research is on text generation. Most famously, this is the primary use of large language models like GPT-3 (OpenAI), Wu Dao (Beijing Academy of AI), and Gopher (DeepMind).

In [2]:
#Special module written for this class
#This provides access to data and to helper functions from previous weeks
#Make sure you update it before starting this notebook
import lucem_illud #pip install -U git+https://github.com/UChicago-Computational-Content-Analysis/lucem_illud.git

import sklearn #For generating some matrices
import pandas as pd #For DataFrames
import numpy as np #For arrays
import matplotlib.pyplot as plt #For plotting
import seaborn #Makes the plots look nice
import seaborn as sns
import scipy #Some stats
import nltk #a little language code
from IPython.display import Image #for pics

import pickle #if you want to save layouts
import os
import io
import zipfile

import networkx as nx

%matplotlib inline

import torch # pip install torch
from torch.utils.data import TensorDataset, DataLoader, RandomSampler, SequentialSampler
from sklearn.model_selection import train_test_split
from transformers import BertTokenizer, BertConfig # pip install tranformers
from transformers import AdamW, BertForSequenceClassification
from tqdm import tqdm, trange

In [3]:
%matplotlib inline

In [4]:
import os
os.environ['KERAS_BACKEND'] = 'tensorflow'
from keras.preprocessing.sequence import pad_sequences

# ConvoKit
As we alluded to in Week 6 with causal inference, [ConvoKit](https://convokit.cornell.edu/) is an exciting platform for conversational analysis developed by Jonathan Chang, Calem Chiam, and others, mostly at Cornell. Keep this in mind if you are interested in a final project with conversational data such as Twitter threads or movie scripts. They have an [interactive tutorial](https://colab.research.google.com/github/CornellNLP/Cornell-Conversational-Analysis-Toolkit/blob/master/examples/Introduction_to_ConvoKit.ipynb), which we include some examples from below. Most of the following text and code is authored by them.

These ConvoKit corpora can be used for the next exercise in this notebook.

In [5]:
try:
    import convokit
except ModuleNotFoundError:
    !pip install convokit

In [6]:
# for pretty printing of cells within the Colab version of this notebook
from IPython.display import HTML, display

def set_css():
  display(HTML('''
  <style>
    pre {
        white-space: pre-wrap;
    }
  </style>
  '''))
get_ipython().events.register('pre_run_cell', set_css)

In [7]:
import convokit
from convokit import Corpus, download

### Loading a Corpus

A Corpus represents a conversational dataset. We typically begin our analysis by loading a Corpus. A list of existing datasets already in ConvoKit format can be found [here](https://convokit.cornell.edu/documentation/datasets.html). 

A growing list of many other conversational datasets covering a variety of conversational settings are available in ConvoKit, such as face-to-face (e.g. the [*Intelligence Squared Debates corpus*](https://convokit.cornell.edu/documentation/iq2.html)), institutional (e.g. the [*Supreme Court Oral Arguments corpus*](https://convokit.cornell.edu/documentation/supreme.html)), fictional (e.g. the [*Cornell Movie Dialog Corpus*](https://convokit.cornell.edu/documentation/movie.html)), or online  (e.g. all talkpage conversations on [*Wikipedia Talk Pages*](https://convokit.cornell.edu/documentation/wiki.html) and a full dump of [*Reddit*](https://convokit.cornell.edu/documentation/subreddit.html)).

For this tutorial, we will primarily be using the *r/Cornell* subreddit corpus to demo various ConvoKit functionality, and occasionally the [*Switchboard Dialog Act Corpus*](https://convokit.cornell.edu/documentation/switchboard.html) (a collection of anonymized five-minute telephone conversations) as a contrasting dataset.

In [8]:
corpus = Corpus(download('subreddit-Cornell'))

# You can try a different corpus if you want.
#corpus = Corpus(download('diplomacy-corpus'))
#corpus = Corpus(download('switchboard-corpus'))
#corpus = Corpus(download('reddit-corpus-small'))

Dataset already exists at C:\Users\jacy\.convokit\downloads\subreddit-Cornell


In [9]:
corpus.print_summary_stats()

Number of Speakers: 7568
Number of Utterances: 74467
Number of Conversations: 10744


### Corpus components: Conversations, Utterances, Speakers

Every Corpus has three main components: [Conversations](https://convokit.cornell.edu/documentation/conversation.html), [Utterances](https://convokit.cornell.edu/documentation/utterance.html), and [Speakers](https://convokit.cornell.edu/documentation/speaker.html). Just as in real life, in ConvoKit, Conversations are some sequence of Utterances, where each Utterance is made by some Speaker. Let's look at an example of each.

In [10]:
# This is a Reddit thread
corpus.random_conversation().meta

{'title': 'Cornell Running Club?',
 'num_comments': 5,
 'domain': 'self.Cornell',
 'timestamp': 1505250396,
 'subreddit': 'Cornell',
 'gilded': 0,
 'gildings': None,
 'stickied': False,
 'author_flair_text': ''}

In [11]:
# This is a Reddit post or comment.
corpus.random_utterance().meta

{'score': 1,
 'top_level_comment': 'd6z4plq',
 'retrieved_on': 1473641929,
 'gilded': 0,
 'gildings': None,
 'subreddit': 'Cornell',
 'stickied': False,
 'permalink': '',
 'author_flair_text': ''}

In [12]:
# The r/Cornell Corpus does not have speaker metadata.
#corpus.random_speaker().meta

#Speaker do have an 'id' which is their Reddit username, as seen here.
corpus.random_speaker()

Speaker({'obj_type': 'speaker', 'meta': {}, 'vectors': [], 'owner': <convokit.model.corpus.Corpus object at 0x000002161AC2D940>, 'id': 'aik2124'})

In [13]:
# We can iterate through these objects as we iterate lists or DataFrames in Python.
for utt in corpus.iter_utterances():
    print(utt.text)
    break 

I was just reading about the Princeton Mic-Check and it's getting [national press](http://www.bloomberg.com/news/2011-12-29/princeton-brews-trouble-for-us-1-percenters-commentary-by-michael-lewis.html).

I want to get a sense of what people felt like around campus. Anything interesting happen? Anything interesting coming up?


Conversations, Utterances, and Speakers are each interesting, but the magic of conversational analysis is connecting them. For example, we can get all the Conversations in which a Speaker has participated and all the Utterances they have made. To make it more interesting, we can find a Speaker to study by navigating from a random Utterance.

In [14]:
# consider this sequence of operations that highlight how to navigate between components
utt = corpus.random_utterance()
convo = utt.get_conversation() # get the Conversation the Utterance belongs to
spkr = utt.speaker # get the Speaker who made the Utterance

spkr_convos = list(spkr.iter_conversations())

# Display up to 3 of their conversations.
spkr_convos[:3]

[Conversation({'obj_type': 'conversation', 'meta': {'title': 'Any feedback on the Cooperative Workshops at the College of Engineering?', 'num_comments': 3, 'domain': 'self.Cornell', 'timestamp': 1464944048, 'subreddit': 'Cornell', 'gilded': 0, 'gildings': None, 'stickied': False, 'author_flair_text': ''}, 'vectors': [], 'tree': None, 'owner': <convokit.model.corpus.Corpus object at 0x000002161AC2D940>, 'id': '4mc08j'}),
 Conversation({'obj_type': 'conversation', 'meta': {'title': 'How Cornell find space for Transfer students?', 'num_comments': 7, 'domain': 'self.Cornell', 'timestamp': 1465405179, 'subreddit': 'Cornell', 'gilded': 0, 'gildings': None, 'stickied': False, 'author_flair_text': ''}, 'vectors': [], 'tree': None, 'owner': <convokit.model.corpus.Corpus object at 0x000002161AC2D940>, 'id': '4n625w'}),
 Conversation({'obj_type': 'conversation', 'meta': {'title': 'How common is to be rejected by the CS dept?', 'num_comments': 3, 'domain': 'self.Cornell', 'timestamp': 1465870556, 

For a more qualitative feel of the data, you can display a Conversation. For Reddit data, this is a single thread.

In [15]:
# We truncate sentences at character 80 to avoid making this notebook too long!
convo.print_conversation_structure(lambda utt: utt.text[:80] + "\n")

Incoming fall CS major, I was thinking of taking the following courses. However,

    Engineering chemisty is chem 2090. If you're a CS major, you should definitely t

        Sorry i double posted. Thanks for the advice! Also, are the engri's for spring s

            I think the ENGRI courses are selected on a semesterly basis, so the ones being 

                This is mostly right, but some ENGRI courses are specifically always fall-only o

        hey, just a small question. I have the option of choosing two classes that are 1

            Depends on how far apart they are. The usual time between classes is 15 minutes,

                Thank you so much!

    I'm in engineering, and I would say go for the ENGRI 1620 over a CS class. It's 

        hey, just a small question. I have the option of choosing two classes that are 1

            Having classes 10 min apart is not an issue at all if they're in the same or adj

                thank you!!!

    I also plan to apply for C

There is a lot more to ConvoKit that we encourage you to explore, especially their [tutorial](https://colab.research.google.com/github/CornellNLP/Cornell-Conversational-Analysis-Toolkit/blob/master/examples/Introduction_to_ConvoKit.ipynb), but the ability to seamlessly navigate between the Utterances, Conversations, and Speakers of a Corpus is extremely valuable for social science.

## <font color="red">*Exercise 1*</font>

<font color="red">Construct cells immediately below this that use ConvoKit to analyze a Corpus other than 'subreddit-Cornell', including at least one function you find in the package not used above. You can also generate a ConvoKit Corpus from your own dataset based on [their Corpus from .txt files tutorial](https://github.com/CornellNLP/Cornell-Conversational-Analysis-Toolkit/blob/master/examples/converting_movie_corpus.ipynb) or [their Corpus from pandas tutorial](https://github.com/CornellNLP/Cornell-Conversational-Analysis-Toolkit/blob/master/examples/corpus_from_pandas.ipynb), but that may be time-consuming for a weekly assignment. (It could be a great idea for your final project!)

## Creating networks of agents from corpora

Now let's return to the Davies corpora (specifically, Soap Operas) to see how we can extract actors and build a network of their relationships in the texts.

We'll use the `lucem_illud.loadDavies()` function to get the dataframe. Make sure to download `SOAP.zip` from DropBox, unzip, and edit the following line with the path to that file. This code may take some time.

In [16]:
corpora_address = "C:/Downloads/SOAP"

In [17]:
soap_texts = lucem_illud.loadDavies(corpora_address, num_files=2000)

FileNotFoundError: [WinError 3] The system cannot find the path specified: 'C:/Downloads/SOAP/'

We now use the source to see how the data is stored. Note that this is different from the movies corpus, and that we will need to use a different aggregating method to store the data. Each dataset would have a different approach, but they are all very similar, it depends on how the data is stored. Here multiple textids match multiple scripts, so our soap dataframe would be structured a little differently. 

You can see the first 20 lines of the source file here.

In [None]:
zfile = zipfile.ZipFile(corpora_address + "/soap_sources.zip")
source = []

In [None]:
for file in zfile.namelist():
    with zfile.open(file) as f:
        for line in f:
            source.append(line)

In [None]:
source[0:20]

In [None]:
soap_dict = {}

In [None]:
for soap in source[3:]:
    try:
        textID, year, show, url = soap.decode("utf-8").split("\t")
    except UnicodeDecodeError:
        continue
    if show.strip() not in soap_dict:
        soap_dict[show.strip()] = []
    if show.strip() in soap_dict:
        try:
            soap_dict[show.strip()].append(soap_texts[textID.strip()])
        except KeyError:
            continue

In [None]:
soap_dict.keys()

In [None]:
soap_df = pd.DataFrame(columns=["Soap Name", "Tokenized Texts"])

In [None]:
i = 0

In [None]:
for soap in soap_dict:
    # since there were multiple lists
    print(soap)
    full_script = []
    for part in soap_dict[soap]:
        full_script = full_script + part
    soap_df.loc[i] = [soap, full_script]
    i += 1

In [None]:
soap_df

We now have each Soap, and each of the Tokenized texts. Let us see what kind of information we can get. These are American soap operas, and are likely to be cheesy and dramatic (an understatment). A fun start would be to make networks of each of the actors and actresses in these soaps. 

What would be a good way to create a network? Maybe everytime someone talks to someone we add one weight? But we wouldn't want to add weights whenever it's a different scene - or maybe we do? Let us look at the text and figure it out.

Note that we didn't add the year here because it spans over multiple years. If we are doing different kinds of analysis we would want to a years column as well.

In my dataframe, Days of Our Lives is the 4th corpora, and I conducted my basic analysis on that.

In [None]:
dool = soap_df['Tokenized Texts'][3]

In [None]:
' '.join(dool[0:1500])

Hmmm... we can't do our normal text processing. But this provides us with an interesting oppurtunity: every '@!' is followed by some useeful information. Let us do a quick check of how many characters exist here, and how many times they speak.

In [None]:
characters = {}

In [None]:
for token in dool:
    if token[0] == '@':
        # all characters or actions start with @, so we add that to character
        if token[2:] not in characters:
            characters[token[2:]] = 0
        if token[2:] in characters:
            characters[token[2:]] += 1


In [None]:
len(characters)

Wow, that's a lot of characters: but we notice a '@!' between certain actions too, such as screaming and sobbing. Let us maybe only look for characters with a high number of appearances?

In [None]:
for character in characters:
    if characters[character] > 2000:
        print(character, characters[character])

Let's check these folks out on the interwebz...a image of search of the name + "days of our lives":

In [None]:
Image(filename='../data/dool/dool_john.png') 

In [None]:
Image(filename='../data/dool/dool_brady.jpg') 

In [None]:
# Image(filename='../data/dool/dool_hope.jpeg')

In [None]:
# Image(filename='../data/dool/dool_philip.jpeg')

In [None]:
# Image(filename='../data/dool/dool_marlena.jpg')

In [None]:
# Image(filename='../data/dool/dool_kate.png')

In [None]:
# Image(filename='../data/dool/dool_bo.png')

In [None]:
# Image(filename='../data/dool/dool_chloe.jpg')

In [None]:
# Image(filename='../data/dool/dool_sami.jpg')

In [None]:
# Image(filename='../data/dool/dool_shawn.jpg')

In [None]:
# Image(filename='../data/dool/dool_belle.jpg')

In [None]:
# Image(filename='../data/dool/dool_lucas.jpg')

In [None]:
# Image(filename='../data/dool/dool_nicole.jpg')

These are definitely big, long-time players in the dramatic Days narrative. It would make sense to create a graph where each character who appears over 2000 times is a node, and each time they talk to each other, we add one to their weight. We should also store all the things these chracters say: that's useful information.

So we now iterate through the tokens in a manner where we can capture this information.

In [None]:
actor_network = nx.Graph()

In [None]:
for character in characters:
    if characters[character] > 2000:
        actor_network.add_node(character, lines_spoken= characters[character], words=[])

In [None]:
len(actor_network.nodes.data())

In [None]:
actor_network.nodes.data()

In [None]:
actor_network.nodes['Sami']['lines_spoken']

In [None]:
i = 0

The following lines of code creates the graph of actors and their relationships.

In [None]:
for token in dool:
    i += 1
    if i > len(dool):
        break
    if token[0] == "@":
        if token[2:] in actor_network.nodes():
            j = i
            for token_ in dool[i:]:
                if token_[0] == "@":
                    # if both the characters exist in the graph, add a weight
                    if token_[2:] != token[2:] and token_[2:] in actor_network.nodes():
                        if (token[2:], token_[2:]) not in actor_network.edges():
                            actor_network.add_edge(token[2:], token_[2:], weight=0)
                        if (token[2:], token_[2:]) in actor_network.edges():
                            actor_network.edges[(token[2:], token_[2:])]['weight'] += 1
                    break
                j += 1
            # adding characters sentences
            actor_network.nodes[token[2:]]['words'].append(dool[i:j])

In [None]:
nx.draw(actor_network, with_labels=True, font_weight='bold')

In [None]:
L = []
for node in actor_network.nodes():
    l = []
    for node_ in actor_network.nodes():
        if node == node_:
            l.append(0)
        else:
            l.append(actor_network.edges[(node, node_)]['weight'])
    L.append(l)
M_ = np.array(L)
fig = plt.figure()
div = pd.DataFrame(M_, columns = list(actor_network.nodes()), index = list(actor_network.nodes()))
ax = sns.heatmap(div)
plt.show()

In [None]:
from networkx.algorithms.community import greedy_modularity_communities
c = list(greedy_modularity_communities(actor_network))

In [None]:
c

### Finding structure in networks

We now have a lot of useful information: we have a graph of all the characters, with their relationships with other characters, as well as all the words they've said. We tried finding communities, but it seems like everyone is connected to everyone: each of them form their own 'community'. Seems like people talk to each other a bunch in soaps.

This is however, not the best network to find any meaningful patterns, as we can see with everyone connected to everyone. But as we can see with our heatmap, not everyone talks to everyone an equal amount. How about we only keep our "important" ties, where people are talking to each other a lot?

In [None]:
smaller_actor_network = nx.Graph()

In [None]:
for actor_1 in actor_network.nodes():
    smaller_actor_network.add_node(actor_1, lines_spoken= actor_network.nodes[actor_1]['lines_spoken'], words=actor_network.nodes[actor_1]['words'])
    for actor_2 in actor_network.nodes():
        if actor_2!=actor_1 and actor_network.edges[(actor_1, actor_2)]['weight'] > 250:
            smaller_actor_network.add_edge(actor_1, actor_2, weight=actor_network.edges[(actor_1, actor_2)]['weight'])


In [None]:
nx.draw(smaller_actor_network, with_labels=True, font_weight='bold')

This is a lot more interesting: while the sets of characters overlap, there is still two distinct communities if you look at characters who regularly talk to each other!

Let us see what our centrality measures look like, as well as communities.

In [None]:
from networkx.algorithms.community import greedy_modularity_communities
c = list(greedy_modularity_communities(smaller_actor_network))

In [None]:
c

In [None]:
dcentralities = nx.degree_centrality(smaller_actor_network)

In [None]:
dcentralities['John'], dcentralities['Philip']

Our two different communities show up as detected by the networkx algorithm, and when we look at centralities, we can see that John is a lot more central than Philip.

Let us go back to our original graph, and see if the weight or number of similar appearences matches the text... how do we do this? Well, we already have the graph, and we also have information of who spoke to who. So we have our framework!

This means we can explore ideas contained in two of the papers you will be reading: . “No country for old members: User lifecycle and linguistic change in online communities.”, and  “Fitting In or Standing Out? The Tradeoffs of Structural and Cultural Embeddedness”, both of which you can access on Canvas. 

Let us use a simplified version of the papers, and check if a higher number of conversations might lead to a higher similarity between the word distributions for two characters. We can use the same divergences we used in the last notebook. Do you think it will match with the number of times each character was associated with each other?

In [None]:
def kl_divergence(X, Y):
    P = X.copy()
    Q = Y.copy()
    P.columns = ['P']
    Q.columns = ['Q']
    df = Q.join(P).fillna(0)
    p = df.iloc[:,1]
    q = df.iloc[:,0]
    D_kl = scipy.stats.entropy(p, q)
    return D_kl

def chi2_divergence(X,Y):
    P = X.copy()
    Q = Y.copy()
    P.columns = ['P']
    Q.columns = ['Q']
    df = Q.join(P).fillna(0)
    p = df.iloc[:,1]
    q = df.iloc[:,0]
    return scipy.stats.chisquare(p, q).statistic

def Divergence(corpus1, corpus2, difference="KL"):
    """Difference parameter can equal KL, Chi2, or Wass"""
    freqP = nltk.FreqDist(corpus1)
    P = pd.DataFrame(list(freqP.values()), columns = ['frequency'], index = list(freqP.keys()))
    freqQ = nltk.FreqDist(corpus2)
    Q = pd.DataFrame(list(freqQ.values()), columns = ['frequency'], index = list(freqQ.keys()))
    if difference == "KL":
        return kl_divergence(P, Q)
    elif difference == "Chi2":
        return chi2_divergence(P, Q)
    elif difference == "KS":
        try:
            return scipy.stats.ks_2samp(P['frequency'], Q['frequency']).statistic
        except:
            return scipy.stats.ks_2samp(P['frequency'], Q['frequency'])
    elif difference == "Wasserstein":
        try:
            return scipy.stats.wasserstein_distance(P['frequency'], Q['frequency'], u_weights=None, v_weights=None).statistic
        except:
            return scipy.stats.wasserstein_distance(P['frequency'], Q['frequency'], u_weights=None, v_weights=None)

In [None]:
corpora = []
for character in actor_network.nodes():
    character_words = []
    for sentence in actor_network.nodes[character]['words']:
        for word in sentence:
            character_words.append(word)
    corpora.append(lucem_illud.normalizeTokens(character_words))

In [None]:
L = []

In [None]:
for p in corpora:
    l = []
    for q in corpora:
        l.append(Divergence(p,q, difference='KS'))
    L.append(l)
M = np.array(L)

In [None]:
fig = plt.figure()
div = pd.DataFrame(M, columns = list(actor_network.nodes()), index = list(actor_network.nodes()))
ax = sns.heatmap(div)
plt.show()

In [None]:
# np.corrcoef(M_, M)[0]

With our two heatplots, we can attempt to do some rudimentary analysis. We can see from our previous plot that Shawn and Belle talk to each other a lot, so do Hope and Bo, and Nicole and Brady, and Lucas and Sami. Do they also talk *like* each other?

Kind of, actually: all four of these pairs have a lower distance between them. Now I don't know anything about this particular soap... are these four pairs related? Are they in a relationship, either married or dating, or are they just really good friends?

This lays out the frameworks which you can now use to explore your own networks. 

# Interactional influence

Before we utilize transformers, let's see how to estimate the influence of one speaker on another in order to estimate a kind of interpersonal influence network based on a recent paper by Fangjian Guo, Charles Blundell, Hanna Wallach, and Katherine Heller entitled ["The Bayesian Echo Chamber: Modeling Social Influence via Linguistic Accommodation"](https://arxiv.org/pdf/1411.2674.pdf). This relies on a kind of point process called a Hawkes process that estimate the influence of one point on another. Specifically, what they estimate is the degree to which one actor to an interpersonal interaction engaged in "accomodation" behaviors relative to the other, generating a directed edge from the one to the other.

### First let's look at the output of their analysis:

In [None]:
example_name = '12-angry-men'   #example datasets: "12-angry-men" or "USpresident"

In [None]:
result_path = '../data/Bayesian-echo/results/{}/'.format(example_name)
if not os.path.isdir(result_path):
    raise ValueError('Invalid example selected, only "12-angry-men" or "USpresident" are avaliable')

In [None]:
df_meta_info = pd.read_table(result_path + 'meta-info.txt',header=None)
df_log_prob = pd.read_csv(result_path + "SAMPLE-log_prior_and_log_likelihood.txt",delim_whitespace=True) #log_prob samples
df_influence = pd.read_csv(result_path + 'SAMPLE-influence.txt',delim_whitespace=True) # influence samples
df_participants = pd.read_csv(result_path + 'cast.txt', delim_whitespace=True)
person_id = pd.Series(df_participants['agent.num'].values-1,index=df_participants['agent.name']).to_dict()
print()
print ('Person : ID')
person_id

In [None]:
def getDensity(df):
    data = df#_log_prob['log.prior']
    density = scipy.stats.gaussian_kde(data)
    width = np.max(data) - np.min(data)
    xs = np.linspace(np.min(data)-width/5, np.max(data)+width/5,600)
    density.covariance_factor = lambda : .25
    density._compute_covariance()
    return xs, density(xs)

### Plot MCMC (Markov Monte Carlo) trace and the density of log-likelihoods

In [None]:
fig = plt.figure(figsize=[12,10])

plt.subplot(4,2,1)
plt.plot(df_log_prob['log.prior'])
plt.xlabel('Iterations')
plt.title('Trace of log.prior')

plt.subplot(4,2,2)
x,y = getDensity(df_log_prob['log.prior'])
plt.plot(x,y)
plt.xlabel('Iterations')
plt.title('Density of log.prior')

plt.subplot(4,2,3)
plt.plot(df_log_prob['log.likelihood'])
plt.title('Trace of log.likelihood')
plt.xlabel('Iterations')
plt.tight_layout()

plt.subplot(4,2,4)
x,y = getDensity(df_log_prob['log.likelihood'])
plt.plot(x,y)
plt.xlabel('Iterations')
plt.title('Density of log.likelihood')

plt.subplot(4,2,5)
plt.plot(df_log_prob['log.likelihood.test.set'])
plt.title('Trace of log.likelihood.test.set')
plt.xlabel('Iterations')
plt.tight_layout()

plt.subplot(4,2,6)
x,y = getDensity(df_log_prob['log.likelihood.test.set'])
plt.plot(x,y)
plt.xlabel('Iterations')
plt.title('Density of log.likelihood.test.set')

plt.subplot(4,2,7)
plt.plot(df_log_prob['log.prior']+df_log_prob['log.likelihood'])
plt.title('Trace of log.prob')
plt.xlabel('Iterations')

plt.subplot(4,2,8)
x,y = getDensity(df_log_prob['log.prior']+df_log_prob['log.likelihood'])
plt.plot(x,y)
plt.xlabel('Iterations')
plt.title('Density of log.prob')

plt.tight_layout()

plt.show()

### Plot the influence matrix between participants

In [None]:
A = int(np.sqrt(len(df_influence.columns))) #number of participants
id_person = {}
for p in person_id:
    id_person[person_id[p]]=p

In [None]:
def getmatrix(stacked,A):
    influence_matrix = [[0 for i in range(A)] for j in range(A)]
    for row in stacked.iteritems():
        from_ = int(row[0].split('.')[1])-1
        to_ = int(row[0].split('.')[2])-1
        value = float(row[1])
        influence_matrix[from_][to_]=value
    df_ = pd.DataFrame(influence_matrix) 
    
    df_ =df_.rename(index = id_person)
    df_ =df_.rename(columns = id_person)
    return df_

In [None]:
stacked = df_influence.mean(axis=0)
df_mean = getmatrix(stacked,A)

stacked = df_influence.std(axis=0)
df_std = getmatrix(stacked,A)

In [None]:
df_mean

In [None]:
f, ax = plt.subplots(figsize=(9, 6))
seaborn.heatmap(df_mean, annot=True,  linewidths=.5, ax=ax,cmap="YlGnBu")
print('MEAN of influence matrix (row=from, col=to)')
plt.show()

In [None]:
f, ax = plt.subplots(figsize=(9, 6))
seaborn.heatmap(df_std, annot=True,  linewidths=.5, ax=ax,cmap="YlGnBu")
print('SD of influence matrix (row=from, col=to)')
plt.show()

### Barplot of total influences sent/received

In [None]:
sender_std = {} #sd of total influence sent
reciever_std = {} #sd of total influence recieved
for i in range(A):
    reciever_std[id_person[i]] = df_influence[df_influence.columns[i::A]].sum(axis=1).std()
    sender_std[id_person[i]] = df_influence[df_influence.columns[i*A:(i+1)*A:]].sum(axis=1).std()

sent = df_mean.sum(axis=1) #mean of total influence sent
recieved =df_mean.sum(axis=0) #mean of total influence recieved

Total influence:

In [None]:
print ("\t\tTotal linguistic influence sent/received ")
ax.fig = plt.figure(figsize=[np.min([A,20]),6])

plt.grid()
wd=0.45
ii=0
for p in sender_std:
    plt.bar(person_id[p],sent.loc[p],width=wd,color='red',alpha=0.6,label = "Sent" if ii == 0 else "")
    plt.plot([person_id[p]-wd/4,person_id[p]+wd/4],[sent.loc[p]+sender_std[p],sent.loc[p]+sender_std[p]],color='k')
    plt.plot([person_id[p]-wd/4,person_id[p]+wd/4],[sent.loc[p]-sender_std[p],sent.loc[p]-sender_std[p]],color='k')
    plt.plot([person_id[p],person_id[p]],[sent.loc[p]-sender_std[p],sent.loc[p]+sender_std[p]],color='k')
    ii+=1
ii=0
for p in reciever_std:
    plt.bar(person_id[p]+wd,recieved.loc[p],width=wd,color='blue',alpha=0.4,label = "Received" if ii == 0 else "")
    plt.plot([person_id[p]+wd-wd/4,person_id[p]+wd+wd/4],[recieved.loc[p]+reciever_std[p],recieved.loc[p]+reciever_std[p]],color='k')
    plt.plot([person_id[p]+wd-wd/4,person_id[p]+wd+wd/4],[recieved.loc[p]-reciever_std[p],recieved.loc[p]-reciever_std[p]],color='k')
    plt.plot([person_id[p]+wd,person_id[p]+wd],[recieved.loc[p]-reciever_std[p],recieved.loc[p]+reciever_std[p]],color='k')
    ii+=1
plt.legend(loc='center left', bbox_to_anchor=(1, 0.7))
plt.xticks([i+0.25 for i in range(A)],list(zip(*sorted(id_person.items())))[1])
plt.ylabel('value')
plt.xlabel('speaker',fontsize=14)
plt.show()

## Visualize Influence Network!

You can visualize any of the influence matrices above:

Using networkx:

In [None]:
def drawNetwork(df,title):
    fig = plt.figure(figsize=[8,8])
    G = nx.DiGraph()
    for from_ in df.index:
        for to_ in df.columns:
            G.add_edge(from_,to_,weight = df.loc[from_][to_])
            
    pos = nx.spring_layout(G,k=0.55,iterations=20)
    edges,weights = zip(*nx.get_edge_attributes(G,'weight').items())
    weights = np.array(weights)
    #weights = weights*weights
    weights = 6*weights/np.max(weights)
    print(title)
    
    edge_colors=20*(weights/np.max(weights))
    edge_colors = edge_colors.astype(int)
#     nx.draw_networkx_nodes(G,pos,node_size=1200,alpha=0.7,node_color='#99cef7')
#     nx.draw_networkx_edges(G,pos,edge_color=edge_colors)
#     nx.draw_networkx_labels(G,pos,font_weight='bold')
    nx.draw(G,pos,with_labels=True, font_weight='bold',width=weights,\
            edge_color=255-edge_colors,node_color='#99cef7',node_size=1200,\
            alpha=0.75,arrows=True,arrowsize=20)
    return edge_colors

In [None]:
# get quantile influence matrices for 25%, 50%, 75% quantile
stacked = df_influence.quantile(0.25)
df_q25 = getmatrix(stacked,A)

stacked = df_influence.quantile(0.5)
df_q50 = getmatrix(stacked,A)

stacked = df_influence.quantile(0.75)
df_q75 = getmatrix(stacked,A)

In [None]:
G_mean = drawNetwork(df_mean,'Mean Influence Network')

In [None]:
G_q25 = drawNetwork(df_q25,'25 Quantile Influence Network')

In [None]:
G_q75 = drawNetwork(df_q75,'75 Quantile Influence Network')

In [None]:
def fakeEnglish(length):
    listd=['a','b','c','d','e','f','g','s','h','i','j','k','l']
    return ''.join(np.random.choice(listd,length))

Your own dataset should contains 4 columns (with the same column names) as the artificial one below:

- name: name of the participant
- tokens: a list of tokens in one utterance
- start: starting time of utterance (unit doesn't matter, can be 'seconds','minutes','hours'...)
- end: ending time of utterance (same unit as start)

There is no need to sort data for the moment.

Below, we generate a fake collection of data from "Obama", "Trump", "Clinton"...and other recent presidents. You can either create your own simulation OR (better), add real interactional data from a online chat forum, comment chain, or transcribed from a conversation.

In [None]:
script= []
language = 'eng' #parameter, no need to tune if using English, accept:{'eng','chinese'}
role = 'Adult' #parameter, no need to tune 

for i in range(290):
    dt = []
    dt.append(np.random.choice(['Obama','Trump','Clinton','Bush','Reagan','Carter','Ford','Nixon','Kennedy','Roosevelt']))
    faketokens = [fakeEnglish(length = 4) for j in range(30)]
    dt.append(faketokens) #fake utterance
    dt.append(i*2+np.random.random()) # start time
    dt.append(i*2+1+np.random.random()) # end time
    script.append(dt)

df_transcript = pd.DataFrame(script,columns=['name','tokens','start','end']) #"start", "end" are timestamps of utterances, units don't matter
df_transcript[:2]

Transform data into TalkbankXML format:

In [None]:
output_fname = 'USpresident.xml'  #should be .xml
language = 'eng' 
#language = 'chinese'
lucem_illud.make_TalkbankXML(df_transcript, output_fname, language = language )

Run Bayesian Echo Chamber to get estimation.

- It may take a couple of hours. ( About 4-5 hours if Vocab_size=600 and sampling_time =2000)
- Larger "Vocab_size" (see below) will cost more time
- Larger "sampling_time" will also consume more time

In [None]:
Vocab_size = 90 # up to Vocab_size most frequent words will be considered, it should be smaller than the total vocab
sampling_time = 1500  #The times of Gibbs sampling sweeps  (500 burn-in not included)
lucem_illud.bec_run(output_fname, Vocab_size, language, sampling_time)

## <font color="red">*Exercise 2*</font>

<font color="red">Construct cells immediately below this that perform a similar social similarity or influence analysis on a dataset relevant to your final project (__or one from ConvoKit__). Create relationships between actors in a network based on your dataset (e.g., person to person or document to document), and perform analyses that interrogate the structure of their interactions, similarity, and/or influence on one another. (For example, if relevant to your final project, you could explore different soap operas, counting how many times a character may have used the word love in conversation with another character, and identify if characters in love speak like each other. Or do opposites attract?) What does that analysis and its output reveal about the relative influence of each actor on others? What does it reveal about the social game being played?

<font color="red">Stretch 1:
Render the social network with weights (e.g., based on the number of scenes in which actors appear together), then calculate the most central actors in the show.Realtime output can be viewed in shell.

<font color="red">Stretch 2:
Implement more complex measures of similarity based on the papers you have read.

## Text Generation using GPT-2 and BERT

We can make use of the transformers we learned about last week to do text generation, where the model takes one or multiple places in a conversation. While some may regard it as a parlour trick due to unpredictability, recent dramatic improvements in text generation suggest that these kind of models can find themselves being used in more serious social scientific applications, such as in survey design and construction, idiomatic translation, and the normalization of phrase and sentence meanings.

These models can be quite impressive, even uncanny in how human like they sound. Check out this [cool website](https://transformer.huggingface.co), which allows you to write with a transformer. The website is built by the folks who wrote the package we are using. The code underneath the website can be found in their examples: [run_generation.py](https://github.com/huggingface/transformers/blob/master/examples/run_generation.py).

We will be using the built in generate function, but the example file has more detailed code which allows you to set the seed differently.

In [18]:
from transformers import AutoModelWithLMHead, AutoTokenizer

tokenizer_gpt = AutoTokenizer.from_pretrained("gpt2")
model_gpt = AutoModelWithLMHead.from_pretrained("gpt2")

In [19]:
sequence = "Nothing that we like to do more than analyse data all day long and"

input = tokenizer_gpt.encode(sequence, return_tensors="pt")
generated = model_gpt.generate(input, max_length=50)

resulting_string = tokenizer_gpt.decode(generated.tolist()[0])
print(resulting_string)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Nothing that we like to do more than analyse data all day long and then try to figure out what's going on.

"We're not going to be able to do that. We're not going to be able to do that. We


Wow. A little creepy, and as we can see, far from perfect: GPT doesn't alwats work out flawlessly, but it sometimes can, and we will try and see if fine-tuning helps. We are going to tune the model on a complete dataset of Trump tweets, as they have a set of distinctive, highly identifiable qualities.

### Creating a domain-specific language model

One of the most exciting things about BERT and GPT is being able to retune them the way we want to. We will be training models to perform two tasks - one is to create a BERT with an "accent", by traning a model with english news data from the UK, from the US, and from India. We will also train a language generation model with a bunch of Trump tweets. 

We can train models specifically over a certain domain to make its language generation similar to that domain. 
`run_language modeling.py` followed by `run_generation.py`. I've downloaded these files and added them to this directory so we can run them through the notebook. You are encouraged to look at these files to get a rough idea of what is going on.

### Loading Data 

We want to now get our Trump tweets and our English news datasets ready. The data the scripts expect is just a text file with relevant data. We load the Trump tweets and then write them to disk as train and test files with only data. I leave the original dataframes in case you would like to use it for your own purposes.

In [20]:
dfs = []

In [21]:
for file in os.listdir("../data/trump_tweets"):
    dfs.append(pd.read_json("../data/trump_tweets/" + file))

In [22]:
df = pd.concat(dfs)

In [23]:
df.head()

Unnamed: 0,source,id_str,text,created_at,retweet_count,in_reply_to_user_id_str,favorite_count,is_retweet
0,Twitter for Android,550441250965708800,"""@ronmeier123: @Macys Your APPAREL is UNPARALL...",2014-12-31 23:59:55+00:00,8,,21,False
1,Twitter for Android,550441111513493504,"""@gillule4: @realDonaldTrump incredible experi...",2014-12-31 23:59:22+00:00,5,,18,False
2,Twitter for Android,550440752254562304,"""@JobSnarechs: Negotiation tip #1: The worst t...",2014-12-31 23:57:56+00:00,33,,44,False
3,Twitter for Android,550440620792492032,"""@joelmch2os: @realDonaldTrump announce your p...",2014-12-31 23:57:25+00:00,8,,26,False
4,Twitter for Android,550440523094577152,"""@djspookyshadow: Feeling a deep gratitude for...",2014-12-31 23:57:02+00:00,9,,31,False


In [24]:
from sklearn.model_selection import train_test_split
train_text, test_text = train_test_split(df['text'], test_size=0.2)

In [25]:
train_text.head()

1897    Ask Sally Yates, under oath, if she knows how ...
2540    “Sally Yates is part of concerns people have r...
1979                In Las Vegas, getting ready to speak!
2435    "@mstanish53: @realDonaldTrump @megynkelly  Th...
3451    RT @realDonaldTrump: “President Trump is not g...
Name: text, dtype: object

In [26]:
train_text.to_frame().to_csv(r'train_text_trump', header=None, index=None, sep=' ', mode='a')

In [27]:
test_text.to_frame().to_csv(r'test_text_trump', header=None, index=None, sep=' ', mode='a')

I now used the Google Colab GPUs to train the Trump tweet models. We'll be doing the same for our blog posts too.

### GloWBe dataset

We'll now load up the GloWbe (Corpus of Global Web-Based English) dataset which have different texts from different countries. We'll try and draw out texts from only the US, UK and India. We'll then save these to disk. Note that this is a Davies Corpora dataset: the full download can be done with the Dropbox link I sent in an announcement a few weeks ago. The whole download is about 3.5 GB but we only need two files, which are anout 250 MB each. The other files might be useful for your research purposes.

In [28]:
address = "C:/Downloads/GloWbE"

In [29]:
# these are the exact name of the files
us = "/text_us_blog_jfy.zip"
gb = "/text_gb_blog_akq.zip"

In [30]:
us_texts = lucem_illud.loadDavies(address, corpus_style="us_blog", num_files=5000)

text_us_blog_jfy.zip




KeyboardInterrupt: 

In [None]:
gb_texts = lucem_illud.loadDavies(address, corpus_style="gb_blog", num_files=5000)

We now have a dictionary with document ids mapping to text. Since we don't need any information but the text, we can just save these to disk.

In [None]:
' '.join(list(us_texts.values())[10])

In [None]:
def dict_to_texts(texts, file_name):
    text = []
    for doc in list(texts.values()):
        text.append(' '.join(doc).replace("< h >", "").replace("< p >", ""))
    train_text, test_text = train_test_split(text, test_size=0.2)
    with open(file_name + "_train", 'w') as f:
        for item in train_text:
            f.write("%s\n" % item)
    
    with open(file_name + "_test", 'w') as f:
        for item in test_text:
            f.write("%s\n" % item)

In [None]:
dict_to_texts(us_texts, "us_blog")

In [None]:
dict_to_texts(gb_texts, "gb_blog")

We now have the training and testing files for both US and GB blogs in English. 

(WARNING - SHIFT TO GOOGLE COLAB OR GPU ENABLED MACHINE)


### Running Scripts

We use the scripts to do language modeling and text generation. The following cells run the code as if you would have run it in a terminal.

#### Trump GPT-2

In [None]:
# You might have issues with the memory of your GPU in the following code.
# The default Google Colab GPU should work with batch size of 2,
# but if you get a "CUDA out of memory" error, you can reduce it to 1.

# !python run_language_modeling_gpt.py --per_gpu_train_batch_size=1 --output_dir=output_gpt_trump --model_type=gpt2 --model_name_or_path=gpt2 --do_train --train_data_file=train_text_trump --do_eval --eval_data_file=test_text_trump
!python run_language_modeling_gpt.py --per_gpu_train_batch_size=2 --output_dir=output_gpt_trump --model_type=gpt2 --model_name_or_path=gpt2 --do_train --train_data_file=train_text_trump --do_eval --eval_data_file=test_text_trump
# !python run_language_modeling_gpt.py --output_dir=output_gpt_trump --model_type=gpt2 --model_name_or_path=gpt2 --do_train --train_data_file=train_text_trump --do_eval --eval_data_file=test_text_trump

# If the GPU memory error still occurs even with a batch size of 1,
# you might need to run the code on RCC Midway
# (See https://github.com/UChicago-Computational-Content-Analysis/Frequently-Asked-Questions/issues/18)

#### RoBERTa US

In [38]:
!python run_language_modeling_roberta.py --per_gpu_train_batch_size=1 --output_dir=output_roberta_us --model_type=roberta --model_name_or_path=roberta-base --do_train --train_data_file=us_blog_train --do_eval --eval_data_file=us_blog_test --mlm

03/08/2022 11:14:08 - INFO - __main__ -   Training/evaluation parameters Namespace(train_data_file='us_blog_train', output_dir='output_roberta_us', model_type='roberta', eval_data_file='us_blog_test', line_by_line=False, should_continue=False, model_name_or_path='roberta-base', mlm=True, mlm_probability=0.15, config_name=None, tokenizer_name=None, cache_dir=None, block_size=512, do_train=True, do_eval=True, evaluate_during_training=False, per_gpu_train_batch_size=1, per_gpu_eval_batch_size=4, gradient_accumulation_steps=1, learning_rate=5e-05, weight_decay=0.0, adam_epsilon=1e-08, max_grad_norm=1.0, num_train_epochs=1.0, max_steps=-1, warmup_steps=0, logging_steps=500, save_steps=500, save_total_limit=None, eval_all_checkpoints=False, no_cuda=False, overwrite_output_dir=False, overwrite_cache=False, seed=42, fp16=False, fp16_opt_level='O1', local_rank=-1, server_ip='', server_port='', n_gpu=1, device=device(type='cuda'))
03/08/2022 11:14:08 - INFO - __main__ -   Creating features from 


Iteration:   4%|3         | 492/13686 [00:57<26:10,  8.40it/s][A

Iteration:   4%|3         | 493/13686 [00:57<25:55,  8.48it/s][A

Iteration:   4%|3         | 494/13686 [00:58<26:50,  8.19it/s][A

Iteration:   4%|3         | 495/13686 [00:58<26:57,  8.16it/s][A

Iteration:   4%|3         | 496/13686 [00:58<27:32,  7.98it/s][A

Iteration:   4%|3         | 497/13686 [00:58<27:09,  8.09it/s][A

Iteration:   4%|3         | 498/13686 [00:58<26:54,  8.17it/s][A

03/08/2022 11:15:40 - INFO - __main__ -   Saving model checkpoint to output_roberta_us\checkpoint-500
03/08/2022 11:15:42 - INFO - __main__ -   Saving optimizer and scheduler states to output_roberta_us\checkpoint-500


Iteration:   4%|3         | 500/13686 [01:01<3:56:04,  1.07s/it][A

Iteration:   4%|3         | 501/13686 [01:02<2:56:58,  1.24it/s][A

Iteration:   4%|3         | 502/13686 [01:02<2:15:13,  1.62it/s][A

Iteration:   4%|3         | 503/13686 [01:02<1:42:19,  2.15it/s][A

Iteration:   4%|3         | 504/13


Iteration:  15%|#4        | 1988/13686 [03:58<22:08,  8.81it/s][A

Iteration:  15%|#4        | 1989/13686 [03:59<22:08,  8.80it/s][A

Iteration:  15%|#4        | 1990/13686 [03:59<22:08,  8.80it/s][A

Iteration:  15%|#4        | 1991/13686 [03:59<22:10,  8.79it/s][A

Iteration:  15%|#4        | 1992/13686 [03:59<22:08,  8.80it/s][A

Iteration:  15%|#4        | 1993/13686 [03:59<22:06,  8.82it/s][A

Iteration:  15%|#4        | 1994/13686 [03:59<22:07,  8.81it/s][A

Iteration:  15%|#4        | 1995/13686 [03:59<22:06,  8.81it/s][A

Iteration:  15%|#4        | 1996/13686 [03:59<22:15,  8.75it/s][A

Iteration:  15%|#4        | 1997/13686 [03:59<22:08,  8.80it/s][A

Iteration:  15%|#4        | 1998/13686 [04:00<22:06,  8.81it/s][A

Iteration:  15%|#4        | 1999/13686 [04:00<22:03,  8.83it/s][A03/08/2022 11:18:42 - INFO - __main__ -   Saving model checkpoint to output_roberta_us\checkpoint-2000
03/08/2022 11:18:43 - INFO - __main__ -   Saving optimizer and scheduler states to

Iteration:  18%|#7        | 2405/13686 [04:49<21:15,  8.85it/s][A

Iteration:  18%|#7        | 2406/13686 [04:49<21:18,  8.82it/s][A

Iteration:  18%|#7        | 2407/13686 [04:49<21:16,  8.83it/s][A

Iteration:  18%|#7        | 2408/13686 [04:49<21:17,  8.83it/s][A

Iteration:  18%|#7        | 2409/13686 [04:50<21:13,  8.85it/s][A

Iteration:  18%|#7        | 2410/13686 [04:50<21:17,  8.83it/s][A

Iteration:  18%|#7        | 2411/13686 [04:50<21:15,  8.84it/s][A

Iteration:  18%|#7        | 2412/13686 [04:50<21:12,  8.86it/s][A

Iteration:  18%|#7        | 2413/13686 [04:50<21:12,  8.86it/s][A

Iteration:  18%|#7        | 2414/13686 [04:50<21:13,  8.85it/s][A

Iteration:  18%|#7        | 2415/13686 [04:50<21:15,  8.84it/s][A

Iteration:  18%|#7        | 2416/13686 [04:50<21:14,  8.84it/s][A

Iteration:  18%|#7        | 2417/13686 [04:51<21:12,  8.86it/s][A

Iteration:  18%|#7        | 2418/13686 [04:51<21:11,  8.86it/s][A

Iteration:  18%|#7        | 2419/13686 [04:51<21

Iteration:  29%|##9       | 3983/13686 [08:00<18:35,  8.70it/s][A

Iteration:  29%|##9       | 3984/13686 [08:00<18:44,  8.63it/s][A

Iteration:  29%|##9       | 3985/13686 [08:00<18:33,  8.72it/s][A

Iteration:  29%|##9       | 3986/13686 [08:00<18:27,  8.75it/s][A

Iteration:  29%|##9       | 3987/13686 [08:00<18:23,  8.79it/s][A

Iteration:  29%|##9       | 3988/13686 [08:00<18:21,  8.80it/s][A

Iteration:  29%|##9       | 3989/13686 [08:01<18:24,  8.78it/s][A

Iteration:  29%|##9       | 3990/13686 [08:01<18:23,  8.78it/s][A

Iteration:  29%|##9       | 3991/13686 [08:01<18:18,  8.82it/s][A

Iteration:  29%|##9       | 3992/13686 [08:01<18:16,  8.84it/s][A

Iteration:  29%|##9       | 3993/13686 [08:01<18:14,  8.85it/s][A

Iteration:  29%|##9       | 3994/13686 [08:01<18:20,  8.81it/s][A

Iteration:  29%|##9       | 3995/13686 [08:01<18:16,  8.84it/s][A

Iteration:  29%|##9       | 3996/13686 [08:01<18:19,  8.82it/s][A

Iteration:  29%|##9       | 3997/13686 [08:01<18


Iteration:  31%|###       | 4232/13686 [08:31<18:14,  8.64it/s][A

Iteration:  31%|###       | 4233/13686 [08:31<18:14,  8.64it/s][A

Iteration:  31%|###       | 4234/13686 [08:31<18:15,  8.63it/s][A

Iteration:  31%|###       | 4235/13686 [08:31<18:12,  8.65it/s][A

Iteration:  31%|###       | 4236/13686 [08:32<18:08,  8.68it/s][A

Iteration:  31%|###       | 4237/13686 [08:32<18:10,  8.67it/s][A

Iteration:  31%|###       | 4238/13686 [08:32<18:04,  8.71it/s][A

Iteration:  31%|###       | 4239/13686 [08:32<18:10,  8.66it/s][A

Iteration:  31%|###       | 4240/13686 [08:32<18:03,  8.72it/s][A

Iteration:  31%|###       | 4241/13686 [08:32<18:00,  8.74it/s][A

Iteration:  31%|###       | 4242/13686 [08:32<17:58,  8.75it/s][A

Iteration:  31%|###1      | 4243/13686 [08:32<17:54,  8.79it/s][A

Iteration:  31%|###1      | 4244/13686 [08:32<17:55,  8.78it/s][A

Iteration:  31%|###1      | 4245/13686 [08:33<17:52,  8.80it/s][A

Iteration:  31%|###1      | 4246/13686 [08:33<1


Iteration:  46%|####5     | 6228/13686 [12:33<14:09,  8.78it/s][A

Iteration:  46%|####5     | 6229/13686 [12:33<14:02,  8.85it/s][A

Iteration:  46%|####5     | 6230/13686 [12:33<14:13,  8.73it/s][A

Iteration:  46%|####5     | 6231/13686 [12:34<14:09,  8.78it/s][A

Iteration:  46%|####5     | 6232/13686 [12:34<14:06,  8.81it/s][A

Iteration:  46%|####5     | 6233/13686 [12:34<13:51,  8.96it/s][A

Iteration:  46%|####5     | 6234/13686 [12:34<14:07,  8.79it/s][A

Iteration:  46%|####5     | 6235/13686 [12:34<14:01,  8.86it/s][A

Iteration:  46%|####5     | 6236/13686 [12:34<13:54,  8.93it/s][A

Iteration:  46%|####5     | 6237/13686 [12:34<14:11,  8.75it/s][A

Iteration:  46%|####5     | 6238/13686 [12:34<14:00,  8.86it/s][A

Iteration:  46%|####5     | 6239/13686 [12:35<14:09,  8.77it/s][A

Iteration:  46%|####5     | 6240/13686 [12:35<14:07,  8.78it/s][A

Iteration:  46%|####5     | 6241/13686 [12:35<13:59,  8.86it/s][A

Iteration:  46%|####5     | 6242/13686 [12:35<1

Iteration:  49%|####8     | 6687/13686 [13:28<13:09,  8.86it/s][A

Iteration:  49%|####8     | 6688/13686 [13:28<13:03,  8.93it/s][A

Iteration:  49%|####8     | 6689/13686 [13:28<13:20,  8.74it/s][A

Iteration:  49%|####8     | 6690/13686 [13:28<13:23,  8.71it/s][A

Iteration:  49%|####8     | 6691/13686 [13:28<13:19,  8.75it/s][A

Iteration:  49%|####8     | 6692/13686 [13:28<13:10,  8.84it/s][A

Iteration:  49%|####8     | 6693/13686 [13:28<13:05,  8.90it/s][A

Iteration:  49%|####8     | 6694/13686 [13:28<13:07,  8.88it/s][A

Iteration:  49%|####8     | 6695/13686 [13:29<13:15,  8.79it/s][A

Iteration:  49%|####8     | 6696/13686 [13:29<13:14,  8.80it/s][A

Iteration:  49%|####8     | 6697/13686 [13:29<13:17,  8.76it/s][A

Iteration:  49%|####8     | 6698/13686 [13:29<13:12,  8.82it/s][A

Iteration:  49%|####8     | 6699/13686 [13:29<13:05,  8.90it/s][A

Iteration:  49%|####8     | 6700/13686 [13:29<13:19,  8.74it/s][A

Iteration:  49%|####8     | 6701/13686 [13:29<13

Iteration:  64%|######3   | 8722/13686 [17:29<09:22,  8.83it/s][A

Iteration:  64%|######3   | 8723/13686 [17:29<09:27,  8.75it/s][A

Iteration:  64%|######3   | 8724/13686 [17:29<09:21,  8.84it/s][A

Iteration:  64%|######3   | 8725/13686 [17:29<09:16,  8.92it/s][A

Iteration:  64%|######3   | 8726/13686 [17:30<09:17,  8.90it/s][A

Iteration:  64%|######3   | 8727/13686 [17:30<09:23,  8.80it/s][A

Iteration:  64%|######3   | 8728/13686 [17:30<09:18,  8.87it/s][A

Iteration:  64%|######3   | 8729/13686 [17:30<09:27,  8.73it/s][A

Iteration:  64%|######3   | 8730/13686 [17:30<09:25,  8.76it/s][A

Iteration:  64%|######3   | 8731/13686 [17:30<09:15,  8.91it/s][A

Iteration:  64%|######3   | 8732/13686 [17:30<09:23,  8.80it/s][A

Iteration:  64%|######3   | 8733/13686 [17:30<09:20,  8.83it/s][A

Iteration:  64%|######3   | 8734/13686 [17:30<09:17,  8.88it/s][A

Iteration:  64%|######3   | 8735/13686 [17:31<09:13,  8.95it/s][A

Iteration:  64%|######3   | 8736/13686 [17:31<09


Iteration:  67%|######7   | 9221/13686 [18:29<08:23,  8.87it/s][A

Iteration:  67%|######7   | 9222/13686 [18:29<08:15,  9.01it/s][A

Iteration:  67%|######7   | 9223/13686 [18:29<08:26,  8.81it/s][A

Iteration:  67%|######7   | 9224/13686 [18:29<08:22,  8.88it/s][A

Iteration:  67%|######7   | 9225/13686 [18:29<08:22,  8.87it/s][A

Iteration:  67%|######7   | 9226/13686 [18:29<08:26,  8.81it/s][A

Iteration:  67%|######7   | 9227/13686 [18:29<08:23,  8.85it/s][A

Iteration:  67%|######7   | 9228/13686 [18:29<08:19,  8.93it/s][A

Iteration:  67%|######7   | 9229/13686 [18:30<08:21,  8.89it/s][A

Iteration:  67%|######7   | 9230/13686 [18:30<08:27,  8.78it/s][A

Iteration:  67%|######7   | 9231/13686 [18:30<08:21,  8.88it/s][A

Iteration:  67%|######7   | 9232/13686 [18:30<08:17,  8.95it/s][A

Iteration:  67%|######7   | 9233/13686 [18:30<08:19,  8.91it/s][A

Iteration:  67%|######7   | 9234/13686 [18:30<08:24,  8.82it/s][A

Iteration:  67%|######7   | 9235/13686 [18:30<0


Iteration:  82%|########1 | 11217/13686 [22:31<05:03,  8.15it/s][A

Iteration:  82%|########1 | 11218/13686 [22:31<05:07,  8.01it/s][A

Iteration:  82%|########1 | 11219/13686 [22:31<05:11,  7.93it/s][A

Iteration:  82%|########1 | 11220/13686 [22:32<05:09,  7.98it/s][A

Iteration:  82%|########1 | 11221/13686 [22:32<05:13,  7.86it/s][A

Iteration:  82%|########1 | 11222/13686 [22:32<05:02,  8.16it/s][A

Iteration:  82%|########2 | 11223/13686 [22:32<04:57,  8.29it/s][A

Iteration:  82%|########2 | 11224/13686 [22:32<05:06,  8.04it/s][A

Iteration:  82%|########2 | 11225/13686 [22:32<05:07,  8.01it/s][A

Iteration:  82%|########2 | 11226/13686 [22:32<04:59,  8.22it/s][A

Iteration:  82%|########2 | 11227/13686 [22:32<05:04,  8.08it/s][A

Iteration:  82%|########2 | 11228/13686 [22:33<05:07,  7.99it/s][A

Iteration:  82%|########2 | 11229/13686 [22:33<04:59,  8.21it/s][A

Iteration:  82%|########2 | 11230/13686 [22:33<04:53,  8.38it/s][A

Iteration:  82%|########2 | 11231

Iteration:  85%|########5 | 11644/13686 [23:25<03:53,  8.75it/s][A

Iteration:  85%|########5 | 11645/13686 [23:25<03:52,  8.77it/s][A

Iteration:  85%|########5 | 11646/13686 [23:25<03:51,  8.80it/s][A

Iteration:  85%|########5 | 11647/13686 [23:26<03:53,  8.74it/s][A

Iteration:  85%|########5 | 11648/13686 [23:26<03:50,  8.83it/s][A

Iteration:  85%|########5 | 11649/13686 [23:26<03:52,  8.76it/s][A

Iteration:  85%|########5 | 11650/13686 [23:26<03:54,  8.70it/s][A

Iteration:  85%|########5 | 11651/13686 [23:26<03:51,  8.80it/s][A

Iteration:  85%|########5 | 11652/13686 [23:26<03:52,  8.74it/s][A

Iteration:  85%|########5 | 11653/13686 [23:26<03:49,  8.85it/s][A

Iteration:  85%|########5 | 11654/13686 [23:26<03:53,  8.71it/s][A

Iteration:  85%|########5 | 11655/13686 [23:27<03:52,  8.75it/s][A

Iteration:  85%|########5 | 11656/13686 [23:27<03:50,  8.79it/s][A

Iteration:  85%|########5 | 11657/13686 [23:27<03:50,  8.81it/s][A

Iteration:  85%|########5 | 11658/

#### RoBERTa UK

In [32]:
!python run_language_modeling_roberta.py --output_dir=output_roberta_gb --model_type=roberta --model_name_or_path=roberta-base --do_train --train_data_file=gb_blog_train --do_eval --eval_data_file=gb_blog_test --mlm

03/08/2022 11:07:32 - INFO - __main__ -   Training/evaluation parameters Namespace(train_data_file='gb_blog_train', output_dir='output_roberta_gb', eval_data_file='gb_blog_test', model_type='roberta', model_name_or_path='roberta-base', mlm=True, mlm_probability=0.15, config_name='', tokenizer_name='', cache_dir='', block_size=510, do_train=True, do_eval=True, evaluate_during_training=False, do_lower_case=False, per_gpu_train_batch_size=4, per_gpu_eval_batch_size=1, gradient_accumulation_steps=1, learning_rate=5e-05, weight_decay=0.0, adam_epsilon=1e-08, max_grad_norm=1.0, num_train_epochs=1.0, max_steps=-1, num_warmup_steps=0, logging_steps=50, save_steps=50, save_total_limit=None, eval_all_checkpoints=False, no_cuda=False, overwrite_output_dir=False, overwrite_cache=False, seed=42, fp16=False, fp16_opt_level='O1', local_rank=-1, server_ip='', server_port='', n_gpu=1, device=device(type='cuda'))
03/08/2022 11:07:32 - INFO - __main__ -   Creating features from dataset file at 
03/08/202

(If you left to use a GPU machine, COME BACK TO THIS NOTEBOOK to load and work with your trained model.)

### Loading and using models

Let us now load the four models we have and see how we can use them.

And now - let us see what our Trump Tweet Bot looks like!
You can generate text via command line using the command below. You can also load a model once it is saved - I trained my model using Google Colab, downloaded the model, and am loading it again via the command below. Note that you have to download all the files in your folder of the fine-tuned model to use the model.

In [None]:
# !python run_generation.py --model_type=gpt2 --model_name_or_path=output_gpt_trump

In [None]:
tokenizer_trump = AutoTokenizer.from_pretrained("output_gpt_trump")
# If this line does not work, try tokenizer_trump = AutoTokenizer.from_pretrained("gpt2")

model_trump = AutoModelWithLMHead.from_pretrained("output_gpt_trump")

In [None]:
sequence = "Obama is going to"

input = tokenizer_trump.encode(sequence, return_tensors="pt")
generated = model_trump.generate(input, max_length=50, bos_token_id=1, pad_token_id=1, eos_token_ids=1)

resulting_string = tokenizer_trump.decode(generated.tolist()[0])
print(resulting_string)

Wow - our Trump bot is nasty, so we know our model trained well. What happens if we try the same sentence for our non-fine tuned model?

In [None]:
sequence = "Obama is going to"

input = tokenizer_gpt.encode(sequence, return_tensors="pt")
generated = model_gpt.generate(input, max_length=50, bos_token_id=1, pad_token_id=1, eos_token_ids=1)

resulting_string = tokenizer_gpt.decode(generated.tolist()[0])
print(resulting_string)

Quite the contrast.

## <font color="red">*Exercise 3*</font>

<font color="red">Construct cells immediately below this that generate a BERT-powered chatbot tuned on text related to your final project. What is interesting about this model, and how to does it compare to an untrained model? What does it reveal about the social game involved with your dataset?

Contextual models can also help us visualize how words in a sentence or different or similar to each other. We will try to construct sentences where words might mean different things in different countries - in the US, people might eat chips with salsa, but in the UK, chips are what Americans call french fries, and might eat it fried fish instead. 

In [33]:
from transformers import RobertaConfig, RobertaModel, RobertaTokenizer

In [34]:
roberta_us_model_embedding = RobertaModel.from_pretrained('roberta_us')

OSError: roberta_us is not a local folder and is not a valid model identifier listed on 'https://huggingface.co/models'
If this is a private repository, make sure to pass a token having permission to this repo with `use_auth_token` or log in with `huggingface-cli login` and pass `use_auth_token=True`.

In [None]:
roberta_us_tokenizer = RobertaTokenizer.from_pretrained('roberta_us')

In [None]:
text = "Do you have your chips with fish or with salsa?" 

In [None]:
text1 = "He went out in just his undershirt and pants." #pants are underwear in Britain; maybe closer to an undershirt
text2 = "His braces completed the outfit." #braces are suspenders (in Britain); maybe closer to an outfit
text3 = "Does your pencil have a rubber on it?" #rubber is an eraser in Britain); maybe closer to a pencil
text4 = "Was the bog closer to the forest or the house?" #bog is a toilen in Britain); maybe closer to a house
text5 = "Are you taking the trolley or the train to the grocery market" #trolley is a food carriage; possibly closer to a market

In [None]:
def word_vector(text, word_id, model, tokenizer):
    marked_text = "[CLS] " + text + " [SEP]"
    tokenized_text = tokenizer.tokenize(marked_text)
    indexed_tokens = tokenizer.convert_tokens_to_ids(tokenized_text)
    tokens_tensor = torch.tensor([indexed_tokens])
    word_embeddings = model(tokens_tensor)[0]
    sentence_embeddings = model(tokens_tensor)[1]
    vector = word_embeddings[0][word_id].detach().numpy()
    return vector

In [None]:
def visualise_diffs(text, model, tokenizer):
    word_vecs = []
    for i in range(0, len(text.split())):
        word_vecs.append(word_vector(text, i, model, tokenizer))
    L = []
    for p in word_vecs:
        l = []
        for q in word_vecs:
            l.append(1 - cosine(p, q))
        L.append(l)
    M = np.array(L)
    fig = plt.figure()
    div = pd.DataFrame(M, columns = list(text.split()), index = list(text.split()))
    ax = sns.heatmap(div)
    plt.show()

In [None]:
visualise_diffs(text, roberta_us_model_embedding, roberta_us_tokenizer)

In [None]:
roberta_gb_model_embedding = RobertaModel.from_pretrained('roberta_gb')
roberta_gb_tokenizer = RobertaTokenizer.from_pretrained('roberta_gb')

In [None]:
visualise_diffs(text, roberta_gb_model_embedding, roberta_gb_tokenizer)

## <font color="red">*Exercise 4*</font>

<font color="red">What's a research question relevant to your final project that would benefit from a generative or conversational model like GPT-2 or GPT-3? What task would you have it perform, and how would you evaluate the ouput?
    
<font color="red">_Stretch:_ Do it!

# Optional: AI Dungeon
A fun way to explore the power of conversation and text generation is [AI Dungeon](https://play.aidungeon.io/main/newGame). If you select _Quick Start_, you can start a text adventure similar to the popular role-playing game Dungeons and Dragons, or the "choose your own adventure" books some of us boomers read as kids. AI Dungeon used to use GPT-2 but now uses GPT-3 to generate text in response to your actions. For example, you may begin with the prompt:

```
You are Jimmy, a wizard living in the kingdom of Larion. You have a staff and a spellbook. You finish your long journey and finally arrive at the ruin you've been looking for. You have come here searching for a mystical spellbook of great power called the book of essence. You look around and see a beautiful woman with raven hair and dark, sparkling eyes. She approaches you and asks you what you seek.

You are now before the one known as Raven.
```

You can respond with any text. I type, `I ask Raven, "What is the meaning of life?"`, to which AI Dungeon says:

```
Raven ponders the question for a moment, and then she answers. Raven says, "It is the pursuit of happiness." What is the pursuit of happiness?
```

I type, `I respond, "Happiness to me would be unearthing the treasures of the Kilarean Empire."`:

```
Raven looks at me and says, "Come with me and live out your one true life."

You go with Raven through many passages and up and down elevators.
```

Raven might be helping me unearth the treasures! Not bad!

There are a lot of tools like AI Dungeon being developed now with state-of-the-art language models. Now that you have some understanding of deep language models, you can better appreciate them—and perhaps design some of your own!