# About

The overall goal of this tutorial is to create a language learning Trump ChatBot where you can practice simple conversations in a language you care about. To reach the goal we
will finetune small DialoGPT https://huggingface.co/transformers/model_doc/dialogpt.html. 

# Dataset

The main bottleneck is the dataset. It would be great to have some kind of Trump dialog. In the first iteration I propose to use data from the election debate:
* Debate with Hillary Clinton ( first and second): https://github.com/wimlds/election-data-hackathon
* First debate with John Biden: https://www.kaggle.com/headsortails/us-election-2020-presidential-debates?select=us_election_2020_1st_presidential_debate.csv
* Second debate with John Biden: https://www.kaggle.com/headsortails/us-election-2020-presidential-debates?select=us_election_2020_2nd_presidential_debate.csv
* Trump town hall: https://www.kaggle.com/headsortails/us-election-2020-presidential-debates?select=us_election_2020_trump_town_hall.csv

This data is not perfect enough and has a lot of weakness. For example, a long answer. 

How can we increase the amount of data? - for the next iteration we can use data from Wikipedia, tweets, and Trump's speech + model: https://github.com/patil-suraj/question_generation to generate the question for the text. Small experiment how it is working please find in Additional_dataset.

In [1]:
from IPython.core.display import display, HTML
display(HTML("<style>.container { width:100% !important; }</style>"))

In [2]:
import numpy as np
import pandas as pd

pd.set_option('display.max_columns', 100)
pd.set_option('display.max_colwidth', -1) 
pd.set_option('display.max_rows', 150)
np.seterr(divide='ignore', invalid='ignore')

  pd.set_option('display.max_colwidth', -1)


{'divide': 'warn', 'over': 'warn', 'under': 'ignore', 'invalid': 'warn'}

## Install some libraries

In [20]:
! pip install mosestokenizer
! pip install unidecode
! pip install blingfire
! pip install torch
! pip install git+https://github.com/huggingface/transformers
! pip install tensorboardX

Collecting git+https://github.com/huggingface/transformers
  Cloning https://github.com/huggingface/transformers to /private/var/folders/5x/9tykzp5x2hs34ly3kyq8pp8m0000gn/T/pip-req-build-los__npd
  Installing build dependencies ... [?25ldone
[?25h  Getting requirements to build wheel ... [?25ldone
[?25h    Preparing wheel metadata ... [?25ldone
Building wheels for collected packages: transformers
  Building wheel for transformers (PEP 517) ... [?25ldone
[?25h  Created wheel for transformers: filename=transformers-4.2.0.dev0-py3-none-any.whl size=1527263 sha256=23c05fe175b4d10a17da913eaa6b0fe44e982ca1e5db0c41f54e6ac89646c144
  Stored in directory: /private/var/folders/5x/9tykzp5x2hs34ly3kyq8pp8m0000gn/T/pip-ephem-wheel-cache-9p0xr3d4/wheels/42/68/45/c63edff61c292f2dfd4df4ef6522dcbecc603e7af82813c1d7
Successfully built transformers
Installing collected packages: transformers
  Attempting uninstall: transformers
    Found existing installation: transformers 4.1.1
    Uninstalling t

## Example of model without finetuning 

In [36]:
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

tokenizer = AutoTokenizer.from_pretrained("microsoft/DialoGPT-small") #medium
model = AutoModelForCausalLM.from_pretrained("microsoft/DialoGPT-small")


# Let's chat for 6 lines
for step in range(6):
    # encode the new user input, add the eos_token and return a tensor in Pytorch
    new_user_input_ids = tokenizer.encode(input(">> User: ") + tokenizer.eos_token, return_tensors='pt')
    # print(new_user_input_ids)

    # append the new user input tokens to the chat history
    bot_input_ids = torch.cat([chat_history_ids, new_user_input_ids], dim=-1) if step > 0 else new_user_input_ids

    # generated a response while limiting the total chat history to 1000 tokens, 
    chat_history_ids = model.generate(
        bot_input_ids, max_length=300,
        pad_token_id=tokenizer.eos_token_id,  
        no_repeat_ngram_size=3,       
        do_sample=True, 
        top_k=100, 
        top_p=0.7,
        temperature = 0.8
    )
    
    # pretty print last ouput tokens from bot
    print("Trump: {}".format(tokenizer.decode(chat_history_ids[:, bot_input_ids.shape[-1]:][0], skip_special_tokens=True)))

>> User: Good evening!
Trump: Good morning!
>> User: Are you planning to start a nuclear war with North Korea?
Trump: Good evening!
>> User: Are you planning to continue the economic war with China?
Trump: Greetings! Good evening! Good afternoon!
>> User: Have you been supported by Russia during the election?
Trump: I am an American citizen and I support this decision.
>> User: How do you feel about the Black Life Matters?
Trump: They do.
>> User: How are you planning to overcome Covid-19?
Trump: The war will be won by a large majority of people.


#### The answers are not good enough. Let's fine-tune the model and repeat the questions

In [3]:
"""
Fine-tuning the library models for language modeling on a text file (GPT, GPT-2, BERT, RoBERTa).
GPT and GPT-2 are fine-tuned using a causal language modeling (CLM) loss while BERT and RoBERTa are fine-tuned
using a masked language modeling (MLM) loss.
"""

import glob
import logging
import os
import pickle
import random
import re
import shutil
from typing import Dict, List, Tuple

import pandas as pd
import numpy as np
import torch

from sklearn.model_selection import train_test_split

from torch.nn.utils.rnn import pad_sequence
from torch.utils.data import DataLoader, Dataset, RandomSampler, SequentialSampler
from torch.utils.data.distributed import DistributedSampler
from tqdm.notebook import tqdm, trange

from pathlib import Path

from transformers import (
    MODEL_WITH_LM_HEAD_MAPPING,
    WEIGHTS_NAME,
    AdamW,
    AutoConfig,
    AutoModelWithLMHead,
    AutoTokenizer,
    PreTrainedModel,
    PreTrainedTokenizer,
    get_linear_schedule_with_warmup,
)

from blingfire import *
from mosestokenizer import *
from unidecode import unidecode

try:
    from torch.utils.tensorboard import SummaryWriter
except ImportError:
    from tensorboardX import SummaryWriter

# Configs
logger = logging.getLogger(__name__)

MODEL_CONFIG_CLASSES = list(MODEL_WITH_LM_HEAD_MAPPING.keys())
MODEL_TYPES = tuple(conf.model_type for conf in MODEL_CONFIG_CLASSES)

Let's define to configuration variables so we don't have a bunch of magic numbers and strings!

In [4]:
# Args to allow for easy convertion of python script to notebook
class Args():
    def __init__(self):
        self.output_dir = 'output'
        self.model_type = 'gpt2'
        self.model_name_or_path = 'microsoft/DialoGPT-small'
        self.config_name = 'microsoft/DialoGPT-small'
        self.tokenizer_name = 'microsoft/DialoGPT-small'
        self.cache_dir = 'cached'
        self.block_size = 512
        self.do_train = True
        self.do_eval = True
        self.evaluate_during_training = True #False
        self.per_gpu_train_batch_size = 4 # 4
        self.per_gpu_eval_batch_size = 4 # 4
        self.gradient_accumulation_steps = 1
        self.learning_rate = 5e-5
        self.weight_decay = 0.0
        self.adam_epsilon = 1e-8
        self.max_grad_norm = 1.0
        self.num_train_epochs = 3
        self.max_steps = -1
        self.warmup_steps = 0
        self.logging_steps = 1000
        self.save_steps = 3500
        self.save_total_limit = None
        self.eval_all_checkpoints = False
        self.no_cuda = False
        self.overwrite_output_dir = True
        self.overwrite_cache = True
        self.should_continue = False
        self.seed = None
        self.local_rank = -1
        self.fp16 = False
        self.fp16_opt_level = 'O1'
        self.path_to_data = "/Users/viktor/Documents/GitHub/Project/Trump_chatbot/data/debate"
        self.n_tokens = 200 # maximum N of tokens that we will leave in dataset

args = Args()

#### Let's read the data 

In [5]:
hillary_trump_debat = pd.read_csv(args.path_to_data + "/hillary/debate.csv")

biden_trump_debat_1 = pd.read_csv(args.path_to_data + "/biden/us_election_2020_1st_presidential_debate.csv")
biden_trump_debat_2 = pd.read_csv(args.path_to_data + "/biden/us_election_2020_2nd_presidential_debate.csv")

trump_town_hall = pd.read_csv(args.path_to_data + "/biden/us_election_2020_trump_town_hall.csv")


In [6]:
biden_trump_debat_2.head()

Unnamed: 0,speaker,minute,text
0,Kristen Welker,00:18,"Good evening, everyone. Good evening. Thank you so much for being here. It is such an honor for me to moderate this debate tonight, the final debate. I want to welcome the first family and the first lady. We’re so glad and thankful that you are feeling better. I want to welcome the Biden family, Dr. Jill Biden. Thank you all for being here tonight. We are so excited. We’re looking forward to a really robust discussion. And the only thing I would reiterate are the CPD guidelines that when the candidates are talking, please hold any applause or any other reactions. Except of course, when they walk out, make sure you cheer and loud and applause so that everyone can hear you. Thank you for having me. This is really the honor of a lifetime. I am going to sit down and just get organized and get settled and the show will start very soon. Thank you for being here. (silence). Good evening from Belmont University in Nashville, Tennessee. I’m Kristen Welker of NBC News. And I welcome you to the final 2020 presidential debate between President Donald J. Trump and former vice president Joe Biden. Tonight’s debate is sponsored by the Commission on Presidential Debates. It is conducted under health and safety protocols designed by the Commission’s health security advisor. The audience here in the hall has promised to remain silent. No cheers, boos, or other interruptions, except right now, as we welcome to the stage, former vice president Joe Biden and President Donald J. Trump."
1,Donald Trump,07:37,How are you doing? How are you?
2,Kristen Welker,07:58,"And I do want to say a very good evening to both of you. This debate will cover six major topics. At the beginning of each section, each candidate will have two minutes, uninterrupted, to answer my first question. The Debate Commission will then turn on their microphone only when it is their turn to answer. And the Commission will turn it off exactly when the two minutes have expired. After that, both microphones will remain on. But on behalf of the voters, I’m going to ask you to please speak one at a time."
3,Kristen Welker,08:27,"The goal is for you to hear each other and for the American people to hear every word of what you both have to say. And so with that, if you’re ready, let’s start. And we will begin with the fight against the coronavirus. President Trump, the first question is for you. The country is heading into a dangerous new phase. More than 40,000 Americans are in the hospital tonight with COVID, including record numbers here in Tennessee. And since the two of you last shared a stage, 16,000 Americans have died from COVID. So please be specific. How would you lead the country during this next stage of the coronavirus crisis? Two minutes, uninterrupted."
4,Kristen Welker,09:03,… during this next stage of the coronavirus crisis. Two minutes uninterrupted.


#### Small preprocessing 

In [7]:
biden_trump_debat_2 = biden_trump_debat_2[biden_trump_debat_2['text'] != '[Crosstalk 00:24:31].']

In [8]:
hillary_trump_debat = hillary_trump_debat[~hillary_trump_debat.Speaker.isin(['CANDIDATES','Audience'])]
hillary_trump_debat = hillary_trump_debat[hillary_trump_debat.Date !='2016-10-04']
hillary_trump_debat.columns = [col.lower() for col in hillary_trump_debat.columns]

In [9]:
hillary_trump_debat1 = hillary_trump_debat[hillary_trump_debat.date =='2016-09-26'].reset_index(drop=True)
hillary_trump_debat2 = hillary_trump_debat[hillary_trump_debat.date =='2016-10-09'].reset_index(drop=True)

In [12]:
def convert_dataframe2response_context(df_in: pd.DataFrame, column_w_speaker: str='speaker',
                                       column_w_text: str='text', trump_name_in_df: str='President Donald J. Trump',
                                       context_len: int=5):
    
    df_in = df_in.reset_index(drop=True)
    trump_told = df_in[df_in[column_w_speaker] == trump_name_in_df]

    contexted = []


    for i in list(trump_told.index):

        # we substract 1, so row will contain trump_responce and 7 previous responces
        prev = max(i - 1 - context_len, 0) 
        row = [df_in.loc[j, column_w_text] for j in range(i, prev, -1)]

        contexted.append(row)  

    columns = ['Trump_said', 'context'] 
    columns = columns + ['context/'+str(i) for i in range(context_len-1)]   

    df_out = pd.DataFrame.from_records(contexted, columns=columns)
    
    return df_out



def text_to_words(s: str):
    """convert text to word"""

    # get the UTF-8 bytes
    s_bytes = s.encode("utf-8")

    # allocate the output buffer
    o_bytes = create_string_buffer(len(s_bytes) * 3)
    o_bytes_count = len(o_bytes)

    # identify paragraphs
    o_len = blingfire.TextToWords(c_char_p(s_bytes), c_int(len(s_bytes)), byref(o_bytes), c_int(o_bytes_count))

    # check if no error has happened
    if -1 == o_len or o_len > o_bytes_count:
        return ''

    # compute the unicode string from the UTF-8 bytes
    return o_bytes.value.decode('utf-8')

#### Convert each dataset into format: trump speach + context
Also, each input text we will normalized and leave maximum args.n_tokens

In [13]:
list_of_df = []

detokenize = MosesDetokenizer('en')
for df_to_convert, trump_name in zip([hillary_trump_debat1, hillary_trump_debat2, biden_trump_debat_1, biden_trump_debat_2, trump_town_hall],
                                     ['Trump', 'Trump', 'President Donald J. Trump', 'Donald Trump', 'President Trump']):
    ## normalized the text
    df_to_convert['norm_text'] = df_to_convert.text.replace('-', ' ').apply(lambda x: text_to_words(unidecode(str(x))).replace('&', 'and'))
    ## leave only n tokens maximum
    df_to_convert['norm_text'] = df_to_convert['norm_text'].apply(lambda x: detokenize(x.split(' ')[:args.n_tokens]))
    
    df_converted = convert_dataframe2response_context(df_to_convert, column_w_speaker='speaker', column_w_text='norm_text',
                                  trump_name_in_df=trump_name, context_len=5)

    list_of_df.append(df_converted)
    

stdbuf was not found; communication with perl may hang due to stdio buffering.


Collect all dataset in one dataframe and remove line without context

In [14]:
df = pd.concat(list_of_df, ignore_index=True, axis=0).dropna()
df.head(5)

Unnamed: 0,Trump_said,context,context/0,context/1,context/2,context/3
0,"Thank you, Lester. Our jobs are fleeing the country. They're going to Mexico. They're going to many other countries. You look at what China is doing to our country in terms of making our product. They're devaluing their currency, and there's nobody in our government to fight them. And we have a very good fight. And we have a winning fight. Because they're using our country as a piggy bank to rebuild China, and many other countries are doing the same thing. So we're losing our good jobs, so many of them. When you look at what's happening in Mexico, a friend of mine who builds plants said it's the eighth wonder of the world. They're building some of the biggest plants anywhere in the world, some of the most sophisticated, some of the best plants. With the United States, as he said, not so much. So Ford is leaving. You see that, their small car division leaving. Thousands of jobs leaving Michigan, leaving Ohio","Secretary Clinton, thank you. Mr. Trump, the same question to you. It's about putting money - - more money into the pockets of American workers. You have up to two minutes.","I also want to see more companies do profit - sharing. If you help create the profits, you should be able to share in them, not just the executives at the top. And I want us to do more to support people who are struggling to balance family and work. I've heard from so many of you about the difficult choices you face and the stresses that you're under. So let's have paid family leave, earned sick days. Let's be sure we have affordable child care and debt - free college. How are we going to do it? We're going to do it by having the wealthy pay their fair share and close the corporate loopholes. Finally, we tonight are on the stage together, Donald Trump and I. Donald, it's good to be with you. We're going to have a debate where we are talking about the important issues facing our country. You have to judge us, who can shoulder the immense, awesome responsibilities of the presidency, who can put into action the","Well, thank you, Lester, and thanks to Hofstra for hosting us. The central question in this election is really what kind of country we want to be and what kind of future we'll build together. Today is my granddaughter's second birthday, so I think about this a lot. First, we have to build an economy that works for everyone, not just those at the top. That means we need new jobs, good jobs, with rising incomes. I want us to invest in you. I want us to invest in your future. That means jobs in infrastructure, in advanced manufacturing, innovation and technology, clean, renewable energy, and small business, because most of the new jobs will come from small business. We also have to make the economy fairer. That starts with raising the national minimum wage and also guarantee, finally, equal pay for women's work.","Well, I do n't expect us to cover all the issues of this campaign tonight, but I remind everyone, there are two more presidential debates scheduled. We are going to focus on many of the issues that voters tell us are most important, and we're going to press for specifics. I am honored to have this role, but this evening belongs to the candidates and, just as important, to the American people. Candidates, we look forward to hearing you articulate your policies and your positions, as well as your visions and your values. So, let's begin. We're calling this opening segment ""Achieving Prosperity."" And central to that is jobs. There are two economic realities in America today. There's been a record six straight years of job growth, and new census numbers show incomes have increased at a record rate after years of stagnation. However, income inequality remains significant, and nearly half of Americans are living paycheck to paycheck. Beginning with you, Secretary Clinton, why are you a better choice",Good luck to you.
1,"We can not let it happen. Under my plan, I'll be reducing taxes tremendously, from 35 percent to 15 percent for companies, small and big businesses. That's going to be a job creator like we have n't seen since Ronald Reagan. It's going to be a beautiful thing to watch. Companies will come. They will build. They will expand. New companies will start. And I look very, very much forward to doing it. We have to renegotiate our trade deals, and we have to stop these countries from stealing our companies and our jobs.","Thank you, Lester. Our jobs are fleeing the country. They're going to Mexico. They're going to many other countries. You look at what China is doing to our country in terms of making our product. They're devaluing their currency, and there's nobody in our government to fight them. And we have a very good fight. And we have a winning fight. Because they're using our country as a piggy bank to rebuild China, and many other countries are doing the same thing. So we're losing our good jobs, so many of them. When you look at what's happening in Mexico, a friend of mine who builds plants said it's the eighth wonder of the world. They're building some of the biggest plants anywhere in the world, some of the most sophisticated, some of the best plants. With the United States, as he said, not so much. So Ford is leaving. You see that, their small car division leaving. Thousands of jobs leaving Michigan, leaving Ohio","Secretary Clinton, thank you. Mr. Trump, the same question to you. It's about putting money - - more money into the pockets of American workers. You have up to two minutes.","I also want to see more companies do profit - sharing. If you help create the profits, you should be able to share in them, not just the executives at the top. And I want us to do more to support people who are struggling to balance family and work. I've heard from so many of you about the difficult choices you face and the stresses that you're under. So let's have paid family leave, earned sick days. Let's be sure we have affordable child care and debt - free college. How are we going to do it? We're going to do it by having the wealthy pay their fair share and close the corporate loopholes. Finally, we tonight are on the stage together, Donald Trump and I. Donald, it's good to be with you. We're going to have a debate where we are talking about the important issues facing our country. You have to judge us, who can shoulder the immense, awesome responsibilities of the presidency, who can put into action the","Well, thank you, Lester, and thanks to Hofstra for hosting us. The central question in this election is really what kind of country we want to be and what kind of future we'll build together. Today is my granddaughter's second birthday, so I think about this a lot. First, we have to build an economy that works for everyone, not just those at the top. That means we need new jobs, good jobs, with rising incomes. I want us to invest in you. I want us to invest in your future. That means jobs in infrastructure, in advanced manufacturing, innovation and technology, clean, renewable energy, and small business, because most of the new jobs will come from small business. We also have to make the economy fairer. That starts with raising the national minimum wage and also guarantee, finally, equal pay for women's work.","Well, I do n't expect us to cover all the issues of this campaign tonight, but I remind everyone, there are two more presidential debates scheduled. We are going to focus on many of the issues that voters tell us are most important, and we're going to press for specifics. I am honored to have this role, but this evening belongs to the candidates and, just as important, to the American people. Candidates, we look forward to hearing you articulate your policies and your positions, as well as your visions and your values. So, let's begin. We're calling this opening segment ""Achieving Prosperity."" And central to that is jobs. There are two economic realities in America today. There's been a record six straight years of job growth, and new census numbers show incomes have increased at a record rate after years of stagnation. However, income inequality remains significant, and nearly half of Americans are living paycheck to paycheck. Beginning with you, Secretary Clinton, why are you a better choice"
2,"Well, for one thing - - and before we start on that - - my father gave me a very small loan in 1975, and I built it into a company that's worth many, many billions of dollars, with some of the greatest assets in the world, and I say that only because that's the kind of thinking that our country needs. Our country's in deep trouble. We do n't know what we're doing when it comes to devaluations and all of these countries all over the world, especially China. They're the best, the best ever at it. What they're doing to us is a very, very sad thing. So we have to do that. We have to renegotiate our trade deals. And, Lester, they're taking our jobs, they're giving incentives, they're doing things that, frankly, we do n't do. Let me give you the example of Mexico. They have a VAT tax. We're on a different system. When we sell into Mexico, there","Let me follow up with Mr. Trump, if you can. You've talked about creating 25 million jobs, and you've promised to bring back millions of jobs for Americans. How are you going to bring back the industries that have left this country for cheaper labor overseas? How, specifically, are you going to tell American manufacturers that you have to come back?","Well, I think that trade is an important issue. Of course, we are 5 percent of the world's population; we have to trade with the other 95 percent. And we need to have smart, fair trade deals. We also, though, need to have a tax system that rewards work and not just financial transactions. And the kind of plan that Donald has put forth would be trickle - down economics all over again. In fact, it would be the most extreme version, the biggest tax cuts for the top percent of the people in this country than we've ever had. I call it trumped - up trickle - down, because that's exactly what it would be. That is not how we grow the economy. We just have a different view about what's best for growing the economy, how we make investments that will actually produce jobs and rising incomes. I think we come at it from somewhat different perspectives. I understand that. You know, Donald was very fortunate in his life, and that","Secretary Clinton, would you like to respond?","We can not let it happen. Under my plan, I'll be reducing taxes tremendously, from 35 percent to 15 percent for companies, small and big businesses. That's going to be a job creator like we have n't seen since Ronald Reagan. It's going to be a beautiful thing to watch. Companies will come. They will build. They will expand. New companies will start. And I look very, very much forward to doing it. We have to renegotiate our trade deals, and we have to stop these countries from stealing our companies and our jobs.","Thank you, Lester. Our jobs are fleeing the country. They're going to Mexico. They're going to many other countries. You look at what China is doing to our country in terms of making our product. They're devaluing their currency, and there's nobody in our government to fight them. And we have a very good fight. And we have a winning fight. Because they're using our country as a piggy bank to rebuild China, and many other countries are doing the same thing. So we're losing our good jobs, so many of them. When you look at what's happening in Mexico, a friend of mine who builds plants said it's the eighth wonder of the world. They're building some of the biggest plants anywhere in the world, some of the most sophisticated, some of the best plants. With the United States, as he said, not so much. So Ford is leaving. You see that, their small car division leaving. Thousands of jobs leaving Michigan, leaving Ohio"
3,"Secretary Clinton and others, politicians, should have been doing this for years, not right now, because of the fact that we've created a movement. They should have been doing this for years. What's happened to our jobs and our country and our economy generally is - - look, we owe $20 trillion. We can not do it any longer, Lester.","Let me interrupt just a moment, but...","Well, for one thing - - and before we start on that - - my father gave me a very small loan in 1975, and I built it into a company that's worth many, many billions of dollars, with some of the greatest assets in the world, and I say that only because that's the kind of thinking that our country needs. Our country's in deep trouble. We do n't know what we're doing when it comes to devaluations and all of these countries all over the world, especially China. They're the best, the best ever at it. What they're doing to us is a very, very sad thing. So we have to do that. We have to renegotiate our trade deals. And, Lester, they're taking our jobs, they're giving incentives, they're doing things that, frankly, we do n't do. Let me give you the example of Mexico. They have a VAT tax. We're on a different system. When we sell into Mexico, there","Let me follow up with Mr. Trump, if you can. You've talked about creating 25 million jobs, and you've promised to bring back millions of jobs for Americans. How are you going to bring back the industries that have left this country for cheaper labor overseas? How, specifically, are you going to tell American manufacturers that you have to come back?","Well, I think that trade is an important issue. Of course, we are 5 percent of the world's population; we have to trade with the other 95 percent. And we need to have smart, fair trade deals. We also, though, need to have a tax system that rewards work and not just financial transactions. And the kind of plan that Donald has put forth would be trickle - down economics all over again. In fact, it would be the most extreme version, the biggest tax cuts for the top percent of the people in this country than we've ever had. I call it trumped - up trickle - down, because that's exactly what it would be. That is not how we grow the economy. We just have a different view about what's best for growing the economy, how we make investments that will actually produce jobs and rising incomes. I think we come at it from somewhat different perspectives. I understand that. You know, Donald was very fortunate in his life, and that","Secretary Clinton, would you like to respond?"
4,"Well, the first thing you do is do n't let the jobs leave. The companies are leaving. I could name, I mean, there are thousands of them. They're leaving, and they're leaving in bigger numbers than ever. And what you do is you say, fine, you want to go to Mexico or some other country, good luck. We wish you a lot of luck. But if you think you're going to make your air conditioners or your cars or your cookies or whatever you make and bring them into our country without a tax, you're wrong. And once you say you're going to have to tax them coming in, and our politicians never do this, because they have special interests and the special interests want those companies to leave, because in many cases, they own the companies. So what I 'm saying is, we can stop them from leaving. We have to stop them from leaving. And that's a big, big factor.","Back to the question, though. How do you bring back - - specifically bring back jobs, American manufacturers? How do you make them bring the jobs back?","Secretary Clinton and others, politicians, should have been doing this for years, not right now, because of the fact that we've created a movement. They should have been doing this for years. What's happened to our jobs and our country and our economy generally is - - look, we owe $20 trillion. We can not do it any longer, Lester.","Let me interrupt just a moment, but...","Well, for one thing - - and before we start on that - - my father gave me a very small loan in 1975, and I built it into a company that's worth many, many billions of dollars, with some of the greatest assets in the world, and I say that only because that's the kind of thinking that our country needs. Our country's in deep trouble. We do n't know what we're doing when it comes to devaluations and all of these countries all over the world, especially China. They're the best, the best ever at it. What they're doing to us is a very, very sad thing. So we have to do that. We have to renegotiate our trade deals. And, Lester, they're taking our jobs, they're giving incentives, they're doing things that, frankly, we do n't do. Let me give you the example of Mexico. They have a VAT tax. We're on a different system. When we sell into Mexico, there","Let me follow up with Mr. Trump, if you can. You've talked about creating 25 million jobs, and you've promised to bring back millions of jobs for Americans. How are you going to bring back the industries that have left this country for cheaper labor overseas? How, specifically, are you going to tell American manufacturers that you have to come back?"


### Let's split data into train and text

In [15]:
trn_df, val_df = train_test_split(df, test_size = 0.1)
trn_df.head()

Unnamed: 0,Trump_said,context,context/0,context/1,context/2,context/3
789,"No, I do n't know that.",You do n't know that? Okay.,I have no idea. I know nothing about them.,But there's not a Satanic pedophile cult being run by -,... and I agree with it very strongly.,Okay.
478,But why was he given tens of millions of dollars?,"My son like a lot of people at home had a drug problem. He's overtaking it. He's fixed it. He's worked on it. And I 'm proud of him, I' m proud of my son.",He made a fortune and he did n't have a job.,That is not true.,"Once you became vice president he made a fortune in Ukraine, in China, in Moscow and various other places.",None of that is true.
715,That's a big statement.,Here's the deal -,"Oh, I see. Okay.","Because the oil industry pollutes, significantly.",Why would you do that?,Because I would stop.
404,I'll fire them.,"Well, I'll give you the list of the people who -",I'd like to know who they are.,I did it honorably.,"Oh, really?",... testified under oath in his administration said I did my job and I did it very well.
159,It's nice to - - one on three.,Ken Karpowicz has a question.,"No, it has n't. It has n't. And it has n't been finished at all.",We brought up the e - mails.,"I'd like to know, Anderson, why are n't you bringing up the e - mails? I'd like to know. Why are n't you bringing...",We have a question here from Ken Karpowicz. He has a question about health care. Ken?


In [16]:
def construct_conv(row, tokenizer, eos = True):
    flatten = lambda l: [item for sublist in l for item in sublist]
    conv = list(reversed([tokenizer.encode(x) + [tokenizer.eos_token_id] for x in row]))
    conv = flatten(conv)
    return conv

class ConversationDataset(Dataset):
    def __init__(self, tokenizer: PreTrainedTokenizer, args, df, block_size=512):

        block_size = block_size - (tokenizer.model_max_length - tokenizer.max_len_single_sentence)

        directory = args.cache_dir
        cached_features_file = os.path.join(
            directory, args.model_type + "_cached_lm_" + str(block_size)
        )

        if os.path.exists(cached_features_file) and not args.overwrite_cache:
            logger.info("Loading features from cached file %s", cached_features_file)
            with open(cached_features_file, "rb") as handle:
                self.examples = pickle.load(handle)
        else:
            logger.info("Creating features from dataset file at %s", directory)

            self.examples = []
            for _, row in df.iterrows():
                conv = construct_conv(row, tokenizer)
                self.examples.append(conv)

            logger.info("Saving features into cached file %s", cached_features_file)
            with open(cached_features_file, "wb") as handle:
                pickle.dump(self.examples, handle, protocol=pickle.HIGHEST_PROTOCOL)

    def __len__(self):
        return len(self.examples)

    def __getitem__(self, item):
        return torch.tensor(self.examples[item], dtype=torch.long)

In [17]:
# Cacheing and storing of data/checkpoints

def load_and_cache_examples(args, tokenizer, df_trn, df_val, evaluate=False):
    return ConversationDataset(tokenizer, args, df_val if evaluate else df_trn)


def set_seed(args):
    random.seed(args.seed)
    np.random.seed(args.seed)
    torch.manual_seed(args.seed)
    if args.n_gpu > 0:
        torch.cuda.manual_seed_all(args.seed)


def _sorted_checkpoints(args, checkpoint_prefix="checkpoint", use_mtime=False) -> List[str]:
    ordering_and_checkpoint_path = []

    glob_checkpoints = glob.glob(os.path.join(args.output_dir, "{}-*".format(checkpoint_prefix)))

    for path in glob_checkpoints:
        if use_mtime:
            ordering_and_checkpoint_path.append((os.path.getmtime(path), path))
        else:
            regex_match = re.match(".*{}-([0-9]+)".format(checkpoint_prefix), path)
            if regex_match and regex_match.groups():
                ordering_and_checkpoint_path.append((int(regex_match.groups()[0]), path))

    checkpoints_sorted = sorted(ordering_and_checkpoint_path)
    checkpoints_sorted = [checkpoint[1] for checkpoint in checkpoints_sorted]
    return checkpoints_sorted


def _rotate_checkpoints(args, checkpoint_prefix="checkpoint", use_mtime=False) -> None:
    if not args.save_total_limit:
        return
    if args.save_total_limit <= 0:
        return

    # Check if we should delete older checkpoint(s)
    checkpoints_sorted = _sorted_checkpoints(args, checkpoint_prefix, use_mtime)
    if len(checkpoints_sorted) <= args.save_total_limit:
        return

    number_of_checkpoints_to_delete = max(0, len(checkpoints_sorted) - args.save_total_limit)
    checkpoints_to_be_deleted = checkpoints_sorted[:number_of_checkpoints_to_delete]
    for checkpoint in checkpoints_to_be_deleted:
        logger.info("Deleting older checkpoint [{}] due to args.save_total_limit".format(checkpoint))
        shutil.rmtree(checkpoint)

# Training and Evaluating
Now that we have THE DATA we can finally create our model and start training it! The training and evaluation loop are quite simple. We simplely take a batch of examples from our dataloader and use it both as our inputs and labels. We do this because GPT2 is an auto-regressive model, meaning it uses some context to predict the next token. This prediction is then added to the original context and fed back in as the new context for generating the next token.

To evaluate our model, we use the metric perplexity, which is a simple, but powerful metric. Perplexity is a measure of how unsure the model is in its choice of the next token. The more unsure our model is, the higher its perplexity. One fascinating thing about perplexity is that it correlates very well with what humans think of when it comes to coherent and specific natural conversations, which was shown in the amazing paper ["Towards a Human-like Open-Domain Chatbot"](https://arxiv.org/abs/2001.09977) by Daniel Adiwardana, et. al.

In [18]:
def train(args, train_dataset, model: PreTrainedModel, tokenizer: PreTrainedTokenizer) -> Tuple[int, float]:
    """ Train the model """
    if args.local_rank in [-1, 0]:
        tb_writer = SummaryWriter()

    args.train_batch_size = args.per_gpu_train_batch_size * max(1, args.n_gpu)

    def collate(examples: List[torch.Tensor]):
        if tokenizer._pad_token is None:
            return pad_sequence(examples, batch_first=True)
        return pad_sequence(examples, batch_first=True, padding_value=tokenizer.pad_token_id)

    train_sampler = RandomSampler(train_dataset) if args.local_rank == -1 else DistributedSampler(train_dataset)
    train_dataloader = DataLoader(
        train_dataset, sampler=train_sampler, batch_size=args.train_batch_size, collate_fn=collate, drop_last = True
    )

    if args.max_steps > 0:
        t_total = args.max_steps
        args.num_train_epochs = args.max_steps // (len(train_dataloader) // args.gradient_accumulation_steps) + 1
    else:
        t_total = len(train_dataloader) // args.gradient_accumulation_steps * args.num_train_epochs

    model = model.module if hasattr(model, "module") else model  # Take care of distributed/parallel training
    model.resize_token_embeddings(len(tokenizer))
    # add_special_tokens_(model, tokenizer)


    # Prepare optimizer and schedule (linear warmup and decay)
    no_decay = ["bias", "LayerNorm.weight"]
    optimizer_grouped_parameters = [
        {
            "params": [p for n, p in model.named_parameters() if not any(nd in n for nd in no_decay)],
            "weight_decay": args.weight_decay,
        },
        {"params": [p for n, p in model.named_parameters() if any(nd in n for nd in no_decay)], "weight_decay": 0.0},
    ]
    optimizer = AdamW(optimizer_grouped_parameters, lr=args.learning_rate, eps=args.adam_epsilon)
    scheduler = get_linear_schedule_with_warmup(
        optimizer, num_warmup_steps=args.warmup_steps, num_training_steps=t_total
    )

    # Check if saved optimizer or scheduler states exist
    if (
        args.model_name_or_path
        and os.path.isfile(os.path.join(args.model_name_or_path, "optimizer.pt"))
        and os.path.isfile(os.path.join(args.model_name_or_path, "scheduler.pt"))
    ):
        # Load in optimizer and scheduler states
        optimizer.load_state_dict(torch.load(os.path.join(args.model_name_or_path, "optimizer.pt")))
        scheduler.load_state_dict(torch.load(os.path.join(args.model_name_or_path, "scheduler.pt")))

    if args.fp16:
        try:
            from apex import amp
        except ImportError:
            raise ImportError("Please install apex from https://www.github.com/nvidia/apex to use fp16 training.")
        model, optimizer = amp.initialize(model, optimizer, opt_level=args.fp16_opt_level)

    # multi-gpu training (should be after apex fp16 initialization)
    if args.n_gpu > 1:
        model = torch.nn.DataParallel(model)

    # Distributed training (should be after apex fp16 initialization)
    if args.local_rank != -1:
        model = torch.nn.parallel.DistributedDataParallel(
            model, device_ids=[args.local_rank], output_device=args.local_rank, find_unused_parameters=True
        )

    # Train!
    logger.info("***** Running training *****")
    logger.info("  Num examples = %d", len(train_dataset))
    logger.info("  Num Epochs = %d", args.num_train_epochs)
    logger.info("  Instantaneous batch size per GPU = %d", args.per_gpu_train_batch_size)
    logger.info(
        "  Total train batch size (w. parallel, distributed & accumulation) = %d",
        args.train_batch_size
        * args.gradient_accumulation_steps
        * (torch.distributed.get_world_size() if args.local_rank != -1 else 1),
    )
    logger.info("  Gradient Accumulation steps = %d", args.gradient_accumulation_steps)
    logger.info("  Total optimization steps = %d", t_total)

    global_step = 0
    epochs_trained = 0
    steps_trained_in_current_epoch = 0
    # Check if continuing training from a checkpoint
    if args.model_name_or_path and os.path.exists(args.model_name_or_path):
        try:
            # set global_step to gobal_step of last saved checkpoint from model path
            checkpoint_suffix = args.model_name_or_path.split("-")[-1].split("/")[0]
            global_step = int(checkpoint_suffix)
            epochs_trained = global_step // (len(train_dataloader) // args.gradient_accumulation_steps)
            steps_trained_in_current_epoch = global_step % (len(train_dataloader) // args.gradient_accumulation_steps)

            logger.info("  Continuing training from checkpoint, will skip to saved global_step")
            logger.info("  Continuing training from epoch %d", epochs_trained)
            logger.info("  Continuing training from global step %d", global_step)
            logger.info("  Will skip the first %d steps in the first epoch", steps_trained_in_current_epoch)
        except ValueError:
            logger.info("  Starting fine-tuning.")

    tr_loss, logging_loss = 0.0, 0.0

    model.zero_grad()
    train_iterator = trange(
        epochs_trained, int(args.num_train_epochs), desc="Epoch", disable=args.local_rank not in [-1, 0]
    )
    # set_seed(args)  # Added here for reproducibility
    for _ in train_iterator:
        epoch_iterator = tqdm(train_dataloader, desc="Iteration", disable=args.local_rank not in [-1, 0])
        for step, batch in enumerate(epoch_iterator):

            # Skip past any already trained steps if resuming training
            if steps_trained_in_current_epoch > 0:
                steps_trained_in_current_epoch -= 1
                continue

            inputs, labels = (batch, batch)
            if inputs.shape[1] > 1024: continue
            inputs = inputs.to(args.device)
            labels = labels.to(args.device)
            model.train()
            outputs = model(inputs, labels=labels)
            loss = outputs[0]  # model outputs are always tuple in transformers (see doc)

            if args.n_gpu > 1:
                loss = loss.mean()  # mean() to average on multi-gpu parallel training
            if args.gradient_accumulation_steps > 1:
                loss = loss / args.gradient_accumulation_steps

            if args.fp16:
                with amp.scale_loss(loss, optimizer) as scaled_loss:
                    scaled_loss.backward()
            else:
                loss.backward()

            tr_loss += loss.item()
            if (step + 1) % args.gradient_accumulation_steps == 0:
                if args.fp16:
                    torch.nn.utils.clip_grad_norm_(amp.master_params(optimizer), args.max_grad_norm)
                else:
                    torch.nn.utils.clip_grad_norm_(model.parameters(), args.max_grad_norm)
                optimizer.step()
                scheduler.step()  # Update learning rate schedule
                model.zero_grad()
                global_step += 1

                if args.local_rank in [-1, 0] and args.logging_steps > 0 and global_step % args.logging_steps == 0:
                    # Log metrics
                    if (
                        args.local_rank == -1 and args.evaluate_during_training
                    ):  # Only evaluate when single GPU otherwise metrics may not average well
                        results = evaluate(args, model, tokenizer)
                        for key, value in results.items():
                            tb_writer.add_scalar("eval_{}".format(key), value, global_step)
                    tb_writer.add_scalar("lr", scheduler.get_lr()[0], global_step)
                    tb_writer.add_scalar("loss", (tr_loss - logging_loss) / args.logging_steps, global_step)
                    logging_loss = tr_loss

                if args.local_rank in [-1, 0] and args.save_steps > 0 and global_step % args.save_steps == 0:
                    checkpoint_prefix = "checkpoint"
                    # Save model checkpoint
                    output_dir = os.path.join(args.output_dir, "{}-{}".format(checkpoint_prefix, global_step))
                    os.makedirs(output_dir, exist_ok=True)
                    model_to_save = (
                        model.module if hasattr(model, "module") else model
                    )  # Take care of distributed/parallel training
                    model_to_save.save_pretrained(output_dir)
                    tokenizer.save_pretrained(output_dir)

                    torch.save(args, os.path.join(output_dir, "training_args.bin"))
                    logger.info("Saving model checkpoint to %s", output_dir)

                    _rotate_checkpoints(args, checkpoint_prefix)

                    torch.save(optimizer.state_dict(), os.path.join(output_dir, "optimizer.pt"))
                    torch.save(scheduler.state_dict(), os.path.join(output_dir, "scheduler.pt"))
                    logger.info("Saving optimizer and scheduler states to %s", output_dir)

            if args.max_steps > 0 and global_step > args.max_steps:
                epoch_iterator.close()
                break
        if args.max_steps > 0 and global_step > args.max_steps:
            train_iterator.close()
            break

    if args.local_rank in [-1, 0]:
        tb_writer.close()

    return global_step, tr_loss / global_step

# Evaluation of some model

def evaluate(args, model: PreTrainedModel, tokenizer: PreTrainedTokenizer, df_trn, df_val, prefix="") -> Dict:
    # Loop to handle MNLI double evaluation (matched, mis-matched)
    eval_output_dir = args.output_dir

    eval_dataset = load_and_cache_examples(args, tokenizer, df_trn, df_val, evaluate=True)
    os.makedirs(eval_output_dir, exist_ok=True)
    args.eval_batch_size = args.per_gpu_eval_batch_size * max(1, args.n_gpu)
    # Note that DistributedSampler samples randomly

    def collate(examples: List[torch.Tensor]):
        if tokenizer._pad_token is None:
            return pad_sequence(examples, batch_first=True)
        return pad_sequence(examples, batch_first=True, padding_value=tokenizer.pad_token_id)

    eval_sampler = SequentialSampler(eval_dataset)
    eval_dataloader = DataLoader(
        eval_dataset, sampler=eval_sampler, batch_size=args.eval_batch_size, collate_fn=collate, drop_last = True
    )

    # multi-gpu evaluate
    if args.n_gpu > 1:
        model = torch.nn.DataParallel(model)

    # Eval!
    logger.info("***** Running evaluation {} *****".format(prefix))
    logger.info("  Num examples = %d", len(eval_dataset))
    logger.info("  Batch size = %d", args.eval_batch_size)
    eval_loss = 0.0
    nb_eval_steps = 0
    model.eval()

    for batch in tqdm(eval_dataloader, desc="Evaluating"):
        inputs, labels = (batch, batch)
        inputs = inputs.to(args.device)
        labels = labels.to(args.device)

        with torch.no_grad():
            outputs = model(inputs, labels=labels)
            lm_loss = outputs[0]
            eval_loss += lm_loss.mean().item()
        nb_eval_steps += 1

    eval_loss = eval_loss / nb_eval_steps
    perplexity = torch.exp(torch.tensor(eval_loss))

    result = {"perplexity": perplexity}

    output_eval_file = os.path.join(eval_output_dir, prefix, "eval_results.txt")
    with open(output_eval_file, "w") as writer:
        logger.info("***** Eval results {} *****".format(prefix))
        for key in sorted(result.keys()):
            logger.info("  %s = %s", key, str(result[key]))
            writer.write("%s = %s\n" % (key, str(result[key])))

    return result

In [19]:
# Main runner

def main(df_trn, df_val):
    args = Args()
    
    if args.should_continue:
        sorted_checkpoints = _sorted_checkpoints(args)
        if len(sorted_checkpoints) == 0:
            raise ValueError("Used --should_continue but no checkpoint was found in --output_dir.")
        else:
            args.model_name_or_path = sorted_checkpoints[-1]

    if (
        os.path.exists(args.output_dir)
        and os.listdir(args.output_dir)
        and args.do_train
        and not args.overwrite_output_dir
        and not args.should_continue
    ):
        raise ValueError(
            "Output directory ({}) already exists and is not empty. Use --overwrite_output_dir to overcome.".format(
                args.output_dir
            )
        )

    # Setup CUDA, GPU & distributed training
    if torch.cuda.is_available():
        device = torch.device("cuda")
    else:
        device = torch.device("cpu")
    args.n_gpu = torch.cuda.device_count()
    args.device = device

    # Setup logging
    logging.basicConfig(
        format="%(asctime)s - %(levelname)s - %(name)s -   %(message)s",
        datefmt="%m/%d/%Y %H:%M:%S",
        level=logging.INFO if args.local_rank in [-1, 0] else logging.WARN,
    )
    logger.warning(
        "Process rank: %s, device: %s, n_gpu: %s, distributed training: %s, 16-bits training: %s",
        args.local_rank,
        device,
        args.n_gpu,
        bool(args.local_rank != -1),
        args.fp16,
    )

    # Set seed
    # set_seed(args)

    config = AutoConfig.from_pretrained(args.config_name, cache_dir=args.cache_dir)
    tokenizer = AutoTokenizer.from_pretrained(args.tokenizer_name, cache_dir=args.cache_dir)
    model = AutoModelWithLMHead.from_pretrained(
        args.model_name_or_path,
        from_tf=False,
        config=config,
        cache_dir=args.cache_dir,
    )
    model.to(args.device)
    
    logger.info("Training/evaluation parameters %s", args)

    # Training
    if args.do_train:
        train_dataset = load_and_cache_examples(args, tokenizer, df_trn, df_val, evaluate=False)

        global_step, tr_loss = train(args, train_dataset, model, tokenizer)
        logger.info(" global_step = %s, average loss = %s", global_step, tr_loss)

    # Saving best-practices: if you use save_pretrained for the model and tokenizer, you can reload them using from_pretrained()
    if args.do_train:
        # Create output directory if needed
        os.makedirs(args.output_dir, exist_ok=True)

        logger.info("Saving model checkpoint to %s", args.output_dir)
        # Save a trained model, configuration and tokenizer using `save_pretrained()`.
        # They can then be reloaded using `from_pretrained()`
        model_to_save = (
            model.module if hasattr(model, "module") else model
        )  # Take care of distributed/parallel training
        model_to_save.save_pretrained(args.output_dir)
        tokenizer.save_pretrained(args.output_dir)

        # Good practice: save your training arguments together with the trained model
        torch.save(args, os.path.join(args.output_dir, "training_args.bin"))

        # Load a trained model and vocabulary that you have fine-tuned
        model = AutoModelWithLMHead.from_pretrained(args.output_dir)
        tokenizer = AutoTokenizer.from_pretrained(args.output_dir)
        model.to(args.device)

    # Evaluation
    results = {}
    if args.do_eval and args.local_rank in [-1, 0]:
        checkpoints = [args.output_dir]
        if args.eval_all_checkpoints:
            checkpoints = list(
                os.path.dirname(c) for c in sorted(glob.glob(args.output_dir + "/**/" + WEIGHTS_NAME, recursive=True))
            )
            logging.getLogger("transformers.modeling_utils").setLevel(logging.WARN)  # Reduce logging
        logger.info("Evaluate the following checkpoints: %s", checkpoints)
        for checkpoint in checkpoints:
            global_step = checkpoint.split("-")[-1] if len(checkpoints) > 1 else ""
            prefix = checkpoint.split("/")[-1] if checkpoint.find("checkpoint") != -1 else ""

            model = AutoModelWithLMHead.from_pretrained(checkpoint)
            model.to(args.device)
            result = evaluate(args, model, tokenizer, df_trn, df_val, prefix=prefix)
            result = dict((k + "_{}".format(global_step), v) for k, v in result.items())
            results.update(result)

    return results

#### Run the code

In [20]:
main(trn_df, val_df)

12/26/2020 10:59:08 - INFO - __main__ -   Training/evaluation parameters <__main__.Args object at 0x7fb81da656a0>
12/26/2020 10:59:08 - INFO - __main__ -   Creating features from dataset file at cached
12/26/2020 10:59:09 - INFO - __main__ -   Saving features into cached file cached/gpt2_cached_lm_512
12/26/2020 10:59:09 - INFO - __main__ -   ***** Running training *****
12/26/2020 10:59:09 - INFO - __main__ -     Num examples = 830
12/26/2020 10:59:09 - INFO - __main__ -     Num Epochs = 3
12/26/2020 10:59:09 - INFO - __main__ -     Instantaneous batch size per GPU = 4
12/26/2020 10:59:09 - INFO - __main__ -     Total train batch size (w. parallel, distributed & accumulation) = 4
12/26/2020 10:59:09 - INFO - __main__ -     Gradient Accumulation steps = 1
12/26/2020 10:59:09 - INFO - __main__ -     Total optimization steps = 621


HBox(children=(HTML(value='Epoch'), FloatProgress(value=0.0, max=3.0), HTML(value='')))

HBox(children=(HTML(value='Iteration'), FloatProgress(value=0.0, max=207.0), HTML(value='')))




HBox(children=(HTML(value='Iteration'), FloatProgress(value=0.0, max=207.0), HTML(value='')))




HBox(children=(HTML(value='Iteration'), FloatProgress(value=0.0, max=207.0), HTML(value='')))





12/26/2020 15:47:47 - INFO - __main__ -    global_step = 621, average loss = 1.9892467788260149
12/26/2020 15:47:47 - INFO - __main__ -   Saving model checkpoint to output
12/26/2020 15:47:53 - INFO - __main__ -   Evaluate the following checkpoints: ['output']
12/26/2020 15:47:56 - INFO - __main__ -   Creating features from dataset file at cached
12/26/2020 15:47:56 - INFO - __main__ -   Saving features into cached file cached/gpt2_cached_lm_512
12/26/2020 15:47:56 - INFO - __main__ -   ***** Running evaluation  *****
12/26/2020 15:47:56 - INFO - __main__ -     Num examples = 93
12/26/2020 15:47:56 - INFO - __main__ -     Batch size = 4


HBox(children=(HTML(value='Evaluating'), FloatProgress(value=0.0, max=23.0), HTML(value='')))

12/26/2020 15:50:32 - INFO - __main__ -   ***** Eval results  *****
12/26/2020 15:50:32 - INFO - __main__ -     perplexity = tensor(4.9978)





{'perplexity_': tensor(4.9978)}

# Chatting with our Model

Now that we have our model trained, let's it out for a spin and have our first conversation with Donald Trump!

The below code is copied pretty much verbatim from the creators of the DialoGPT model, which you can find [here](https://huggingface.co/microsoft/DialoGPT-small).

Moreover, let's compare the result with other models.

In [41]:
tokenizer = AutoTokenizer.from_pretrained('microsoft/DialoGPT-small')
model = AutoModelWithLMHead.from_pretrained(args.output_dir)

# Let's chat for 6 lines
for step in range(6):
    # encode the new user input, add the eos_token and return a tensor in Pytorch
    new_user_input_ids = tokenizer.encode(input(">> User: ") + tokenizer.eos_token, return_tensors='pt')
    # print(new_user_input_ids)

    # append the new user input tokens to the chat history
    bot_input_ids = torch.cat([chat_history_ids, new_user_input_ids], dim=-1) if step > 0 else new_user_input_ids

    # generated a response while limiting the total chat history to 1000 tokens, 
    chat_history_ids = model.generate(
        bot_input_ids, max_length=300,
        pad_token_id=tokenizer.eos_token_id,  
        no_repeat_ngram_size=3,       
        do_sample=True, 
        top_k=100, 
        top_p=0.7,
        temperature = 0.8
    )
    
    # pretty print last ouput tokens from bot
    print("Trump: {}".format(tokenizer.decode(chat_history_ids[:, bot_input_ids.shape[-1]:][0], skip_special_tokens=True)))

>> User: How are you, Mr. Trump?
Trump: I'm not.
>> User: Are you planning to start a nuclear war with North Korea?
Trump: Okay.
>> User: Are you planning to continue the economic war with China?
Trump: !!!I' m not going to continue that economic war.
>> User: Have you been supported by Russia during the election?
Trump: No, I have not been supported, by Russia.
>> User: How do you feel about the Black Life Matters?
Trump: That's a good question.
>> User: How are you planning to overcome Covid-19?
Trump: Well, let me ask you this.


In [32]:
tokenizer = AutoTokenizer.from_pretrained(args.output_dir)
model = AutoModelWithLMHead.from_pretrained(args.output_dir)

# Let's chat for 6 lines
for step in range(6):
    # encode the new user input, add the eos_token and return a tensor in Pytorch
    new_user_input_ids = tokenizer.encode(input(">> User:") + tokenizer.eos_token, return_tensors='pt')
    # print(new_user_input_ids)

    # append the new user input tokens to the chat history
    bot_input_ids = torch.cat([chat_history_ids, new_user_input_ids], dim=-1) if step > 0 else new_user_input_ids

    # generated a response while limiting the total chat history to 1000 tokens, 
    chat_history_ids = model.generate(
        bot_input_ids, max_length=300,
        pad_token_id=tokenizer.eos_token_id,  
        no_repeat_ngram_size=3,       
        do_sample=True, 
        top_k=100, 
        top_p=0.7,
        temperature = 0.8
    )
    
    # pretty print last ouput tokens from bot
    print("Trump: {}".format(tokenizer.decode(chat_history_ids[:, bot_input_ids.shape[-1]:][0], skip_special_tokens=True)))

>> User:Good evening!
Trump: It's been a very tough week.
>> User:Are you plan to start the war with North Korea
Trump: !!!The war with China is going to end in a matter of weeks.
>> User:Are you planning to continue the economic war with China?
Trump: !The economic war is going on with China.
>> User:Have you been supported by Russia during the election?
Trump: I have been supporting China. And you know what? I have been supported for the last two years, and I have supported Russia, for a long time.
>> User:How do you feel about the Black Life Matters?
Trump: And I have to say, you have to be a leader.
>> User:How are you planning to overcome Covid-19?
Trump: All right, gentlemen.


### Let's use another model, without normalization of the text and limitation for tokens. perplexity = tensor(4.5344)


In [45]:
tokenizer = AutoTokenizer.from_pretrained('microsoft/DialoGPT-small')
model = AutoModelWithLMHead.from_pretrained("output_full_trump")

# Let's chat for 6 lines
for step in range(6):
    # encode the new user input, add the eos_token and return a tensor in Pytorch
    new_user_input_ids = tokenizer.encode(input(">> User: ") + tokenizer.eos_token, return_tensors='pt')
    # print(new_user_input_ids)

    # append the new user input tokens to the chat history
    bot_input_ids = torch.cat([chat_history_ids, new_user_input_ids], dim=-1) if step > 0 else new_user_input_ids

    # generated a response while limiting the total chat history to 1000 tokens, 
    chat_history_ids = model.generate(
        bot_input_ids, max_length=300,
        pad_token_id=tokenizer.eos_token_id,  
        no_repeat_ngram_size=3,       
        do_sample=True, 
        top_k=100, 
        top_p=0.7,
        temperature = 0.8
    )
    
    # pretty print last ouput tokens from bot
    print("Trump: {}".format(tokenizer.decode(chat_history_ids[:, bot_input_ids.shape[-1]:][0], skip_special_tokens=True)))

>> User: Good evening!
Trump: I want to say it again.
>> User: Are you planning to start a nuclear war with North Korea?
Trump: Are they planning to launch a nuclear strike on the United States?
>> User: Are you planning to continue the economic war with China?
Trump: !!!Wrong!!
>> User: Have you been supported by Russia during the election?
Trump: What are you going to do to stop the economic depression that's coming in from China?!!You're going to have a big economy.!!We’re going to be in a big depression.!We are going to make China pay for it.!
>> User: How do you feel about the Black Life Matters?
Trump: That’s not what it’ll be.
>> User: How are you planning to overcome Covid-19?
Trump: I’m not going to give you a chance.


In [None]:
# Good evening!
# Are you planning to start a nuclear war with North Korea?
# Are you planning to continue the economic war with China?
# Have you been supported by Russia during the election?
# How do you feel about the Black Life Matters?
# How are you planning to overcome Covid-19?

### Let's use another model, with normalization the text + lowercase and limitation for tokens = 100. perplexity = tensor(6.5417)

In [62]:
tokenizer = AutoTokenizer.from_pretrained('microsoft/DialoGPT-small')
model = AutoModelWithLMHead.from_pretrained("output_100_lower")

# Let's chat for 6 lines
for step in range(6):
    # encode the new user input, add the eos_token and return a tensor in Pytorch
    new_user_input_ids = tokenizer.encode(input(">> User: ") + tokenizer.eos_token, return_tensors='pt')
    # print(new_user_input_ids)

    # append the new user input tokens to the chat history
    bot_input_ids = torch.cat([chat_history_ids, new_user_input_ids], dim=-1) if step > 0 else new_user_input_ids

    # generated a response while limiting the total chat history to 1000 tokens, 
    chat_history_ids = model.generate(
        bot_input_ids, max_length=200,
        pad_token_id=tokenizer.eos_token_id,  
        no_repeat_ngram_size=3,       
        do_sample=True, 
        top_k=100, 
        top_p=0.7,
        temperature = 0.8
    )
    
    # pretty print last ouput tokens from bot
    print("Trump: {}".format(tokenizer.decode(chat_history_ids[:, bot_input_ids.shape[-1]:][0], skip_special_tokens=True)))

>> User: Good evening!
Trump: good evening.
>> User: Are you planning to start a nuclear war with North Korea?
Trump: i think i have to ask you a question.
>> User: Are you planning to continue the economic war with China?
Trump: it's a good question. we're going to continue to be in a very good position. we are going to be on the very expensive side. we have to be able to provide for the people who depend on us. we will be able, but we will have to provide some very good support. we can be on our side.
>> User: Have you been supported by Russia during the election?
Trump: yes.
>> User: How do you feel about the Black Life Matters?
Trump: !!!excuse me, sir.
>> User:  How are you planning to overcome Covid-19?
Trump: excuse.


In [None]:
# Good evening!
# Are you planning to start a nuclear war with North Korea?
# Are you planning to continue the economic war with China?
# Have you been supported by Russia during the election?
# How do you feel about the Black Life Matters?
# How are you planning to overcome Covid-19?

Now, it ain't the best, however, training it for longer or using the DialoGPT-medium instead of DialoGPT-small does improve results.

# Conclusion and Next Steps


The models are not perfect. How can we increase quality?
* improve the dataset, fix bugs, etc. Make the text, speech smaller, for example, generate a summary for a text, remove noise sentence (leave only important, questions).
* increase amout of data: tweets, trump speech and wikipedia + question-genereted model https://github.com/patil-suraj/question_generation. Example: Addtional_dataset.ipynb.
* use bigger models DialoGPT-medium https://huggingface.co/microsoft/DialoGPT-medium or DialoGPT-large https://huggingface.co/microsoft/DialoGPT-large
* have longer training and play around with parameters of training.
* Using a bot to generate the conversion and after that using it in the training set.
