<a href="https://colab.research.google.com/github/apschlissel/w266-final-project/blob/main/T5_translation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# T5 Translation Model

Summary:
* Replace slang text in reddit posts with de-slanged text
* Manually check replaced text to ensure posts make sense
* Train a T5 model on checked de-slanged text

In [1]:
!pip install -q transformers

In [2]:
!pip install simpletransformers



In [9]:
from __future__ import print_function
import ipywidgets as widgets
from transformers import pipeline
from simpletransformers.t5 import T5Model, T5Args
import pandas as pd
import logging
import numpy as np
import torch
from tqdm.notebook import tqdm
from transformers import BertTokenizer
from torch.utils.data import TensorDataset
from transformers import BertForSequenceClassification
import json
import re
import random
import math
from bs4 import BeautifulSoup
# Pull reddit data from reddit api
import requests
pd.options.display.max_colwidth = 1000
pd.set_option('display.max_rows', 100)

## Pull Reddit Data

In [4]:
# note that CLIENT_ID refers to 'personal use script' and SECRET_TOKEN to 'token'
auth = requests.auth.HTTPBasicAuth('pigKA_TKnDkXcatEGcbo8g', 'nawGKK2MfPtC6vKz8TjaNEnmYfAggA')

# here we pass our login method (password), username, and password
data = {'grant_type': 'password',
        'username': 'Katsuuu100',
        'password': 'Testing159753'}

# setup our header info, which gives reddit a brief description of our app
headers = {'User-Agent': 'MyBot/0.0.1'}

# send our request for an OAuth token
res = requests.post('https://www.reddit.com/api/v1/access_token',
                    auth=auth, data=data, headers=headers)

# convert response to JSON and pull access_token value
TOKEN = res.json()['access_token']

# add authorization to our headers dictionary
headers = {**headers, **{'Authorization': f"bearer {TOKEN}"}}

# while the token is valid (~2 hours) we just add headers=headers to our requests
requests.get('https://oauth.reddit.com/api/v1/me', headers=headers)

<Response [200]>

In [7]:
# HTML web scraper, scrape top subreddits, SFW only.
# 
# Source: https://realpython.com/beautiful-soup-web-scraper-python/
# Source: https://stackoverflow.com/questions/40210093/how-do-i-scrape-only-div-class-quotetext-from-a-website-using-python


URL = "http://redditlist.com/sfw/"
page = requests.get(URL)

soup = BeautifulSoup(page.content, "html.parser")
# print(soup.prettify())
job_elements = soup.find_all("div", class_="listing-item")

# print(job_elements[0])

# Filter to only the Top 125 Most subscribed subreddits
job_elements = job_elements[125:250]

master_subreddit_list = []
# Find subreddit names
for job_element in job_elements:
  links = job_element.find_all("a")
  for link in links:
    print(link.text.strip())
    master_subreddit_list.append(link.text.strip())

announcements
funny
AskReddit
gaming
aww
Music
pics
worldnews
movies
science
todayilearned
videos
news
Showerthoughts
Jokes
food
askscience
IAmA
EarthPorn
gifs
nottheonion
books
DIY
explainlikeimfive
Art
LifeProTips
space
sports
mildlyinteresting
Documentaries
gadgets
memes
tifu
photoshopbattles
UpliftingNews
GetMotivated
dataisbeautiful
listentothis
history
philosophy
television
InternetIsBeautiful
Futurology
WritingPrompts
OldSchoolCool
personalfinance
nosleep
creepy
TwoXChromosomes
wallstreetbets
technology
wholesomememes
AdviceAnimals
interestingasfuck
Fitness
politics
WTF
lifehacks
oddlysatisfying
relationship_advice
NatureIsFuckingLit
Minecraft
travel
facepalm
Whatcouldgowrong
nextfuckinglevel
pcmasterrace
leagueoflegends
BlackPeopleTwitter
me_irl
Unexpected
dankmemes
bestof
dadjokes
buildapc
Tinder
PS4
MadeMeSmile
AnimalsBeingBros
Damnthatsinteresting
tattoos
CryptoCurrency
AnimalsBeingJerks
photography
nba
AnimalsBeingDerps
gardening
BikiniBottomTwitter
trippinthroughtime
Watch

In [10]:
# Pick 5 random subreddits
five_random_subreddits = random.choices(master_subreddit_list, k=5)
print(five_random_subreddits)

['MadeMeSmile', 'PewdiepieSubmissions', 'photography', 'leagueoflegends', 'reactiongifs']


In [11]:
# Source: https://towardsdatascience.com/how-to-use-the-reddit-api-in-python-5e05ddfd1e5c
# Source: https://pynative.com/python-random-choice/
# Pull from 5 classes. 5 classes = 5 subreddits.

my_list_of_dictionaries = []
total = 0
# Target Count for test set = 100.
n = int(math.ceil(100/0.20/25))

# url_list_check = [f"https://oauth.reddit.com/r/{five_random_subreddits[0]}/new/",
#           f"https://oauth.reddit.com/r/{five_random_subreddits[1]}/new/",
#           f"https://oauth.reddit.com/r/{five_random_subreddits[2]}/new/",
#           f"https://oauth.reddit.com/r/{five_random_subreddits[3]}/new/",
#           f"https://oauth.reddit.com/r/{five_random_subreddits[4]}/new/"
#           ]

url_list_check = [f"https://oauth.reddit.com/r/wallstreetbets/new/",
          f"https://oauth.reddit.com/r/teenagers/new/",
          f"https://oauth.reddit.com/r/copypasta/new/",
          f"https://oauth.reddit.com/r/genz/new/",
          f"https://oauth.reddit.com/r/unpopularopinion/new/",
          # f"https://oauth.reddit.com/r/frat/new/"
          ]

for i in range(len(url_list_check)):
    
  # print(url_list_check[i])
  res_check = requests.get(url_list_check[i],
                    headers=headers,
                    params={"limit": "1"})
  
  # print(res_check)
  # print(json.dumps(res_check.json()["data"]["children"][0]["data"]["name"], indent=4))
  name = res_check.json()["data"]["children"][0]["data"]["name"]
  page_count = 25
  
  for j in range(n):

    # url_list = [f"https://oauth.reddit.com/r/{five_random_subreddits[0]}/new/?count={page_count}&after={name}",
    #         f"https://oauth.reddit.com/r/{five_random_subreddits[1]}/new/?count={page_count}&after={name}",
    #         f"https://oauth.reddit.com/r/{five_random_subreddits[2]}/new/?count={page_count}&after={name}",
    #         f"https://oauth.reddit.com/r/{five_random_subreddits[3]}/new/?count={page_count}&after={name}",
    #         f"https://oauth.reddit.com/r/{five_random_subreddits[4]}/new/?count={page_count}&after={name}"
    #         ]

    url_list = [f"https://oauth.reddit.com/r/wallstreetbets/new/?count={page_count}&after={name}",
            f"https://oauth.reddit.com/r/teenagers/new/?count={page_count}&after={name}",
            f"https://oauth.reddit.com/r/copypasta/new/?count={page_count}&after={name}",
            f"https://oauth.reddit.com/r/genz/new/?count={page_count}&after={name}",
            f"https://oauth.reddit.com/r/unpopularopinion/new/?count={page_count}&after={name}",
            # f"https://oauth.reddit.com/r/frat/new/?count={page_count}&after={name}"
            ]
    
    print("Page Count:", page_count)
    print("Name:", name)
    print("Url:", url_list[i])
    
    res = requests.get(url_list[i],
                    headers=headers)
                    # params={"limit": "100"})

    reddit_dictionary = res.json()

    for k in range(len(reddit_dictionary["data"]["children"])):
      my_dictionary = {}
      my_dictionary["subreddit"] = reddit_dictionary["data"]["children"][k]["data"]["subreddit"]
      my_dictionary["text"] = reddit_dictionary["data"]["children"][k]["data"]["selftext"]
      # If a reddit post is has no body text:
      if my_dictionary["text"] == "":
        # Replace with title of reddit post.
        my_dictionary["text"] = reddit_dictionary["data"]["children"][k]["data"]["title"]
      
      print(my_dictionary["text"])
      # my_dictionary["title"] = reddit_dictionary["data"]["children"][k]["data"]["title"]
      # my_dictionary["url"] = reddit_dictionary["data"]["children"][k]["data"]["url"]
      # print(reddit_dictionary["data"]["children"][k]["data"]["subreddit"])
      # print(reddit_dictionary["data"]["children"][k]["data"]["selftext"])
      # print(reddit_dictionary["data"]["children"][k]["data"]["url"])
      my_list_of_dictionaries.append(my_dictionary)
      total += 1
      name = reddit_dictionary["data"]["children"][k]["data"]["name"]
    # print(json.dumps(my_list_of_dictionaries, indent=4, sort_keys=False))
    
    page_count += 25

print("Total gathered:", total)

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
One day when I was 16, I was in my math class when I heard a terrible noise, it reminded me of the first time I was beating my meat, but it turns out it was just a gunshot. Kids all around me where shouting that there was a school shooter. I immediately got up and ran to the door, carrying my pack with my hands. Suddenly I heard shots very close to my ears, and saw dead bodies on the ground, so I ran into the closest door which was the janitor's room, closed the door and hid under the table making as little noise as possible. Little did I know, my crush was under the same table, hiding there in fear, when she saw me she almost screamed, but I put my hand over her mouth and told her to be quiet. I heard the shooter opening the door slowly and looking for me. I was completely silent, but then I noticed my crush's incredible bajongas, and I felt my cock starting to throb and expand. She seemed to notice, and I could see that

In [13]:
# Source: https://github.com/susanli2016/NLP-with-Python/blob/master/Text_Classification_With_BERT.ipynb
# Convert list of dictionaries into pandas df
df = pd.DataFrame(my_list_of_dictionaries)
df.head()

Unnamed: 0,subreddit,text
0,wallstreetbets,Still can’t get a girlfriend. 🥲
1,wallstreetbets,Like taking candy from an ape
2,wallstreetbets,https://thehill.com/policy/finance/599807-irs-faces-steep-climb-in-clearing-old-tax-returns\n\nThe IRA is up to it's nipples in returns they haven't processed yet. So this is your chance to pocket those gains because the IRS is just trying to shovel work out the door. No one is going to bother to squeeze you for those gains because they don't have the time! \n\nThis isn't Financial advice but it IS an idea.
3,wallstreetbets,jealous of this plate
4,wallstreetbets,Crawling out of the trenches


In [14]:
df['subreddit'].value_counts()

wallstreetbets      500
teenagers           500
copypasta           500
GenZ                500
unpopularopinion    500
Name: subreddit, dtype: int64

In [15]:
possible_labels = df.subreddit.unique()

label_dict = {}
for index, possible_label in enumerate(possible_labels):
    label_dict[possible_label] = index
label_dict

{'wallstreetbets': 0,
 'teenagers': 1,
 'copypasta': 2,
 'GenZ': 3,
 'unpopularopinion': 4}

In [16]:
df['label'] = df.subreddit.replace(label_dict)
df.head()

Unnamed: 0,subreddit,text,label
0,wallstreetbets,Still can’t get a girlfriend. 🥲,0
1,wallstreetbets,Like taking candy from an ape,0
2,wallstreetbets,https://thehill.com/policy/finance/599807-irs-faces-steep-climb-in-clearing-old-tax-returns\n\nThe IRA is up to it's nipples in returns they haven't processed yet. So this is your chance to pocket those gains because the IRS is just trying to shovel work out the door. No one is going to bother to squeeze you for those gains because they don't have the time! \n\nThis isn't Financial advice but it IS an idea.,0
3,wallstreetbets,jealous of this plate,0
4,wallstreetbets,Crawling out of the trenches,0


In [18]:
#combine title & text to make one column
#df['title_and_text'] = df['title'] + ' ' +  df['text']
#df['title_and_text'].head()

In [19]:
from sklearn.model_selection import train_test_split

X_train, X_val, y_train, y_val = train_test_split(df.index.values, 
                                                  df.label.values, 
                                                  test_size=0.20, 
                                                  random_state=42, 
                                                  stratify=df.label.values)

In [21]:
df['data_type'] = ['not_set']*df.shape[0]

df.loc[X_train, 'data_type'] = 'train'
df.loc[X_val, 'data_type'] = 'val'

df.groupby(['subreddit', 'label', 'data_type']).count()

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,text
subreddit,label,data_type,Unnamed: 3_level_1
GenZ,3,train,400
GenZ,3,val,100
copypasta,2,train,400
copypasta,2,val,100
teenagers,1,train,400
teenagers,1,val,100
unpopularopinion,4,train,400
unpopularopinion,4,val,100
wallstreetbets,0,train,400
wallstreetbets,0,val,100


## Load Slangit Data

Slangit is a direct translation of slang data

In [22]:
from google.colab import drive
drive.mount('/content/gdrive')

Drive already mounted at /content/gdrive; to attempt to forcibly remount, call drive.mount("/content/gdrive", force_remount=True).


In [23]:
slang = pd.read_csv('/content/gdrive/MyDrive/w266/final_project/slangit.csv')
slang.head(20)

Unnamed: 0,Slang Term,Meaning
0,*$,Starbucks
1,*$$,Starbucks
2,2,Two cents
3,0773H,Hello
4,10m,Ten man
5,10q,Thank you
6,10x,Thanks
7,1174,Meet in person at
8,121,One to one
9,1337,Leet


In [24]:
slangit_dict = slang.set_index('Slang Term').to_dict()
slangit_dict = slangit_dict['Meaning']

In [25]:
keys_values = slangit_dict.items()
slangit_dict = {str(key): str(value) for key, value in keys_values}

In [26]:
def slang_lookup(text, dictionary):
    
    pattern = re.compile(r'(?<!\w)(' + '|'.join(re.escape(key) for key in slangit_dict.keys()) + r')(?!\w)')
    result = pattern.sub(lambda x: dictionary[x.group()], text)

    return result

In [27]:
my_text = 'I watched the UNC game at a bar b/c YOLO, FTW'

print(slang_lookup(my_text, slangit_dict))

I watched the UNC game at a bar Be/See You only live once, For the win


## Apply Slangit regex replace to reddit data

In [28]:
df_train = df[df.index.isin(X_train)]
len(df_train)

2000

In [30]:
df_train['text_deslanged'] = df_train['text'].apply(lambda x: slang_lookup(x, slangit_dict))
df_train['text_deslanged'].head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


0                                                                                                                                                                                                                                                                                                                                                                                                                                                              Still can’t get a girlfriend. 🥲
2    https://thehill.com/policy/finance/599807-irs-faces-steep-climb-in-clearing-old-tax-returns\n\nThe Inherited runs allowed is Underpowered to it'Sarcasm nipples in returns they haven't processed yet. So this is your chance to pocket those gains because the Internal Revenue Service is just trying to shovel work out the door. No one is going to bother to squeeze you for those gains because they don't have the time! \n\nThis isn't Financial advice but it I'm sorry an idea.
3                                         

In [31]:
df_train['text_deslanged'].head(30)

0                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             Still can’t get a girlfrie

In [32]:
#df_train['same'] = df_train['title_and_text'].equals(df_train['title_and_text_deslanged'])
df_train['same'] = np.where(df_train['text'] == df_train['text_deslanged'] , True, False)
df_train['same'].head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


0     True
2    False
3     True
4     True
5     True
Name: same, dtype: bool

In [33]:
df_train['same'].value_counts()

False    1383
True      617
Name: same, dtype: int64

## Check regex deslang, correct examples where it is deslanged incorrectly

In [34]:
deslanged = df_train[['text', 'text_deslanged']][df_train['same'] == False]

In [35]:
deslanged[:100]

Unnamed: 0,text,text_deslanged
2,https://thehill.com/policy/finance/599807-irs-faces-steep-climb-in-clearing-old-tax-returns\n\nThe IRA is up to it's nipples in returns they haven't processed yet. So this is your chance to pocket those gains because the IRS is just trying to shovel work out the door. No one is going to bother to squeeze you for those gains because they don't have the time! \n\nThis isn't Financial advice but it IS an idea.,https://thehill.com/policy/finance/599807-irs-faces-steep-climb-in-clearing-old-tax-returns\n\nThe Inherited runs allowed is Underpowered to it'Sarcasm nipples in returns they haven't processed yet. So this is your chance to pocket those gains because the Internal Revenue Service is just trying to shovel work out the door. No one is going to bother to squeeze you for those gains because they don't have the time! \n\nThis isn't Financial advice but it I'm sorry an idea.
7,The guy who YOLOed his daughter's college fund into TSLA puts,The guy who YOLOed his daughter'Sarcasm college fund into TSLA puts
8,"[🕵️‍♂️ I SPY, GME, TSLA, AMD, and NVDA - 3/28 Scalpers Delight](https://www.reddit.com/r/wallstreetbets/comments/tp4u70/i_spy_gme_tsla_amd_and_nvda_328_scalpers_delight/)\n\n# Economic Calendar - March 30, 2022\n\nhttps://preview.redd.it/ty9wopkuqeq81.png?width=1456&amp;format=png&amp;auto=webp&amp;s=00d7bcff3508338fcd9342e1dcb4f4cfdb0a78ea\n\n# SPY - March 30 - Technical Analysis\n\n* Bullish 🎯: 466.83 - 471.53 (needs to break 462.07)\n* Bearish 🎯: 457.43 - 456.04 - 451.38 (needs to break 460.61)\n\nhttps://preview.redd.it/aod848hyqeq81.png?width=1456&amp;format=png&amp;auto=webp&amp;s=ca82997ef029616d97f95d6325b07e385203453f\n\n* Neutral 440-449. Bullish 450+. Bearish at 439 and below. Want to see us close above 450 next week for a few days to be bullish.\n* Overbought on the 15 and 65 min. RSI so be mindful if we gap up tomorrow. If true, wait for the dip and observe buyers on volume around gap.\n* Key for bulls protect today’s gap 457.43 - 456.04. Like to see buyers at this lev...","[🕵️‍♂️ I SPY, GME, TSLA, AMD, and NVDA - 3/28 Scalpers Delight](https://www.reddit.com/Are/wallstreetbets/comments/tp4u70/i_spy_gme_tsla_amd_and_nvda_328_scalpers_delight/)\n\n# Economic Calendar - March 30, 2022\n\nhttps://preview.redd.it/ty9wopkuqeq81.png?width=1456&To be loud and angry;format=png&To be loud and angry;auto=webp&To be loud and angry;Sarcasm=00d7bcff3508338fcd9342e1dcb4f4cfdb0a78ea\n\n# SPY - March 30 - Technical Analysis\n\n* Positive outlook 🎯: 466.83 - 471.53 (needs to break 462.07)\n* Negative outlook 🎯: 457.43 - 456.04 - 451.38 (needs to break 460.61)\n\nhttps://preview.redd.it/aod848hyqeq81.png?width=1456&To be loud and angry;format=png&To be loud and angry;auto=webp&To be loud and angry;Sarcasm=ca82997ef029616d97f95d6325b07e385203453f\n\n* Neutral 440-449. Positive outlook 450+. Negative outlook at 439 and below. Want to see Ultrasound close above 450 next week for a few days to be bullish.\n* Overbought on the 15 and 65 Minute. RSI so be mindful if Whatever..."
14,https://www.bloomberg.com/opinion/articles/2022-03-29/is-a-recession-coming-the-fed-has-made-it-inevitable?sref=uN6cur8D\n\nMay be time to poke the 🌈🐻s awake again.,https://www.bloomberg.com/opinion/articles/2022-03-29/is-a-recession-coming-the-fed-has-made-it-inevitable?sref=uN6cur8D\n\nMay be time to poke the 🌈🐻Sarcasm awake again.
15,BBBY ~25K GAINZ...BEEN A GOOD YEAR,BBBY ~25K GAINZ...BEEN Assists GOOD YEAR
16,Picking up $GME is like investing in a dot com startup in the 90s. Fundamentally growing by the day.. 🐂ish AF still.,Picking Underpowered $GME is like investing in a dot com startup in the 90s. Fundamentally growing by the day.. 🐂ish As f*** still.
17,"When I go to invest in the shares of a company, I think of myself as buying a part of that company. I ask myself ""would I want to own this company?"". Often I say ""hell yeah!"" who wouldn't want to own a company that is printing money? (like oil companies and banks and steel companies) and who wouldn't want to own a company that has truly disruptive technology that is going to change the world? (like Tesla and Amazon and Microsoft and Apple did) But then there are other times when I say ""not in a million years"" Why would I want to own a company that is bleeding cash and really has no prospects of ever turning a profit? One where the only way that I'm going to make money is to sell my shares to some sucker. When I find companies in the first category I buy their stock. When I find companies in the latter category I buy their puts. I am long in 60 different companies. Some of them are old school (value) some of them are new school (growth). I want to talk about the ones that ...","When I go to invest in the shares of a company, I think of myself as buying a part of that company. I ask myself ""would I want to own this company?"". Often I say ""hell yeah!"" who wouldn't want to own a company that is printing money? (like oil companies and banks and steel companies) and who wouldn't want to own a company that has truly disruptive technology that is going to change the world? (like Tesla and Amazon and Microsoft and Apple did) But then there are other times when I say ""not in a million years"" Why would I want to own a company that is bleeding cash and really has Know prospects of ever turning a profit? One where the only way that I'Male going to make money is to sell my shares to some sucker. When I find companies in the first category I buy their stock. When I find companies in the latter category I buy their puts. I am long in 60 different companies. Some of them are old school (value) some of them are new school (growth). I want to talk about the ones ..."
18,Fomoing at 9:30 isn’t feeling too good right now mr. Stark…,Fomoing at Parent in room:30 isn’t feeling too good right now mr. Stark…
22,Just when we though insulting people's hair hit a low point this week.,Just when Whatever though insulting people'Sarcasm hair hit a low point this week.
23,Mar 8th was my 1 year anniversary. Let's see how I did . . . 🤡🎺,Mar 8th was my 1 year anniversary. Let'Sarcasm see how I did . . . 🤡🎺


## T5 Translation

In [26]:
#model example from: https://simpletransformers.ai/docs/t5-data-formats/

import logging

import pandas as pd
from simpletransformers.t5 import T5Model, T5Args

logging.basicConfig(level=logging.INFO)
transformers_logger = logging.getLogger("transformers")
transformers_logger.setLevel(logging.WARNING)


train_data = [
    ["binary classification", "Anakin was Luke's father" , "1"],
    ["binary classification", "Luke was a Sith Lord" , "0"],
    ["generate question", "Star Wars is an American epic space-opera media franchise created by George Lucas, which began with the eponymous 1977 film and quickly became a worldwide pop-culture phenomenon", "Who created the Star Wars franchise?"],
    ["generate question", "Anakin was Luke's father" , "Who was Luke's father?"],
]
train_df = pd.DataFrame(train_data)
train_df.columns = ["prefix", "input_text", "target_text"]

eval_data = [
    ["binary classification", "Leia was Luke's sister" , "1"],
    ["binary classification", "Han was a Sith Lord" , "0"],
    ["generate question", "In 2020, the Star Wars franchise's total value was estimated at US$70 billion, and it is currently the fifth-highest-grossing media franchise of all time.", "What is the total value of the Star Wars franchise?"],
    ["generate question", "Leia was Luke's sister" , "Who was Luke's sister?"],
]
eval_df = pd.DataFrame(eval_data)
eval_df.columns = ["prefix", "input_text", "target_text"]

model_args = T5Args()
model_args.num_train_epochs = 200
model_args.no_save = True
model_args.evaluate_generated_text = True
model_args.evaluate_during_training = True
model_args.evaluate_during_training_verbose = True

model = T5Model("t5", "t5-base", args=model_args, use_cuda=False)


def count_matches(labels, preds):
    print(labels)
    print(preds)
    return sum([1 if label == pred else 0 for label, pred in zip(labels, preds)])


model.train_model(train_df, eval_data=eval_df, matches=count_matches)

print(model.eval_model(eval_df, matches=count_matches))

ValueError: ignored

In [37]:
import logging

import pandas as pd
from simpletransformers.t5 import T5Model, T5Args

logging.basicConfig(level=logging.INFO)
transformers_logger = logging.getLogger("transformers")
transformers_logger.setLevel(logging.WARNING)


train_data = df_train[['text', 'text_deslanged']]
train_data['prefix'] = 'translate'
train_df = train_data[['prefix', 'text', 'text_deslanged']]
train_df = train_df.rename(columns={'prefix': 'prefix', 'text': 'input_text', 'text_deslanged': 'target_text'})

eval_data = df[df.index.isin(X_val)]
eval_data['text_deslanged'] = eval_data['text'].apply(lambda x: slang_lookup(x, slangit_dict))
eval_data = eval_data[['text', 'text_deslanged']]
eval_data['prefix'] = 'translate'
eval_df = eval_data[['prefix', 'text', 'text_deslanged']]
eval_df = eval_df.rename(columns={'prefix': 'prefix', 'text': 'input_text', 'text_deslanged': 'target_text'})

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  if sys.path[0] == '':
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


In [38]:
eval_df.head()

Unnamed: 0,prefix,input_text,target_text
1,translate,Like taking candy from an ape,Like taking candy from an ape
13,translate,ᴛᴏᴍᴏʀʀᴏᴡ ᴡᴇ ʀɪᴅᴇ ᴏᴜᴛ ᴛᴏ ꜱᴍᴀꜱʜ ᴛʜᴇ ʀᴇꜱɪꜱᴛᴀɴᴄᴇ ᴀᴛ ᴛʜᴇ ᴄʀᴀᴄᴋ ᴏꜰ ᴅᴀᴡɴ,ᴛᴏᴍᴏʀʀᴏᴡ ᴡᴇ ʀɪᴅᴇ ᴏᴜᴛ ᴛᴏ ꜱᴍᴀꜱʜ ᴛʜᴇ ʀᴇꜱɪꜱᴛᴀɴᴄᴇ ᴀᴛ ᴛʜᴇ ᴄʀᴀᴄᴋ ᴏꜰ ᴅᴀᴡɴ
20,translate,"I have a 186 cost basis. A few months ago every “expert” in the land was suckling at the PYPL teet. Then In a matter of a few weeks they all turned tail. In 2019 the full year EPS was 2.07 and the stock was around 120ish. Now they’re projecting high 4s EPS and the stock is at 120…wtf? So they lost eBay and the market treated it like it was gonna go BK because of that, who the fuck uses eBay anyway anymore. They gained Amazon with Venmo and the market barely gave 2 shits about that…AMAZON! \n\nI get their growth has slowed down due to less free allowance money from Powell out there, but that isn’t a PayPal story that’s a story every company almost. \n\nSo the stock has been destroyed thrown out to die. Finally starting to get some back but with a 186 cost basis im still down 30+ % \n\nWhat would an ape do here? Add more? Cut and run? Wait it out? \n\nIm not in TO deep but deeper than I wanna be so somewhat hesitant to add more","I have a 186 cost basis. Assists few months ago every “expert” in the land was suckling at the PYPL teet. Then In a matter of a few weeks they all turned tail. In 2019 the full year Earnings per share was To.07 and the stock was around 120ish. Now they’Rematch projecting high 4s Earnings per share and the stock is at 120…wtf? So they lost eBay and the market treated it like it was Going to go Burger King because of that, who the fuck uses eBay anyway anymore. They gained Amazon with Venmo and the market barely gave To shits about that…AMAZON! \n\nI get their growth has slowed down due to less free allowance money from Powell out there, but that isn’t a PayPal story that’Sarcasm a story every company almost. \n\nSo the stock has been destroyed thrown out to die. Finally starting to get some back but with a 186 cost basis im still down 30+ % \n\nWhat would an ape do here? Add more? Cut and run? Wait it out? \n\nIm not in TO deep but deeper than I Want to be so somewhat hesitant to Ad..."
21,translate,"I'm sorry I tanked GME guys, I was sitting on the side lines all last week and finally found the balls to jump in and get some weekly calls at power hour today. Looks like I caused it to tank because literally 5 minutes after I bought, my calls lost 20%.\n\nThese gonna print tomorrow? Or should I go stock up on handy lube for my shift at the Wendy's dumpster tomorrow night?","I'Male sorry I tanked GME guys, I was sitting on the side lines all last week and finally found the balls to jump in and get some weekly calls at power hour today. Looks like I caused it to tank because literally 5 minutes after I bought, my calls lost Location%.\n\nThese Going to print tomorrow? Or should I go stock Underpowered on handy lube for my shift at the Wendy'Sarcasm dumpster tomorrow night?"
34,translate,Got a lot if hates for my GME puts. Should I switch side tomorrow? I am up $600. your thoughts?,Got a lot if hates for my GME puts. Should I switch side tomorrow? I am Underpowered $600. your thoughts?


In [39]:
model_args = T5Args()
model_args.num_train_epochs = 10
model_args.no_save = True
model_args.evaluate_generated_text = True
model_args.evaluate_during_training = True
model_args.evaluate_during_training_verbose = True
model_args.overwrite_output_dir = True
torch.cuda.memory_summary(device=None, abbreviated=False)
model_args.per_gpu_train_batch_size = 128

model = T5Model("t5", "t5-base", args=model_args, use_cuda=True)


In [40]:
def count_matches(labels, preds):
    print(labels)
    print(preds)
    return sum([1 if label == pred else 0 for label, pred in zip(labels, preds)])

In [41]:
torch.cuda.memory_summary(device=None, abbreviated=False)



In [42]:
torch.cuda.empty_cache()

In [43]:
model.train_model(train_df, eval_data=eval_df, matches=count_matches)

INFO:simpletransformers.t5.t5_utils: Creating features from dataset file at cache_dir/


  0%|          | 0/2000 [00:00<?, ?it/s]

`prepare_seq2seq_batch` is deprecated and will be removed in version 5 of HuggingFace Transformers. Use the regular
`__call__` method to prepare your inputs and the tokenizer under the `as_target_tokenizer` context manager to prepare
your targets.

Here is a short example:

model_inputs = tokenizer(src_texts, ...)
with tokenizer.as_target_tokenizer():
    labels = tokenizer(tgt_texts, ...)
model_inputs["labels"] = labels["input_ids"]

See the documentation of your specific tokenizer for more details on the specific arguments to the tokenizer of choice.
For a more complete example, see the implementation of `prepare_seq2seq_batch`.

INFO:simpletransformers.t5.t5_utils: Saving features into cached file cache_dir/t5-base_cached_1282000
INFO:simpletransformers.t5.t5_model: Training started


Epoch:   0%|          | 0/10 [00:00<?, ?it/s]

Running Epoch 0 of 10:   0%|          | 0/250 [00:00<?, ?it/s]

INFO:simpletransformers.t5.t5_utils: Creating features from dataset file at cache_dir/


  0%|          | 0/500 [00:00<?, ?it/s]

`prepare_seq2seq_batch` is deprecated and will be removed in version 5 of HuggingFace Transformers. Use the regular
`__call__` method to prepare your inputs and the tokenizer under the `as_target_tokenizer` context manager to prepare
your targets.

Here is a short example:

model_inputs = tokenizer(src_texts, ...)
with tokenizer.as_target_tokenizer():
    labels = tokenizer(tgt_texts, ...)
model_inputs["labels"] = labels["input_ids"]

See the documentation of your specific tokenizer for more details on the specific arguments to the tokenizer of choice.
For a more complete example, see the implementation of `prepare_seq2seq_batch`.

INFO:simpletransformers.t5.t5_utils: Saving features into cached file cache_dir/t5-base_cached_128500


Generating outputs:   0%|          | 0/63 [00:00<?, ?it/s]

`prepare_seq2seq_batch` is deprecated and will be removed in version 5 of HuggingFace Transformers. Use the regular
`__call__` method to prepare your inputs and the tokenizer under the `as_target_tokenizer` context manager to prepare
your targets.

Here is a short example:

model_inputs = tokenizer(src_texts, ...)
with tokenizer.as_target_tokenizer():
    labels = tokenizer(tgt_texts, ...)
model_inputs["labels"] = labels["input_ids"]

See the documentation of your specific tokenizer for more details on the specific arguments to the tokenizer of choice.
For a more complete example, see the implementation of `prepare_seq2seq_batch`.



Decoding outputs:   0%|          | 0/500 [00:00<?, ?it/s]

INFO:simpletransformers.t5.t5_model:{'eval_loss': 0.14157769571627593, 'matches': 124}


['Like taking candy from an ape', 'ᴛᴏᴍᴏʀʀᴏᴡ ᴡᴇ ʀɪᴅᴇ ᴏᴜᴛ ᴛᴏ ꜱᴍᴀꜱʜ ᴛʜᴇ ʀᴇꜱɪꜱᴛᴀɴᴄᴇ ᴀᴛ ᴛʜᴇ ᴄʀᴀᴄᴋ ᴏꜰ ᴅᴀᴡɴ', 'I have a 186 cost basis. Assists few months ago every “expert” in the land was suckling at the PYPL teet. Then In a matter of a few weeks they all turned tail. In 2019 the full year Earnings per share was To.07 and the stock was around 120ish. Now they’Rematch projecting high 4s Earnings per share and the stock is at 120…wtf? So they lost eBay and the market treated it like it was Going to go Burger King because of that, who the fuck uses eBay anyway anymore. They gained Amazon with Venmo and the market barely gave To shits about that…AMAZON! \n\nI get their growth has slowed down due to less free allowance money from Powell out there, but that isn’t a PayPal story that’Sarcasm a story every company almost. \n\nSo the stock has been destroyed thrown out to die. Finally starting to get some back but with a 186 cost basis im still down 30+ % \n\nWhat would an ape do here? Add more? Cut

Running Epoch 1 of 10:   0%|          | 0/250 [00:00<?, ?it/s]

INFO:simpletransformers.t5.t5_utils: Creating features from dataset file at cache_dir/


  0%|          | 0/500 [00:00<?, ?it/s]

INFO:simpletransformers.t5.t5_utils: Saving features into cached file cache_dir/t5-base_cached_128500


Generating outputs:   0%|          | 0/63 [00:00<?, ?it/s]

Decoding outputs:   0%|          | 0/500 [00:00<?, ?it/s]

INFO:simpletransformers.t5.t5_model:{'eval_loss': 0.12361182576961934, 'matches': 126}


['Like taking candy from an ape', 'ᴛᴏᴍᴏʀʀᴏᴡ ᴡᴇ ʀɪᴅᴇ ᴏᴜᴛ ᴛᴏ ꜱᴍᴀꜱʜ ᴛʜᴇ ʀᴇꜱɪꜱᴛᴀɴᴄᴇ ᴀᴛ ᴛʜᴇ ᴄʀᴀᴄᴋ ᴏꜰ ᴅᴀᴡɴ', 'I have a 186 cost basis. Assists few months ago every “expert” in the land was suckling at the PYPL teet. Then In a matter of a few weeks they all turned tail. In 2019 the full year Earnings per share was To.07 and the stock was around 120ish. Now they’Rematch projecting high 4s Earnings per share and the stock is at 120…wtf? So they lost eBay and the market treated it like it was Going to go Burger King because of that, who the fuck uses eBay anyway anymore. They gained Amazon with Venmo and the market barely gave To shits about that…AMAZON! \n\nI get their growth has slowed down due to less free allowance money from Powell out there, but that isn’t a PayPal story that’Sarcasm a story every company almost. \n\nSo the stock has been destroyed thrown out to die. Finally starting to get some back but with a 186 cost basis im still down 30+ % \n\nWhat would an ape do here? Add more? Cut

Running Epoch 2 of 10:   0%|          | 0/250 [00:00<?, ?it/s]

INFO:simpletransformers.t5.t5_utils: Creating features from dataset file at cache_dir/


  0%|          | 0/500 [00:00<?, ?it/s]

INFO:simpletransformers.t5.t5_utils: Saving features into cached file cache_dir/t5-base_cached_128500


Generating outputs:   0%|          | 0/63 [00:00<?, ?it/s]

Decoding outputs:   0%|          | 0/500 [00:00<?, ?it/s]

INFO:simpletransformers.t5.t5_model:{'eval_loss': 0.10910093757630666, 'matches': 130}


['Like taking candy from an ape', 'ᴛᴏᴍᴏʀʀᴏᴡ ᴡᴇ ʀɪᴅᴇ ᴏᴜᴛ ᴛᴏ ꜱᴍᴀꜱʜ ᴛʜᴇ ʀᴇꜱɪꜱᴛᴀɴᴄᴇ ᴀᴛ ᴛʜᴇ ᴄʀᴀᴄᴋ ᴏꜰ ᴅᴀᴡɴ', 'I have a 186 cost basis. Assists few months ago every “expert” in the land was suckling at the PYPL teet. Then In a matter of a few weeks they all turned tail. In 2019 the full year Earnings per share was To.07 and the stock was around 120ish. Now they’Rematch projecting high 4s Earnings per share and the stock is at 120…wtf? So they lost eBay and the market treated it like it was Going to go Burger King because of that, who the fuck uses eBay anyway anymore. They gained Amazon with Venmo and the market barely gave To shits about that…AMAZON! \n\nI get their growth has slowed down due to less free allowance money from Powell out there, but that isn’t a PayPal story that’Sarcasm a story every company almost. \n\nSo the stock has been destroyed thrown out to die. Finally starting to get some back but with a 186 cost basis im still down 30+ % \n\nWhat would an ape do here? Add more? Cut

Running Epoch 3 of 10:   0%|          | 0/250 [00:00<?, ?it/s]

INFO:simpletransformers.t5.t5_utils: Creating features from dataset file at cache_dir/


  0%|          | 0/500 [00:00<?, ?it/s]

INFO:simpletransformers.t5.t5_utils: Saving features into cached file cache_dir/t5-base_cached_128500


Generating outputs:   0%|          | 0/63 [00:00<?, ?it/s]

Decoding outputs:   0%|          | 0/500 [00:00<?, ?it/s]

INFO:simpletransformers.t5.t5_model:{'eval_loss': 0.11364263590861348, 'matches': 127}


['Like taking candy from an ape', 'ᴛᴏᴍᴏʀʀᴏᴡ ᴡᴇ ʀɪᴅᴇ ᴏᴜᴛ ᴛᴏ ꜱᴍᴀꜱʜ ᴛʜᴇ ʀᴇꜱɪꜱᴛᴀɴᴄᴇ ᴀᴛ ᴛʜᴇ ᴄʀᴀᴄᴋ ᴏꜰ ᴅᴀᴡɴ', 'I have a 186 cost basis. Assists few months ago every “expert” in the land was suckling at the PYPL teet. Then In a matter of a few weeks they all turned tail. In 2019 the full year Earnings per share was To.07 and the stock was around 120ish. Now they’Rematch projecting high 4s Earnings per share and the stock is at 120…wtf? So they lost eBay and the market treated it like it was Going to go Burger King because of that, who the fuck uses eBay anyway anymore. They gained Amazon with Venmo and the market barely gave To shits about that…AMAZON! \n\nI get their growth has slowed down due to less free allowance money from Powell out there, but that isn’t a PayPal story that’Sarcasm a story every company almost. \n\nSo the stock has been destroyed thrown out to die. Finally starting to get some back but with a 186 cost basis im still down 30+ % \n\nWhat would an ape do here? Add more? Cut

Running Epoch 4 of 10:   0%|          | 0/250 [00:00<?, ?it/s]

INFO:simpletransformers.t5.t5_utils: Creating features from dataset file at cache_dir/


  0%|          | 0/500 [00:00<?, ?it/s]

INFO:simpletransformers.t5.t5_utils: Saving features into cached file cache_dir/t5-base_cached_128500


Generating outputs:   0%|          | 0/63 [00:00<?, ?it/s]

Decoding outputs:   0%|          | 0/500 [00:00<?, ?it/s]

INFO:simpletransformers.t5.t5_model:{'eval_loss': 0.12629996085126463, 'matches': 125}


['Like taking candy from an ape', 'ᴛᴏᴍᴏʀʀᴏᴡ ᴡᴇ ʀɪᴅᴇ ᴏᴜᴛ ᴛᴏ ꜱᴍᴀꜱʜ ᴛʜᴇ ʀᴇꜱɪꜱᴛᴀɴᴄᴇ ᴀᴛ ᴛʜᴇ ᴄʀᴀᴄᴋ ᴏꜰ ᴅᴀᴡɴ', 'I have a 186 cost basis. Assists few months ago every “expert” in the land was suckling at the PYPL teet. Then In a matter of a few weeks they all turned tail. In 2019 the full year Earnings per share was To.07 and the stock was around 120ish. Now they’Rematch projecting high 4s Earnings per share and the stock is at 120…wtf? So they lost eBay and the market treated it like it was Going to go Burger King because of that, who the fuck uses eBay anyway anymore. They gained Amazon with Venmo and the market barely gave To shits about that…AMAZON! \n\nI get their growth has slowed down due to less free allowance money from Powell out there, but that isn’t a PayPal story that’Sarcasm a story every company almost. \n\nSo the stock has been destroyed thrown out to die. Finally starting to get some back but with a 186 cost basis im still down 30+ % \n\nWhat would an ape do here? Add more? Cut

Running Epoch 5 of 10:   0%|          | 0/250 [00:00<?, ?it/s]

INFO:simpletransformers.t5.t5_utils: Creating features from dataset file at cache_dir/


  0%|          | 0/500 [00:00<?, ?it/s]

INFO:simpletransformers.t5.t5_utils: Saving features into cached file cache_dir/t5-base_cached_128500


Generating outputs:   0%|          | 0/63 [00:00<?, ?it/s]

Decoding outputs:   0%|          | 0/500 [00:00<?, ?it/s]

INFO:simpletransformers.t5.t5_model:{'eval_loss': 0.11748419229762577, 'matches': 128}


['Like taking candy from an ape', 'ᴛᴏᴍᴏʀʀᴏᴡ ᴡᴇ ʀɪᴅᴇ ᴏᴜᴛ ᴛᴏ ꜱᴍᴀꜱʜ ᴛʜᴇ ʀᴇꜱɪꜱᴛᴀɴᴄᴇ ᴀᴛ ᴛʜᴇ ᴄʀᴀᴄᴋ ᴏꜰ ᴅᴀᴡɴ', 'I have a 186 cost basis. Assists few months ago every “expert” in the land was suckling at the PYPL teet. Then In a matter of a few weeks they all turned tail. In 2019 the full year Earnings per share was To.07 and the stock was around 120ish. Now they’Rematch projecting high 4s Earnings per share and the stock is at 120…wtf? So they lost eBay and the market treated it like it was Going to go Burger King because of that, who the fuck uses eBay anyway anymore. They gained Amazon with Venmo and the market barely gave To shits about that…AMAZON! \n\nI get their growth has slowed down due to less free allowance money from Powell out there, but that isn’t a PayPal story that’Sarcasm a story every company almost. \n\nSo the stock has been destroyed thrown out to die. Finally starting to get some back but with a 186 cost basis im still down 30+ % \n\nWhat would an ape do here? Add more? Cut

Running Epoch 6 of 10:   0%|          | 0/250 [00:00<?, ?it/s]

INFO:simpletransformers.t5.t5_utils: Creating features from dataset file at cache_dir/


  0%|          | 0/500 [00:00<?, ?it/s]

INFO:simpletransformers.t5.t5_utils: Saving features into cached file cache_dir/t5-base_cached_128500


Generating outputs:   0%|          | 0/63 [00:00<?, ?it/s]

Decoding outputs:   0%|          | 0/500 [00:00<?, ?it/s]

INFO:simpletransformers.t5.t5_model:{'eval_loss': 0.1286085242833314, 'matches': 123}


['Like taking candy from an ape', 'ᴛᴏᴍᴏʀʀᴏᴡ ᴡᴇ ʀɪᴅᴇ ᴏᴜᴛ ᴛᴏ ꜱᴍᴀꜱʜ ᴛʜᴇ ʀᴇꜱɪꜱᴛᴀɴᴄᴇ ᴀᴛ ᴛʜᴇ ᴄʀᴀᴄᴋ ᴏꜰ ᴅᴀᴡɴ', 'I have a 186 cost basis. Assists few months ago every “expert” in the land was suckling at the PYPL teet. Then In a matter of a few weeks they all turned tail. In 2019 the full year Earnings per share was To.07 and the stock was around 120ish. Now they’Rematch projecting high 4s Earnings per share and the stock is at 120…wtf? So they lost eBay and the market treated it like it was Going to go Burger King because of that, who the fuck uses eBay anyway anymore. They gained Amazon with Venmo and the market barely gave To shits about that…AMAZON! \n\nI get their growth has slowed down due to less free allowance money from Powell out there, but that isn’t a PayPal story that’Sarcasm a story every company almost. \n\nSo the stock has been destroyed thrown out to die. Finally starting to get some back but with a 186 cost basis im still down 30+ % \n\nWhat would an ape do here? Add more? Cut

Running Epoch 7 of 10:   0%|          | 0/250 [00:00<?, ?it/s]

INFO:simpletransformers.t5.t5_utils: Creating features from dataset file at cache_dir/


  0%|          | 0/500 [00:00<?, ?it/s]

INFO:simpletransformers.t5.t5_utils: Saving features into cached file cache_dir/t5-base_cached_128500


Generating outputs:   0%|          | 0/63 [00:00<?, ?it/s]

Decoding outputs:   0%|          | 0/500 [00:00<?, ?it/s]

INFO:simpletransformers.t5.t5_model:{'eval_loss': 0.13629059591778694, 'matches': 124}
INFO:simpletransformers.t5.t5_utils: Creating features from dataset file at cache_dir/


['Like taking candy from an ape', 'ᴛᴏᴍᴏʀʀᴏᴡ ᴡᴇ ʀɪᴅᴇ ᴏᴜᴛ ᴛᴏ ꜱᴍᴀꜱʜ ᴛʜᴇ ʀᴇꜱɪꜱᴛᴀɴᴄᴇ ᴀᴛ ᴛʜᴇ ᴄʀᴀᴄᴋ ᴏꜰ ᴅᴀᴡɴ', 'I have a 186 cost basis. Assists few months ago every “expert” in the land was suckling at the PYPL teet. Then In a matter of a few weeks they all turned tail. In 2019 the full year Earnings per share was To.07 and the stock was around 120ish. Now they’Rematch projecting high 4s Earnings per share and the stock is at 120…wtf? So they lost eBay and the market treated it like it was Going to go Burger King because of that, who the fuck uses eBay anyway anymore. They gained Amazon with Venmo and the market barely gave To shits about that…AMAZON! \n\nI get their growth has slowed down due to less free allowance money from Powell out there, but that isn’t a PayPal story that’Sarcasm a story every company almost. \n\nSo the stock has been destroyed thrown out to die. Finally starting to get some back but with a 186 cost basis im still down 30+ % \n\nWhat would an ape do here? Add more? Cut

  0%|          | 0/500 [00:00<?, ?it/s]

INFO:simpletransformers.t5.t5_utils: Saving features into cached file cache_dir/t5-base_cached_128500


Generating outputs:   0%|          | 0/63 [00:00<?, ?it/s]

Decoding outputs:   0%|          | 0/500 [00:00<?, ?it/s]

INFO:simpletransformers.t5.t5_model:{'eval_loss': 0.13629059591778694, 'matches': 124}


['Like taking candy from an ape', 'ᴛᴏᴍᴏʀʀᴏᴡ ᴡᴇ ʀɪᴅᴇ ᴏᴜᴛ ᴛᴏ ꜱᴍᴀꜱʜ ᴛʜᴇ ʀᴇꜱɪꜱᴛᴀɴᴄᴇ ᴀᴛ ᴛʜᴇ ᴄʀᴀᴄᴋ ᴏꜰ ᴅᴀᴡɴ', 'I have a 186 cost basis. Assists few months ago every “expert” in the land was suckling at the PYPL teet. Then In a matter of a few weeks they all turned tail. In 2019 the full year Earnings per share was To.07 and the stock was around 120ish. Now they’Rematch projecting high 4s Earnings per share and the stock is at 120…wtf? So they lost eBay and the market treated it like it was Going to go Burger King because of that, who the fuck uses eBay anyway anymore. They gained Amazon with Venmo and the market barely gave To shits about that…AMAZON! \n\nI get their growth has slowed down due to less free allowance money from Powell out there, but that isn’t a PayPal story that’Sarcasm a story every company almost. \n\nSo the stock has been destroyed thrown out to die. Finally starting to get some back but with a 186 cost basis im still down 30+ % \n\nWhat would an ape do here? Add more? Cut

Running Epoch 8 of 10:   0%|          | 0/250 [00:00<?, ?it/s]

INFO:simpletransformers.t5.t5_utils: Creating features from dataset file at cache_dir/


  0%|          | 0/500 [00:00<?, ?it/s]

INFO:simpletransformers.t5.t5_utils: Saving features into cached file cache_dir/t5-base_cached_128500


Generating outputs:   0%|          | 0/63 [00:00<?, ?it/s]

Decoding outputs:   0%|          | 0/500 [00:00<?, ?it/s]

INFO:simpletransformers.t5.t5_model:{'eval_loss': 0.141906945330866, 'matches': 129}


['Like taking candy from an ape', 'ᴛᴏᴍᴏʀʀᴏᴡ ᴡᴇ ʀɪᴅᴇ ᴏᴜᴛ ᴛᴏ ꜱᴍᴀꜱʜ ᴛʜᴇ ʀᴇꜱɪꜱᴛᴀɴᴄᴇ ᴀᴛ ᴛʜᴇ ᴄʀᴀᴄᴋ ᴏꜰ ᴅᴀᴡɴ', 'I have a 186 cost basis. Assists few months ago every “expert” in the land was suckling at the PYPL teet. Then In a matter of a few weeks they all turned tail. In 2019 the full year Earnings per share was To.07 and the stock was around 120ish. Now they’Rematch projecting high 4s Earnings per share and the stock is at 120…wtf? So they lost eBay and the market treated it like it was Going to go Burger King because of that, who the fuck uses eBay anyway anymore. They gained Amazon with Venmo and the market barely gave To shits about that…AMAZON! \n\nI get their growth has slowed down due to less free allowance money from Powell out there, but that isn’t a PayPal story that’Sarcasm a story every company almost. \n\nSo the stock has been destroyed thrown out to die. Finally starting to get some back but with a 186 cost basis im still down 30+ % \n\nWhat would an ape do here? Add more? Cut

Running Epoch 9 of 10:   0%|          | 0/250 [00:00<?, ?it/s]

INFO:simpletransformers.t5.t5_utils: Creating features from dataset file at cache_dir/


  0%|          | 0/500 [00:00<?, ?it/s]

INFO:simpletransformers.t5.t5_utils: Saving features into cached file cache_dir/t5-base_cached_128500


Generating outputs:   0%|          | 0/63 [00:00<?, ?it/s]

Decoding outputs:   0%|          | 0/500 [00:00<?, ?it/s]

INFO:simpletransformers.t5.t5_model:{'eval_loss': 0.1440515995794461, 'matches': 130}
INFO:simpletransformers.t5.t5_model: Training of t5-base model complete. Saved to outputs/.


['Like taking candy from an ape', 'ᴛᴏᴍᴏʀʀᴏᴡ ᴡᴇ ʀɪᴅᴇ ᴏᴜᴛ ᴛᴏ ꜱᴍᴀꜱʜ ᴛʜᴇ ʀᴇꜱɪꜱᴛᴀɴᴄᴇ ᴀᴛ ᴛʜᴇ ᴄʀᴀᴄᴋ ᴏꜰ ᴅᴀᴡɴ', 'I have a 186 cost basis. Assists few months ago every “expert” in the land was suckling at the PYPL teet. Then In a matter of a few weeks they all turned tail. In 2019 the full year Earnings per share was To.07 and the stock was around 120ish. Now they’Rematch projecting high 4s Earnings per share and the stock is at 120…wtf? So they lost eBay and the market treated it like it was Going to go Burger King because of that, who the fuck uses eBay anyway anymore. They gained Amazon with Venmo and the market barely gave To shits about that…AMAZON! \n\nI get their growth has slowed down due to less free allowance money from Powell out there, but that isn’t a PayPal story that’Sarcasm a story every company almost. \n\nSo the stock has been destroyed thrown out to die. Finally starting to get some back but with a 186 cost basis im still down 30+ % \n\nWhat would an ape do here? Add more? Cut

(2500,
 {'global_step': [250,
   500,
   750,
   1000,
   1250,
   1500,
   1750,
   2000,
   2000,
   2250,
   2500],
  'eval_loss': [0.14157769571627593,
   0.12361182576961934,
   0.10910093757630666,
   0.11364263590861348,
   0.12629996085126463,
   0.11748419229762577,
   0.1286085242833314,
   0.13629059591778694,
   0.13629059591778694,
   0.141906945330866,
   0.1440515995794461],
  'train_loss': [0.09822169691324234,
   0.11330597847700119,
   0.07268226146697998,
   0.16022087633609772,
   0.1051546260714531,
   0.010743957944214344,
   0.010357785038650036,
   0.03229517117142677,
   0.03229517117142677,
   0.003322034142911434,
   0.03331939876079559],
  'matches': [124, 126, 130, 127, 125, 128, 123, 124, 124, 129, 130]})

In [44]:
my_t5 = model.eval_model(eval_df, matches=count_matches)
print(my_t5)

INFO:simpletransformers.t5.t5_utils: Creating features from dataset file at cache_dir/


  0%|          | 0/500 [00:00<?, ?it/s]

INFO:simpletransformers.t5.t5_utils: Saving features into cached file cache_dir/t5-base_cached_128500


Running Evaluation:   0%|          | 0/63 [00:00<?, ?it/s]

Generating outputs:   0%|          | 0/63 [00:00<?, ?it/s]

Decoding outputs:   0%|          | 0/500 [00:00<?, ?it/s]

INFO:simpletransformers.t5.t5_model:{'eval_loss': 0.1440515995794461, 'matches': 130}


['Like taking candy from an ape', 'ᴛᴏᴍᴏʀʀᴏᴡ ᴡᴇ ʀɪᴅᴇ ᴏᴜᴛ ᴛᴏ ꜱᴍᴀꜱʜ ᴛʜᴇ ʀᴇꜱɪꜱᴛᴀɴᴄᴇ ᴀᴛ ᴛʜᴇ ᴄʀᴀᴄᴋ ᴏꜰ ᴅᴀᴡɴ', 'I have a 186 cost basis. Assists few months ago every “expert” in the land was suckling at the PYPL teet. Then In a matter of a few weeks they all turned tail. In 2019 the full year Earnings per share was To.07 and the stock was around 120ish. Now they’Rematch projecting high 4s Earnings per share and the stock is at 120…wtf? So they lost eBay and the market treated it like it was Going to go Burger King because of that, who the fuck uses eBay anyway anymore. They gained Amazon with Venmo and the market barely gave To shits about that…AMAZON! \n\nI get their growth has slowed down due to less free allowance money from Powell out there, but that isn’t a PayPal story that’Sarcasm a story every company almost. \n\nSo the stock has been destroyed thrown out to die. Finally starting to get some back but with a 186 cost basis im still down 30+ % \n\nWhat would an ape do here? Add more? Cut

In [45]:
print(my_t5)

{'eval_loss': 0.1440515995794461, 'matches': 130}


In [46]:
#get predictions included in df
preds = model.predict(list(eval_df['input_text']))

Generating outputs:   0%|          | 0/63 [00:00<?, ?it/s]

Decoding outputs:   0%|          | 0/500 [00:00<?, ?it/s]

In [47]:
print(preds)

['Like taking candy from an apepe', '', 'Today they’Rematch projecting high 4s Earnest and the stock is at 120...', "Or should I go stock Underpowered on handy lube for my shift at the Wendy'", 'I am Underpowered $600. your thoughts?', 'Sums Underpowered the day pretty well...', 'what goes Underpowered must come down.', 'jk, Vlad and the rest of Robinhood can get bent. [Bloomberg', 'YOLO 3/29/2022', 'Getting ready for the power hour', 'G M E touched upon $448,950.00 per share according to a transaction', 'Realistically, I think comparatives suggest ToX revenue, so a', 'YOLO HISTORY DOESNT REPEAT Information Technology RHYMES', 'Dios mos, man!', '— The Justice Department Monday endorsed legislation forbidding large digital platforms such Amazon and', '$TSLA $GME', 'I have another Location shares in my brokerage too. Love me some gains boys.', '—&gt; Went Long via shares and shorting April 1 $140 put', 'Is this what my financial advisor meant when he said: “You need to sell some', "I've 

In [48]:
eval_df['t5_prediction'] = preds
eval_df.head()

Unnamed: 0,prefix,input_text,target_text,t5_prediction
1,translate,Like taking candy from an ape,Like taking candy from an ape,Like taking candy from an apepe
13,translate,ᴛᴏᴍᴏʀʀᴏᴡ ᴡᴇ ʀɪᴅᴇ ᴏᴜᴛ ᴛᴏ ꜱᴍᴀꜱʜ ᴛʜᴇ ʀᴇꜱɪꜱᴛᴀɴᴄᴇ ᴀᴛ ᴛʜᴇ ᴄʀᴀᴄᴋ ᴏꜰ ᴅᴀᴡɴ,ᴛᴏᴍᴏʀʀᴏᴡ ᴡᴇ ʀɪᴅᴇ ᴏᴜᴛ ᴛᴏ ꜱᴍᴀꜱʜ ᴛʜᴇ ʀᴇꜱɪꜱᴛᴀɴᴄᴇ ᴀᴛ ᴛʜᴇ ᴄʀᴀᴄᴋ ᴏꜰ ᴅᴀᴡɴ,
20,translate,"I have a 186 cost basis. A few months ago every “expert” in the land was suckling at the PYPL teet. Then In a matter of a few weeks they all turned tail. In 2019 the full year EPS was 2.07 and the stock was around 120ish. Now they’re projecting high 4s EPS and the stock is at 120…wtf? So they lost eBay and the market treated it like it was gonna go BK because of that, who the fuck uses eBay anyway anymore. They gained Amazon with Venmo and the market barely gave 2 shits about that…AMAZON! \n\nI get their growth has slowed down due to less free allowance money from Powell out there, but that isn’t a PayPal story that’s a story every company almost. \n\nSo the stock has been destroyed thrown out to die. Finally starting to get some back but with a 186 cost basis im still down 30+ % \n\nWhat would an ape do here? Add more? Cut and run? Wait it out? \n\nIm not in TO deep but deeper than I wanna be so somewhat hesitant to add more","I have a 186 cost basis. Assists few months ago every “expert” in the land was suckling at the PYPL teet. Then In a matter of a few weeks they all turned tail. In 2019 the full year Earnings per share was To.07 and the stock was around 120ish. Now they’Rematch projecting high 4s Earnings per share and the stock is at 120…wtf? So they lost eBay and the market treated it like it was Going to go Burger King because of that, who the fuck uses eBay anyway anymore. They gained Amazon with Venmo and the market barely gave To shits about that…AMAZON! \n\nI get their growth has slowed down due to less free allowance money from Powell out there, but that isn’t a PayPal story that’Sarcasm a story every company almost. \n\nSo the stock has been destroyed thrown out to die. Finally starting to get some back but with a 186 cost basis im still down 30+ % \n\nWhat would an ape do here? Add more? Cut and run? Wait it out? \n\nIm not in TO deep but deeper than I Want to be so somewhat hesitant to Ad...",Today they’Rematch projecting high 4s Earnest and the stock is at 120...
21,translate,"I'm sorry I tanked GME guys, I was sitting on the side lines all last week and finally found the balls to jump in and get some weekly calls at power hour today. Looks like I caused it to tank because literally 5 minutes after I bought, my calls lost 20%.\n\nThese gonna print tomorrow? Or should I go stock up on handy lube for my shift at the Wendy's dumpster tomorrow night?","I'Male sorry I tanked GME guys, I was sitting on the side lines all last week and finally found the balls to jump in and get some weekly calls at power hour today. Looks like I caused it to tank because literally 5 minutes after I bought, my calls lost Location%.\n\nThese Going to print tomorrow? Or should I go stock Underpowered on handy lube for my shift at the Wendy'Sarcasm dumpster tomorrow night?",Or should I go stock Underpowered on handy lube for my shift at the Wendy'
34,translate,Got a lot if hates for my GME puts. Should I switch side tomorrow? I am up $600. your thoughts?,Got a lot if hates for my GME puts. Should I switch side tomorrow? I am Underpowered $600. your thoughts?,I am Underpowered $600. your thoughts?


In [49]:
eval_df.to_csv('reddit_eval_t5_translated.csv')