## Prepare Referece files using TFIDF for retrieving attributes


In [2]:
#!pip install -U scikit-learn

Collecting scikit-learn
  Downloading scikit_learn-0.22.2.post1-cp37-cp37m-manylinux1_x86_64.whl (7.1 MB)
[K     |████████████████████████████████| 7.1 MB 3.4 MB/s eta 0:00:01
[?25hCollecting joblib>=0.11
  Downloading joblib-0.14.1-py2.py3-none-any.whl (294 kB)
[K     |████████████████████████████████| 294 kB 80.7 MB/s eta 0:00:01
[?25hCollecting scipy>=0.17.0
  Using cached scipy-1.4.1-cp37-cp37m-manylinux1_x86_64.whl (26.1 MB)
Installing collected packages: joblib, scipy, scikit-learn
Successfully installed joblib-0.14.1 scikit-learn-0.22.2.post1 scipy-1.4.1


In [1]:
import pandas as pd
from tqdm import tqdm, trange
import numpy as np
import time
import torch
from sklearn.feature_extraction.text import TfidfVectorizer

In [2]:
def read_file(path):
    with open(path) as fp:
        lines = fp.read().splitlines()
    return lines

In [3]:
def clean_text(text):
    return text.replace("<POS>","").replace("<NEG>","").replace("<CON_START>","").replace("<START>","").replace("<END>","").strip()


In [5]:
!ls data/lipton/sentiment/orig/bert_classifier_training/

dev.csv					  sentiment_test_0.txt	 test.csv
processed_files_with_bert_with_best_head  sentiment_test_1.txt	 train.csv
sentiment_dev_0.txt			  sentiment_train_0.txt
sentiment_dev_1.txt			  sentiment_train_1.txt


In [9]:
# this is the BGST reference data
!head -n1 data/lipton/sentiment/orig/bert_classifier_training/processed_files_with_bert_with_best_head/reference_0.txt

<POS> <CON_START> i did not enjoy this 1945 mystery thriller film about a young woman , nina foch , ( julia ross ) who is out of work and has fallen behind in her rent and is desperate to find work . julia reads an ad in the local london newspaper looking for a secretary and rushes out to try and obtain this position . julia obtains the position and is hired by a mrs . hughes , ( dame may witty ) who requires that she lives with her employer in her home and wants her to have no involvement with men friends and , conveniently , julia tells them she has no family and is free to devote her entire time to this job . george macready , ( ralph hughes ) is the son of mrs . hughes and has some very strange desires for playing around with knives . unfortunately , this was a film and most of the scenes were close ups in order to avoid the expense of a background and costs for scenery . this stereotypical strange family all live in a huge mansion off the cornwall coast of england and there are th

In [10]:
!head -n1 data/lipton/sentiment/new/reference_0.txt

This a bad Disney flick. It is the story of an aging high school baseball coach (Dennis Quaid), who was once on his way to the big leagues as a pitcher, but suffered a career ending injury. Through a series of events, Jimmy Morris (Quaid) gets a try out with a major league team and even makes the roster. This is a bad family film. It is not inspirational and pours it on too thick. It's neither fun nor entertaining. Adults will hate this movie as well as kids. It is based upon a true story, though i'm sure the filmmakers took some liberties in telling the story. Quaid is dull as the title character, very unconvincing. If you're looking for a film the whole family can enjoy, look past this. 2/10


In [11]:
# so copy over reference_0.txt and _1 from new/  to orig/
!cp data/lipton/sentiment/new/reference_0.txt data/lipton/sentiment/orig/bert_classifier_training/
!cp data/lipton/sentiment/new/reference_1.txt data/lipton/sentiment/orig/bert_classifier_training/

In [13]:

#data_dir = "data/yelp/"
data_dir = "data/lipton/sentiment/orig/bert_classifier_training/"
train0_org = read_file(data_dir+"sentiment_train_0.txt") # Training data of negative sentiment
train1_org = read_file(data_dir+"sentiment_train_1.txt") # Training data of positive sentiment
ref0_processed = read_file(data_dir+"processed_files_with_bert_with_best_head/reference_0.txt") # Reference data for delete_generate model
ref1_processed = read_file(data_dir+"processed_files_with_bert_with_best_head/reference_1.txt") # Reference data for delete_generate model
ref0_org = read_file(data_dir+"reference_0.txt") # Original Refrence_0 data
ref1_org = read_file(data_dir+"reference_1.txt") # Original Refrence_1 data
train0_processed = read_file(data_dir+"processed_files_with_bert_with_best_head/delete_retrieve_edit_model/sentiment_train_0_all_attrs.txt") # training data with content and attributes seperation
train1_processed = read_file(data_dir+"processed_files_with_bert_with_best_head/delete_retrieve_edit_model/sentiment_train_1_all_attrs.txt") # training data with content and attributes seperation

In [14]:
# Get the Original Reference Sentence
ref0_org = [x.split("\t")[0] for x in ref0_org]
ref1_org = [x.split("\t")[0] for x in ref1_org]

In [15]:
# Get the Content of the Reference Sentences
ref0_con = [clean_text(x) for x in ref0_processed]
ref1_con = [clean_text(x) for x in ref1_processed]

In [16]:
ref0_org[:4], ref0_con[:4]

(["This a bad Disney flick. It is the story of an aging high school baseball coach (Dennis Quaid), who was once on his way to the big leagues as a pitcher, but suffered a career ending injury. Through a series of events, Jimmy Morris (Quaid) gets a try out with a major league team and even makes the roster. This is a bad family film. It is not inspirational and pours it on too thick. It's neither fun nor entertaining. Adults will hate this movie as well as kids. It is based upon a true story, though i'm sure the filmmakers took some liberties in telling the story. Quaid is dull as the title character, very unconvincing. If you're looking for a film the whole family can enjoy, look past this. 2/10",
  "To all the great people who have done everything from complain about the dialogue, the budget, the this and the that....everyone wants to hear it.  IF you missed the point of this terrible movie, that's not your loss. The rest of us who deeply hate this movie care what you think. I am a t

In [17]:
def get_train_content(text):
    return text.split("<START>")[0].split("<CON_START>")[1].strip()

In [18]:
def get_train_attrs(text):
    return text.split("<CON_START>")[0].replace("<ATTR_WORDS>","").strip().split()

In [19]:
get_train_attrs(train0_processed[0])

['Long,', 'boring,', 'blasphemous.', 'Never', 'I', 'roll.']

In [20]:
train0_processed[:4], train1_processed[:4]

(['<ATTR_WORDS> Long, boring, blasphemous. Never I roll. <CON_START> , , blasphemous . have i been so glad to see ending credits roll . <START> Long, boring, blasphemous. Never have I been so glad to see ending credits roll. <END>',
  '<ATTR_WORDS> Not good! Rent original! Watch then....maybe.<br /><br />It Elvis King. <CON_START> ! or buy the original ! watch this only if someone has a gun to your head and then . . . . maybe . < br / > < br / > it is like claiming an elvis actor is as good as the real king . <START> Not good! Rent or buy the original! Watch this only if someone has a gun to your head and then....maybe.<br /><br />It is like claiming an Elvis actor is as good as the real King. <END>',
  '<ATTR_WORDS> "This bad, all-time ""comedy"": Police Academy 7. No laughs movie. Do worthwhile, really. Just don\'t garbage." <CON_START> " this movie is so , it can only be compared to the all - time worst " " comedy " " : police academy 7 . no throughout the movie . do something worth

In [21]:
# get content
train0_con = [get_train_content(x) for x in train0_processed]
train1_con = [get_train_content(x) for x in train1_processed]

In [22]:
train0_con[:4], train1_con[:4]

([', , blasphemous . have i been so glad to see ending credits roll .',
  '! or buy the original ! watch this only if someone has a gun to your head and then . . . . maybe . < br / > < br / > it is like claiming an elvis actor is as good as the real king .',
  '" this movie is so , it can only be compared to the all - time worst " " comedy " " : police academy 7 . no throughout the movie . do something worthwhile , anything really . just don \' t waste your time on this . "',
  'horrors are bad at all , some are smart with interesting stories , but is not the case of " " second name " " . it is badly directed , badly acted and boring . . . boring . . . boring , a missed chance for an interesting story . "'],
 ['this film might have weak production values , but that is also what makes it so good . the special effects are gross out and done . my part of the movie had to be chrissy played by janelle brady . she is super hot and also has a good nude scene . robert prichard as the leader of

In [23]:
# Fatch attributes from the training data
attrs_neg = [get_train_attrs(x) for x in train0_processed]
attrs_pos = [get_train_attrs(x) for x in train1_processed]

In [24]:
# Get TFIDF vectors for Training and Reference
tfidf = TfidfVectorizer()
conts_vecs = tfidf.fit_transform(train0_con + train1_con)
conts_pos_vecs = conts_vecs[:len(train1_con)]
conts_neg_vecs = conts_vecs[len(train1_con):len(train1_con)+len(train0_con)]
conts_from_pos_ref_vecs = tfidf.transform(ref1_con)
conts_from_neg_ref_vecs = tfidf.transform(ref0_con)

#### AnnoyIndex is used to store the TFIDF vectors of training set and retrieve nearest neighbours of the reference content 

In [20]:
!pip install annoy

Collecting annoy
  Downloading annoy-1.16.3.tar.gz (644 kB)
[K     |████████████████████████████████| 644 kB 3.0 MB/s eta 0:00:01
[?25hBuilding wheels for collected packages: annoy
  Building wheel for annoy (setup.py) ... [?25ldone
[?25h  Created wheel for annoy: filename=annoy-1.16.3-cp37-cp37m-linux_x86_64.whl size=275501 sha256=352b4d48653d80445a26736f1ab5ee2218f6b7cc3d6da57324a84a69aaed44ae
  Stored in directory: /home/diego/.cache/pip/wheels/39/36/d4/ee348a7240ca3e8d1fcbf04ebe46d45f2879ccb094a40f5706
Successfully built annoy
Installing collected packages: annoy
Successfully installed annoy-1.16.3


In [25]:
from annoy import AnnoyIndex

In [26]:
train0_tree = AnnoyIndex(conts_neg_vecs.shape[-1])
train1_tree = AnnoyIndex(conts_pos_vecs.shape[-1])

  """Entry point for launching an IPython kernel.
  


In [28]:
print(conts_neg_vecs.shape[0])
print(conts_pos_vecs.shape[0])

851
856


In [29]:
#  NO NEED:  We have randomly selected training samples to control the memory usage
neg_idxs = [i for i in range(conts_neg_vecs.shape[0]) ] #np.random.choice(conts_neg_vecs.shape[0], size=50000, replace=False)
pos_idxs = [i for i in range(conts_pos_vecs.shape[0]) ]#np.random.choice(conts_pos_vecs.shape[0], size=50000, replace=False)

In [30]:
#for i in trange(conts_neg_vecs.shape[0]):
for i in trange(len(neg_idxs)):
    np_array = conts_neg_vecs[neg_idxs[i]].toarray()[0]
    train0_tree.add_item(i,np_array)

100%|██████████| 851/851 [00:02<00:00, 369.26it/s]


In [31]:
train0_tree.build(50)
train0_tree.save(data_dir+'tfidf_train0.ann')

True

In [32]:
ref1_con[0:3], " ".join(attrs_neg[neg_idxs[0]])

(["i rated this a 5 . the dubbing was as good as i have seen . the plot - wow . i ' m not sure which made the movie more . jet li is a martial artist , as good as jackie chan .",
  '" only one thing could have redeemed this sketch . a healthy gunfight between the happy couple , the exotic model at the delicatessen , and the old - timer from the motel who was ( it would have turned out ) secretly watching from the woods and had been aging rent - boy to the guys when they \' d shared the rubber house . < br / > < br / > in the process , they could have blown that freezing shack to smithereens , resolved most of the snags ; such as the " " whore bitch " " ode on the windscreen , the reason why the protagonist had " " no friends , " " as well as explaining his coolness under pressure from bloody tampon , incessant phone calls . . . and that crawl - space chic , the green thumb , and his attraction to the simpler life . quite the technician with the human body , though . ex - abortionist ? 

In [33]:
#mkdir tfidf folder in processed_files_with_bert_with_best_head/delete_retrieve_edit_model/
!cd data/lipton/sentiment/orig/bert_classifier_training/processed_files_with_bert_with_best_head/delete_retrieve_edit_model/; mkdir tfidf; ls

sentiment_dev_0_all_attrs.txt	sentiment_test_all_attrs.txt
sentiment_dev_0.txt		sentiment_test.txt
sentiment_dev_1_all_attrs.txt	sentiment_train_0_all_attrs.txt
sentiment_dev_1.txt		sentiment_train_0.txt
sentiment_dev_all_attrs.txt	sentiment_train_1_all_attrs.txt
sentiment_dev.txt		sentiment_train_1.txt
sentiment_test_0_all_attrs.txt	sentiment_train_all_attrs.txt
sentiment_test_0.txt		sentiment_train.txt
sentiment_test_1_all_attrs.txt	tfidf
sentiment_test_1.txt


In [34]:
# HERE

with open(data_dir+"processed_files_with_bert_with_best_head/delete_retrieve_edit_model/tfidf/reference_1.txt", "w") as out_fp:
    for i in range(conts_from_pos_ref_vecs.shape[0]):
        x = conts_from_pos_ref_vecs[i].toarray()[0]
        inx,dis = train0_tree.get_nns_by_vector(x, 1, include_distances=True)
        ref_sen = ref1_con[i]
        #ref_sen = processed_ref0[i].replace("<POS>","").replace("<NEG>","").replace("<CON_START>","").replace("<START>","")
        #print(dis,"\t",ref0_org[i], "\t" ,train1_data[inx[0]], train1_attr[inx[0]])
        out_str = "<ATTR_WORDS> " + " ".join(attrs_neg[neg_idxs[inx[0]]]) + " <CON_START> " + ref_sen.strip() + " <START>" + "\n"
        print(out_str)
        out_fp.write(out_str)

<ATTR_WORDS> "One I ""Gymkata"" on; I it. It hilarious, horrifying, really. Think way--if bad terrible this, last? Not movie. It's must-see, obviously." <CON_START> i rated this a 5 . the dubbing was as good as i have seen . the plot - wow . i ' m not sure which made the movie more . jet li is a martial artist , as good as jackie chan . <START>

<ATTR_WORDS> "I everyone's poorly written.The I it. In I campy performer.I Laura Harris Canadian poorly HBO ""Dead Like Me"" Daisy Adair manner.I Ashley grader.I ""make it"" Laura Harris Nordic allure. If I 'Godfather' 'Beaches' low-budget Laura Harris I ""It's start!""" <CON_START> " only one thing could have redeemed this sketch . a healthy gunfight between the happy couple , the exotic model at the delicatessen , and the old - timer from the motel who was ( it would have turned out ) secretly watching from the woods and had been aging rent - boy to the guys when they ' d shared the rubber house . < br / > < br / > in the process , they could

In [35]:
#for i in trange(conts_neg_vecs.shape[0]):
for i in trange(len(pos_idxs)):
    np_array = conts_pos_vecs[pos_idxs[i]].toarray()[0]
    train1_tree.add_item(i,np_array)

100%|██████████| 856/856 [00:02<00:00, 371.51it/s]


In [36]:
train1_tree.build(50)
train1_tree.save(data_dir+'tfidf_train1.ann')

True

In [37]:
with open(data_dir+"processed_files_with_bert_with_best_head/delete_retrieve_edit_model/tfidf/reference_0.txt", "w") as out_fp:
    for i in range(conts_from_neg_ref_vecs.shape[0]):
        x = conts_from_neg_ref_vecs[i].toarray()[0]
        inx,dis = train1_tree.get_nns_by_vector(x, 1, include_distances=True)
        ref_sen = ref0_con[i]
        out_str = "<ATTR_WORDS> " + " ".join(attrs_pos[pos_idxs[inx[0]]]) + " <CON_START> " + ref_sen.strip() + " <START>" + "\n"
        print(i, out_str)
        out_fp.write(out_str)

0 <ATTR_WORDS> "This interesting. I Glover accomplish. I center, ""think"" control. The Glover ""outrageous"" beautiful potency. I I Glover trilogy. It fine! EVERYTHING IS FINE. See also. People ""thoughtless"" ""pretentious"" boat. This intelligent films. If books, while. The something. You experience!" <CON_START> i did not enjoy this 1945 mystery thriller film about a young woman , nina foch , ( julia ross ) who is out of work and has fallen behind in her rent and is desperate to find work . julia reads an ad in the local london newspaper looking for a secretary and rushes out to try and obtain this position . julia obtains the position and is hired by a mrs . hughes , ( dame may witty ) who requires that she lives with her employer in her home and wants her to have no involvement with men friends and , conveniently , julia tells them she has no family and is free to devote her entire time to this job . george macready , ( ralph hughes ) is the son of mrs . hughes and has some very 

In [None]:
# 1. NOW CHECK HOW THIS DOES MODELING WISE with GGST 
# 2. ALSO PUT UP THE DATA FROM THE ORIGINAL PREDICTIONS FOR B-GST and G-GST ALONG WITH ORIGINAL AND REFS 
#    ( so a spreadsheet of 1000) and eyeball things  <-- this will be useful for improvements made to this model


