# step 1 - scrape data
In order to work, we require that the data is saved to a csv file with one piece of text (e.g. one paper title) on each line. There are no actual commas in the file, and there shouldn't be newlines contained in each piece of text.

**scrape arxiv for titles**

In [2]:
import arxivscraper as ax
import numpy as np

'''
# scraper for arxiv stat.ml
scraper = ax.Scraper(category='stat', date_from='2017-08-01',
                     date_until='2019-07-01', t=10, 
                     filters={'categories':['stat.ml'],'abstract':['learning']})

# scraper for arxiv q-bio
scraper = ax.Scraper(category='q-bio', date_from='2016-08-01',
                     date_until='2019-07-01', t=10, 
                     filters={'categories':['q-bio.GN', 'q-bio.NC']})
'''

# scraper for arxiv physics
scraper = ax.Scraper(category='physics', date_from='2019-05-01',
                     date_until='2019-07-03', t=10,
                     filters={'categories':['quant-ph']})

output = scraper.scrape()



# cols = ('id', 'title', 'categories', 'abstract', 'doi', 'created', 'updated', 'authors')
titles = [' '.join(o['title'].split()) for o in output]
np.savetxt('titles_ref.csv', np.array(titles), fmt='%s')

http://export.arxiv.org/oai2?verb=ListRecords&from=2019-05-01&until=2019-05-02&metadataPrefix=arXiv&set=physics
fetching up to  1000 records...
fetching up to  2000 records...
Got 503. Retrying after 10 seconds.
fetching up to  2000 records...
fetching is completed in 26.7 seconds.
Total number of records 131


**alternatively, scrape something else**

In [3]:
import urllib
import numpy as np

# scrape some interesting quotes
url = 'https://raw.githubusercontent.com/akhiltak/inspirational-quotes/master/Quotes.csv'
response = urllib.request.urlopen(url).read().decode()
quotes = []
lines = response.split('\n')
for line in lines[:-1]:
    quotes.append(line.split(';')[0].replace("\'", '').replace('*', '').replace('#', '').replace('%', '').replace('&', ''))
    
np.savetxt('titles.csv', np.array(quotes[1:]), fmt='%s')

**e.g. could scrape tweets (requires having twitter api credentials)**

In [1]:
import tweetscraper
# tweetscraper.get_all_tweets("SICKOFWOLVES") # name of account to scrape
tweetscraper.clean_csv(fname='data/wolves_tweets.csv') # 

# step 2 - finetune gpt2
this code will download gpt2 and finetune it on the file title.csv, generating samples at intermediate steps

In [None]:
import gpt_2_simple as gpt2

model_name = "117M" # "355M" for larger model (it's 1.4 GB)
gpt2.download_gpt2(model_name=model_name)   # model is saved into current directory under /models/117M/

sess = gpt2.start_tf_sess()
gpt2.finetune(sess,
              'titles_ref.csv',
              model_name=model_name,
              steps=1000,
              save_every=200,
              sample_every=25)   # steps is max number of training steps

gpt2.generate(sess)

# step 3 - look at the model

### look at some samples
the samples are saved to the 'samples' folder by default

In [None]:
sample_file = 'samples/samples-901'
t = open(sample_file, 'r').read()

for s in ['endoftext', 'startoftext', '<|', '|>']:
    t = t.replace(s, '')
for title in t.title().split('\n')[1:]:
    if not title == '':
        print('- ' + title)

### generating new samples from the finetuned model

In [1]:
import gpt_2_simple as gpt2
sess = gpt2.start_tf_sess()
gpt2.load_gpt2(sess)

The TensorFlow contrib module will not be included in TensorFlow 2.0.
For more information, please see:
  * https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset.md
  * https://github.com/tensorflow/addons
  * https://github.com/tensorflow/io (for I/O related ops)
If you depend on functionality not listed there, please file an issue.

Loading checkpoint checkpoint/run1/model-1000
INFO:tensorflow:Restoring parameters from checkpoint/run1/model-1000


**generate one sample**

In [23]:
prefix = 'neural' # None is default
text = gpt2.generate(sess,
              length=40,
              temperature=0.7,
              prefix=neural,
              nsamples=1,
              batch_size=1,
              return_as_list=True
             )


t = text[0].title()
t = t.replace('<|Startoftext|>', '').replace('\n', '') # remove extraneous stuff
t = t[:t.index('<|Endoftext|>')] # only get one title
print(t)

Neural Source Separation Via Non-Negative Eigenvector Field Variate Operator


**generate a bunch of samples**

In [None]:
text = gpt2.generate(sess,
#               length=40,
              temperature=0.7,
              prefix=None,
              nsamples=100,
              batch_size=1,
              return_as_list=True
             )


t = text[0].title()
t = t.replace('<|Startoftext|>', '').replace('\n', '') # remove extraneous stuff
t = t[:t.index('<|Endoftext|>')] # only get one title
print(t)