Read the review.json file from the Yelp Dataset into memory. There are 6M+ reviews in the file.

In [0]:
%%time

from google.colab import drive
drive.mount('/content/drive')
root_dir = 'drive/My Drive/Colab Notebooks/AuthorshipAttribution/'

import json
reviews = []
with open(root_dir + "data/review.json") as f:
    index = 0
    for line in f:
        if index % 1000000 == 0:
            print(index)
        index += 1
        reviews.append(json.loads(line))

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
0
1000000
2000000
3000000
4000000
5000000
6000000


Print the first review to examine the metadata that comes with the review

In [0]:
print(reviews[0])

NameError: ignored

We instantiate a counter object which will extract the user id of the reviewers with the most number of reviews. Specifically, we take the top 50 such reviewers.

In [0]:
from collections import Counter
prolific_reviewers = Counter([review['user_id'] for review in reviews]).most_common(50)
prolific_reviewers

NameError: ignored

In [0]:
keep_ids = {pr[0] : 0 for pr in prolific_reviewers}
keep_ids

We then go through the list of reviews and fetch the reviews written by the top 50 most prolific authors extracted in the previous step. We concatenate all the reviews written by a particular author into one big chunk of text, and store it as the value in a { author_id: review_text } dictionary.

In [0]:
by_author = {} # author : "review 1\n review 2\n review 3"
for review in reviews:
    uid = review['user_id']
    if uid in keep_ids:
        uid = review['user_id']
        if uid in by_author:
            by_author[uid] += "\n{}".format(review['text'])
        else:
            by_author[uid] = "{}".format(review['text'])

In [0]:
len(by_author)

50

We now make sure that each author has at least 200,000 characters worth of review text in the dictionary created above, by sorting the dictionary on the length of each author's review text, and seeing if the minimum review text length is > 200,000

In [0]:
# check that we have at least 200000 characters for each author
sorted([(len(by_author[key]), key) for key in by_author])[:5]

[(218976, 'ic-tyi1jElL_umxZVh8KNA'),
 (237086, 'PcvbBOCOcs6_suRDH7TSTg'),
 (288579, 'PeLGa5vUR8_mcsn-fn42Jg'),
 (467816, 'Lfv4hefW1VbvaC2gatTFWA'),
 (468683, 'iDlkZO2iILS8Jwfdy7DP9A')]

The following function takes a string and a chunk length (n) as input and returns chunks of the string each n characters long.

In [0]:
def get_chunks(l, n):
    n = max(1, n)
    return [l[i:i+n] for i in range(0, len(l), n)]

We then split the reviews into training and test sets as specified in the readme file.

In [0]:
train_texts = []  # the first 100 000 chars from each author
train_labels = [] # each author
test_texts = []   # 100 texts of 1000 characters each (second 100 000 chars of each author)
test_labels = []  # each author * 100

author_int = {author: i for i,author in enumerate(by_author)}
int_author = {author_int[author]: author for author in author_int}

for author in by_author:
    train_text = by_author[author][:50000]
    train_label = author_int[author]
    train_texts.append(train_text)
    train_labels.append(train_label)
    
    short_texts = get_chunks(by_author[author][50000:100000], 1000)
    for text in short_texts:
        test_texts.append(text)
        test_labels.append(author_int[author])

In [0]:
train_data = {
    "train_texts": train_texts,
    "train_labels": train_labels
}
test_data = {
    "test_texts": test_texts,
    "test_labels": test_labels
}

Lastly, we dump the extracted training and test data into pickle files of their own, stored for later use.

In [0]:
import pickle

with open("../data/train_data.pickle", "wb") as f:
    pickle.dump(train_data, f)
with open("../data/test_data.pickle", "wb") as f:
    pickle.dump(test_data, f)