### Yelp Reviews: Authorship Attribution with Python and scikit-learn

We humans almost always leave clues behind our path and writting simple texts seems to be no exception. Here we look at yelp reviews data set and and explore if we can find the author of a yelp review based on past reviews of an author. How style of a writting i.e. review can reveal its author identity?

Often, it’s possible to identify someone using only their unique style of writing. We’ll see how easy it is to identify people using their writing style through machine learning. Specifically, we’ll look at reviewers who have left multiple reviews on Yelp. 

We’ll teach a machine learning system to differentiate between different writing styles, and then see how well it can predict the correct author of a review, looking only at the review text.

### OVERVIEW

we will:

Load about 6 million reviews from the 2018 Yelp Dataset Challenge

Find users who have left at least 500 reviews

Train a support vector machine classifier to identify the writing style of each author

See how well the classifier can identify reviews it hasn’t seen during training

Relying on scikit-learn for the machine learning components.

### Yelp Dataset JSON Download and Documentation

https://www.yelp.com/dataset/download

https://www.yelp.com/dataset/documentation/main

In [1]:
from collections import Counter
import json

#### review.json
Contains full review text data including the user_id that wrote the review and the business_id the review is written for.

In [3]:
%%time
json_review_dataset = "yelp_dataset/yelp_academic_dataset_review.json"
reviews = [json.loads(review) for review in open(json_review_dataset)]

CPU times: user 55.2 s, sys: 8.28 s, total: 1min 3s
Wall time: 1min 4s


Let's take a look at the first review to see how the data is structured.

In [4]:
reviews_cnt = len(reviews)
print(f"Number of reviews: {reviews_cnt}")
reviews[0]  # looking at 1st review

Number of reviews: 5996996


{'review_id': 'x7mDIiDB3jEiPGPHOmDzyw',
 'user_id': 'msQe1u7Z_XuqjGoqhB0J5g',
 'business_id': 'iCQpiavjjPzJ5_3gPD5Ebg',
 'stars': 2,
 'date': '2011-02-25',
 'text': "The pizza was okay. Not the best I've had. I prefer Biaggio's on Flamingo / Fort Apache. The chef there can make a MUCH better NY style pizza. The pizzeria @ Cosmo was over priced for the quality and lack of personality in the food. Biaggio's is a much better pick if youre going for italian - family owned, home made recipes, people that actually CARE if you like their food. You dont get that at a pizzeria in a casino. I dont care what you say...",
 'useful': 0,
 'funny': 0,
 'cool': 0}

#### A quick look at star rating percentages

In [10]:
%%time
star_ratings_cnt = Counter([review['stars'] for review in reviews])
for star in range(5, 0, -1):
    star_cnt = star_ratings_cnt[star]
    print(f"{'*' * star}\t {star_cnt} \t {round(star_cnt / reviews_cnt * 100)}%")

*****	 2641880 	 44%
****	 1335957 	 22%
***	 673206 	 11%
**	 487813 	 8%
*	 858139 	 14%
CPU times: user 1.1 s, sys: 47 ms, total: 1.15 s
Wall time: 1.15 s


### Finding Most Active Reviewers 

In [44]:
%%time
num_prolific_reviewers = 100  # top reviewers by number of reviews to be included
prolific_reviewers = Counter([review['user_id'] for review in reviews]).most_common(num_prolific_reviewers)

CPU times: user 3.52 s, sys: 4.89 s, total: 8.41 s
Wall time: 8.91 s


In [45]:
print("List of (Reviewer ID, Number of Reviews)")
prolific_reviewers

List of (Reviewer ID, Number of Reviews)


[('CxDOIDnH8gp9KXzpBHJYXw', 3739),
 ('bLbSNkLggFnqwNNzzq-Ijw', 2229),
 ('PKEzKWv_FktMm2mGPjwd0Q', 1674),
 ('DK57YibC5ShBmqQl97CKog', 1574),
 ('QJI9OSEn6ujRCtrX06vs1w', 1324),
 ('d_TBs6J3twMy9GChqUEXkg', 1245),
 ('hWDybu_KvYLSdEFzGrniTw', 1220),
 ('ELcQDlf69kb-ihJfxZyL0A', 1204),
 ('cMEtAiW60I5wE_vLfTxoJQ', 1201),
 ('YRcaNlwQ6XXPFDXWtuMGdA', 1195),
 ('U4INQZOPSUaj8hMjLlZ3KA', 1182),
 ('62GNFh5FySkA3MbrQmnqvg', 1126),
 ('UYcmGbelzRa0Q6JqzLoguw', 1104),
 ('dIIKEfOgo0KqUfGQvGikPg', 1049),
 ('iDlkZO2iILS8Jwfdy7DP9A', 990),
 ('n86B7IkbU20AkxlFX_5aew', 956),
 ('Ry1O_KXZHGRI8g5zBR3IcQ', 935),
 ('N3oNEwh0qgPqPP3Em6wJXw', 927),
 ('rCWrxuRC8_pfagpchtHp6A', 925),
 ('WeVkkF5L39888IPPlRhNpg', 886),
 ('0BBUmH7Krcax1RZgbH4fSA', 884),
 ('U5YQX_vMl_xQy8EQDqlNQQ', 883),
 ('fiGqQ7pIGKyZ9G0RqWLMpg', 861),
 ('pMefTWo6gMdx8WhYSA2u3w', 858),
 ('Q9mA60HnY87C1TW5kjAZ6Q', 851),
 ('3nDUQBjKyVor5wV0reJChg', 843),
 ('YMgZqBUAddmFErxLtCfK_w', 822),
 ('ic-tyi1jElL_umxZVh8KNA', 819),
 ('Wc5L6iuvSNF5WGBlqIO8nw', 818),


Now we want to create a balanced dataset i.e. we want the same number of reviews of each reviewer. We’ll go through all our reviews again and keep only those reviews written by the ??? authors we identified above, and only ??? reviews from each author. Below, keep_ids is a dictionary which we’ll use to keep count of how many reviews we have from each author.

In [59]:
%%time
num_reviews = 500  # max number of reviews kept for each reviewer
keep_ids = {pr[0] : 0 for pr in prolific_reviewers}

keep_reviews = []
for review in reviews:
    uid = review['user_id']
    if uid in keep_ids and keep_ids[uid] < num_reviews:
        keep_reviews.append(review)
        keep_ids[uid] += 1

CPU times: user 1.24 s, sys: 2.94 ms, total: 1.24 s
Wall time: 1.24 s


In [60]:
len(keep_reviews)

50000

Now we’ll split the reviews we kept into two lists: 

    one for the texts of the reviews, 
    another for the author ids. 
    
    

The two lists are implicitly associated by index i.e. the first text in our texts array was written by the first author in our authors array). 

In machine learning, we refer to these as “instances” and “labels”.

    Instances are the things we use to learn from, and the 
    Labels are the things we are trying to learn.

In [61]:
%%time
texts = [review['text'] for review in keep_reviews]
authors = [review['user_id'] for review in keep_reviews]

CPU times: user 46.9 ms, sys: 342 ms, total: 389 ms
Wall time: 659 ms


Next, we need to import some things from the scikit-learn library. Specifically, we need a vectorizer (something that transforms our texts into a numerical representation that’s easier to work with) and a classifier (the thing that learns how to discriminate based on labeled examples). 

We’ll be using TfidfVectorizer, which transforms our text into vectors with tf-idf weighting and a LinearSVM, which is a Support Vector Machine with a linear kernel — a kernel that is often used for text classification tasks. 

We’ll also import a helper function called train_test_split. We’ll use this to split our data into a 

    training set and a 
    test set. 
    
The classifier will learn patterns from the training set, and then we’ll make sure that it actually works by seeing if it can correctly predict the authors in the held-out test set.

In [62]:
%%time
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import LinearSVC
from sklearn.model_selection import train_test_split

CPU times: user 16 µs, sys: 0 ns, total: 16 µs
Wall time: 20 µs


### Transform  the reviews / texts into vectors by setting up a vectorizer and giving it the list of our texts.

In [63]:
%%time
vectorizer = TfidfVectorizer()
vectors = vectorizer.fit_transform(texts)

CPU times: user 6.66 s, sys: 113 ms, total: 6.78 s
Wall time: 6.78 s


In [64]:
print( f"Our vectors variable contains a sparse matrix with shape {vectors.shape} which is {num_prolific_reviewers} by {num_reviews}")
print(f"Number of Features: {vectors.shape[1]}")

Our vectors variable contains a sparse matrix with shape (50000, 69810) which is 100 by 500
Number of Features: 69810


Our vectors variable contains a sparse matrix with shape 50000 (which is 500 texts times by 100 authors — the number of texts we kept) by 68619 which is the number of features. 

Our features are single words (unigrams) and each text is represented by an indication of how often that word appears in that text (which is nearly always 0 as we have 68619 unique words in the dataset).

Often in machine learning, you’ll see the instances (texts in our case) referred to as Xs and the labels as ys. You can think about machine learning tasks as a function y = f(x). We have x (the review text) and we want to know y (the author’s ID). The SVM attempts to learn a function f that can map the texts to the labels. We’ll follow this convention, and break our texts into X_train (the texts we’ll show the SVM as learning examples) and X_test (the texts we won’t show to the SVM so we can see if it’s able to predict the correct authors for these texts based on the patterns it learned from the texts in X_train). We’ll similarly break our labels (the author ids) into two arrays as well: y_train and y_test. We can use the function provided by scikit-learn to handle taking a random sample of our texts and labels, while making sure that the indices still correspond, as follows.

In [65]:
# We use a fixed random_state to ensure that the same random sampling is used every time the code is run.
X_train, X_test, y_train, y_test = train_test_split(vectors, authors, test_size=0.2, random_state=1337)

We now have 40000 texts (80% of our data) to train on and 10000 (20%) for testing:

In [66]:
X_train.shape, X_test.shape

((40000, 69810), (10000, 69810))

### TRAINING AND TESTING A CLASSIFIER

We first need to call fit on our classifier and pass in the learning texts and labels. Then we’ll call predict to get predictions on the test data, and look at some metrics to see how well it did.

In [67]:
%%time
svm = LinearSVC()
svm.fit(X_train, y_train)

CPU times: user 14.3 s, sys: 62.1 ms, total: 14.4 s
Wall time: 14.4 s


We can now make predictions on the test set (note that the SVM has never seen the labels from the test set that are stored in y_test). The SVM will output whichever user_id it thinks is most likely to be the author of that review, for each review we pass in.

In [68]:
predictions = svm.predict(X_test)

In [69]:
correct_prediction = 0
for test, prediction in zip(predictions, y_test):
    if prediction == test:
        correct_prediction += 1

If the classifier did a good job, the predictions should be similar to the test labels (y_test).

In [70]:
correct_prediction/len(y_test)

0.8554

In [71]:
from sklearn.metrics import accuracy_score
accuracy_score(y_test, predictions)

0.8554

### IMPROVING OUR SYSTEM

Let’s see if we can tune some parameters and do even better.

The following vectorizer looks at words individually (unigrams) but also looks at pairs of words (bigrams). This makes our feature space much larger, so we’ll need a bit more processing time, both to create the vectors, and to train the SVM using the vectors. To alleviate this issue, we’ll tell the vectorizer to ignore all words and word pairs that don’t appear in at least five different reviews — there are a lot of very rarely used words, and we can’t learn anything from these anyway.

In [72]:
%%time
vectorizer = TfidfVectorizer(ngram_range=(1,2), min_df=5)
vectors = vectorizer.fit_transform(texts)

CPU times: user 23.8 s, sys: 683 ms, total: 24.5 s
Wall time: 24.5 s


In [73]:
print(f"Now we have {vectors.shape[1]} features, about.")

Now we have 202611 features, about.


In [74]:
X_train, X_test, y_train, y_test = train_test_split(vectors, authors, test_size=0.2, random_state=1337)

In [75]:
%%time
svm = LinearSVC()
svm.fit(X_train, y_train)
predictions = svm.predict(X_test)

CPU times: user 45.1 s, sys: 186 ms, total: 45.3 s
Wall time: 45.3 s


In [76]:
print(f"We can now identify the correct author correctly {accuracy_score(y_test, predictions)*100}% of the time.\nConsidering that some of the reviews are only a few sentences long, it is perhaps surprising that the writing styles are inactive enough.")

We can now identify the correct author correctly 89.75% of the time.
Considering that some of the reviews are only a few sentences long, it is perhaps surprising that the writing styles are inactive enough.


### CONCLUSION

This could help find people who have more than one Yelp account for the purposes of promoting their own establishment or leaving bad reviews for competitors. It is also useful in forensic linguistics when the true authorship of someone’s will or suicide note is often questioned, and it can be used to prove the authorship of disputed literary works, such as Shakespeare’s plays or books written under pseudonyms.