In [66]:
import numpy as np
from bs4 import BeautifulSoup
import nltk
import random

The goal in the notebook is to implement an article spinner, which will replace some words of a given text by a synonym

Code partly adapted from: https://deeplearningcourses.com/c/data-science-natural-language-processing-in-python

In [67]:
# represent the given text in a 4-gram
# (w1, PosTag(w2), w3) is the key, [ w2 ] are the values
# w1 represent past word, w2 the current word and w3 the future word

def lang_model(corpus):
    
    ngram = {}
    for review in corpus:
        s = review.text.lower()
        tokens = nltk.tokenize.word_tokenize(s)
        
        for i in range(len(tokens) - 2):
            k = (tokens[i], nltk.pos_tag([tokens[i+1]])[0][1], tokens[i+2])
            if k not in ngram:
                ngram[k] = []
            ngram[k].append(tokens[i+1])
    
    # turn each array of middle-words into a probability vector
    ngram_prob = {}
    for k, words in ngram.items():

        if len(set(words)) > 1:
            d = {}
            n = 0
            for w in words:
                if w not in d:
                    d[w] = 0
                d[w] += 1
                n += 1
            for w, c in d.items():
                d[w] = float(c) / n
            ngram_prob[k] = d
            
    return ngram_prob


In [68]:
# return a element of d_probs, based on the given probabilities
def random_sample(d_probs):
    r = random.random()
    cumulative = 0
    for w, p in d_probs.items():
        cumulative += p
        if r < cumulative:
            return w


def test_spinner(reviews, ngram):
    
    chosen_prob = 1
    s = random.choice(reviews).text.lower()
    
    print("Original Text:", s)
    tokens = nltk.tokenize.word_tokenize(s)
    for i in range(len(tokens) - 2):
        if random.random() < chosen_prob:
            k = (tokens[i], nltk.pos_tag([tokens[i+1]])[0][1], tokens[i+2])
            if k in ngram:
                w = random_sample(ngram[k])
                tokens[i+1] = w
    print("Updated Text:")
    print(" ".join(tokens).replace(" .", ".").replace(" '", "'").replace(" ,", ",").replace("$ ", "$").replace(" !", "!"))



In [69]:
# get the data
# data courtesy of http://www.cs.jhu.edu/~mdredze/datasets/sentiment/index2.html

positive_reviews = BeautifulSoup(open('electronics/positive.review').read(), "lxml")
reviews = positive_reviews.findAll('review_text')


ngram = lang_model(reviews)

test_spinner(reviews, ngram)

Original Text: 
i like it very much, i use it in an outdoor trail camera for wild game and it hold several pictures and has been in the camera in all kinds of weather. i would buy kingston again

Updated Text:
i like them very small, everyday use it in an outdoor trail camera for wild game and it hold several months and has been for the device in all kinds of sink. i'd buy kingston again


As can be observed, the accuracy of the model is not great as some of the replacements don't make sense in the given context. This should be due to the assumed Markov assumption in the creation of the n-gram. Still, this creation of Article Spinners correspond to a real case scenario where NLP can be applied as a valid contribute