# Introduction to NLP: Traditional approaches
# Article spinner using n-grams

- **Whats an article spinner**


*Article spinning is a writing technique used in search engine optimization (SEO), and other applications, which creates what appears to be new content from what already exists. Content spinning works by replacing specific words, phrases, sentences, or even entire paragraphs with any number of alternate versions to provide a slightly different variation with each spin - also known as Rogeting. This process can be completely automated or written manually as many times as needed. Early content produced through automated methods often resulted in articles which were hard or even impossible to read. However, as article spinning techniques were refined they became more sophisticated, and can now result in perfectly readable articles which appear original.*


- **How are we going to solve this problem** 
Describe the problem, how are we using trigrams,

This sentiment dataset was used in the paper: John Blitzer, Mark Dredze, Fernando Pereira. Biographies, Bollywood, Boom-boxes and Blenders: **Domain Adaptation for Sentiment Classification. Association of Computational Linguistics (ACL), 2007**. [PDF]
Link to datasource: http://www.cs.jhu.edu/~mdredze/datasets/sentiment/index2.html

Link to code: https://github.com/lazyprogrammer/machine_learning_examples/blob/master/nlp_class/article_spinner.py
Link to blog post or aditional readings:
- Some links about n-grams


### Importing the libraries

In [30]:
import pandas as pd
import numpy as np
import os
import random

from bs4 import BeautifulSoup
import nltk
# Downloading some nltk resources, just once
nltk.download('punkt')


[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\edumu\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

We set the variables for data location

In [6]:
# Global parameters
#root folder
root_folder='.'
#data_folder='.'
data_folder_name='.'
data_filename='unlabeled.review'
# Variable for data directory
DATA_PATH = os.path.abspath(os.path.join(root_folder, data_folder_name))

# Both train and test set are in the root data directory
train_path = DATA_PATH
#test_path = DATA_PATH

#Relevant columns
TEXT_COLUMN = 'text'
TARGET_COLUMN = 'target'


### Loading the dataset

In [8]:
# load the reviews
reviews = BeautifulSoup(open(train_path+'\\'+data_filename).read())
reviews = reviews.findAll('review_text')

In [10]:
type(reviews)

bs4.element.ResultSet

## Define and create N-grams and their probabilities

- **What is a n-gram** In this case a trigram and how it will be create x-1, x, x+1

### ngrams
An n-gram is a contiguous sequence of n items (where the items can be characters, syllables, or words). A 1-gram is a unigram, a 2-gram is a bigram, and a 3-gram is a trigram.

Here, we are referring to sequences of words. So examples of bigrams include "the dog", "said that", and "can't you".

- **What are the probabilities of a trigram** Markov assumptions: pros and cons

The assumption that the probability of a word depends only on the previous word is called a Markov assumption. Markov models are the class of probabilistic models that assume we can predict the probability of some future unit without looking too
far into the past. We can generalize the bigram (which looks one word into the past) n-gram to the trigram (which looks two words into the past) and thus to the n-gram (which looks n−1 words into the past).

Thus, the general equation for this n-gram approximation to the conditional probability of the next word in a sequence is:

$P(W_n|W^{n-1}_1) \to P(W_n|W^{n-1}_{n-N-1})$


In [24]:
# extract trigrams and insert into dictionary
# (w1, w3) is the key, [ w2 ] are the values
def my_trigrams(reviews):
    trigrams = {}
    for review in reviews:
        s = review.text.lower()
        # ANY OTHER DATA PREPROCESSING BEFORE TOKENIZE
        tokens = nltk.tokenize.word_tokenize(s)
        for i in range(len(tokens) - 2):
            k = (tokens[i], tokens[i+2])
            if k not in trigrams:
                trigrams[k] = []
            trigrams[k].append(tokens[i+1])

    return trigrams

# Cerate the probability vector of the trigrams
def trigram_proba(trigrams):
    trigram_probabilities = {}
    for k, words in trigrams.items():
        # create a dictionary of word -> count
        if len(set(words)) > 1:
            # only do this when there are different possibilities for a middle word
            d = {}
            n = 0
            for w in words:
                if w not in d:
                    d[w] = 0
                d[w] += 1
                n += 1
            for w, c in d.items():
                d[w] = float(c) / n
            trigram_probabilities[k] = d
            
    return trigram_probabilities

Using the previuos defined function we can extract the trigrams and its probabilities:

In [17]:
tgrams=my_trigrams(reviews)
tgrams_probs = trigram_proba(tgrams)
print('Trigrams:',len(tgrams_probs))

In [28]:
len(tgrams_probs)

103105

In [29]:
# Return a token based on a random sample
def random_sample(d):
    # choose a random sample from dictionary where values are the probabilities
    r = random.random()
    cumulative = 0
    for w, p in d.items():
        cumulative += p
        if r < cumulative:
            return w


In [32]:
# Select a random review
review = random.choice(reviews)
# Transform to lowercase
s = review.text.lower()
print("Original Text:", s)
# Extract the tokens
tokens = nltk.tokenize.word_tokenize(s)
for i in range(len(tokens) - 2):
    if random.random() < 0.2: # 20% chance of replacement
        # Extract the elements in the trigrams
        k = (tokens[i], tokens[i+2])
        if k in tgrams_probs:
            # Get a sample word from based on the trigrams probabilities
            w = random_sample(tgrams_probs[k])
            tokens[i+1] = w
print("Spun:")
print(" ".join(tokens).replace(" .", ".").replace(" '", "'").replace(" ,", ",").replace("$ ", "$").replace(" !", "!"))


Original Text: 
i purchased the base unit (which includes one handset) and two additional handsets.  they worked okay for about a month or so, then the handset started inexplicably dying.  sometimes the screen went totally blank and you would lose the call or couldn't dial.  othertimes the screen said that i was too far away from the base (even when i was literally right next to the base).  so i called customer service and they said that it sounded like i needed a new battery (after a month!).  i actually read the directions before using the phone and i wasn't even putting the phones on the bases to recharge after every call so that the ni cd batteries wouldn't develop a memory (they don't use the better lithium ion batteries that do not develop a memory).  so this made no sense.  anyway, i went ahead and bought the batteries (for all three handsets) and that didn't solve the problem at all.  i then went ahead and purchased another handset.  again, no problems for a month and then the 

- Show a good result: 
- Original: othertimes the screen said that i was too far away
- New:  othertimes the screen shows that i was too far away
- Original: i went ahead and bought the batteries
- New: i went out and bought the batteries

But some others are completely wrong
Original: i know i wasn't doing anything wrong to cause this.
New: i know i was n't doing anything negative to cause this set
**REMOVE PUNCTUATION**