# SUMZ - Amazon reviews summarization chrome extension

### Data preprocessing

Here we're cleaning up the Amazon reviews dataset from [Kaggle.](https://www.kaggle.com/snap/amazon-fine-food-reviews)

It has 568,454 food reviews on Amazon up to October 2012, with the following columns in Reviews.csv:


| Field        | Description
|:------------- |:-------------
| Id      | ID of review
| ProductId      | unique identifier for the product
| UserId | unqiue identifier for the user
| ProfileName | -- 
| HelpfulnessNumerator | number of users who found the review helpful
| HelpfulnessDenominator | number of users who indicated whether they found the review helpful
| Score | rating between 1 and 5
| Time | timestamp for the review  
| Summary | brief summary of the review  
| Text | text of the review

The only columns that we care about are <b>Text</b> and <b>Summary</b>; our motivation is to use the text-summary pairs to train our sequence-to-sequence model to generate its own summaries given a review text (which we'll be scraping from the Amazon product page).


In [125]:
import pandas as pd
import numpy as np
import re
from nltk.corpus import stopwords
import time
from helpers import text_cleaning
import pickle

### Inspecting the reviews

In [6]:
reviews = pd.read_csv("Reviews.csv")

In [7]:
reviews.head(3)

Unnamed: 0,Id,ProductId,UserId,ProfileName,HelpfulnessNumerator,HelpfulnessDenominator,Score,Time,Summary,Text
0,1,B001E4KFG0,A3SGXH7AUHU8GW,delmartian,1,1,5,1303862400,Good Quality Dog Food,I have bought several of the Vitality canned d...
1,2,B00813GRG4,A1D87F6ZCVE5NK,dll pa,0,0,1,1346976000,Not as Advertised,Product arrived labeled as Jumbo Salted Peanut...
2,3,B000LQOCH0,ABXLMWJIXXAIN,"Natalia Corres ""Natalia Corres""",1,1,4,1219017600,"""Delight"" says it all",This is a confection that has been around a fe...


In [8]:
## Drop rows with any NAs,
## Drop columns except 'Summary' and 'Text'
reviews = reviews.dropna()
reviews = reviews.drop(['Id','ProductId','UserId','ProfileName','HelpfulnessNumerator','HelpfulnessDenominator',
                        'Score','Time'], 1)
reviews = reviews.reset_index(drop=True)

In [9]:
reviews.head()

Unnamed: 0,Summary,Text
0,Good Quality Dog Food,I have bought several of the Vitality canned d...
1,Not as Advertised,Product arrived labeled as Jumbo Salted Peanut...
2,"""Delight"" says it all",This is a confection that has been around a fe...
3,Cough Medicine,If you are looking for the secret ingredient i...
4,Great taffy,Great taffy at a great price. There was a wid...


Let's dig into some of the reviews and see their summary pairs

In [17]:
reviews_to_inspect = 5
for i in range(reviews_to_inspect):
    print("### REVIEW TEXT {}:\n{}".format(i+1, reviews.Text[i]))
    print("### SUMMARY {}:{}".format(i+1, reviews.Summary[i]))
    print("")
    

### REVIEW TEXT 1:
I have bought several of the Vitality canned dog food products and have found them all to be of good quality. The product looks more like a stew than a processed meat and it smells better. My Labrador is finicky and she appreciates this product better than  most.
### SUMMARY 1:Good Quality Dog Food

### REVIEW TEXT 2:
Product arrived labeled as Jumbo Salted Peanuts...the peanuts were actually small sized unsalted. Not sure if this was an error or if the vendor intended to represent the product as "Jumbo".
### SUMMARY 2:Not as Advertised

### REVIEW TEXT 3:
This is a confection that has been around a few centuries.  It is a light, pillowy citrus gelatin with nuts - in this case Filberts. And it is cut into tiny squares and then liberally coated with powdered sugar.  And it is a tiny mouthful of heaven.  Not too chewy, and very flavorful.  I highly recommend this yummy treat.  If you are familiar with the story of C.S. Lewis' "The Lion, The Witch, and The Wardrobe" - t

### Cleaning the text
Let's clean up this text a bit to help our network -- we'll do the following:
- Convert text to lowercase
- Replace contractions with proper form
- Remove unwanted characters
- Remove stopwords (in the text but NOT the summary, to make summary sound natural)

In [116]:
def clean_text(text, remove_stopwords=True):

    text = text.lower()
    text = text.split()
    uncontracted_text = []
    
    # Remove contractions
    for word in text:
        if word in text_cleaning.contractions:
            uncontracted_text.append(text_cleaning.contractions[word])
        else:
            uncontracted_text.append(word)
    text = " ".join(uncontracted_text)

    # Remove unwanted characters
    text = re.sub(r'https?:\/\/.*[\r\n]*', '', text, flags=re.MULTILINE)
    text = re.sub(r'\<a href', ' ', text)
    text = re.sub(r'&amp;', '', text) 
    text = re.sub(r'[_"\-;%()|+&=*%.,!?:#$@\[\]/]', ' ', text)
    text = re.sub(r'<br />', ' ', text)
    text = re.sub(r'\'', ' ', text)
    
    # Remove stop words
    if remove_stopwords:
        stop_words = set(stopwords.words("english"))
        text = text.split()
        text = [word for word in text if not word in stop_words]
        text = " ".join(text)
    
    return text

In [123]:
clean_summaries = [clean_text(text, remove_stopwords=False) for text in reviews.Summary]
clean_texts     = [clean_text(text, remove_stopwords=True)  for text in reviews.Text]

In [124]:
reviews_to_inspect = 5
for i in range(reviews_to_inspect):
    print("### REVIEW TEXT {}:\n{}".format(i+1, clean_texts[i]))
    print("### SUMMARY {}:{}".format(i+1, clean_summaries[i]))
    print("")

### REVIEW TEXT 1:
bought several vitality canned dog food products found good quality product looks like stew processed meat smells better labrador finicky appreciates product better
### SUMMARY 1:good quality dog food

### REVIEW TEXT 2:
product arrived labeled jumbo salted peanuts peanuts actually small sized unsalted sure error vendor intended represent product jumbo
### SUMMARY 2:not as advertised

### REVIEW TEXT 3:
confection around centuries light pillowy citrus gelatin nuts case filberts cut tiny squares liberally coated powdered sugar tiny mouthful heaven chewy flavorful highly recommend yummy treat familiar story c lewis lion witch wardrobe treat seduces edmund selling brother sisters witch
### SUMMARY 3: delight  says it all

### REVIEW TEXT 4:
looking secret ingredient robitussin believe found got addition root beer extract ordered good made cherry soda flavor medicinal
### SUMMARY 4:cough medicine

### REVIEW TEXT 5:
great taffy great price wide assortment yummy taffy del

### Checkpoint: saving cleaned texts / summaries

In [126]:
# Dump the cleaned texts to save for later in case we need iti
cleaned_texts_path = './checkpointed_data/cleaned_texts.p'
pickle.dump((clean_texts, clean_summaries), open(cleaned_texts_path, 'wb'))

In [127]:
# Load in cleaned data from checkpoint
cleaned_texts_path = './checkpointed_data/cleaned_texts.p'
clean_texts, clean_summaries = pickle.load(open(cleaned_texts_path, mode='rb'))

### Embedding the words into vectors

We can't feed text directly into the model, and instead of one-hot encoding (which will make massive sparse matrices for each word in which most characters are 0), we'll instead use pre-trained word embeddings.

<img src="images/word2vec_diagrams.png"/>
<i>source: https://deeplearning4j.org/word2vec.html</i>

Instead of word2vec or GloVe, we'll use [ConceptNet Numberbatch](https://github.com/commonsense/conceptnet-numberbatch). This seems to be the best of everything since it has an ensemble of the above-mentioned word embeddings.

<b>Formal attribution:<b>
<i>This data contains semantic vectors from ConceptNet Numberbatch, by
Luminoso Technologies, Inc. You may redistribute or modify the
data under the terms of the CC-By-SA 4.0 license.</i>


First let's remove vocabulary words that are not in the ConceptNet (CN) embeddings; however, if these non-included words are showing up in reviews over a threshold (say 20), we'll still include them by assigning them a vector of random embeddings.

In [134]:
from collections import Counter
def get_word_counts(clean_summaries, clean_texts):
    total_counts = Counter()
    for sentence in (clean_summaries + clean_texts):
        for word in sentence.split():
            if word not in total_counts:
                total_counts[word] = 1
            else:
                total_counts[word] += 1
    return total_counts

In [135]:
word_counts = get_word_counts(clean_summaries, clean_texts)
print("Total size of all vocabulary: {}".format(len(word_counts)))

Total size of all vocabulary: 132884


Now we make a word matrix from the ConceptNet embeddings

In [137]:
embed_index = {}
with open('./numberbatch-en-17.06.txt', encoding='utf-8') as f:
    for line in f:
        values = line.split(' ')
        word = values[0]
        embedding = np.asarray(values[1:], dtype='float32')
        embed_index[word] = embedding
print("Total word embeddings from CN:", len(embed_index))

Total word embeddings from CN: 417195


Let's find words that are more than our threshold but not in CN, so we can make our own embeddings for those words

In [None]:
def find_missing_words(word_counts, embed_index):
    not_in_cn = 0
    word_threshold = 20 # If it appears more than 20 times, lets make our own embedding for it
    missing_words = [word for word, count in word_counts.items() if (count > word_threshold and not word in embed_index)]
    print("Words missing from CN: {}, ({}% of our vocabulary)".format(len(missing_words), round(len(missing_words)/len(word_counts),4)*100))
    return missing_words

In [144]:
def word_dicts(word_counts, embed_index, threshold):
    vocab_to_int = {}
    value = 0
    for word, count in word_counts.items():
        if count >= threshold or word in embed_index:
            vocab_to_int[word] = value
            value += 1
    
    # Special codes to include
    codes = ["<UNK>","<PAD>","<EOS>","<GO>"]  
    for code in codes:
        vocab_to_int[code] = len(vocab_to_int)
    
    # Reverse dictionary
    int_to_vocab = {}
    for word, value in vocab_to_int.items():
        int_to_vocab[value] = word
    
    # Print stats
    usage_ratio = round(len(vocab_to_int) / len(word_counts),4)*100
    print("Total set of possible words:", len(word_counts))
    print("Number of words in our vocab:", len(vocab_to_int))
    print("Percent of words we're using: {}%".format(usage_ratio))
    
    return vocab_to_int, int_to_vocab
    

In [150]:
def make_word_embed_matrix(vocab_to_int, embed_index, embedding_dim=300):
    nb_words = len(vocab_to_int)
    
    # Create initial matrix of shape [nb_words,embedding_dim] with all zeros
    word_embedding_matrix = np.zeros((nb_words, embedding_dim), dtype=np.float32)
    for word, idx in vocab_to_int.items():
        if word in embed_index:
            word_embedding_matrix[idx] = embed_index[word]
        else:
            # If it's not in CN, we make a random embedding
            new_embedding = np.array(np.random.uniform(-1.0, 1.0, embedding_dim))
            embed_index[word] = new_embedding
            word_embedding_matrix[idx] = new_embedding
    
    print("Number of words in embedding matrix: ", len(word_embedding_matrix))
    print("Number of words in vocab_to_int    : ", len(vocab_to_int))
    return word_embedding_matrix
            
    
    

In [146]:
vocab_to_int, int_to_vocab = word_dicts(word_counts, embed_index, 20)

Total set of possible words: 132884
Number of words in our vocab: 59595
Percent of words we're using: 44.85%


In [151]:
word_embedding_matrix = make_word_embed_matrix(vocab_to_int, embed_index)

Number of words in embedding matrix:  59595
Number of words in vocab_to_int    :  59595


Now let's actually convert all words in both the text and clean summaries into their word-embedding representations.

This means each input into our model (a review) is actually an array of size N where N = number of words (we'll use padding below so all reviews are the same length). So our total features will be size [M x N] where M is the number of reviews in our data.

In [160]:
'''
We're converting words to integers per vocab_to_int.
We're also replacing words we don't know with UNK's code.
And then adding an EOS token to end of each review.
'''
def convert_text_to_ints(text, vocab_to_int, word_count, unk_count, eos=False):
    all_word_ints = []
    for sentence in text:
        sentence_ints = []
        for word in sentence.split():
            word_count += 1
            if word in vocab_to_int:
                sentence_ints.append(vocab_to_int[word])
            else:
                sentence_ints.append(vocab_to_int['<UNK>'])
                unk_count += 1
        if eos:
            sentence_ints.append(vocab_to_int['<EOS>'])
        all_word_ints.append(sentence_ints)
    return all_word_ints, word_count, unk_count
    

In [162]:
word_count = 0
unk_count = 0
int_summaries, word_count, unk_count = convert_text_to_ints(clean_summaries, vocab_to_int, word_count, unk_count)

# We are only adding <EOS> to the review (not the summary)
int_texts, word_count, unk_count = convert_text_to_ints(clean_texts, vocab_to_int, word_count, unk_count, eos=True)

unk_perc = round(unk_count / word_count,4)*100
print("Total number of words in reviews (and summaries):", word_count)
print("Total number of UNKs in reviews (and summaries):", unk_count)
print("Percent of words that are UNK: {}%".format(unk_perc))

Total number of words in reviews (and summaries): 25679946
Total number of UNKs in reviews (and summaries): 192245
Percent of words that are UNK: 0.75%


### Checkpoint: saving vocab_to_int, int_to_vocab, word_embedding_matrix

In [180]:
# Dump the data to save for later in case we need iti
word_dicts = './checkpointed_data/word_dicts.p'
pickle.dump((vocab_to_int, int_to_vocab, word_embedding_matrix), open(word_dicts, 'wb'))

In [181]:
# Load in data from checkpoint
word_dicts = './checkpointed_data/word_dicts.p'
vocab_to_int, int_to_vocab, word_embedding_matrix = pickle.load(open(word_dicts, mode='rb'))

In [182]:
len(word_embedding_matrix)

59595

Now all sentences are replaced with the integer values for their respective words.

Let's do the following to filter out the sentences we don't want to includes:
- Only include reviews that are between a predefined min / max sentence length (we don't want super long ones or super short ones)
- Remove reviews with too many UNK words

In [167]:
def unk_counter(text):
    unk_count = 0
    for word in text:
        if word == vocab_to_int['<UNK>']:
            unk_count += 1
    return unk_count

In [169]:
def create_lengths(text):
    '''Create a data frame of the sentence lengths from a text'''
    lengths = []
    for sentence in text:
        lengths.append(len(sentence))
    return pd.DataFrame(lengths, columns=['counts'])

In [170]:
lengths_summaries = create_lengths(int_summaries)
lengths_texts = create_lengths(int_texts)

print("Summaries:")
print(lengths_summaries.describe())
print()
print("Texts:")
print(lengths_texts.describe())

Summaries:
              counts
count  568412.000000
mean        4.181620
std         2.657872
min         0.000000
25%         2.000000
50%         4.000000
75%         5.000000
max        48.000000

Texts:
              counts
count  568412.000000
mean       41.996782
std        42.520854
min         1.000000
25%        18.000000
50%        29.000000
75%        50.000000
max      2085.000000


In [171]:
# Inspect the length of texts
print(np.percentile(lengths_texts.counts, 90))
print(np.percentile(lengths_texts.counts, 95))
print(np.percentile(lengths_texts.counts, 99))

84.0
115.0
207.0


In [172]:
# Inspect the length of summaries
print(np.percentile(lengths_summaries.counts, 90))
print(np.percentile(lengths_summaries.counts, 95))
print(np.percentile(lengths_summaries.counts, 99))

8.0
9.0
13.0


In [175]:
def create_final_data(int_summaries,
                      int_texts, 
                      max_text_length, 
                      max_summary_length, 
                      unk_text_limit, 
                      unk_summary_limit):
    
    '''
    Makes the final sorted summaries and sorted texts for our model to process
    Params:
        int_summaries      : summaries in word-int form
        int_texts          : review texts in word-int form
        max_text_length    : maximum allowed review text size
        max_summary_length : maximum allowed summary size
        unk_text_limit     : max number of UNKs allowed in review text
        unk_summary_limit  : max number of UNKs allowed in summary
    '''
    
    sorted_summaries = []
    sorted_texts = []
#     max_text_length = 84
#     max_summary_length = 13
    min_length = 2
#     unk_text_limit = 1
#     unk_summary_limit = 0

    for length in range(min(lengths_texts.counts), max_text_length): 
        for count, words in enumerate(int_summaries):
            if (len(int_summaries[count]) >= min_length and
                len(int_summaries[count]) <= max_summary_length and
                len(int_texts[count]) >= min_length and
                unk_counter(int_summaries[count]) <= unk_summary_limit and
                unk_counter(int_texts[count]) <= unk_text_limit and
                length == len(int_texts[count])
               ):
                sorted_summaries.append(int_summaries[count])
                sorted_texts.append(int_texts[count])

    # Compare lengths to ensure they match
    print(len(sorted_summaries))
    print(len(sorted_texts))
    return sorted_summaries, sorted_texts

In [177]:
sorted_summaries, sorted_texts = create_final_data(int_summaries,
                                                   int_texts,
                                                   84, 13, 1, 0)

425616
425616


### Checkpoint: saving final data


In [178]:
# Dump the data to save for later
model_input_data_path = './checkpointed_data/model_input_data.p'
pickle.dump((sorted_summaries, sorted_texts), open(model_input_data_path, 'wb'))

In [179]:
# Load in model input data from checkpoint
model_input_data_path = './checkpointed_data/model_input_data.p'
sorted_summaries, sorted_texts = pickle.load(open(model_input_data_path, mode='rb'))
print(len(sorted_summaries), len(sorted_texts))

425616 425616


## Conclusion
Now we've gotten all our Amazon reviews and summaries in the proper form to go forward. We'll be doing further processing (adding PAD tokens, etc) when building the model itself.

Our texts are now integer matrixes of length M x N (M = number of items, N = the word_integer in each item).