##Extractive Model
This completed notebook was provided by Mehul Gupta and can be found here: https://medium.com/data-science-in-your-pocket/text-summarization-using-textrank-in-nlp-4bce52c5b390
The model has been adapted to manually take as input an amazon review and has had the output length changed to one sentence only.

This is the extractive model used to produce summaries and utilises the TextRank algorithm. The library Networkx provided the PageRank algorithm which the TextRank algorithm is based on. The PageRank algorithm is implemented in this model by calling the function nx.pagerank_numpy().

### Import Libraries

In [37]:
import numpy as np
import pandas as pd
import nltk
import re
from nltk.tokenize import sent_tokenize
from nltk.corpus import stopwords
nltk.download('stopwords')
nltk.download('punkt')
from gensim.models import Word2Vec
from scipy import spatial
import networkx as nx

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


The review variable holds the original review that will be used to produce a summary. This variable must be updated manually to produce a different summary. Below are some example reviews taken from the amazon dataset that can be used:

'For me: too sweet. Its like eating candied meat. Not the savory item I was hoping for! Id skip this option and go for original.'
Gave me such a caffeine overdose I had the shakes, a racing heart and an anxiety attack. Plus it tastes unbelievably bad. I'll stick with coffee, tea and soda, thanks.                                                   


In [38]:
review = 'Maybe I got the runt of the litter, but when I opened the bag, the foul musty/moldy smell hit me in the face. Even though the label (which was just a printed sticker as opposed to something stamped on the bag itself) said good until 1/2013, there was no doubt that the food had gone bad. Very disappointing.'                                                   

### Tokenise review into a list of sentences

In [39]:
sentences=sent_tokenize(review)

###Preprocess Data
This cleans the review by performing tasks such as removing punctuation marks, special characters, and removing stopwords

In [40]:
sentences_clean=[re.sub(r'[^\w\s]','',sentence.lower()) for sentence in sentences]
stop_words = stopwords.words('english')
sentence_tokens=[[words for words in sentence.split(' ') if words not in stop_words] for sentence in sentences_clean]

### Calculate sentence embeddings
The library gensim is used to calculated the embedding for each word in the review. These are then used to calculate sentence embeddings which are then padded with 0s to ensure all embedded sentences are the same length.

In [41]:
w2v=Word2Vec(sentence_tokens,size=1,min_count=1,iter=100)
sentence_embeddings=[[w2v[word][0] for word in words] for words in sentence_tokens]
max_len=max([len(tokens) for tokens in sentence_tokens])
sentence_embeddings=[np.pad(embedding,(0,max_len-len(embedding)),'constant') for embedding in sentence_embeddings]

  


### Calculate similarity matrix
An initial similarity matrix is created that is filled with 0s. The cosine distance is used to calculate the similarity between sentence pairs in the review. This value is then used to update the matrix.

In [42]:
similarity_matrix = np.zeros([len(sentence_tokens), len(sentence_tokens)])
for i,row_embedding in enumerate(sentence_embeddings):
    for j,column_embedding in enumerate(sentence_embeddings):
        similarity_matrix[i][j]=1-spatial.distance.cosine(row_embedding,column_embedding)

### Implement PageRank
TextRank is based on the PageRank algorithm where instead of sentences are used instead of webpages. The Networkx implementation of PageRank is called using the similarity matrix, which has been converted to a graph, as the parameter.

In [43]:
nx_graph = nx.from_numpy_array(similarity_matrix)
scores = nx.pagerank_numpy(nx_graph)

### Create sentence dictionary

In [44]:
top_sentence={sentence:scores[index] for index,sentence in enumerate(sentences)}
top=dict(sorted(top_sentence.items(), key=lambda x: x[1], reverse=True)[:1]) #Change last value to be desired number of sentences in summary 

### Produce Summary

In [45]:
for sent in sentences:
    if sent in top.keys():
        print(sent)

Maybe I got the runt of the litter, but when I opened the bag, the foul musty/moldy smell hit me in the face.
