## Types of Text Summarization Methods

![Imgur](https://i.imgur.com/J5KyMBJ.png)


# Import Packages 


In [1]:
import numpy as np
import spacy
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from wordcloud import WordCloud
from IPython.display import display
import base64
import string
import re
from collections import Counter
from time import time
import nltk
from nltk.corpus import stopwords
stopwords = stopwords.words('english')
import heapq
import plotly.offline as py
py.init_notebook_mode(connected=True)
import plotly.graph_objs as go
import plotly.tools as tls
from gensim.models.doc2vec import TaggedDocument, Doc2Vec
from nltk.tokenize import sent_tokenize, word_tokenize
from sklearn.feature_extraction.text import TfidfVectorizer
%matplotlib inline

sns.set_context('notebook')

# Import Dataset 


In [2]:
reviews = pd.read_csv("../data/wine_reviews.csv", usecols =['points', 'title', 'description', 'variety', 'price'], encoding='utf-8')
reviews = reviews.dropna()
reviews.reset_index(drop=True, inplace=True)
reviews.head(15)

Unnamed: 0,description,points,price,title,variety
0,"This is ripe and fruity, a wine that is smooth...",87,15.0,Quinta dos Avidagos 2011 Avidagos Red (Douro),Portuguese Red
1,"Tart and snappy, the flavors of lime flesh and...",87,14.0,Rainstorm 2013 Pinot Gris (Willamette Valley),Pinot Gris
2,"Pineapple rind, lemon pith and orange blossom ...",87,13.0,St. Julian 2013 Reserve Late Harvest Riesling ...,Riesling
3,"Much like the regular bottling from 2012, this...",87,65.0,Sweet Cheeks 2012 Vintner's Reserve Wild Child...,Pinot Noir
4,Blackberry and raspberry aromas show a typical...,87,15.0,Tandem 2011 Ars In Vitro Tempranillo-Merlot (N...,Tempranillo-Merlot
5,"Here's a bright, informal red that opens with ...",87,16.0,Terre di Giurfo 2013 Belsito Frappato (Vittoria),Frappato
6,This dry and restrained wine offers spice in p...,87,24.0,Trimbach 2012 Gewurztraminer (Alsace),Gewürztraminer
7,Savory dried thyme notes accent sunnier flavor...,87,12.0,Heinz Eifel 2013 Shine Gewürztraminer (Rheinhe...,Gewürztraminer
8,This has great depth of flavor with its fresh ...,87,27.0,Jean-Baptiste Adam 2012 Les Natures Pinot Gris...,Pinot Gris
9,"Soft, supple plum envelopes an oaky structure ...",87,19.0,Kirkland Signature 2011 Mountain Cuvée Caberne...,Cabernet Sauvignon


# Text preprocessing

In [3]:
# !python -m spacy download en_core_web_lg
nlp = spacy.load('en_core_web_lg')
def normalize_text(text):
    tm1 = re.sub('<pre>.*?</pre>', '', text, flags=re.DOTALL)
    # tm2 = re.sub('<code>.*?</code>', '', tm1, flags=re.DOTALL)
    tm3 = re.sub('<[^>]+>©', '', tm1, flags=re.DOTALL)
    return tm3.replace("\n", "")

In [4]:
reviews['description_normalized'] = reviews['description'].apply(normalize_text)

In [5]:
print('Before normalizing text-----\n')
print(reviews['description'][2])
print('\nAfter normalizing text-----\n')
print(reviews['description_normalized'][2])

Before normalizing text-----

Pineapple rind, lemon pith and orange blossom start off the aromas. The palate is a bit more opulent, with notes of honey-drizzled guava and mango giving way to a slightly astringent, semidry finish.

After normalizing text-----

Pineapple rind, lemon pith and orange blossom start off the aromas. The palate is a bit more opulent, with notes of honey-drizzled guava and mango giving way to a slightly astringent, semidry finish.


Ваши методы, подумайте до конца ли вы почистили датасет

In [6]:
punctuations = '!"#$%&\'()*+,-/:;<=>?@[\\]^_`{|}~©'
# Define function to cleanup text by removing personal pronouns, stopwords, and puncuation
def cleanup_text(docs, logging=False):
    texts = []
    doc = nlp(docs, disable=['parser', 'ner'])
    tokens = [tok.lemma_.lower().strip() for tok in doc if tok.lemma_ != '-PRON-']
    tokens = [tok for tok in tokens if tok not in stopwords and tok not in punctuations]
    tokens = ' '.join(tokens)
    texts.append(tokens)
    return pd.Series(texts)

In [7]:
reviews['description_cleaned'] = reviews['description_normalized'].apply(lambda x: cleanup_text(x, False))


Returning a DataFrame from Series.apply when the supplied function returns a Series is deprecated and will be removed in a future version.



In [8]:
print('Reviews description with punctuatin and stopwords---\n')
print(reviews['description_normalized'][0])
print('\nReviews description after removing punctuation and stopwrods---\n')
print(reviews['description_cleaned'][0])

Reviews description with punctuatin and stopwords---

This is ripe and fruity, a wine that is smooth while still structured. Firm tannins are filled out with juicy red berry fruits and freshened with acidity. It's  already drinkable, although it will certainly be better from 2016.

Reviews description after removing punctuation and stopwrods---

ripe fruity wine smooth still structure . firm tannin fill juicy red berry fruit freshen acidity . already drinkable although certainly well 2016 .


# Distribution of Points

# Analyze reviews description

Pipeline for summary using spacy or gensim:

### 1 Convert Paragraphs to Sentences

### 2 Text Preprocessing

### 3 Tokenizing the Sentences

### 4 Find Weighted Frequency of Occurrence

### 5 Replace Words by Weighted Frequency in Original Sentences

### 6 Sort Sentences in Descending Order of Sum

In [9]:
def generate_summary(cleaned_text, text_without_removing_dot=None):
    tfidf_vectorizer = TfidfVectorizer()

    sentences = sent_tokenize(cleaned_text)

    tfidf_matrix = tfidf_vectorizer.fit_transform(sentences)
    scores = tfidf_matrix.toarray().sum(axis=0)
    ranked_sentences = sorted(((scores[i], sentence) for i, sentence in enumerate(sentences)), reverse=True)
    summary_sentences = [sentence for score, sentence in ranked_sentences[:3]]

    original_sentences = [sentences[sentences.index(summary_sentence)] for summary_sentence in summary_sentences]

    summary = ' '.join(original_sentences)
    return summary

    # print("\nOriginal Text:\n")
    # print(text_without_removing_dot)
    # print('\n\nSummarized text:\n')
    # print(summary)

In [10]:
summarization = []
for i in range(reviews.shape[0]):
    summarization.append(generate_summary(reviews['description_cleaned'][i]))

reviews['summary'] = summarization
reviews['summary'] = reviews['summary'].astype(str)
reviews.to_csv("../data/wine_reviews_with_summary.csv", encoding='utf-8')

In [11]:
reviews_with_summary = pd.read_csv("../data/wine_reviews_with_summary.csv", index_col=0, encoding='utf-8')
reviews_with_summary['summary'] = reviews_with_summary['summary'].astype(str)

In [12]:
reviews_with_summary.head(5)

Unnamed: 0,description,points,price,title,variety,description_normalized,description_cleaned,summary
0,"This is ripe and fruity, a wine that is smooth...",87,15.0,Quinta dos Avidagos 2011 Avidagos Red (Douro),Portuguese Red,"This is ripe and fruity, a wine that is smooth...",ripe fruity wine smooth still structure . firm...,ripe fruity wine smooth still structure . alre...
1,"Tart and snappy, the flavors of lime flesh and...",87,14.0,Rainstorm 2013 Pinot Gris (Willamette Valley),Pinot Gris,"Tart and snappy, the flavors of lime flesh and...",tart snappy flavor lime flesh rind dominate . ...,wine stainless steel ferment . tart snappy fla...
2,"Pineapple rind, lemon pith and orange blossom ...",87,13.0,St. Julian 2013 Reserve Late Harvest Riesling ...,Riesling,"Pineapple rind, lemon pith and orange blossom ...",pineapple rind lemon pith orange blossom start...,pineapple rind lemon pith orange blossom start...
3,"Much like the regular bottling from 2012, this...",87,65.0,Sweet Cheeks 2012 Vintner's Reserve Wild Child...,Pinot Noir,"Much like the regular bottling from 2012, this...",much like regular bottling 2012 come across ra...,nonetheless think pleasantly unfussy country w...
4,Blackberry and raspberry aromas show a typical...,87,15.0,Tandem 2011 Ars In Vitro Tempranillo-Merlot (N...,Tempranillo-Merlot,Blackberry and raspberry aromas show a typical...,blackberry raspberry aroma show typical navarr...,blackberry raspberry aroma show typical navarr...


In [17]:
tagged_data = [TaggedDocument(words=word_tokenize(_w.lower()), tags=[str(i)]) for i, _w in enumerate(reviews_with_summary.summary)]

In [22]:
max_epochs = 5
vec_size = 20
alpha = 0.07

model = Doc2Vec(vector_size=vec_size, alpha=alpha, min_alpha=0.00025, min_count=1, dm=1)
  
model.build_vocab(tagged_data)

for epoch in range(max_epochs):
    if epoch % 1 == 0:
        print('Iteration {0}'.format(epoch))
    model.train(tagged_data, total_examples=model.corpus_count, epochs=model.epochs)
    # decrease the learning rate
    model.alpha -= 0.002
    # fix the learning rate, no decay
    model.min_alpha = model.alpha

model.save("../models/d2v.model")
print("Model Saved")

Iteration 0
Iteration 1
Iteration 2
Iteration 3
Iteration 4
Model Saved


In [25]:
model_downloaded = Doc2Vec.load("../models/d2v.model")

#to find the vector of a document which is not in training data
test_data = word_tokenize(generate_summary(cleanup_text(normalize_text('This is ripe and fruity, a wine that is smooth while still structured. Firm tannins are filled out with juicy red berry fruits and freshened with acidity. Its  already drinkable, although it will certainly be better from 2016'))[0]))
test_data_vec = model_downloaded.infer_vector(test_data)
sims = model_downloaded.dv.most_similar([test_data_vec], topn=len(model_downloaded.dv))
print('The most suitable wines according to the description:\n')
for i in range(5):
    index = int(sims[i][0])
    acc = float(sims[i][1]) * 100
    print(f"Vine title: {reviews_with_summary['title'][index]}, vine variety: {reviews_with_summary['variety'][index]}"
          f", WineEnthusiast points: {reviews_with_summary['points'][index]:.0f}. Coincidence: {acc:.2f}%.")

The most suitable wines according to the description:

Vine title: Quinta dos Avidagos 2011 Avidagos Red (Douro), vine variety: Portuguese Red, WineEnthusiast points: 87. Coincidence: 97.15%.
Vine title: Maison Malet Roquefort 2011 Léo de la Gaffelière  (Bordeaux), vine variety: Bordeaux-style Red Blend, WineEnthusiast points: 88. Coincidence: 93.79%.
Vine title: Quevedo 2014 Claudia's Red (Douro), vine variety: Portuguese Red, WineEnthusiast points: 88. Coincidence: 92.42%.
Vine title: Château Suau 2015  Bordeaux Blanc, vine variety: Bordeaux-style White Blend, WineEnthusiast points: 84. Coincidence: 91.57%.
Vine title: Casa Santos Lima 2014 Lab Reserva Red (Lisboa), vine variety: Portuguese Red, WineEnthusiast points: 87. Coincidence: 90.53%.
