# Intermediate NLP Homework (Solution)

Using TF-IDF, write a document summarizer for a corpus of your choosing, but summarize using full sentences or paragraphs rather than individual words.

In [1]:
# for Python 2: use print only as a function
from __future__ import print_function

In [2]:
# read yelp.csv into a DataFrame using a relative path
import pandas as pd
path = '../data/yelp.csv'
yelp = pd.read_csv(path)

In [3]:
# create a document-term matrix using TfidfVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
vect = TfidfVectorizer()
dtm = vect.fit_transform(yelp.text)
dtm.shape

(10000, 29185)

In [4]:
# create a list of all of the features
features = vect.get_feature_names()
len(features)

29185

In [5]:
import numpy as np
from textblob import TextBlob
from sklearn.feature_extraction.text import CountVectorizer

def summarize():
    
    # choose a random review that has at least 10 sentences
    num_sentences = 0
    while num_sentences < 10:
        review_id = np.random.randint(0, yelp.shape[0])
        review_text = yelp.loc[review_id, 'text']
        review_blob = TextBlob(review_text)
        num_sentences = len(review_blob.sentences)
    
    # create a list of all unique words in the review using CountVectorizer
    vect = CountVectorizer()
    vect.fit([review_text])
    unique_words = vect.get_feature_names()
    
    # create a dictionary of words and their TF-IDF scores
    word_scores = {}
    for word in unique_words:
        word_scores[word] = dtm[review_id, features.index(word)]
    
    # calculate the mean TF-IDF score for each sentence that has at least 6 words
    sentences = review_blob.sentences
    sentence_scores = []
    for sentence in sentences:
        sentence_words = sentence.words.lower()
        if len(sentence_words) >= 6:
            sentence_score = np.mean([word_scores[word] for word in sentence_words if word in unique_words])
            sentence_scores.append((sentence_score, sentence))
    
    # print sentences with the top 3 TF-IDF scores
    print('TOP SCORING SENTENCES:')
    top_scores = sorted(sentence_scores, reverse=True)[0:3]
    for score, sentence in top_scores:
        print(sentence)
    
    # print 3 random sentences (for comparison)
    print('\n' + 'RANDOM SENTENCES:')
    random_sentences = np.random.choice(sentences, size=3, replace=False)
    for sentence in random_sentences:
        print(sentence)
    
    # print the review
    print('\n' + 'REVIEW:' + '\n' + review_text)

In [6]:
summarize()

TOP SCORING SENTENCES:
The pizza was really good (the fresh ingredients really come through), but we weren't the biggest fans of the crust.
Service was great and we met Karen, the owner, before we left and she is a doll.
We both had the Prosciutto Cruda pizza and had wanted to try the burrata, but unfortunately they'd been out of it for the last week so we ordered the bruschetta instead.

RANDOM SENTENCES:
I love love love their patio.
Our waiter also recommended the Valpolicella Ripasso, which was ok, but it was a little light for our tastes.
The heaters keep you warm on a chilly night and it's much quieter than their indoor space.

REVIEW:
So we finally decided to give Cibo another chance and I'm so glad we did!

I love love love their patio. It's very cute and romantic especially with all the lights wrapped around the trees. The heaters keep you warm on a chilly night and it's much quieter than their indoor space. This place is a great date spot. 

We both had the Prosciutto Cru