In [22]:
%logstop
%logstart -rtq ~/.logs/nlp.py append
import seaborn as sns
sns.set()

In [19]:
from static_grader import grader

# NLP Miniproject

## Introduction

The objective of this miniproject is to gain experience with natural language processing and how to use text data to train a machine learning model to make predictions. For the miniproject, we will be working with product review text from Amazon. The reviews are for only products in the "Electronics" category. The objective is to train a model to predict the rating, ranging from 1 to 5 stars.

## Scoring

For most of the questions, you will be asked to submit the `predict` method of your trained model to the grader. The grader will use the passed `predict` method to evaluate how your model performs on a test set with respect to a reference model. The grader uses the [R<sup>2</sup>-score](https://scikit-learn.org/stable/modules/model_evaluation.html#r2-score) for model evaluation. If your model performs better than the reference solution, then you can score higher than 1.0. For the last question, you will submit the results of an analysis and your passed answer will be compared directly to the reference solution.

## Downloading and loading the data

The data set is available on Amazon S3 and comes as a compressed file where each line is a JSON object. To load the data set, we will need to use the `gzip` library to open the file and decode each JSON into a Python dictionary. In the end, we have a list of dictionaries, where each dictionary represents an observation.

In [20]:
%%bash
mkdir data
wget http://dataincubator-wqu.s3.amazonaws.com/mldata/amazon_electronics_reviews_training.json.gz -nc -P ./data

mkdir: cannot create directory ‘data’: File exists
File ‘./data/amazon_electronics_reviews_training.json.gz’ already there; not retrieving.



In [6]:
import gzip
import simplejson as json

with gzip.open("data/amazon_electronics_reviews_training.json.gz", "r") as f:                                  
    data = [json.loads(line) for line in f]

The ratings are stored in the keyword `"overall"`. You should create an array of the ratings for each review, preferably using list comprehensions.

In [22]:
ratings = [row['overall'] for row in data]

In [23]:
sum(ratings)/len(ratings)

4.226383333333334

**Note**, the test set used by the grader is in the same format as that of `data`, a list of dictionaries. Your trained model needs to accept data in the same format. Thus, you should use `Pipeline` when constructing your model so that all necessary transformation needed are encapsulated into a single estimator object.

## Question 1: Bag of words model

Construct a machine learning model trained on word counts using the bag of words algorithm. Remember, the bag of words is implemented with `CountVectorizer`. Some things you should consider:

* The reference solution uses a linear model and you should as well; use either `Ridge` or `SGDRegressor`.
* The text review is stored in the key `"reviewText"`. You will need to construct a custom transformer to extract out the value of this key. It will be the first step in your pipeline.
* Consider what hyperparameters you will need to tune for your model.
* Subsampling the training data will boost training times, which will be helpful when determining the best hyperparameters to use. Note, your final model will perform best if it is trained on the full data set.
* Including stop words may help with performance.

In [12]:
from sklearn.base import BaseEstimator, TransformerMixin

class KeySelector(BaseEstimator, TransformerMixin):
    def __init__(self, col):
        self.col=col
    def fit(self, X, y=None):
        return self
    def transform(self, X):
        return [row[self.col] for row in X]



In [25]:
selector=KeySelector("reviewText")
selector.fit_transform(data, ratings)[:5]

["I bought this mouse to use with my laptop because I don't like those little touchpads.  I could not be happier.Since it's USB, I can plug it in with the computer already on and expect it to work automatically.  Since it's optical (the new kind, not to be confused with the old Sun optical mice that required a special checkered mouse pad) it works on most surfaces, including my pant legs, my couch, and random tables that I put my laptop down on.  It's also light and durable, features that help with portability.The wheel is surprisingly useful.  In addition to scrolling, it controls zoom and pan in programs like Autocad and 3D Studio Max.  I can no longer bear using either of these programs without it.One complaint - the software included with the Internet navigation features is useless.  Don't bother installing it if you have a newer Windows version that automatically supports wheel mice.  Just plug it in and use it - it's that easy.",
 'One by one, all of the discs went bad within a 6

In [26]:
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import Ridge

bag_of_words_model = Pipeline([
    ('selector', selector),
    ('vectorizer', CountVectorizer()),
    ('regressor', Ridge(alpha=200)),
])

In [27]:
#value of alpha ridge has to be edited to 100 and above to get a good score

In [28]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(data, ratings, test_size=0.2, random_state=0)

In [29]:
bag_of_words_model.fit(X_train, y_train)

Pipeline(steps=[('selector', KeySelector(col='reviewText')),
                ('vectorizer', CountVectorizer()),
                ('regressor', Ridge(alpha=200))])

In [30]:
bag_of_words_model.fit(X_test, y_test)

Pipeline(steps=[('selector', KeySelector(col='reviewText')),
                ('vectorizer', CountVectorizer()),
                ('regressor', Ridge(alpha=200))])

In [31]:
(bag_of_words_model[-1].coef_**2).sum()

7.353821031864136

In [32]:
#bag_of_words_model = ...

In [33]:
grader.score.nlp__bag_of_words_model(bag_of_words_model.predict)

Your score: 0.888


## Question 2: Normalized model

Using raw counts will not be as effective compared if we had normalized the counts. There are several ways to normalize raw counts; the `HashingVectorizer` class has the keyword `norm` and there is also the `TfidfTransformer` and `TfidfVectorizer` that perform tf-idf weighting on the counts. Apply normalization to your model to improve performance.

In [34]:
import numpy as np

In [35]:
class CharacterLength(BaseEstimator, TransformerMixin):
    def fit(self, X, y=None):
        return self
    
    def transform(self, X):
        return np.array([len(x) for x in X]).reshape(-1, 1)

In [36]:
character_length_model = Pipeline([
    ('selector', KeySelector('reviewText')),
    ('compute_length', CharacterLength()),
    ('regressor', Ridge()),
])

In [37]:
character_length_model.fit(X_train, y_train)

Pipeline(steps=[('selector', KeySelector(col='reviewText')),
                ('compute_length', CharacterLength()), ('regressor', Ridge())])

In [38]:
character_length_model.fit(X_test, y_test)

Pipeline(steps=[('selector', KeySelector(col='reviewText')),
                ('compute_length', CharacterLength()), ('regressor', Ridge())])

In [39]:
grader.score.nlp__normalized_model(character_length_model.predict)

Your score: 0.013


In [50]:
character_length_model = Pipeline([
    ('selector', selector),
    ('compute_length', CharacterLength()),
    ('regressor', Ridge(alpha=1000)),
])

In [51]:
character_length_model.fit(X_train, y_train)

Pipeline(steps=[('selector', KeySelector(col='reviewText')),
                ('compute_length', CharacterLength()),
                ('regressor', Ridge(alpha=1000))])

In [58]:
grader.score.nlp__normalized_model(character_length_model.predict)

Your score: 0.010


In [43]:
# score is 0 because it is a simple model

In [53]:
from sklearn.feature_extraction.text import HashingVectorizer

In [56]:
normalized_model = Pipeline([
    ('selector', KeySelector('reviewText')),
    ('vectorizer', HashingVectorizer()),
    ('regressor', Ridge()),
])

In [None]:
X_t=normalized_model[:-1].fit_transform(X_train, y_train)

In [None]:
X_t

In [57]:
normalized_model.fit(X_train, y_train)

Pipeline(steps=[('selector', KeySelector(col='reviewText')),
                ('vectorizer', HashingVectorizer()), ('regressor', Ridge())])

In [None]:

normalized_model = ...

In [59]:
grader.score.nlp__normalized_model(normalized_model.predict)

Your score: 1.015


In [None]:
#testing with 1st, obviously normalized model is a better model. 
grader.score.nlp__bag_of_words_model(normalized_model.predict)

## Question 3: Bigrams model

The model performance may increase when including additional features generated by counting bigrams. Include bigrams to your model. When using more features, the risk of overfitting increases. Make sure you try to minimize overfitting as much as possible.

In [47]:

bigrams_model = Pipeline([
    ('selector', KeySelector('reviewText')),
    ('vectorizer', HashingVectorizer(ngram_range=(1,2))),
    ('regressor', Ridge()),
])

In [48]:
bigrams_model.fit(X_train, y_train)

Pipeline(steps=[('selector', KeySelector(col='reviewText')),
                ('vectorizer', HashingVectorizer(ngram_range=(1, 2))),
                ('regressor', Ridge())])

In [49]:
grader.score.nlp__bigrams_model(bigrams_model.predict)

Your score: 1.128


## Question 4: Polarity analysis

Let's derive some insight from our analysis. We want to determine the most polarizing words in the corpus of reviews. In other words, we want identify words that strongly signal a review is either positive or negative. For example, we understand a word like "terrible" will mostly appear in negative rather than positive reviews. The naive Bayes model calculates probabilities such as $P(\text{terrible } | \text{ negative})$, the probability the word "terrible" appears in the text, given that the review is negative. Using these probabilities, we can derive a **polarity score** for each counted word,

$$
\text{polarity} =  \log\left(\frac{P(\text{word } | \text{ positive})}{P(\text{word } | \text{ negative})}\right).
$$ 

The polarity analysis is an example where a simpler model offers more explicability than a more complicated model. For this question, you are asked to determine the top thirty words with the largest positive **and** largest negative polarity, for a total of sixty words. For this analysis, you should:

1. Use the naive Bayes model, `MultinomialNB`.
1. Use tf-idf weighting.
1. Remove stop words.

A trained naive Bayes model stores the log of the probabilities in the attribute `feature_log_prob_`. It is a NumPy array of shape (number of classes, the number of features). You will need the mapping between feature index to word. For this problem, you will use a different data set; it has been processed to only include reviews with one and five stars. You can download it below.

In [2]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import TfidfVectorizer

from spacy.lang.en import STOP_WORDS

In [3]:
%%bash
wget http://dataincubator-wqu.s3.amazonaws.com/mldata/amazon_one_and_five_star_reviews.json.gz -nc -P ./data

File ‘./data/amazon_one_and_five_star_reviews.json.gz’ already there; not retrieving.



In order to avoid memory issues, let's delete the older data.

In [None]:
#del data, ratings

In [7]:
import numpy as np
import gzip
from sklearn.naive_bayes import MultinomialNB

with gzip.open("data/amazon_one_and_five_star_reviews.json.gz", "r") as f:
    data_polarity = [json.loads(line) for line in f]

ratings = [row['overall'] for row in data_polarity]


In [67]:

model = Pipeline([
    ('selector', KeySelector('reviewText')),
    ('vectorizer', TfidfVectorizer(stop_words=STOP_WORDS)),
    ('classifier', MultinomialNB()),
])

In [68]:
#You can use this to fit: model.fit(data_polarity, ratings)

#model.fit(X_train, y_train)



Pipeline(steps=[('selector', KeySelector(col='reviewText')),
                ('vectorizer',
                 TfidfVectorizer(stop_words={"'d", "'ll", "'m", "'re", "'s",
                                             "'ve", 'a', 'about', 'above',
                                             'across', 'after', 'afterwards',
                                             'again', 'against', 'all',
                                             'almost', 'alone', 'along',
                                             'already', 'also', 'although',
                                             'always', 'am', 'among', 'amongst',
                                             'amount', 'an', 'and', 'another',
                                             'any', ...})),
                ('classifier', MultinomialNB())])

In [74]:
model.fit(data_polarity, ratings)

Pipeline(steps=[('selector', KeySelector(col='reviewText')),
                ('vectorizer',
                 TfidfVectorizer(stop_words={"'d", "'ll", "'m", "'re", "'s",
                                             "'ve", 'a', 'about', 'above',
                                             'across', 'after', 'afterwards',
                                             'again', 'against', 'all',
                                             'almost', 'alone', 'along',
                                             'already', 'also', 'although',
                                             'always', 'am', 'among', 'amongst',
                                             'amount', 'an', 'and', 'another',
                                             'any', ...})),
                ('classifier', MultinomialNB())])

In [75]:
polarity=model[-1].feature_log_prob_[0, :] - model[-1].feature_log_prob_[1, :]

polarity

array([ 0.7061141 , -0.33401936, -0.29165675, ...,  0.61954703,
        0.10706008,  0.19475051])

In [76]:
np.sort(polarity)

array([-2.3469003 , -2.33051588, -2.18552209, ...,  2.87344209,
        2.96722437,  3.13553283])

In [77]:
np.argsort(polarity)

array([11288,  3610, 17646, ..., 19021, 24531, 18497])

In [81]:
np.argsort(polarity)[:30]

array([11288,  3610, 17646, 16559, 14790, 17123,  2511, 21610, 13921,
       11980,  8880,  4036, 16917,   524,  5010, 10947,  3303, 17124,
        7961,  6376, 16718, 22331,  4353,  9267, 18541,  8197,  2228,
       16535, 13763, 21903])

In [87]:
polarity[3610]

-2.3305158753579107

In [88]:
ind_most_polar=np.hstack((np.argsort(polarity)[:30], np.argsort(polarity)[-30:]))

In [89]:
ind_most_polar.shape

(60,)

In [93]:
for index in ind_most_polar:
    print(model[1].get_feature_names()[index])

highly
beat
protects
perfect
monopod
portrait
amazing
sturdy
macro
incredible
excellent
bokeh
pleased
200mm
charm
handy
awesome
portraits
dslr
crisp
photography
telephoto
buck
fantastic
regrets
easy
affordable
penny
loves
surround
dead
poorly
send
sent
contacted
refused
threw
disappointing
randomly
stopped
unreliable
horrible
awful
unacceptable
poor
beware
defective
trash
worse
worthless
useless
garbage
returned
terrible
junk
worst
returning
return
waste
refund


In [96]:
top_60 = [model[1].get_feature_names()[index] for index in ind_most_polar]

In [97]:
grader.score.nlp__most_polar(top_60)

Your score: 1.000


## Question 5: Topic modeling [optional]

Topic modeling is the analysis of determining the key topics or themes in a corpus. With respect to machine learning, topic modeling is an unsupervised technique. One way to uncover the main topics in a corpus is to use [non-negative matrix factorization](https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.NMF.html). For this question, use non-negative matrix factorization to determine the top ten words for the first twenty topics. You should submit your answer as a list of lists. What topics exist in the reviews?

In [51]:
from sklearn.decomposition import NMF
 
import numpy as np
import gzip
from sklearn.naive_bayes import MultinomialNB

with gzip.open("data/amazon_one_and_five_star_reviews.json.gz", "r") as f:
    data_polarity = [json.loads(line) for line in f]

ratings = [row['overall'] for row in data_polarity]

from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import Ridge
from spacy.lang.en import STOP_WORDS

STOP_WORDS.update({"ll", "ve"})

import spacy

nlp = spacy.load("en", disable=["parser", "ner", "textcat"])

In [52]:
data_polarity[0]['reviewText']

"This worked perfectly for about 8 rewinds.  But once it eats one of your precious tapes you'll probably be as dissapointed as I was.  Buyer beware... these rewinders are mostly the same look; but marketed by different companies.  They all eat tape."

In [54]:
for token in nlp(data_polarity[0]['reviewText']):
    print(token.lemma_)

this
work
perfectly
for
about
8
rewind
.
 
but
once
-PRON-
eat
one
of
-PRON-
precious
tape
-PRON-
will
probably
be
as
dissapointed
as
-PRON-
be
.
 
Buyer
beware
...
these
rewinder
be
mostly
the
same
look
;
but
market
by
different
company
.
 
-PRON-
all
eat
tape
.


In [55]:
def lemmatize(document):
    return [word.lemma_.lower() for word in nlp(document)]

In [57]:
STOP_WORDS_LEMMA = {word.lemma_.lower() for word in nlp(" ".join(STOP_WORDS))}

In [58]:
STOP_WORDS_LEMMA

{"'",
 "'d",
 "'s",
 '-pron-',
 'a',
 'about',
 'above',
 'across',
 'after',
 'afterwards',
 'again',
 'against',
 'all',
 'almost',
 'alone',
 'along',
 'already',
 'also',
 'although',
 'always',
 'am',
 'among',
 'amongst',
 'amount',
 'an',
 'and',
 'another',
 'any',
 'anyhow',
 'anyone',
 'anything',
 'anyway',
 'anywhere',
 'around',
 'as',
 'at',
 'back',
 'be',
 'because',
 'become',
 'before',
 'beforehand',
 'behind',
 'below',
 'beside',
 'besides',
 'between',
 'beyond',
 'both',
 'bottom',
 'but',
 'by',
 'ca',
 'call',
 'can',
 'could',
 'd',
 'do',
 'down',
 'due',
 'during',
 'each',
 'eight',
 'either',
 'eleven',
 'else',
 'elsewhere',
 'empty',
 'enough',
 'even',
 'ever',
 'every',
 'everyone',
 'everything',
 'everywhere',
 'except',
 'few',
 'fifteen',
 'fifty',
 'first',
 'five',
 'for',
 'former',
 'formerly',
 'forty',
 'four',
 'from',
 'front',
 'full',
 'further',
 'get',
 'give',
 'go',
 'have',
 'hence',
 'her',
 'here',
 'hereafter',
 'hereby',
 'herein

In [89]:
model_topic = Pipeline([
    ('selector', KeySelector('reviewText')),
    ('vectorizer', TfidfVectorizer(stop_words=STOP_WORDS_LEMMA, tokenizer=lemmatize)),
    ('dim-reduction', NMF(n_components=20, random_state=0)),
])


In [80]:
model_topic.fit(data_polarity)

Pipeline(steps=[('selector', KeySelector(col='reviewText')),
                ('vectorizer',
                 TfidfVectorizer(stop_words={"'", "'d", "'s", '-pron-', 'a',
                                             'about', 'above', 'across',
                                             'after', 'afterwards', 'again',
                                             'against', 'all', 'almost',
                                             'alone', 'along', 'already',
                                             'also', 'although', 'always', 'am',
                                             'among', 'amongst', 'amount', 'an',
                                             'and', 'another', 'any', 'anyhow',
                                             'anyone', ...},
                                 tokenizer=<function lemmatize at 0x7fa27b59a040>)),
                ('dim-reduction', NMF(n_components=20, random_state=0))])

In [90]:
model_topic.named_steps

{'selector': KeySelector(col='reviewText'),
 'vectorizer': TfidfVectorizer(stop_words={"'", "'d", "'s", '-pron-', 'a', 'about', 'above',
                             'across', 'after', 'afterwards', 'again', 'against',
                             'all', 'almost', 'alone', 'along', 'already',
                             'also', 'although', 'always', 'am', 'among',
                             'amongst', 'amount', 'an', 'and', 'another', 'any',
                             'anyhow', 'anyone', ...},
                 tokenizer=<function lemmatize at 0x7fa27b59a040>),
 'dim-reduction': NMF(n_components=20, random_state=0)}

In [None]:
#1. find the indices that contribute the most to each feature
#2. find the words associated with each feature
#3. print out the words

In [3]:
#def find_ind_biggest(model, N, topic_number):
   # """
    #Returns the indices that most contribute to a feature
     # model: trained ML model
   # N: The number of top words
   # topic_number: new feature index
   # """
    nmf= model.named_steps["dim-reduction"]
    #return np.argsort(-nmf.components_[topic_number, :])[:N]

def find_ind_biggest(vector, num_words_topic):
    return np.argsort(-vector)[:N]


def find_words(model, ind):
    vectorizer= model.named_steps["vectorizer"]

    features= vectorizer.get_feature_names()

    return [features[i] for i in ind]


    
def print_result(words, topic_number):
    print (f" Top words for topic: {topic_number}")
    print("-" * 20)
    
    for word in words:
        print(word)
        

        

def find_words_of_topic(model, vector, topic_number, num_words_topic=20):
    ind= find_ind_biggest(vector, num_words_topic)
    words= find_words(model, ind)
    print_result(words, topic_number)
    
#def find_words_of_topic(model, topic_number, N=20):
    #ind= find_ind_biggest(model, N, topic_number)
   # words= find_words(model, ind)
   # print_result(words)
    
    
def main(model, num_words_topic=20):
    for i, vector in enumerate(model.named_steps["dim-reduction"].components_):
        find_words_of_topic(model, vector, i, num_words_topic=num_words_topic)



IndentationError: unexpected indent (<ipython-input-3-c20b30f9f893>, line 8)

In [96]:
main(model_topic)

AttributeError: 'NMF' object has no attribute 'components_'

In [97]:
NMF?

In [64]:
ind=find_ind_biggest(model_topic, 20, 6)
find_words(model_topic, ind)

['cable',
 '.',
 'connect',
 'tv',
 'need',
 'connector',
 'length',
 'quality',
 'belkin',
 'monitor',
 'long',
 'signal',
 'monster',
 'foot',
 'order',
 'video',
 'hook',
 'short',
 'extension',
 'connection']

In [None]:
#test

In [5]:
from sklearn.decomposition import NMF

import json
import numpy as np
import gzip
from sklearn.naive_bayes import MultinomialNB

with gzip.open("data/amazon_one_and_five_star_reviews.json.gz", "r") as f:
    data_polarity = [json.loads(line) for line in f]

ratings = [row['overall'] for row in data_polarity]

from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import Ridge
from spacy.lang.en import STOP_WORDS

STOP_WORDS.update({"ll", "ve"})

import spacy

nlp = spacy.load("en", disable=["parser", "ner", "textcat"])

In [6]:
data_polarity[0]['reviewText']

"This worked perfectly for about 8 rewinds.  But once it eats one of your precious tapes you'll probably be as dissapointed as I was.  Buyer beware... these rewinders are mostly the same look; but marketed by different companies.  They all eat tape."

In [9]:
from sklearn.base import BaseEstimator, TransformerMixin

class KeySelector(BaseEstimator, TransformerMixin):
    def __init__(self, col):
        self.col=col
    def fit(self, X, y=None):
        return self
    def transform(self, X):
        return [row[self.col] for row in X]

In [10]:
def find_ind_biggest(model, N, topic_number):
    """
       Returns the indices that most contribute to a feature
       model: trained ML model
       N: The number of top words
       topic_number: new feature index
     """
    nmf= model.named_steps["dim-reduction"]
    return np.argsort(-nmf.components_[topic_number, :])[:N]



def find_words(model, ind):
    vectorizer= model.named_steps["vectorizer"]

    features= vectorizer.get_feature_names()

    return [features[i] for i in ind]


    
def print_result(words, topic_number):
    print (f" Top words for topic: {topic_number}")
    print("-" * 20)
    
    for word in words:
        print(word)
        
       
    
def main(model, topic_number, N=20):
    ind= find_ind_biggest(model, N, topic_number)
    words= find_words(model, ind)
    print_result(words)




In [11]:
model_topic = Pipeline([
    ('selector', KeySelector('reviewText')),
    ('vectorizer', TfidfVectorizer(stop_words=STOP_WORDS)),
    ('dim-reduction', NMF(n_components=20, random_state=0)),
])


In [12]:
model_topic.fit(data_polarity)

Pipeline(steps=[('selector', KeySelector(col='reviewText')),
                ('vectorizer',
                 TfidfVectorizer(stop_words={"'d", "'ll", "'m", "'re", "'s",
                                             "'ve", 'a', 'about', 'above',
                                             'across', 'after', 'afterwards',
                                             'again', 'against', 'all',
                                             'almost', 'alone', 'along',
                                             'already', 'also', 'although',
                                             'always', 'am', 'among', 'amongst',
                                             'amount', 'an', 'and', 'another',
                                             'any', ...})),
                ('dim-reduction', NMF(n_components=20, random_state=0))])

*Copyright &copy; 2022 WorldQuant University. This content is licensed solely for personal use. Redistribution or publication of this material is strictly prohibited.*

In [None]:
#2nd test

In [13]:
def find_ind_biggest(model, N, topic_number):
    """
       Returns the indices that most contribute to a feature
       model: trained ML model
       N: The number of top words
       topic_number: new feature index
     """
    nmf= model.named_steps["dim-reduction"]
    return np.argsort(-nmf.components_[topic_number, :])[:N]



def find_words(model, ind):
    vectorizer= model.named_steps["vectorizer"]

    features= vectorizer.get_feature_names()

    return [features[i] for i in ind]


    
def print_result(words, topic_number):
    print (f" Top words for topic: {topic_number}")
    print("-" * 20)
    
    for word in words:
        print(word)
        
       
    
def main(model, topic_number, N=20):
    ind= find_ind_biggest(model, N, topic_number)
    words= find_words(model, ind)
    print_result(words)




In [15]:
import spacy

nlp = spacy.load("en", disable=["parser", "ner", "textcat"])

In [16]:
for token in nlp(data_polarity[0]['reviewText']):
    print(token.lemma_)

this
work
perfectly
for
about
8
rewind
.
 
but
once
-PRON-
eat
one
of
-PRON-
precious
tape
-PRON-
will
probably
be
as
dissapointed
as
-PRON-
be
.
 
Buyer
beware
...
these
rewinder
be
mostly
the
same
look
;
but
market
by
different
company
.
 
-PRON-
all
eat
tape
.


In [17]:
def lemmatize(document):
    return [word.lemma_.lower() for word in nlp(document)]

In [18]:
STOP_WORDS_LEMMA = {word.lemma_.lower() for word in nlp(" ".join(STOP_WORDS))}

In [25]:
model_topic = Pipeline([
    ('selector', KeySelector('reviewText')),
   # ('vectorizer', TfidfVectorizer(stop_words=STOP_WORDS_LEMMA, tokenizer=lemmatize)),
    ('vectorizer', TfidfVectorizer(stop_words=STOP_WORDS)),
    ('dim-reduction', NMF(n_components=20, random_state=0)),
])


In [26]:
model_topic.fit(data_polarity)

Pipeline(steps=[('selector', KeySelector(col='reviewText')),
                ('vectorizer',
                 TfidfVectorizer(stop_words={"'d", "'ll", "'m", "'re", "'s",
                                             "'ve", 'a', 'about', 'above',
                                             'across', 'after', 'afterwards',
                                             'again', 'against', 'all',
                                             'almost', 'alone', 'along',
                                             'already', 'also', 'although',
                                             'always', 'am', 'among', 'amongst',
                                             'amount', 'an', 'and', 'another',
                                             'any', ...})),
                ('dim-reduction', NMF(n_components=20, random_state=0))])

In [27]:
def find_ind_biggest(vector, num_words_topic):
    
    return np.argsort(-vector)[:num_words_topic]

def find_words(model, ind):
    vectorizer= model.named_steps["vectorizer"]

    features= vectorizer.get_feature_names()

    return [features[i] for i in ind]


    
def print_result(words, topic_number):
    print (f" Top words for topic: {topic_number}")
    print("-" * 20)
    
    for word in words:
        print(word)
        
       

    
def find_words_of_topic(model, vector, topic_number, num_words_topic=20):
    ind= find_ind_biggest(vector, num_words_topic)
    words= find_words(model, ind)
    print_result(words, topic_number)

                      

                      
def main(model, num_words_topic=20):
    for i, vector in enumerate(model.named_steps["dim-reduction"].components_):
        find_words_of_topic(model, vector, i,num_words_topic=num_words_topic)

In [28]:
main(model_topic)

 Top words for topic: 0
--------------------
unit
time
buy
bought
amazon
money
don
months
working
got
worked
item
battery
return
thing
waste
new
bad
warranty
years
 Top words for topic: 1
--------------------
lens
canon
lenses
focus
50mm
hood
sharp
light
nikon
wide
zoom
shots
images
18
macro
low
cap
image
prime
70
 Top words for topic: 2
--------------------
headphones
sound
speakers
ear
bass
pair
ears
volume
music
quality
head
headphone
better
comfortable
hear
noise
sony
like
speaker
set
 Top words for topic: 3
--------------------
cable
tv
monitor
connect
signal
belkin
extension
modem
needed
length
quality
video
foot
connector
printer
picture
long
fine
monster
vga
 Top words for topic: 4
--------------------
camera
pictures
digital
cameras
canon
batteries
battery
flash
picture
nikon
memory
remote
photos
zoom
shoot
kodak
takes
olympus
tripod
shots
 Top words for topic: 5
--------------------
dvd
player
cd
play
sony
disc
discs
dvds
players
unit
mp3
tv
recorder
video
memorex
remote
vcr
