# Opinion mining from the dataset of Amazon product reviews

The goal of this project is to mine user opinions about certain product features. The task will use a dataset of user reviews of several Amazon products. The goal is to analyze the reviews, extract product features and calculate sentiment per each feature. All the discussed source code is included in this document and described in the accompanying README.md. 

This document contains 4 cells:

1. Dependancies and downloading corpora
2. Model definition
3. Run on one product and generate report
4. Run on a set of products and generate results in markdown syntax

The data is a collection of customer reviews, extracted from Amazon. Reviews for individual products are grouped in files and each file has been manually labelled with the list of product features, sentiment polarity and sentiment strength. Each file contains reviews for one specific product or domain. 

Symbols used in the annotated reviews (from Customer_review_data/Readme.txt): 
```text
  [t]: the title of the review: Each [t] tag starts a review. 
       We did not use the title information in our papers.
  xxxx[+|-n]: xxxx is a product feature. 
      [+n]: Positive opinion, n is the opinion strength: 3 strongest, 
            and 1 weakest. Note that the strength is quite subjective. 
            You may want ignore it, but only considering + and -
      [-n]: Negative opinion
  ##  : start of each sentence. Each line is a sentence. 
  [u] : feature not appeared in the sentence.
  [p] : feature not appeared in the sentence. Pronoun resolution is needed.
  [s] : suggestion or recommendation.
  [cc]: comparison with a competing product from a different brand.
  [cs]: comparison with a competing product from the same brand.
```

In [1]:
# Please run this cell to import deps and download the necessary corpora

import nltk
import numpy as np              # For TFIDF results handling
import string
import time                     # for timing execution
import itertools                # For feature prunning
import operator
from nltk import word_tokenize
from nltk.corpus import product_reviews_1, product_reviews_2
from textblob import TextBlob    # For noun phrase extraction and spell check
from apyori import apriori       # For the Apriori algorithm
from numpy import sign           # For extracting sentiment direction
from sklearn.feature_extraction.text import TfidfVectorizer

nltk.download('product_reviews_1')
nltk.download('product_reviews_2')
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')

# Uncomment and run to see all available Amazon products in the corpora:
# print(product_reviews_1.fileids())
# print(product_reviews_2.fileids())

[nltk_data] Downloading package product_reviews_1 to
[nltk_data]     /Users/siemens/nltk_data...
[nltk_data]   Package product_reviews_1 is already up-to-date!
[nltk_data] Downloading package product_reviews_2 to
[nltk_data]     /Users/siemens/nltk_data...
[nltk_data]   Package product_reviews_2 is already up-to-date!
[nltk_data] Downloading package punkt to /Users/siemens/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /Users/siemens/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


True

# Storing internal state of the model

The internal representation of both, the baseline and the upgraded model is built on top of the same architecture. They differ in the individual steps in the preprocessing, feature extraction or sentiment analysis pipeline and share the same internal representation. The main classes representing the  state of the model are based on the NLTK review reader module [2] and have been extended by my own classes to optimally store the internal state. The three core components are:

 ![class diagram](media/class_diagram.png)

In [3]:
class PSent:
    """PSent represents one sentence in a product review"""
    
    def __init__(self, s, f):
        self.raw = s # Sentence in raw format
        self.pp = [] # Sentence after preprocessing
        self.ft = {} # Features / polarity based on model
        self.op = [] # Opinion words
        self.test = f # Test features labelled by human
        self.eval = {} # Evaluation score
        
    def sentiment(self):
        # Detecting sentiment polarity in the sentences                 
        if self.ft:
            # print("\n"+str(s))
            p = TextBlob(str(self)).sentiment.polarity
            if( p > 0 ):
                p = 1                
            elif( p < 0 ):
                p = -1
            # In base model there is no distintion
            # between different features within one
            # sentence:
            for f in self.ft.keys():
                self.ft[f] = p
        
    def __repr__(self):
        return "PSent"
    
    def __str__(self):
        return ' '.join(self.raw)
        
class PRev:
    """PRev represents one review"""
    
    def __init__(self, r):
        self.NltkReview = r # original review object
        self.review = []    # preprocessed review   
        self.eval = {}      # feature / sentiment evaluation score
        
    def preprocess(self, 
                         spelling = False,
                         stemming = False,
                         chunking = True,
                         lemmatization = True):
        """Review preprocessing pipeline
        
        The following POS entities indicate the nouns:
        * NN noun, singular ‘desk’
        * NNS noun plural ‘desks’
        * NNP proper noun, singular ‘Harrison’
        * NNPS proper noun, plural ‘Americans’         
        """
        
        es = nltk.stem.SnowballStemmer('english')
        
        self.review = []
        
        for rl in self.NltkReview.review_lines:
            
            ps = PSent(rl.sent, 
                       rl.features) # Human labelled features
            
            a = ' '.join(rl.sent)
            a = TextBlob(a)

            # 2. spelling correction 
            if(spelling):                    
                a = a.correct() 
            
            # 3. get all nouns
            nouns = []
            for pos in a.tags:
                if (pos[1][:2] == 'NN' 
                   and len(pos[0])>=3):
                    nouns.append(pos[0])

            # 4. Chunking noun phrases
            if(chunking):
                nouns = a.noun_phrases + nouns
                # nouns = get_noun_phrases(' '.join(rl.sent)) + nouns

            # 5. Lemmatication            
            if(lemmatization):
                nouns = [n.lemmatize() for n in nouns]
            
            
            # 5. Stemming the words in the noun phrases
            if(stemming):      
                nouns = [es.stem(n) for n in nouns]
            
            ps.pp=nouns
            self.review.append(ps)
                
            
    def sents(self):
        """Return a list of processed sentences tokenized"""
        
        return [s.pp for s in self.review]
    
    def sents_raw(self):
        """Return a list of non-processed sentences tokenized"""
        
        return [s.raw for s in self.review]
    
    def sents_str(self):
        """Return processed sentences as one string"""
        
        rs = ""
        for s in self.review:
            rs +=' '.join(s.pp)
        return rs
            
    def sentiment(self):
        """Detecting sentiment polarity in the sentences
        containing product features
        """
             
        for s in self.review:               
            s.sentiment()
            
    def features(self):
        """Return list of features mentioned in the review"""
        
        self.ft = []
        for s in self.review:
            self.ft.extend(s.ft.keys())
        return list(set(self.ft))
            
            
    def __str__(self):
        """Return original review as one string"""
        
        rstr = ""
        for s in self.NltkReview.sents():
            rstr += ' '.join(s)+'\n'
        return rstr

    
class PReviews:
    """PReviews represents a set of reviews for one product 
    e.g. set of user reviews for camera Canon G3"""
    
    def __init__(self, corpus, name):  
        self.name = name        # Name of the product corpus
        self.NltkCorpus = corpus.reviews(name)
        self.eval = {}          # Evaluation score
        self.report = {}        # Evaluation report
        self.test_report = {}   # Target values (from annotations)
        
        self.revs = []
        for r in self.NltkCorpus:
            self.revs.append(PRev(r))
            
        
    def preprocess(self, 
                         spelling = False,
                         stemming = False,
                         chunking = True,
                         lemmatization = True):
        """Preprocess all reviews"""
        
        start = time.time()

        for r in self.revs:
            r.preprocess(spelling = spelling,
                         stemming = stemming,
                         chunking = chunking,
                         lemmatization=lemmatization)
            
        end = time.time()

        print("Preprocessed {} reviews in {:.2f} seconds (spelling correction={}, stemming={})"
              .format(len(self.revs), end-start, spelling, stemming))
        
    def get_tfidf_top_features(self, 
                               documents, 
                               tfidf_max_df=0.90, 
                               tfidf_min_df=0.05, 
                               tfidf_max_df_ngram_range=(1,3),
                               n_top=30):
        """Calculate TFIDF matrix and return top N terms 
        with the highest TFIDF score
        
        """
        
        tfidf_vectorizer = TfidfVectorizer(max_df=tfidf_max_df, 
                                           min_df=tfidf_min_df, 
                                           stop_words='english', 
                                           ngram_range=tfidf_max_df_ngram_range)
        tfidf = tfidf_vectorizer.fit_transform(documents)
        importance = np.argsort(np.asarray(tfidf.sum(axis=0)).ravel())[::-1]
        tfidf_feature_names = np.array(tfidf_vectorizer.get_feature_names())
        
        # Maybe distribute accross various n-gram lengths so that unigrams don't dominate?
        #
        # features_by_gram = defaultdict(list)
        # for f, w in zip(tfidf_vectorizer.get_feature_names(), tfidf_vectorizer.idf_):
        #    features_by_gram[len(f.split(' '))].append((f, w))

        # print(tfidf_vectorizer.get_feature_names())

        # for gram, features in features_by_gram.items():
        #    top_features = sorted(features, key=lambda x: x[1], reverse=True)[:n_top]

        #    top_features = [f[0] for f in top_features]
        #    print('{}-gram top:{}'.format(gram, top_features))
    
        return tfidf_feature_names[importance[:n_top]]
    
    
    def features(self,
                 apr = True,
                 prune = True,
                 tfidf_max_df=0.90, 
                 tfidf_min_df=0.05, 
                 tfidf_max_df_ngram_range=(1,3),
                 tfidf_top_n=10,
                 min_support=0.005, 
                 min_confidence=0.2, 
                 min_lift=3, 
                 min_length=3):
        
        """Mine product reviews for all product feature candidates."""
        
        if not self.revs:
            self.preprocess()
        
        sentence_dict = []       
        for r in self.revs:
            sentence_dict.extend(r.sents()) 
        
        
        #
        # Extracting feature candidates based on the term frequency (TFIDF)
        #
        if len(self.revs) > 1:
            # Normal review file with multiple reviews separated by title
            top_n = list(self.get_tfidf_top_features([r.sents_str() for r in self.revs], 
                                                     tfidf_max_df=tfidf_max_df, 
                                                     tfidf_min_df=tfidf_min_df, 
                                                     tfidf_max_df_ngram_range=tfidf_max_df_ngram_range,
                                                     n_top=tfidf_top_n))
        else:
            # Some review files contain one single title (one review with many lines)
            top_n = list(self.get_tfidf_top_features([' '.join(s) for s in sentence_dict], n_top=30))
            
        
                       
        #
        # Mining frequent items sets using the Apriori algorithm 
        # to find more potential features:        
        #
        ar=None
        if apr:
            self.association_rules = apriori(sentence_dict, 
                                        min_support=min_support, 
                                        min_confidence=min_confidence, 
                                        min_lift=min_lift, 
                                        min_length=min_length)

            ar = list(self.association_rules)
            for f in ar:
                top_n.extend(list(f.items))
        
        # Drop duplicates:
        top_n = list(set(top_n))
        
        #
        #  Prunning the unwanted features:
        #
        if prune:
            top_n = self.feature_prunning(top_n)
        
        # Tag each sentence in the corpus where one of the features is present:
        self._tag_sentences_with_features(top_n, ar)
        
        return top_n
    
    def opinions(self):
        """Extracting opinion words (adjectives and adverbs)
        JJ adjective, JJR adj comparative, JJS adj superlative
        RB adverb, RBR comparative, RBS seperlative        
        """
        
        for r in self.revs:
            for s in r.review:               
                for w in TextBlob(str(s)).tags:
                    if (w[1][:2] == 'JJ' or
                       w[1][:2] == 'RB'):
                            # print(w[0])
                            s.op.append(w[0])
    
    def sentiment(self):
        """Detecting sentiment polarity in the sentences
        containing product features        
        """
        
        for r in self.revs:
            r.sentiment()
                        
    def gen_report(self):
        """Generate a feature report per product"""
        
        for r in self.revs:
            for s in r.review:               
                for feat, polarity in s.ft.items():
                    self.report.setdefault(feat, (0,0))
                    if polarity > 0:
                        self.report[feat] = (self.report[feat][0]+1,
                                             self.report[feat][1])
                    elif polarity < 0:                
                        self.report.setdefault(feat, (0,0))
                        self.report[feat] = (self.report[feat][0],
                                             self.report[feat][1]+1)
            
        return self.report
    
    def print_report(self, top_n=5):
        """Print the feature report per product"""
        
        if not self.report:
            self.gen_report()
            
        sorted_dict = dict(sorted(self.report.items(), key=operator.itemgetter(1), reverse=True)[:top_n])
            
        print("\nProduct: ", self.name)
        for feat, score in sorted_dict.items():
            print("\t\nFeature: ", feat)
            print("\t\tPositive: ", sorted_dict[feat][0])
            print("\t\tNegative: ", sorted_dict[feat][1])
    
    def extract_tagged_data(self):
        self.test_report = {}
        for r in self.NltkCorpus:
            # print("\n")
            for f in r.features():
                feat = f[0] # name of feature
                score_sign = f[1][0] #just the plus or minus sign
                self.test_report.setdefault(feat, (0,0))
                if score_sign == '+':
                    self.test_report[feat] = (self.test_report[feat][0]+1,
                                                 self.test_report[feat][1])
                elif score_sign == '-':
                    self.test_report[feat] = (self.test_report[feat][0],
                                                 self.test_report[feat][1]+1)
        return self.test_report
    
    def fscore(self, tp, fn, fp):
        """Calculate F-score"""
        
        recall = 0
        prec = 0
        fscore = 0
        
        if(tp+fn)>0:
            recall = tp/(tp+fn) 
        
        if(tp+fp)>0:
            prec = tp/(tp+fp) 
            
        if(recall+prec)>0:
            fscore = 2*(recall*prec)/(recall+prec)
        
        return recall, prec, fscore
    
    def feat_evaluation(self, mute_output=False):
        """Evaluation of feature extraction success
        Look at features that were picked up by mining and
        those that were missed based on comparison with the
        annotated data in NLTK corpus
        
        set mute_output to just return results without outputting to the console
        """
        
        if not self.test_report:
            self.extract_tagged_data()
            
        if not self.report:
            self.gen_report()
            

        f_tp = []
        f_fp = []
        f_fn = []
        
        for feat in self.report.keys():
            if feat in self.test_report:
                # cfmatch += 1
                f_tp.append(feat)               
            else:                
                f_fp.append(feat)                
                
        for feat in self.test_report.keys():
            if feat not in self.report:                
                f_fn.append(feat)
        
        recall, prec, fscore = self.fscore(len(f_tp),
                                           len(f_fn),
                                           len(f_fp))
        
        results1 = [len(f_tp),len(f_fn),len(f_fp),recall,prec, fscore]
    
        if not mute_output:
            print("Looking at all product features together:")
            print("|\tTP\t|\tFN\t|\tFP\t|\tRecall\t|\tPrecision\t|\tF-score\t|")
            print("|---|---|---|---|---|---|")
            print("|\t{0}\t|\t{1}\t|\t{2}\t|\t{3:.2f}\t|\t{4:.2f}\t|\t{5:.2f}\t|".format(len(f_tp), 
                                                                                         len(f_fn), 
                                                                                         len(f_fp),
                                                                                        recall,
                                                                                        prec,
                                                                                        fscore))
        
            print("\nTP")
            print('%s' % ', '.join(map(str, f_tp)))

            print("\nFN")
            print('%s' % ', '.join(map(str, f_fn)))

            print("\nFP")
            print('%s' % ', '.join(map(str, f_fp)))

        
        # Calculate recall / precision and F1 score per sentence, review and product:
        self.eval["tp"] = 0 # true positive (labelled feature found in mined features)
        self.eval["fn"] = 0 # false negative (labelled feature not found in mined features)
        self.eval["fp"] = 0 # false positive, mined feature that is not present in labelled
        self.eval["recall"] = 0 
        self.eval["prec"] = 0 
        self.eval["fscore"] = 0 
        for r in self.revs:
            r.eval["tp"] = 0 # true positive (labelled feature found in mined features)
            r.eval["fn"] = 0 # false negative (labelled feature not found in mined features)
            r.eval["fp"] = 0 # false positive, mined feature that is not present in labelled
            r.eval["recall"] = 0 
            r.eval["prec"] = 0 
            r.eval["fscore"] = 0 
            for s in r.review: # iterate through each sentence
                s.eval["tp"] = 0 # true positive (labelled feature found in mined features)
                s.eval["fn"] = 0    # false negative (labelled feature not found in mined features)
                s.eval["fp"] = 0    # false positive, mined feature that is not present in labelled
                s.eval["recall"] = 0 
                s.eval["prec"] = 0 
                s.eval["fscore"] = 0 
                for lf in s.test: # iterate through labelled features
                    if lf[0] in s.ft.keys(): # comparing to mined features
                        s.eval["tp"] += 1
                    else:                        
                        s.eval["fn"] += 1
                for mf in s.ft.keys(): # iterate through labelled features
                    found = False
                    for lf in s.test:                        
                        if mf == lf[0]:
                            found = True
                            break
                    if not found:
                        s.eval["fp"] += 1                        
                        
                r.eval["tp"] += s.eval["tp"] # true positive (labelled feature found in mined features)
                r.eval["fn"] += s.eval["fn"] # false negative (labelled feature not found in mined features)
                r.eval["fp"] += s.eval["fp"] # false positive, mined feature that is not present in labelled
                
                # Recall / precision / F1 score per sentence:
                s.eval["recall"], s.eval["prec"], s.eval["fscore"] = self.fscore(s.eval["tp"],
                                                                               s.eval["fn"],
                                                                               s.eval["fp"])
                
            self.eval["tp"] += r.eval["tp"] # true positive (labelled feature found in mined features)
            self.eval["fn"] += r.eval["fn"] # false negative (labelled feature not found in mined features)
            self.eval["fp"] += r.eval["fp"] # false positive, mined feature that is not present in labelled
                
            # Recall / precision / F1 score per review:
            r.eval["recall"], r.eval["prec"], r.eval["fscore"] = self.fscore(r.eval["tp"],
                                                                           r.eval["fn"],
                                                                           r.eval["fp"])
            
        # Recall / precision / F1 score per product:
        self.eval["recall"], self.eval["prec"], self.eval["fscore"] = self.fscore(self.eval["tp"],
                                                                                   self.eval["fn"],
                                                                                   self.eval["fp"])
        
        results2 = [self.eval["tp"],
                    self.eval["fn"],
                    self.eval["fp"],
                    self.eval["recall"],
                    self.eval["prec"],
                    self.eval["fscore"]]
        if not mute_output:
            print("\n\nLooking at product features per individual sentence:")
            print("|\tTP\t|\tFN\t|\tFP\t|\tRecall\t|\tPrec\t|\tF score\t|")
            print("|\t{0}\t|\t{1}\t|\t{2}\t|\t{3:.2f}\t|\t{4:.2f}\t|\t{5:.2f}\t|"
                  .format(self.eval["tp"],
                        self.eval["fn"],
                        self.eval["fp"],
                        self.eval["recall"],
                        self.eval["prec"],
                        self.eval["fscore"]))

        return results1, results2
        
    def sent_evaluation(self, mute_output=False):   
        """Evaluation of sentiment analysis success
        
        set mute_output to just return results without outputting to the console
        """
        
        if not self.test_report:
            self.extract_tagged_data()
            
        if not self.report:
            self.gen_report()
            
        # First collect features that were correctly mined
        f_tp = []        
        for feat in self.report.keys():
            if feat in self.test_report:
                f_tp.append(feat)

        # Calculate recall / precision and F1 score per sentence, review and product
        # on those sentences that contain features that were correctly mined:
        self.eval["stp"] = 0 # true positive (positive feature labelled positive)
        self.eval["sfn"] = 0 # false negative (positive feature not labelled positive)
        self.eval["sfp"] = 0 # false positive (negative feature labelled positive)
        self.eval["srecall"] = 0 
        self.eval["sprec"] = 0 
        self.eval["sfscore"] = 0 
        for r in self.revs:
            r.eval["stp"] = 0 # true positive (positive feature labelled positive)
            r.eval["sfn"] = 0 # false negative (positive feature not labelled positive)
            r.eval["sfp"] = 0 # false positive (negative feature labelled positive)
            r.eval["srecall"] = 0 
            r.eval["sprec"] = 0 
            r.eval["sfscore"] = 0 
            for s in r.review: # iterate through each sentence
                s.eval["stp"] = 0 # true positive (positive feature labelled positive)
                s.eval["sfn"] = 0    # false negative (positive feature not labelled positive)
                s.eval["sfp"] = 0    # false positive (negative feature labelled positive)
                s.eval["srecall"] = 0 
                s.eval["sprec"] = 0 
                s.eval["sfscore"] = 0 
                for lf in s.test: # iterate through labelled features
                    if lf[0] in s.ft: # comparing to mined features
                        if sign(int(lf[1])) == s.ft[lf[0]]:
                            s.eval["stp"] += 1
                            # print("TP {}:{}=={}:{}".format(lf[0],sign(int(lf[1])),lf[0], s.ft[lf[0]]))
                        elif sign(int(lf[1])) == 1.0:
                            s.eval["sfn"] += 1
                            # print("Falsely negative:")
                            # print(s)
                        else:
                            s.eval["sfp"] += 1
                            # print("Falsely positive:")
                            # print(s)
                        
                r.eval["stp"] += s.eval["stp"] # true positive (positive feature labelled positive)
                r.eval["sfn"] += s.eval["sfn"] # false negative (positive feature not labelled positive)
                r.eval["sfp"] += s.eval["sfp"] # false positive (negative feature labelled positive)
                
                # Recall / precision / F1 score per sentence:
                s.eval["srecall"], s.eval["sprec"], s.eval["sfscore"] = self.fscore(s.eval["stp"],
                                                                               s.eval["sfn"],
                                                                               s.eval["sfp"])
                
            self.eval["stp"] += r.eval["stp"] # true positive (positive feature labelled positive)
            self.eval["sfn"] += r.eval["sfn"] # false negative (positive feature not labelled positive)
            self.eval["sfp"] += r.eval["sfp"] # false positive (negative feature labelled positive)
                
            # Recall / precision / F1 score per review:
            r.eval["srecall"], r.eval["sprec"], r.eval["sfscore"] = self.fscore(r.eval["stp"],
                                                                           r.eval["sfn"],
                                                                           r.eval["sfp"])
            
        # Recall / precision / F1 score per product:
        self.eval["srecall"], self.eval["sprec"], self.eval["sfscore"] = self.fscore(self.eval["stp"],
                                                                                   self.eval["sfn"],
                                                                                   self.eval["sfp"])
        results = [self.eval["stp"],
                    self.eval["sfn"],
                    self.eval["sfp"],
                    self.eval["srecall"],
                    self.eval["sprec"],
                    self.eval["sfscore"]]
        if not mute_output:
            print("\n\nLooking at sentiment evaluation per individual sentence:")
            print("|\tTP\t|\tFN\t|\tFP\t|\tRecall\t|\tPrec\t|\tF score\t|")
            print("|\t{0}\t|\t{1}\t|\t{2}\t|\t{3:.2f}\t|\t{4:.2f}\t|\t{5:.2f}\t|"
                  .format(self.eval["stp"],
                        self.eval["sfn"],
                        self.eval["sfp"],
                        self.eval["srecall"],
                        self.eval["sprec"],
                        self.eval["sfscore"]))
        return results
    
    def d(w1, w2, words):
        """Calculate distance between two words in a sentence
        
        Consider that the words might appear many times in a single sentence.
        In such case the minimum is calculated.
        """
        
        if w1 in words and w2 in words:
            w1_indexes = [index for index, value in enumerate(words) if value == w1]    
            w2_indexes = [index for index, value in enumerate(words) if value == w2]    
            distances = [abs(item[0] - item[1]) for item in itertools.product(w1_indexes, w2_indexes)]
            return {'min': min(distances), 'avg': sum(distances)/float(len(distances))}
    
    def compactness_prunning(self, multiple_word_features):
        """Prune features containing multiple words 
        based on the distance between those words in a sentence
        """
        
        exclude = []
        for (mw, parts) in multiple_word_features:
            compact_count = 0
            if len(parts) == 2:
                for r in self.revs:
                    for s in r.review:
                        if mw in s.ft:
                            distance = d(parts[0],parts[1],s.raw)
                            if distance and distance["min"] <= 3:
                                # print("{} is compact in {}".format(mw,s.raw))
                                compact_count+=1
                    if compact_count >= 2:
                        break
            if compact_count < 2:
                exclude.append(mw)
        return exclude            
    
    def feature_prunning(self, features):
        """Prune features based on compactness prunning and
        redundancy prunning
        """
        
        hierarchy = {}
        multiword = []
        exclude = []
        
        count_start = len(features)

        for f in features:
            parts = f.split(' ')
            if len(parts) == 1:
                hierarchy[f]=[]
            else:
                # check if words in the multi-word feature repeat
                if not len(set(parts)) == len(list(parts)):
                    #print("Exclude (repetition): ", f)
                    exclude.append(f)
                else:
                    multiword.append((f,parts))
                    
        exclude.extend(self.compactness_prunning(multiword))

        for mw, parts in multiword:
            for p in parts:
                if p in hierarchy.keys():
                    # multiword feature is a narrower category of feature p
                    hierarchy[p].append(mw)

        document_presence = {}
        for w, more_specific in hierarchy.items():
            for mw in more_specific:
                both = 0
                general = 0
                for r in self.revs:
                    if mw in r.sents_str():
                        both += 1
                    elif w in r.sents_str():
                        general += 1
                # print("{} vs {}, both {} general {} ratio {}".format(w,mw,both,general,(general+both)/both))
                if both > 0 and general/both == 1:                    
                    exclude.append(w)
        
        result = [f for f in features if f not in exclude]
        count_end = len(result)
        
        print("Pruned {} out of {} features".format(count_start-count_end,count_start))
        
        return result


    def _tag_sentences_with_features(self, features, ar=None):
        for r in self.revs:
            for s in r.review:
                found = False
                for word in features:
                    if word in s.pp:                        
                        s.ft[word]=0
                        
                if ar:
                    for f in ar:
                        feat = list(f.items)  

                        found = False
                        for word in feat:
                            found = False
                            for np in s.pp:
                                if word == np or word in np:
                                    found = True
                                    break
                            if not found:
                                break                
                        if(found):                        
                            featstr = feat[1]+' '+feat[0]
                            s.ft[featstr]=0
        
                    
                    
                    
        

# Process one product and generate the feature report

In [4]:
c = PReviews(product_reviews_1, 'Canon_G3.txt')
c.preprocess(chunking=True, lemmatization=True)
c.features(apr = True,
           prune=True,
                 tfidf_max_df=0.80, 
                 tfidf_min_df=0.03, 
                 tfidf_max_df_ngram_range=(1,3),
                 tfidf_top_n=100,
                 min_support=0.004, 
                 min_confidence=0.2, 
                 min_lift=3, 
                 min_length=3)
c.opinions()
c.sentiment()
c.feat_evaluation(mute_output=False)
c.sent_evaluation(mute_output=True)
c.print_report()

Preprocessed 45 reviews in 4.33 seconds (spelling correction=False, stemming=False)
Pruned 30 out of 134 features
Looking at all product features together:
|	TP	|	FN	|	FP	|	Recall	|	Precision	|	F-score	|
|---|---|---|---|---|---|
|	32	|	73	|	177	|	0.30	|	0.15	|	0.20	|

TP
canon, camera, picture, quality, picture quality, flash, feature, use, option, software, control, lens, image, dial, viewfinder, photo, lcd, design, focus, zoom, battery, shoot, lens cap, price, color, shot, product, lag, lag time, compactflash, performance, strap

FN
canon powershot g3, speed, function, auto setting, canon g3, photo quality, darn diopter adjustment dial, exposure control, metering option, spot metering, 4mp, size, weight, optical zoom, digital zoom, menu, button, lense, auto mode, canera, print, manual mode, feel, four megapixel, night mode, lens cover, zooming lever, white balance, grain, flash photo, noise, g3, depth, external flash hot shoe, raw image, battery life, manual function, service, autom

# Evaluating results on multiple products / markdown generator

In [5]:
# This is a bit messy and is intended to output evaluation results of the opinion miner 
# for all the products in both corpora formatted as markdown.

results1 = {}
results2 = {}
results3 = {}
sum1=[0.0,0.0,0.0,0.0,0.0,0.0]
sum2=[0.0,0.0,0.0,0.0,0.0,0.0]
sum3=[0.0,0.0,0.0,0.0,0.0,0.0]

count = 0
for corpus in [product_reviews_1, product_reviews_2]:
    for product in corpus.fileids():
        if count < 5:            
            if product not in ['README.txt','ipod.txt', 'norton.txt']:            
                c = PReviews(corpus, product)
                c.preprocess(chunking=True, lemmatization=True, spelling = False, stemming = False)
                c.features(apr = True,
               prune=True,
                     tfidf_max_df=0.80, 
                     tfidf_min_df=0.03, 
                     tfidf_max_df_ngram_range=(1,3),
                     tfidf_top_n=100,
                     min_support=0.003, 
                     min_confidence=0.2, 
                     min_lift=3, 
                     min_length=3)
                c.opinions()
                c.sentiment()
                r1,r2 = c.feat_evaluation(mute_output=True)
                r3 = c.sent_evaluation(mute_output=True)
                results1[product]=r1
                results2[product]=r2
                results3[product]=r3
                sum1 = [x + y for x, y in zip(sum1, r1)]            
                sum2 = [x + y for x, y in zip(sum2, r2)]
                sum3 = [x + y for x, y in zip(sum3, r3)]
                count+=1
            
sum1=[x / count for x in sum1]
sum2=[x / count for x in sum2]
sum3=[x / count for x in sum3]

print("Looking at features per product:")
print("|\tName\t|\tTP\t|\tFN\t|\tFP\t|\tRecall\t|\tPrecision\t|\tF-score\t|")
print("|:---|:---:|:---:|:---:|:---:|:---:|:---:|")
for product, r in results1.items():
    print("|\t{6}\t|\t{0}\t|\t{1}\t|\t{2}\t|\t{3:.2f}\t|\t{4:.2f}\t|\t{5:.2f}\t|".format(r[0], 
                                                                                 r[1], 
                                                                                 r[2],
                                                                                r[3],
                                                                                r[4],
                                                                                r[5],
                                                                                        product))
print("|\t{6}\t|\t**{0:.2f}**\t|\t**{1:.2f}**\t|\t**{2:.2f}**\t|\t**{3:.2f}**\t|\t**{4:.2f}**\t|\t**{5:.2f}**\t|".format(sum1[0], 
                                                                             sum1[1], 
                                                                             sum1[2],
                                                                            sum1[3],
                                                                            sum1[4],
                                                                            sum1[5],
                                                                                    "Average values"))
    
print("Looking at features per sentence:")
print("|\tName\t|\tTP\t|\tFN\t|\tFP\t|\tRecall\t|\tPrecision\t|\tF-score\t|")
print("|:---|:---:|:---:|:---:|:---:|:---:|:---:|")
for product, r in results2.items():
    print("|\t{6}\t|\t{0}\t|\t{1}\t|\t{2}\t|\t{3:.2f}\t|\t{4:.2f}\t|\t{5:.2f}\t|".format(r[0], 
                                                                                 r[1], 
                                                                                 r[2],
                                                                                r[3],
                                                                                r[4],
                                                                                r[5],
                                                                                        product))
print("|\t{6}\t|\t**{0:.2f}**\t|\t**{1:.2f}**\t|\t**{2:.2f}**\t|\t**{3:.2f}**\t|\t**{4:.2f}**\t|\t**{5:.2f}**\t|".format(sum2[0], 
                                                                             sum2[1], 
                                                                             sum2[2],
                                                                            sum2[3],
                                                                            sum2[4],
                                                                            sum2[5],
                                                                                    "Average values"))
    
print("Looking at sentiment analysis:")
print("|\tName\t|\tTP\t|\tFN\t|\tFP\t|\tRecall\t|\tPrecision\t|\tF-score\t|")
print("|:---|:---:|:---:|:---:|:---:|:---:|:---:|")
for product, r in results3.items():
    print("|\t{6}\t|\t{0}\t|\t{1}\t|\t{2}\t|\t{3:.2f}\t|\t{4:.2f}\t|\t{5:.2f}\t|".format(r[0], 
                                                                                 r[1], 
                                                                                 r[2],
                                                                                r[3],
                                                                                r[4],
                                                                                r[5],
                                                                                        product))
print("|\t{6}\t|\t**{0:.2f}**\t|\t**{1:.2f}**\t|\t**{2:.2f}**\t|\t**{3:.2f}**\t|\t**{4:.2f}**\t|\t**{5:.2f}**\t|".format(sum3[0], 
                                                                             sum3[1], 
                                                                             sum3[2],
                                                                            sum3[3],
                                                                            sum3[4],
                                                                            sum3[5],
                                                                                    "Average values"))

Preprocessed 99 reviews in 0.76 seconds (spelling correction=False, stemming=False)
Pruned 29 out of 134 features
Preprocessed 45 reviews in 0.83 seconds (spelling correction=False, stemming=False)
Pruned 78 out of 240 features
Preprocessed 95 reviews in 1.85 seconds (spelling correction=False, stemming=False)
Pruned 14 out of 113 features
Preprocessed 34 reviews in 0.40 seconds (spelling correction=False, stemming=False)
Pruned 44 out of 159 features
Preprocessed 40 reviews in 0.79 seconds (spelling correction=False, stemming=False)
Pruned 45 out of 181 features
Looking at features per product:
|	Name	|	TP	|	FN	|	FP	|	Recall	|	Precision	|	F-score	|
|:---|:---:|:---:|:---:|:---:|:---:|:---:|
|	Apex_AD2600_Progressive_scan_DVD player.txt	|	39	|	76	|	171	|	0.34	|	0.19	|	0.24	|
|	Canon_G3.txt	|	41	|	64	|	554	|	0.39	|	0.07	|	0.12	|
|	Creative_Labs_Nomad_Jukebox_Zen_Xtra_40GB.txt	|	47	|	141	|	133	|	0.25	|	0.26	|	0.26	|
|	Nikon_coolpix_4300.txt	|	22	|	53	|	266	|	0.29	|	0.08	|	0.12	|
|	Nokia_

# Manually parsing the corpora


In [19]:
def parse_tagged_reviews(path):
    with open(path, 'r') as f:
        reviews = []
        
        title = ""
        text = []
        for line in f.readlines():
            if line.startswith("*"): 
                # skip comment                
                continue
            elif line.startswith("[t]"):                 
                # title of new review
                if text: # but title can be empty                    
                    reviews.append(text)                
                text = [] # reset last review
                features = "" # reset last feature
                title = line[3:]
                # print("Title:", title)                
            elif line.startswith("##"): # sentence
                text.append(line[2:])
            elif not line.startswith("##") and "##" in line: #feature
                s = line.split("##")
                features = s[0]
                text.append(s[1])
        # append the last review
        reviews.append(text)
    return reviews
