# Step 1: Analyse the data and task

The task involves mining people's opinions of product features within Amazon reviews. I will be using a similar approach to Hu and Liu, 2004. From the reviews of each product product features can be extracted, by extracting nouns and noun phrases using PoS tags and chunking. Once the product features are identified for a product, the opinions can be mined by looking for sentences within the reviews which mention the product features identified and include sentiment-bearing words, such as adjectives. Sentiment analysis using a lexicon can then be used to label each product feature mentioned as being positive or negative. Once this process is complete summaries can be produced to show the number of positive and negative review sentences for each of the product features identified. This process can be carried out for all of the different products in the data.

In [1]:
data_file ="Data/Data/Customer_review_data/Canon G3.txt"
with open(data_file) as f:
    print(f.read())

*****************************************************************************
* Annotated by: Minqing Hu and Bing Liu, 2004.              
*		Department of Computer Sicence
*               University of Illinios at Chicago
*
* Product name: Canon G3
* Review Source: amazon.com
*
* See Readme.txt to find the meaning of each symbol. 
*****************************************************************************

[t]excellent picture quality / color 
canon powershot g3[+3]##i recently purchased the canon powershot g3 and am extremely satisfied with the purchase . 
use[+2]##the camera is very easy to use , in fact on a recent trip this past week i was asked to take a picture of a vacationing elderly group . 
##after i took their picture with their camera , they offered to take a picture of us . 
##i just told them , press halfway , wait for the box to turn green and press the rest of the way . 
picture[+2]##they fired away and the picture turned out quite nicely . ( as all of my pictures ha

Throughout this Jupyter notebook, I will be going through each step of the opinion mining application using the Canon G3 camera data. At the bottom of the notebook I analyse every product, but it is useful to go through the process with just 1 example first and be able to see the various outputs along the way. From looking at the camera review file, I can see that at the top there is a header with reference to Hu and Liu, 2004. This will need to be excluded from the analysis as it is not part of the reviews. The main body of the file is made up of reviews of the camera, which, as the Readme file points out are split into sentences using '##'. This is important as the application requires analysis to be done at the sentence level, so it is important to be able to seperate the review sentences effectively.

The manually labelled product features are added before the start of the respective sentence. These will be important as they will act as the ground truth product feature labels, so will be used for evaluation. They need to be separated from the raw data so as not to contaminate the data.

In [11]:
import pandas as pd
import os
import re
import spacy
from pprint import pprint
import string
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
import statistics as stats

# Step 2: Apply data pre-processing steps

The process function is used to lemmatize the data. I have chosen lemmatizing over stemming because it is less aggressive so it will maintain whole words. This is very important as it should help the text retain more meaning and make for more accurate PoS tagging and chunking.

The only other thing the function does is remove punctuation and newlines ('\n'), using a regular expression. Punctuation is not used in NLP applications for the most part, and I will not be looking at it here, so it is good to remove it and clean the data up.

In [3]:
lem = WordNetLemmatizer()
stoplist = set(stopwords.words('english'))

def process(raw_string):
    tokens = nltk.word_tokenize(re.sub(r"(\\\w)|([^\w\s])", '', raw_string))
    # regex used to remove '\n' and punctuation
    stemmed = " ".join([lem.lemmatize(t) for t in tokens])
    return stemmed

The following cell shows the main pre-processing steps. This takes a chosen file as an input (here, we are using the camera file), and returns 3 dictionaries. All 3 have sentence ID as their keys. The sentences dictionary's values are the processed review sentences. The other 2 are for the positive and negative sentiment ground truth tags. The product feature tags are removed from the document and are listed in these dictionaries so they can be easily called on for evaluation.

When adding the review sentences to the dictionary, I have made sure to skip sentences in the file header, which start with an asterisk. I also skip the review titles, because they were not included by Hu and Liu, so there are no labels for them (2004). I use a regex search to find the positive and negative sentiments so that I can add the respective product feature to the correct dictionary. I have chosen to use dictionaries because they are very efficient for querying, and they have the benefit that keys can't be duplicated, which means for each sentence in the data there can only be one key in the dictionary.

In [4]:
def read_in(file):
    sentences = {}
    tag_dic = {}
    pos_dic = {}
    neg_dic = {}
    sent_id = 0
    with open(file) as f:
        doc = f.read()
    sent_list = doc.split('\n')
    for sent in sent_list:
        if sent.startswith("*") or sent.startswith("[t]") or sent == "" or "##" not in sent:
            continue
        sentences[sent_id] = process(sent[sent.index("##")+2:])
        
        pos_dic[sent_id] = []
        neg_dic[sent_id] = []
        
        tags = sent[:sent.index("##")].split(',')
        for i in range(len(tags)):
            tag=tags[i]
            pos = re.search("\+", tag)
            neg = re.search("\-", tag)
            if pos:
                pos_dic[sent_id].append(tag[:pos.start()-1])
            elif neg:
                neg_dic[sent_id].append(tag[:neg.start()-1])
            
        sent_id += 1
    return sentences, pos_dic, neg_dic

sent, pos, neg = read_in(data_file)

pprint(pos)
pprint(neg)


{0: ['canon powershot g3'],
 1: ['use'],
 2: [],
 3: [],
 4: ['picture'],
 5: ['picture quality'],
 6: ['picture quality'],
 7: [],
 8: ['camera', ' use', ' feature'],
 9: ['picture quality', ' use', ' option'],
 10: [],
 11: [],
 12: ['camera'],
 13: [],
 14: ['picture'],
 15: ['use'],
 16: [],
 17: [],
 18: [],
 19: [],
 20: [],
 21: [],
 22: [],
 23: [],
 24: ['camera'],
 25: [],
 26: [],
 27: [],
 28: ['camera'],
 29: ['speed', 'picture quality', 'function'],
 30: ['auto setting'],
 31: [],
 32: ['camera'],
 33: [],
 34: ['canon g3'],
 35: ['photo quality'],
 36: ['feature'],
 37: [],
 38: [],
 39: [],
 40: ['camera'],
 41: ['camera'],
 42: ['photo quality'],
 43: [],
 44: [],
 45: [],
 46: [],
 47: [],
 48: ['exposure control', 'auto setting'],
 49: ['metering option'],
 50: ['spot metering'],
 51: ['4mp'],
 52: ['zoom'],
 53: [],
 54: ['focus'],
 55: ['focus'],
 56: [],
 57: [],
 58: [],
 59: [],
 60: ['feature'],
 61: ['lcd'],
 62: [],
 63: ['optical zoom', 'digital zoom'],
 64:

In [5]:
pprint(sent)

{0: 'i recently purchased the canon powershot g3 and am extremely satisfied '
    'with the purchase',
 1: 'the camera is very easy to use in fact on a recent trip this past week i '
    'wa asked to take a picture of a vacationing elderly group',
 2: 'after i took their picture with their camera they offered to take a '
    'picture of u',
 3: 'i just told them press halfway wait for the box to turn green and press '
    'the rest of the way',
 4: 'they fired away and the picture turned out quite nicely a all of my '
    'picture have thusfar',
 5: 'a few of my work constituants owned the g2 and highly recommended the '
    'canon for picture quality',
 6: 'i m easily enlarging picture to 8 12 x 11 with no visable loss in picture '
    'quality and not even using the best possible setting a yet super fine',
 7: 'ensure you get a larger flash 128 or 256 some are selling with the larger '
    'flash 32mb will do in a pinch but you ll quickly want a larger flash card '
    'a with any of

# Step 3: Extract relevant information

This step requires product features to be extracted from the review sentences. Here, I am using the SpaCy module to make use of the PoS tagger and chunking algorithms. As in the approach by Hu and Liu, I have extracted all the nouns and nounphrases from the corpus to use as potential product features to mine opinions on. I create a dictionary for the nouns and nounphrases (nouns), which has the word as the key and the number of times the word appears in the corpus as the value. I filter this dictionary of nouns and nounphrases by only including ones which occur in more than 1% (Ibid) of review sentences, as it can be expected that people reviewing the same product will talk about the same kinds of features.

I process the list of frequent features, first removing stopwords because if they are occurring in nounphrases it reduces the chance of matching the ground truth product features. There were some unigrams in the list that were stopwords, so after removing them there were blank list entries which needed to be removed. I also removed duplicates from the list as this could cause errors to propogate in the sentiment analysis stage. As can be seen from the print statements, this processing is beneficial as it reduces the list of frequent features from 114 down to 83.

With the list of product features I have mined from the data, I can now use this to extract opinion sentences about the product features in the step 4.

In [6]:
nlp = spacy.load("en_core_web_sm")
def frequent_features(dic):
    n_sentences = len(dic)
    corpus = ' '.join(dic.values())
    doc = nlp(corpus)
    nouns = {}
    frequent_features = []
    for token in doc:
        if token.pos_ == 'NOUN':
            nouns[token.text] = nouns.get(token.text, 0) + 1
    
    for chunk in doc.noun_chunks:
        nouns[chunk.text] = nouns.get(chunk.text, 0) + 1
        
    for key, value in nouns.items():
        if (value / n_sentences) * 100 > 1: # occurs in more than 1% of review sentences
            frequent_features.append(key)
    
    f_stop = []
    pprint("original")
    pprint(len(frequent_features))
    pprint(frequent_features)
    
    for f in frequent_features:
        
        f_tokens = nltk.word_tokenize(f)
        f_stop.append(' '.join([word for word in f_tokens if word not in stoplist]))
    
    pprint("after removing stopwords")
    print(len(f_stop))
    pprint(f_stop)
    print()
    
    ff = [feat for feat in f_stop if feat != ""]  
    print("after removing blanks")
    pprint(len(ff))
    pprint(ff)
    print()
    
    final_f = list(dict.fromkeys(ff)) # removing duplicates
    print("after removing duplicates")
    pprint(len(final_f))
    return final_f

camera_features = frequent_features(sent)

'original'
114
['canon',
 'powershot',
 'camera',
 'fact',
 'week',
 'picture',
 'way',
 'work',
 'quality',
 'flash',
 'mb',
 'card',
 'feature',
 'ability',
 'choice',
 'use',
 'option',
 'software',
 'month',
 'megapixel',
 'control',
 'lens',
 'it',
 'image',
 'cf',
 'thing',
 'viewfinder',
 'lcd',
 'research',
 'review',
 'resolution',
 'speed',
 'setting',
 'mode',
 'photo',
 'shutter',
 'flaw',
 'time',
 'photography',
 'day',
 'screen',
 'doe',
 'adjustment',
 't',
 'focus',
 'slr',
 'exposure',
 'auto',
 'range',
 'zoom',
 'result',
 'film',
 'button',
 'battery',
 'plastic',
 'problem',
 'computer',
 'view',
 'price',
 'point',
 'difference',
 'year',
 'pic',
 'hand',
 'bit',
 'life',
 'compactflash',
 'color',
 'power',
 'amazon',
 'ton',
 'shot',
 'priority',
 'lot',
 'light',
 'digicams',
 'd',
 'photoshop',
 'g2',
 'term',
 'one',
 'moment',
 'g3',
 'a',
 'i',
 'the camera',
 'a picture',
 'they',
 'them',
 'the picture',
 'the canon',
 'you',
 'this camera',
 'anyone',
 

The following code is from Kochmar, 2021. I am using an adjective sentiment lexicon because adjectives tend to express the most sentiment, and will be used in reviews to describe their feelings towards product features.

In [7]:
import codecs
def collect_wordlist(input_file):
    word_dict = {}
    with codecs.open(input_file, encoding='ISO-8859-1', errors ='ignore') as f:
        for a_line in f.readlines():
            cols = a_line.split("\t")
            if len(cols)>2:
                word = cols[0].strip()
                score = float(cols[1].strip())
                word_dict[word] = score
    f.close()
    return word_dict

adj_00 = collect_wordlist("2000.tsv")
adj_00.get("cool")

1.19

# Step 4: Apply a relevant algorithm

The following function is part of step 3, as it identifies sentences that contain opinions about the product features, but it is also step 4 because it then uses this information and sentiment analysis to discover which review sentences are positive and negative about different product features.

I initialise a positive and negative dictionary. This will be where the positive and negative product feature tags will go, linked to their sentence id keys. This will make it easy to compare the sentiment and product feature tags from my classification to the ones in the ground truth data.

Looping through the review sentences, I am looking for occurrences of the frequent features I have extracted, and if I find one of them, I then use the dependency tags from SpaCy to see whether there are any adjectives linked to the feature. If there is an adjective, the sentiment of the adjective is looked up in the sentiment lexicon. From experimenting with this step, I found that allowing any sentiment figure would result in a lot of sentiment words with scores very close to 0, including examples like 'optical', which had a very low negative score, but due to the context was appearing a lot and would have skewed the results significantly. Plus it doesn't seem logical to be either positive or negative. Because of this, I introduced a minimum threshold of 0.2, for both positive and negative sentiments. Additionally, if the adjective has a negation dependency, then I multiply by -1 to reverse the sentiment polarity.

In [8]:
def sentiment_analysis(review_sents, product_features):
    my_pos_dic = {}
    my_neg_dic = {}
    adjectives = []
    
    for i in range(len(review_sents)):
        my_pos_dic[i] = []
        my_neg_dic[i] = []
        
        sent = review_sents[i]
        for token in nlp(sent):
            if token.text in product_features:
                children = [child for child in token.children]
                for child in children:
                    multiplier = 1
                    if child.dep_ == 'neg':
                        multiplier *= -1
                    if child.pos_ == "ADJ":
                        if adj_00.get(child.text) is None:
                            continue
                        elif adj_00.get(child.text) * multiplier > 0.2:
                            my_pos_dic[i].append(token.text)
                        elif adj_00.get(child.text) * multiplier < -0.2:
                            my_neg_dic[i].append(token.text)
    
                        adjectives.append(child.text)
    return my_pos_dic, my_neg_dic

my_pos, my_neg = sentiment_analysis(sent, camera_features)

In [9]:
pprint(my_pos)
pprint(pos)

{0: [],
 1: [],
 2: [],
 3: [],
 4: [],
 5: [],
 6: ['setting'],
 7: ['mb', 'card'],
 8: ['feature'],
 9: [],
 10: ['canon'],
 11: [],
 12: [],
 13: [],
 14: ['control'],
 15: [],
 16: [],
 17: ['quality'],
 18: ['work'],
 19: [],
 20: [],
 21: [],
 22: ['viewfinder'],
 23: ['viewfinder'],
 24: [],
 25: ['research'],
 26: ['review'],
 27: [],
 28: [],
 29: [],
 30: ['setting'],
 31: [],
 32: ['quality'],
 33: [],
 34: [],
 35: [],
 36: ['feature', 'camera', 'speed'],
 37: [],
 38: [],
 39: [],
 40: [],
 41: ['day'],
 42: [],
 43: [],
 44: [],
 45: ['adjustment'],
 46: ['review'],
 47: [],
 48: [],
 49: [],
 50: [],
 51: [],
 52: [],
 53: ['thing'],
 54: [],
 55: [],
 56: [],
 57: ['battery'],
 58: [],
 59: [],
 60: [],
 61: ['picture'],
 62: [],
 63: ['zoom', 'work', 'zoom'],
 64: ['zoom', 'picture'],
 65: ['problem'],
 66: [],
 67: [],
 68: [],
 69: ['camera'],
 70: [],
 71: ['control'],
 72: ['setting'],
 73: [],
 74: [],
 75: [],
 76: ['g3'],
 77: ['difference'],
 78: [],
 79: [],
 

# Step 5: Report evaluation results

For the evaluation, I will be reporting recall, precision and f1_scores, as well as comparing the total number of distinct features identified in my classification of the data to the number of ground truth distinct features. To get the counts of TP, FN, and FP I loop through each sentence one by one, comparing the dictionary of ground truth product feature opinions to those of my classification. This works well because in each dictionary the keys include every sentence ID, so it is simple to compare the values.

A true positive in this case occurs when the same tag which is apparent in the ground truth also appears for the same sentence in my classification. A false negative is when a product feature is absent from the respective sentiment dictionary. A false positive is any product feature included in my classification which is not present in the ground truth. The final recall, precision and f1 score values given are their repective average values from every product feature in the ground truth. 

In [12]:
def get_distinct_features(dic1, dic2):
    distinct_features = []
    for list_of_features in dic1.values():
        for feature in list_of_features:
            if feature not in distinct_features:
                distinct_features.append(feature)
    
    for list_of_features in dic2.values():
        for feature in list_of_features:
            if feature not in distinct_features:
                distinct_features.append(feature)            
    
    return distinct_features

my_distinct_features = get_distinct_features(my_pos, my_neg)
GT_distinct_features = get_distinct_features(pos, neg)
    

def evaluate(GT_pos, GT_neg, test_pos, test_neg, GT_distinct_features):
    GT_distinct_features = get_distinct_features(GT_pos, GT_neg)
    test_distinct_features = get_distinct_features(test_pos, test_neg)
    num_GT_features = len(GT_distinct_features)
    num_test_features = len(test_distinct_features)
    
    recall_list = []
    precision_list = []
    f1_list = []
    if len(GT_pos) + len(GT_neg) != len(test_pos) + len(test_neg):
        print(len(GT_pos), len(GT_neg), len(test_pos), len(test_neg))
        raise KeyError
    for feature in GT_distinct_features:
        TP = 0
        FP = 0
        FN = 0
        
        for i in range(len(test_pos)):
            if (feature in GT_pos[i] and feature in test_pos[i]) or (feature in GT_neg[i] and feature in test_neg[i]):
                TP+=1
            elif (feature in GT_pos[i] and feature not in test_pos[i]) or (feature in GT_neg[i] and feature not in test_neg[i]):
                FN+=1
            elif (feature not in GT_pos[i] and feature in test_pos[i]) or (feature not in GT_neg[i] and feature in test_neg[i]):
                FP+=1
        if TP+FN == 0:
            recall = 0
        else:
            recall = TP/(TP+FN)
        if TP+FP == 0:
            precision = 0
        else:
            precision=TP/(TP+FP)
        if precision+recall == 0:
            f1=0
        else:
            f1=(2 * precision * recall)/(precision + recall)
        
        recall_list.append(recall)
        precision_list.append(precision)
        f1_list.append(f1)
        
    recall_avg = stats.mean(recall_list)
    precision_avg = stats.mean(precision_list)
    f1_avg = stats.mean(f1_list)
    return num_GT_features, num_test_features, recall_avg, precision_avg, f1_avg

evaluate(pos, neg, my_pos, my_neg, GT_distinct_features)
        
        

(112, 46, 0.02193877551020408, 0.02636904761904762, 0.023074454853166337)

Below is the code which runs through the entire application from start to finish for every dataset provided. The results are very poor. The highest statistic is for the Canon S100 with a precision of 0.047. Recall tends to be even lower than precision most of the time, showing that false negatives are more of a problem than false positives (albeit only slightly, as both are very poor). This shows that my classification just wasn't able to pick out the correct product features a lot of the time. The much lower number of distinct product features shows that my design choice to just use nouns and nounphrases was perhaps too limited and simplified.

In [13]:
def start_to_finish(file):
    read = read_in(file)
    ff = frequent_features(read[0])
    sentiment = sentiment_analysis(read[0], ff)
    print(file)
    ev = evaluate(read[1], read[2], sentiment[0], sentiment[1], get_distinct_features(read[1], read[2]))
    
    return ev

directories = ["Data/Data/Customer_review_data/", "Data/Data/CustomerReviews-3_domains/", "Data/Data/Reviews-9-products/"]
results = {}
rows = ('No. GT features', 'No. extracted features', 'recall', 'precision', 'f1 score')
for directory in directories:
    files = os.listdir(directory)
    for file in sorted(files):
        if file == "Readme.txt" or file == ".DS_Store":
            continue
        else:
            results[file] = start_to_finish(directory+file)
        
results_table = pd.DataFrame(results, rows)
print(results_table)

'original'
85
['apex',
 'dvd',
 'player',
 'video',
 'doe',
 'hour',
 'support',
 'picture',
 'control',
 'button',
 'remote',
 'tv',
 'display',
 'output',
 'screen',
 'problem',
 'year',
 'sound',
 'price',
 'one',
 'way',
 'feature',
 'work',
 'model',
 'format',
 'file',
 'lot',
 'time',
 'thing',
 'unit',
 'wa',
 'movie',
 'month',
 'money',
 'quality',
 'customer',
 'service',
 'product',
 'review',
 'week',
 'cd',
 'play',
 'disc',
 'company',
 'machine',
 'amazon',
 'day',
 'disk',
 'trouble',
 'star',
 'number',
 'return',
 't',
 'gift',
 'brand',
 'mine',
 'menu',
 'it',
 'you',
 'the player',
 'the remote',
 'this',
 'i',
 'the price',
 'what',
 'me',
 'that',
 'them',
 'the unit',
 'they',
 'this dvd player',
 'we',
 'the dvd',
 'this player',
 'the review',
 'christmas',
 'everything',
 'the apex',
 'no problem',
 'something',
 'which',
 'the picture',
 'all',
 'this product',
 'the dvd player']
'after removing stopwords'
85
['apex',
 'dvd',
 'player',
 'video',
 'doe',
 '

Data/Data/Customer_review_data/Creative Labs Nomad Jukebox Zen Xtra 40GB.txt
'original'
112
['camera',
 'picture',
 'macro',
 'day',
 'feature',
 'autofocus',
 'scene',
 'mode',
 'situation',
 'mb',
 'flash',
 'battery',
 'cf',
 'vacation',
 'experience',
 'one',
 'beginner',
 'model',
 'review',
 'auto',
 'point',
 't',
 'lens',
 'subject',
 'life',
 'lcd',
 'time',
 'lexar',
 'card',
 'image',
 'resolution',
 'photography',
 'power',
 'use',
 'option',
 'shutter',
 'speed',
 'people',
 'month',
 'canon',
 'problem',
 'quality',
 'good',
 'pc',
 'way',
 'memory',
 'thing',
 'date',
 'software',
 'print',
 'room',
 'cap',
 'couple',
 'size',
 'movie',
 'clarity',
 'menu',
 'week',
 'part',
 'term',
 'price',
 'photo',
 'hand',
 'lot',
 'manual',
 'course',
 'setting',
 'photograph',
 'mm',
 'money',
 'u',
 'work',
 'adapter',
 'light',
 'nikon',
 'value',
 'zoom',
 'pic',
 'film',
 'coolpix',
 'research',
 'control',
 'doe',
 'drawback',
 'shot',
 'slr',
 'color',
 'range',
 'filter',


Data/Data/CustomerReviews-3_domains/Computer.txt
'original'
103
['price',
 'customer',
 'service',
 'problem',
 'laptop',
 'internet',
 'product',
 'cable',
 'router',
 'replacement',
 'month',
 'connection',
 'speed',
 'network',
 'star',
 'storage',
 'device',
 'computer',
 'way',
 'drive',
 'file',
 'wa',
 'issue',
 'support',
 'firmware',
 'desktop',
 'port',
 'year',
 'number',
 'installation',
 'setup',
 'feature',
 'signal',
 'people',
 'USB',
 'one',
 'review',
 'password',
 'button',
 'time',
 'printer',
 'hour',
 'doe',
 'minute',
 'modem',
 'home',
 'wireless',
 'd',
 'house',
 'model',
 'software',
 'others',
 'unit',
 'user',
 'security',
 'thing',
 'warranty',
 'phone',
 'call',
 'tech',
 'range',
 'day',
 'power',
 'point',
 'box',
 'mbps',
 'case',
 'RRB',
 'line',
 'version',
 'access',
 'wifi',
 'IP',
 'address',
 'switch',
 'video',
 'LRB',
 'band',
 'I',
 'the internet',
 'me',
 'this product',
 'we',
 'it',
 'them',
 'you',
 'this',
 'they',
 'which',
 'that',
 'wh

Data/Data/Reviews-9-products/Canon PowerShot SD500.txt
'original'
118
['camera',
 'reason',
 'people',
 'review',
 'size',
 'pocket',
 'picture',
 'friend',
 'thing',
 'memory',
 'card',
 'room',
 'money',
 'battery',
 'I',
 'lens',
 'quality',
 'price',
 'lithium',
 'place',
 'way',
 'problem',
 'wa',
 'zoom',
 'function',
 'flash',
 'focus',
 'shot',
 'flaw',
 'choice',
 'time',
 'month',
 'advantage',
 'bit',
 'button',
 'body',
 'lack',
 'resolution',
 'pic',
 'mm',
 'snapshot',
 'image',
 'computer',
 'software',
 'LCD',
 'color',
 'light',
 'film',
 'one',
 'exposure',
 'mode',
 'wallet',
 'screen',
 'control',
 'balance',
 'week',
 'mb',
 'viewfinder',
 'panorama',
 'S100',
 'second',
 'star',
 'amazon',
 'point',
 'fun',
 'wife',
 'moment',
 'practice',
 'result',
 'print',
 'd',
 'case',
 'pack',
 'metal',
 'product',
 'capability',
 'year',
 'MB',
 'photography',
 'photo',
 'PC',
 'housing',
 'example',
 'auto',
 'sunlight',
 'this camera',
 'it',
 'what',
 'my pocket',
 'tha

Data/Data/Reviews-9-products/Hitachi router.txt
'original'
110
['router',
 'setup',
 'installation',
 'fact',
 'tech',
 'CD',
 'computer',
 'modem',
 'house',
 'connection',
 'people',
 'review',
 'problem',
 'thing',
 'manual',
 'way',
 'documentation',
 'cable',
 'PC',
 'wireless',
 'step',
 'box',
 'product',
 'website',
 'G',
 'configuration',
 'system',
 'use',
 'hour',
 'person',
 'IT',
 'support',
 'instruction',
 'site',
 'user',
 'star',
 'couple',
 'day',
 'performance',
 'network',
 'others',
 'time',
 'internet',
 'laptop',
 'minute',
 'web',
 'speed',
 'issue',
 'work',
 'point',
 'card',
 'help',
 'encryption',
 'firmware',
 'update',
 'address',
 'device',
 'price',
 'Ethernet',
 'experience',
 'one',
 'security',
 'program',
 'Linksys',
 'd',
 'hardware',
 'lot',
 'year',
 'version',
 'feature',
 'home',
 'access',
 'linksys',
 'range',
 'service',
 'file',
 'desktop',
 'signal',
 'g',
 'software',
 'month',
 'standard',
 'MB',
 'Internet',
 'I',
 'everything',
 'that',

Data/Data/Reviews-9-products/Nokia 6600.txt
'original'
105
['MP3',
 'player',
 'feature',
 'game',
 'unit',
 'review',
 'sound',
 'quality',
 'lot',
 'problem',
 'battery',
 'month',
 'year',
 'charge',
 'thing',
 'scratch',
 'case',
 'version',
 'photo',
 'design',
 'one',
 'software',
 'computer',
 'music',
 'hour',
 'warranty',
 'service',
 'generation',
 'ipod',
 'song',
 'CDs',
 'wheel',
 'menu',
 'screen',
 'ear',
 'user',
 'life',
 'time',
 'drive',
 'week',
 'product',
 'way',
 'apple',
 'advice',
 'USB',
 'device',
 'doe',
 'GB',
 'market',
 'storage',
 'program',
 'people',
 'PC',
 'part',
 'reputation',
 'money',
 'book',
 'capacity',
 'issue',
 'file',
 'fact',
 'minute',
 'ease',
 'use',
 'library',
 'model',
 'CD',
 'car',
 'cable',
 'day',
 'Apples',
 'mine',
 'iPods',
 'track',
 'price',
 'mini',
 'iPod',
 'I',
 'the iPod',
 'it',
 'It',
 'you',
 'This',
 'me',
 'they',
 'something',
 'this',
 'that',
 'a lot',
 'what',
 'Apple',
 'the battery',
 'They',
 'iTunes',
 'wh

In [14]:
pprint(results_table)

                        Apex AD2600 Progressive-scan DVD player.txt  \
No. GT features                                          131.000000   
No. extracted features                                    36.000000   
recall                                                     0.019231   
precision                                                  0.044614   
f1 score                                                   0.019485   

                        Canon G3.txt  \
No. GT features           112.000000   
No. extracted features     46.000000   
recall                      0.021939   
precision                   0.026369   
f1 score                    0.023074   

                        Creative Labs Nomad Jukebox Zen Xtra 40GB.txt  \
No. GT features                                            203.000000   
No. extracted features                                      55.000000   
recall                                                       0.018970   
precision                               

# Summaries

Despite the bad evaluation results, reading out the summaries below is encouraging. There is a good example of negation working in the 'setting' product feature. One of the positive examples is "with the automatic setting i really have nt
taken a bad picture yet". As it includes the word bad, without any negation handling one would expect it to be categorized as negative. This shows that the problem with my classification approach is that the product features it is choosing are very different, and certainly more limited than the ground truth, but the sentiment analysis itself is working very well, as one can see from manually checking the summaries that positive and negative sentences are being correctly classified.

In [15]:
for f in my_distinct_features:
    positives = 0
    negatives = 0
    p_ex = []
    n_ex = []
    for i in range(len(sent)):
        
        if f in my_pos[i]:
            positives+=1
            p_ex.append(sent[i])
        if f in my_neg[i]:
            negatives+=1
            n_ex.append(sent[i])

    print(f'Product feature: {f} \n    Positives: {positives}\n        examples: {p_ex} \n\n    Negatives: {negatives}\n        examples: {n_ex}\n\n')


Product feature: setting 
    Positives: 5
        examples: ['i m easily enlarging picture to 8 12 x 11 with no visable loss in picture quality and not even using the best possible setting a yet super fine', 'with the automatic setting i really have nt taken a bad picture yet', 'newbie will find the full auto setting will give them perfect picture right out of the box', 'by cocking the shutter to the halfway position and getting the setting ready to shoot i wa able to produce excellent stopaction photo contrary to what other reviewer experienced', 'you can use this camera right out of the box on the automatic setting or slowly get comfortable with the manual setting and what they can do'] 

    Negatives: 0
        examples: []


Product feature: mb 
    Positives: 1
        examples: ['ensure you get a larger flash 128 or 256 some are selling with the larger flash 32mb will do in a pinch but you ll quickly want a larger flash card a with any of the 4mp camera'] 

    Negatives: 0
   

# References:

Hu, M. and Liu, B., 2004. Mining and Summarizing Customer Reviews. p.10.

Kochmar, E., 2021. Getting Started with Natural Language Processing [Online]. Manning Publications. Available from: https://www.manning.com/books/getting-started-with-natural-language-processing [Accessed 12 November 2021].

William L. Hamilton, Kevin Clark, Jure Leskovec, and Dan Jurafsky. Inducing Domain-Specific Sentiment Lexicons from Unlabeled Corpora. ArXiv preprint (arxiv:1606.02820). 2016.