# CM500328 Introduction to NLP: Final Assignment

### Introduction

The aim of this assignment is to carry out opinion mining on reviews of a range of products.  

Specifically, given a set of reviews, the task is to identify keyphrases that describe product features and detect the polarity of sentences that discuss these product features. Product features and associated sentiments are identified at the sentence level. These results are aggregated to summarise product features and the senitments associated with them. The data used has been extracted from Amazon and have been manually annotated with product features and opinion polarity. 

In this assignment we trial two different methodologies:  
1. Development of an algorithm to identify features and sentence polarity, similar to the work of Hu and Liu (2004) 
2. A Machine Learning (ML) method to predict sentence level sentiment using tf-idf vectors and a Support Vector Classifier (SVC)
    
The remainder of this assignment is structured as follows:

1. **Analyse the data and the task**
    - In this section we read in the data, take care of preliminary cleansing activities and transform the data into useable data structures


2. **Apply relevant data pre-processing steps**
    - In this section we carry out data pre-processing steps and creation of training and test sets


3. **Extract relevant information**
    - We identify relevant product features and the sentiment pertaining to these features


4. **Apply a relevant algorithm**
    - We produce summary reviews at the feature level for a given product


5. **Report evaluation results**
    - We report the evaluation results of the approaches using precision and recall metrics
    - We produce an NLP pipeline so that we can easily apply our algorithm to multiple products


6. **Machine Learning method**  
    - We implement a supervised learning approach using tf-idf word vectors and a Support Vector Classifier (SVC) to predict sentence polarity
    - Results are compared with our original method  
  
 
7. **Conclusions**
    - A summary of the key findings  
  
  
8. **References**
    - A list of references used in this assignment
    
Throughout this assignment, the code used for each section is presented at the start of the section so that reviewers can easily refer to the code used for a given section.


### Notes
**Packages:** The packages and versions used are detailed in the requirements.txt file, found in the same directory as this notebook  
**Data:** It is assumed that the "Data" folder containing subfolders is stored in the same directory as this notebook


In [1]:
import os, codecs
import numpy as np
from numpy.random import default_rng
import pandas as pd
import spacy
import string
import regex as re
import nltk
from nltk import word_tokenize, pos_tag, ne_chunk, RegexpParser, Tree
from copy import deepcopy
from collections import Counter, defaultdict
from sklearn.metrics import precision_score, recall_score, accuracy_score
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import LinearSVC

### 1. Analyse the data and the task

In this step we will read in the data for the reviews and perform some intial data cleansing. Specifically, we ignore several tags specified by Hu and Liu (2004), including titles denoted by [t] and other tags such as [u], [p], [s], [cc] and [cs]. These tags are note deemed to be useful for this assignment.   

After processing the data is separated into the following data structures:

| Name | Type | Description |
|:-----|:-----|:------------|
| labels | dict | Groundtruth labels for each sentence |
| reviews | dict | Review sentences, each sentence in a review is a separate record. Reviews can span multiple sentences. |
| other | dict | Data not classified as tags or text |
| data | str | The entire input string afer initial cleansing |

For the sake of example and explanation, we will work through the detailed steps of the pipeline for a single product. In section 5, we will put all these steps together to create a unified pipeline and compare performance across multiple products.

In [2]:
def read_in(folders=None, single_file=None, exclude=["Readme.txt"], verbose=False):
    """ 
    Read in data and store as a string
    """
    file_string = ""
    
    if folders is not None:
        for folder in folders:
            for a_file in os.listdir(folder):
                if not a_file.startswith(".") and a_file not in exclude:
                    if verbose:
                        print(f"Reading data from: {folder + a_file}")
                    with codecs.open(folder + a_file, 'r', encoding='iso-8859-1', errors="ignore") as f:
                        file_string += f.read()
    elif single_file is not None:
        if verbose:
            print(f"Reading data from: {single_file}")
        with codecs.open(single_file, 'r', encoding='iso-8859-1', errors="ignore") as f:
            file_string += f.read()
    else:
        raise ValueError("Please provide a list of folders or a path to a single file")
    
    return file_string


def create_dicts(data):
    """ 
    Converts the raw string into three dictionaries:
        1. labels: labels for each line
        2. text: The lines for each review
        3. other: text that cannot be identified as labels or text
        
    """
    labels = {}
    reviews = {}
    other = {}
    count = 0
    other_count = 0
    
    data = data.replace("]#", "] #")
    data = data.replace("[p]", "")
    data = data.replace("[u]", "")
    data = data.replace("[s]", "")
    data = data.replace("[cc]", "")
    data = data.replace("[cs]", "")
    
    for line in data.split("\r\n"):
        line = line.strip()
        if len(re.findall('^[\w#]+', line)) == 0:
            other[other_count] = line
            other_count += 1
        else:
            text_split = line.split("##")
            labels[count] = text_split[0]
            if len(text_split) > 1:
                reviews[count] = text_split[1]
            count += 1
            
    return labels, reviews, other, data

def sort_dict(dic, reverse=True):
    """ 
    Helper function that sorts a dictionary by values 
    """
    dic = {k: v for k, v in sorted(dic.items(), key=lambda item: item[1], reverse=reverse)}
    return dic

def print_dict(dic, n=10):
    """ 
    Helper function that prints the first n items from a dictionary 
    """
    c = 0
    for k, v in dic.items():
        if c > n:
            break
        print(f'{k}: {v}')
        c += 1

Let's start with a single product, the Canon G3. As mentioned beforehand, we will apply the entire pipeline to a wider range of products in section 5.

In [3]:
product = 'Canon G3'
file = 'data/Customer_review_data/Canon G3.txt'

# Read data and check length
data = read_in(single_file=file, verbose=True)
labels, reviews, other, data_clean = create_dicts(data)

Reading data from: data/Customer_review_data/Canon G3.txt


Let's observe the labels from the first 10 records of the labels, reviews and other dictionary:

In [4]:
print_dict(labels, 10)

0: canon powershot g3[+3] 
1: use[+2] 
2: 
3: 
4: picture[+2] 
5: picture quality[+1] 
6: picture quality[+1] 
7: 
8: camera[+2], use[+2], feature[+1] 
9: picture quality[+3], use[+1], option[+1] 
10: 


In [5]:
print_dict(reviews, 10)

0: i recently purchased the canon powershot g3 and am extremely satisfied with the purchase .
1: the camera is very easy to use , in fact on a recent trip this past week i was asked to take a picture of a vacationing elderly group .
2: after i took their picture with their camera , they offered to take a picture of us .
3: i just told them , press halfway , wait for the box to turn green and press the rest of the way .
4: they fired away and the picture turned out quite nicely . ( as all of my pictures have thusfar ) .
5: a few of my work constituants owned the g2 and highly recommended the canon for picture quality .
6: i 'm easily enlarging pictures to 8 1/2 x 11 with no visable loss in picture quality and not even using the best possible setting as yet ( super fine ) .
7: ensure you get a larger flash , 128 or 256 , some are selling with the larger flash , 32mb will do in a pinch but you 'll quickly want a larger flash card as with any of the 4mp cameras .
8: bottom line , well made

In [6]:
print_dict(other, 10)

0: *****************************************************************************
1: * Annotated by: Minqing Hu and Bing Liu, 2004.
2: *		Department of Computer Sicence
3: *               University of Illinios at Chicago
4: *
5: * Product name: Canon G3
6: * Review Source: amazon.com
7: *
8: * See Readme.txt to find the meaning of each symbol.
9: *****************************************************************************
10: 


<font color='blue' weight='bold'>
    <b>Summary</b><br>
    In this section we have:
    <li> Read in the data </li>
    <li> Extracted reviews and labels from the raw data </li>   
</font>

### 2. Apply relevant data pre-processing steps

In this section we will apply relevant preprocessing steps to the sentences stored in the text dictionary.

**Specifically we will:**
1. Convert sentences to lemma form and lower case
2. Remove tokens less than the minimum specified token length, we will use a minimum length of 3

I have leveraged the spaCy token.lemma_ functionality to reduce words to their base lemma, which will prove useful in reducing dimensionality when trying to identify product features and opinion words.

I have experimented with stopword removal, which appears to degrade the overall performance of the algorithm, so stopwords have been retained.

In [7]:
def process_text(text, remove_stopwords=False, min_token_length=3):
    """ 
    Apply text preprocessing steps 
    """
    doc = nlp(text, disable=["ner"])
    if remove_stopwords:
        return " ".join([token.lemma_.lower() for token in doc if not token.is_stop 
                         and len(token.text) >= min_token_length])
    else:
        return " ".join([token.lemma_.lower() for token in doc if len(token.text) >= min_token_length])

def clean_text(text_dict, remove_stopwords, verbose=False, min_token_length=3):
    """ 
    Apply test processing to clean the text
    """
    text_clean = {}
    count = 0
    for k, v in text_dict.items():
        if verbose and count%100==0 and count>0:
            print(f"Records processed: {count}")
        text_clean[k] = process_text(v, remove_stopwords, min_token_length)
        count += 1
        
    return text_clean

def create_train_test_indicies(text_dict, test_percent, random_seed=42):
    """ 
    Generate train and test set indicies 
    """
    np.random.seed(random_seed)
    
    train_n = int(len(text_dict) * (1 - test_percent))
    idx_list = list(text_dict.keys())
    np.random.shuffle(idx_list)
    
    train_indicies = idx_list[:train_n]
    test_indicies = idx_list[train_n:]
                      
    return train_indicies, test_indicies
    
def create_train_test_sets(x, y, test_percent, random_seed=42, verbose=False):
    """
    Returns train and test sets given indicies for train and test
    """
    # First get the indidices for splitting
    train_indicies, test_indicies= create_train_test_indicies(x, test_percent, random_seed=random_seed)
    
    # Initialise dictionaries to store output data
    x_train = {}
    y_train = {}
    x_test = {}
    y_test = {}
    
    # Build datsets
    for i in train_indicies:
        x_train[i] = x[i]
        y_train[i] = y[i]
        
    for i in test_indicies:
        x_test[i] = x[i]
        y_test[i] = y[i]    
        
    if verbose:
        print(f'x_train records: {len(x_train)}')
        print(f'y_train records: {len(y_train)}')
        print(f'x_test records: {len(x_test)}')
        print(f'y_test records: {len(y_test)} \n')
          
    return x_train, y_train, x_test, y_test 

Let's apply the preprocessing on the text:

In [8]:
nlp = spacy.load("en_core_web_md")
reviews_clean = clean_text(reviews, remove_stopwords=False, min_token_length=1)
print_dict(reviews_clean, 10)

0: i recently purchase the canon powershot g3 and be extremely satisfied with the purchase .
1: the camera be very easy to use , in fact on a recent trip this past week i be ask to take a picture of a vacation elderly group .
2: after i take their picture with their camera , they offer to take a picture of we .
3: i just tell they , press halfway , wait for the box to turn green and press the rest of the way .
4: they fire away and the picture turn out quite nicely . ( as all of my picture have thusfar ) .
5: a few of my work constituant own the g2 and highly recommend the canon for picture quality .
6: i ' m easily enlarge picture to 8 1/2 x 11 with no visable loss in picture quality and not even use the good possible setting as yet ( super fine ) .
7: ensure you get a large flash , 128 or 256 , some be sell with the large flash , 32 mb will do in a pinch but you 'll quickly want a large flash card as with any of the 4mp camera .
8: bottom line , well make camera , easy to use , very 

Next we will create a train and test split, we will hold aside 25% of the data for the test set.

A test set is not really necessary for our non-ML approach, however we will split the data now in anticipation of the ML approach that we will try later on.

In [9]:
test_percent = 0.25
x_train, y_train, x_test, y_test = create_train_test_sets(reviews_clean, labels, test_percent, 
                                                          random_seed=42, verbose=True)

x_train records: 447
y_train records: 447
x_test records: 150
y_test records: 150 



<font color='blue' weight='bold'>
    <b>Summary</b><br>
    In this section we:
    <li> Applied preprocessing steps to the review data </li>
    <li> Split the data into training and test sets </li>   
</font>

### 3. Extract relevant information

In this step we want to first extract what features customers are commentning on in the reviews.

To extract features I have implemented a custom function which returns all combinations of consecutive nouns within the text.

#### 3.1 Frequent features identification

In [10]:
def get_features(text_dict, verbose=False):
    """
    Get the noun phrases from a document using a method I found on Stack Overflow, this appears superior to
    the spaCy noun_chunks methods
    https://stackoverflow.com/questions/49564176/python-nltk-more-efficient-way-to-extract-noun-phrases
    """    
    features_dict = defaultdict(int)
    for i, text in enumerate(text_dict.values()):
        if verbose and i%100 == 0 and i > 0:
            print(f'Records processed: {i}')
        pos = pos_tag(word_tokenize(text))
        count = 0
        half_chunk = ""
        for word, tag in pos:
            if re.match(r"NN.*", tag):
                count+=1
                if count>=1:
                    half_chunk = half_chunk + word + " "
            else:
                half_chunk = half_chunk+"---"
                count = 0
        half_chunk = re.sub(r"-+","?",half_chunk).split("?")
        half_chunk = [x.strip() for x in half_chunk if x!=""]
        
        for chunk in half_chunk:
            # Increment the dictionary
            features_dict[chunk] += 1
    
    # Sort the dictionary
    features_dict = sort_dict(features_dict, reverse=True)
    
    return features_dict

def feature_pruning(text_dict, features_dict, features_list=[], min_support=3, min_percent=0.01, verbose=False):
    """
    Prune features using p_support and keep features that appear in at least 1% of sentences
    """
    # Apply p-support pruning
    pruned_features_dict = features_dict.copy()
    for k1, v1 in features_dict.items():
        # Find single word features and the support
        if len(k1.split())==1:
            p_support = v1
            if verbose:
                print(f'feature {k1} support={p_support}')
            # Calculate p_support by subtracting the number of times the single noun appears in other
            #   noun_phrases
            for k2, v2 in features_dict.items():
                if len(k2.split())!=1:
                    if k1 in k2:
                        p_support -= v2
                        if verbose:
                            print(f'feature {k1} found in {k2}, updated support={p_support}') 
            # If the final p_support value < min_support, remove this feacture
            if p_support < min_support:
                pruned_features_dict.pop(k1, None)
                if verbose:
                    print(f'feature {k1} removed, support={p_support}')
              
    # Apply frequent feature pruning
    features = []
    for k, v in pruned_features_dict.items():
        if v > int(len(text_dict) * min_percent):
            features.append(k)
            
    return features

def find_docs_with_features(text_dict, features):
    """ 
    Identify which documents contain the features
    """
    docs_containing_features = {}
    doc_features = []
    for k, v in text_dict.items():
        doc_features = []
        for feature in features:
            if feature in v:
                doc_features.append(feature)
        if len(doc_features) > 0:
            docs_containing_features[k] = doc_features
    return docs_containing_features

In [11]:
# Extract the features
features_dict = get_features(x_train, verbose=False)
print_dict(features_dict, n=10)
print(f'Total features: {len(features_dict)}')

i: 135
camera: 116
g3: 37
picture: 36
time: 23
feature: 20
flash: 19
canon: 18
photo: 15
lens: 14
image: 13
Total features: 753


To reduce the number of features, I have implemented the p-support and feature frequency pruning methods described by Hu and Liu (2004).

Specifically, single word features must have a p-support of at least 3 and all features must appear in at least 1% of sentences in order to be retained.

In [12]:
# Prune features
features = feature_pruning(x_train, features_dict, min_support=3, min_percent=0.01)
print(features)
print(f'Remaining features: {len(features)}')

['camera', 'g3', 'picture', 'time', 'feature', 'flash', 'canon', 'lens', 'image', 'use', 'viewfinder', 'point', 'g2', 'thing', 'review', 'lot', 'shot', 'research', 'battery life', 'result', 'moment', 'setting', 'slr', 'control', 'problem', 'range', 'card', 'canon g3', 'picture quality', 'exposure', 'light', 'year', 'something', 'photography', 'purchase', 'hand', 'lense', 'way', 'box', 'people', 'difference', 'ability', 'week', 'bit', 's330', 'color']
Remaining features: 46


In [13]:
docs_containing_features_train = find_docs_with_features(x_train, features)
docs_containing_features_test = find_docs_with_features(x_test, features)
print_dict(docs_containing_features_train, 10)

109: ['camera']
480: ['point', 'thing', 'range', 'something', 'photography']
135: ['use']
77: ['point', 'difference']
396: ['camera']
286: ['camera', 'g3', 'thing', 'lot', 'research', 'purchase', 'hand']
10: ['canon']
589: ['image', 'card']
78: ['lens', 'range', 'lense']
55: ['range']
585: ['camera', 'g3', 'picture', 'canon', 'use', 'g2', 'battery life', 'canon g3', 'picture quality', 'way']


In [14]:
print(f'{len(docs_containing_features_train)/len(x_train):.2%} of sentences contain features')

80.09% of sentences contain features


<font color='blue' weight='bold'>
    <b>Summary</b><br>
    In this section we have:
    <li> Extracted potential features by identifying noun phrases </li>
    <li> Reduced the number of potential features from 753 to 46 by implementing p_support, and keeping only potential features which appear in more than 1% of sentences </li>  
    <li> We have then identified the sentences in the training and test sets that contain the identified features </li>
    <li> We have found that 80% of the sentences in our training set contain features </li>
    
</font>

#### 3.2 Opinion words extraction
Now we will only consider those sentences that we have identified as containing features. As per Hu and Liu (2004), we consider adjectives to be potential opinion words.

To identify candidate opinion words, we will loop through the sentences that contain features, extract the adjectives and store these in a dictionary.

We allocate opinion words to the closest feature. In this way it is possible for features to have multiple opinion words. Additionally, opinion words equidistant from multiple features can be allocated to multiple features, however this appears to be rare.

We set a parameter, max_feature_distance, which defines the maximum allowable distance between an opinion word and a feature. The rationale is that the closer an opinion word is to a feature, the more likely that the opinion word pertains to that feature. Although this might not always be the case and long range dependencies may exist, this seems a reasonable assumption for our simplistic model.

If this threshold cannot be met for any of the identified features, then an opinion word is discarded.

In [15]:
def identify_and_allocate_opinion_words(text_dict, docs_containing_features, max_feature_distance=3, verbose=False):
    """
    Identify opinion words and allocate them to nearby features in a text
    """

    opinion_words_dict = defaultdict(list)
    for k, v in docs_containing_features.items():
        features_dict = defaultdict(list)
        adjectives_list = []
        features_list = []

        text = text_dict[k]
        doc = nlp(text, disable=["ner"]) 

        # Find adjectives and their positions 
        for token in doc:
            if token.pos_=='ADJ':
                adjectives_list.append((token.text, token.i))

        # Search for the features in the text and find the positions
        for feature in v:
            pos = find_feature_pos(text, feature)
            features_list.append((feature, pos))

        # Allocate adjectives to nearest feature
        for adj in adjectives_list:
            min_feature_distance = np.inf
            closest_feature = None
            for feature in features_list:
                distance = abs(feature[1] - adj[1])
                if distance < min_feature_distance and distance <= max_feature_distance and distance != 0:
                    min_feature_distance = distance
                    closest_feature = feature[0]
            if verbose:
                print(f'Closest feature to {adj} is {closest_feature} with a token distance of {min_feature_distance} \n')
            if closest_feature is not None:
                features_dict[closest_feature].append(adj[0])
        if len(features_dict) > 0:
            opinion_words_dict[k] = features_dict

    return opinion_words_dict
    
def find_feature_pos(text, feature):
    """
    Helper function to find feature position in a text
    """
    text_split = text.split()
    feature_split = feature.split()
    len_text_split = len(text_split)
    len_feature_split = len(feature_split)
    
    for i in range(len_text_split - len_feature_split + 1):
        if text_split[i:i+len_feature_split] == feature_split:
            return i
    return -1

In [16]:
opword_dict_train = identify_and_allocate_opinion_words(x_train, docs_containing_features_train, 
                                                        max_feature_distance=5, verbose=False)
opword_dict_test = identify_and_allocate_opinion_words(x_test, docs_containing_features_test, 
                                                        max_feature_distance=5, verbose=False)
print_dict(opword_dict_train, 10)

109: defaultdict(<class 'list'>, {'camera': ['great']})
480: defaultdict(<class 'list'>, {'point': ['advanced'], 'photography': ['serious']})
135: defaultdict(<class 'list'>, {'use': ['used']})
77: defaultdict(<class 'list'>, {'point': ['more'], 'difference': ['moderate']})
396: defaultdict(<class 'list'>, {'camera': ['first', 'digital']})
286: defaultdict(<class 'list'>, {'camera': ['digital']})
10: defaultdict(<class 'list'>, {'canon': ['great']})
78: defaultdict(<class 'list'>, {'lens': ['extended'], 'lense': ['fast']})
290: defaultdict(<class 'list'>, {'camera': ['great']})
30: defaultdict(<class 'list'>, {'setting': ['automatic'], 'picture': ['bad']})
463: defaultdict(<class 'list'>, {'g3': ['unnerving']})


In [17]:
print(f'{len(opword_dict_train)/len(x_train):.2%} of sentences contain features and opinion words')

55.26% of sentences contain features and opinion words


<font color='blue' weight='bold'>
    <b>Summary</b><br>
    In this section we have:
    <li> Identified opinion words as adjectives and allocated these opinion words to features using a simple distance method </li>
    <li> Found that 55% of sentences have mention of features with opinion words </li>   
</font>`

#### 3.3 Orientation Identification for Opinion Words

Hu and Liu (2004) utilise WordNet to determine the orientation of opinion words. I depart from this approach and identify the orientation of opinion words by leveraging the word vectors prodivded by spaCy tokens.

For each opinion word, I compare its vector representation to the vector representations of the words "positive" and "negative".

If the opinion word is closer to "positive" as measured by cosine similarity, then the orientation of the word is deemed to be positive. Conversely, if the opinion word is closer to "negative", then the orientation of the word is deemed to be negative. If the cosine similarities to both positive and negative are similar (less than 10% relative difference, chosen arbitrarily), then the orientation is deemed to be neutral.

As demonstrated in the following cells, this method appears to work very well.

In [18]:
def polarity(text, verbose=False):
    """ 
    Returns polarity of a document by comparing the cosine similarity of a token 
    to "negative" and "positive"
    
    Returns:
        int 1: positive, 0: neutral, -1: negative
    """
    DIFFERENCE_THRESHOLD = 0.1
    MIN_SIMILARITY = 0.1
    
    doc = nlp(text, disable=["ner"])
    if np.all(doc.vector):
        doc_pos = nlp("positive", disable=["ner"])
        doc_neg = nlp("negative", disable=["ner"])

        sim_pos = doc.similarity(doc_pos)
        sim_neg = doc.similarity(doc_neg)
        if sim_pos > 0 and sim_neg > 0:
            relative_diff = sim_pos/sim_neg-1
        else:
            relative_diff = 0

        if relative_diff > DIFFERENCE_THRESHOLD and sim_pos > MIN_SIMILARITY:
            sentiment = 1
        elif relative_diff < -DIFFERENCE_THRESHOLD and sim_neg > MIN_SIMILARITY:
            sentiment = -1
        else: 
            sentiment = 0

        if verbose:
            print(f'{doc} -> {doc_pos} {sim_pos}')
            print(f'{doc} -> {doc_neg} {sim_neg}')
            if sentiment == 1:
                print(f"{doc} classified as POSITIVE")
            elif sentiment == -1:
                print(f"{doc} classified as NEGATIVE")
            else:
                print(f"{doc} classified as NEUTRAL")
    else:
        sentiment = 0
        if verbose:
            print(f"{doc} has no vector, classified as NEUTRAL")
        
    return sentiment

#### Below is a demonstration of the polarity function on a range of words

As expected, words such as "amazing", and "great" are given a positive orientation and words such as "terrible" and "rubbish" are given a negative orientation. Words such as "average" and "ok" are classed as neutral. Non-existent words such as "scoozle", which I made up, do not have a prebuilt spaCy vector representation and are deemed as neutral. 

In [19]:
# Demonstration of the polarity function
words= ["amazing", "great", "terrible", "rubbish", "average", "ok", "scoozle"]

for word in words:
    p = polarity(word, True)
    print()

amazing -> positive 0.29383947128808274
amazing -> negative 0.15883798557452
amazing classified as POSITIVE

great -> positive 0.3911600082252871
great -> negative 0.23636989022567528
great classified as POSITIVE

terrible -> positive 0.28007942392719387
terrible -> negative 0.3913616946409252
terrible classified as NEGATIVE

rubbish -> positive 0.13967908330449005
rubbish -> negative 0.24332614796946453
rubbish classified as NEGATIVE

average -> positive 0.3507661092170427
average -> negative 0.36584300761165206
average classified as NEUTRAL

ok -> positive 0.2533984654352536
ok -> negative 0.270677459132954
ok classified as NEUTRAL

scoozle has no vector, classified as NEUTRAL



#### 3.4 Predicting the Orientations of Opinion Sentences

For each sentence containing opinion words, we will assign a polarity to each opinion word alloacted to each feature using the method described above. We will save these results in a new dictionary.

In [20]:
def calculate_polarity(opword_dict):
    """ 
    Get polarity for opinion words in a sentence
    """
    polarity_dict = defaultdict(list)
    for k1, v1 in opword_dict.items():
        feature_dict = defaultdict(list)
        for k2, v2 in v1.items():
            cum_polarity = 0
            for opword in v2:
                cum_polarity += polarity(opword)
            feature_dict[k2] = cum_polarity
        polarity_dict[k1] = feature_dict
    return polarity_dict

In [21]:
polarity_dict_train = calculate_polarity(opword_dict_train)
polarity_dict_test = calculate_polarity(opword_dict_test)
print_dict(polarity_dict_train, 10)

109: defaultdict(<class 'list'>, {'camera': 1})
480: defaultdict(<class 'list'>, {'point': 1, 'photography': 0})
135: defaultdict(<class 'list'>, {'use': -1})
77: defaultdict(<class 'list'>, {'point': 1, 'difference': 0})
396: defaultdict(<class 'list'>, {'camera': 0})
286: defaultdict(<class 'list'>, {'camera': -1})
10: defaultdict(<class 'list'>, {'canon': 1})
78: defaultdict(<class 'list'>, {'lens': 1, 'lense': 1})
290: defaultdict(<class 'list'>, {'camera': 1})
30: defaultdict(<class 'list'>, {'setting': 0, 'picture': -1})
463: defaultdict(<class 'list'>, {'g3': -1})


<font color='blue' weight='bold'>
    <b>Summary</b><br>
    In this section we have:
    <li> Used word vectors and cosine similarity to determine the polarity of the opinion words allocated to the features in the reviews </li>
</font>`

### 4. Apply relevant algorithms

We will now create summaries for product features by iterating through all the identified features, and summing the polarity of the opinion words for each of those features.

We construct summaries of the form:

- Product Name  
    - Feature Name  
        - Positive: # of positive opinion words
        - Negative: # of negative opinion words


In [22]:
  def create_summary_review(product_name, polarity_dict, max_features=5):
    """ 
    Summarise the review for a product 
    """
    pos_dict = defaultdict(int)
    neg_dict = defaultdict(int)
    counts_dict = defaultdict(int)
    
    for k1, v1 in polarity_dict.items():
        for k2, v2 in v1.items():
            if v2 > 0:
                counts_dict[k2] += 1
                pos_dict[k2] += 1
            elif v2 < 0:
                counts_dict[k2] += 1
                neg_dict[k2] += 1
                
    # Sort the dictionary
    counts_dict = sort_dict(counts_dict, reverse=True)   
    
    print(f'Product: {product_name}')
    for i, k in enumerate(counts_dict.keys()):
        if i > max_features:
            break
        print(f'  Feature: {k}')
        print(f'     Positive: {pos_dict[k]}')
        print(f'     Negative: {neg_dict[k]} \n')    

In [23]:
create_summary_review(product, polarity_dict_train, max_features=5)

Product: Canon G3
  Feature: camera
     Positive: 39
     Negative: 14 

  Feature: use
     Positive: 22
     Negative: 3 

  Feature: picture
     Positive: 13
     Negative: 8 

  Feature: flash
     Positive: 5
     Negative: 6 

  Feature: image
     Positive: 6
     Negative: 5 

  Feature: g3
     Positive: 6
     Negative: 4 



<font color='blue' weight='bold'>
    <b>Summary</b><br>
    In this section we have:
    <li> Generated a feature level summary for a given product, detailing the number of positive and negative opinion words associated with each feature </li>
</font>`

### 5. Evaluation

In this section we will evaluate the performance of our algorithm so far.

We will assess performance in two areas:
1. How well our algorithm identifies product features 
2. How well our algorithm has identifies sentiment, both in terms of detection and orientation

We will use precision, recall (and accuracy for sentiment orientation) as performance metrics.

#### 5.1 Feature identification

Firstly, we want to know how well our algorithm identifies features in the reviews.

As there are many possible features, I have only considered those features that are common between the ground truth labels and those we have discovered.

In [24]:
def create_feature_target_df_from_labels(y_dict, verbose=False):
    """
    Create a pandas dataframe of target variables given target labels sourced from the original data
    """
    features_set = set()
    target_dict = defaultdict(list)

    # Firstly, collect all the features in a set
    for k, v in y_dict.items():
        v_split = v.split(',')
        if len(v_split) > 0:
            for x in v_split:
                k = x[:x.find('[')].strip()
                if k != '':
                    features_set.add(k)
                    
    # Create tabular data
    for k, v in y_dict.items():
        v_split = v.split(',')
        items = [(x[:x.find('[')].strip(), x[x.find('[')+1:x.find(']')].strip()) for x in v_split]
        for i in items:
            if len(i[0]) > 0:
                target_dict[i[0]].append(1)
        for j in [f for f in features_set if f not in [i[0] for i in items]]:
            target_dict[j].append(0)
 
    if verbose:
        print('Target labels and volumes:')
        for k in target_dict.keys():
            print(k, len(target_dict[k]))
                
    return pd.DataFrame(target_dict)

def create_feature_target_df_from_model(x_dict, docs_containing_features, polarity_dict, features, verbose=False):
    """ 
    Create a target df from the features identified in our analysis 
    """
    target_dict = defaultdict(list)
    keys = []
    for k in x_dict.keys():
        if k in docs_containing_features.keys():
            for feature in docs_containing_features[k]:
                target_dict[feature].append(1)
            for feature in [f for f in features if f not in docs_containing_features[k]]:
                target_dict[feature].append(0)
        else:
            for feature in features:
                target_dict[feature].append(0)
                
    if verbose:
        for k in target_dict.keys():
            print(k, len(target_dict[k]))
        
    return pd.DataFrame(target_dict)

def performance_metrics_feat(y_df, y_pred_df, feature_level=False):
    """
    Calculate precision and recall
    """
    performance_dict = defaultdict(list)
    
    # Only keep features present in both datasets
    shared_features = [x for x in y_pred_df.columns.to_list() if x in y_df.columns.to_list()]
    y_df = y_df[shared_features]
    y_pred_df = y_pred_df[shared_features]
    
    if feature_level:
        for feature in shared_features:
            count = y_df[feature].sum()
            precision = precision_score(y_df[feature], y_pred_df[feature])
            recall = recall_score(y_df[feature], y_pred_df[feature])
            performance_dict['feature'].append(feature)
            performance_dict['count'].append(count)
            performance_dict['precision'].append(precision)
            performance_dict['recall'].append(recall)
    else:
        precision = precision_score(y_df, y_pred_df, average='weighted')
        recall = recall_score(y_df, y_pred_df, average='weighted')
        performance_dict['precision'].append(precision)
        performance_dict['recall'].append(recall)
        
    return pd.DataFrame(performance_dict)

def performance_metrics_feat_combined(product, y_train_df, y_train_pred_df, y_test_df, y_test_pred_df):
    """
    Combine train and test features metrics to form one unified dataframe
    """ 
    COLUMNS = ['precision_train', 'recall_train', 'precision_test', 'recall_test']
    train_metrics = performance_metrics_feat(y_train_df, y_train_pred_df, feature_level=False)
    test_metrics = performance_metrics_feat(y_test_df, y_test_pred_df, feature_level=False)
    combined_metrics = pd.concat([train_metrics, test_metrics], axis=1)
    combined_metrics['product'] = product
    combined_metrics.columns = COLUMNS + ['product']
    
    return combined_metrics[['product'] + COLUMNS]

In [25]:
# Create datasets
y_train_feat = create_feature_target_df_from_labels(y_train)
y_train_pred_feat = create_feature_target_df_from_model(x_train, docs_containing_features_train, 
                                                            polarity_dict_train, features)
y_test_feat = create_feature_target_df_from_labels(y_test)
y_test_pred_feat = create_feature_target_df_from_model(x_test, docs_containing_features_test, 
                                                               polarity_dict_test, features)

We find that our algorithm has a precision of about 0.31 and recall of 0.88 on the test set.

This means that on averages it detects features correctly 31% of the time, and it captures 88% of all features.

In [26]:
# Calculate performance
perf_feat = performance_metrics_feat_combined(product, y_train_feat, y_train_pred_feat, 
                                                  y_test_feat, y_test_pred_feat)
perf_feat 

Unnamed: 0,product,precision_train,recall_train,precision_test,recall_test
0,Canon G3,0.299438,0.83,0.309133,0.880952


#### 5.2 Analysis of sentiment

In this section we will evaluate how well our model predicts sentiment at the sentence level.

We measure how well the algorithm captures sentiment bearing sentences through precision and recall, and orientation performance via accuracy.

When considering orientation, a sentence is deemed to have positive sentiment is the sum of the orientation values is positive, negative sentiment if the sum of the orientation values is negative, and neutral if the sum is zero.

In [27]:
def create_sentiment_target_df_from_labels(y_dict):
    """
    Create a pandas dataframe of target variables given target labels sourced from the original data
    """
    # Create dicts to store sentiment bearing sentence binary flags and orientations
    sbs_dict = {}
    orientation_dict = {}
    
    # Create tabular data
    for k, v in y_dict.items():
        v_split = v.split(',')
        items = [x[x.find('[')+1:x.find(']')].strip() for x in v_split]
        cumsum = 0
        sbs_dict[k] = 0
        for i in items:
            if len(i) > 0:
                sbs_dict[k] = 1
                try:
                    if i=="+" or int(i) > 0:
                        cumsum += 1
                    elif i=="-" or int(i) < 0:
                        cumsum -= 1
                except:
                    pass
        if cumsum > 0:
            orientation_dict[k] = 1
        elif cumsum < 0:
            orientation_dict[k] = -1
        else:
            orientation_dict[k] = 0
         
    # Create target df by combining sbs and orientation DataFrames
    sbs_df = pd.DataFrame(sbs_dict, index=[0]).T
    sbs_df.columns = ['sentiment']
    orientation_df = pd.DataFrame(orientation_dict, index=[0]).T
    orientation_df.columns = ['orientation']
    target_df = pd.concat([sbs_df, orientation_df], axis=1)
    
    return target_df

def create_sentiment_target_df_from_model(x_train, polarity_dict):
    """ 
    Create a sentiment df from the features identified in our analysis 
    """
    # Create dicts to store sentiment bearing sentence binary flags and orientations
    sbs_dict = {}
    orientation_dict = {}

    for k in x_train.keys():
        if k in polarity_dict.keys():
            sbs_dict[k] = 1
            cum_sum = 0
            for value in polarity_dict[k].values():
                cum_sum += value
            if cum_sum > 0:
                orientation_dict[k] = 1
            elif cum_sum < 0:
                orientation_dict[k] = -1
            else:
                orientation_dict[k] = 0    
        else:
            orientation_dict[k] = 0
            sbs_dict[k] = 0
            
    # Create target df by combining sbs and orientation DataFrames
    sbs_df = pd.DataFrame(sbs_dict, index=[0]).T
    sbs_df.columns = ['sentiment']
    orientation_df = pd.DataFrame(orientation_dict, index=[0]).T
    orientation_df.columns = ['orientation']
    target_df = pd.concat([sbs_df, orientation_df], axis=1)  
    
    return target_df

def performance_metrics_sentiment(y_df, y_pred_df):
    """
    Calculate precision and recall
    """
    performance_dict = defaultdict(list)
    precision = precision_score(y_df['sentiment'], y_pred_df['sentiment'], average='weighted')
    recall = recall_score(y_df['sentiment'], y_pred_df['sentiment'], average='weighted')
    accuracy = accuracy_score(y_df['orientation'], y_pred_df['orientation'])
    performance_dict['precision'].append(precision)
    performance_dict['recall'].append(recall)
    performance_dict['orientation_accuracy'].append(accuracy)
    return pd.DataFrame(performance_dict)

def performance_metrics_sentiment_combined(product, y_train_df, y_train_pred_df, y_test_df, y_test_pred_df):
    """
    Combine train and test sentiment metrics to form one unified dataframe
    """
    COLUMNS = ['precision_train', 'recall_train', 'accuracy_train', 'precision_test', 'recall_test', 'accuracy_test']
    train_metrics = performance_metrics_sentiment(y_train_df, y_train_pred_df)
    test_metrics = performance_metrics_sentiment(y_test_df, y_test_pred_df)
    combined_metrics = pd.concat([train_metrics, test_metrics], axis=1)
    combined_metrics['product'] = product
    combined_metrics.columns = COLUMNS + ['product']
    
    return combined_metrics[['product'] + COLUMNS]

In [28]:
# Create datasets
y_train_sentiment = create_sentiment_target_df_from_labels(y_train)
y_train_pred_sentiment = create_sentiment_target_df_from_model(x_train, polarity_dict_train)
y_test_sentiment = create_sentiment_target_df_from_labels(y_test)
y_test_pred_sentiment = create_sentiment_target_df_from_model(x_test, polarity_dict_test)

We find that our algorithm has a precision of about 0.54, recall of 0.53 and accuracy of 0.49 on the test set.

This means that on averages it detects senteniment bearing sentences 54% of the time, it captures 53% of all sentiment bearing sentences, and it predicts orientation correctly 49% of the time.

In [29]:
# Calculate performance
perf_sentiment = performance_metrics_sentiment_combined(product, y_train_sentiment, y_train_pred_sentiment, 
                                       y_test_sentiment, y_test_pred_sentiment)
perf_sentiment

Unnamed: 0,product,precision_train,recall_train,accuracy_train,precision_test,recall_test,accuracy_test
0,Canon G3,0.623545,0.574944,0.608501,0.537229,0.533333,0.493333


<font color='blue' weight='bold'>
    <b>Summary</b><br>
    In this section we have:
    <li> Examined the performance of our algorithm in terms of feature identification and sentiment, using precision, recall and accuracy metrics </li>
    <li> Found that the performance of our algorithm leaves a lot to be desired, and is far from human level performance </li>   
</font>`

#### 5.3 Creating an NLP pipeline and analysing more products
In this section we will create a pipeline that orchestrates all of the previous steps in a single function.

We will use this pipeline to replicate the analysis for multiple products. All parameters used in pervious steps will be retain their settings for this analysis for comparability.

In [30]:
def run_pipeline_single_product(file, 
                                product="",
                                test_percent=0.25, 
                                remove_stopwords=False, 
                                min_token_length=1,
                                min_support=3,
                                min_percent=0.01, 
                                max_feature_distance=5,
                                max_features_review=5,
                                random_seed=42,
                                print_review=True,
                                verbose=False):
    """
    Run the whole NLP pipeline for a single product
    """
    data = read_in(single_file=file)
    labels, reviews, other, data_clean = create_dicts(data)
    
    # Data preprocessing
    reviews_clean = clean_text(reviews, remove_stopwords=remove_stopwords, min_token_length=min_token_length)
    
    x_train, y_train, x_test, y_test = create_train_test_sets(reviews_clean, labels, test_percent, 
                                                              random_seed=random_seed, verbose=verbose)
    
    # Extract relevant information
    features_dict = get_features(x_train, verbose=verbose)
    
    features = feature_pruning(x_train, features_dict, min_support=min_support, min_percent=min_percent)
    
    docs_containing_features_train = find_docs_with_features(x_train, features)
    docs_containing_features_test = find_docs_with_features(x_test, features)
    opword_dict_train = identify_and_allocate_opinion_words(x_train, docs_containing_features_train, 
                                                            max_feature_distance=max_feature_distance,
                                                            verbose=verbose) 
    opword_dict_test = identify_and_allocate_opinion_words(x_test, docs_containing_features_test, 
                                                           max_feature_distance=max_feature_distance,
                                                           verbose=verbose) 
        
    polarity_dict_train = calculate_polarity(opword_dict_train)
    polarity_dict_test = calculate_polarity(opword_dict_test)
    
    # Apply the algorithm
    if print_review:
        create_summary_review(product + ' Train', polarity_dict_train, max_features=max_features_review)
        create_summary_review(product + ' Test', polarity_dict_test, max_features=max_features_review)
        
    # Evaluation - Product features
    y_train_feat = create_feature_target_df_from_labels(y_train)
    y_train_pred_feat = create_feature_target_df_from_model(x_train, docs_containing_features_train, 
                                                                polarity_dict_train, features)
    
    y_test_feat = create_feature_target_df_from_labels(y_test)
    y_test_pred_feat = create_feature_target_df_from_model(x_test, docs_containing_features_test, 
                                                                polarity_dict_test, features)
    
    perf_feat = performance_metrics_feat_combined(product, y_train_feat, y_train_pred_feat, 
                                                  y_test_feat, y_test_pred_feat)
    
    # Evaluation - Sentiment
    y_train_sentiment = create_sentiment_target_df_from_labels(y_train)
    y_train_pred_sentiment = create_sentiment_target_df_from_model(x_train, polarity_dict_train)

    y_test_sentiment = create_sentiment_target_df_from_labels(y_test)
    y_test_pred_sentiment = create_sentiment_target_df_from_model(x_test, polarity_dict_test)

    perf_sentiment = performance_metrics_sentiment_combined(product, y_train_sentiment, y_train_pred_sentiment, 
                                                            y_test_sentiment, y_test_pred_sentiment)
    
    return perf_feat, perf_sentiment

def create_mean_perf(perf_df_all, sentiment=False):
    """
    Calculate mean metrics to add as an average row
    """
    mean_dict = {}
    mean_perf = perf_df_all.mean(numeric_only=True, axis=0)
    mean_dict['product'] = 'Average'
    mean_dict['precision_train'] = mean_perf['precision_train']
    mean_dict['recall_train'] = mean_perf['recall_train']
    if sentiment:
        mean_dict['accuracy_train'] = mean_perf['accuracy_train']
    mean_dict['precision_test'] = mean_perf['precision_test']
    mean_dict['recall_test'] = mean_perf['recall_test']
    if sentiment:
        mean_dict['accuracy_test'] = mean_perf['accuracy_test']
    return mean_dict

def run_pipeline_multiple_products(file_list, 
                                   product_list,
                                   test_percent=0.25, 
                                   remove_stopwords=False, 
                                   min_token_length=1,
                                   min_support=3,
                                   min_percent=0.01, 
                                   max_feature_distance=5,
                                   max_features_review=5,
                                   random_seed=42,
                                   print_review=False,
                                   verbose=False):
    """
    Run the NLP pipeline for multiple products
    """
    if len(file_list) != len(product_list):
        raise ValueError("The number of files and product names must match")
        
    perf_feat_all = pd.DataFrame()
    perf_sentiment_all = pd.DataFrame()
    
    for i, file in enumerate(file_list):
        print(f'Running pipeline for {product_list[i]}')
        perf_feat, perf_sentiment = run_pipeline_single_product(file, 
                                                                product=product_list[i],
                                                                test_percent=test_percent, 
                                                                remove_stopwords=remove_stopwords, 
                                                                min_token_length=min_token_length,
                                                                min_support=min_support,
                                                                min_percent=min_percent, 
                                                                max_feature_distance=max_feature_distance,
                                                                max_features_review=max_features_review,
                                                                random_seed=random_seed,
                                                                print_review=print_review,
                                                                verbose=verbose)
        # Append the dataframes
        perf_feat_all = perf_feat_all.append(perf_feat)
        perf_sentiment_all = perf_sentiment_all.append(perf_sentiment)
        
    
    # Add averages
    perf_feat_all = perf_feat_all.append(create_mean_perf(perf_feat_all), ignore_index=True)
    perf_sentiment_all = perf_sentiment_all.append(create_mean_perf(perf_sentiment_all, sentiment=True), 
                                                   ignore_index=True)
    return perf_feat_all, perf_sentiment_all

Now let's run the pipeline for the following 9 products:
- Apex AD2600 Progressive-scan DVD player
- Creative Labs Nomad Jukebox Zen Xtra 40GB
- Hitachi Router
- Nikon coolpix 4300
- Diaper Champ
- Nokia 6600
- iPod
- Linksys Router
- Norton

In [31]:
%%time
file_list = ['data/Customer_review_data/Apex AD2600 Progressive-scan DVD player.txt',
             'data/Customer_review_data/Creative Labs Nomad Jukebox Zen Xtra 40GB.txt',
             'data/Customer_review_data/Nikon coolpix 4300.txt',
             'data/Reviews-9-products/Hitachi router.txt',
             'data/Reviews-9-products/Diaper Champ.txt',
             'data/Reviews-9-products/Nokia 6600.txt',
             'data/Reviews-9-products/ipod.txt',
             'data/Reviews-9-products/Linksys Router.txt',
             'data/Reviews-9-products/norton.txt']

product_list = ['DVD player',
                'MP3 player',
                'Hitachi router',
                'Nikon coolpix',
                'Diaper Champ',
                'Nokia 6600',
                'iPod',
                'Linksys Router',
                'Norton']

perf_feat_all, perf_sentiment_all = run_pipeline_multiple_products(file_list, product_list)

Running pipeline for DVD player
Running pipeline for MP3 player
Running pipeline for Hitachi router
Running pipeline for Nikon coolpix
Running pipeline for Diaper Champ
Running pipeline for Nokia 6600


  _warn_prf(average, modifier, msg_start, len(result))


Running pipeline for iPod
Running pipeline for Linksys Router
Running pipeline for Norton
Wall time: 1min 32s


Observing feature identification - the algorithm achieves an average test set precision of 0.33 and an average test set recall of 0.85.

These results vary widely among the products.

In [32]:
perf_feat_all

Unnamed: 0,product,precision_train,recall_train,precision_test,recall_test
0,DVD player,0.257247,0.971429,0.163228,0.846154
1,MP3 player,0.379275,0.836735,0.335324,0.860759
2,Hitachi router,0.440116,0.876543,0.271288,0.681818
3,Nikon coolpix,0.544922,0.802632,0.388889,0.733333
4,Diaper Champ,0.343283,0.836735,0.310608,0.913043
5,Nokia 6600,0.38041,0.786325,0.319444,0.766667
6,iPod,0.296984,1.0,0.570833,1.0
7,Linksys Router,0.190587,0.837209,0.272619,0.928571
8,Norton,0.251492,0.828571,0.293357,0.909091
9,Average,0.342702,0.86402,0.325066,0.848826


Observing sentiment - the algorithm achieves an average test set precision of 0.63, average test set recall of 0.60, and an average orientation accuracy of 0.57.

Again, these results vary widely among the products.

In [33]:
perf_sentiment_all

Unnamed: 0,product,precision_train,recall_train,accuracy_train,precision_test,recall_test,accuracy_test
0,DVD player,0.61935,0.617329,0.536101,0.551806,0.551351,0.518919
1,MP3 player,0.598848,0.591298,0.550117,0.628917,0.617716,0.592075
2,Hitachi router,0.615766,0.610039,0.590734,0.644181,0.609195,0.551724
3,Nikon coolpix,0.57422,0.568376,0.512821,0.591316,0.564103,0.487179
4,Diaper Champ,0.625641,0.619217,0.544484,0.655113,0.659574,0.648936
5,Nokia 6600,0.640099,0.583133,0.503614,0.61369,0.57554,0.503597
6,iPod,0.615638,0.544081,0.541562,0.676334,0.56391,0.616541
7,Linksys Router,0.631895,0.598592,0.598592,0.689562,0.678322,0.692308
8,Norton,0.607793,0.603509,0.452632,0.646999,0.568421,0.494737
9,Average,0.614361,0.592841,0.536739,0.633102,0.598681,0.567335


<font color='blue' weight='bold'>
    <b>Summary</b><br>
    In this section we have:
    <li> Devloped a comprehensive NLP pipeline to analyse multiple products</li>
    <li> Applied this pipeline to 9 products and examined the performance metrics</li>   
</font>

### 6. Machine learning method
So far our algorithm has operated completely unsupervised!

In this section we will trial a machine learning method for predicting sentiment orientation.

Specifically, we will transform our sentiment bearing sentences into tf-idf weighted word vectors and apply a Support Vector Classifier (SVC) to predict orientation.

We will build our predictions using only the sentences that we have flagged as being sentiment bearing, using our earlier unsupervised method. When creating predictions, sentences that we have flagged as non-sentiment bearing will be given an orientation of 0 for neutral. This will enable us to compare fairly with the ground truth labels.

In [34]:
def train_svm_models(x_train, x_test, docs_containing_features_train, docs_containing_features_test,
                     y_train_sentiment, y_test_sentiment):
    """
    Train SVM models and return predictions
    """
    # Only train on sentences that we think are sentiment bearing
    x_train_list = []
    y_train_list = []
    for k in docs_containing_features_train.keys(): 
        x_train_list.append(x_train[k])
        y_train_list.append(int(y_train_sentiment['orientation'].loc[k]))

    x_test_list = []
    y_test_list = []
    for k in docs_containing_features_test.keys(): 
        x_test_list.append(x_test[k])
        y_test_list.append(int(y_test_sentiment['orientation'].loc[k]))    

    # Create tf-idf vectors and train models
    vectorizer = TfidfVectorizer(min_df=2, ngram_range=(1, 2))
    vectorizer.fit(x_train_list)
    vectorizer_vocab = vectorizer.get_feature_names()

    x_train_vec = vectorizer.transform(x_train_list).toarray()
    x_test_vec = vectorizer.transform(x_test_list).toarray()

    clf = LinearSVC()
    clf.fit(x_train_vec, y_train_list)

    train_preds = clf.predict(x_train_vec)
    test_preds = clf.predict(x_test_vec)
    
    # Reconstruct entire datasets
    train_preds_dict = {}
    c = 0
    for k in x_train.keys(): 
        if k in docs_containing_features_train.keys():
            train_preds_dict[k] = train_preds[c]
            c += 1
        else:
            train_preds_dict[k] = 0
            
    test_preds_dict = {}
    c = 0
    for k in x_test.keys(): 
        if k in docs_containing_features_test.keys():
            test_preds_dict[k] = test_preds[c]
            c += 1
        else:
            test_preds_dict[k] = 0
            
    y_train_sentiment_ml_pred = [v for v in train_preds_dict.values()]
    y_test_sentiment_ml_pred = [v for v in test_preds_dict.values()]
    
    return y_train_sentiment_ml_pred, y_test_sentiment_ml_pred

def create_mean_perf_ml(perf_df_all):
    """
    Calculate mean metrics to add as an average row
    """
    mean_dict = {}
    mean_perf = perf_df_all.mean(numeric_only=True, axis=0)
    mean_dict['product'] = 'Average'
    mean_dict['accuracy_train'] = mean_perf['accuracy_train']
    mean_dict['accuracy_test'] = mean_perf['accuracy_test']
    return mean_dict

def run_pipeline_single_product_ml(file, 
                                    product="",
                                    test_percent=0.25, 
                                    remove_stopwords=False, 
                                    min_token_length=1,
                                    min_support=3,
                                    min_percent=0.01, 
                                    max_feature_distance=5,
                                    max_features_review=5,
                                    random_seed=42,
                                    print_review=True,
                                    verbose=False):
    """
    Run the whole NLP pipeline for a single product, using an SVM to predict orientation
    """
    data = read_in(single_file=file)
    labels, reviews, other, data_clean = create_dicts(data)
    
    # Data preprocessing
    reviews_clean = clean_text(reviews, remove_stopwords=remove_stopwords, min_token_length=min_token_length)
    
    x_train, y_train, x_test, y_test = create_train_test_sets(reviews_clean, labels, test_percent, 
                                                              random_seed=random_seed, verbose=verbose)
    
    # Extract relevant information
    features_dict = get_features(x_train, verbose=verbose)
    
    features = feature_pruning(x_train, features_dict, min_support=min_support, min_percent=min_percent)
    
    docs_containing_features_train = find_docs_with_features(x_train, features)
    docs_containing_features_test = find_docs_with_features(x_test, features)
    opword_dict_train = identify_and_allocate_opinion_words(x_train, docs_containing_features_train, 
                                                            max_feature_distance=max_feature_distance,
                                                            verbose=verbose) 
    opword_dict_test = identify_and_allocate_opinion_words(x_test, docs_containing_features_test, 
                                                           max_feature_distance=max_feature_distance,
                                                           verbose=verbose) 
        
    polarity_dict_train = calculate_polarity(opword_dict_train)
    polarity_dict_test = calculate_polarity(opword_dict_test)
    
    if print_review:
        create_summary_review(product + ' Train', polarity_dict_train, max_features=max_features_review)
        create_summary_review(product + ' Test', polarity_dict_test, max_features=max_features_review)
        

    # Evaluation - Sentiment
    y_train_sentiment = create_sentiment_target_df_from_labels(y_train)
    y_test_sentiment = create_sentiment_target_df_from_labels(y_test)
    
    y_train_sentiment_pred, y_test_sentiment_pred = train_svm_models(x_train, x_test, 
                                                                     docs_containing_features_train, 
                                                                     docs_containing_features_test, 
                                                                     y_train_sentiment, 
                                                                     y_test_sentiment)
    
    # Calculate performance
    acc_train = accuracy_score(y_train_sentiment['orientation'].to_list(), y_train_sentiment_pred)
    acc_test = accuracy_score(y_test_sentiment['orientation'].to_list(), y_test_sentiment_pred)
    
    perf_sentiment = pd.DataFrame({'product':product, 'accuracy_train':acc_train , 'accuracy_test':acc_test},
                                  index=[0])
 
    return perf_sentiment


def run_pipeline_multiple_products_ml(file_list, 
                                   product_list,
                                   test_percent=0.25, 
                                   remove_stopwords=False, 
                                   min_token_length=1,
                                   min_support=3,
                                   min_percent=0.01, 
                                   max_feature_distance=5,
                                   max_features_review=5,
                                   random_seed=42,
                                   print_review=False,
                                   verbose=False):
    """
    Run the NLP pipeline for multiple products
    """
    if len(file_list) != len(product_list):
        raise ValueError("The number of files and product names must match")
        
    perf_sentiment_all = pd.DataFrame()
    
    for i, file in enumerate(file_list):
        print(f'Running ML pipeline for {product_list[i]}')
        perf_sentiment = run_pipeline_single_product_ml(file, 
                                                        product=product_list[i],
                                                        test_percent=test_percent, 
                                                        remove_stopwords=remove_stopwords, 
                                                        min_token_length=min_token_length,
                                                        min_support=min_support,
                                                        min_percent=min_percent, 
                                                        max_feature_distance=max_feature_distance,
                                                        max_features_review=max_features_review,
                                                        random_seed=random_seed,
                                                        print_review=print_review,
                                                        verbose=verbose)
        # Append the dataframe
        perf_sentiment_all = perf_sentiment_all.append(perf_sentiment)
        
    
    # Add averages
    perf_sentiment_all = perf_sentiment_all.append(create_mean_perf_ml(perf_sentiment_all), 
                                                   ignore_index=True)
    return perf_sentiment_all

In [35]:
%%time
file_list = ['data/Customer_review_data/Apex AD2600 Progressive-scan DVD player.txt',
             'data/Customer_review_data/Creative Labs Nomad Jukebox Zen Xtra 40GB.txt',
             'data/Customer_review_data/Nikon coolpix 4300.txt',
             'data/Reviews-9-products/Hitachi router.txt',
             'data/Reviews-9-products/Diaper Champ.txt',
             'data/Reviews-9-products/Nokia 6600.txt',
             'data/Reviews-9-products/ipod.txt',
             'data/Reviews-9-products/Linksys Router.txt',
             'data/Reviews-9-products/norton.txt']

product_list = ['DVD player',
                'MP3 player',
                'Hitachi router',
                'Nikon coolpix',
                'Diaper Champ',
                'Nokia 6600',
                'iPod',
                'Linksys Router',
                'Norton']

perf_sentiment_all_ml = run_pipeline_multiple_products_ml(file_list, product_list)

Running ML pipeline for DVD player
Running ML pipeline for MP3 player
Running ML pipeline for Hitachi router
Running ML pipeline for Nikon coolpix
Running ML pipeline for Diaper Champ
Running ML pipeline for Nokia 6600
Running ML pipeline for iPod
Running ML pipeline for Linksys Router
Running ML pipeline for Norton
Wall time: 1min 23s


After implementing our ML algorithm, we find test set accuracy increases to 0.63 from 0.57 using our original method. This equates to an increase in accuracy of 10.5%! Although still a long way from human level performance, this is a considerable improvement.

Further improvements could be made through further hyperparameter tuning to the vectorisation settings and the SVC algorithm.

In [36]:
perf_sentiment_all_ml

Unnamed: 0,product,accuracy_train,accuracy_test
0,DVD player,0.972924,0.605405
1,MP3 player,0.934732,0.668998
2,Hitachi router,0.926641,0.689655
3,Nikon coolpix,0.991453,0.564103
4,Diaper Champ,1.0,0.670213
5,Nokia 6600,0.898795,0.57554
6,iPod,0.962217,0.729323
7,Linksys Router,0.941315,0.685315
8,Norton,0.94386,0.515789
9,Average,0.952437,0.633816


<font color='blue' weight='bold'>
    <b>Summary</b><br>
    In this section we have:
    <li> Modified our pipeline to use tf-idf word vectors and a SVC to predict sentiment orientation </li>
    <li> Observed an increase in test set accuracy of 10.5% from using these methods </li>   
</font>

### 7. Conclusions

In this assignment we have carried out opinion mining to extract important product features and understand the sentiment towards these features and for each sentence in a review.

To meet this end we have built a comprehensive NLP pipeline that reads in data, cleans it, identifies relevant features, opinion words and sentence level sentiment. Our pipeline is capable of producing review summaries to understand the key features and their sentiment orientations. We have analysed the performance of our algorithm through performance metrics of precision, recall and accuracy.

As an extension to our algorithm, we have trialled a machine learning model for predicting sentence orientation. We find that the use of tf-idf vectors and SVC increase average orientation accuracy by 10.5%, from 0.57 to 0.63.

Overall, the results of our algorithm are far from the performance we would expect of a human, however performance could be greatly improved through invstigation of other methods such as dependency parsing and utilisation of state of the art techiques such as transformer models.

### 8. References
- Hu, M. and Liu, B., 2004. Mining Opinion Features in Customer Reviews. Proceedings of AAAI.
- Hu, M. and Liu, B., 2004. Mining and summarizing customer reviews. [Online], KDD-2004 - Proceedings of the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. pp.168–177. Available from:                 https://doi.org/10.1145/1014052.1014073.
- Wu, Y., Zhang, Q., Huang, X. and Wu, L., 2009. Phrase dependency parsing for opinion mining. Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 3 - Volume 3, USA. EMNLP ’09. USA: Association for Computational Linguistics, pp.1533–1541.


