# Exercise - Final Assignment in Opinion Mining

---

<div class="alert alert-block alert-success">
<b>1) Analyse the data and the task: </b> 
</div>

Below is the structure of the data folder:

<img src="Pictures/Introduction_2.jpg" style="width: 90%;"/>


In this exercise, we will create a data frame for each folder. Below are the data frames

    1. df_CustomerReview
    2. df_Reviews_9_prdoucts
    3. df_CustomerReviews_3_domain

In addition, a global data frame, `df_Global`, will be created to combine the data from the above data frames. 

*Remark (1): Readme.txt files will be ignored during data parsing. 

---

#### We can understand the meaning of those special characters in other txt files through readme files.  A summary table is shown below:

<img src="Pictures/Introduction.jpg" style="width: 90%;"/>
        

In this exercise, [+n] and [-n] will be treated as the ground truth sentiment. In our result evaluation, we will use these ground truths to calculate all related matrics, such as recall and precision.


For those files with [t], we will group the sentence by the [t], i.e. the title. In the dataframe, we would create Title and Title ID  columns to store such information. 


** Remark (2): [t] does not exist in CustomerReviews-3_domains and ipod.txt under Reviews-9-products. 


---


### Below is the pipeline of this exercise, which mirros the apporoach taken by Hu and Liu (2004). 

<img src="Pictures/Pipeline_v4.jpg" style="width: 90%;"/>



#### There are six parts for this exercise:

1) Data Parsing - To load the text files into a panada data frame format as a foundation for future steps.

2) Pre-Processing - To extract relevant information from the text, pre-processing steps would be conducted.

3) Frequent Features Extraction - With the pre-processed data, we would extract frequent features.

4) Opinion Mining - With the frequent features, we would use them to predict the opinion of each review. In this exercise, two algorithms (Vader and TextBlob) would be implemented.

5) Comparison with Machine Learning - Apart from that, we will use Machine Learning as a control experiment against the pipeline taken by Hu and Liu (2004).

6) Print the summaries of the opinion miner - Imagine this exercise is an opinion mining system, these summaries would be the output of the system.


---

<div class="alert alert-block alert-success">
<b>2) Data Parsing </b> 
</div>


### 2.1 Malfromed Data Handling 

During the pre-processing, malformed data was discovered. Due to the low volume, the manual correction in data format will not impact our future steps, for instance, Machine Learning, Features Extraction, and Opinion Mining. We amended four types of errors manually.


### 2.1.1 Manual Adjustment

Below are manual adjustments made before data parsing. We decided to amend the data manually since the volume of the following scenarios is uncommon, which cannot justify an automated solution. 

1) Misplacement of Special Characters 
    Example (1) Line 12 in Nokia:
    size[cs][+2]##3660 had similar features, but that is big in size.

The correct way of the above example is size[+2][cs]##3660 had similar features, but that is big in size. 

---

2)  Missing commas between features 
    
    Example (1) Line 15 in Nokia:
    LCD[+3]camera quality[+3]##Especially the LCD is big and the camera quality is among the very best around.

    Example (2) Line 54 in Nokia:
    processor[-3]ear volume[-2][cs]##Anyhow some demerits of this phone, The phone has a slow processor and the ear volume is lower than 6610 but then the 6610 has the best ear volume among most Nokia variants.

    Example (3) Line 93 in Nokia:
    camera[+3]PDA[+3]##i highly recommend it if you are looking for a phone with great camera quality and/or lots of features you would find in a PDA.

The correct format for sentences with multiple features is xxxx[+|-n], xxxx[+|-n]. The comma has a crucial role in separating different features. Same errors can be found in the Nokia file, where commas are missing in the features. 

---

### 2.1.2 Automated Adjustment

1) Missing the score inside the sentiment

    Example (1)Line 90 Creative Labs Nomad Jukebox Zen Xtra 40GB
    
    player[+], price[+3]##the creative labs zen xtra has all the features the i-pod has and if you get if from amazon your only going to pay $ 300 for this great player. 
    
In this case, the sentiment score is missing for the player. We assume the sentiment score is 1 when only the sign [+] and [-] is available. 

Based on the observation, this error is relatively more common compared to the cases in 2.1.1. Given that it will not impact the algorithm performance significantly, we will include an automated adjustment process in our data parsing algorithm.

Moreover, since the + and - contain the direction of sentiment, it will be useful to keep such data to increase our dataset, especially for Machine Learning.


---

### 2.1.3 Treat the feature as normal text

For other error types, we treated them as text as they would cost tremendous human effort and system performance to handle. In our data parsing data algorithm, an error handling procedure was included. 

1) Using the incorrect bracket to store the sentiment

    Example (1)Line 480 in Nokia 6610: 
    size[+2][u], look{+1]##first let me say that it is much smaller than it looks on the web and it also looks better
    
    Example (2)Line 578 in Apex AD2600 Progressive-scan DVD player:

According to the readme.text, it should be look[+1] instead of look{+1].

For this case, we did not include "look" as a feature. Instead, it would become a part of the sentence in the data frame. 


In [1]:
#Libraies for this assignment 

import os
import pandas as pd
from tqdm import tqdm
import numpy as np
import re
import time
from mlxtend.preprocessing import TransactionEncoder
from mlxtend.frequent_patterns import apriori
from sklearn.feature_extraction.text import CountVectorizer
from nltk.stem import WordNetLemmatizer
import nltk
folder_to_view = 'Data'
from sklearn.metrics import classification_report
from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.metrics import recall_score, precision_score
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix
import matplotlib.pyplot as plt
import seaborn as sns
from textblob import TextBlob
from nltk.sentiment.vader import SentimentIntensityAnalyzer
from textblob import TextBlob
from sklearn import svm


#please comment the below lines if you already downloaded those files from nltk
nltk.download('wordnet')
nltk.download('omw-1.4')
nltk.download('averaged_perceptron_tagger')
nltk.download('vader_lexicon')


  if LooseVersion(mpl.__version__) >= "3.0":
  other = LooseVersion(other)
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\Chris\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     C:\Users\Chris\AppData\Roaming\nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\Chris\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package vader_lexicon to
[nltk_data]     C:\Users\Chris\AppData\Roaming\nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!


True

In [2]:
#key functions for data prasing

# This is the function to obatins all the file names for this exercise. 
def get_filename(x):
    
    folder_to_view = 'Data'
    main_directories = os.listdir(folder_to_view)
    all_file_names=[]
    
    for dirpath, _, files in os.walk(os.path.join(folder_to_view,main_directories[x])): 
        files = [os.path.join(dirpath, x) for x in files  if x != 'Readme.txt']
        all_file_names +=files
        
    return all_file_names


## This function is to identify the features from different txt files. 
## The feature format is xxxx[+|-n], xxxx is a product feature.
def get_features(filename):
    with open(filename, "r") as f:
        lines = f.readlines()
        
    features = []
    for line in lines:
        
        if line.startswith ("*") or line.startswith ("#") or line.startswith ("[t]"):
            continue
            
        LHS = line.split("##")[0]
        line_features_scores = LHS.split(',') # xxxx[-1], xxx[+1]
        final_line_features = [x.split("[")[0].strip() for x in line_features_scores]
        features += final_line_features
        
    return list(set(features))


## This function is to validate if the feature is in the correct format, i.e. xxxx [+|-n]. 
## If the future is incorect, we will treat them as text
def validate_features(feature_str):
    
    if "[" in feature_str and "]" in feature_str:
        feature_score = feature_str.split("[")[1].strip()
        score_str=feature_score[:-1]
        
        if "+" in score_str or "-" in score_str:
            return True
        else:
            return False
        
    else:
        return False

In [3]:
# This is the function to turn the txt files into the pandas dataframe. 
def generate_dataframe1(all_file_names_1):
    
    unique_features=[]
    
    for file in all_file_names_1:
        test = get_features(file)
        unique_features += test
        
    unique_test = list(set(unique_features))
    my_columns_1 = ['Dataset', 'Product Name', 'Title', 'Title ID', 'Sentence'] 
    sorted_unique_features = sorted(unique_test)
    my_columns_1 += sorted_unique_features
    df1= pd.DataFrame(columns=my_columns_1)
    all_sentence = []
    sent_feature = []
    all_titles = []

    for file in tqdm(all_file_names_1):
        with open(file, "r") as f:
            lines = f.readlines()

        values = [0]*len(my_columns_1) 
        title=None
        sentence=None

        for line in lines:
            values = [0]*len(my_columns_1)

            if line.startswith ("*"):
                continue

            elif line.startswith ("[t]"):
                title = line[3:]
                all_titles.append(title)

            elif "##" in line: 
                if line.startswith ("##"):
                    sentence = line[2:]
                    values[0] = file.split("\\")[1]
                    values[1] = file.split("\\")[-1][:-4]
                    values[2] = title
                    values[3] = len(all_titles)
                    values[4] = sentence
                    df1.loc[len(df1)] = values
                    all_sentence.append(sentence)

                else:
                    components = line.split("##")
                    sentence = components[1]
                    features_scores_list = components[0].split(',')

                    for x in features_scores_list:
                        if not validate_features(x):
                            sentence = line
                            values[0] = file.split("\\")[1]
                            values[1] = file.split("\\")[-1][:-4]
                            values[2] = title
                            values[3] = len(all_titles)
                            values[4] = sentence
                            df1.loc[len(df1)] = values
                            all_sentence.append(sentence)
                            continue

                        feature = x.split("[")[0].strip()
                        feature_score = x.split("[")[1].strip()
                        score_str=feature_score[:-1]

                        if score_str == "+":
                            score_str = "+1"
                        if score_str == "-":
                            score_str = "-1"

                        score = int(score_str)
                        
                        if score > 0:
                            score = 1
                        elif score < 0:
                            score = -1
                        
                        sent_feature.append((feature,sentence,score))
                        values[my_columns_1.index(feature)] = score
                    
                    values[0] = file.split("\\")[1]
                    values[1] = file.split("\\")[-1][:-4]
                    values[2] = title
                    values[3] = len(all_titles)
                    values[4] = sentence
                    df1.loc[len(df1)] = values
    return df1

---

#### Dataframe of all datasets

In [4]:
files_global=[]
for x in range(0,3):
    files_global += get_filename(x)

df_Global = generate_dataframe1(files_global)
df_Global = df_Global.drop(df_Global.columns[5], axis=1)
df_Global

100%|█████████████████████████████████████████████████████████████████████████████████| 17/17 [38:36<00:00, 136.27s/it]


Unnamed: 0,Dataset,Product Name,Title,Title ID,Sentence,1-800,2004,2004 edition,2004 version,3650,...,zen,zen xtra,zone alarm,zoom,zoom mode,zoom range,zoomed image,zooming lever,zooms,zx
0,CustomerReviews-3_domains,Computer,,0,I purchased this monitor because of budgetary...,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,CustomerReviews-3_domains,Computer,,0,This item was the most inexpensive 17 inch mo...,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,CustomerReviews-3_domains,Computer,,0,My overall experience with this monitor was v...,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,CustomerReviews-3_domains,Computer,,0,When the screen was n't contracting or glitch...,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,CustomerReviews-3_domains,Computer,,0,I 've viewed numerous different monitor model...,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
10327,Reviews-9-products,norton,\n,637,"Finally I ran msconfig and went into the ""serv...",0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
10328,Reviews-9-products,norton,\n,637,"This time, it started the uninstallation proce...",0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
10329,Reviews-9-products,norton,\n,637,"I simply hate Symantec. I swear, if I could ha...",0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
10330,Reviews-9-products,norton,\n,638,I have been a loyal Norton Anti-Virus and Inte...,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


#### Individual Datasets of sub-folders

In [5]:
# files_customerReview
files_customerReview = get_filename(1)

# Dataframe of Customer_Review_Data
df_CustomerReview = generate_dataframe1(files_customerReview)
df_CustomerReview = df_CustomerReview.drop(df_CustomerReview.columns[5], axis=1)

#Dataframe of Reviews-9-products
files_Reviews_9_prdoucts= get_filename(2)
df_Reviews_9_prdoucts = generate_dataframe1(files_Reviews_9_prdoucts)
df_Reviews_9_prdoucts = df_Reviews_9_prdoucts.drop(df_Reviews_9_prdoucts.columns[5], axis=1)

#Dataframe of CustomerReviews-3_domains
files_CustomerReviews_3_domains = get_filename(0)
df_CustomerReviews_3_domain = generate_dataframe1(files_CustomerReviews_3_domains)

#Show DataFrame (umcomment the line to show)
df_CustomerReview 
# df_Reviews_9_prdoucts
# df_CustomerReviews_3_domain


100%|████████████████████████████████████████████████████████████████████████████████████| 5/5 [02:12<00:00, 26.44s/it]
100%|████████████████████████████████████████████████████████████████████████████████████| 9/9 [05:03<00:00, 33.75s/it]
100%|████████████████████████████████████████████████████████████████████████████████████| 3/3 [01:21<00:00, 27.26s/it]


Unnamed: 0,Dataset,Product Name,Title,Title ID,Sentence,4mp,4mp camera,4mp resolution,8mb,8mb card,...,wma file,work,xtra,zen,zen xtra,zoom,zoom mode,zoomed image,zooming lever,zx
0,Customer_review_data,Apex AD2600 Progressive-scan DVD player,troubleshooting ad-2500 and ad-2600 no pictur...,1,"repost from january 13 , 2004 with a better fi...",0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,Customer_review_data,Apex AD2600 Progressive-scan DVD player,troubleshooting ad-2500 and ad-2600 no pictur...,1,does your apex dvd player only play dvd audio ...,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,Customer_review_data,Apex AD2600 Progressive-scan DVD player,troubleshooting ad-2500 and ad-2600 no pictur...,1,or does it play audio and video but scrolling ...,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,Customer_review_data,Apex AD2600 Progressive-scan DVD player,troubleshooting ad-2500 and ad-2600 no pictur...,1,before you try to return the player or waste h...,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,Customer_review_data,Apex AD2600 Progressive-scan DVD player,troubleshooting ad-2500 and ad-2600 no pictur...,1,no picture : \n,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3944,Customer_review_data,Nokia 6610,people with bad audio quality have defective p...,313,i have read a lot of the reviews and my phone ...,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3945,Customer_review_data,Nokia 6610,people with bad audio quality have defective p...,313,it is crystal clear . \n,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3946,Customer_review_data,Nokia 6610,people with bad audio quality have defective p...,313,this is one of the nicest phones nokia has mad...,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3947,Customer_review_data,Nokia 6610,people with bad audio quality have defective p...,313,i do recommend getting the data kit for those ...,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


<div class="alert alert-block alert-success">
<b>3) Pre-Processing and Extraction Relevant Information </b> 
</div>

### 3.1 Pre-Processing

Before extracting potential features from sentences, data pre-processed would be the first step. A tokenizer called `my_tokenier` will be created to conduct the following steps:

    1. Tokenization - Split the sentence into tokens
    2. Lemmatization - The main purpose is to get turn the words back to their basic form.
    3. POS Tagging - Only select the tokens if they are nouns and proper nouns. 
 
The tokenizer would tokenize each sentence in the global data frame. In this exercise, only nous tokens would be considered since we assume that all features are nouns.

In the NLTK library, the possible POS tags of nouns are NN, NNS, NNP, and NNPS. As mentioned above, only those tokens with one of these POS tags will be included. [9]

In [6]:
def my_tokenizer(doc):
    lemmatizer = WordNetLemmatizer()
    word_list = nltk.tokenize.word_tokenize(doc)
    lemmatized_tokens = [lemmatizer.lemmatize(word) for word in word_list if 
                          nltk.pos_tag([word])[0][1] in ['NN', "NNS", "NNP", "NNPS"] and 
                         word.isalpha() and len(word) > 1]
    
    return lemmatized_tokens

def find_features(sentences):
    vectorizer = CountVectorizer(strip_accents='ascii', 
                         stop_words='english', ngram_range=(1,3), 
                         max_df=0.3, min_df=5, binary=True, tokenizer=my_tokenizer)

    X = vectorizer.fit_transform(sentences)
    features = vectorizer.get_feature_names_out()
    
    return features

def find_cv(sentences):
    vectorizer = CountVectorizer(strip_accents='ascii', 
                         stop_words='english', ngram_range=(1,3), min_df=2, binary=True, tokenizer=my_tokenizer)

    X = vectorizer.fit_transform(sentences)
    Y = vectorizer.get_feature_names_out()
    Z = np.array(Y)
    Z = Z.reshape((1, len(Y)))
    Vec = X.toarray()
    cv_matrix = np.vstack((Z,Vec))
    cv_result= pd.DataFrame(cv_matrix)
    cv_result.columns = cv_result.iloc[0]
    cv_result = cv_result.reindex(cv_result.index.drop(0)).reset_index(drop=True)
    cv_result.columns.name = None
    
    return cv_result

# print(len(features))
# print(features)
# print(X.shape)

### 3.2 Frequent Features Extraction (Using Apriori Algorithm and Compactness Pruning)

The next step is to identify frequent features across all products with noun tokens. To achieve this target, we would adopt Compactness Pruning and Apriori Algorithm proposed by Hu and Liu (2004).

---

#### 3.2.1 Compactness Pruing 
The Compactness Pruning had been included before we filtered frequent features via Apriori Algorithm. Two important concepts were introduced by Hu and Liu (2004) on Compactness Pruning:

##### Concept 1) Uni-grams, Bi-grams, and Tri-grams. 
"Assuming f is a frequent feature in a sentence. f is compact in s if the length of f is less than 3." (Hu and Liu, 2004). In this exercise, we would adopt Uni-grams, Bi-grams, and Tri-grams for this concept. In Sklearn/CountVectorizer, an in-build hyper-parameter, `ngram_range`, can easily tokenize the sentence in N-grams. The value of ngram_range will be set as (1,3) to generate Uni-grams, Bi-grams, and Tri-grams.

##### Concept 2)  Minimum Frequency of a Noun in the Corpus
Some nouns are not genuine features, such as company names and personal names, which generally start with a capital letter in English. In this assignment, Hu and Liu (2004) recommended that nouns and noun phrases that occur less than two times should be ignored.

In CountVectorizer, the `min_df` hyper-parameter is to determine the minimum frequency when creating a sparse matrix. In this case, the `min_df` would be two.
 
---

#### 3.2.2 Apriori Algorithm

There are two major steps in Apriori Algorithm (Guo, Wang and Li, 2017), In this exercise, only step no.1 is required as the objective is to find out frequent features (i.e. frequent items set in Apriori Algorithm). 

##### Step 1) Identify all the frequent item sets from the data  sources. In this exercise, we define frequent item sets as frequent features.  
    
##### Step 2) Generate associate rules from frequent item sets. However, this step is not needed in this pipeline to find frequent features.

Hu and Liu (2004) proposed to set the minimum support value as 1%, which we would follow the same approach. On the other hand, we will create a binary matrix using `find_cv` to indicate the features per sentence. Since we adopted Countvectorizer, we can utilize the inbuild function, such as n-grams and min_df.

After loading the binary matrix into Apriori Algorithm, frequent feature sets with support levels higher than 0.01 will be generated. 

In [7]:
def apply_apriori(df):
    cv = find_cv(df)
    frequent_items = apriori(cv, min_support =0.01, use_colnames =True, max_len=1)
    return frequent_items

def Process_apriori(df):
    COLUMN_NAMES = ['support', 'itemsets', 'Product Name']
    apriori_result = pd.DataFrame(columns=COLUMN_NAMES) 
    for x in range(0,len(df.iloc[:, 1].unique())):
        df_test = df[df['Product Name'] == df.iloc[:, 1].unique()[x]]
        apriori_interim = apply_apriori(df_test['Sentence'])
        apriori_interim['Product Name'] = df.iloc[:, 1].unique()[x]
        apriori_result = pd.concat([apriori_result, apriori_interim])
    
    return apriori_result

#### Result of Frequent features ( by Compactness Pruning and Apriori Algorithm)

For the frequent features, we listed them out at the product level. The assumption is that each product (i.e. each txt file) contains its frequent features, which are independent of the other products.


In [8]:
Frequent_Features = Process_apriori(df_Global)

itemsets_text = []
for y in range(0,len(Frequent_Features)):
    x = list(Frequent_Features.iloc[y,1])
    x = listToStr = ' '.join(map(str, x))
    itemsets_text.append(x)

Frequent_Features['itemsets_text'] = itemsets_text
Frequent_Features



Unnamed: 0,support,itemsets,Product Name,itemsets_text
0,0.071563,(acer),Computer,acer
1,0.015066,(amazon),Computer,amazon
2,0.015066,(angle),Computer,angle
3,0.020716,(apple),Computer,apple
4,0.015066,(beat),Computer,beat
...,...,...,...,...
103,0.059585,(work),norton,work
104,0.028497,(xp),norton,xp
105,0.025907,(year),norton,year
106,0.010363,(zone),norton,zone


---
### 3.3 - Redundancy Pruning on Frequent Features

In section 3.2, a list of frequent features was identified via Apriori Algorithm and Compactness Pruning. Furthermore, Hu and Lui 2004 mentioned another concept called Redundancy Pruning,


#### Redundancy Pruning
Redundancy Pruning is a method to find out which uni-grams are redundant features. Those uni-grams that exist in Bi-grams and Tri-grams may be meaningless as uni-gram. Below is an example:


1) Sentence 1: This hair dryer is very useful. 

2) Sentence 2: I usually buy the Shampoo from this supermarket to wash my hair. 

In the above case, hair is a redundancy feature as "hair dryer" is the real feature. In this exercise, for those Uni-grams that overlap with Bi-grams and Tri-grams, we would consider those that exist more than twice in the corpus. (i.e. `P_count` >2 )


In [9]:
def contains(sentence, ngrams):
    for ngram in ngrams:
        if ngram in sentence:
            return True
    return False

def p_support(sentences, unigram, ngrams):
    p_count=0
    pos = ['']
    # iterate through sentences
    for sentence in sentences:
        # if unigram in sentence and not any ngrams in sentence
        if nltk.pos_tag([unigram])[0][1]  in ['NN', "NNS", "NNP", "NNPS"] and unigram in sentence and not contains(sentence, ngrams):
            p_count +=1 # check the POS if noun: p_count+=1
    return p_count

def pruned_unigrams(features,sentences):
    # list of unigrams only
    unigrams = []
    for feature in features:
        if len(feature.split()) < 2:
            unigrams.append(feature)

    # list of bigrams-trigrams only
    ngrams = []
    for feature in features:
        if len(feature.split()) > 1:
            ngrams.append(feature)

    # iterate through unigrams ( for unigram in unigrams)
    # generate a new n-gram list that only contains ngrams including the current unigram
    pruned_unigrams=[]
    for unigram in tqdm(unigrams):
        current_ngrams=[]
        for ngram in ngrams:
            if unigram in ngram:
                current_ngrams.append(ngram)
        p_count = p_support(sentences, unigram, current_ngrams)
        if p_count <=3 and len(current_ngrams)>0:
            pruned_unigrams.append(unigram)
            
    return pruned_unigrams


#### 3.3.1 Result - Redundancy Feaures of each Product in the Global Data Frame

For the redundancy features, we would remove them from our frequent feature list. They would not be considered in our pipeline anymore.

The redundancy features are in a dictionary format. We listed out the redundancy features on the product level.


In [10]:
def Redundancy_Pruning_Product(df1,df2):
    result = {}
    
    for x in range(0,len(df1.iloc[:, 1].unique())):
        df_test = df1[df1['Product Name'] == df1.iloc[:, 1].unique()[x]]
        Frequent_Features_test = df2[df2['Product Name'] == df2.iloc[:, 2].unique()[x]]
        text = pruned_unigrams(Frequent_Features_test['itemsets_text'], df_test['Sentence'])
        result[df1.iloc[:, 1].unique()[x]] = text
        
    return result
    
Pruned = Redundancy_Pruning_Product(df_Global, Frequent_Features)
Pruned

100%|██████████████████████████████████████████████████████████████████████████████████| 85/85 [00:16<00:00,  5.28it/s]
100%|██████████████████████████████████████████████████████████████████████████████████| 96/96 [00:27<00:00,  3.45it/s]
100%|██████████████████████████████████████████████████████████████████████████████████| 83/83 [00:19<00:00,  4.27it/s]
100%|██████████████████████████████████████████████████████████████████████████████████| 70/70 [00:23<00:00,  2.95it/s]
100%|██████████████████████████████████████████████████████████████████████████████████| 94/94 [00:20<00:00,  4.66it/s]
100%|██████████████████████████████████████████████████████████████████████████████████| 78/78 [00:43<00:00,  1.78it/s]
100%|██████████████████████████████████████████████████████████████████████████████████| 94/94 [00:12<00:00,  7.53it/s]
100%|██████████████████████████████████████████████████████████████████████████████████| 84/84 [00:15<00:00,  5.38it/s]
100%|███████████████████████████████████

{'Computer': ['lcd'],
 'Router': ['primary'],
 'Speaker': ['theater'],
 'Apex AD2600 Progressive-scan DVD player': ['control'],
 'Canon G3': ['life', 'screen'],
 'Creative Labs Nomad Jukebox Zen Xtra 40GB': ['life'],
 'Nikon coolpix 4300': ['cf', 'life', 'memory', 'range', 'reader'],
 'Nokia 6610': ['fm', 'life'],
 'Canon PowerShot SD500': ['clip', 'digital', 'lcd', 'sd', 'zoom'],
 'Canon S100': ['canon', 'cf', 'elph', 'photography'],
 'Diaper Champ': ['cloth', 'dirty', 'garbage', 'genie', 'plastic', 'tall'],
 'Hitachi router': ['adjustment', 'control', 'fine', 'height', 'inch', 'knob'],
 'ipod': [],
 'Linksys Router': ['update'],
 'MicroMP3': ['fm'],
 'Nokia 6600': ['life', 'mmc'],
 'norton': ['alarm', 'norton', 'systemworks', 'zone']}

#### 3.3.2 Revised Frequent Features after Redundancy  Redundancy Pruning

In [11]:
Frequent_Features_copy = Frequent_Features.copy()
Frequent_Features_copy.reset_index(drop=True,inplace=True)

Pruned_lst = []
for k,v in Pruned.items():
    for val in v:
        Pruned_lst.append((k,val))
        
def check(pname,item):
    for indx,row in Frequent_Features_copy.iterrows():
        if row['Product Name']==pname and row['itemsets_text']==item:
            return indx
    return None

indx_drop = []
for pname,item in Pruned_lst:
    index = check(pname,item)
    if index != None:
        indx_drop.append(index)
        
Reduced_Features = Frequent_Features_copy.drop(indx_drop,axis=0)
Reduced_Features.reset_index(drop=True,inplace=True)
Reduced_Features

Unnamed: 0,support,itemsets,Product Name,itemsets_text
0,0.071563,(acer),Computer,acer
1,0.015066,(amazon),Computer,amazon
2,0.015066,(angle),Computer,angle
3,0.020716,(apple),Computer,apple
4,0.015066,(beat),Computer,beat
...,...,...,...,...
1632,0.012953,(window xp),norton,window xp
1633,0.059585,(work),norton,work
1634,0.028497,(xp),norton,xp
1635,0.025907,(year),norton,year


---
### 3.4 Evaluation Framework on Frequent Features

#### 3.4.1 Precision
<img src="Pictures/Precision.jpg" style="width: 25%;"/>

In our case, the precision equals one means there is no false positive feature. The reason is that all ground truths are nouns and we only considered noun and noun phase in our pre-processing step. As a result, we need to look closer at Recall to give us more insight into the accuracy of frequent features extraction.


#### 3.4.2 Recall
<img src="Pictures/Recall.jpg" style="width: 18%;"/>

In this exercise, the recall is also low. One of the potential reasons is that only frequent features are included in the features extraction step. Thus, there are a lot of false negative cases for those non-frequent features. In addition, another possible reason is that we only consider Uni-grams, Bigrams, and Tri-grams. In reality, it is possible to have features with more than three words.

To calculate the matrices, we took references from (precision, Kumar, and Kalyanarangan, 2022). 

In [12]:
def evaluate_features(actual, extracted):
    mlb = MultiLabelBinarizer()

    list_1 = [extracted.tolist()]
    lemmatizer = WordNetLemmatizer()
    lemmatized_actual = [[lemmatizer.lemmatize(x).lower() for x in actual]]

    A_new = mlb.fit(list_1).transform(list_1)
    B_new = mlb.transform(lemmatized_actual)

    precision = precision_score(A_new,B_new,average='samples')
    recall = recall_score(A_new, B_new, average='samples')
    result = [recall,precision]
    
    return result

def check_product_feature(k):
    for z in k:
        if int(z) !=0:
            return True
            break
        else:
            continue

    return False 

def find_ground_truth(df):
    grond_truth = []
    for col_name in df.columns:
        if check_product_feature(df[col_name]):
            grond_truth.append(col_name.lower())
    
    return grond_truth

In [13]:
product_list = df_Global.iloc[:, 1].unique()
final_result = []

for product in product_list:
    x = df_Global.loc[df_Global['Product Name'] == product]
    y= Reduced_Features.loc[Reduced_Features['Product Name'] == product]
    p = find_ground_truth(x.iloc[: , 5:])
    final_result.append([product,evaluate_features(p,y['itemsets_text'])])







#### Evaluation Result [Recall, Precision]

In [14]:
final_result

[['Computer', [0.313953488372093, 1.0]],
 ['Router', [0.2980769230769231, 1.0]],
 ['Speaker', [0.313953488372093, 1.0]],
 ['Apex AD2600 Progressive-scan DVD player', [0.45569620253164556, 1.0]],
 ['Canon G3', [0.3069306930693069, 1.0]],
 ['Creative Labs Nomad Jukebox Zen Xtra 40GB', [0.5125, 1.0]],
 ['Nikon coolpix 4300', [0.23809523809523808, 1.0]],
 ['Nokia 6610', [0.4111111111111111, 1.0]],
 ['Canon PowerShot SD500', [0.20869565217391303, 1.0]],
 ['Canon S100', [0.30327868852459017, 1.0]],
 ['Diaper Champ', [0.24242424242424243, 1.0]],
 ['Hitachi router', [0.3157894736842105, 1.0]],
 ['ipod', [0.28, 1.0]],
 ['Linksys Router', [0.2571428571428571, 1.0]],
 ['MicroMP3', [0.4090909090909091, 1.0]],
 ['Nokia 6600', [0.41237113402061853, 1.0]],
 ['norton', [0.2692307692307692, 1.0]]]

---

### 3.5 Opinion Mining - Using the near adjectives of frequent feautres

After finishing the frequent feature extraction, the next step is to predict the sentiment of each sentence. Hu and Liu (2004) pointed out that opinion words are usually adjectives next to the frequent features.

To mirror the approach of Hui and Liu (2004), following steps would be taken to extract the opinion for each sentence.

1) Indicate the frequent features in each sentence. In section 3.3.2, frequent features were already identified. We would review each sentence to check if it contains frequent features.


2) With frequent features, we will find out the nearest adjectives of each frequent feature. These adjectives are opinion words, highlighted by Hu and Liu. In this section, we would use `find_adj` function to identify adjectives. 


After that, we would predict the sentiment in section 4, using these adjectives/opinion words.

In [15]:
def check_freq(sentence,df_reduce):
    Freq_Feature = []
    Reduced1 = df_reduce['itemsets_text']
    Reduced1.tolist()
    sentence_token = nltk.tokenize.word_tokenize(sentence)
    
    for r in  Reduced1:
         if r in sentence_token:
            Freq_Feature.append(r)
            
    return Freq_Feature
 
def find_frequent_features(df,df_reduce):
    freq = []
    
    for sentence in df['Sentence']:
        in4 = check_freq(sentence,df_reduce)
        freq.append(in4)
        
    df['Frequent Features'] = freq
    
    return df

Global_new3 = pd.DataFrame(columns=df_Global.columns)
for product in df_Global['Product Name'].unique():
    Global = df_Global.loc[df_Global['Product Name'] == product]
    Reduced = Reduced_Features.loc[Reduced_Features['Product Name'] == product]
    Global_new2 = find_frequent_features(Global,Reduced)
    Global_new3 = pd.concat([Global_new3, Global_new2])


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['Frequent Features'] = freq
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['Frequent Features'] = freq
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['Frequent Features'] = freq
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = val

#### 3.5.1 List out all frequent features 

In [17]:
Global_new3[['Product Name','Sentence','Frequent Features']]

Unnamed: 0,Product Name,Sentence,Frequent Features
0,Computer,I purchased this monitor because of budgetary...,[monitor]
1,Computer,This item was the most inexpensive 17 inch mo...,"[item, monitor, purchase, time]"
2,Computer,My overall experience with this monitor was v...,"[experience, monitor]"
3,Computer,When the screen was n't contracting or glitch...,"[picture, quality, screen]"
4,Computer,I 've viewed numerous different monitor model...,"[monitor, picture, quality]"
...,...,...,...
10327,norton,"Finally I ran msconfig and went into the ""serv...","[service, uninstall]"
10328,norton,"This time, it started the uninstallation proce...","[left, process, think, time, work]"
10329,norton,"I simply hate Symantec. I swear, if I could ha...","[computer, software]"
10330,norton,I have been a loyal Norton Anti-Virus and Inte...,"[software, user]"


#### 3.5.2 List out nearest adjects for each feature

In [18]:
def find_adj(Freq_index,sent):
    global adj
    global back

#Find the adjectives before and after the features
    for i in range(Freq_index, len(sent)):
        if nltk.pos_tag([sent[i]])[0][1] in ['JJ', 'JJR', 'JJS']:
            front = i
            break
        else:
            front = False
    
    if Freq_index != 0:
        for i in reversed(range(Freq_index)):    
            if nltk.pos_tag([sent[i]])[0][1] in ['JJ', 'JJR', 'JJS']:
                back = i
                break
            else:
                back = False
    else:
        back = False

#Check if the adjective occur before or after the feature 
    if front == False and back == False:
        return False
    
    if front == False and back != False:
        adj = back
        return sent[adj] 

    if front != False and back == False:
        adj = front
        return sent[adj] 
    
    if front != False and back != False:   
        if front - Freq_index > Freq_index - back:
            adj = back
            return sent[adj] 

        if front - Freq_index < Freq_index - back:
            adj = front
            return sent[adj] 
        
        if front - Freq_index == Freq_index - back:
            adj = front
            return sent[adj] 
        
def list_adj(df):
    Near_Adj = []

    for index, row in tqdm(Global_new3.iterrows()):
        adj_list = []
        sent_1 = nltk.tokenize.word_tokenize(row['Sentence'])
        list_1 = []
    
        for x in row["Frequent Features"]:
            Freq_index = sent_1.index(x)
        
            if not find_adj(Freq_index,sent_1):
                continue 
            
            adj = find_adj(Freq_index,sent_1)
            list_1.append(adj)
        
        Near_Adj.append(list_1)

    return Near_Adj

In [19]:
Global_new3['Nearest Adjective'] = list_adj(Global_new3)
Global_new3[['Product Name','Sentence','Frequent Features', 'Nearest Adjective']]

10332it [02:35, 66.56it/s]


Unnamed: 0,Product Name,Sentence,Frequent Features,Nearest Adjective
0,Computer,I purchased this monitor because of budgetary...,[monitor],[budgetary]
1,Computer,This item was the most inexpensive 17 inch mo...,"[item, monitor, purchase, time]","[most, available, available, available]"
2,Computer,My overall experience with this monitor was v...,"[experience, monitor]","[overall, poor]"
3,Computer,When the screen was n't contracting or glitch...,"[picture, quality, screen]","[overall, poor, overall]"
4,Computer,I 've viewed numerous different monitor model...,"[monitor, picture, quality]","[different, poor, poor]"
...,...,...,...,...
10327,norton,"Finally I ran msconfig and went into the ""serv...","[service, uninstall]","[last, last]"
10328,norton,"This time, it started the uninstallation proce...","[left, process, think, time, work]",[]
10329,norton,"I simply hate Symantec. I swear, if I could ha...","[computer, software]",[]
10330,norton,I have been a loyal Norton Anti-Virus and Inte...,"[software, user]",[]


---
<div class="alert alert-block alert-success">
<b>4) Apply a relevant algorithm - Sentiment Analysis </b> 
</div>

#### With the opinion words/adjectives extracted in section 3.5, the next step is to predict the sentiment. 


### 4.1 Sentiement Aanlysis Model 1 - TextBlob
The first algorithm to analyze sentence sentiment is TextBlob, a well-known algorithm built on top of the NLTK library. With TextBlob, it can access the lexical resources in NLTK and compute the polarity score between -1 and 1. (My Absolute Go-To for Sentiment Analysis — TextBlob., 2022). For a positive sentiment, the polarity score would be close to 1 and vice versa.

The major limitation of TextBlob is that it may not effectively detect sentences containing connotation, sarcasm, and irony. (Saura, Palacios-Marqués and Ribeiro-Soriano, 2021) Nevertheless, it should not significantly impact our result since customers usually provide feedback instead of hiding them with sarcasm.


### 4.2 Sentiment Analysis Model 2 - Vader 

The second algorithm for sentiment analysis is Valence Aware Dictionary for Sentiment Reasoning (Vader). It was designed for sentiment ratings from various English phrases from social media platforms. (Lundqvist, Liyanagunawardena and Starkey, 2020). The advantage of Vader is that it does not require training data, which allows us to import a new set of data without ground truth values.


Detailed information for Vader can be found in [1]. In this exercise, the compound score will be considered from the algorithm. 

Vader algorithm would generate three results, positive polarity, negative polarity, neutral polarity, and the compound score. Unlike TextBlob, it is expected that Vader will be more comprehensive as it considers both positive and negative sentiment to compute the final compound score. For  this exercise, we will use compound score from Vader to determine the sentiment. 


---

In [20]:
sid = SentimentIntensityAnalyzer()

#Find Sentiment Score - Vader 
def vader_sentiment(df):
    final_sent = []
    sent_score_vs = []

    for index, row in df.iterrows():
        adjectives = row['Nearest Adjective']
        adjectives = ' '.join(adjectives)
        sentiment = sid.polarity_scores(adjectives)
        sentiment_score = sentiment.get('compound')
        final_sent.append(sentiment)
        sent_score_vs.append(sentiment_score)
        
    return [final_sent, sent_score_vs]

#Find Sentiment Score - TextBlob
def textblob_sentiment(df):
    sent_score_tb = []

    for index, row in df.iterrows():
        adjectives = row['Nearest Adjective']
        adjectives = ' '.join(adjectives)
        analysis = TextBlob(adjectives)
        sentiment_tb = (analysis.sentiment)
        sent_score_tb.append(sentiment_tb.polarity)
        
    return sent_score_tb

---

#### 4.2.1 Result - Vader and TextBlob

The predicted sentiment of Vader can be found on "Predicted_Sentiment_vs". Meanwhile, the predicted sentiment of TextBlob can be found on "Predicted_Sentiment_tb".

We created dummy variables to represent the sentiment. 

1) Positive Sentiment, where x = 1

2) Netural Sentimennt, where x = 0

3) Negative Sentiment, where x = -1

In [21]:
Global_new3['Predicted Sentiment'] = vader_sentiment(Global_new3)[0]
Global_new3['Sentiment Score_vs'] = vader_sentiment(Global_new3)[1]
Global_new3['Sentiment Score_tb'] = textblob_sentiment(Global_new3)

Predicted_Sentiment_vs = []
Predicted_Sentiment_tb = []

#vader 
for index, row in Global_new3.iterrows():
    if row["Sentiment Score_vs"] == 0:
        Predicted_Sentiment_vs.append(int(row["Sentiment Score_vs"]))
        continue

    if row["Sentiment Score_vs"] < 0:
        row["Sentiment Score_vs"] = -1
        Predicted_Sentiment_vs.append(row["Sentiment Score_vs"])
        
    if row["Sentiment Score_vs"] > 0:
        row["Sentiment Score_vs"] = 1
        Predicted_Sentiment_vs.append(row["Sentiment Score_vs"])
        
        
#Text Blob
for index, row in Global_new3.iterrows():
    if row["Sentiment Score_tb"] == 0:
        Predicted_Sentiment_tb.append(int(row["Sentiment Score_tb"]))
        continue

    if row["Sentiment Score_tb"] < 0:
        row["Sentiment Score_tb"] = -1
        Predicted_Sentiment_tb.append(row["Sentiment Score_tb"])

    if row["Sentiment Score_tb"] > 0:
        row["Sentiment Score_tb"] = 1
        Predicted_Sentiment_tb.append(row["Sentiment Score_tb"])

        
Global_new3['Predicted_Sentiment_vs'] = Predicted_Sentiment_vs
Global_new3['Predicted_Sentiment_tb'] = Predicted_Sentiment_tb

Global_new3[['Product Name','Sentence','Frequent Features', 'Nearest Adjective', 'Predicted Sentiment', 
             'Sentiment Score_vs', 'Sentiment Score_tb',
             'Predicted_Sentiment_vs', 'Predicted_Sentiment_tb']]

Unnamed: 0,Product Name,Sentence,Frequent Features,Nearest Adjective,Predicted Sentiment,Sentiment Score_vs,Sentiment Score_tb,Predicted_Sentiment_vs,Predicted_Sentiment_tb
0,Computer,I purchased this monitor because of budgetary...,[monitor],[budgetary],"{'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound...",0.0000,0.000000,0,0
1,Computer,This item was the most inexpensive 17 inch mo...,"[item, monitor, purchase, time]","[most, available, available, available]","{'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound...",0.0000,0.425000,0,1
2,Computer,My overall experience with this monitor was v...,"[experience, monitor]","[overall, poor]","{'neg': 0.756, 'neu': 0.244, 'pos': 0.0, 'comp...",-0.4767,-0.200000,-1,-1
3,Computer,When the screen was n't contracting or glitch...,"[picture, quality, screen]","[overall, poor, overall]","{'neg': 0.608, 'neu': 0.392, 'pos': 0.0, 'comp...",-0.4767,-0.133333,-1,-1
4,Computer,I 've viewed numerous different monitor model...,"[monitor, picture, quality]","[different, poor, poor]","{'neg': 0.861, 'neu': 0.139, 'pos': 0.0, 'comp...",-0.7351,-0.266667,-1,-1
...,...,...,...,...,...,...,...,...,...
10327,norton,"Finally I ran msconfig and went into the ""serv...","[service, uninstall]","[last, last]","{'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound...",0.0000,0.000000,0,0
10328,norton,"This time, it started the uninstallation proce...","[left, process, think, time, work]",[],"{'neg': 0.0, 'neu': 0.0, 'pos': 0.0, 'compound...",0.0000,0.000000,0,0
10329,norton,"I simply hate Symantec. I swear, if I could ha...","[computer, software]",[],"{'neg': 0.0, 'neu': 0.0, 'pos': 0.0, 'compound...",0.0000,0.000000,0,0
10330,norton,I have been a loyal Norton Anti-Virus and Inte...,"[software, user]",[],"{'neg': 0.0, 'neu': 0.0, 'pos': 0.0, 'compound...",0.0000,0.000000,0,0


---
#### 4.2.2 Add Ground Truth as a Comparasion

Similar to the above data frame, we created dummy variables to represent the sentiment.

1) Positive Sentiment, where x = 1

2) Netural Sentimennt, where x = 0

3) Negative Sentiment, where x = -1

In the text files, there are some sentences containing multiple sentiments. For those cases, we will take the average of the sentiment.


In [22]:
def sentence_sentiment(dataframe):
    sentence_sentiment = dataframe.iloc[:,5:]
    sentence_sentiment = sentence_sentiment.replace(0, np.NaN)
    sentence_sentiment = sentence_sentiment.mean(axis=1).round()
    sentence_sentiment = sentence_sentiment.replace( np.NaN,0)
    sentence_sentiment = sentence_sentiment.astype(int)
    return sentence_sentiment

In [23]:
Global_new3['Predicted_Sentiment_Ground Truth'] = sentence_sentiment(Global_new3.iloc[:, 5:-7])
# Global_new3.iloc[:, 5:-7]
selection=['Product Name','Sentence', 'Sentiment Score_vs', 'Sentiment Score_tb',
             'Predicted_Sentiment_vs', 'Predicted_Sentiment_tb','Predicted_Sentiment_Ground Truth']
Global_new3[selection].head()

Unnamed: 0,Product Name,Sentence,Sentiment Score_vs,Sentiment Score_tb,Predicted_Sentiment_vs,Predicted_Sentiment_tb,Predicted_Sentiment_Ground Truth
0,Computer,I purchased this monitor because of budgetary...,0.0,0.0,0,0,0
1,Computer,This item was the most inexpensive 17 inch mo...,0.0,0.425,0,1,1
2,Computer,My overall experience with this monitor was v...,-0.4767,-0.2,-1,-1,-1
3,Computer,When the screen was n't contracting or glitch...,-0.4767,-0.133333,-1,-1,-1
4,Computer,I 've viewed numerous different monitor model...,-0.7351,-0.266667,-1,-1,-1


---
#### Confusion Matrix - TextBlob

When we compute the confusion matrix, we only consider the sentiment-bearing sentences from the ground truth so that we can precisely evaluate the ability to predict sentiment direction. We will also apply the same idea to evaluate Vader. 

For TextBlob, more than 70% of sentiment directions are correctly predicted, which is a promising result.


In [24]:
df_eval = Global_new3.drop(['Product Name', 'Sentence'], axis=1)[selection[2:]]
df_eval_nz=df_eval.loc[~(df_eval['Predicted_Sentiment_Ground Truth']==0)]


df_tb=df_eval_nz[['Predicted_Sentiment_tb', 'Predicted_Sentiment_Ground Truth']]
df_vs=df_eval_nz[['Predicted_Sentiment_vs', 'Predicted_Sentiment_Ground Truth']]

df_tb = df_tb.loc[~(df_tb['Predicted_Sentiment_tb']==0)]
df_vs = df_vs.loc[~(df_vs['Predicted_Sentiment_vs']==0)]

y_true = df_tb['Predicted_Sentiment_Ground Truth']
y_pred = df_tb['Predicted_Sentiment_tb']
cf_matrix_tb = confusion_matrix(y_true, y_pred)


print(classification_report(y_true, y_pred))

              precision    recall  f1-score   support

          -1       0.53      0.45      0.49       560
           1       0.79      0.83      0.81      1354

    accuracy                           0.72      1914
   macro avg       0.66      0.64      0.65      1914
weighted avg       0.71      0.72      0.71      1914



---
#### Confusion Matrix - Vader

For Vader, more than 80% of sentiment directions are correctly predicted, which is a promising result. One of the possible reasons is that it considers both negative and positive sentiment in the compound score. Therefore, it is more comprehensive than TextBlob. 


In [25]:
y_true_1 = df_vs['Predicted_Sentiment_Ground Truth']
y_pred_1 = df_vs['Predicted_Sentiment_vs']
cf_matrix_vs = confusion_matrix(y_true_1, y_pred_1)


print(classification_report(y_true_1, y_pred_1))

              precision    recall  f1-score   support

          -1       0.69      0.50      0.58       300
           1       0.86      0.93      0.89       975

    accuracy                           0.83      1275
   macro avg       0.77      0.72      0.74      1275
weighted avg       0.82      0.83      0.82      1275



----

### 4.3 Logistic Regression (with TfidfVectorizer)

As a next step, the comparison experiment is a next step to review the effectiveness of the sentiment analyzer based on TextBlob and Vader. In this exercise, two machine learning methods, Logistic Regression and Support Vector Matrix (SVM) would be implemented as a comparison.

 
Instead of the normal CountVecotrizer, TF-IDF would be adopted in the both Logistic Regression and SVM. (Piryani et al., 2020). The benefit of TF-IDF is that it reduces the TF-IDF values for common words with the IDF denominator. It should increase the accuracy in our machine learning model. 


Remark: Sentitment without ground truth sentitment would be excluded fromt the machine learning, which would reduce the noise during the learning process. 

In [26]:
df_Global['Predicted_Sentiment_Ground Truth'] = sentence_sentiment(df_Global.iloc[:, 5:])
df_Global_nz = df_Global.loc[~(df_Global['Predicted_Sentiment_Ground Truth']==0)]

df = df_Global_nz

sentences = df_Global_nz["Sentence"]
my_dictionary=find_features(sentences)
vectorizer_1 = TfidfVectorizer(strip_accents='ascii',stop_words='english', ngram_range=(1,3), 
                         max_df=0.3, min_df=3,tokenizer=my_tokenizer, vocabulary=my_dictionary)

# create labels y
# create features X

y = df_Global_nz['Predicted_Sentiment_Ground Truth']
X = vectorizer_1.fit_transform(sentences)

# train test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)


#fit to alg
clf = LogisticRegression(random_state=0).fit(X_train, y_train)

# test 
print(clf.score(X_test, y_test))

0.7348066298342542


#### Confusion Matrix - Logistic Regression

The accuary is 73% to determine positve and negative sentiment. 

In [27]:
predictions = clf.predict(X_test)
print(classification_report(y_test, predictions))

              precision    recall  f1-score   support

          -1       0.69      0.44      0.53       316
           1       0.75      0.89      0.81       589

    accuracy                           0.73       905
   macro avg       0.72      0.67      0.67       905
weighted avg       0.73      0.73      0.72       905



----
### 4.4 Support Vector Machine (with linear kernel)

Apart from Logistic Regression, Support Vector Machine (SVM) would be implemented as a second comparison. Same as section 4.3, TF-IDF vectorize was chosen. (Piryani et al., 2020)
Since our data is binary data (positive and negative sentiment), we would adopt a linear kernel as our kernel function as our classification method.

Same as Logistic Regression, sentiment without ground truth sentiment would be excluded from the machine learning, which would reduce the noise during the learning process.


In [28]:
model_svm = svm.SVC(C=8.0, kernel='linear')
clf1 = model_svm.fit(X_train, y_train)
print(clf1.score(X_test, y_test))

0.7160220994475138


#### Confusion Matrix - SVM

The accuary is 71% to determine positve and negative sentiment.

In [29]:
predictions_SVM = clf1.predict(X_test)
print(classification_report(y_test, predictions_SVM))

              precision    recall  f1-score   support

          -1       0.60      0.54      0.57       316
           1       0.77      0.81      0.79       589

    accuracy                           0.72       905
   macro avg       0.69      0.67      0.68       905
weighted avg       0.71      0.72      0.71       905



<div class="alert alert-block alert-success">
<b>5) Conclusion of This Exercise: </b> 
</div>


Across these four methods, Vader has the best performance. One of the possible reasons is that it considers both negative and positive sentiment before determining the final sentiment. It is more comprehensive than Machine Learning (Logistic Regression and SVM).


As mentioned above, the benefit of TF-IDF is that it reduces the TF-IDF values for common words with the IDF denominator. However, it still cannot outperform the Vader in our case.


Another finding is that the F1-score is higher for positive sentiment acorss four methods. It means that the sentiment analysis performs better in positive sentiment. Unfortunately, we cannot amend the ground truth. Otherwise, we should have balanced ground truth data for this exercise.


<div class="alert alert-block alert-success">
<b>6) Report evaluation results: </b> 
</div>


Below is the summary of our opinion minner. Imagine this exercise is a opinion mining system, the summaries are the output of the system.

### 6.1 Summary of Vader Opinion Minner

In [30]:
Final_Output = Global_new3[['Product Name', 'Frequent Features','Predicted_Sentiment_vs']]
Final_Output_nz=Final_Output.loc[~(Final_Output['Predicted_Sentiment_vs']==0)]

Final_Output_nz = Final_Output_nz.explode('Frequent Features')
Final_Output_nz['Positives'] = Final_Output_nz.apply(lambda row: row.Predicted_Sentiment_vs > 0, axis=1)
Final_Output_nz['Negatives'] =  Final_Output_nz.apply(lambda row: row.Predicted_Sentiment_vs < 0, axis=1)

answer = Final_Output_nz[['Product Name','Frequent Features', 'Positives', 'Negatives']].groupby(
    ['Product Name', 'Frequent Features']).agg(['sum'])

for name, group in answer.iterrows():
    print(name)
    print(group)
    print(" ")
    print('------------------')


('Apex AD2600 Progressive-scan DVD player', 'amazon')
Positives  sum    9
Negatives  sum    1
Name: (Apex AD2600 Progressive-scan DVD player, amazon), dtype: int64
 
------------------
('Apex AD2600 Progressive-scan DVD player', 'apex')
Positives  sum    14
Negatives  sum     3
Name: (Apex AD2600 Progressive-scan DVD player, apex), dtype: int64
 
------------------
('Apex AD2600 Progressive-scan DVD player', 'audio')
Positives  sum    1
Negatives  sum    0
Name: (Apex AD2600 Progressive-scan DVD player, audio), dtype: int64
 
------------------
('Apex AD2600 Progressive-scan DVD player', 'bought')
Positives  sum    3
Negatives  sum    2
Name: (Apex AD2600 Progressive-scan DVD player, bought), dtype: int64
 
------------------
('Apex AD2600 Progressive-scan DVD player', 'brand')
Positives  sum    1
Negatives  sum    4
Name: (Apex AD2600 Progressive-scan DVD player, brand), dtype: int64
 
------------------
('Apex AD2600 Progressive-scan DVD player', 'button')
Positives  sum    3
Negativ

 
------------------
('Canon G3', 'speed')
Positives  sum    2
Negatives  sum    0
Name: (Canon G3, speed), dtype: int64
 
------------------
('Canon G3', 'store')
Positives  sum    1
Negatives  sum    0
Name: (Canon G3, store), dtype: int64
 
------------------
('Canon G3', 'thing')
Positives  sum    4
Negatives  sum    1
Name: (Canon G3, thing), dtype: int64
 
------------------
('Canon G3', 'think')
Positives  sum    1
Negatives  sum    0
Name: (Canon G3, think), dtype: int64
 
------------------
('Canon G3', 'time')
Positives  sum    3
Negatives  sum    0
Name: (Canon G3, time), dtype: int64
 
------------------
('Canon G3', 'use')
Positives  sum    26
Negatives  sum     0
Name: (Canon G3, use), dtype: int64
 
------------------
('Canon G3', 'view')
Positives  sum    1
Negatives  sum    0
Name: (Canon G3, view), dtype: int64
 
------------------
('Canon G3', 'viewfinder')
Positives  sum    1
Negatives  sum    2
Name: (Canon G3, viewfinder), dtype: int64
 
------------------
('Canon

('Creative Labs Nomad Jukebox Zen Xtra 40GB', 'price')
Positives  sum    20
Negatives  sum     5
Name: (Creative Labs Nomad Jukebox Zen Xtra 40GB, price), dtype: int64
 
------------------
('Creative Labs Nomad Jukebox Zen Xtra 40GB', 'problem')
Positives  sum    6
Negatives  sum    1
Name: (Creative Labs Nomad Jukebox Zen Xtra 40GB, problem), dtype: int64
 
------------------
('Creative Labs Nomad Jukebox Zen Xtra 40GB', 'product')
Positives  sum    18
Negatives  sum     2
Name: (Creative Labs Nomad Jukebox Zen Xtra 40GB, product), dtype: int64
 
------------------
('Creative Labs Nomad Jukebox Zen Xtra 40GB', 'quality')
Positives  sum    32
Negatives  sum     7
Name: (Creative Labs Nomad Jukebox Zen Xtra 40GB, quality), dtype: int64
 
------------------
('Creative Labs Nomad Jukebox Zen Xtra 40GB', 'read')
Positives  sum    9
Negatives  sum    2
Name: (Creative Labs Nomad Jukebox Zen Xtra 40GB, read), dtype: int64
 
------------------
('Creative Labs Nomad Jukebox Zen Xtra 40GB', 're

Positives  sum    6
Negatives  sum    0
Name: (MicroMP3, audio), dtype: int64
 
------------------
('MicroMP3', 'battery')
Positives  sum    12
Negatives  sum     2
Name: (MicroMP3, battery), dtype: int64
 
------------------
('MicroMP3', 'bit')
Positives  sum    6
Negatives  sum    1
Name: (MicroMP3, bit), dtype: int64
 
------------------
('MicroMP3', 'blue')
Positives  sum    2
Negatives  sum    1
Name: (MicroMP3, blue), dtype: int64
 
------------------
('MicroMP3', 'bought')
Positives  sum    3
Negatives  sum    0
Name: (MicroMP3, bought), dtype: int64
 
------------------
('MicroMP3', 'button')
Positives  sum    1
Negatives  sum    1
Name: (MicroMP3, button), dtype: int64
 
------------------
('MicroMP3', 'cable')
Positives  sum    2
Negatives  sum    0
Name: (MicroMP3, cable), dtype: int64
 
------------------
('MicroMP3', 'case')
Positives  sum    3
Negatives  sum    0
Name: (MicroMP3, case), dtype: int64
 
------------------
('MicroMP3', 'clip')
Positives  sum    3
Negatives  

('Nokia 6610', 'battery')
Positives  sum    4
Negatives  sum    1
Name: (Nokia 6610, battery), dtype: int64
 
------------------
('Nokia 6610', 'bluetooth')
Positives  sum    1
Negatives  sum    0
Name: (Nokia 6610, bluetooth), dtype: int64
 
------------------
('Nokia 6610', 'bought')
Positives  sum    1
Negatives  sum    0
Name: (Nokia 6610, bought), dtype: int64
 
------------------
('Nokia 6610', 'camera')
Positives  sum    3
Negatives  sum    0
Name: (Nokia 6610, camera), dtype: int64
 
------------------
('Nokia 6610', 'cell')
Positives  sum    3
Negatives  sum    1
Name: (Nokia 6610, cell), dtype: int64
 
------------------
('Nokia 6610', 'color')
Positives  sum    4
Negatives  sum    0
Name: (Nokia 6610, color), dtype: int64
 
------------------
('Nokia 6610', 'cool')
Positives  sum    3
Negatives  sum    0
Name: (Nokia 6610, cool), dtype: int64
 
------------------
('Nokia 6610', 'customer')
Positives  sum    2
Negatives  sum    1
Name: (Nokia 6610, customer), dtype: int64
 
-

------------------
('ipod', 'case')
Positives  sum    3
Negatives  sum    0
Name: (ipod, case), dtype: int64
 
------------------
('ipod', 'change')
Positives  sum    0
Negatives  sum    1
Name: (ipod, change), dtype: int64
 
------------------
('ipod', 'charge')
Positives  sum    1
Negatives  sum    0
Name: (ipod, charge), dtype: int64
 
------------------
('ipod', 'click')
Positives  sum    1
Negatives  sum    0
Name: (ipod, click), dtype: int64
 
------------------
('ipod', 'computer')
Positives  sum    1
Negatives  sum    2
Name: (ipod, computer), dtype: int64
 
------------------
('ipod', 'day')
Positives  sum    2
Negatives  sum    0
Name: (ipod, day), dtype: int64
 
------------------
('ipod', 'design')
Positives  sum    6
Negatives  sum    0
Name: (ipod, design), dtype: int64
 
------------------
('ipod', 'device')
Positives  sum    1
Negatives  sum    1
Name: (ipod, device), dtype: int64
 
------------------
('ipod', 'drive')
Positives  sum    0
Negatives  sum    5
Name: (ipod

### 6.2 Summary of TextBlob Opinion Minner

In [31]:
Final_Output1 = Global_new3[['Product Name', 'Frequent Features','Predicted_Sentiment_tb']]
Final_Output1_nz=Final_Output1.loc[~(Final_Output['Predicted_Sentiment_vs']==0)]

Final_Output1_nz = Final_Output1_nz.explode('Frequent Features')
Final_Output1_nz['Positives'] = Final_Output1_nz.apply(lambda row: row.Predicted_Sentiment_tb > 0, axis=1)
Final_Output1_nz['Negatives'] =  Final_Output1_nz.apply(lambda row: row.Predicted_Sentiment_tb < 0, axis=1)

answer1 = Final_Output1_nz[['Product Name','Frequent Features', 'Positives', 'Negatives']].groupby(
    ['Product Name', 'Frequent Features']).agg(['sum'])

for name, group in answer1.iterrows():
    print(name)
    print(group)
    print(" ")
    print('------------------')


('Apex AD2600 Progressive-scan DVD player', 'amazon')
Positives  sum    9
Negatives  sum    1
Name: (Apex AD2600 Progressive-scan DVD player, amazon), dtype: int64
 
------------------
('Apex AD2600 Progressive-scan DVD player', 'apex')
Positives  sum    14
Negatives  sum     3
Name: (Apex AD2600 Progressive-scan DVD player, apex), dtype: int64
 
------------------
('Apex AD2600 Progressive-scan DVD player', 'audio')
Positives  sum    1
Negatives  sum    0
Name: (Apex AD2600 Progressive-scan DVD player, audio), dtype: int64
 
------------------
('Apex AD2600 Progressive-scan DVD player', 'bought')
Positives  sum    3
Negatives  sum    2
Name: (Apex AD2600 Progressive-scan DVD player, bought), dtype: int64
 
------------------
('Apex AD2600 Progressive-scan DVD player', 'brand')
Positives  sum    1
Negatives  sum    2
Name: (Apex AD2600 Progressive-scan DVD player, brand), dtype: int64
 
------------------
('Apex AD2600 Progressive-scan DVD player', 'button')
Positives  sum    2
Negativ

Positives  sum    1
Negatives  sum    0
Name: (Canon S100, body), dtype: int64
 
------------------
('Canon S100', 'bright')
Positives  sum    1
Negatives  sum    0
Name: (Canon S100, bright), dtype: int64
 
------------------
('Canon S100', 'camera')
Positives  sum    20
Negatives  sum     4
Name: (Canon S100, camera), dtype: int64
 
------------------
('Canon S100', 'capture')
Positives  sum    1
Negatives  sum    0
Name: (Canon S100, capture), dtype: int64
 
------------------
('Canon S100', 'card')
Positives  sum    1
Negatives  sum    0
Name: (Canon S100, card), dtype: int64
 
------------------
('Canon S100', 'carry')
Positives  sum    2
Negatives  sum    0
Name: (Canon S100, carry), dtype: int64
 
------------------
('Canon S100', 'case')
Positives  sum    1
Negatives  sum    0
Name: (Canon S100, case), dtype: int64
 
------------------
('Canon S100', 'choice')
Positives  sum    1
Negatives  sum    0
Name: (Canon S100, choice), dtype: int64
 
------------------
('Canon S100', 'c

Positives  sum    0
Negatives  sum    1
Name: (Creative Labs Nomad Jukebox Zen Xtra 40GB, track), dtype: int64
 
------------------
('Creative Labs Nomad Jukebox Zen Xtra 40GB', 'transfer')
Positives  sum    9
Negatives  sum    3
Name: (Creative Labs Nomad Jukebox Zen Xtra 40GB, transfer), dtype: int64
 
------------------
('Creative Labs Nomad Jukebox Zen Xtra 40GB', 'unit')
Positives  sum    9
Negatives  sum    0
Name: (Creative Labs Nomad Jukebox Zen Xtra 40GB, unit), dtype: int64
 
------------------
('Creative Labs Nomad Jukebox Zen Xtra 40GB', 'usb')
Positives  sum    3
Negatives  sum    0
Name: (Creative Labs Nomad Jukebox Zen Xtra 40GB, usb), dtype: int64
 
------------------
('Creative Labs Nomad Jukebox Zen Xtra 40GB', 'use')
Positives  sum    37
Negatives  sum     2
Name: (Creative Labs Nomad Jukebox Zen Xtra 40GB, use), dtype: int64
 
------------------
('Creative Labs Nomad Jukebox Zen Xtra 40GB', 'user')
Positives  sum    2
Negatives  sum    2
Name: (Creative Labs Nomad J

Name: (MicroMP3, ability), dtype: int64
 
------------------
('MicroMP3', 'audio')
Positives  sum    6
Negatives  sum    0
Name: (MicroMP3, audio), dtype: int64
 
------------------
('MicroMP3', 'battery')
Positives  sum    12
Negatives  sum     2
Name: (MicroMP3, battery), dtype: int64
 
------------------
('MicroMP3', 'bit')
Positives  sum    6
Negatives  sum    0
Name: (MicroMP3, bit), dtype: int64
 
------------------
('MicroMP3', 'blue')
Positives  sum    2
Negatives  sum    0
Name: (MicroMP3, blue), dtype: int64
 
------------------
('MicroMP3', 'bought')
Positives  sum    3
Negatives  sum    0
Name: (MicroMP3, bought), dtype: int64
 
------------------
('MicroMP3', 'button')
Positives  sum    1
Negatives  sum    0
Name: (MicroMP3, button), dtype: int64
 
------------------
('MicroMP3', 'cable')
Positives  sum    2
Negatives  sum    0
Name: (MicroMP3, cable), dtype: int64
 
------------------
('MicroMP3', 'case')
Positives  sum    3
Negatives  sum    0
Name: (MicroMP3, case), dty

Positives  sum    0
Negatives  sum    2
Name: (Nokia 6610, gsm), dtype: int64
 
------------------
('Nokia 6610', 'headset')
Positives  sum    2
Negatives  sum    0
Name: (Nokia 6610, headset), dtype: int64
 
------------------
('Nokia 6610', 'hear')
Positives  sum    2
Negatives  sum    1
Name: (Nokia 6610, hear), dtype: int64
 
------------------
('Nokia 6610', 'heard')
Positives  sum    1
Negatives  sum    1
Name: (Nokia 6610, heard), dtype: int64
 
------------------
('Nokia 6610', 'internet')
Positives  sum    1
Negatives  sum    1
Name: (Nokia 6610, internet), dtype: int64
 
------------------
('Nokia 6610', 'key')
Positives  sum    1
Negatives  sum    1
Name: (Nokia 6610, key), dtype: int64
 
------------------
('Nokia 6610', 'lack')
Positives  sum    1
Negatives  sum    0
Name: (Nokia 6610, lack), dtype: int64
 
------------------
('Nokia 6610', 'light')
Positives  sum    1
Negatives  sum    0
Name: (Nokia 6610, light), dtype: int64
 
------------------
('Nokia 6610', 'look')
P

Name: (ipod, set), dtype: int64
 
------------------
('ipod', 'software')
Positives  sum    4
Negatives  sum    0
Name: (ipod, software), dtype: int64
 
------------------
('ipod', 'sound')
Positives  sum    6
Negatives  sum    1
Name: (ipod, sound), dtype: int64
 
------------------
('ipod', 'storage')
Positives  sum    3
Negatives  sum    0
Name: (ipod, storage), dtype: int64
 
------------------
('ipod', 'store')
Positives  sum    2
Negatives  sum    1
Name: (ipod, store), dtype: int64
 
------------------
('ipod', 'sure')
Positives  sum    1
Negatives  sum    1
Name: (ipod, sure), dtype: int64
 
------------------
('ipod', 'tell')
Positives  sum    1
Negatives  sum    0
Name: (ipod, tell), dtype: int64
 
------------------
('ipod', 'thing')
Positives  sum    4
Negatives  sum    1
Name: (ipod, thing), dtype: int64
 
------------------
('ipod', 'think')
Positives  sum    4
Negatives  sum    0
Name: (ipod, think), dtype: int64
 
------------------
('ipod', 'time')
Positives  sum    4


<div class="alert alert-block alert-success">
<b>6) Reference </b> 
</div>

[1] Anon, 2022. GitHub - cjhutto/vaderSentiment: VADER Sentiment Analysis. VADER (Valence Aware Dictionary and sEntiment Reasoner) is a lexicon and rule-based sentiment analysis tool that is specifically attuned to sentiments expressed in social media, and works well on texts from other domains.. [online] GitHub. Available from: https://github.com/cjhutto/vaderSentiment [Accessed 11 Aug. 2022].

[2] Hu, M. and Liu, B., 2004. Mining Opinion Features in Customer Reviews.

[3] Hu, M. and Liu, B., 2004. Mining and Summarizing Customer Reviews.

[4] Guo, Y., Wang, M. and Li, X., 2017. Application of an improved Apriori algorithm in a mobile e-commerce recommendation system. Industrial Management &amp; Data Systems, 117(2), pp.287-303.

[5] Lundqvist, K., Liyanagunawardena, T. and Starkey, L., 2020. Evaluation of Student Feedback Within a MOOC Using Sentiment Analysis and Target Groups. The International Review of Research in Open and Distributed Learning, 21(3).

[6] Anon, 2022. My Absolute Go-To for Sentiment Analysis — TextBlob.. [online] Medium. Available from: https://towardsdatascience.com/my-absolute-go-to-for-sentiment-analysis-textblob-3ac3a11d524 [Accessed 14 Aug. 2022].

[7] Piryani, R., Piryani, B., Singh, V. and Pinto, D., 2020. Sentiment analysis in Nepali: Exploring machine learning and lexicon-based approaches. Journal of Intelligent &amp; Fuzzy Systems, 39(2), pp.2201-2212.

[8] precision, f., Kumar, V. and Kalyanarangan, V., 2022. precision , recall , f score when y_pred & y_true have different sizes. [online] Stack Overflow. Available from: https://stackoverflow.com/questions/42022498/precision-recall-f-score-when-y-pred-y-true-have-different-sizes [Accessed 21 Aug. 2022].

[9] : Anon, 2022. POS Tagging with NLTK and Chunking in NLP [EXAMPLES]. [online] Guru99. Available from: https://www.guru99.com/pos-tagging-chunking-nltk.html [Accessed 12 Aug. 2022].

[10] Saura, J., Palacios-Marqués, D. and Ribeiro-Soriano, D., 2021. Using data mining techniques to explore security issues in smart living environments in Twitter. Computer Communications, 179, pp.285-295.