# Opinion Mining: Information Extraction for Product Development
#### Ayan Karim

## Introduction 

We already know about the immense amount of data that companies collect every day. But beyond the company's own databases exist an entire corpus of information online that they don't have direct access to. Any company that sells a product in mass production generates a huge amount of public opinion on the product and tons of reviews, articles, reactions and overall sentiments are posted online, especially soon after release. As of 2017, 223 million iPhones were sold in the US (Apple) which is huge considering the population was 325 million (this information was from 2017)[1]! This type of reach definitely results in tons of opinion online about how consumers feel about the product. 

So now when companies see this wealth of information that isn't a part of their own analytics, they want to access it to learn about how they can make their products better, or how they're products are performing. The problem is, how do companies access this data from the public media and turn it into something useful? How do we extract all that diverse information from the web and analyze it? This is where Opinion Mining comes in.


### What is Opinion Mining?

"Opinion Mining" is one of the most useful applications of Data Science in which a pipeline is designed to process and interpret public opinion about various products. The source of this data usually comes from public sources like reviews, article and social media so that the company can gain a diverse understanding of how people feel about their products.

My endeavor with this project is to create a pipeline that solves this problem. So, I developed a Data Science product that collects information from the web, that processes and models the text data, and finally gives an end-user a summary of sentiments concerning their product which they can then use for product development. To take it a step further, this pipeline also applies the same process to another product that's in competition. For the sake of my demo, my target product is an iPhone X, and I compare it to Samsung's Galaxy S9.


### What is Aspect-Based Sentiment Analysis?

The type of analysis done in this product is called Aspect-Based Sentiment Analysis. This basically means our model will extract sentiments of a product within context. So, the model will parse the text, sentence by sentence, and extract the sentiment (positive or negative) as well as the aspect that the sentiment is about. For example, in the sentence, "The battery is so unreliable.", our dependency parser will extract a negative sentiment from "unreliable" and extract the fact that it's talking about the "battery".


### Let’s Pretend We’re Apple!

We want to understand public opinion about the iPhone X and gain actionable insights to make our next product better. The questions that guide our investigation are:

#### What does Public Opinion tell us about the iPhone X?
#### What are some negative aspects of the iPhone X that people dislike about our product?
#### How does our product compare with that of a competing smartphone like the Samsung Galaxy S9?


### Technology Pipeline:

1. Scrapy to scrape information from the web
2. NLTK, Spacy and Gensim's simple_process to process texts
3. Gensim's Latent Dirichlet Allocation to extract and assign topics
4. pyLDAvis for visualization
5. Scikit Learn, Multi-Label Naive Bayes and Support Vector Machines to train text data on Topics
6. Scikit Multilearn, PowerLabelset to train on multiple labels
7. Opinion Lexicon by Minqinq Hu and Bing Liu
8. Spacy Dependency Parser, Countvectorizer and TF-IDF vectorizer for Aspect-Based Semantic Analysis
9. Word2Vec pre-trained on Google's News dataset for assigning aspects and sentiments to topics
10. Matplotlib and Seaborn for data visualization

In [1]:
# Import Dependencies and modules
import numpy as np
import pandas as pd
import scipy
import sklearn
import spacy
from spacy.lang.en import English
import matplotlib.pyplot as plt
import seaborn as sns
import re
import string
from string import punctuation
from collections import Counter
from io import StringIO
from nltk.corpus import stopwords
from nltk.corpus import wordnet as wn

from sklearn.linear_model import SGDClassifier
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.pipeline import Pipeline
from sklearn.naive_bayes import MultinomialNB
from skmultilearn.problem_transform import LabelPowerset


import nltk
import glob
import errno
import os
import json

from pprint import pprint

# Gensim
import gensim
import gensim.corpora as corpora
from gensim.utils import simple_preprocess
from gensim import models, corpora, similarities
from gensim.models import CoherenceModel, TfidfModel
from itertools import chain
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MultiLabelBinarizer


# spacy for lemmatization
import spacy

# Plotting tools
import pyLDAvis
import pyLDAvis.gensim  # don't skip this
import matplotlib.pyplot as plt

import pickle


%matplotlib inline

# Understanding the Data

As we've mentioned before, the goal of this project is to make use of data found from public media. So, the first thing I've done is scrape text data from articles and reviews from digitaltrends.com, gizmodo.com and techradar.com. I scraped anything I could find on the following phones:

1. iPhone X
2. Galaxy S9
3. Pixel 3
4. Huawei Mate 20 Pro
5. OnePlus 6T
6. Huawei P2 Pro
7. LG V4 Thinq
8. Sony Xperia XZ3
9. Essential Phone
10. Razer Phone 2
11. HTC U12+
12. Moto G6 plus

You'll notice that I've listed far more phones than just the iPhone X and Galaxy S9. We need as much data as possible for the sake of topic modeling. If we have more text information we have about phones, we can extract more sparse topics.

### Scraping the Data.

To scrape the data, I used Python's scraping and web crawling framework called Scrapy. I built 23 custom crawlers to crawl, paginate and scraped content from links in search queries and produced 285 texts of data, that vary in length, in JSON format. This text data included any information about phones that comes to mind such reviews, opinion pieces, comparisons, price, etc. The initial data set consisted title, author and the text of the articles that were scraped.

In [2]:
# Create DataFrame to hold json data
json_data = pd.DataFrame(columns = ['author', 'text', 'title'])

In [3]:
# Define path to text files that contain the novels
path = '/Users/ayankarim/Documents/Thinkful/Bootcamp/Final Capstone Opinion Mining/Opinion Mining/Notebooks/01 Topic Modelling/files/*.json'
files = glob.glob(path)

# Populate a list with the json objects
all_jsons = []

for filenames in files:
    with open(filenames, 'r') as f:
        file = json.load(f)
        all_jsons.append(file)

# Create a DataFrame of all json objects
for dicts in all_jsons:
    df = pd.DataFrame(dicts)
    json_data = json_data.append(df)

In [4]:
# View initial data set
json_data.head()

Unnamed: 0,author,text,title
0,[\n\t\t\t\t\t\tJulian Chokkattu\t\t\t\t\t],"[<p>Google’s <a href=""https://store.google.com...",\n\t\tGoogle will announce hardware on October...
1,[\n\t\t\t\t\t\tChristian de Looper\t\t\t\t\t],"[<p>Google finally unveiled the new <a href=""h...",\n\t\tHere’s how to buy the new Google Pixel 3...
2,[\n\t\t\t\t\t\tSimon Hill\t\t\t\t\t],[<p>If you plan to buy one of Google’s <a href...,\n\t\tThe best Pixel 3 cases and covers\t
3,[\n\t\t\t\t\t\tSimon Hill\t\t\t\t\t],"[<p>As the developer of Android, Google turns ...",\n\t\tGoogle Pixel 3 vs. Pixel 2 vs. Pixel: Pi...
4,[\n\t\t\t\t\t\tSimon Hill\t\t\t\t\t],[<p>There are plenty of contenders in the <a h...,\n\t\tGoogle Pixel 3 vs. Samsung Galaxy S9: Wh...


# Clean Data

In [5]:
# Utility function for standard text cleaning.
def text_cleaner(text):
    # Visual inspection identifies a form of punctuation spaCy does not
    # recognize: the double dash '--'.  Better get rid of it now!

    text = str(text).replace("\n", "")
    text = str(text).replace("\t", "")
    text = str(text).replace("\\n", "")
    text = str(text).replace("\\t", "")
    text = str(text).replace("\\", "")
    text = str(text).replace("xa0", " ")
    text = str(text).replace("\'", "")
    text = re.sub("<p>", "", str(text))
    text = re.sub("</p>", "", str(text))
    text = re.sub("</a>", "", str(text))
    text = re.sub('<[^>]+>', "", str(text)) 
    text = str(text).replace("\\u2019", "")
    text = str(text).replace("\\u2013", "")
    text = str(text).replace("\\u2018", "")
    text = str(text).replace("\\u00a0", "")
    text = str(text).replace("\\u00a3", "")
    text = str(text).replace("\u2014", "")
    text = str(text).replace("\u201d", "")
    text = str(text).replace("\u201c", "")
    return text


In [6]:
# Define function to clean text
def clean_text(df):
    # Convert lists to strings and remove brackets
    df['text'] = df['text'].astype(str)
    df['author'] = df['author'].astype(str)

    df['text'] = df['text'].map(lambda x: x.strip('[]'))
    df['author'] = df['author'].map(lambda x: x.strip('[]'))

    # Clean text
    df['text'] = df['text'].apply(lambda x: text_cleaner(x))
    df['title'] = df['title'].apply(lambda x: text_cleaner(x))
    df['author'] = df['author'].apply(lambda x: text_cleaner(x))

In [7]:
# Clean Text
clean_text(json_data)

# Reset index
json_data = json_data.reset_index()
json_data = json_data.drop(['index'], axis=1)

In [8]:
# Visualize dataframe
json_data.head(10)

Unnamed: 0,author,text,title
0,Julian Chokkattu,Google’s annual hardware launch event will tak...,"Google will announce hardware on October 9, ne..."
1,Christian de Looper,Google finally unveiled the new Google Pixel 3...,Here’s how to buy the new Google Pixel 3 and G...
2,Simon Hill,If you plan to buy one of Google’s Pixel 3 sma...,The best Pixel 3 cases and covers
3,Simon Hill,"As the developer of Android, Google turns out ...",Google Pixel 3 vs. Pixel 2 vs. Pixel: Picking ...
4,Simon Hill,There are plenty of contenders in the Android ...,Google Pixel 3 vs. Samsung Galaxy S9: Which sm...
5,Julian Chokkattu,Got your hands on a new Pixel 3 or Pixel 3 XL ...,Key settings you need to change on your brand-...
6,Simon Hill,Rarely has a flagship phone been so thoroughly...,Google Pixel 3 and Pixel 3 XL: Everything you ...
7,Lucas Coll,"Mobile hardware is getting better and better, ...","Verizon’s buy one, get one offer is the best d..."
8,Christian de Looper,The Google Pixel 3 and Pixel 3 XL may have sto...,The Google Pixel Stand turns your Android phon...
9,Simon Hill,The Google Pixel 3 and Pixel 3 XL are phones w...,The best Google Pixel 3 tips and tricks


# Pre-Process Data for Topic Modeling

In [9]:
# Process and tokenize the text using Gensim
def sent_to_words(sentences):
    for sentence in sentences:
        yield(gensim.utils.simple_preprocess(str(sentence), deacc = True))

data = list(json_data['text'])
data_words = list(sent_to_words(data))

In [10]:
# Visualize tokenized documents
print(data_words[:1])

[['google', 'annual', 'hardware', 'launch', 'event', 'will', 'take', 'place', 'on', 'october', 'in', 'new', 'york', 'city', 'the', 'company', 'sent', 'out', 'invites', 'to', 'media', 'including', 'digital', 'trends', 'confirming', 'the', 'date', 'which', 'had', 'leaked', 'in', 'august', 'the', 'date', 'and', 'venue', 'are', 'change', 'of', 'pace', 'considering', 'the', 'past', 'two', 'google', 'october', 'events', 'have', 'taken', 'place', 'in', 'san', 'francisco', 'on', 'october', 'the', 'company', 'is', 'widely', 'expected', 'to', 'launch', 'slew', 'of', 'hardware', 'products', 'ranging', 'from', 'smartphones', 'to', 'smart', 'home', 'devices', 'the', 'highlights', 'will', 'be', 'the', 'pixel', 'and', 'pixel', 'xl', 'successors', 'to', 'last', 'year', 'critically', 'acclaimed', 'pixel', 'and', 'pixel', 'xl', 'smartphones', 'there', 'have', 'been', 'an', 'alarmingly', 'high', 'number', 'of', 'leaks', 'for', 'the', 'pixel', 'series', 'and', 'if', 'true', 'we', 'know', 'quite', 'lot', '

In [11]:
# Create Bigrams and Trigrams

# Build the models
bigram = gensim.models.Phrases(data_words, min_count=5, threshold=100)
trigram = gensim.models.Phrases(bigram[data_words], threshold=100)

# fast way to get a sentece clubbed as a bigram/trigram
bigram_mod = gensim.models.phrases.Phraser(bigram)
trigram_mod = gensim.models.phrases.Phraser(trigram)

In [12]:
# Remove Stopwords, make bigrams and lemmatize
stop_words = stopwords.words('english')
stop_words.extend(['pixel', 'iphone', 'samsung', 'apple', 'essential', 'xs', 'max', 
                  'huawei', 'galaxy', 'note', 'moto', 'oneplus', 'android', 'mate', 'pro', 'lg', 'sony', 'razer', 'phone', 'company', 
                  'smartphone', 'google', 'thinq', 'nokia', 'htc', 'xperia', 'xz', 'xr', 's9'])

# Define functions for stopwords, bigrams, trigrams and lemmatization
def remove_stopwords(texts):
    return [[word for word in simple_preprocess(str(doc)) if word not in stop_words] for doc in texts]

def make_bigrams(texts):
    return [bigram_mod[doc] for doc in texts]

def make_trigrams(texts):
    return [trigram_mod[bigram_mod[doc]] for doc in texts]

def lemmatization(texts, allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV']):
    """https://spacy.io/api/annotation"""
    texts_out = []
    for sent in texts:
        doc = nlp(" ".join(sent)) 
        texts_out.append([token.lemma_ for token in doc if token.pos_ in allowed_postags])
    return texts_out

In [13]:
# Remove Stop Words
data_words_nostops = remove_stopwords(data_words)

# Form Bigrams
data_words_bigrams = make_bigrams(data_words_nostops)

# Initialize spacy 'en' model, keeping only tagger component (for efficiency)
# python3 -m spacy download en
nlp = spacy.load('en', disable=['parser', 'ner'])

# Do lemmatization keeping only noun, adj, vb, adv
data_lemmatized = lemmatization(data_words_bigrams, allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV'])

print(data_lemmatized[:1])

[['annual', 'hardware', 'launch', 'event', 'take', 'place', 'october', 'new', 'york', 'city', 'send', 'invite', 'medium', 'include', 'digital_trend', 'confirm', 'date', 'leak', 'august', 'date', 'venue', 'change', 'pace', 'consider', 'october', 'event', 'take', 'place', 'san', 'francisco', 'october', 'widely', 'expect', 'launch', 'slay', 'hardware', 'product', 'range', 'smartphone', 'smart', 'home', 'device', 'highlight', 'successor', 'last_year', 'critically', 'acclaim', 'smartphone', 'alarmingly', 'high', 'number', 'leak', 'series', 'true', 'know', 'quite', 'lot', 'phone', 'may', 'due', 'carelessness', 'recently', 'someone', 'leave', 'lyft', 'separately', 'group', 'russia', 'claim', 'get', 'hand', 'shipment', 'smartphone', 'even', 'post', 'unbox', 'video', 'show', 'everything', 'get', 'box', 'expect', 'notch', 'design', 'cutout', 'top', 'screen', 'house', 'front_fac', 'camera', 'garner', 'criticism', 'notch', 'look', 'unusually', 'large', 'small', 'may', 'traditional', 'design', 'sli

# Prepare Corpus for Topic Modeling

In [14]:
# Create Dictionary
id2word = corpora.Dictionary(data_lemmatized)

# Create Corpus
texts = data_lemmatized

# Term Document Frequency
corpus = [id2word.doc2bow(text) for text in texts]

In [15]:
# gather tfidf scores
tfidf = models.TfidfModel(corpus, id2word = id2word)

# filter out low value words
low_value = 0.025

for i in range(0, len(corpus)):
    bow = corpus[i]
    low_value_words = [] #reinitialize to be safe. You can skip this.
    low_value_words = [id for id, value in tfidf[bow] if value < low_value]
    new_bow = [b for b in bow if b[0] not in low_value_words]

    #reassign        
    corpus[i] = new_bow

In [16]:
# View
print(corpus[:1])

[[(0, 1), (1, 1), (2, 1), (3, 1), (5, 1), (6, 3), (7, 1), (8, 1), (10, 2), (11, 1), (12, 1), (13, 1), (14, 1), (15, 1), (16, 1), (18, 1), (19, 1), (20, 1), (21, 1), (22, 1), (23, 1), (24, 1), (26, 1), (27, 1), (28, 1), (29, 1), (30, 1), (31, 1), (32, 2), (33, 2), (34, 2), (35, 2), (36, 3), (37, 1), (38, 1), (39, 1), (40, 1), (41, 1), (42, 1), (44, 3), (45, 2), (46, 1), (47, 2), (48, 1), (49, 1), (50, 1), (51, 1), (53, 1), (54, 1), (55, 2), (56, 1), (57, 1), (58, 1), (59, 2), (60, 1), (62, 1), (65, 2), (66, 3), (67, 3), (68, 1), (69, 1), (70, 1), (71, 1), (72, 1), (75, 1), (78, 1), (81, 2), (82, 1), (83, 4), (85, 1), (87, 1), (88, 2), (89, 1), (90, 1), (91, 2), (92, 1), (93, 1), (94, 1), (95, 1), (96, 1), (97, 3), (98, 1), (99, 1), (100, 1), (101, 1), (102, 3), (103, 1), (104, 1), (105, 1), (106, 1), (107, 2), (108, 1), (109, 1), (110, 1), (112, 4), (113, 3), (114, 1), (115, 1), (116, 1), (117, 1), (118, 1), (119, 1), (120, 1), (122, 1), (123, 2), (124, 1), (125, 1), (126, 1), (127, 1),

# Build LDA

In [17]:
# Build the lda topic model
lda = models.ldamodel.LdaModel(corpus=corpus, id2word=id2word, num_topics=3, update_every=1, 
                               chunksize=50, passes=25, random_state=1, alpha='auto', minimum_probability=0)

# save the model to disk
filename = 'lda_model.pkl'
pickle.dump(lda, open(filename, 'wb'))

In [18]:
# Print the key words in the 10 topics
pprint(lda.print_topics(num_words=15))
doc_lda = lda[corpus]

[(0,
  '0.018*"camera" + 0.013*"phone" + 0.009*"good" + 0.008*"display" + '
  '0.008*"feature" + 0.007*"lens" + 0.007*"megapixel" + 0.006*"offer" + '
  '0.006*"winner" + 0.006*"find" + 0.006*"screen" + 0.005*"get" + '
  '0.005*"device" + 0.005*"come" + 0.005*"update"'),
 (1,
  '0.018*"app" + 0.017*"screen" + 0.014*"case" + 0.011*"tap" + 0.008*"home" + '
  '0.007*"set" + 0.006*"want" + 0.006*"display" + 0.006*"setting" + '
  '0.006*"option" + 0.006*"go" + 0.006*"turn" + 0.006*"button" + 0.005*"use" + '
  '0.005*"time"'),
 (2,
  '0.007*"photo" + 0.006*"even" + 0.006*"camera" + 0.006*"device" + '
  '0.005*"could" + 0.004*"say" + 0.004*"take" + 0.004*"user" + 0.004*"issue" + '
  '0.004*"really" + 0.004*"thing" + 0.004*"leak" + 0.003*"people" + '
  '0.003*"new" + 0.003*"year"')]


### Manually Naming Topics

1. topic 0: reliability
    
2. topic 1: function
    
3. topic 2: design

In [19]:
# Evaluate LDA model by computing Coherence
coherenece_model_lda = CoherenceModel(model=lda, texts=data_lemmatized, dictionary=id2word, coherence='c_v' )
coherence_lda = coherenece_model_lda.get_coherence()
print('\nCoherence Score: ', coherence_lda)


Coherence Score:  0.3891607445835293


### Evaluating LDA Model

One of the biggest problems in developing this pipeline is evaluating the models. This is because almost all of the models are unsupervised. And the one supervised model is trained on labels produced by another unsupervised model! Thankfully there are way to measure the performance of our models.

For our LDA model we calculate a coherence score to evaluate how successful our topics are. Intuitively, the coherence score measure how "coherent" our topics are to the documents. This basically scores how logically our documents fall under the topics and how similar they are. The highest score I've been able to produce is 0.389, so we can see there is room for improvement. To produce better results in the future, I think we would need to collect aa lot more data (1000 or 2000 documents rather than just 285), but for the sake of this project, time was a bit sensitive.

In [20]:
# Visualize the topics
pyLDAvis.enable_notebook()
vis = pyLDAvis.gensim.prepare(lda, corpus, id2word)
vis

of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.


  return pd.concat([default_term_info] + list(topic_dfs))


### Visualizing Topics

The best way to evaluate our topics is to actually visualize them. Gensim has a package called pyLDAvis that works with gensim's LDA model to build topic visualizations.

Above, we can see a representation of our three topics on the left as bubbles. The size of the bubbles corresponds with the significance of the topic and the separation between the bubbles tell us how sparse they are i.e. how different they are from each other. 

pyLDAvis is also an interactive visual. If we hover over each bubble, the right side of the visual tells us, in order, which terms are the most frequent for each topic in red. In blue, it shows us the overall frequency of the terms in the documents.

This visualization produced is actually quite encouraging because our topics are all sparse and significant, and we can clearly see which words are most important to which topics.

# Label Corpus with Topics

In [21]:
# Assigns the topics to the documents in corpus
lda_corpus = lda[corpus]

# Find the threshold, let's set the threshold to be 1/#clusters,
# To prove that the threshold is sane, we average the sum of all probabilities:
scores = list(chain(*[[score for topic_id,score in topic] \
                      for topic in [doc for doc in lda_corpus]]))
threshold = sum(scores)/len(scores)
print (threshold)

reliability = [j for i,j in zip(lda_corpus,data) if i[0][1] > threshold]
function = [j for i,j in zip(lda_corpus,data) if i[1][1] > threshold]
design = [j for i,j in zip(lda_corpus,data) if i[2][1] > threshold]

0.33333333399843246


In [22]:
# Create dataframe of document groups for topics
topics_df = pd.DataFrame(reliability, columns=['reliability'])
topics_df['function'] = pd.DataFrame(function)
topics_df['design'] = pd.DataFrame(design)

In [23]:
labelled_df = json_data

# Assign labels as binary topics
labelled_df['reliability'] = ""
labelled_df['function'] = ""
labelled_df['design'] = ""

labelled_df['reliability'] = np.where(labelled_df['text'].isin(reliability), 'reliability', None)
labelled_df['function'] = np.where(labelled_df['text'].isin(function), 'function', None)
labelled_df['design'] = np.where(labelled_df['text'].isin(design), 'design', None)

In [24]:
labelled_df.head()

Unnamed: 0,author,text,title,reliability,function,design
0,Julian Chokkattu,Google’s annual hardware launch event will tak...,"Google will announce hardware on October 9, ne...",,,design
1,Christian de Looper,Google finally unveiled the new Google Pixel 3...,Here’s how to buy the new Google Pixel 3 and G...,reliability,,
2,Simon Hill,If you plan to buy one of Google’s Pixel 3 sma...,The best Pixel 3 cases and covers,,function,
3,Simon Hill,"As the developer of Android, Google turns out ...",Google Pixel 3 vs. Pixel 2 vs. Pixel: Picking ...,reliability,,
4,Simon Hill,There are plenty of contenders in the Android ...,Google Pixel 3 vs. Samsung Galaxy S9: Which sm...,reliability,,


In [25]:
# Combine labels into column
labelled_df['labelled'] = labelled_df[['reliability', 'function', 'design']].values.tolist()
labelled_df['labelled'] = labelled_df['labelled'].apply(lambda x: list(filter(lambda a: a != None, x)))
labelled_df = labelled_df.drop(['reliability', 'function', 'design'], axis=1)

In [26]:
labelled_df.head()

Unnamed: 0,author,text,title,labelled
0,Julian Chokkattu,Google’s annual hardware launch event will tak...,"Google will announce hardware on October 9, ne...",[design]
1,Christian de Looper,Google finally unveiled the new Google Pixel 3...,Here’s how to buy the new Google Pixel 3 and G...,[reliability]
2,Simon Hill,If you plan to buy one of Google’s Pixel 3 sma...,The best Pixel 3 cases and covers,[function]
3,Simon Hill,"As the developer of Android, Google turns out ...",Google Pixel 3 vs. Pixel 2 vs. Pixel: Picking ...,[reliability]
4,Simon Hill,There are plenty of contenders in the Android ...,Google Pixel 3 vs. Samsung Galaxy S9: Which sm...,[reliability]


# Train Mutli-Label Classifier Models

We need to train a Mutli-Label Classifier so that we can use this model to assign aspect sentiments to topics later on. We're going to compare the performance of ML-Naive Bayes and ML-Support Vector Machines and then decide on which one we'll use.

In [27]:
# Convert the multi-labels into arrays
mlb = MultiLabelBinarizer()
y = mlb.fit_transform(labelled_df.labelled)
X = labelled_df.text

# Split data into train and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=0)

# save the the fitted binarizer labels
# This is important: it contains the how the multi-label was binarized, so you need to
# load this in the next folder in order to undo the transformation for the correct labels.
filename = 'mlb.pkl'
pickle.dump(mlb, open(filename, 'wb'))

In [28]:
# LabelPowerset allows for multi-label classification
# Build a pipeline for multinomial naive bayes classification
text_clf = Pipeline([('vect', CountVectorizer(stop_words = "english",ngram_range=(1, 1))),
                     ('tfidf', TfidfTransformer(use_idf=False)),
                     ('clf', LabelPowerset(MultinomialNB(alpha=1e-1))),])

text_clf = text_clf.fit(X_train, y_train)
predicted = text_clf.predict(X_test)

# Calculate accuracy
np.mean(predicted == y_test)

0.8009259259259259

In [29]:
# Test if SVM performs better
text_clf_svm = Pipeline([('vect', CountVectorizer()),
                         ('tfidf', TfidfTransformer()),
                         ('clf-svm', LabelPowerset(
                             SGDClassifier(loss='hinge', penalty='l2',
                                           alpha=1e-3, max_iter=6, random_state=42)))])
text_clf_svm = text_clf_svm.fit(X_train, y_train)
predicted_svm = text_clf_svm.predict(X_test)

#Calculate accuracy
np.mean(predicted_svm == y_test)



0.8564814814814815

In [30]:
from sklearn import metrics
from sklearn.metrics import confusion_matrix

# View accuracy scores on classifying each author (precission, recall, f1-score and support)
print(metrics.classification_report(y_test, predicted_svm))

              precision    recall  f1-score   support

           0       0.92      0.65      0.76        37
           1       1.00      0.73      0.84        11
           2       0.81      0.88      0.84        40

   micro avg       0.87      0.76      0.81        88
   macro avg       0.91      0.75      0.82        88
weighted avg       0.88      0.76      0.81        88
 samples avg       0.88      0.81      0.82        88



### Evaluating Mutli-Label Classifiers

When comparing the two models, SVM performs significantly better than the Naive Bayes with an 85.6% accuracy. The average precision, recall and f1-scores are also consistent and very promising, ranging from 86% to 90%. The only concerning number are the low recall scores that range from 65% to 88%. I think we can improve on these results by tuning the hyperparameters more specifically. But having more data to train on would also help

So, for our Aspect-Based Opinion Mining notebook, we're going to use the SVM model.

In [31]:
# Train naive bayes on full dataset and save model
text_clf_svm = Pipeline([('vect', CountVectorizer(stop_words = "english",ngram_range=(1, 1))),
                     ('tfidf', TfidfTransformer(use_idf=False)),
                     ('clf', LabelPowerset(MultinomialNB(alpha=1e-1))),])
text_clf_svm = text_clf.fit(X, y)

# save the model to disk
filename = 'svm_model.pkl'
pickle.dump(text_clf_svm, open(filename, 'wb'))

# Continue to Second Notebook: 02 Opinion Mining - Aspect Based Sentiment Analysis

# References

1. "Apple iPhone sales 2018." Statista. Statista. 19 Feb. 2019 <https://www.statista.com/statistics/263401/global-apple-iphone-sales-since-3rd-quarter-2007/>.

2. Bansal, Shivam, and Natural Language Processing and Machine Learning. "Beginners Guide to Topic Modeling in Python." Analytics Vidhya. 11 Jan. 2019. 19 Feb. 2019 <https://www.analyticsvidhya.com/blog/2016/08/beginners-guide-to-topic-modeling-in-python/>.

3. Li, Susan. "Topic Modeling and Latent Dirichlet Allocation (LDA) in Python." Towards Data Science. 31 May 2018. Towards Data Science. 19 Feb. 2019 <https://towardsdatascience.com/topic-modeling-and-latent-dirichlet-allocation-in-python-9bf156893c24>.

4. Li, Susan. "Topic Modelling in Python with NLTK and Gensim – Towards Data Science." Towards Data Science. 30 Mar. 2018. Towards Data Science. 19 Feb. 2019 <https://towardsdatascience.com/topic-modelling-in-python-with-nltk-and-gensim-4ef03213cd21>.

5. Min, Peter. "Aspect-Based Opinion Mining (NLP with Python) – Peter Min – Medium." Medium.com. 06 June 2018. Medium. 19 Feb. 2019 <https://medium.com/@pmin91/aspect-based-opinion-mining-nlp-with-python-a53eb4752800>.