# Aspect Detection Exploration

Created: 2019.1.10  
Updated: 2019.1.17

### _An unsupervised aspect detection model for sentiment analysis of reviews_

It looks like they start with a "seed set" of aspects (the seed set is found unsupervised??)

It iteratively bootstraps (and clusters?) to find better final aspect list

Use a generalized version of an FLR method to "rank aspects and select important ones"  
("FLR is a word scoring method that uses internal structures and frequencies of candidates")

### _An Unsupervised Neural Attention Model for Aspect Extraction_
(Note that there is code for this paper at https://github.com/ruidan/Unsupervised-Aspect-Extraction)

Interestingly they note that LDA is frequently used for aspect extraction (but go on to state why this generally doesn't work very well)

Does this work handle multi-word aspects?

## Getting data

In [4]:
import pandas as pd

In [5]:
def load_data():
    article_table = pd.read_csv("../data/raw/kaggle1/articles1.csv")
    return article_table

In [6]:
articles = load_data()
articles

Unnamed: 0.1,Unnamed: 0,id,title,publication,author,date,year,month,url,content
0,0,17283,House Republicans Fret About Winning Their Hea...,New York Times,Carl Hulse,2016-12-31,2016.0,12.0,,WASHINGTON — Congressional Republicans have...
1,1,17284,Rift Between Officers and Residents as Killing...,New York Times,Benjamin Mueller and Al Baker,2017-06-19,2017.0,6.0,,"After the bullet shells get counted, the blood..."
2,2,17285,"Tyrus Wong, ‘Bambi’ Artist Thwarted by Racial ...",New York Times,Margalit Fox,2017-01-06,2017.0,1.0,,"When Walt Disney’s “Bambi” opened in 1942, cri..."
3,3,17286,"Among Deaths in 2016, a Heavy Toll in Pop Musi...",New York Times,William McDonald,2017-04-10,2017.0,4.0,,"Death may be the great equalizer, but it isn’t..."
4,4,17287,Kim Jong-un Says North Korea Is Preparing to T...,New York Times,Choe Sang-Hun,2017-01-02,2017.0,1.0,,"SEOUL, South Korea — North Korea’s leader, ..."
5,5,17288,"Sick With a Cold, Queen Elizabeth Misses New Y...",New York Times,Sewell Chan,2017-01-02,2017.0,1.0,,"LONDON — Queen Elizabeth II, who has been b..."
6,6,17289,Taiwan’s President Accuses China of Renewed In...,New York Times,Javier C. Hernández,2017-01-02,2017.0,1.0,,BEIJING — President Tsai of Taiwan sharpl...
7,7,17290,"After ‘The Biggest Loser,’ Their Bodies Fought...",New York Times,Gina Kolata,2017-02-08,2017.0,2.0,,"Danny Cahill stood, slightly dazed, in a blizz..."
8,8,17291,"First, a Mixtape. Then a Romance. - The New Yo...",New York Times,Katherine Rosman,2016-12-31,2016.0,12.0,,"Just how is Hillary Kerr, the founder of ..."
9,9,17292,Calling on Angels While Enduring the Trials of...,New York Times,Andy Newman,2016-12-31,2016.0,12.0,,Angels are everywhere in the Muñiz family’s ap...


## Sentencifying data

In [5]:
articles.content[0].split(".")

['WASHINGTON  —   Congressional Republicans have a new fear when it comes to their    health care lawsuit against the Obama administration: They might win',
 ' The incoming Trump administration could choose to no longer defend the executive branch against the suit, which challenges the administration’s authority to spend billions of dollars on health insurance subsidies for   and   Americans, handing House Republicans a big victory on    issues',
 ' But a sudden loss of the disputed subsidies could conceivably cause the health care program to implode, leaving millions of people without access to health insurance before Republicans have prepared a replacement',
 ' That could lead to chaos in the insurance market and spur a political backlash just as Republicans gain full control of the government',
 ' To stave off that outcome, Republicans could find themselves in the awkward position of appropriating huge sums to temporarily prop up the Obama health care law, angering conservative vote

In [7]:
sentences = []
for article in articles.content:
    sentences.extend(article.split("."))

In [5]:
len(sentences)

1853772

## Word2Vec on it

In [9]:
import gensim

In [10]:
iterator = iter(sentences)
model = gensim.models.Word2Vec(sentences, size=200, window=5, min_count=10, workers=4, iter=2)

In [11]:
model.corpus_total_words

190868913

# The Bootstrapping Method

(from _An unsupervised aspect detection model for sentiment analysis of reviews_)

A POS pattern finder, thanks to https://stackoverflow.com/questions/32399299/how-do-i-extract-patterns-from-lists-of-pos-tagged-words-nltk

In [16]:
# pattern should be an array of POS tags
# n for n-gram size
#def find_pos_pattern(pos_sentences, pattern, n):
def find_pos_pattern_ordered(pos_sentences, pattern):
    for sentence in pos_sentences:
        # handle index error at end?
        end = len(sentence) - len(pattern) # NOTE: I think off by one or two somehow?
        
        for index, (a, b) in enumerate(sentence, 1):
            if index == end:
                break
             
            # NOTE: I would use this method if I cared about order
            i = 0
            found = True
            for part in pattern:
                if part != sentence[index+i][1]: 
                    found = False
                    break
                i += 1
            
            if found: yield(sentence[index:index+len(pattern)])

def find_pos_pattern_unordered(pos_sentences, tags, n):
    for sentence in pos_sentences:
        # handle index error at end?
        end = len(sentence) - n # NOTE: I think off by one or two somehow?
        
        for index, (a, b) in enumerate(sentence, 1):
            if index == end:
                break
            
            found = True
            for pos in sentence[index:index+n]:
                if pos[1] not in tags:
                    found = False
                    break
            
            if found: yield(sentence[index:index+n])

In [12]:
import nltk

ordered_patterns = [
    ["NN"],
    
    ["JJ", "NN", "NN", "NN"],
    ["DT", "JJ"], 
    ["DT", "NN", "NNS", "VBG"]
]

# note: assumes POS sentences being passed in
def top_aspects(sentences):
    #print(sentences)
    # extract review sentences

    candidate_aspects = []

    # for each sentence
    #for sentence in sentences:
        # use POS tagging (already handled)
        #words = nltk.word_tokenize(sentence)
        #pos = nltk.pos_tag(words)

        #print(list(find_pos_pattern([pos], ["NN"])))
        
        #Extract POS tag patterns as candidates for aspects
        #pass

    # extract POS tag patterns as candidates for aspects
    
    # combination of nouns
    for i in range(1,5):
        print(list(find_pos_pattern_unordered(sentences, ["NN", "NNS"], i)))
    
    # combination of nouns and adjectives
    print(list(find_pos_pattern_ordered(sentences, ["JJ", "NN"])))
    print(list(find_pos_pattern_ordered(sentences, ["JJ", "NNS"])))
    print(list(find_pos_pattern_ordered(sentences, ["JJ", "NN", "NN"])))
    print(list(find_pos_pattern_ordered(sentences, ["JJ", "NNS", "NN"])))
    print(list(find_pos_pattern_ordered(sentences, ["JJ", "NN", "NNS"])))
    print(list(find_pos_pattern_ordered(sentences, ["JJ", "NN", "NN", "NN"])))
    print(list(find_pos_pattern_ordered(sentences, ["JJ", "NNS", "NN", "NN"])))
    print(list(find_pos_pattern_ordered(sentences, ["JJ", "NN", "NNS", "NN"])))
    print(list(find_pos_pattern_ordered(sentences, ["JJ", "NN", "NN", "NNS"])))
    print(list(find_pos_pattern_ordered(sentences, ["JJ", "NNS", "NNS", "NN"])))
    print(list(find_pos_pattern_ordered(sentences, ["JJ", "NNS", "NN", "NNS"])))
    print(list(find_pos_pattern_ordered(sentences, ["JJ", "NN", "NNS", "NNS"])))
    print(list(find_pos_pattern_ordered(sentences, ["JJ", "NNS", "NNS", "NNS"])))
        
    # combination determiners and adjectives
    print(list(find_pos_pattern_ordered(sentences, ["DT", "JJ"])))
    
    # combination of nouns and verb gerunds (present participle)
    print(list(find_pos_pattern_ordered(sentences, ["DT", "NN"])))
    print(list(find_pos_pattern_ordered(sentences, ["DT", "NNS"])))
    print(list(find_pos_pattern_ordered(sentences, ["DT", "VBG"])))
    print(list(find_pos_pattern_ordered(sentences, ["DT", "VBG", "NN"])))
    print(list(find_pos_pattern_ordered(sentences, ["DT", "VBG", "NNS"])))
    print(list(find_pos_pattern_ordered(sentences, ["DT", "NN", "VBG"])))
    print(list(find_pos_pattern_ordered(sentences, ["DT", "NNS", "VBG"])))
    print(list(find_pos_pattern_ordered(sentences, ["DT", "NN", "NN"])))
    print(list(find_pos_pattern_ordered(sentences, ["DT", "NN", "NNS"])))
    print(list(find_pos_pattern_ordered(sentences, ["DT", "NNS", "NNS"])))
    print(list(find_pos_pattern_ordered(sentences, ["DT", "NNS", "NN"])))
        
    #for each candidate aspect
    for candidate in candidate_aspects:
        # use stemming

        # select multiword aspects

        # use a set of heuristic rules
        pass

    # make initial seed for final aspects

    # use iterative bootstrapping for detecting final aspects

    # aspect pruning

    # return top selected aspects

In [17]:
pos_sentences = []

for sentence in sentences[0:100]:
    words = nltk.word_tokenize(sentence)
    pos = nltk.pos_tag(words)
    pos_sentences.append(pos)

testt = pos_sentences[0:2]
#print(testt)
top_aspects(testt)

[[('fear', 'NN')], [('health', 'NN')], [('care', 'NN')], [('lawsuit', 'NN')], [('administration', 'NN')], [('administration', 'NN')], [('executive', 'NN')], [('branch', 'NN')], [('suit', 'NN')], [('administration', 'NN')], [('authority', 'NN')], [('billions', 'NNS')], [('dollars', 'NNS')], [('health', 'NN')], [('insurance', 'NN')], [('subsidies', 'NNS')], [('victory', 'NN')]]
[[('health', 'NN'), ('care', 'NN')], [('care', 'NN'), ('lawsuit', 'NN')], [('executive', 'NN'), ('branch', 'NN')], [('health', 'NN'), ('insurance', 'NN')], [('insurance', 'NN'), ('subsidies', 'NNS')]]
[[('health', 'NN'), ('care', 'NN'), ('lawsuit', 'NN')], [('health', 'NN'), ('insurance', 'NN'), ('subsidies', 'NNS')]]
[]
[[('new', 'JJ'), ('fear', 'NN')], [('big', 'JJ'), ('victory', 'NN')]]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[[('a', 'DT'), ('new', 'JJ')], [('a', 'DT'), ('big', 'JJ')]]
[[('the', 'DT'), ('executive', 'NN')], [('the', 'DT'), ('suit', 'NN')], [('the', 'DT'), ('administration', 'NN')]]
[]
[]
[]
[]
[]
[