# Aspect Detection Exploration

Created: 2019.1.10  
Updated: 2019.1.17

### _An unsupervised aspect detection model for sentiment analysis of reviews_

It looks like they start with a "seed set" of aspects (the seed set is found unsupervised??)

It iteratively bootstraps (and clusters?) to find better final aspect list

Use a generalized version of an FLR method to "rank aspects and select important ones"  
("FLR is a word scoring method that uses internal structures and frequencies of candidates")

### _An Unsupervised Neural Attention Model for Aspect Extraction_
(Note that there is code for this paper at https://github.com/ruidan/Unsupervised-Aspect-Extraction)

Interestingly they note that LDA is frequently used for aspect extraction (but go on to state why this generally doesn't work very well)

Does this work handle multi-word aspects?

## Getting data

In [1]:
import pandas as pd

In [2]:
def load_data():
    article_table = pd.read_csv("../data/raw/kaggle1/articles1.csv")
    return article_table

In [3]:
articles = load_data()
articles

Unnamed: 0.1,Unnamed: 0,id,title,publication,author,date,year,month,url,content
0,0,17283,House Republicans Fret About Winning Their Hea...,New York Times,Carl Hulse,2016-12-31,2016.0,12.0,,WASHINGTON — Congressional Republicans have...
1,1,17284,Rift Between Officers and Residents as Killing...,New York Times,Benjamin Mueller and Al Baker,2017-06-19,2017.0,6.0,,"After the bullet shells get counted, the blood..."
2,2,17285,"Tyrus Wong, ‘Bambi’ Artist Thwarted by Racial ...",New York Times,Margalit Fox,2017-01-06,2017.0,1.0,,"When Walt Disney’s “Bambi” opened in 1942, cri..."
3,3,17286,"Among Deaths in 2016, a Heavy Toll in Pop Musi...",New York Times,William McDonald,2017-04-10,2017.0,4.0,,"Death may be the great equalizer, but it isn’t..."
4,4,17287,Kim Jong-un Says North Korea Is Preparing to T...,New York Times,Choe Sang-Hun,2017-01-02,2017.0,1.0,,"SEOUL, South Korea — North Korea’s leader, ..."
5,5,17288,"Sick With a Cold, Queen Elizabeth Misses New Y...",New York Times,Sewell Chan,2017-01-02,2017.0,1.0,,"LONDON — Queen Elizabeth II, who has been b..."
6,6,17289,Taiwan’s President Accuses China of Renewed In...,New York Times,Javier C. Hernández,2017-01-02,2017.0,1.0,,BEIJING — President Tsai of Taiwan sharpl...
7,7,17290,"After ‘The Biggest Loser,’ Their Bodies Fought...",New York Times,Gina Kolata,2017-02-08,2017.0,2.0,,"Danny Cahill stood, slightly dazed, in a blizz..."
8,8,17291,"First, a Mixtape. Then a Romance. - The New Yo...",New York Times,Katherine Rosman,2016-12-31,2016.0,12.0,,"Just how is Hillary Kerr, the founder of ..."
9,9,17292,Calling on Angels While Enduring the Trials of...,New York Times,Andy Newman,2016-12-31,2016.0,12.0,,Angels are everywhere in the Muñiz family’s ap...


## Sentencifying data

In [5]:
articles.content[0].split(".")

['WASHINGTON  —   Congressional Republicans have a new fear when it comes to their    health care lawsuit against the Obama administration: They might win',
 ' The incoming Trump administration could choose to no longer defend the executive branch against the suit, which challenges the administration’s authority to spend billions of dollars on health insurance subsidies for   and   Americans, handing House Republicans a big victory on    issues',
 ' But a sudden loss of the disputed subsidies could conceivably cause the health care program to implode, leaving millions of people without access to health insurance before Republicans have prepared a replacement',
 ' That could lead to chaos in the insurance market and spur a political backlash just as Republicans gain full control of the government',
 ' To stave off that outcome, Republicans could find themselves in the awkward position of appropriating huge sums to temporarily prop up the Obama health care law, angering conservative vote

In [4]:
sentences = []
for article in articles.content:
    sentences.extend(article.split("."))

In [5]:
len(sentences)

1853772

## Word2Vec on it

In [9]:
import gensim

In [10]:
iterator = iter(sentences)
model = gensim.models.Word2Vec(sentences, size=200, window=5, min_count=10, workers=4, iter=2)

In [11]:
model.corpus_total_words

190868913

# The Bootstrapping Method

(from _An unsupervised aspect detection model for sentiment analysis of reviews_)

A POS pattern finder, thanks to https://stackoverflow.com/questions/32399299/how-do-i-extract-patterns-from-lists-of-pos-tagged-words-nltk

In [14]:
# pattern should be an array of POS tags
# n for n-gram size
#def find_pos_pattern(pos_sentences, pattern, n):
def find_pos_pattern_ordered(pos_sentences, pattern):
    for sentence in pos_sentences:
        # handle index error at end?
        end = len(sentence) - len(pattern) # NOTE: I think off by one or two somehow?
        
        if len(pattern) >= len(sentence): continue
        
        for index, (a, b) in enumerate(sentence, 1):
            if index == end:
                break
             
            # NOTE: I would use this method if I cared about order
            i = 0
            found = True
            for part in pattern:
                if part != sentence[index+i][1]: 
                    found = False
                    break
                i += 1
            
            if found: yield(sentence[index:index+len(pattern)])

def find_pos_pattern_unordered(pos_sentences, tags, n):
    for sentence in pos_sentences:
        # handle index error at end?
        end = len(sentence) - n # NOTE: I think off by one or two somehow?
        
        if n >= len(sentence): continue
        
        for index, (a, b) in enumerate(sentence, 1):
            if index == end:
                break
            
            found = True
            for pos in sentence[index:index+n]:
                if pos[1] not in tags:
                    found = False
                    break
            
            if found: yield(sentence[index:index+n])

In [34]:
import nltk

ordered_patterns = [
    ["NN"],
    
    ["JJ", "NN", "NN", "NN"],
    ["DT", "JJ"], 
    ["DT", "NN", "NNS", "VBG"]
]

# note: assumes POS sentences being passed in
def top_aspects(sentences):
    #print(sentences)
    # extract review sentences

    #candidate_aspects = []

    # for each sentence
    #for sentence in sentences:
        # use POS tagging (already handled)
        #words = nltk.word_tokenize(sentence)
        #pos = nltk.pos_tag(words)

        #print(list(find_pos_pattern([pos], ["NN"])))
        
        #Extract POS tag patterns as candidates for aspects
        #pass

    # extract POS tag patterns as candidates for aspects
    extracted = []
    
    # combination of nouns
    for i in range(1,5):
        extracted.extend(list(find_pos_pattern_unordered(sentences, ["NN", "NNS"], i)))

    # combination of nouns and adjectives
    extracted.extend(list(find_pos_pattern_ordered(sentences, ["JJ", "NN"])))
    extracted.extend(list(find_pos_pattern_ordered(sentences, ["JJ", "NNS"])))
    extracted.extend(list(find_pos_pattern_ordered(sentences, ["JJ", "NN", "NN"])))
    extracted.extend(list(find_pos_pattern_ordered(sentences, ["JJ", "NNS", "NN"])))
    extracted.extend(list(find_pos_pattern_ordered(sentences, ["JJ", "NN", "NNS"])))
    extracted.extend(list(find_pos_pattern_ordered(sentences, ["JJ", "NN", "NN", "NN"])))
    extracted.extend(list(find_pos_pattern_ordered(sentences, ["JJ", "NNS", "NN", "NN"])))
    extracted.extend(list(find_pos_pattern_ordered(sentences, ["JJ", "NN", "NNS", "NN"])))
    extracted.extend(list(find_pos_pattern_ordered(sentences, ["JJ", "NN", "NN", "NNS"])))
    extracted.extend(list(find_pos_pattern_ordered(sentences, ["JJ", "NNS", "NNS", "NN"])))
    extracted.extend(list(find_pos_pattern_ordered(sentences, ["JJ", "NNS", "NN", "NNS"])))
    extracted.extend(list(find_pos_pattern_ordered(sentences, ["JJ", "NN", "NNS", "NNS"])))
    extracted.extend(list(find_pos_pattern_ordered(sentences, ["JJ", "NNS", "NNS", "NNS"])))
        
    # combination determiners and adjectives
    extracted.extend(list(find_pos_pattern_ordered(sentences, ["DT", "JJ"])))
 
    # combination of nouns and verb gerunds (present participle)
    extracted.extend(list(find_pos_pattern_ordered(sentences, ["DT", "NN"])))
    extracted.extend(list(find_pos_pattern_ordered(sentences, ["DT", "NNS"])))
    extracted.extend(list(find_pos_pattern_ordered(sentences, ["DT", "VBG"])))
    extracted.extend(list(find_pos_pattern_ordered(sentences, ["DT", "VBG", "NN"])))
    extracted.extend(list(find_pos_pattern_ordered(sentences, ["DT", "VBG", "NNS"])))
    extracted.extend(list(find_pos_pattern_ordered(sentences, ["DT", "NN", "VBG"])))
    extracted.extend(list(find_pos_pattern_ordered(sentences, ["DT", "NNS", "VBG"])))
    extracted.extend(list(find_pos_pattern_ordered(sentences, ["DT", "NN", "NN"])))
    extracted.extend(list(find_pos_pattern_ordered(sentences, ["DT", "NN", "NNS"])))
    extracted.extend(list(find_pos_pattern_ordered(sentences, ["DT", "NNS", "NNS"])))
    extracted.extend(list(find_pos_pattern_ordered(sentences, ["DT", "NNS", "NN"])))
        
    #for each candidate aspect
    completed = []
    scores = []
    for candidate in extracted:
        if candidate in completed: continue
        completed.append(candidate)
        
        # use stemming

        # select multiword aspects
        score = flr(candidate, sentences, extracted)
        
        #print(score, candidate)
        scores.append(score)

        # use a set of heuristic rules
        pass

    # make initial seed for final aspects

    # use iterative bootstrapping for detecting final aspects

    # aspect pruning

    # return top selected aspects
    
    
    return completed, scores

In [23]:
import math

lr_counts = {}

# it would be much more efficient to go through all of them once and comprehensively count them all in the same run, maybe?
def lr_count(pos_sentences, aspect):
    
    lr_counts[aspect] = {"l":0, "r":0, "l_seen":[], "r_seen":[]}
    
    # search every word of every sentence
    for i in range(0, len(pos_sentences)):
        for j in range (1, len(pos_sentences[i])-1):
            
            # is this the aspect we're looking for?
            if pos_sentences[i][j] == aspect:
                l_type = pos_sentences[i][j-1][1]
                r_type = pos_sentences[i][j+1][1]
                
                # have we seen the left type before?
                if l_type not in lr_counts[aspect]["l_seen"]:
                    lr_counts[aspect]["l"] += 1
                    lr_counts[aspect]["l_seen"].append(l_type)
                    
                # have we seen the right type before?
                if r_type not in lr_counts[aspect]["r_seen"]:
                    lr_counts[aspect]["r"] += 1
                    lr_counts[aspect]["r_seen"].append(r_type)


def lr_i_calc(pos_sentences, aspect_part):
    if aspect_part not in lr_counts.keys():
        lr_count(pos_sentences, aspect_part)
    
    return math.sqrt(lr_counts[aspect_part]["l"]*lr_counts[aspect_part]["r"])

def lr_calc(pos_sentences, aspect):
    product = 1
    for part in aspect:
        product *= lr_i_calc(pos_sentences, part)
    
    return product ** (1 / len(aspect))

def frequency(aspect, aspects):
    return aspects.count(aspect)
    
def flr(aspect, pos_sentences, aspects):
    return frequency(aspect, aspects) * lr_calc(pos_sentences, aspect)

In [41]:
pos_sentences = []

for sentence in sentences[0:10000]:
    words = nltk.word_tokenize(sentence)
    pos = nltk.pos_tag(words)
    pos_sentences.append(pos)

testt = pos_sentences
#print(testt)
aspects, scores = top_aspects(testt)

combined = {}

i = 0
for aspect in aspects:
    key = ""
    for part in aspect:
        key += part[0] + " "
    combined[key] = scores[i]
    i += 1
    
#print(combined)

sorted_aspects = sorted(combined.items(), key=lambda x: -x[1])
for thing in sorted_aspects:
    print(thing)

('intelligence ', 1243.9151900350764)
('campaign ', 952.0)
('people ', 882.0)
('companies ', 769.3997660514332)
('schools ', 634.064665471906)
('emails ', 620.5739279086739)
('party ', 595.6643350075611)
('weight ', 594.5082001116554)
('time ', 558.0)
('trade ', 529.1502622129182)
('sales ', 521.206293131616)
('agencies ', 479.3182658735217)
('work ', 432.70082042908126)
('evidence ', 416.26914370392626)
('administration ', 389.7114317029974)
('the world ', 379.13626572995594)
('years ', 378.0)
('workers ', 367.65200937843383)
('legislation ', 358.1954773583832)
('the Russian ', 354.9445015450286)
('attacks ', 351.0128202786901)
('the company ', 346.7644849575899)
('year ', 341.2140090910688)
('plan ', 340.0)
('months ', 339.587985653203)
('rules ', 330.0)
('world ', 313.9554108468271)
('government ', 308.0)
('computer ', 307.1237535587243)
('health ', 304.2614007724279)
('the same ', 304.02003483158273)
('death ', 300.0)
('home ', 298.8377486195477)
('countries ', 296.0)
('tax ', 293.

('the education minister ', 6.631509350543594)
('budget enforcement tool ', 6.626587999889208)
('computer networks ', 6.6238072086195725)
('s aides ', 6.619501839293746)
('s advisers ', 6.619501839293746)
('s work ', 6.619501839293746)
('internet firms ', 6.619501839293746)
('ethics lawyer ', 6.619501839293746)
('normal weight ', 6.619501839293746)
('middle school ', 6.619501839293746)
('s meetings ', 6.619501839293746)
('any evidence ', 6.619501839293746)
('no means ', 6.619501839293746)
('the city party ', 6.606717648934963)
('the party headquarters ', 6.606717648934963)
('each raising ', 6.593491505914626)
('the legislative ', 6.593491505914625)
('the nuclear ', 6.593491505914625)
('hedge fund manager ', 6.590195889768262)
('s reality television show ', 6.585810213882006)
('intelligence collection ', 6.577736335972116)
('intelligence leaders ', 6.577736335972116)
('intelligence agents ', 6.577736335972116)
('intelligence briefing ', 6.577736335972116)
('intelligence unit ', 6.577736

('the mountain ', 3.408658099402498)
('the dictionary ', 3.408658099402498)
('the lightness ', 3.408658099402498)
('the sentence ', 3.408658099402498)
('the sprint ', 3.408658099402498)
('the homey ', 3.408658099402498)
('the naming ', 3.408658099402498)
('the rape ', 3.408658099402498)
('the bolívar ', 3.408658099402498)
('the brazenness ', 3.408658099402498)
('the tabloid ', 3.408658099402498)
('the potency ', 3.408658099402498)
('the meal ', 3.408658099402498)
('the lap ', 3.408658099402498)
('the harvesting ', 3.408658099402498)
('the cyberinfrastructure ', 3.408658099402498)
('the tallying ', 3.408658099402498)
('the elbow ', 3.408658099402498)
('an propaganda ', 3.408658099402498)
('the cyberage ', 3.408658099402498)
('the thriving ', 3.408658099402498)
('the leaking ', 3.408658099402498)
('the precise ', 3.408658099402498)
('the broadcaster ', 3.408658099402498)
('the remember ', 3.408658099402498)
('the stifling ', 3.408658099402498)
('the gymnast ', 3.408658099402498)
('the an

('silent void ', 2.0597671439071177)
('single gunman ', 2.0597671439071177)
('northern state ', 2.0597671439071177)
('Hispanic neighborhoods ', 2.0597671439071177)
('only people ', 2.0597671439071177)
('funeral expenses ', 2.0597671439071177)
('expensive technologies ', 2.0597671439071177)
('rare instances ', 2.0597671439071177)
('vast changes ', 2.0597671439071177)
('illicit gifts ', 2.0597671439071177)
('improper gifts ', 2.0597671439071177)
('Palestinian men ', 2.0597671439071177)
('persistent features ', 2.0597671439071177)
('violent people ', 2.0597671439071177)
('inside homes ', 2.0597671439071177)
('congressional races ', 2.0597671439071177)
('urban neighborhoods ', 2.0597671439071177)
('traditional influences ', 2.0597671439071177)
('young people ', 2.0597671439071177)
('some key ', 2.0597671439071177)
('some random ', 2.0597671439071177)
('some troubling ', 2.0597671439071177)
('That spiritual ', 2.0597671439071177)
('some transit ', 2.0597671439071177)
('some sort ', 2.059767

('insight ', 1.0)
('recruits ', 1.0)
('mantra ', 1.0)
('paralysis ', 1.0)
('casualties ', 1.0)
('broken ', 1.0)
('scattering ', 1.0)
('singular ', 1.0)
('imperative ', 1.0)
('literacy ', 1.0)
('warfare ', 1.0)
('emperor ', 1.0)
('clutter ', 1.0)
('decorations ', 1.0)
('closets ', 1.0)
('taking ', 1.0)
('popcorn ', 1.0)
('cookie ', 1.0)
('tins ', 1.0)
('shoeboxes ', 1.0)
('fragile ', 1.0)
('tissue ', 1.0)
('garlands ', 1.0)
('zip ', 1.0)
('tangles ', 1.0)
('balcony ', 1.0)
('bushes ', 1.0)
('wreaths ', 1.0)
('hangers ', 1.0)
('stockpile ', 1.0)
('guide ', 1.0)
('spark ', 1.0)
('blouses ', 1.0)
('jackets ', 1.0)
('cleaning ', 1.0)
('shirts ', 1.0)
('shortest ', 1.0)
('rotation ', 1.0)
('selecting ', 1.0)
('rid ', 1.0)
('pros ', 1.0)
('playthings ', 1.0)
('stuffed ', 1.0)
('Parents ', 1.0)
('doll ', 1.0)
('swears ', 1.0)
('drawers ', 1.0)
('dividers ', 1.0)
('label ', 1.0)
('bookcase ', 1.0)
('paints ', 1.0)
('jigsaw ', 1.0)
('puzzles ', 1.0)
('stickers ', 1.0)
('cleanup ', 1.0)
('counter