# NLP for Digital markets

***Akash Gupta (2019DMB02)***

# Language Statistics

Before delving into numerical encodings and transformations of textual data - we present ideas here, that will help us **clean** the textual data and obtain broad overview of the textual characteristics of the **review data**. These broad characteristics are termed as **language statistics** and will essentially give us a flavour of the information and insights trapped in this data. **Note** that in this document, we are exploring two datasets - the review dataset for **HP laptops** and the review dataset for **Lenovo laptops**. Here are some factors we will look at in this regard:

- The **count** of all words in each review dataset - this would give us a sense of **how much content did people write** for each laptop - this is essentially the **vocabulary** that makes up the datasets. 

- The **average** length of each review in each dataset. **Length of a review** is denoted by the **number of words used to write that review**. A large average length implies that people have writing detailed responses regarding the product. 

- The **count** of all words in each review dataset, without **stopwords**. Stopwords are inconsequential words like - **is, the, of, at, from** - that do not contain any significant information about the textual data. 

- The **most frequently occurring words** in each review dataset. This statistic would give us vital insights into the words that predominantly characterize the review dataset. 

- The **most rarely occurring words (hapaxes)** in each review dataset. This would give us a sense of what kind of words and characteristics might be stark differentiators between the two datasets. 

- We find **collocations** in each dataset. Collocations essentially means - **most frequently occurring bigrams** - where a **bigram** is nothing but a **pair of words**. The output will simply give the most frequently used pair of words that tend to describe each product. 

- We find **concordance** in each dataset based on certain **keywords of interest**. This essentially means - **to find the various contexts in which a particular word occurs**. Here we extract the various contexts in which the word **money** have been used. 

- We then find **trigrams** which are essentially tuples of **three words**. Here we extract all trigrams that contain the words **money** and **performance** to get an idea of the **sentiment attached to these features**. 

- Finally, we find some of the **nouns** used in each dataset. The nouns give us a perspective on - **in what context or about what features** - is the product being discussed. 

&nbsp;

In [1]:
import numpy as np
import pandas as pd
import nltk
from nltk.corpus import stopwords
import re
import random

In [2]:
df = pd.read_csv('/Users/Akashgupta/Desktop/NEWERFIN/disser/MASTER.csv')

In [8]:
df2 = pd.read_csv('/Users/Akashgupta/Desktop/NEWERFIN/disser/basic_DELL_data.csv')

In [128]:
hpdat = df[df['brand'] == 'HP']
#hpdat = hpdat[hpdat['customers'] == 0]
#hpdat = hpdat[hpdat['performance'] == 1]
#hpdat = hpdat[hpdat['appearance_price'] == 1]
hpneg = hpdat[hpdat['LABELS'] == 'pos']
hpneg = hpneg[hpneg['SCORE'] >= 3]
hpneg = hpneg[['REVIEW', 'SCORE', 'LABELS', 'customers']]
hpneg.head()

Unnamed: 0,REVIEW,SCORE,LABELS,customers
388,writing this review after using the laptop for...,6,pos,1
389,computer technician and have purchased and con...,13,pos,1
390,amazing have never ever thought the ssd can fa...,9,pos,1
391,purchased this laptop 29th october 2019 writin...,19,pos,1
395,the processor fast really good for office work...,3,pos,1


In [100]:
lendat = df[df['brand'] == 'LEN']
lendat = lendat[lendat['customers'] == 0]
lendat = lendat[lendat['performance'] == 1]
lendat = lendat[lendat['appearance_price'] == 1]
lenneg = lendat[lendat['LABELS'] == 'neg']
lenneg = lenneg[['REVIEW', 'SCORE', 'LABELS', 'customers']]
lenneg.head()

Unnamed: 0,REVIEW,SCORE,LABELS,customers
757,received s145 with warranty months remaining w...,-2,neg,0
758,telling you this laptop not worth any your mon...,-1,neg,0
762,never buy this laptop outright pathetic just d...,-6,neg,0
763,using for last days best for office document b...,0,neg,0
764,excellent product this price point only batter...,-2,neg,0


In [101]:
print("average negative intensity for lenovo customer: ", np.abs(lenneg['SCORE']).mean())
print(len(lenneg['REVIEW']))

average negative intensity for lenovo customer:  1.6779661016949152
59


In [129]:
print("average negative intensity for hp customer: ", np.abs(hpneg['SCORE']).mean())
print(len(hpneg['REVIEW']))

average negative intensity for hp customer:  4.967213114754099
61


&nbsp;

## The implementation program

Below is essentially a custom made program that encapsulates the above mentioned functions. This program has been created such that one can simply pass the data to it - and it will output the desired language statistics. 

&nbsp;

In [2]:
class L_stats(object):
    def __init__(self, raw, col_name):
        self.raw = raw
        self.col_name = col_name
        
    def tokenized(self):
        tokens = []
        for review in self.raw[self.col_name]:
            tokens.append([w.lower() for w in nltk.word_tokenize(review) if len(w)>2])
        return tokens
    
    def vocab(self):
        voc = []
        for review in self.raw[self.col_name]:
            voc += [w.lower() for w in nltk.word_tokenize(review) if len(w)>2]
        return voc
    
    def tokenized_uni(self):
        tokens = []
        for review in self.raw[self.col_name]:
            tokens.append(list(set([w.lower() for w in nltk.word_tokenize(review) if len(w)>2])))
        return tokens
    
    def vocab_uni(self):
        voc = []
        for review in self.raw[self.col_name]:
            voc += [w.lower() for w in nltk.word_tokenize(review) if len(w)>2]
        return list(set(voc))
    
    def avg_word_count(self):
        avg = 0
        v = self.tokenized()
        for review in v:
            avg += (len(review)/len(v))
        avg = np.round(avg, 2)
        return avg
    
    def tokenized_no_stops(self):
        DF = []
        c = self.tokenized()
        for r in c:
            DF.append([w.lower() for w in r if w.lower() not in stopwords.words('english')])
        return DF
    
    def vocab_no_stops(self):
        v = self.vocab()
        d = [w.lower() for w in v if w.lower() not in stopwords.words('english')]
        return d
    
    def count_stats(self):
        print("Total number of words in all reviews: ", len(self.vocab()))
        print()
        print("Average number of words in each review: ", self.avg_word_count())
        print()
        print("Total number of words in all reviews without stopwords", len(self.vocab_no_stops()))
         
    def frequencies(self):
        print("Top 60 most frequently occurring words in the dataset, along with their count:")
        print()
        var2 = nltk.FreqDist(self.vocab()).most_common(60)
        for i in range(60):
            print(var2[i])
    
    def frequencies_stop(self):
        print("Top 30 most frequently occurring words in the data without stops, along with their count:")
        print()
        var = nltk.FreqDist(self.vocab_no_stops()).most_common(30)
        for i in range(30):
            print(var[i])
    
    def hapaxes(self):
        print("Top 60 most rare words that occur only once in the review dataset:")
        print()
        vvv = self.vocab()
        var3 = nltk.FreqDist(vvv).hapaxes()
        for i in range(60):
            print(var3[i])
            
    def collocation(self):
        print("Most frequent bigrams (pairs of words): ")
        print()
        v4 = self.vocab()
        t = nltk.Text(v4)
        print(t.collocations())
        #print()
        #dft = self.tokenized()
        #for r in dft:
            #t = nltk.Text(r)
            #if t.collocations() != None:
                #print(t.collocations())
                
    def concord(self, word):
        vs = self.vocab()
        tt = nltk.Text(vs)
        print(tt.concordance(word))
        
    def trigrams(self, word):
        vv = self.vocab()
        t6 = nltk.Text(vv)
        print(t6.findall(r'<.*> <.*> <{}>'.format(word)))
        
    def nouns(self):
        vvs = self.vocab()
        nouns = [w for (w, n) in nltk.pos_tag(vvs) if n in ['NN', 'NNP']]
        return nouns

In [3]:
z = pd.read_csv('/Users/Akashgupta/Desktop/NEWERFIN/disser/zero.csv')
o = pd.read_csv('/Users/Akashgupta/Desktop/NEWERFIN/disser/one.csv')
t = pd.read_csv('/Users/Akashgupta/Desktop/NEWERFIN/disser/two.csv')

In [4]:
Z = L_stats(z, 'REVIEW')
O = L_stats(o, 'REVIEW')
T = L_stats(t, 'REVIEW')

In [6]:
dfz = Z.tokenized()
dfo = O.tokenized()
dft = T.tokenized()

In [130]:
Lenovo = L_stats(lenneg, 'REVIEW')
HP = L_stats(hpneg, 'REVIEW')

In [131]:
df_l = Lenovo.tokenized()
df_h = HP.tokenized()

In [132]:
V_l = Lenovo.vocab()
V_h = HP.vocab()

&nbsp;

## Count statistics

The count statistics are given below.

- We see that **Lenovo** reviews are on average, longer in length and people have written longer reviews for it as compared to reviews on **HP**.

&nbsp;

In [7]:
Z.count_stats()

Total number of words in all reviews:  44671

Average number of words in each review:  60.2

Total number of words in all reviews without stopwords 29907


In [8]:
O.count_stats()

Total number of words in all reviews:  11271

Average number of words in each review:  20.16

Total number of words in all reviews without stopwords 8525


In [9]:
T.count_stats()

Total number of words in all reviews:  1713

Average number of words in each review:  12.98

Total number of words in all reviews without stopwords 1254


&nbsp;

## Frequent words

- Some of high frequency characteristic words for **Lenovo** are - **good, battery, slow, price, money, performance, poor, bad, processor**.

- Some of the high frequency characteristic words for **HP** are - **good, performance, battery, fast, light, weight, display, SSD**. 

&nbsp;

In [10]:
Z.frequencies_stop()

Top 30 most frequently occurring words in the data without stops, along with their count:

('laptop', 872)
('good', 326)
('product', 315)
("n't", 260)
('apple', 238)
('battery', 226)
('buy', 215)
('amazon', 206)
('one', 198)
('macbook', 171)
('like', 169)
('dell', 166)
('use', 164)
('windows', 153)
('...', 152)
('get', 142)
('mac', 142)
('even', 141)
('performance', 139)
('also', 138)
('time', 137)
('price', 136)
('bought', 135)
('service', 133)
('quality', 132)
('working', 128)
('screen', 123)
('display', 120)
('using', 119)
('better', 117)


In [11]:
O.frequencies_stop()

Top 30 most frequently occurring words in the data without stops, along with their count:

('good', 376)
('battery', 215)
('quality', 179)
('laptop', 176)
('performance', 151)
('product', 121)
('light', 99)
('life', 82)
('...', 80)
('display', 77)
('screen', 72)
('fast', 71)
('also', 66)
('weight', 65)
('slow', 65)
('use', 64)
('backup', 63)
('poor', 63)
('sound', 61)
('best', 58)
('like', 57)
('design', 54)
('great', 50)
('awesome', 48)
('overall', 48)
('price', 46)
('office', 44)
('nice', 40)
('processor', 40)
('money', 39)


In [12]:
T.frequencies_stop()

Top 30 most frequently occurring words in the data without stops, along with their count:

('money', 79)
('price', 72)
('product', 72)
('value', 57)
('good', 49)
('laptop', 42)
('best', 37)
('range', 31)
('nice', 26)
('great', 17)
('waste', 16)
('performance', 15)
('battery', 14)
('excellent', 12)
('...', 12)
('buy', 11)
("n't", 10)
('worth', 9)
('really', 8)
('one', 8)
('also', 7)
('awesome', 7)
('high', 7)
('quality', 6)
('like', 6)
('design', 6)
('weight', 6)
('screen', 5)
('display', 5)
('decent', 5)


&nbsp;

## Common words correlated with NOT

- In **Lenovo** reviews the most correlated words with the word **NOT** are - **good, working, buy, worth, recommend, functioning** - indicating negative sentiments with respect to the said features. 

- In **HP** reviews the most correlated words with the word **NOT** are - **working, worth, charging, suitable**.

&nbsp;

In [13]:
bigram_L = nltk.bigrams(Z.vocab())
cfd2 = nltk.ConditionalFreqDist(list(bigram_L))
cfd2['not'].most_common(20)

[('working', 31),
 ('for', 20),
 ('buy', 19),
 ('good', 16),
 ('even', 15),
 ('that', 13),
 ('worth', 12),
 ('the', 10),
 ('sure', 10),
 ('recommend', 8),
 ('getting', 8),
 ('find', 8),
 ('work', 7),
 ('able', 7),
 ('recommended', 7),
 ('all', 7),
 ('available', 6),
 ('very', 6),
 ('charging', 6),
 ('installed', 6)]

In [14]:
bigram_H = nltk.bigrams(O.vocab())
cfd3 = nltk.ConditionalFreqDist(list(bigram_H))
cfd3['not'].most_common(20)

[('good', 28),
 ('for', 9),
 ('that', 7),
 ('worth', 7),
 ('recommended', 5),
 ('bad', 5),
 ('upto', 4),
 ('very', 4),
 ('the', 4),
 ('buy', 4),
 ('sure', 3),
 ('functioning', 3),
 ('all', 3),
 ('used', 3),
 ('have', 3),
 ('quite', 3),
 ('even', 3),
 ('enough', 3),
 ('available', 2),
 ('much', 2)]

In [15]:
bigram_H4 = nltk.bigrams(T.vocab())
cfd34 = nltk.ConditionalFreqDist(list(bigram_H4))
cfd34['not'].most_common(20)

[('amazon', 1),
 ('good', 1),
 ('bad', 1),
 ('available', 1),
 ('value', 1),
 ('refundable', 1),
 ('buy', 1),
 ('anti-glare', 1),
 ('excellent', 1),
 ('expected', 1),
 ('made', 1),
 ('for', 1)]

&nbsp;

## Hapaxes

- Some of the rare characteristic words for **Lenovo** are - **noise, novice, frustrated, scrap, trackpad**.

- Some of the rare characteristic words for **HP** are - **surprise, useless, multitasking, bandwidth**.

&nbsp;

In [16]:
Z.hapaxes()

Top 60 most rare words that occur only once in the review dataset:

4~5
permits
ish
rainy
season
os+dells
utilities
.gb
hynix
3200
latency
nos
reserved
80~90gb
partition
resistance
inserting
fear
breaking
encryption
74000rs
damages
shipping
tasking
maintain
temperature
farcry
assasins
creeed
entry
preffer
limitted
unsealed
surity
evenafter
assurance
havent
speakers-here
vist
bilaspur
c.g
techies
faqs
tensed
fail
occurred
predecessor
3590
mesh
gtx
pcie
lanes
premiere
filter
demanding
megabytes
2.0
mbps
hdmi,1


In [17]:
O.hapaxes()

Top 60 most rare words that occur only once in the review dataset:

sealed
loose
already
covered
65k+
layout
neatness
website
know
why
seven
reduced
paid
64k
change
green
increasing
gave
proper
others
adequately
somewhat
similar
fingerpint
supply
delicate
2hr
sealer
devlopment
sell
digree
celcius
winter
fine.but
website.however
assistance
immediately
cause
solution
ordered
nov
message
immediate
power-off/shutdown
lightning
data
11th
needing
mechanism
accidently
clicked
60hz
refresh
shift
good.graphics
.overall
powered
good🔥
130w
type-c


&nbsp;

## Collocations

- Some of the most common bigrams for **Lenovo** are - **battery life, battery backup, light weight, online classes, waste money, camera quality, price range, build quality**.

- Some of the most common bigrams for **HP** are - **light weight, build quality, watching movies, microsoft account, made china, customer care**.

&nbsp;

In [18]:
Z.collocation()

Most frequent bigrams (pairs of words): 

battery life; macbook air; service center; light weight; stopped
working; build quality; battery backup; service centre; logic board;
customer care; macbook pro; android studio; price range; video
editing; 10th gen; finger print; sound quality; mother board;
microsoft office; online classes
None


In [19]:
O.collocation()

Most frequent bigrams (pairs of words): 

light weight; battery life; battery backup; build quality; sound
quality; screen quality; camera quality; macbook air; heating issues;
online classes; worth buying; waste money; hard drive; backlit
keyboard; also good; dont buy; fan sound; built quality; heating
issue; 4gb ram
None


In [20]:
T.collocation()

Most frequent bigrams (pairs of words): 

price range; waste money; light weight; nice product; online classes;
n't buy; battery life; build quality; great product; excellent
product; battery backup; good product; price point; low range; money
best; money ...; definitely value; awesome product; price awesome;
best laptop
None


&nbsp;

## Concordance

- For **Lenovo** it seems like quite a lot of contexts around **money** involves negative words like **waste** and **not worth** even though some positive words are also there. 

- For **HP** it seems like a lot of contexts around **money** involves positive words like **value for**.

&nbsp;

In [124]:
Lenovo.concord('screen')

Displaying 15 of 15 matches:
ptop 've ever seen even refresh the screen 'll take more than minutes just ref
ndows best product this price range screen quality good but battery lasts only
essor very slow battery backup poor screen quality not good the processor way 
t that good too slow sometimes also screen quality not great but battery life 
anged not good the battery life and screen the very pathetic very poor laptop 
ay led but the quality not like led screen also installed only msoffice and ad
prints wer clearly visible like the screen size strongly recommend not buying 
eating dead slow processor horrible screen laptop design was very good.but sou
king for good battery life also the screen sooo dim the charger not connected 
harging its very slower performance screen quality very poor battery hardly gi
eend the key attached the photo the screen slowest ever freezing frequently ju
ing and laptop back panel also heat screen resolution very poor showing error 
uality good.processor a

In [139]:
HP.concord('ssd')

Displaying 23 of 23 matches:
he laptop boots just second very fast ssd the shutdown also takes just second t
his laptop super fast because its m.2 ssd clients with 3rd and 4th generation c
ition word excel one note has 250 m.2 ssd storage blazing fast single slot ram 
s amazing have never ever thought the ssd can fast that the windows will boot j
ws office and powerful processor with ssd this segment upto ₹28,000 purchased t
owerpoint and basic editing tasks due ssd works like charm pros ssd makes prett
g tasks due ssd works like charm pros ssd makes pretty fast with 5-7 seconds bo
 save money and put that more ram and ssd also has only single ram slot you wou
mory please remember dual channel ram ssd the key smooth performance ssd alone 
el ram ssd the key smooth performance ssd alone will only make startup and shut
s intel icore5 10gen with ram and 512 ssd the system having high end configurat
 office utility for programming being ssd device quite fast than hdd machines k
l laptop ad

&nbsp;

## Trigrams

- Some common trigrams for **Lenovo** around **money** are - **laptop waste money, value for money, dont waste money, worth the money, buying waste money, awesome value money, hard earn money**.

- Some common trigrams for **Lenovo** around **performance** are - **expect great performance, such poor performance, very slow performance, fantastic best performance**.

- Some common trigrams for **HP** around **money** are - **value for money spend some money, absolute waste money, worth for money**.

- Some common trigrams for **HP** around **performance** are - **job poor performance, good battery performance, much great performance, slow rubbish performance**.

&nbsp;

In [93]:
print(Lenovo.trigrams('product'))
print()

highspeed broadband product; worst possible product; this junk
product; via online product; open worst product
None
None



In [141]:
print(HP.trigrams('ssd'))
print()

very fast ssd; its m.2 ssd; 250 m.2 ssd; thought the ssd; processor
with ssd; tasks due ssd; charm pros ssd; ram and ssd; channel ram ssd;
smooth performance ssd; and 512 ssd; programming being ssd; 8gb and
ssd; ram and ssd; and 512 ssd; performance effect ssd; get 512 ssd;
240gb sata ssd; the windows ssd; please for ssd; and 512gb ssd;
purchased the ssd; 10th only ssd
None
None



&nbsp;

## Nouns

- Some associated nouns with **Lenovo** are - **display, processor, RAM, plastic, battery**.

- Some associated nouns with **HP** are - **light, battery, office, booting, SSD**.

&nbsp;

In [97]:
Lenovo.nouns()[:30]

['lenovo',
 'care',
 'action',
 'period',
 'ddr4',
 'care',
 'advice',
 'word',
 'vice',
 'city',
 'purchase',
 'malfunction',
 'start',
 'operation',
 'communication',
 'claim',
 'replacement',
 'warranty',
 'situation',
 'request',
 'visit',
 'september',
 'keypad',
 'replacement',
 'malfunction',
 'issue',
 'service',
 'request',
 'september',
 'response']

In [142]:
HP.nouns()[:30]

['review',
 'laptop',
 'month',
 'office',
 'account',
 'account',
 'shutdown',
 'time',
 'processing',
 'price',
 'work',
 'hour',
 'charge',
 'autocad',
 'performance',
 'one',
 'computer',
 'technician',
 'amazon',
 'setup',
 'super',
 'm.2',
 'ssd',
 'generation',
 'core',
 'difference',
 'year',
 'laptops',
 'cache',
 'excel']

In [9]:
df2.columns

Index(['DELL'], dtype='object')

In [10]:
for i in range(len(df2['DELL'])):
    print(str(i+1) + "--> " + df2['DELL'][i])
    print()

1--> Using it for last 4~5 days. This is my 4th laptop. Main usage is for development and when time permits little bit of gaming.

What I like.
1. Good processor. Provide balanced performance.
2. Though 100% plastic body shell, its overall build quality is good.
3. Mid range gpu enough for all purpose usage except for high end gaming.
4. Okay ish sound nothing to complain about.
5. Screen quality too is acceptable at this price point.
6. Though not used it, the laptop provides finger print scanner.
7. So far no issues with cooling but it's too early to comment as overall ambience is on a cooler side due to rainy season.

What I didn't like.
1. Only 8 GB ram. 4 GB is consumed by os+dells own utilities. In need to upgrade to 16 .GB.
2. Hynix DDR4 3200 ram module. Overall okay with performance point of view but should have put better ram with better Latency nos.
3. Samsung 256 GB ssd. Dell has reserved about 80~90gb for recovery partition. With os, Dell's own software with MS office and o

In [380]:
dfL = pd.read_csv('/Users/Akashgupta/Desktop/NEWERFIN/disser/GOLD_LABELS_len.csv')

In [412]:
df['LENOVO']

0      Very slow processor, very very laggy... This l...
1      It is total waste laptop. In fact its not a la...
2                                          It's awesome 
3      strongly recommend not buying laptops from Ama...
4      i think they are trying to pass out defective ...
                             ...                        
494    worst . very slow , and within six months keyb...
495    Looking very nice but working performance is v...
496                               Processor is very slow
497    Very bad laptop.. too much slow performance. S...
498                        Light weight and good product
Name: LENOVO, Length: 499, dtype: object

In [383]:
df['LABELS'] = dfL['LABELS']

In [384]:
df

Unnamed: 0,LENOVO,LABELS
0,"Very slow processor, very very laggy... This l...",0
1,It is total waste laptop. In fact its not a la...,0
2,It's awesome,1
3,strongly recommend not buying laptops from Ama...,0
4,i think they are trying to pass out defective ...,0
...,...,...
494,"worst . very slow , and within six months keyb...",0
495,Looking very nice but working performance is v...,0
496,Processor is very slow,0
497,Very bad laptop.. too much slow performance. S...,0


In [386]:
dd = {0: 'neg', 1: 'pos'}

In [389]:
df['LABELS'] = df['LABELS'].map(dd)

In [403]:
Lenovo = L_stats(df, 'LENOVO')

In [418]:
DFL = Lenovo.tokenized_no_stops()
reviews_all = []
for i in range(len(df['LABELS'])):
    reviews_all.append((DFL[i], df['LABELS'][i]))

In [409]:
random.shuffle(reviews_all)

In [393]:
freq_words = nltk.FreqDist(Lenovo.vocab_no_stops())

In [396]:
word_features = list(freq_words)

In [399]:
def review_features(review, w_feat):
    review_words = set(review)
    features = {}
    for word in w_feat:
        features['contains({})'.format(word)] = (word in review_words)
    return features

In [429]:
feature_sets = [(review_features(r, word_features), c) for (r, c) in reviews_all]

In [430]:
train_set, test_set = feature_sets[:350], feature_sets[350:]

In [431]:
classifier = nltk.NaiveBayesClassifier.train(train_set)

In [432]:
nltk.classify.accuracy(classifier, test_set)

0.8389261744966443

In [435]:
classifier.show_most_informative_features(50)

Most Informative Features
          contains(best) = True              pos : neg    =     24.6 : 1.0
         contains(value) = True              pos : neg    =     15.2 : 1.0
         contains(great) = True              pos : neg    =     14.1 : 1.0
         contains(worst) = True              neg : pos    =      9.6 : 1.0
           contains(bit) = True              pos : neg    =      8.9 : 1.0
          contains(nice) = True              pos : neg    =      8.7 : 1.0
         contains(waste) = True              neg : pos    =      7.9 : 1.0
          contains(ever) = True              neg : pos    =      7.4 : 1.0
         contains(price) = True              pos : neg    =      7.0 : 1.0
        contains(return) = True              neg : pos    =      7.0 : 1.0
       contains(awesome) = True              pos : neg    =      6.8 : 1.0
          contains(slim) = True              pos : neg    =      6.8 : 1.0
        contains(budget) = True              pos : neg    =      6.6 : 1.0