# NLTK analyses of all non-excluded beauty memory descriptions across experiments
Useful reseource for basic operations: https://www.nltk.org/book/

For other processing steps: https://chrisalbon.com/machine_learning/preprocessing_text/remove_stop_words/

In [11]:
import nltk, re, pprint
from nltk import word_tokenize

Read the text

In [12]:
f = open('all_descriptions_men.txt', 'r')
raw = f.read()

Tokenize, convert to text, set to lowercase, grab total vocabulary

In [13]:
tokens = word_tokenize(raw)
text = nltk.Text(tokens)

sentences = nltk.sent_tokenize(raw)
sentences = [nltk.word_tokenize(sent) for sent in sentences]
sentences = [nltk.pos_tag(sent) for sent in sentences]

words = [w.lower() for w in tokens]
vocab = sorted(set(words))

Clean the text

In [14]:
from nltk.corpus import stopwords
stop_words = stopwords.words('english')
punctuation = ['.',',','!',';','(',')']
clean_text = nltk.Text([word for word in words if word not in stop_words and word not in punctuation])

## Look for patterns, keywords, bigrams
It seems like there is a pre-built package that extracts key words/phrases from a text. https://pypi.org/project/rake-nltk/. However, I do not find that their extracted ranking makes any sense. It merely seems to capture alphabetical order?

I think, we can for now just get away with a frequency count, discounting stop words.

Let's look at similar words, and co-occurring words. 

*Note:* It seems like we do not have enough data for finding similar words once we stripped off stopwords and the like. 

In [15]:
# find words that are similar to both beauty and beautiful
# logic: what applies to both gets rid of random co-occurrences more so than just the function alone
# NOTE that .similar() and .similar_words do produce different outputs, so it is useful to look at both
# However, we can only autmatically look at the overlap between outputs for beauty and beautiful using similar_words
text.similar('beautiful')
text.similar('beauty')

happy there beauty peaceful the that great pretty good warm calm reds
setting amazing when nice younger bright breathtaking one
beautiful time lake it experience me air moment beach way mountain
grass wind nature stars snow distance city day mountains


Here, I was trying to find the overlapping portion of the similar words only, but that seems to fail and not be very interesting

In [16]:
sim_beautiful = text._word_context_index.similar_words("beautiful")
sim_beauty = text._word_context_index.similar_words("beauty")
#sim_beaut = [word for word in sim_beautiful if word in sim_beauty]
#print(sim_beaut)

In [17]:
import numpy as np
print(np.transpose([[sim_beautiful],[sim_beauty]]))

[[['when' 'top']]

 [['peaceful' 'smell']]

 [['beauty' 'end']]

 [['great' 'experience']]

 [['happy' 'middle']]

 [['amazing' 'sound']]

 [['born' 'sounds']]

 [['dramatic' 'scent']]

 [['transforming' 'edge']]

 [['spiritual' 'bottom']]

 [['warm' 'presence']]

 [['there' 'rest']]

 [['that' 'birth']]

 [['perfect' 'beautiful']]

 [['out' 'mountains']]

 [['calm' 'day']]

 [['blue' 'city']]

 [['clean' 'moment']]

 [['proud' 'mother']]

 [['if' 'side']]]


In [18]:
from nltk import FreqDist

fdist_words = FreqDist(clean_text)
fdist_words.most_common(15)

[('beautiful', 131),
 ('day', 82),
 ('time', 81),
 ('like', 75),
 ('beauty', 66),
 ('felt', 60),
 ('experience', 58),
 ('one', 58),
 ('see', 56),
 ('saw', 53),
 ("'s", 51),
 ('remember', 49),
 ('life', 49),
 ('could', 48),
 ('went', 45)]

In [19]:
# bigram freq - use this rather than mere 'collocations' because here we do get a proper frequency count.
bigrm = nltk.bigrams(clean_text)
fdist_bigrams = FreqDist(bigrm)
fdist_bigrams.most_common(10)

[(('first', 'time'), 25),
 (('could', 'see'), 17),
 (('beauty', 'experience'), 14),
 (('felt', 'like'), 12),
 (('beautiful', 'experience'), 11),
 (('ever', 'seen'), 9),
 (('looked', 'like'), 7),
 (('one', 'day'), 7),
 (('years', 'ago'), 7),
 (('sun', 'setting'), 6)]

## use empath package for LIWC like analyses

In [20]:
from empath import Empath
lexicon = Empath()

lexicon.analyze(raw, normalize=True)

{'help': 0.0010217574227671595,
 'office': 0.0007813439115278278,
 'dance': 0.001322274311816324,
 'money': 0.000661137155908162,
 'wedding': 0.005950234403173459,
 'domestic_work': 0.002584445245822815,
 'sleep': 0.002824858757062147,
 'medical_emergency': 0.0008414472893376608,
 'cold': 0.006010337780983291,
 'hate': 0.001322274311816324,
 'cheerfulness': 0.0023440317345834838,
 'aggression': 0.0003606202668589975,
 'occupation': 0.0007813439115278278,
 'envy': 0.00018031013342949875,
 'anticipation': 0.0008414472893376608,
 'family': 0.005649717514124294,
 'vacation': 0.008354369515566775,
 'crime': 0.0005409304002884963,
 'attractive': 0.013282846495973074,
 'masculine': 0.000721240533717995,
 'prison': 0.00018031013342949875,
 'health': 0.0003606202668589975,
 'pride': 0.0007813439115278278,
 'dispute': 0.0004808270224786633,
 'nervousness': 0.0031854790239211443,
 'government': 0.00018031013342949875,
 'weakness': 0.0004207236446688304,
 'horror': 0.00030051688904916455,
 'sweari

In [21]:
lexicon.analyze(raw, categories=["social"])

{'social': 0.0}

In [22]:
lexicon.analyze(raw, categories=["beauty"])

{'beauty': 263.0}