# First try at NLTK analyses of the non-excluded beauty descriptions
Useful reseource for basic operations: https://www.nltk.org/book/

For other processing steps: https://chrisalbon.com/machine_learning/preprocessing_text/remove_stop_words/

In [1]:
import nltk, re, pprint
from nltk import word_tokenize

Read the text

In [2]:
f = open('beauty_descriptions_US.txt', 'r')
raw = f.read()

Tokenize, convert to text, set to lowercase, grab total vocabulary

In [3]:
tokens = word_tokenize(raw)
text = nltk.Text(tokens)

sentences = nltk.sent_tokenize(raw)
sentences = [nltk.word_tokenize(sent) for sent in sentences]
sentences = [nltk.pos_tag(sent) for sent in sentences]

words = [w.lower() for w in tokens]
vocab = sorted(set(words))

Clean the text

In [4]:
from nltk.corpus import stopwords
stop_words = stopwords.words('english')
punctuation = ['.',',','!',';','(',')']
clean_text = nltk.Text([word for word in words if word not in stop_words and word not in punctuation])

## Look for patterns, keywords, bigrams
It seems like there is a pre-built package that extracts key words/phrases from a text. https://pypi.org/project/rake-nltk/. However, I do not find that their extracted ranking makes any sense. It merely seems to capture alphabetical order?

I think, we can for now just get away with a frequency count, discounting stop words.

Let's look at similar words, and co-occurring words. 

*Note:* It seems like we do not have enough data for finding similar words once we stripped off stopwords and the like. 

In [5]:
# find words that are similar to both beauty and beautiful
# logic: what applies to both gets rid of random co-occurrences more so than just the function alone
# NOTE that .similar() and .similar_words do produce different outputs, so it is useful to look at both
# However, we can only autmatically look at the overlap between outputs for beauty and beautiful using similar_words
text.similar('beautiful')
text.similar('beauty')

up there beauty younger i the reds trees out setting not perfect
amazing intense mesmerizing original kayak while bright breathtaking
beautiful nature top it experience crunch edge being part epitome
kayak anything while smells point sound grass company clearness bottom


Here, I was trying to find the overlapping portion of the similar words only, but that seems to fail and not be very interesting

In [6]:
sim_beautiful = text._word_context_index.similar_words("beautiful")
sim_beauty = text._word_context_index.similar_words("beauty")
#sim_beaut = [word for word in sim_beautiful if word in sim_beauty]
#print(sim_beaut)

In [7]:
import numpy as np
print(np.transpose([[sim_beautiful],[sim_beauty]]))

[[['dramatic' 'top']]

 [['transforming' 'nature']]

 [['peaceful' 'smell']]

 [['out' 'experience']]

 [['there' 'it']]

 [['younger' 'freedom']]

 [['up' 'perfume']]

 [['breathtaking' 'anything']]

 [['perfect' 'magic']]

 [['loaded' 'grass']]

 [['incredible' 'beautiful']]

 [['bright' 'edge']]

 [['warm' 'sound']]

 [['going' 'presence']]

 [['setting' 'birth']]

 [['blue' 'whole']]

 [['amazing' 'crunch']]

 [['beauty' 'part']]

 [['precious' 'epitome']]

 [['encased' 'smells']]]


In [8]:
from nltk import FreqDist

fdist_words = FreqDist(clean_text)
fdist_words.most_common(15)

[('beautiful', 69),
 ('experience', 35),
 ('could', 28),
 ('like', 27),
 ('beauty', 27),
 ('see', 26),
 ('one', 24),
 ('time', 22),
 ('went', 22),
 ('remember', 22),
 ('day', 21),
 ('saw', 20),
 ("'s", 20),
 ('would', 18),
 ("n't", 17)]

In [9]:
# bigram freq - use this rather than mere 'collocations' because here we do get a proper frequency count.
bigrm = nltk.bigrams(clean_text)
fdist_bigrams = FreqDist(bigrm)
fdist_bigrams.most_common(10)

[(('beauty', 'experience'), 9),
 (('could', 'see'), 8),
 (('first', 'time'), 7),
 (('beautiful', 'experience'), 6),
 (('sun', 'setting'), 5),
 (('ever', 'seen'), 5),
 (('able', 'see'), 5),
 (('felt', 'like'), 5),
 (('made', 'feel'), 5),
 (('top', 'mountain'), 4)]

## use empath package for LIWC like analyses

In [10]:
from empath import Empath
lexicon = Empath()

lexicon.analyze(raw, normalize=True)

ModuleNotFoundError: No module named 'empath'

In [None]:
lexicon.analyze(raw, categories=["social"])

In [None]:
lexicon.analyze(raw, categories=["beauty"])