# First try at NLTK analyses of the non-excluded beauty descriptions from the UK
Useful reseource for basic operations: https://www.nltk.org/book/

For other processing steps: https://chrisalbon.com/machine_learning/preprocessing_text/remove_stop_words/

In [1]:
import nltk, re, pprint
from nltk import word_tokenize

Read the text

In [2]:
f = open('beauty_descriptions_UK.txt', 'r')
raw = f.read()

Tokenize, convert to text, set to lowercase, grab total vocabulary

In [3]:
tokens = word_tokenize(raw)
text = nltk.Text(tokens)

sentences = nltk.sent_tokenize(raw)
sentences = [nltk.word_tokenize(sent) for sent in sentences]
sentences = [nltk.pos_tag(sent) for sent in sentences]

words = [w.lower() for w in tokens]
vocab = sorted(set(words))

Clean the text

In [4]:
from nltk.corpus import stopwords
stop_words = stopwords.words('english')
punctuation = ['.',',','!',';','(',')']
clean_text = nltk.Text([word for word in words if word not in stop_words and word not in punctuation])

## Look for patterns, keywords, bigrams
It seems like there is a pre-built package that extracts key words/phrases from a text. https://pypi.org/project/rake-nltk/. However, I do not find that their extracted ranking makes any sense. It merely seems to capture alphabetical order?

I think, we can for now just get away with a frequency count, discounting stop words.

Let's look at similar words, and co-occurring words. 

*Note:* It seems like we do not have enough data for finding similar words once we stripped off stopwords and the like. 

In [5]:
# find words that are similar to both beauty and beautiful
# logic: what applies to both gets rid of random co-occurrences more so than just the function alone
# NOTE that .similar() and .similar_words do produce different outputs, so it is useful to look at both
# However, we can only autmatically look at the overlap between outputs for beauty and beautiful using similar_words
text.similar('joy')

# 
sim_beauty = text._word_context_index.similar_words("beauty")

down its


In [6]:
print(sim_beauty)

['smell', 'top', 'end', 'sound', 'scent', 'sounds', 'sense', 'edge', 'banks', 'ruins', 'side', 'moment', 'silence', 'lapping', 'tang', 'scenery', 'day', 'amount', 'bottom', 'size']


In [7]:
from nltk import FreqDist

fdist_words = FreqDist(clean_text)
fdist_words.most_common(15)

[('day', 42),
 ('beautiful', 38),
 ('time', 31),
 ('could', 30),
 ("''", 30),
 ('like', 28),
 ('see', 26),
 ("'s", 26),
 ('felt', 23),
 ('smell', 21),
 ('beauty', 20),
 ('sun', 19),
 ('life', 18),
 ('water', 18),
 ('saw', 18)]

In [8]:
# bigram freq - use this rather than mere 'collocations' because here we do get a proper frequency count.
bigrm = nltk.bigrams(clean_text)
fdist_bigrams = FreqDist(bigrm)
fdist_bigrams.most_common(10)

[(('first', 'time'), 11),
 (('could', 'see'), 11),
 (('``', "''"), 5),
 (('felt', 'like'), 4),
 (('could', 'feel'), 4),
 (("''", "''"), 4),
 (('summers', 'day'), 3),
 (('said', '``'), 3),
 (('?', "''"), 3),
 (('could', 'hear'), 3)]

## use empath package for LIWC like analyses

In [9]:
from empath import Empath
lexicon = Empath()

lexicon.analyze(raw, normalize=True)

ModuleNotFoundError: No module named 'empath'