#### Overview

Part 1: Basic text analysis

Part 2: Cleaning and normalization

Ideas for complex/advanced applications

### Part 1: Basic Text Analysis 

NLTK (Natural Language Toolkit): 

A python library with functions specifically designed to analyze natural (not-computer) language.

NLTK needs to be imported in any python script that uses it


In [2]:
import nltk
from urllib.request import urlopen


NLTK comes with two built-in texts, *Moby Dick* and *Sense and Sensibility.* Note the assumptions about how NLTK will be used: novels, long, canonical...

#### **NLTK functions: Concordance()**

- Called on a Text object and takes a string (a sequence of characters) as an argument

- text_variable.condordance(string_arg) 

- Calling condordance("word") returns the words that surround “word” in different sentences, helping us to get a glimpse of the contexts in which the word shows up. For example:

In [None]:
# click the icon that looks like a folder with a plus sign in the left menu, create a new folder called textfiles
# add one or more text files
files_path = 'textfiles'


In [3]:
# to bring in file from within JupyterHub
my_file = open("Liberator91901.txt", "r")
file_txt = my_file.read()
txt_tokens = nltk.word_tokenize(file_txt)
txt_prepped = nltk.Text(txt_tokens)

txt_prepped

<Text: The Liberator Our martyred president , William McKinley...>

In [None]:
# to bring in file from URL
#my_url = "https://raw.githubusercontent.com/ucla/ca-dhri/main/Day2/Liberator91901.txt"

#file = urlopen(my_url)
#liberator_raw = file.read()
#liberator_txt = liberator_raw.decode()
#txt_tokens = nltk.word_tokenize(liberator_txt)
#liberator = nltk.Text(txt_tokens)

#let's make sure we created a text object
#liberator

In [4]:
#call concordance on the word "law"
txt_prepped.concordance("law")

Displaying 25 of 26 matches:
of white men changed their tune Lynch law appears to be on the increase in this
ishment under the solemn forms of the law is an effective deterrent of crime . 
ther , with ever-lessening regard for law and justice . Negro crime can never b
will be punished , if at all , by the law . It was not the uncertainty of the l
w . It was not the uncertainty of the law that impelled the mob , but its certa
led the mob , but its certainty . The law would have singled out the guilty wre
le motive but contempt or defiance of law . If two white men had to be killed t
dy forfeited , but the majesty of the law was worth saving at any cost . Fresno
s one of the worst known to ethics or law . To what can be ascribed the cause o
iased observer that the laxity of the law against lynchers in the South has had
pon his head the violent hands of the law and the hatred of mankind . But the m
re , become popular , and respect for law ceases to be a duty or a virtue . Thi
ious . If t



#### **NLTK functions: Similar()**

-text.similar(string_arg)

-Like concordance, similar will find the contexts of the string variable it is given, but it can also compare the content of these contexts to all other words, looking for words that are used in similar contexts to the given argument

In [None]:
liberator.similar("law")

#### **NLTK functions: dispersion plot**

-text.dispersion_plot(list_arg)

-Takes a list of strings as input (not a single string!) and outputs a graph of the instances where each word appears. If you want to make a plot for one string, pass the function a one-object list [“example”]

-Note: dispersion_plot() is helpful for seeing how language changes over time or over narrative arcs. It might be more useful on a large collection of newspapers over time than on a single newspaper.


In [None]:
liberator.dispersion_plot(["law"])

In [None]:
liberator.dispersion_plot(["law","furniture","Angeles"])

#### **NLTK functions: count()**

-Takes a word as an argument and returns a count of each instance of that word in a text. 

-It is case sensitive (we will address cases in data cleaning)

In [None]:
liberator.count("Angeles")

In [None]:
liberator.count("angeles")

### Practice in Breakout Rooms

Can you think of how we could generate a concordance that would allow us to extract addresses from the text? What might we do with that information?

#### **Python operations on NLTK objects: len(), set()**

So far, we have been using the built-in NLTK corpus to analyze our text object. We can also use regular python expressions on it:

**len(text_object)** returns the length of the nltk object, that is, the number of words in the text. In a pre-cleaning text, this will include punctuation and metadata. 

**set(text_object)** creates a set (a list without duplicates) of all the unique words 

**len(set(text_object)** returns — you guessed it — the length of the set of unique words, ie, the number of unique words in the text.

In [None]:
len(liberator)

In [None]:
set(liberator)
sorted(set(liberator))[:30]

In [None]:
len(set(liberator))

### Part 2: Cleaning and Normalizing Data

#### **Removing capitalization and punctuation**

Type vs. token: 
- angeles vs. Angeles vs. ANGELES are distinct types
    
- A token is an instance of a type

- nltk.count(“angeles”) counts the number of tokens of that type

If the distinction between cases isn't important in your analysis, making all values of a text lowercase can be useful.

So we'll start normalizing the text by making all words lower case

In [None]:
liberator_lowercase = [word.lower() for word in liberator] 

In [None]:
liberator_lowercase.count("angeles")

#### **Remove all punctuation**

Lets run this: 

In [None]:
liberator_lowercase_textonly = [word.lower() for word in liberator if word.isalpha()]

In [None]:
#What did we just do, though?
#liberator_lowercase_textonly = [word.lower() for word in liberator if word.isalpha()] is shorthand for this function

liberator_lowercase_textonly = []				#define an empty list called liberator_lowercase_only

for w in liberator:					        #For each word ("w") in our existing text object 

	if w.isalpha():				        #if the word (“w”) is letters (not punctuation)
        
		liberator_lowercase_textonly.append(w.lower())  	#make it lowercase and add to our new list

        #if the word is not alpha, the for loop will move on to the next word 


TIP: be smart about your variable names 

#### **Removing stop words**

- ("the," "an," "a," etc.)


In [None]:
#a predefined list of stopwords, how nice!
nltk.download('stopwords')
from nltk.corpus import stopwords

#note, stopwords' type isn't a List, it's a WordListCorpusReader
print(type(stopwords))

#if we want a "real" list, we need to call the words attribute
print(type(stopwords.words()))
      
#here's what we have:
print(stopwords.words('english'))

In [None]:
#Get rid of stopwords in our text

#make sure you import stopwords somewhere in the script before you call it
liberator_sans_stops = []
stops = stopwords.words('english')

#define a new list
for word in liberator_lowercase_textonly:				#for each word in our cleaner list,
    if word not in stops:	                        #if that word isn’t in the list of stopwords,
        liberator_sans_stops.append(word)		        #add it to our new list


In [None]:
#did it work?
liberator_sans_stops.count("angeles")

In [None]:
liberator_sans_stops.count("of")

In [None]:
liberator_sans_stops.count("after")

#### **Lemmatizing words**

- Lemmatization shrinks words to their grammatical root
    - example, cats ⭢ cat and walked ⭢ walk
  
- This gets complicated in the case of men ⭢ man and sang ⭢ sing 

- Lemmatization looks up a word in a reference dictionary and finds the appropriate root (though this still is not entirely accurate and takes a long time, since each word must be looked up in a reference)

- NLTK comes with pre-built stemmers and lemmatizers. 

In [None]:
from nltk.stem import WordNetLemmatizer	

#create an instance of it for our function
wordnet_lemmatizer = WordNetLemmatizer()	

print(wordnet_lemmatizer.lemmatize("children"))

print(wordnet_lemmatizer.lemmatize("better"))

#for a word like “better,” need to specify grammatical function
print(wordnet_lemmatizer.lemmatize("better", pos='a'))
print(wordnet_lemmatizer.lemmatize("better", pos='n'))

#Parts of Speech
#ADJ, ADJ_SAT, ADV, NOUN, VERB = "a", "s", "r", "n", "v"

liberator_lemma = []
for word in liberator_sans_stops:
    word_lem = wordnet_lemmatizer.lemmatize(word)
    liberator_lemma.append(word_lem)   

### Part 3: Basic text analysis with our clean text

Because we've lemmatized our text, the results of functions like concordance, similar, and count will change. 

In [None]:
#we have to make our clean text an NLTK object to use the functions
liberator_clean = nltk.Text(liberator_lemma)

liberator_clean.concordance('law')

In [None]:
liberator_clean.concordance("children")

In [None]:
liberator_clean.concordance("child")

In [None]:
liberator_clean.concordance("anarchist")

In [None]:
liberator_clean.concordance("lawless")

In [None]:
liberator_clean.similar("good")

In [None]:
liberator_clean.similar("child")

In [None]:
liberator_clean.dispersion_plot(["law","america","anarchist","assassin",])

In [None]:
liberator.dispersion_plot(["law","America","anarchist","assassin",])