# An introduction to computational text analysis: A more in-depth look at strings

In [None]:
from string import punctuation
punctuation

Strings have some simple but powerful methods that allow us to begin working with text in more complex ways. You saw how to import a .csv as a `datascience` Table in the last notebook, but what happens when we want to import text that is not nicely organized into rows and columns? We can organize the important parts into a nice tabular structure by first identifying parts that we want. 

**NOTE:** We will use hand-typed and plain text (.txt) file examples in this notebook, and since the rest of the class will focus on HathiTrust Research Center resources, **know that  .html, .json, and .xml file formats are important to computational text analysis but will not be covered in this class.** 

# Challenge 1

1. Store your first name in a variable named `first`.  
2. Store your last name in a variable named `last`.  
3. Convert `first` to all upper case letters.  
4. Convert `last` to all lower case letters.  
5. Combine these two string variables into a variable named `full`.  
6. Slice out `first` from `full`.  
7. Slice out `last` from `full`.  
8. Slice `full` so that it contains only the last 2 characters of your first name and the first two characters of your last name.  
9. Which string methods did you use in #3 and #4 above? How do you know they are string methods?

In [None]:
## YOUR CODE HERE

#### Bonus
What do the following commands return? Why? 

"cat" > "category"  
"cat " * 5  
"cat" + 2  

In [None]:
## YOUR CODE HERE

# Jorge Luis Borges

![borges](img/borges_1921.jpg)

Below is a string of [Jorge Luis Borges'](https://en.wikipedia.org/wiki/Jorge_Luis_Borges) poem "On His Blindness":

In [None]:
borges = '''In the fullness of the years, like it or not,
a luminous mist surrounds me, unvarying, 
that breaks things down into a single thing,
colorless, formless. Almost into a thought. 
The elemental, vast night and the day
teeming with people have become that fog
of constant, tentative light that does not flag,
and lies in wait at dawn. I longed to see
just once a human face. Unknown to me
the closed encyclopedia, the sweet play
in volumes I can do no more than hold, 
the tiny soaring birds, the moons of gold.
Others have the world, for better or worse; 
I have this half-dark, and the toil of verse.'''

In [None]:
print(borges)

In [None]:
# make an unpreprocessed copy for use below
borges_dirty = borges

# Tokenization
Tokenization is the process of splitting text into words - each word is called a "token" and each word has a particular "type". However, a word such as "the" might adhere to multiple tokens of "the" within a text.

`.split` allows us to split the text based on some sort of separator. In this case, we want to split on the "whitespace" (the blank spaces between words). 

**NOTE:** remember to use your help files in the form of `help(borges.split)`

Let's just look at the first six words. 

In [None]:
borges.split()[:6]

How many characters are there in `borges`?

In [None]:
print(len(borges))

How many words?

In [None]:
print(len(borges.split()))

How many lines? (hint: a line break is represented as \n)

In [None]:
print(len(borges.split("\n")))

How many stanzas?

In [None]:
print(len(borges.split("\n\n")))

At which index does the word "me" first appear?

In [None]:
print(borges.find("me")) # .find is "forward search"

At which index does the word "me" last appear?

In [None]:
print(borges.rfind("me")) # .rfind starts at the highest index and works in reverse

In [None]:
help(borges.rfind)

How many unique words? (hint, use `set`!)

In [None]:
len(set(borges.lower().split()))

In [None]:
print(set(borges.lower().split()))

# Challenge 2
1. Using your intutition, how might you split text on commas?
2. On periods?
3. How do you split _all_ of `borges` on whitespace so that all words are split and printed?

In [None]:
## YOUR CODE HERE

# Some fast notes about for-loops, functions, and conditionals in Python

Custom functions, for-loops, and conditionals are important tools that you will want to eventually explore. Since we will use a loop to remove punctuation below, let's take a minute to talk about these important topics.

### Custom functions

Just like built-in functions, custom functions take some inputs and give you back desired output(s). We can define our own custom functions to use over and over again. 

In this case, `def` tells Python that we want to define our own function. `square` is the name of the function and it needs two arguments to work `x` and `y`. 

The colon symbol `:` tells Python that the code to be evaluated comes on the indented line after it.  

`return` tells Python that the code after it should be printed out.

In [None]:
def sq_and_div(x,y):
    return (x**2)/y

In [None]:
sq_and_div(5,2)

### For loops

For loops are useful when you want to use the same code over a range of values, data, or files. 

`for` tells Python that we want to write a for loop. 

`x` is our "iterator" (placeholder) variable and range is the number of times to iterate. The colon symbol `:` again tells Python that the code to be evaluated follows.

In [None]:
for x in range(1, 13):
    print("The time is", x, "o'clock")

### Conditionals

Conditionals are statements that help you assign different conditions to different pieces of data. In the case below, `if` tells Python that "if some condition is met - do _this_". 

However, "if _some other condition_ is met - do something _else!_"

In [None]:
x = int(input("What time is it (PM)?"))
if x < 9:
    print("The time is", x, "o'clock")
else:
    print("It's getting late!")

# Removing punctuation

Remember how we imported that nice string of English punctuation in the first cell of this notebook? We could manually remove all of the punctuation using the `.replace` method, but this would get old fast!

In [None]:
print(type(punctuation))
print(punctuation)

In [None]:
borges_periods = borges.replace(".", " ")

In [None]:
print(borges_periods) # all periods have been successfully removed! 

But, what if you have tons of text and don't know exactly what punctuation is present? A quick custom function can help us remove all the punctuation from `borges`, i.e. !"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~)

In [None]:
for char in punctuation:
    borges = borges.replace(char, "")

In [None]:
print(borges)

# Challenge 3

Describe what is happening in this remove punctuation function

In [None]:
for char in punctuation:
    borges = borges.replace(char, "")

# Tokenization with the `nltk` library

The [`nltk` (natural language toolkit)](https://nltk.readthedocs.io/en/latest/) library can also help you tokenize your text.

In [None]:
from nltk.tokenize import word_tokenize

tokens = word_tokenize(borges)
tokens

# Sentence segmentation

Sentence segmentation deals with identifying sentence boundaries. We can do this by splitting on punctuation:

In [None]:
borges_dirty

In [None]:
borges_dirty.split(".")

# `nltk` can do sentence segmentation also!

In [None]:
from nltk.tokenize import sent_tokenize
sent_tokenize(borges_dirty)

We do this with the hopes of "normalizing" our text. There are many scenarios that make text non-normalized, but some common ones include:
- case folding (dealing with upper and lower case letters; generally, we want to make all text lower-case).
- removing URLs, digits, and hashtags
- infrequent word removal
- stop word removal

Regular expressions help out greatly with these! See me during office hours if you have further questions. 

# Count word frequencies

We can use Python's built-in function `Counter` to count words! Let's look at the most frequent twelve:

In [None]:
from collections import Counter
freq = Counter(tokens)
freq.most_common(12)

# Removing stop words

Yikes! The most common words in `borges` seem to be [stop words](https://en.wikipedia.org/wiki/Stop_words) such as "the", "of", and "a". Let's remove them because they are rarely useful in computational text analysis. 

In [None]:
from nltk.corpus import stopwords
stop = stopwords.words("english")
stop

In [None]:
no_stops = [word for word in tokens if word not in stopwords.words('english')]
no_stops

# Stemming/lemmatization

Both of these terms seek to remove morphological affixes on words. 

If we stem the word "eats" we get "eat". If we stem the word "sleeping" we get "sleep". We stem words because we tend to focus more on the meaning of the core content of the word, rather than its tense. 

NLTK provides many algorithms for stemming. For English, a great baseline is the [Porter](https://github.com/nltk/nltk/blob/develop/nltk/stem/porter.py) algorithm, which is in spirit isn't that far from a bunch of regular expressions.

Let's try a few! 

In [None]:
from nltk.stem import PorterStemmer
stemmer = PorterStemmer()

In [None]:
stemmer.stem("eats")

In [None]:
stemmer.stem("sleeping")

In [None]:
stemmer.stem("flying") # uh oh...

In [None]:
from nltk.stem import SnowballStemmer, WordNetLemmatizer
snowballer_stemmer = SnowballStemmer('english')
lemmatizer = WordNetLemmatizer()

In [None]:
print(snowballer_stemmer.stem("eats"))
print(snowballer_stemmer.stem("sleeping"))

In [None]:
print(lemmatizer.lemmatize("leaves")) # uh-oh...

# Part of speech tagging

Part of speech (POS) tagging assigns each token a part of speech! (i.e., noun, verg, adjective, etc.). 

Again, there are many different [alternatives](https://github.com/nltk/nltk/tree/develop/nltk/tag), but NLTK keeps its recommended POS tagger available through the function `pos_tag`. The tagger expects a list of tokens as input. When doing POS tagging, it is advisable **not** to remove stop words beforehand (although you are free to do it afterwards).

In [None]:
borges

Whoops! We forgot to remove our line breaks. Let's do so now:

In [None]:
borges = borges.replace("\n", " ")

In [None]:
borges # looking good! :) 

In [None]:
from nltk import pos_tag
pos_borges = borges
pos_borges

In [None]:
tagged_borges = pos_tag(tokens)
tagged_borges

What might you conclude about Borges' style of writing based on the frequencies of non-stop words and stemmed words? 

# Why is preprocessing important?

Text preprocessing is an essential first step to coding and understanding machine learning algorithms. For machine learning portions of this course, we will focus on bag of words models, namely document-term and term frequency-inverse document frequency models from the [sklearn library](http://scikit-learn.org/stable/). 

As previously stated, these instructions can be improved upon using regular expressions. 

# Challenge 4

We can also open data from files. Let's open up the "poe.txt" file from the materials you downloaded earlier. This contains the poem "A Dream Within a Dream" by Edgar Allen Poe. 

Repeat the instructions in this notebook using Poe's poem. 

In [None]:
with open("./poe.txt", "r") as myfile:
    poe = myfile.read()

In [None]:
print(poe)

In [None]:
len(poe.split("\n"))

In [None]:
## etc.

## YOUR CODE HERE