## Strings
To work with text in Python it's important to be able to manipulate string variables. Let's create a variable called text_string and print it as output.

In [1]:
text_string = 'Hi there'
print(text_string)

Hi there


We can check what kind of a variable this is by using the built-in type() function in Python.

In [2]:
type(text_string)

str

We can refer to specific characters in the string by slicing it using brackets:

In [3]:
print(text_string[0])

H


In [4]:
text_string[1]

'i'

In [5]:
text_string[0:3]

'Hi '

In [6]:
text_string[3:]

'there'

We can also 'add' or *concatenate* strings together using the plus sign.

In [7]:
text_string + '!'

'Hi there!'

In [8]:
text_string += '!'

In [9]:
text_string + ' ' + 'How are you today?'

'Hi there! How are you today?'

Note that since we didn't save the concatentations above to a variable the original text_string is unchanged.

In [10]:
len(text_string)

9

### Built-in string methods
Strings also have built-in methods that can operate on them. These include *join*, *find*, *replace*, *lower* and *upper*

In [14]:
how_string = 'How are you today, Mike?'
how_string.replace('H', 'C')

'Cow are you today, Mike?'

In [16]:
how_string.replace('are you today', 'were you yesterday')

'How were you yesterday, Mike?'

In [17]:
how_string.lower()

'how are you today, mike?'

In [18]:
how_string.upper()

'HOW ARE YOU TODAY, MIKE?'

In [57]:
'x'.join(how_string)

'Hxoxwx xaxrxex xyxoxux xtxoxdxaxyx,x xMxixkxex?'

## Reading in files

Usually you want to work with text from files though, and not manually create string variables. 

The first step is to read in the files containing the data. Common file types for text data are: 
* `.txt`
* `.csv`
* `.json`
* `.html` 
* `.xml`

Each file format requires specific Python tools or methods to read, but for our case, we'll be working with .txt files.

#### Reading in `.txt` files

Python has built-in support for reading in `.txt` files.

Let's take a look at the first file in our directory (folder) of State of the Union addresses (`/sotu_text`):

In [19]:
# create a new variable called file1 and read ("r") the first file in the sotu_text folder
file1 = open("sotu_text/215.txt","r") 

In [20]:
# but when we print the variable, it's not yet stored as a string
type(file1)

_io.TextIOWrapper

In [21]:
# to view the text, let's read in the file1 object to a new variable called "text" using .read() and then print out the first250 characters
text = file1.read()

In [22]:
print(text[0:250])

Mr. President, Mr. Speaker, Members of the 104th Congress, my fellow Americans: Again we are here in the sanctuary of democracy, and once again our democracy has spoken. So let me begin by congratulating all of you here in the 104th Congress and cong


## Tokenization

Once we've read in the data, a common next step is to split a longer string into words. This step is referred to as "tokenization". That's because each occurrence of a word is called a "token". Each distinct word used is called a word "type". So the word type "the" may correspond to multiple tokens of "the" in a text.

#### Tokenizing by whitespace
Let's save each word to a list variable called 'tokens'

In [23]:
# use the split() function to split the text variable up by whitespace into a tokens list
tokens = text.split()

In [24]:
# what kind of a variable is tokens?
type(tokens)

list

### Lists 
You can view each item in a Python list using the same syntax we used above to slice a str variable. The first item in the tokens list is at ```tokens[0]``` and the second is ```tokens[1]```. You can view a range of the first 10 as follows:

In [25]:
tokens[0:10]

['Mr.',
 'President,',
 'Mr.',
 'Speaker,',
 'Members',
 'of',
 'the',
 '104th',
 'Congress,',
 'my']

In [26]:
# the first one
tokens[0]

'Mr.'

In [27]:
# the last ten
tokens[-10:]

['still', 'to', 'come.', 'Thank', 'you,', 'and', 'God', 'bless', 'you', 'all.']

In [28]:
# Note: you can also slice the string variables stored inside of a list
tokens[1][0:5]

'Presi'

## Sentence segmentation

Sentence segmentation involves identifying the boundaries of sentences, and provides a different way to tokenize our text.

#### Sentence segmentation by splitting on punctuation

In [29]:
# instead of the default whitespace for split(), you can identify the character or characters you'd like to split on
sentences = text.split('.')
sentences[0]

'Mr'

We can check how many items are in any list using the len() function.

In [30]:
len(sentences)

467

In [31]:
# note that this method doesn't break out sentences that end with other punctuation, like question marks
sentences[35]

' What are we to do about it? \n\nMore than 60 years ago, at the dawn of another new era, President Roosevelt told our Nation, "New conditions impose new requirements on Government and those who conduct Government'

## Regular Expressions
We could improve on this by using regular expressions. They allow us to split strings using specific characters or patterns that match different *kinds* of characters. Regex is a very powerful tool, but we won't go into it much today. For help figuring out and working with regular expressions we recommend https://regex101.com/

In [32]:
import re

In [33]:
# this pattern matches periods, question marks, or exclamation marks
boundary_pattern = r'[.?!]'
sentences_re = re.split(boundary_pattern, text)

In [34]:
# there are now a few more sentences in our list
len(sentences_re)

474

In [35]:
# and this sentence ends at the question mark
sentences_re[35]

' What are we to do about it'

## Strip whitespace

This is an extremely common step in text cleaning. It's simple to perform and nicely pre-packaged in Python. It's particularly common for user-generated text (think survey forms).

In [36]:
string = " Hi there! "
string

' Hi there! '

In [37]:
string.strip()

'Hi there!'

We can also use ```strip()``` to remove "line breaks" from strings. Line breaks are often represented with "escape characters" such as ```\n``` in text files.

In [38]:
text[-25:]

' and God bless you all.\n\n'

In [39]:
# we can remove whitespace at the beginning and end of a string using .strip()
stripped_text = text.strip()
stripped_text[-25:]

'u, and God bless you all.'

You can also run more complex find/replace patterns using regex. Here we use ```re.sub()``` to match any \s+ characters with a single space.

In [40]:
# we can use regular expressions to remove whitespace throughout the string
# note that we are replacing any of the matching whitespace patterns with a single space ' '.
whitespace_pattern = r'\s+'
clean_text = re.sub(whitespace_pattern, ' ', text)
clean_text[-25:]

', and God bless you all. '

## Text normalization
Text normalization can help us clean our text to fit some standard patterns. One common normalization step is to remove case from the text.

If you want to count the frequencies of words, for example, using lower case will ensure you don't count "Death" and "death" as two separate words.

In [42]:
clean_text = clean_text.lower()
clean_text[0:250]

'mr. president, mr. speaker, members of the 104th congress, my fellow americans: again we are here in the sanctuary of democracy, and once again our democracy has spoken. so let me begin by congratulating all of you here in the 104th congress and cong'

Depending on your analysis, you might also want to throw out numerals.

In [43]:
# remove digits using regex
digits = r'\d+'
re.sub(digits, '', caps_string)

"Hi There! Can you believe it's ?"

In [44]:
# note that since we didn't assign the changes to the string variable, the changes aren't "saved"
caps_string

"Hi There! Can you believe it's 2020?"

## Removing punctuation

Sometimes you might want to keep only the alphanumeric characters (the letters and numbers) and ditch the punctuation. Here's how we can do that.


In [45]:
import string
string.punctuation

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

In [46]:
# strip() will remove punctuation from the beginning or end of the string
caps_string.strip(string.punctuation)

"Hi There! Can you believe it's 2020"

The following code looks a little complex, but essentially it will move through each character in our ```caps_string``` variable, and replace any punctuation mark from our ```string.punctuation``` list with a blank string, ```''```

In [47]:
# this code will return all punctuation from the caps_string variable string
''.join(word.strip(string.punctuation) for word in caps_string)

'Hi There Can you believe its 2020'

In [48]:
# let's remove punctuation from our SOTU speech
clean_text = ''.join(word.strip(string.punctuation) for word in clean_text)

### Let's unpack "comprehension"

This is what is called a *comprehension* in Python. A way to iterate or loop over multiple similar items, perform a task, and capture the result of that task for each of the items in a single object. It is very consise, and a powerful way to think about repetitive tasks, like text cleaning.

They're easiest to understand by going backwards from the loop and conditions and then seeing what is done to them.

So first, we're looping over each item (which we're calling *word* in the list *clean_text*:
Then, we're running the .strip() method on that item and stripping all punctuation as listed in string.punctuation

In [53]:
# as a for-loop, this would look like the following, but the output wouldn't be saved
for word in clean_text:
    word.strip(string.punctuation

In [68]:
# as a list comprehension, we can capture the output into a single list
output = [word.strip(string.punctuation) for word in clean_text]
output[0:10]

['m', 'r', ' ', 'p', 'r', 'e', 's', 'i', 'd', 'e']

Lastly, we want that result as a single string, rather than a list of separate words, so we're using some Python slight of hand to make that happen: we're joining, or concatenating each item of that list to an empty string

In [69]:
''.join(output)

'mr president mr speaker members of the 104th congress my fellow americans again we are here in the sanctuary of democracy and once again our democracy has spoken so let me begin by congratulating all of you here in the 104th congress and congratulating you mr speaker if we agree on nothing else tonight we must agree that the american people certainly voted for change in 1992 and in 1994 and as i look out at you i know how some of you must have felt in 1992 laughter i must say that in both years we didnt hear america singing we heard america shouting and now all of us republicans and democrats alike must say we hear you we will work together to earn the jobs you have given us for we are the keepers of a sacred trust and we must be faithful to it in this new and very demanding era over 200 years ago our founders changed the entire course of human history by joining together to create a new country based on a single powerful idea we hold these truths to be selfevident that all men are cr

We're also overwriting the clean_text with the puctuation-stripped string, which is why you see that same variable on both the left and the right hand of the equals.

In [70]:
clean_text = ''.join(word.strip(string.punctuation) for word in clean_text)

### Remove anything but letters
We can use a regular expression that matches only upper and lower case letters to remove everything else.

In [71]:
# in this case we sub any non-letter characters out with a space, ' '
letters_only = r'[^A-Za-z]+'
re.sub(letters_only, ' ', caps_string)

'Hi There Can you believe it s '

### Tokenizing with the Natural Language Toolkit (NLTK)

We can also use the Natural Language Toolkit (NLTK) to accomplish many of the steps we showed manually above. Or in the case of our clean_text variable, where we've already removed punctuation, we can use the word_tokenize module to break the text up into its consitutent tokens:

In [72]:
from nltk.tokenize import word_tokenize

In [73]:
tokens = word_tokenize(clean_text)
tokens[:10]

['mr',
 'president',
 'mr',
 'speaker',
 'members',
 'of',
 'the',
 '104th',
 'congress',
 'my']

In [75]:
from nltk.probability import FreqDist

Now that we have a list of tokens we can count their frequencies in the speech. Let's use a builtin NLTK function called FreqDist() to look at our most common words.

In [95]:
#apply the FreqDist function to our tokens variable
fdist = FreqDist(tokens)

#fdist is a dictionary of unique words and the number of times they occur
fdist

FreqDist({'the': 463, 'to': 401, 'and': 348, 'of': 231, 'we': 203, 'in': 180, 'a': 179, 'our': 140, 'that': 133, 'for': 113, ...})

In [96]:
#it also includes a handy method to find the most common words 
fdist.most_common(10)

[('the', 463),
 ('to', 401),
 ('and', 348),
 ('of', 231),
 ('we', 203),
 ('in', 180),
 ('a', 179),
 ('our', 140),
 ('that', 133),
 ('for', 113)]

## Removing stop words

You might have noticed that the most common words above aren't terribly exciting. They're words like "am", "i", "the" and "a": stop words. These are rarely useful to us in computational text analysis, so it's very common to remove them completely.

NLTK includes a stopwords module we can use. Not all stopwords lists are equal though: for your own research you might want to customize a stopwords list, or find one that is best-suited to your domain.

In [88]:
from nltk.corpus import stopwords
stop = stopwords.words('english')

# how many stopwords are on the list?
len(stop)

179

In [89]:
# what are the first ten word on the stopword list?
stop[0:10]

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're"]

Let's create a new list of tokens, removing our stopwords along the way. 

This loop checks each word in our original tokens list, and if it does *not* appear on the stopword list, it adds it to a new list called tokens_clean.

In [90]:
tokens_clean = [] 
  
for w in tokens: 
    if w not in stop: 
        tokens_clean.append(w)
tokens_clean[0:10]

['mr',
 'president',
 'mr',
 'speaker',
 'members',
 '104th',
 'congress',
 'fellow',
 'americans',
 'sanctuary']

In [92]:
# advanced we can do the same thing quite efficiently with a list comprehension
tokens_clean = [w for w in tokens if w not in stop]
tokens_clean[0:10]

['mr',
 'president',
 'mr',
 'speaker',
 'members',
 '104th',
 'congress',
 'fellow',
 'americans',
 'sanctuary']

In [109]:
# now we can re-count the most common words after stop words are removed
freq = FreqDist(tokens_clean)
freq.most_common(10)

[('people', 72),
 ('work', 42),
 ('new', 41),
 ('us', 37),
 ('government', 34),
 ('country', 32),
 ('years', 29),
 ('year', 29),
 ('last', 28),
 ('time', 27)]

Hmmm, still not terribly interesting but getting better...

## Stemming/lemmatization

Stemming and lemmatization both refer to remove morphological affixes on words. For example, if we stem the word "grows", we get "grow". If we stem the word "running", we get "run". We do this because often we care more about the core content of the word (i.e. that it has something to do with growth or running, rather than the fact that it's a third person present tense verb, or progressive participle).

NLTK provides many algorithms for stemming. For English, a great baseline is the [Porter](https://github.com/nltk/nltk/blob/develop/nltk/stem/porter.py) algorithm.

In [97]:
# import the PorterStemmer and then stem the word "states" as an example
from nltk.stem import PorterStemmer
stemmer = PorterStemmer()
stemmer.stem('states')

'state'

In [98]:
stemmer.stem('united')

'unit'

In [99]:
stemmer.stem('government')

'govern'

In a similar manner as the stopwords loop above, we can create a new list of stemmed tokens:

In [100]:
tokens_stemmed = []
for t in tokens_clean:
    tokens_stemmed.append(stemmer.stem(t))

In [102]:
#or as a comprehension:
tokens_stemmed = [stemmer.stem(t) for t in tokens_clean]

In [103]:
tokens_stemmed[0:10]

['mr',
 'presid',
 'mr',
 'speaker',
 'member',
 '104th',
 'congress',
 'fellow',
 'american',
 'sanctuari']

In [116]:
# now that the words are stemmed, are the most common words any different?
freq_stemmed = FreqDist(tokens_stemmed)
for f in freq_stemmed.most_common(10):
    print(f)

('peopl', 72)
('work', 69)
('year', 58)
('new', 41)
('american', 40)
('govern', 39)
('us', 37)
('countri', 34)
('cut', 32)
('let', 30)


In [117]:
#what about the unstemmed terms?
for f in freq.most_common(10):
    print(f)

('people', 72)
('work', 42)
('new', 41)
('us', 37)
('government', 34)
('country', 32)
('years', 29)
('year', 29)
('last', 28)
('time', 27)


#### Reading in multiple files

Often, our text data is split across multiple files in a folder. We can read them all into a single variable using a Python tool called glob.

In [118]:
import glob

In [122]:
# save all of the files that end with .txt in the sotu_text/ folder to a variable called sotu_all
sotu_all = glob.glob("sotu_text/*.txt")

In [123]:
# this just saves the file-paths to a list though
sotu_all[0:10]

['sotu_text/060.txt',
 'sotu_text/074.txt',
 'sotu_text/048.txt',
 'sotu_text/114.txt',
 'sotu_text/100.txt',
 'sotu_text/128.txt',
 'sotu_text/129.txt',
 'sotu_text/101.txt',
 'sotu_text/115.txt',
 'sotu_text/049.txt']

Now that we have a list of all the files we need to cycle through each one and save the text from the file.

To do that we'll create a new list variable, speeches. For each file in the sotu_all variable we'll open and read the file, and save the text to the speeches list. 

In [127]:
speeches = []
for speech in sotu_all:
    s = open(speech, 'r')
    text = s.read()
    speeches.append(text)

In [128]:
# now we can refer to each speech from the list using the list index
speeches[45][0:250]

'Mr. President, Mr. Speaker, Members of the 104th Congress, my fellow Americans: Again we are here in the sanctuary of democracy, and once again our democracy has spoken. So let me begin by congratulating all of you here in the 104th Congress and cong'

In [130]:
#which file is that?
sotu_all[45]

'sotu_text/215.txt'

Here's a short function to tidy the open/append loop.

In [131]:
speeches = [open(speech, 'r').read() for speech in sotu_all]

In [134]:
len(speeches)

236

In [137]:
speeches[235][0:250]

'\n\n To the Senate and House of Representatives: \n\nSince the convening of Congress one year ago the nation has undergone a prostration in business and industries such as has not been witnessed with us for many years. Speculation as to the causes for th'

## Acknowledgements
Some of the code, descriptions, and examples above are taken from:
* UC Berkeley's D-Lab [workshop on Text Analysis Fundamentals](https://dlab.berkeley.edu/training/text-analysis-fundamentals-unsupervised-approaches-10).