## Strings

In [113]:
text_string = 'Hi there'
print(text_string)

Hi there


In [120]:
type(text_string)

str

In [114]:
print(text_string[0])

H


In [116]:
text_string[1]

'i'

In [118]:
text_string[0:3]

'Hi '

In [119]:
text_string[3:]

'there'

## Reading in files

The first step is to read in the files containing the data. Common file types for text data are: 
* `.txt`
* `.csv`
* `.json`
* `.html` 
* `.xml`

Each file format requires specific Python tools or methods to read, but for our case, we'll be working with .txt files.

#### Reading in `.txt` files

Python has built-in support for reading in `.txt` files.

First let's take a look at the first file in our directory of State of the Union addresses (`/sotu_text`):

In [19]:
#create a new variable called file1 and read ("r") the first file in the sotu_text folder
file1 = open("sotu_text/001.txt","r") 

In [121]:
#but when we print the variable, it's not yet in a text string
type(file1)

_io.TextIOWrapper

In [21]:
#save the text string in the file1 object to a new variable called "text" using .read())
text = file1.read()

In [123]:
print(text[0:150])



 To the Senate and House of Representatives: 

Since the convening of Congress one year ago the nation has undergone a prostration in business and i


## Tokenization

Once we've read in the data, our next step is often to split it into words. This step is referred to as "tokenization". That's because each occurrence of a word is called a "token". Each distinct word used is called a word "type". So the word type "the" may correspond to multiple tokens of "the" in a text.

#### Tokenizing by whitespace

In [76]:
tokens = text.split()
tokens

In [77]:
type(tokens)

list

In [78]:
tokens[0]

'To'

In [80]:
tokens[0:10]

['To',
 'the',
 'Senate',
 'and',
 'House',
 'of',
 'Representatives:',
 'Since',
 'the',
 'convening']

In [83]:
tokens[-10:]

['will',
 'be',
 'forwarded',
 'to',
 'Congress',
 'without',
 'delay.',
 'U.',
 'S.',
 'GRANT']

## Sentence segmentation

Sentence segmentation involves identifying the boundaries of sentences.

#### Sentence segmentation by splitting on punctuation

In [105]:
sentences = text.split('.')
sentences[0]

'\n\n To the Senate and House of Representatives: \n\nSince the convening of Congress one year ago the nation has undergone a prostration in business and industries such as has not been witnessed with us for many years'

In [106]:
sentences[27]

' But admitting that these two classes of citizens are to be benefited by expansion, would it be honest to give it? Would not the general loss be too great to justify such relief? Would it not be just as honest and prudent to authorize each debtor to issue his own legal-tenders to the extent of his liabilities? Than to do this, would it not be safer, for fear of overissues by unscrupulous creditors, to say that all debt obligations are obliterated in the United States, and now we commence anew, each possessing all he has at the time free from incumbrance? These propositions are too absurd to be entertained for a moment by thinking or honest people'

## Regular Expressions
We could improve on this by using regular expressions. They'll allow us to split strings based on a number of characters.

In [90]:
import re

In [92]:
boundary_pattern = r'[.?!]'
sentences_re = re.split(boundary_pattern, text)

In [107]:
sentences_re[27]

' But admitting that these two classes of citizens are to be benefited by expansion, would it be honest to give it'

## Strip whitespace

This is an extremely common step. It's simple to perform and nicely pre-packaged in Python. It's particularly common for user-generated text (think survey forms).

In [126]:
string = " Hi there! "
string

' Hi there! '

In [127]:
string.strip()

'Hi there!'

In [131]:
text[0:250]

'\n\n To the Senate and House of Representatives: \n\nSince the convening of Congress one year ago the nation has undergone a prostration in business and industries such as has not been witnessed with us for many years. Speculation as to the causes for th'

In [132]:
# we can remove whitespace at the beginning and end of a string using .strip()
stripped_text = text.strip()
stripped_text[0:250]

'To the Senate and House of Representatives: \n\nSince the convening of Congress one year ago the nation has undergone a prostration in business and industries such as has not been witnessed with us for many years. Speculation as to the causes for this '

In [134]:
# or we can use regular expressions to remove whitespace throughout the string
whitespace_pattern = r'\s+'
clean_text = re.sub(whitespace_pattern, ' ', text)
clean_text[0:250]

' To the Senate and House of Representatives: Since the convening of Congress one year ago the nation has undergone a prostration in business and industries such as has not been witnessed with us for many years. Speculation as to the causes for this p'

## Text normalization
Text normalization can help us clean our text to fit some standard patterns. One common normalization step is to remove (upper and lower) case from the text.

In [136]:
caps_string = "Hi There! Can you believe it's 2020?"
caps_string.lower()

"hi there! can you believe it's 2020?"

In [174]:
clean_text = clean_text.lower()

In [146]:
# remove digits using regex
digits = r'\d+'
re.sub(digits, '', caps_string)

"Hi There! Can you believe it's ?"

In [147]:
# note that since we didn't assign the changes to the string variable, the changes aren't "saved"
caps_string

"Hi There! Can you believe it's 2020?"

## Removing punctuation

Sometimes you might want to keep only the alphanumeric characters (i.e. the letters and numbers) and ditch the punctuation. Here's how we can do that.


In [154]:
import string
string.punctuation

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

In [155]:
caps_string.strip(string.punctuation)

"Hi There! Can you believe it's 2020"

In [158]:
''.join(word.strip(string.punctuation) for word in caps_string)

'Hi There Can you believe its 2020'

In [175]:
clean_text = ''.join(word.strip(string.punctuation) for word in clean_text)

### Remove anything but letters

In [164]:
letters_only = r'[^A-Za-z]+'
re.sub(letters_only, ' ', caps_string)

'Hi There Can you believe it s '

### Tokenizing with the Natural Language Toolkit (NLTK)

In [85]:
from nltk.tokenize import word_tokenize

In [176]:
tokens = word_tokenize(clean_text)
tokens[:10]

['to',
 'the',
 'senate',
 'and',
 'house',
 'of',
 'representatives',
 'since',
 'the',
 'convening']

In [177]:
## counting tokens
from collections import Counter
freq = Counter(tokens)
freq.most_common(10)

[('the', 917),
 ('of', 681),
 ('to', 385),
 ('and', 329),
 ('in', 208),
 ('a', 161),
 ('be', 134),
 ('for', 103),
 ('that', 101),
 ('is', 90)]

## Removing stop words

You might have noticed that the most common words above aren't terribly exciting. They're words like "am", "i", "the" and "a": stop words. These are rarely useful to us in computational text analysis, so it's very common to remove them completely.

In [178]:
from nltk.corpus import stopwords
stop = stopwords.words('english')
stop[0:10]

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're"]

In [179]:
tokens_clean = [] 
  
for w in tokens: 
    if w not in stop: 
        tokens_clean.append(w)
tokens_clean[0:10]

['senate',
 'house',
 'representatives',
 'since',
 'convening',
 'congress',
 'one',
 'year',
 'ago',
 'nation']

In [180]:
freq = Counter(tokens_clean)
freq.most_common(10)

[('that', 101),
 ('it', 87),
 ('this', 81),
 ('which', 61),
 ('congress', 52),
 ('government', 48),
 ('states', 42),
 ('united', 32),
 ('these', 30),
 ('would', 29)]

## Stemming/lemmatization

Stemming and lemmatization both refer to remove morphological affixes on words. For example, if we stem the word "grows", we get "grow". If we stem the word "running", we get "run". We do this because often we care more about the core content of the word (i.e. that it has something to do with growth or running, rather than the fact that it's a third person present tense verb, or progressive participle).

NLTK provides many algorithms for stemming. For English, a great baseline is the [Porter](https://github.com/nltk/nltk/blob/develop/nltk/stem/porter.py) algorithm, which is in spirit isn't that far from a bunch of regular expressions.

In [181]:
from nltk.stem import PorterStemmer
stemmer = PorterStemmer()
stemmer.stem('states')

'state'

In [182]:
stemmer.stem('united')

'unit'

In [183]:
stemmer.stem('government')

'govern'

In [184]:
tokens_stemmed = []
for t in tokens_clean:
    tokens_stemmed.append(stemmer.stem(t))

In [185]:
tokens_stemmed[0:10]

['senat',
 'hous',
 'repres',
 'sinc',
 'conven',
 'congress',
 'one',
 'year',
 'ago',
 'nation']

#### Reading in multiple files

Often, our text data is split across multiple files in a folder. We can read them all into a single variable using a Python tool called glob.

In [24]:
import glob

In [59]:
sotu_all = glob.glob("sotu_text/*.txt")

In [67]:
type(sotu_all)

list

### Introduce lists

In [68]:
speeches = []
for speech in sotu_all:
    s = open(speech, 'r')
    text = s.read()
    speeches.append(text)

In [73]:
#speeches[3]