# Preprocessing

Like other data types, text data never comes clean. Moreover, most of our downstream methods only accept data structured in a particular way. Because of this, before we do any computational text analysis techniques, we will always need to perform some level of preprocessing. Text data has its own unique kind of preprocessing. In this notebook, we will cover the core preprocessing methods in preparation for our next two weeks:

- Reading in files
- Tokenization
- Sentence segmentation
- Removing punctuation
- Stripping whitespace
- Text normalization
- Stop words
- Stemming/Lemmatizing
- POS tagging
- DTM/TF-IDF


## Reading in files

The first step is to read in the files containing the data. As we discussed last week, the most common file types for text data are: `.txt`, `.csv`, `.json`, `.html` and `.xml`.

#### Reading in `.txt` files

Python has built-in support for reading in `.txt` files.

- What type of object is `raw`?
- How many characters are in `raw`?
- Get the first 1000 characters of `raw`?

In [95]:
import os
DATA_DIR = 'data'
fname = 'sowing-and-reaping.txt'
fname = os.path.join(DATA_DIR, fname)
with open(fname, encoding='utf-8') as f:
    raw = f.read()

In [96]:
raw[:1000]

"The Project Gutenberg eBook, Sowing and Reaping, by Frances Ellen Watkins\nHarper, Edited by Frances Smith Foster\n\n\nThis eBook is for the use of anyone anywhere at no cost and with\nalmost no restrictions whatsoever.  You may copy it, give it away or\nre-use it under the terms of the Project Gutenberg License included\nwith this eBook or online at www.gutenberg.net\n\n\n\n\n\nTitle: Sowing and Reaping\n\nAuthor: Frances Ellen Watkins Harper\n\nRelease Date: February 10, 2004  [eBook #11022]\n\nLanguage: English\n\nCharacter set encoding: US-ASCII\n\n\n***START OF THE PROJECT GUTENBERG EBOOK SOWING AND REAPING***\n\n\nE-text prepared by Juliet Sutherland, Andrea Ball, and the Project\nGutenberg Online Distributed Proofreading Team\n\n\n\nTranscriber's Note: This document is the text of Sowing and Reaping.\n                    Any bracketed notations such as [Text missing],\n                    [?], and those inserting letters or other comments\n                    are from the origi

#### Reading in `.csv`

Python has a built-in module called `csv` for reading in csv files.

- What type is `tweets`?
- How many entries are in `raw`?
- Which entry is the header row?
- How can we get the text of the first question?
- How can we get a list of the texts of all questions?

In [97]:
import csv
fname = 'trump-tweets.csv'
fname = os.path.join(DATA_DIR, fname)
tweets = []
#with open(fname) as f:
import codecs
with codecs.open(fname, "r", encoding='utf-8', errors='ignore') as f: ##for special encoding issues  
    reader = csv.reader(f)
    tweets = list(reader)

In [98]:
tweets[:10]

[['Date',
  'Time',
  'Tweet_Text',
  'Type',
  'Media_Type',
  'Hashtags',
  'Tweet_Id',
  'Tweet_Url',
  'twt_favourites_IS_THIS_LIKE_QUESTION_MARK',
  'Retweets',
  '',
  ''],
 ['16-11-11',
  '15:26:37',
  'Today we express our deepest gratitude to all those who have served in our armed forces. #ThankAVet https://t.co/wPk7QWpK8Z',
  'text',
  'photo',
  'ThankAVet',
  '7.97E+17',
  'https://twitter.com/realDonaldTrump/status/797098212599496704',
  '127213',
  '41112',
  '',
  ''],
 ['16-11-11',
  '13:33:35',
  'Busy day planned in New York. Will soon be making some very important decisions on the people who will be running our government!',
  'text',
  '',
  '',
  '7.97E+17',
  'https://twitter.com/realDonaldTrump/status/797069763801387008',
  '141527',
  '28654',
  '',
  ''],
 ['16-11-11',
  '11:14:20',
  'Love the fact that the small groups of protesters last night have passion for our great country. We will all come together and be proud!',
  'text',
  '',
  '',
  '7.97E+17',
  '

#### Reading in `.csv` with `pandas`

`pandas` is a third-party library that makes working with tabular data much easier. This is the recommended way to read in a `.csv` file.

- How many tweets are there?
- What happened to the header row?

In [99]:
import os
import pandas as pd
fname = 'trump-tweets.csv'
fname = os.path.join(DATA_DIR, fname)
tweets = pd.read_csv(fname) 

In [100]:
tweets.head(3)

Unnamed: 0,Date,Time,Tweet_Text,Type,Media_Type,Hashtags,Tweet_Id,Tweet_Url,twt_favourites_IS_THIS_LIKE_QUESTION_MARK,Retweets,Unnamed: 10,Unnamed: 11
0,16-11-11,15:26:37,Today we express our deepest gratitude to all ...,text,photo,ThankAVet,7.97e+17,https://twitter.com/realDonaldTrump/status/797...,127213,41112,,
1,16-11-11,13:33:35,Busy day planned in New York. Will soon be mak...,text,,,7.97e+17,https://twitter.com/realDonaldTrump/status/797...,141527,28654,,
2,16-11-11,11:14:20,Love the fact that the small groups of protest...,text,,,7.97e+17,https://twitter.com/realDonaldTrump/status/797...,183729,50039,,


In [101]:
tweet_text = list(tweets['Tweet_Text'])
tweet_text[:4]

['Today we express our deepest gratitude to all those who have served in our armed forces. #ThankAVet https://t.co/wPk7QWpK8Z',
 'Busy day planned in New York. Will soon be making some very important decisions on the people who will be running our government!',
 'Love the fact that the small groups of protesters last night have passion for our great country. We will all come together and be proud!',
 'Just had a very open and successful presidential election. Now professional protesters, incited by the media, are protesting. Very unfair!']

#### Reading in `.json` files

Python has built-in support for reading in `.json` files.

- How many fires are there in the dataset?
- What data type is each fire name?
- How can we access the location of the first fire?
- How can we get a list of the names of all fires?

In [102]:
import json
fname = 'fires.json'
fname = os.path.join(DATA_DIR, fname)
with open(fname) as f:
    data = json.load(f)

In [103]:
data[:4]

[{'Name': 'Bear Fire ',
  'Final': True,
  'Updated': '2020-02-16T09:24:05Z',
  'Started': '2020-02-15T17:48:27Z',
  'AdminUnit': 'CAL FIRE Humboldt-Del Norte Unit',
  'AdminUnitUrl': None,
  'County': 'Humboldt',
  'Location': 'Bear Creek Road and Anderson Ford Road, Northwest of Highway 36 at Dinsmore',
  'AcresBurned': 15.0,
  'PercentContained': 100.0,
  'ControlStatement': None,
  'AgencyNames': '',
  'Longitude': -123.6378411,
  'Latitude': 40.511092,
  'Type': 'Wildfire',
  'UniqueId': '94f20083-2412-4058-bc16-afa6b45c3935',
  'Url': 'https://www.fire.ca.gov/incidents/2020/2/15/bear-fire/',
  'ExtinguishedDate': '',
  'ExtinguishedDateOnly': '',
  'StartedDateOnly': '2020-02-15',
  'IsActive': False,
  'CalFireIncident': True,
  'NotificationDesired': False},
 {'Name': 'Antelope Fire',
  'Final': True,
  'Updated': '2020-02-24T14:41:27Z',
  'Started': '2020-02-17T15:04:08Z',
  'AdminUnit': 'USFS Tahoe National Forest',
  'AdminUnitUrl': None,
  'County': 'Sierra',
  'Location': 

In [104]:
df = pd.DataFrame(data)
df.head()

Unnamed: 0,Name,Final,Updated,Started,AdminUnit,AdminUnitUrl,County,Location,AcresBurned,PercentContained,...,Latitude,Type,UniqueId,Url,ExtinguishedDate,ExtinguishedDateOnly,StartedDateOnly,IsActive,CalFireIncident,NotificationDesired
0,Bear Fire,True,2020-02-16T09:24:05Z,2020-02-15T17:48:27Z,CAL FIRE Humboldt-Del Norte Unit,,Humboldt,"Bear Creek Road and Anderson Ford Road, Northw...",15.0,100.0,...,40.511092,Wildfire,94f20083-2412-4058-bc16-afa6b45c3935,https://www.fire.ca.gov/incidents/2020/2/15/be...,,,2020-02-15,False,True,False
1,Antelope Fire,True,2020-02-24T14:41:27Z,2020-02-17T15:04:08Z,USFS Tahoe National Forest,,Sierra,"Hwy 49 and Fillippini Road, Sierraville",102.0,100.0,...,39.6923,Wildfire,c9bb59f7-be32-4296-8c11-3f1233116827,https://www.fire.ca.gov/incidents/2020/2/17/an...,2020-02-20T14:40:00Z,2020-02-20,2020-02-17,False,False,False
2,Beegum Fire,True,2020-02-24T14:32:24Z,2020-02-23T08:35:30Z,CAL FIRE Tehama-Glenn Unit,,Tehama,"Off of Highway 36 West and Tedoc Rd, West of R...",75.0,100.0,...,40.335833,Wildfire,f6e7a62d-4796-452d-8e41-66ed1aef6d63,https://www.fire.ca.gov/incidents/2020/2/23/be...,2020-02-24T14:32:00Z,2020-02-24,2020-02-23,False,True,False
3,Wood Fire,True,2020-02-24T14:45:55Z,2020-02-23T17:50:20Z,Bureau of Land Management,,Lassen,"Off Horselake Road and Woodranch Road, East of...",57.0,100.0,...,40.62155,Wildfire,edd2012d-0574-46a1-97d5-9ca42e7c5912,https://www.fire.ca.gov/incidents/2020/2/23/wo...,2020-02-23T14:45:00Z,2020-02-23,2020-02-23,False,False,False
4,Baseball Fire,True,2020-03-09T11:22:07Z,2020-02-25T14:17:21Z,Mendocino National Forest,,"Glenn, Mendocino","Off Atchison Creek, East of Covelo",211.0,100.0,...,39.756389,Wildfire,ad9ab6b9-53ed-417c-a490-6b594974f110,https://www.fire.ca.gov/incidents/2020/2/25/ba...,2020-03-03T11:21:00Z,2020-03-03,2020-02-25,False,False,False


#### Reading in `.html` files

The best way to read in `.html` files in Python is with the `BeautifulSoup` package.

In [105]:
from bs4 import BeautifulSoup
fname = 'time.html'
fname = os.path.join(DATA_DIR, fname)
import codecs
#with open(fname) as f:
with codecs.open(fname, "r", encoding='utf-8', errors='ignore') as f: ##for special encoding issues  
    soup = BeautifulSoup(f, "html")

In [106]:
texts = soup.findAll(text=True)
texts[:5]

['html', '\n', '\n', '\n', 'Time - Wikipedia']

#### Reading in `.xml` files

We read in `.xml` files using the `BeautifulSoup` package as well. We can think of `.xml` files as trees where each branch has a tag name. We can find all the branches with a certain name as follows:

In [107]:
from xml.etree import ElementTree as ET
fname = 'books.xml'
fname = os.path.join(DATA_DIR, fname)
with codecs.open(fname, "r",encoding='utf-8', errors='ignore') as f:
    soup = BeautifulSoup(f, 'lxml')

In [108]:
descriptions = soup.findAll('description')
text = [x.get_text() for x in descriptions] ## list comprehension
text[:3]

['An in-depth look at creating applications \r\n      with XML.',
 'A former architect battles corporate zombies, \r\n      an evil sorceress, and her own childhood to become queen \r\n      of the world.',
 'After the collapse of a nanotechnology \r\n      society in England, the young survivors lay the \r\n      foundation for a new society.']

#### Reading in multiple files

Often, our text data is split across multiple files in a folder. We want to be able to read them all into a single variable.

- What type is `harper`?
- What type is `fnames` after it is first assigned a value?
- What type is `fnames` after it is assigned a second value?
- How 

In [109]:
import glob
fnames = os.path.join(DATA_DIR, 'harper', '*.txt')
fnames = glob.glob(fnames)
harper = ''
for fname in fnames:
    with codecs.open(fname, "r", encoding='utf-8-sig', errors='ignore') as f:
        text = f.read()
        harper += text

In [110]:
harper[:10000]

'Project Gutenberg\'s Minnie\'s Sacrifice, by Frances Ellen Watkins Harper\r\n\r\nThis eBook is for the use of anyone anywhere at no cost and with\r\nalmost no restrictions whatsoever.  You may copy it, give it away or\r\nre-use it under the terms of the Project Gutenberg License included\r\nwith this eBook or online at www.gutenberg.org\r\n\r\n\r\nTitle: Minnie\'s Sacrifice\r\n\r\nAuthor: Frances Ellen Watkins Harper\r\n\r\nRelease Date: February 12, 2004 [EBook #11053]\r\n\r\nLanguage: English\r\n\r\n\r\n*** START OF THIS PROJECT GUTENBERG EBOOK MINNIE\'S SACRIFICE ***\r\n\r\n\r\n\r\n\r\nProduced by Juliet Sutherland, Andrea Ball and the Online Distributed\r\nProofreading Team.\r\n\r\n\r\n\r\n\r\n\r\nTranscriber\'s Note: This document is the text of Minnie\'s Sacrifice. Any\r\n                    bracketed notations such as [Text missing], [?], and\r\n                    those inserting letters or other comments are from\r\n                    the original text.\r\n\r\nTranscriber\'s

### Challenge 

I've read in some Amazon reviews from earlier into a list called `reviews`. Each element of the list is a string, representing the text of a single review. Try to:
- Tokenize each review
- Strip all whitespace
- Make all characters lower case
- Replace any URLs and digits

Then find the most common 50 words.

### Challenge - SOLUTION

Read in all the `.csv` files in the folder `amazon`. Extract out only the text column from each THE FIRST TWO files and store them all in a list called `reviews`.

In [17]:
fnames = os.path.join(DATA_DIR, 'amazon', '*.csv')
fnames = glob.glob(fnames)
reviews = []
column_names = ['id', 'product_id', 'user_id', 'profile_name', 'helpfulness_num', 'helpfulness_denom',
               'score', 'time', 'summary', 'text']

In [18]:
fnames[:2]

['data\\amazon\\xaa.csv', 'data\\amazon\\xab.csv']

In [19]:
for fname in fnames[:2]:
    df = pd.read_csv(fname, names=column_names)
    text = list(df['text'])
    reviews.extend(text)

reviews

['Text',
 'I have bought several of the Vitality canned dog food products and have found them all to be of good quality. The product looks more like a stew than a processed meat and it smells better. My Labrador is finicky and she appreciates this product better than  most.',
 'Product arrived labeled as Jumbo Salted Peanuts...the peanuts were actually small sized unsalted. Not sure if this was an error or if the vendor intended to represent the product as "Jumbo".',
 'This is a confection that has been around a few centuries.  It is a light, pillowy citrus gelatin with nuts - in this case Filberts. And it is cut into tiny squares and then liberally coated with powdered sugar.  And it is a tiny mouthful of heaven.  Not too chewy, and very flavorful.  I highly recommend this yummy treat.  If you are familiar with the story of C.S. Lewis\' "The Lion, The Witch, and The Wardrobe" - this is the treat that seduces Edmund into selling out his Brother and Sisters to the Witch.',
 'If you are 

## Tokenization

Once we've read in the data, our next step is often to split it into words. This step is referred to as "tokenization". That's because each occurrence of a word is called a "token". Each distinct word used is called a word "type". So the word type "the" may correspond to multiple tokens of "the" in a text.

#### Tokenizing by whitespace

- What problems do you notice with tokenizing by whitespace?
- What type is `text`?
- What type is `tokens`?
- What type is each element of `tokens`?

In [20]:
import os
fname = 'example1.txt'
fname = os.path.join(DATA_DIR, fname)
with open(fname) as f:
    text = f.read()

In [21]:
text

"In this little example, we're going to see some of the problems that regularly appear in tokenization. Tokenization may seem simple, but it's harder than it first appears. Why is it so hard? Punctuations, contractions (like don't, won't and would've) get in the way. \n"

In [22]:
text.split()[:10]

['In',
 'this',
 'little',
 'example,',
 "we're",
 'going',
 'to',
 'see',
 'some',
 'of']

#### Tokenizing with regular expressions

In [23]:
import re
word_pattern = r'\w+'
tokens = re.findall(word_pattern, text)
tokens[:10]

['In', 'this', 'little', 'example', 'we', 're', 'going', 'to', 'see', 'some']

#### Tokenizing with `nltk`

[Just a bunch of regular expressions under the hood](https://github.com/nltk/nltk/blob/develop/nltk/tokenize/treebank.py)

In [24]:
from nltk.tokenize import word_tokenize
tokens = word_tokenize(text)
tokens[:10]

['In', 'this', 'little', 'example', ',', 'we', "'re", 'going', 'to', 'see']

#### Challenge - SOLUTION

A while ago you read in a bunch of Frances Ellen Watkins Harper books into a variable called `harper`. Tokenize that using a method of your choice. Find all the unique words types (you might want the `set` function). Sort the resulting set object to create a vocabulary (you might want to use the `sorted` function).

In [25]:
tokens = word_tokenize(harper)
tokens[0]

'Project'

In [26]:
tokens[:10]

['Project',
 'Gutenberg',
 "'s",
 'Minnie',
 "'s",
 'Sacrifice',
 ',',
 'by',
 'Frances',
 'Ellen']

In [27]:
vocab = sorted(set(tokens))
vocab[1000]

'Legion'

## Sentence segmentation

Sentence segmentation involves identifying the boundaries of sentences.

#### Sentence segmentation by splitting on punctuation

In [28]:
text.split('.')

["In this little example, we're going to see some of the problems that regularly appear in tokenization",
 " Tokenization may seem simple, but it's harder than it first appears",
 " Why is it so hard? Punctuations, contractions (like don't, won't and would've) get in the way",
 ' \n']

We could improve on this by using regular expressions. They'll allow us to split strings based on a number of characters.

In [29]:
sent_boundary_pattern = r'[.?!]'
re.split(sent_boundary_pattern, text)

["In this little example, we're going to see some of the problems that regularly appear in tokenization",
 " Tokenization may seem simple, but it's harder than it first appears",
 ' Why is it so hard',
 " Punctuations, contractions (like don't, won't and would've) get in the way",
 ' \n']

### Challenge - SOLUTION

The file `example2.txt` has more punctuation problems. Read it in and see what the problems are. Try your best to modify the code from above to work for as many cases as you can.

In [30]:
fname = 'example2.txt'
fname = os.path.join(DATA_DIR, fname)
with open(fname) as f:
    text = f.read()
re.split(sent_boundary_pattern, text) 
# Simply looking for certain characters gives us problems. There's no notion of context in the
# regular expression below.

["In this little example, we're going to see some of the problems that regularly appear in tokenization",
 " Tokenization may seem simple, but it's harder than it first appears",
 ' Why is it so hard',
 " Punctuations, contractions (like don't, won't, and would've) get in the way",
 " \n\nWe can split text into sentences using punctuation, but unfortunately that's not always going to work",
 ' For example, if I wanted to tell you about Dr',
 ' Bailey, or Ms',
 " Ndegeocello, we'd be in trouble",
 ' What if I wanted to write about U',
 'C',
 ' Berkeley',
 ' When you think about it, URLs like www',
 'google',
 'com are troublesome too',
 ' How would we settle on a price of $10',
 '50',
 ' The main point is that these punctuation characters serve a variety of purposes in writing',
 ' Moreover, the functions they serve change depending on the domain (medical vs forum text) and language',
 '']

#### Sentence segmentation by `nltk`

In [31]:
from nltk.tokenize import sent_tokenize
fname = 'example2.txt'
fname = os.path.join(DATA_DIR, fname)
with open(fname) as f:
    text = f.read()
sent_tokenize(text)

["In this little example, we're going to see some of the problems that regularly appear in tokenization.",
 "Tokenization may seem simple, but it's harder than it first appears.",
 'Why is it so hard?',
 "Punctuations, contractions (like don't, won't, and would've) get in the way.",
 "We can split text into sentences using punctuation, but unfortunately that's not always going to work.",
 "For example, if I wanted to tell you about Dr. Bailey, or Ms. Ndegeocello, we'd be in trouble.",
 'What if I wanted to write about U.C.',
 'Berkeley?',
 'When you think about it, URLs like www.google.com are troublesome too.',
 'How would we settle on a price of $10.50?',
 'The main point is that these punctuation characters serve a variety of purposes in writing.',
 'Moreover, the functions they serve change depending on the domain (medical vs forum text) and language.']

## Removing punctuation

Sometimes (although admittedly less frequently than tokenizing and sentence segmentation), you might want to keep only the alphanumeric characters (i.e. the letters and numbers) and ditch the punctuation. Here's how we can do that.

- What type is `punctuation`?

In [32]:
from string import punctuation
punctuation

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

In [33]:
no_punct = ''.join([ch for ch in text if ch not in punctuation])
no_punct

'In this little example were going to see some of the problems that regularly appear in tokenization Tokenization may seem simple but its harder than it first appears Why is it so hard Punctuations contractions like dont wont and wouldve get in the way \n\nWe can split text into sentences using punctuation but unfortunately thats not always going to work For example if I wanted to tell you about Dr Bailey or Ms Ndegeocello wed be in trouble What if I wanted to write about UC Berkeley When you think about it URLs like wwwgooglecom are troublesome too How would we settle on a price of 1050 The main point is that these punctuation characters serve a variety of purposes in writing Moreover the functions they serve change depending on the domain medical vs forum text and language'

## Strip whitespace

This is an extremely common step. It's simple to perform and nicely pre-packaged in Python. It's particularly common for user-generated text (think survey forms).

In [34]:
string = ' Hello! '
string.strip()

'Hello!'

In [35]:
fname = 'example3.txt'
fname = os.path.join(DATA_DIR, fname)
with open(fname) as f:
    text = f.read()
print(text)



This is a text file that has some extra whitespace at the start and end. Whitespace is a catch-all term for spaces, tabs, newlines, and a bunch of other things that computers distinguish but to us all look like spaces, tabs and newlines.


The Python method called "strip" only catches whitespace at the start and end of a string. But it won't catch it in       the middle,		for example,

in this sentence.		Once again, regular expressions will

help		us    with this.






In [36]:
stripped_text = text.strip()
print(stripped_text)

This is a text file that has some extra whitespace at the start and end. Whitespace is a catch-all term for spaces, tabs, newlines, and a bunch of other things that computers distinguish but to us all look like spaces, tabs and newlines.


The Python method called "strip" only catches whitespace at the start and end of a string. But it won't catch it in       the middle,		for example,

in this sentence.		Once again, regular expressions will

help		us    with this.


In [37]:
whitespace_pattern = r'\s+'
clean_text = re.sub(whitespace_pattern, ' ', text)
clean_text.strip()

'This is a text file that has some extra whitespace at the start and end. Whitespace is a catch-all term for spaces, tabs, newlines, and a bunch of other things that computers distinguish but to us all look like spaces, tabs and newlines. The Python method called "strip" only catches whitespace at the start and end of a string. But it won\'t catch it in the middle, for example, in this sentence. Once again, regular expressions will help us with this.'

## Text normalization

Text normalization means making our text fit some standard patterns. Lots of steps come under this wide umbrella, but the most common are:

- case folding
- removing URLs, digits, hashtags
- OOV (removing infequent words)

#### Case folding

Case folding means dealing with upper and lower cases characters. This is usually done by making all characters lower cased.

In [38]:
fname = 'example4.txt'
fname = os.path.join(DATA_DIR, fname)
with open(fname) as f:
    text = f.read()
text

'Upper and lower case characters can be annoying. Characters are the individual letters and numbers that we see on the page. Case folding is the generic term we use for dealing with upper and lower case characters. Lower case is often what people want. Title Case refers to a multi-word expression with the first character of every word in upper case. '

In [39]:
text.lower()

'upper and lower case characters can be annoying. characters are the individual letters and numbers that we see on the page. case folding is the generic term we use for dealing with upper and lower case characters. lower case is often what people want. title case refers to a multi-word expression with the first character of every word in upper case. '

### Challenge - SOLUTION

The `lower` method we used above is a string method, that is, it works on strings. But what if you want to lowercase every word in a list (say you've already tokenized the text). Take the list of tokens below and make each one lower case.

In [40]:
tokens = word_tokenize(text)
lowercase_tokens = []
for token in tokens:
    lowercased_version = token.lower()
    lowercase_tokens.append(lowercased_version)
lowercase_tokens

['upper',
 'and',
 'lower',
 'case',
 'characters',
 'can',
 'be',
 'annoying',
 '.',
 'characters',
 'are',
 'the',
 'individual',
 'letters',
 'and',
 'numbers',
 'that',
 'we',
 'see',
 'on',
 'the',
 'page',
 '.',
 'case',
 'folding',
 'is',
 'the',
 'generic',
 'term',
 'we',
 'use',
 'for',
 'dealing',
 'with',
 'upper',
 'and',
 'lower',
 'case',
 'characters',
 '.',
 'lower',
 'case',
 'is',
 'often',
 'what',
 'people',
 'want',
 '.',
 'title',
 'case',
 'refers',
 'to',
 'a',
 'multi-word',
 'expression',
 'with',
 'the',
 'first',
 'character',
 'of',
 'every',
 'word',
 'in',
 'upper',
 'case',
 '.']

### Removing URLs, digits and hashtags

We rarely care about the exact URL used in a tweet, or the exact number. We could remove them completely (think about how we'd do that), but it's often informative to know that there is a URL or a digit in the text. So we want to replace individual URLs asnd digits with a symbol that preserves the fact that a URL was there. It's standard to just use the strings "URL" and "DIGIT".

How do we do this? Once again, regular expressions save the day.

In [41]:
url_pattern = r'https?:\/\/.*[\r\n]*'
single_tweet = tweet_text[0]
single_tweet

'Today we express our deepest gratitude to all those who have served in our armed forces. #ThankAVet https://t.co/wPk7QWpK8Z'

In [42]:
URL_SIGN = ' URL '
re.sub(url_pattern, URL_SIGN, single_tweet)

'Today we express our deepest gratitude to all those who have served in our armed forces. #ThankAVet  URL '

Above we replaced the URL in a single tweet. Now we will replace all the URLs in all tweets in `tweet_text`.

In [43]:
url_pattern = r'https?:\/\/.*[\r\n]*'
URL_SIGN = ' URL '
list_of_url_less_tweets = []
## Using a for loop
for tweet in tweet_text:
    url_less_tweet = re.sub(url_pattern, URL_SIGN, tweet)
    list_of_url_less_tweets.append(url_less_tweet)
list_of_url_less_tweets

['Today we express our deepest gratitude to all those who have served in our armed forces. #ThankAVet  URL ',
 'Busy day planned in New York. Will soon be making some very important decisions on the people who will be running our government!',
 'Love the fact that the small groups of protesters last night have passion for our great country. We will all come together and be proud!',
 'Just had a very open and successful presidential election. Now professional protesters, incited by the media, are protesting. Very unfair!',
 'A fantastic day in D.C. Met with President Obama for first time. Really good meeting, great chemistry. Melania liked Mrs. O a lot!',
 'Happy 241st birthday to the U.S. Marine Corps! Thank you for your service!!  URL ',
 'Such a beautiful and important evening! The forgotten man and woman will never be forgotten again. We will all come together as never before',
 'Watching the returns at 9:45pm.\n#ElectionNight #MAGA__  URL ',
 'RT @IvankaTrump: Such a surreal moment

In [44]:
## Alternative using list comprehension
list_of_url_less_tweets = [re.sub(url_pattern, URL_SIGN, tweet) for tweet in tweet_text]
list_of_url_less_tweets

['Today we express our deepest gratitude to all those who have served in our armed forces. #ThankAVet  URL ',
 'Busy day planned in New York. Will soon be making some very important decisions on the people who will be running our government!',
 'Love the fact that the small groups of protesters last night have passion for our great country. We will all come together and be proud!',
 'Just had a very open and successful presidential election. Now professional protesters, incited by the media, are protesting. Very unfair!',
 'A fantastic day in D.C. Met with President Obama for first time. Really good meeting, great chemistry. Melania liked Mrs. O a lot!',
 'Happy 241st birthday to the U.S. Marine Corps! Thank you for your service!!  URL ',
 'Such a beautiful and important evening! The forgotten man and woman will never be forgotten again. We will all come together as never before',
 'Watching the returns at 9:45pm.\n#ElectionNight #MAGA__  URL ',
 'RT @IvankaTrump: Such a surreal moment

Now let's remove hashtags and digits.

In [45]:
hashtag_pattern = r'(?:^|\s)[＃#]{1}(\w+)'
HASHTAG_SIGN = ' HASHTAG '
digit_pattern = '\d+'
DIGIT_SIGN = ' DIGIT '

In [46]:
no_hashtags = [re.sub(hashtag_pattern, HASHTAG_SIGN, tweet) for tweet in tweet_text]
no_hashtags

['Today we express our deepest gratitude to all those who have served in our armed forces. HASHTAG  https://t.co/wPk7QWpK8Z',
 'Busy day planned in New York. Will soon be making some very important decisions on the people who will be running our government!',
 'Love the fact that the small groups of protesters last night have passion for our great country. We will all come together and be proud!',
 'Just had a very open and successful presidential election. Now professional protesters, incited by the media, are protesting. Very unfair!',
 'A fantastic day in D.C. Met with President Obama for first time. Really good meeting, great chemistry. Melania liked Mrs. O a lot!',
 'Happy 241st birthday to the U.S. Marine Corps! Thank you for your service!! https://t.co/Lz2dhrXzo4',
 'Such a beautiful and important evening! The forgotten man and woman will never be forgotten again. We will all come together as never before',
 'Watching the returns at 9:45pm. HASHTAG  HASHTAG  https://t.co/HfuJeRZ

In [47]:
no_digit = [re.sub(digit_pattern, DIGIT_SIGN, tweet) for tweet in tweet_text]
no_digit

['Today we express our deepest gratitude to all those who have served in our armed forces. #ThankAVet https://t.co/wPk DIGIT QWpK DIGIT Z',
 'Busy day planned in New York. Will soon be making some very important decisions on the people who will be running our government!',
 'Love the fact that the small groups of protesters last night have passion for our great country. We will all come together and be proud!',
 'Just had a very open and successful presidential election. Now professional protesters, incited by the media, are protesting. Very unfair!',
 'A fantastic day in D.C. Met with President Obama for first time. Really good meeting, great chemistry. Melania liked Mrs. O a lot!',
 'Happy  DIGIT st birthday to the U.S. Marine Corps! Thank you for your service!! https://t.co/Lz DIGIT dhrXzo DIGIT ',
 'Such a beautiful and important evening! The forgotten man and woman will never be forgotten again. We will all come together as never before',
 'Watching the returns at  DIGIT : DIGIT p

#### OOV words

Sometimes it's best for us to remove infrequent words (sometimes not!). When we do remove infrequent words, it's often for a downstream method (like classification) that is sensitive to rare words.

In [48]:
all_tweets = ' '.join(tweet_text)
clean = re.sub(url_pattern, URL_SIGN, all_tweets)
clean = re.sub(hashtag_pattern, HASHTAG_SIGN, clean)
clean = re.sub(digit_pattern, DIGIT_SIGN, clean)
tokens = word_tokenize(clean)
tokens = [token for token in tokens if token not in punctuation]
tokens[:20]

['Today',
 'we',
 'express',
 'our',
 'deepest',
 'gratitude',
 'to',
 'all',
 'those',
 'who',
 'have',
 'served',
 'in',
 'our',
 'armed',
 'forces',
 'HASHTAG',
 'URL',
 'HASHTAG',
 'HASHTAG']

We can count the frequency of each word type with the built-in `Counter` in Python. This basically just takes the set of word types (we calculated this above as `vocabularly`) and makes a special Python dictionary with each value being the number of times it appears in the list. We can ask that dictionary for the most common words, or for the frequency of individual word types.

In [49]:
from collections import Counter
freq = Counter(tokens)
freq.most_common(10)

[('URL', 932),
 ('HASHTAG', 717),
 ('DIGIT', 258),
 ('the', 87),
 ('in', 76),
 ('to', 72),
 ('of', 61),
 ('you', 57),
 ('I', 56),
 ('is', 54)]

In [50]:
freq['Missouri']

3

In [51]:
OOV = 'OOV'
new_tokens = []
for token in tokens:
    if freq[token] == 1:
        new_tokens.append(OOV)
    else:
        new_tokens.append(token)

In [52]:
new_tokens[:20]

['OOV',
 'we',
 'OOV',
 'our',
 'OOV',
 'OOV',
 'to',
 'all',
 'those',
 'who',
 'have',
 'OOV',
 'in',
 'our',
 'OOV',
 'OOV',
 'HASHTAG',
 'URL',
 'HASHTAG',
 'HASHTAG']

### Challenge - SOLUTION

I've read in some Amazon reviews from earlier into a list called `reviews`. Each element of the list is a string, representing the text of a single review. Try to:
- Tokenize each review
- Strip all whitespace
- Make all characters lower case
- Replace any URLs and digits

Then find the most common 50 words.

In [53]:
fnames = os.path.join(DATA_DIR, 'amazon', '*.csv')
fnames = glob.glob(fnames)
reviews = []
column_names = ['id', 'product_id', 'user_id', 'profile_name', 'helpfulness_num', 'helpfulness_denom',
               'score', 'time', 'summary', 'text']
for fname in fnames[:2]:
    df = pd.read_csv(fname, names=column_names)
    text = list(df['text'])
    reviews.extend(text)

In [54]:
clean = [re.sub(url_pattern, URL_SIGN, review) for review in reviews]
clean = [re.sub(hashtag_pattern, HASHTAG_SIGN, review) for review in clean]
clean = [re.sub(digit_pattern, DIGIT_SIGN, review) for review in clean]
clean = [''.join([ch for ch in review if ch not in punctuation]) for review in clean]
clean = [review.lower() for review in clean]
clean = [review.strip() for review in clean]

tokens = [word_tokenize(review) for review in clean] 
tokens[1]


['i',
 'have',
 'bought',
 'several',
 'of',
 'the',
 'vitality',
 'canned',
 'dog',
 'food',
 'products',
 'and',
 'have',
 'found',
 'them',
 'all',
 'to',
 'be',
 'of',
 'good',
 'quality',
 'the',
 'product',
 'looks',
 'more',
 'like',
 'a',
 'stew',
 'than',
 'a',
 'processed',
 'meat',
 'and',
 'it',
 'smells',
 'better',
 'my',
 'labrador',
 'is',
 'finicky',
 'and',
 'she',
 'appreciates',
 'this',
 'product',
 'better',
 'than',
 'most']

In [55]:
freq = Counter(tokens[1])
freq.most_common(50)

[('and', 3),
 ('have', 2),
 ('of', 2),
 ('the', 2),
 ('product', 2),
 ('a', 2),
 ('than', 2),
 ('better', 2),
 ('i', 1),
 ('bought', 1),
 ('several', 1),
 ('vitality', 1),
 ('canned', 1),
 ('dog', 1),
 ('food', 1),
 ('products', 1),
 ('found', 1),
 ('them', 1),
 ('all', 1),
 ('to', 1),
 ('be', 1),
 ('good', 1),
 ('quality', 1),
 ('looks', 1),
 ('more', 1),
 ('like', 1),
 ('stew', 1),
 ('processed', 1),
 ('meat', 1),
 ('it', 1),
 ('smells', 1),
 ('my', 1),
 ('labrador', 1),
 ('is', 1),
 ('finicky', 1),
 ('she', 1),
 ('appreciates', 1),
 ('this', 1),
 ('most', 1)]

## Removing stop words

You might have noticed that the most common words above aren't terribly exciting. They're words like "am", "i", "the" and "a": stop words. These are rarely useful to us in computational text analysis, so it's very common to remove them completely.

- What other stop words do you think there are?

In [56]:
from nltk.corpus import stopwords
stop = stopwords.words('english')
stop

['i',
 'me',
 'my',
 'myself',
 'we',
 'our',
 'ours',
 'ourselves',
 'you',
 "you're",
 "you've",
 "you'll",
 "you'd",
 'your',
 'yours',
 'yourself',
 'yourselves',
 'he',
 'him',
 'his',
 'himself',
 'she',
 "she's",
 'her',
 'hers',
 'herself',
 'it',
 "it's",
 'its',
 'itself',
 'they',
 'them',
 'their',
 'theirs',
 'themselves',
 'what',
 'which',
 'who',
 'whom',
 'this',
 'that',
 "that'll",
 'these',
 'those',
 'am',
 'is',
 'are',
 'was',
 'were',
 'be',
 'been',
 'being',
 'have',
 'has',
 'had',
 'having',
 'do',
 'does',
 'did',
 'doing',
 'a',
 'an',
 'the',
 'and',
 'but',
 'if',
 'or',
 'because',
 'as',
 'until',
 'while',
 'of',
 'at',
 'by',
 'for',
 'with',
 'about',
 'against',
 'between',
 'into',
 'through',
 'during',
 'before',
 'after',
 'above',
 'below',
 'to',
 'from',
 'up',
 'down',
 'in',
 'out',
 'on',
 'off',
 'over',
 'under',
 'again',
 'further',
 'then',
 'once',
 'here',
 'there',
 'when',
 'where',
 'why',
 'how',
 'all',
 'any',
 'both',
 'each

### Challenge - SOLUTION

Use the list `stop` of English stopwords to remove stopwords from our tokenized review above.

In [57]:
new_tokens = [token for token in tokens[1] if token not in stop]
new_tokens

['bought',
 'several',
 'vitality',
 'canned',
 'dog',
 'food',
 'products',
 'found',
 'good',
 'quality',
 'product',
 'looks',
 'like',
 'stew',
 'processed',
 'meat',
 'smells',
 'better',
 'labrador',
 'finicky',
 'appreciates',
 'product',
 'better']

## Stemming/lemmatization

Stemming and lemmatization both refer to remove morphological affixes on words. For example, if we stem the word "grows", we get "grow". If we stem the word "running", we get "run". We do this because often we care more about the core content of the word (i.e. that it has something to do with growth or running, rather than the fact that it's a third person present tense verb, or progressive participle).

NLTK provides many algorithms for stemming. For English, a great baseline is the [Porter](https://github.com/nltk/nltk/blob/develop/nltk/stem/porter.py) algorithm, which is in spirit isn't that far from a bunch of regular expressions.

In [58]:
from nltk.stem import PorterStemmer
stemmer = PorterStemmer()

In [59]:
stemmer.stem('grows')

'grow'

In [60]:
stemmer.stem('running')

'run'

In [61]:
stemmer.stem('leaves')

'leav'

In [62]:
from nltk.stem import SnowballStemmer, WordNetLemmatizer
snowballer_stemmer = SnowballStemmer('english')
lemmatizer = WordNetLemmatizer()

In [63]:
print(snowballer_stemmer.stem('running'))
print(snowballer_stemmer.stem('leaves'))

run
leav


In [64]:
print(lemmatizer.lemmatize('leaves'))

leaf


### Challenge - SOLUTION

Use the Porter stemmer to stem each word in the tweet dataset after having removed stop words.

In [65]:
tokenized_tweets = [word_tokenize(tweet) for tweet in tweet_text]
all_stemmed = []
for tweet in tokenized_tweets:
    stemmed = [stemmer.stem(t) for t in tweet]
    all_stemmed.append(stemmed)
all_stemmed

[['today',
  'we',
  'express',
  'our',
  'deepest',
  'gratitud',
  'to',
  'all',
  'those',
  'who',
  'have',
  'serv',
  'in',
  'our',
  'arm',
  'forc',
  '.',
  '#',
  'thankavet',
  'http',
  ':',
  '//t.co/wpk7qwpk8z'],
 ['busi',
  'day',
  'plan',
  'in',
  'new',
  'york',
  '.',
  'will',
  'soon',
  'be',
  'make',
  'some',
  'veri',
  'import',
  'decis',
  'on',
  'the',
  'peopl',
  'who',
  'will',
  'be',
  'run',
  'our',
  'govern',
  '!'],
 ['love',
  'the',
  'fact',
  'that',
  'the',
  'small',
  'group',
  'of',
  'protest',
  'last',
  'night',
  'have',
  'passion',
  'for',
  'our',
  'great',
  'countri',
  '.',
  'we',
  'will',
  'all',
  'come',
  'togeth',
  'and',
  'be',
  'proud',
  '!'],
 ['just',
  'had',
  'a',
  'veri',
  'open',
  'and',
  'success',
  'presidenti',
  'elect',
  '.',
  'now',
  'profession',
  'protest',
  ',',
  'incit',
  'by',
  'the',
  'media',
  ',',
  'are',
  'protest',
  '.',
  'veri',
  'unfair',
  '!'],
 ['a',
  'f

## POS tagging

POS tagging means assigning each token a part-of-speech (e.g. noun, verb, adjective, etc.). Again, there are many different [alternatives](https://github.com/nltk/nltk/tree/develop/nltk/tag), but NLTK keeps its recommended POS tagger available through the function `pos_tag`. The tagger expects a list of tokens as input.When doing POS tagging, it is advisable **not** to remove stop words beforehand (although you are free to do it afterwards).

In [66]:
from nltk import pos_tag
single_review = reviews[3]
single_review

'This is a confection that has been around a few centuries.  It is a light, pillowy citrus gelatin with nuts - in this case Filberts. And it is cut into tiny squares and then liberally coated with powdered sugar.  And it is a tiny mouthful of heaven.  Not too chewy, and very flavorful.  I highly recommend this yummy treat.  If you are familiar with the story of C.S. Lewis\' "The Lion, The Witch, and The Wardrobe" - this is the treat that seduces Edmund into selling out his Brother and Sisters to the Witch.'

In [67]:
tokens = word_tokenize(single_review)
tagged_review = pos_tag(tokens)
tagged_review

[('This', 'DT'),
 ('is', 'VBZ'),
 ('a', 'DT'),
 ('confection', 'NN'),
 ('that', 'WDT'),
 ('has', 'VBZ'),
 ('been', 'VBN'),
 ('around', 'IN'),
 ('a', 'DT'),
 ('few', 'JJ'),
 ('centuries', 'NNS'),
 ('.', '.'),
 ('It', 'PRP'),
 ('is', 'VBZ'),
 ('a', 'DT'),
 ('light', 'JJ'),
 (',', ','),
 ('pillowy', 'JJ'),
 ('citrus', 'NN'),
 ('gelatin', 'NN'),
 ('with', 'IN'),
 ('nuts', 'NNS'),
 ('-', ':'),
 ('in', 'IN'),
 ('this', 'DT'),
 ('case', 'NN'),
 ('Filberts', 'NNP'),
 ('.', '.'),
 ('And', 'CC'),
 ('it', 'PRP'),
 ('is', 'VBZ'),
 ('cut', 'VBN'),
 ('into', 'IN'),
 ('tiny', 'JJ'),
 ('squares', 'NNS'),
 ('and', 'CC'),
 ('then', 'RB'),
 ('liberally', 'RB'),
 ('coated', 'VBN'),
 ('with', 'IN'),
 ('powdered', 'JJ'),
 ('sugar', 'NN'),
 ('.', '.'),
 ('And', 'CC'),
 ('it', 'PRP'),
 ('is', 'VBZ'),
 ('a', 'DT'),
 ('tiny', 'JJ'),
 ('mouthful', 'NN'),
 ('of', 'IN'),
 ('heaven', 'NN'),
 ('.', '.'),
 ('Not', 'RB'),
 ('too', 'RB'),
 ('chewy', 'JJ'),
 (',', ','),
 ('and', 'CC'),
 ('very', 'RB'),
 ('flavorful', 'J

### Challenge - SOLUTION

Below I've read in the text of Harper's _Sowing and Reaping_ into a variable called `sowing`. Preprocess using the following steps:

- Strip whitespace
- Replace all numbers with '0'
- Tokenize
- Tag each token with a POS tag

Make sure you know:
- What type is the result?
- What type is each element of the result?
- What type are the elements of the elements of the result?

In [90]:
fname = 'sowing-and-reaping.txt'
fname = os.path.join(DATA_DIR, fname)
with open(fname, encoding='utf-8') as f:
    raw = f.read()
sowing = raw[1114:684814]
sowing



In [88]:
sowing = sowing.strip()
sowing = re.sub(digit_pattern, '0', sowing)
tokenized = word_tokenize(sowing[:1000]) # Just tokenize the first 1000 characters to speed things up
tokenized = [token for token in tokenized if token not in punctuation]

In [89]:
tagged = pos_tag(tokenized)
tagged

[('I', 'PRP'),
 ('hear', 'VBP'),
 ('that', 'IN'),
 ('John', 'NNP'),
 ('Andrews', 'NNP'),
 ('has', 'VBZ'),
 ('given', 'VBN'),
 ('up', 'RP'),
 ('his', 'PRP$'),
 ('saloon', 'NN'),
 ('and', 'CC'),
 ('a', 'DT'),
 ('foolish', 'JJ'),
 ('thing', 'NN'),
 ('it', 'PRP'),
 ('was', 'VBD'),
 ('He', 'PRP'),
 ('was', 'VBD'),
 ('doing', 'VBG'),
 ('a', 'DT'),
 ('splendid', 'NN'),
 ('business', 'NN'),
 ('What', 'WP'),
 ('could', 'MD'),
 ('have', 'VB'),
 ('induced', 'VBN'),
 ('him', 'PRP'),
 ("''", "''"),
 ('``', '``'),
 ('They', 'PRP'),
 ('say', 'VBP'),
 ('that', 'IN'),
 ('his', 'PRP$'),
 ('wife', 'NN'),
 ('was', 'VBD'),
 ('bitterly', 'RB'),
 ('opposed', 'VBN'),
 ('to', 'TO'),
 ('the', 'DT'),
 ('business', 'NN'),
 ('I', 'PRP'),
 ("don't", 'VBP'),
 ('know', 'VB'),
 ('but', 'CC'),
 ('I', 'PRP'),
 ('think', 'VBP'),
 ('it', 'PRP'),
 ('quite', 'RB'),
 ('likely', 'JJ'),
 ('She', 'PRP'),
 ('has', 'VBZ'),
 ('never', 'RB'),
 ('seemed', 'VBN'),
 ('happy', 'JJ'),
 ('since', 'IN'),
 ('John', 'NNP'),
 ('has', 'VBZ'),

## Things we didn't cover

- Named entity recognition
- Syntactic parsing
- Information extraction
- Removing markup from HTML
- Extracting numerical features
- SpaCy