# Preprocessing

Like other data types, text data never comes clean. Moreover, most of our downstream methods only accept data structured in a particular way. Because of this, before we do any computational text analysis techniques, we will always need to perform some level of preprocessing. Text data has its own unique kind of preprocessing. In this notebook, we will cover the core preprocessing methods in preparation for our next two weeks:

- Reading in files
- Character encoding
- Tokenization
- Sentence segmentation
- Removing punctuation
- **Stripping whitespace**
- **Text normalization**
- **Stop words**
- **Stemming/Lemmatizing**
- **POS tagging**
- **DTM/TF-IDF**

## Convenience functions for reading in today's data

Here, we define a bunch of functions that simplify the process of reading in data that we'll use throughout today.

In [None]:
import os
import re
import glob
import pandas as pd

DATA_DIR = '../data'

def read_pride():
    fname = os.path.join(DATA_DIR, 'pride-and-prejudice.txt')
    with open(fname) as f:
        return f.read()

def read_trump():
    fname = os.path.join(DATA_DIR, 'trump-tweets.csv')
    df = pd.read_csv(fname)
    return list(df['Tweet_Text'].values)

def read_austen():
    fnames = os.path.join(DATA_DIR, 'austen', '*.txt')
    fnames = glob.glob(fnames)
    austen = ''
    for fname in fnames:
        with open(fname) as f:
            text = f.read()
            austen += text
    return austen

def read_amazon(n=2):   
    fnames = os.path.join(DATA_DIR, 'amazon', '*.csv')
    fnames = glob.glob(fnames)
    reviews = []
    column_names = ['id', 'product_id', 'user_id', 'profile_name', 'helpfulness_num', 'helpfulness_denom',
                   'score', 'time', 'summary', 'text']
    for fname in fnames[:n]:
        df = pd.read_csv(fname, names=column_names)
        text = list(df['text'].iloc[1:])
        reviews.extend(text)
    return reviews

def read_dante():
    fname = os.path.join(DATA_DIR, 'dante.txt')
    with open(fname) as f:
        return f.read()

def read_example(n=1):
    fname = os.path.join(DATA_DIR, 'example{}.txt'.format(n))
    with open(fname) as f:
        return f.read()
    
def read_music():
    fname = os.path.join(DATA_DIR, 'music_reviews.csv')
    return list(pd.read_csv(fname, sep='\t')['body'])

## Reading in files

The first step is to read in the files containing the data. As we discussed last week, the most common file types for text data are: `.txt`, `.csv`, `.json`, `.html` and `.xml`.

#### Reading in `.txt` files

Python has built-in support for reading in `.txt` files.

- What type of object is `raw`?
- How many characters are in `raw`?
- Get the first 1000 characters of `raw`?

In [None]:
import os
DATA_DIR = '../data'
fname = 'pride-and-prejudice.txt'
fname = os.path.join(DATA_DIR, fname)
with open(fname, encoding='utf-8') as f:
    raw = f.read()

#### Reading in `.csv`

Python has a built-in module called `csv` for reading in csv files.

- What type is `tweets`?
- How many entries are in `raw`?
- Which entry is the header row?
- How can we get the text of the first question?
- How can we get a list of the texts of all questions?

In [None]:
import csv
fname = 'trump-tweets.csv'
fname = os.path.join(DATA_DIR, fname)
tweets = []
with open(fname) as f:
    reader = csv.reader(f)
    tweets = list(reader)

#### Reading in `.csv` with `pandas`

`pandas` is a third-party library that makes working with tabular data much easier. This is the recommended way to read in a `.csv` file.

- How many tweets are there?
- What happened to the header row?

In [None]:
import pandas as pd
fname = 'trump-tweets.csv'
fname = os.path.join(DATA_DIR, fname)
tweets = pd.read_csv(fname)

In [None]:
tweets.head(3)

In [None]:
tweet_text = list(tweets['Tweet_Text'])
tweet_text[:4]

#### Reading in `.json` files

Python has built-in support for reading in `.json` files.

- How many questions are there in the dataset?
- What data type is each question?
- How can we access the question text of the first question?
- How can we get a list of the texts of all questions?

In [None]:
import json
fname = 'jeopardy.json'
fname = os.path.join(DATA_DIR, fname)
with open(fname) as f:
    data = json.load(f)

In [None]:
data[:3]

#### Reading in `.html` files

The best way to read in `.html` files in Python is with the `BeautifulSoup` package.

In [None]:
from bs4 import BeautifulSoup
fname = 'time.html'
fname = os.path.join(DATA_DIR, fname)
with open(fname) as f:
    html = f.read()
    soup = BeautifulSoup(html)

In [None]:
texts = soup.findAll(text=True)
#texts = soup.getText()
texts[:5]

#### Reading in `.xml` files

We read in `.xml` files using the `ElementTree` module of Python's standard library. We can think of `.xml` files as trees where each branch has a tag name. We can find all the branches with a certain name as follows:

In [None]:
from xml.etree import ElementTree as ET
fname = 'books.xml'
fname = os.path.join(DATA_DIR, fname)
e = ET.parse(fname)
root = e.getroot()

In [None]:
descriptions = root.findall('*/description')
text = [d.text for d in descriptions]
text[:3]

#### Reading in multiple files

Often, our text data is split across multiple files in a folder. We want to be able to read them all into a single variable.

- What type is `austen`?
- What type is `fnames` after it is first assigned a value?
- What type is `fnames` after it is assigned a second value?
- How 

In [None]:
import glob
fnames = os.path.join(DATA_DIR, 'austen', '*.txt')
fnames = glob.glob(fnames)
austen = ''
for fname in fnames:
    with open(fname) as f:
        text = f.read()
        austen += text

### Challenge

Read in all the `.csv` files in the folder `amazon`. Extract out only the text column from each file and store them all in a list called `reviews`.

## Character encoding

Character encoding was more of a problem in Python 2 and early years in general. With Python 3 and most text files being encoded in `UTF-8`, we don't often need to think about it. If you're getting nonsense when reading in a file, try adding `encoding='utf-8'` to the `open` function.

In [None]:
fname = 'dante.txt'
fname = os.path.join(DATA_DIR, fname)
with open(fname) as f:
    text = f.read()

In [None]:
text[5000:6000]

In [None]:
fname = 'akutagawa.txt'
fname = os.path.join(DATA_DIR, fname)
with open(fname) as f:
    text = f.read()

In [None]:
text[5000:6000]

## Tokenization

Once we've read in the data, our next step is often to split it into words. This step is referred to as "tokenization". That's because each occurrence of a word is called a "token". Each distinct word used is called a word "type". So the word type "the" may correspond to multiple tokens of "the" in a text.

#### Tokenizing by whitespace

- What problems do you notice with tokenizing by whitespace?
- What type is `text`?
- What type is `tokens`?
- What type is each element of `tokens`?

In [None]:
import os
fname = 'example1.txt'
fname = os.path.join(DATA_DIR, fname)
with open(fname) as f:
    text = f.read()

In [None]:
text.split()[:10]

#### Tokenizing with regular expressions

In [None]:
import re
word_pattern = r'\w+'
tokens = re.findall(word_pattern, text)
tokens[:10]

#### Tokenizing with `nltk`

[Just a bunch of regular expressions under the hood](https://github.com/nltk/nltk/blob/develop/nltk/tokenize/treebank.py)

In [None]:
from nltk.tokenize import word_tokenize
tokens = word_tokenize(text)
tokens[:10]

#### Challenge

A while ago you read in a bunch of Jane Austen books into a variable called `austen`. Tokenize that using a method of your choice. Find all the unique words types (you might want the `set` function). Sort the resulting set object to create a vocabulary (you might want to use the `sorted` function).

## Sentence segmentation

Sentence segmentation involves identifying the boundaries of sentences.

#### Sentence segmentation by splitting on punctuation

In [None]:
text.split('.')

We could improve on this by using regular expressions. They'll allow us to split strings based on a number of characters.

In [None]:
sent_boundary_pattern = r'[.?!]'
re.split(sent_boundary_pattern, text)

### Challenge

The file `example2.txt` has more punctuation problems. Read it in and see what the problems are. Try your best to modify the code from above to work for as many cases as you can.

In [None]:
fname = 'example2.txt'
fname = os.path.join(DATA_DIR, fname)
with open(fname) as f:
    text = f.read()

#### Sentence segmentation by `nltk`

In [None]:
from nltk.tokenize import sent_tokenize
fname = 'example2.txt'
fname = os.path.join(DATA_DIR, fname)
with open(fname) as f:
    text = f.read()
sent_tokenize(text)

## Removing punctuation

Sometimes (although admittedly less frequently than tokenizing and sentence segmentation), you might want to keep only the alphanumeric characters (i.e. the letters and numbers) and ditch the punctuation. Here's how we can do that.

- What type is `punctuation`?

In [None]:
from string import punctuation
punctuation

In [None]:
no_punct = ''.join([ch for ch in text if ch not in punctuation])
no_punct

## Strip whitespace

This is an extremely common step. It's simple to perform and nicely pre-packaged in Python. It's particularly common for user-generated text (think survey forms).

In [None]:
fname = 'example3.txt'
fname = os.path.join(DATA_DIR, fname)
with open(fname) as f:
    text = f.read()

In [None]:
print(text)

In [None]:
stripped_text = text.strip()
print(stripped_text)

In [None]:
whitespace_pattern = r'\s+'
clean_text = re.sub(whitespace_pattern, ' ', text)
clean_text

## Revision

I've read in the text of Jane Austen's _Pride and Prejudice_ into a variable called `pride`. Your tasks are to:
- Figure out what type of Python object `pride` is.
- Tokenize the text and store it in a variable called `tokenized_pride`.
- Figure out what type `tokenized_pride` is.
- Remove all punctuation from `pride`.
- Remove all punctuation from `tokenized_pride`.
- Break `pride` up into sentences and store the result as `sents_pride`.

In [None]:
pride = read_pride()

## Text normalization

Text normalization means making our text fit some standard patterns. Lots of steps come under this wide umbrella, but the most common are:

- case folding
- removing URLs, digits, hashtags
- OOV (removing infequent words)

#### Case folding

Case folding means dealing with upper and lower cases characters. This is usually done by making all characters lower cased.

In [None]:
text = read_example(4)

In [None]:
['One', 'Two'].lower()

### Challenge

The `lower` method we used above is a string method, that is, it works on strings. But what if you want to lowercase every word in a list (say you've already tokenized the text). Take the list of tokens below and make each one lower case.

### Removing URLs, digits and hashtags

We rarely care about the exact URL used in a tweet, or the exact number. We could remove them completely (think about how we'd do that), but it's often informative to know that there is a URL or a digit in the text. So we want to replace individual URLs asnd digits with a symbol that preserves the fact that a URL was there. It's standard to just use the strings "URL" and "DIGIT".

How do we do this? Once again, regular expressions save the day.

In [None]:
tweets = read_trump()
tweets[:5]

In [None]:
url_pattern = r'https?:\/\/.*[\r\n]*'
single_tweet = tweets[0]
single_tweet

In [None]:
URL_SIGN = ' URL '
re.sub(url_pattern, URL_SIGN, single_tweet)

#### Challenge

Above we replaced the URL in a single tweet. Now replace all the URLs in all tweets in `tweet_text`.

#### Challenge

Use the regular expression for hashtags below to replace all hashtags in all tweets in `tweet_text`.

In [None]:
hashtag_pattern = r'(?:^|\s)[＃#]{1}(\w+)'
HASHTAG_SIGN = ' HASHTAG '
digit_pattern = '\d+'
DIGIT_SIGN = ' DIGIT '

#### OOV words

Sometimes it's best for us to remove infrequent words (sometimes not!). When we do remove infrequent words, it's often for a downstream method (like classification) that is sensitive to rare words.

In [None]:
all_tweets = ' '.join(tweets)
clean = re.sub(url_pattern, URL_SIGN, all_tweets)
clean = re.sub(hashtag_pattern, HASHTAG_SIGN, clean)
clean = re.sub(digit_pattern, DIGIT_SIGN, clean)
tokens = word_tokenize(clean)
tokens = [token for token in tokens if token not in punctuation]
tokens[:20]

We can count the frequency of each word type with the built-in `Counter` in Python. This basically just takes the set of word types (we calculated this above as `vocabularly`) and makes a special Python dictionary with each value being the number of times it appears in the list. We can ask that dictionary for the most common words, or for the frequency of individual word types.

In [None]:
from collections import Counter
freq = Counter(tokens)
freq.most_common(10)

In [None]:
freq['unleashed']

In [None]:
OOV = 'OOV'
new_tokens = []
for token in tokens:
    if freq[token] == 1:
        new_tokens.append(OOV)
    else:
        new_tokens.append(token)

In [None]:
new_tokens[:20]

### Challenge

I've read in some Amazon reviews from earlier into a list called `reviews`. Each element of the list is a string, representing the text of a single review. Try to:
- Tokenize each review
- Separate each review into sentences
- Strip all whitespace
- Make all characters lower case
- Replace any URLs and digits

Then find the most common 50 words.

In [None]:
reviews = read_amazon()
reviews[:2]

## Removing stop words

You might have noticed that the most common words above aren't terribly exciting. They're words like "am", "i", "the" and "a": stop words. These are rarely useful to us in computational text analysis, so it's very common to remove them completely.

- What other stop words do you think there are?

In [None]:
from nltk.corpus import stopwords
stop = stopwords.words('english')
stop

### Challenge

Use the list `stop` of English stopwords to remove stopwords from our dataset of Tweets.

## Stemming/lemmatization

Stemming and lemmatization both refer to remove morphological affixes on words. For example, if we stem the word "grows", we get "grow". If we stem the word "running", we get "run". We do this because often we care more about the core content of the word (i.e. that it has something to do with growth or running, rather than the fact that it's a third person present tense verb, or progressive participle).

NLTK provides many algorithms for stemming. For English, a great baseline is the [Porter](https://github.com/nltk/nltk/blob/develop/nltk/stem/porter.py) algorithm, which is in spirit isn't that far from a bunch of regular expressions.

In [None]:
from nltk.stem import PorterStemmer
stemmer = PorterStemmer()

In [None]:
stemmer.stem('grows')

In [None]:
stemmer.stem('running')

In [None]:
stemmer.stem('leaves')

In [None]:
from nltk.stem import SnowballStemmer, WordNetLemmatizer
snowballer_stemmer = SnowballStemmer('english')
lemmatizer = WordNetLemmatizer()

In [None]:
print(snowballer_stemmer.stem('running'))
print(snowballer_stemmer.stem('leaves'))

In [None]:
print(lemmatizer.lemmatize('leaves'))

### Challenge

Use the Porter stemmer to stem each word in the tweet dataset after having removed stop words.

## POS tagging

POS tagging means assigning each token a part-of-speech (e.g. noun, verb, adjective, etc.). Again, there are many different [alternatives](https://github.com/nltk/nltk/tree/develop/nltk/tag), but NLTK keeps its recommended POS tagger available through the function `pos_tag`. The tagger expects a list of tokens as input.When doing POS tagging, it is advisable **not** to remove stop words beforehand (although you are free to do it afterwards).

In [None]:
from nltk import pos_tag
single_review = reviews[3]
single_review

In [None]:
tokens = word_tokenize(single_review)
tagged_review = pos_tag(tokens)
tagged_review

### Challenge

Below I've read in the text of Austen's _Pride and Prejudice_ into a variable called `pride`. Preprocess using the following steps:

- Strip whitespace
- Replace all numbers with '0'
- Tokenize
- Tag each token with a POS tag

Make sure you know:
- What type is the result?
- What type is each element of the result?
- What type are the elements of the elements of the result?

In [None]:
pride = read_pride()[679:684814]

## DTM/TF-IDF

Document term matrix and Term Frequency-Inverse Document Frequency are common preprocessing steps for taking tokenized texts and turning them into numerical features, ready for supervised machine learning models. Scikit-learn is the standard method of using DTM and TF-IDF in Python. They have two main classes for this: [CountVectorizer](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html) and [TfidfVectorizer](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html#sklearn.feature_extraction.text.TfidfVectorizer).

In [None]:
whitespace_pattern = r'\s+'
clean = [re.sub(url_pattern, URL_SIGN, t) for t in tweets]
clean = [re.sub(hashtag_pattern, HASHTAG_SIGN, t) for t in clean]
clean = [re.sub(digit_pattern, DIGIT_SIGN, t) for t in clean]
clean = [re.sub(whitespace_pattern, ' ', t) for t in clean]
clean[:4]

In [None]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
count = CountVectorizer()
X = count.fit_transform(clean)
X

In [None]:
X.toarray()[:5,:5]

In [None]:
tfidf = TfidfVectorizer()
X = tfidf.fit_transform(clean)
X

In [None]:
X.toarray()[:5,:5]

## More on DTM/TF-IDF
We will use Python's scikit-learn package learn to make a document term matrix from a .csv Music Reviews dataset (collected from MetaCritic.com). We will then use the DTM and a word weighting technique called tf-idf (term frequency inverse document frequency) to identify important and discriminating words within this dataset (utilizing the Pandas package). The illustrating question: **what words distinguish reviews of Rap albums, Indie Rock albums, and Jazz albums?**

In [None]:
music = read_music()
music[:5]

#### Challenge

Remove all the digits from `music`.

In [None]:
def remove_digit(comment):
    return ''.join([ch for ch in comment if not ch.isdigit()])

no_digits = [remove_digit(comment) for comment in music]

### CountVectorizer Function

Our next step is to turn the text into a document term matrix using the scikit-learn function called `CountVectorizer`.

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

countvec = CountVectorizer()
sparse_dtm = countvec.fit_transform(no_digits)

Great! We made a DTM! Let's look at it.

In [None]:
sparse_dtm

This format is called Compressed Sparse Format. It save a lot of memory to store the dtm in this format, but it is difficult to look at for a human. To illustrate the techniques in this lesson we will first convert this matrix back to a Pandas DataFrame, a format we're more familiar with. For larger datasets, you will have to use the Compressed Sparse Format. Putting it into a DataFrame, however, will enable us to get more comfortable with Pandas!

In [None]:
dtm = pd.DataFrame(sparse_dtm.toarray(), columns=countvec.get_feature_names())
dtm.head()

### Challenge

I've read in a bunch of Jane Austen books into a variable called `books`, which is a list of strings and each string is an entire book. Turn them into a DTM. What will be the rows and columns?

In [None]:
AUSTEN_DIR = os.path.join(DATA_DIR, 'austen', '*.txt')
fnames = glob.glob(AUSTEN_DIR)
books = []
for fname in fnames:
    with open(fname) as f:
        text = f.read()
    books.append(text)

### TF-IDF scores

How to find distinctive words in a corpus is a long-standing question in text analysis? Today, we'll learn one simple approach to this: TF-IDF. The idea behind words scores is to weight words not just by their frequency, but by their frequency in one document compared to their distribution across all documents. Words that are frequent, but are also used in every single document, will not be distinguising. We want to identify words that are unevenly distributed across the corpus.

One of the most popular ways to weight words (beyond frequency counts) is `tf-idf score`. By offsetting the frequency of a word by its document frequency (the number of documents in which it appears) will in theory filter out common terms such as 'the', 'of', and 'and'.

Traditionally, the inverse document frequency is calculated as such:

number_of_documents / number_documents_with_term

so:

tfidf_word1 = word1_frequency_document1 * (number_of_documents / number_document_with_word1)

You can, and often should, normalize the numerator: 

tfidf_word1 = (word1_frequency_document1 / word_count_document1) * (number_of_documents / number_document_with_word1)

We can calculate this manually, but scikit-learn has a built-in function to do so. This function also uses log frequencies, so the numbers will not correspond excactly to the calculations above. We'll use the [scikit-learn calculation](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html), but a challenge for you: use Pandas to calculate this manually.

### TF-IDFVectorizer Function

To do so, we simply do the same thing we did above with CountVectorizer, but instead we use the function TfidfVectorizer.

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidfvec = TfidfVectorizer()
sparse_tfidf = tfidfvec.fit_transform(no_digits)
sparse_tfidf

In [None]:
tfidf = pd.DataFrame(sparse_tfidf.toarray(), columns=tfidfvec.get_feature_names())
tfidf.head()

### Identifying Distinctive Words

What can we do with this? These scores are best used when you want to identify distinctive words for individual documents, or groups of documents, compared to other groups or the corpus as a whole. To illustrate this, let's compare three genres and identify the most distinctive words by genre.

First we add in a column of genre.

In [None]:
fname = os.path.join(DATA_DIR, 'music_reviews.csv')
reviews = pd.read_csv(fname, sep='\t')

tfidf['genre_'] = reviews['genre']
tfidf.head()

In [None]:
rap = tfidf[tfidf['genre_']=='Rap']
indie = tfidf[tfidf['genre_']=='Indie']
jazz = tfidf[tfidf['genre_']=='Jazz']

rap.max(numeric_only=True).sort_values(ascending=False).head()

In [None]:
indie.max(numeric_only=True).sort_values(ascending=False).head()

In [None]:
jazz.max(numeric_only=True).sort_values(ascending=False).head()

There we go! A method of identifying distinctive words. You notice there are some proper nouns in there. How might we remove those if we're not interested in them?

## Things we didn't cover

- Named entity recognition
- Syntactic parsing
- Information extraction
- Removing markup from HTML
- Extracting numerical features
- SpaCy