# Python Text Analysis: Fundamentals, Part 1

In this workshop series, we'll establish building blocks for performing text analysis in Python. These techniques lie in the domain of *natural language processing*, where we apply computational techniques to language written by humans in order to explain some of the underlying structure.

So, the million dollar question: How exactly do we go about performing computational methods on words?

This is ultimately a question of *representations*. Text naturally is represented as words, which are understandable to humans because we have a grammatical and syntactical structure we use to extract meaning from those words. However, most machine learning and data science techniques utilize numerical methods to extract patterns from large datasets. So, we need to find a way to convert the language into a numerical representation. We'll start with this goal in mind, and demonstrate how involved this process can be.

We'll start this process by first importing text into Python. Then, we'll cover a variety of preprocessing steps you might want to use before proceeding with computational methods. In the next sequence of this workshop, we'll work with the bag-of-words, or the first numerical representation of text that we'll encounter in this workshop series.

# Importing Text Files 

Text data we want to analyze will be stored in external files that need to be imported. These files will generally be text files (`.txt`) or comma separated value files (`.csv`).

All the data used in this notebook are stored in a `data` folder that we need to access. We need to adjust our filepaths accordingly:

In [None]:
text_path = '../data/sowing_and_reaping.txt'

## Text Files

We'll first start by importing "Sowing and Reaping" by Frances Harper, which is stored in a text file. Python has built-in functionality for importing text files:

In [None]:
# Open and read the text
with open(text_path, 'r') as file:
    raw_text = file.read()

We've stored the text file in an object called `raw_text`. We'll remove the front and end matter for better preprocessing later:

In [None]:
# Remove the front and end matter
sowing_and_reaping = raw_text[1114:684814]

## Challenge 1: Working with Strings

* What type of object is `sowing_and_reaping`?
* How many characters are in `sowing_and_reaping`?
* How can we get the first 1000 characters of `sowing_and_reaping`?

## Comma Separated Value (CSV) Files

Often, we may have data stored in "dataframes" or "tables", which consists of many samples (rows), each containing several features (columns). Among the features is likely a text column which contains the text of interest. These dataframes are often found as Comma Separated Value (CSV) files (and somewhat less frequently as tab separated value (TSV) files). In either case, there is some "delimiter" (i.e., a comma or tab) which helps separate entries from each other.

The `pandas` package is the best package for dealing with dataframes in Python, and this package comes with its own function for reading CSV files. For example, let's read in a file containing many Tweets about airlines:

In [None]:
# Import pandas
import pandas as pd
# Use pandas to import Tweets
csv_path = '../data/airline_tweets.csv'
tweets = pd.read_csv(csv_path, sep=',')

In [None]:
tweets.head()

Let's take a look at some of the Tweets:

In [None]:
print(tweets['text'].iloc[0])
print(tweets['text'].iloc[1])
print(tweets['text'].iloc[2])

## Challenge 2: Reading in Many Files

The `data` folder contains another folder called `amazon`, which contains many `csv` files of Amazon reviews. Use a `for` loop to read in each dataframe. Do the following:

* We've provided a path to the `amazon` folder, and a list of all the file names within the folder using the `os.listdir()` function.
* Iterate over all these files, and import them using `pd.read_csv()`. You will need to use `os.path.join()` to create the correct path. Additionally, you need to provide `pandas` with the column names since they are not included in the reviews. We have create the `column_names` variable for you.
* Extract the text column from each dataframe, and add then to the `reviews` list. 
* How many totals reviews do you obtain?

In [None]:
# The os package has useful tools for file manipulation
import os
# Amazon review folder
amazon_path = '../data/amazon'
# List all the files in the amazon folder
files = os.listdir(amazon_path)
# Column names for each file
column_names = ['id',
                'product_id',
                'user_id',
                'profile_name',
                'helpfulness_num',
                'helpfulness_denom',
                'score',
                'time',
                'summary',
                'text']
# Add each review text to this list
reviews = []

In [None]:
for file in files:
    # Check that the file is actually a CSV file
    if os.path.splitext(file)[1] == '.csv':
        # YOUR CODE HERE
        text = ''
        reviews.extend(text)

There are other file types which you may come across: `json`, `xml`, `html`, etc. There are packages you can use to import each other these. The main challenge, in most cases, is dealing with multiple files, and extracting the actual text you want.

# Preprocessing

Our goal is to convert a text representation to a numerical representation. However, language can be messy. There's a variety of preprocessing steps that we'd like to do before we get to the numerical representation.

We will largely be using a package called Natural Language Toolkit, or `nltk`, to perform these operations. In some cases, we'll use basic Python.

There are a host of natural language processing packages one can use. For example, one newer package is `spaCy`, which is extremely powerful. Our goal here is to not make you an expert in a variety of NLP packages, but to expose you to principles that are shared by all of them. In this way, you'll be better prepared to open up any new NLP package you might have to use.

## Installing `nltk`

If this is your first time using `nltk`, we'll go through a couple steps to get set up. First, install `nltk` if you have not already:

In [None]:
# Run if you do not have nltk installed
!pip install nltk

Next, we need to install a couple packages within `nltk`:

In [None]:
import nltk
nltk.download('wordnet')
nltk.download('stopwords')

## Text Cleaning

"Text cleaning" is a catch-all term for the process of performing relatively simple tasks in order to normalize our code. Text cleaning can mean a variety of different things depending on your use case.

### A Brief Introduction to Regular Expressions

Before we dive into the specific text cleaning processes, let's briefly introduce regular expressions. We do this here since many text cleaning steps may require regular expressions, and many NLP libraries heavily use them under the hood.

Regular expressions (regexes) are a powerful way of searching for specific string patterns in large corpora. They have an infamously steep learning curve, but are very efficient when you get a handle on them.

Our goal in this workshop is not to provide a deep (or even shallow) dive into regexes; instead, we want to expose you to them so that you're better prepared to do deep dives in the future.

Regex testers are a useful tool in both understanding and creating regex expression. An example is this [website](https://regex101.com).

In [None]:
import re
pattern = 'test'

In [None]:
test_string = 'This is a test.'
# Find tokens
tokens = re.findall(pattern=pattern, string=test_string)
print(tokens)
# Replace tokens
replaced = re.sub(pattern=pattern, repl='not a test', string=test_string) 
print(replaced)

This is nice, but we could have done this somewhat easily with basic Python `string` functions. Let's try something more interesting:

In [None]:
# Word pattern matcher
pattern = r'\w+'
re.findall(pattern, test_string)

What did this do? Use the regex website to confirm your guess!

For now, we won't go much further than this, but there are many resources online to continue learning about regexes.

### Lowercasing

While there is often information in the "casing" of words (e.g., whether text is lowercase or uppercase), we often don't work in a regime where we're able to properly leverage this information. So, a common text cleaning step is to lowercase all text, in order to simplify our analysis.

We can easily do this with the built-in string function `lower()`:

In [None]:
sowing_and_reaping_lower = sowing_and_reaping.lower()

In [None]:
print(sowing_and_reaping[:200])
print('------')
print(sowing_and_reaping_lower[:200])

### Removing Punctuation

Sometimes, you might want to keep only the alphanumeric characters (i.e. the letters and numbers) and ditch the punctuation. This becomes less common when we consider more advanced NLP algorithms. In many cases, you may do this step *after* tokenization, which we will discuss in the next section. 

In [None]:
from string import punctuation
print(punctuation)

In [None]:
punctuation_text = "We've got quite a bit of punctuation here, don't we?!? #Python @D-Lab."
no_punctuation = ''.join([char for char in punctuation_text if char not in punctuation])
print(no_punctuation)

### Stripping Blank Spaces

Removing blank space is a common step, as we might come across text with extraneous blank space. This is particularly common when text is imported from messy places, like webpages.

Python has a built-in function to deal with blank space on the *ends* of strings:

In [None]:
string = ' Hello! '
string.strip()

What about within text? We will need to use a regular expression for this:

In [None]:
example1_path = '../data/example1.txt'

with open(example1_path, 'r') as file:
    example1 = file.read()
    
print(example1)

In [None]:
# Stripping only removes the ends
print(example1.strip())

In [None]:
# A regular expression will handle blank spaces within the text
blankspace_pattern = r'\s+'
blankspace_repl = ' '
clean_text = re.sub(blankspace_pattern, blankspace_repl, example1)
clean_text.strip()

### Removing URLs, Hashtags, and Numbers

Text containing non-alphabetic symbols may have additional meaning beyond simply using punctuation or numbers. For example, text may contain URLs, hashtags, or numbers. Each of these are informative in their own right.

However, we rarely care about the exact URL used in a tweet. Similarly, we might not care about specific hashtags, or the precise number used. While, we could remove them completely, it's often informative to know that there *exists* a URL, hashtag, or number.

So, we replace individual URLs, hashtags, and numbers with a "symbol" that preserves the fact these structures exist in the text. It's standard to just use the strings "URL", "HASHTAG", and "DIGIT".

Since these types of text often contain precise structure, they're an apt case for using regular expressions. Let's apply these patterns to the Tweets above.

In [None]:
# Get a Tweet with a URL in it
url_tweet = tweets['text'].iloc[13]
print(url_tweet)

In [None]:
# URL 
url_pattern = r'https?:\/\/.*[\r\n]*'
url_repl = ' URL '
re.sub(url_pattern, url_repl, url_tweet)

In [None]:
# Hashtag
hashtag_pattern = r'(?:^|\s)[＃#]{1}(\w+)'
hashtag_repl = ' HASHTAG '
re.sub(hashtag_pattern, hashtag_repl, url_tweet)

In [None]:
# Digits
digit_tweet = tweets['text'].iloc[32]
print(digit_tweet)
digit_pattern = '\d+'
digit_repl = ' DIGIT '
re.sub(digit_pattern, digit_repl, digit_tweet)

What other kinds of text strings can you think of that we might want to replace?

## Challenge 3: Text Cleaning with Multiple Steps

In Challenge 1, we imported many Amazon reviews, and stored them in a variable called `reviews`. Each element of the list is a string, representing the text of a single review. For each review:

* Replace any URLs and digits.
* Make all characters lower case.
* Strip all blankspace.

Keep in mind: the order in which you do these steps matters!

In [None]:
def preprocess(text):
    # YOUR CODE HERE
    return text

## Tokenization

One of the most important steps in text analysis is tokenization. This is the process of breaking down the text into "tokens", which are distinct chunks that we recognize as unique in whatever corpus we're working in.

Let's start by importing an example file:

In [None]:
example2_path = '../data/example2.txt'

with open(example2_path) as file:
    example2 = file.read()
    
print(example2)

Let's try naively tokenizing by splitting up the text according to blankspace, using a basic Python string method:

In [None]:
tokens = example2.split()
# Print first ten tokens
tokens[:20]

We can roughly think of this as "word tokenization". However, it's not always clear that simply splitting up by spaces will get what we want. Consider contractions, for example, which really consist of two words connected together. More advanced tokenizations will actually treat these words differently.

`nltk` has a function called `word_tokenize` which can tokenize a string for us in an intelligent fashion. Ultimately, `nltk` basically is a bunch of regular expressions under the hood:

In [None]:
from nltk.tokenize import word_tokenize
nltk_tokens = word_tokenize(example2)

In [None]:
print(nltk_tokens)

Looking at this example, you can see how `nltk` has made certain decisions about where and when to tokenize. Tokenization is critical for downstream processing, and there's a variety of methods for performing the tokenizing. Let's take a look at `spaCy`'s tokenizer.

In [None]:
# Install spaCy if necessary
!pip install spacy
!python -m spacy download en_core_web_sm

In [None]:
# Import spaCy and load the dictionary
import spacy
nlp = spacy.load("en_core_web_sm")
# Pass the example into the English pipeline
doc = nlp(example2)
spacy_tokens = [token.text for token in doc]

In [None]:
# Compare NLTK to spaCy
print(nltk_tokens)
print(spacy_tokens)

## Challenge 4: Tokenizing a Large Text

Tokenize "Sowing and Reaping", which we imported at the beginning of this workshop. Use a method of your choice.

Once you've tokenized, find all the unique words types (you might want the `set` function). Then, sort the resulting `set` object to create a vocabulary (you might want to use the `sorted` function).

## Removing Stop Words

Text often has words that are very common and usually not informative. These words tend to be pronouns or articles, such as "the", "a", "it", "them", etc. In many cases, these "stop words" are those that we may wish to remove before performing computation since they usually are not very informative. 

In practice, this is simple to do - we just filter out tokens by words. However, we may want to use different "stop word lists", depending on our use case. For example, `nltk` has a stop word list:

In [None]:
from nltk.corpus import stopwords
stop = stopwords.words('english')

In [None]:
# What kinds of words are in the list?
print(stop[:50])

In [None]:
# Remove tokens that are stop words
tokens = [token for token in tokens if token not in stop]
print(tokens)

In [None]:
# Compare to the original text
print(example2)

## Stemming and Lemmatization

Stemming and lemmatization both refer to removing morphological affixes on words. Many words consist of a "core" word with a modified ending that adjusts the word's meaning in a given context. For example, the word "grows" is simply "grow" with an "s" added to denote a change in verb tense. In many cases, we're interested in the core content of the word. Stemming and lemmatization are the process of getting at the "core" of a word. This "core" component is often referred to as the *lemma*.

Stemming is a rudimentary approach to obtaining the lemma: it simply removes an ending of a word. So, "grows" would be stemmed to "grow". The word "running" would be stemmed to "run".

Lemmatization is more general: it aims to find the lemma of a word, but can handle cases where stemming may not work. For example, the word "fairies" cannot be stemmed to the lemma, "fairy". So, we need additional rules - provided by lemmatization - that can appropriately turn "fairies" into "fairy".

`nltk` provides many algorithms for stemming. We'll use the Snowball Stemmer, which we'll import from `nltk`. We'll also look at the Word Net Lemmatizer:

In [None]:
from nltk.stem import SnowballStemmer, WordNetLemmatizer

In [None]:
# Instantiate the stemmer and lemmatizer
stemmer = SnowballStemmer('english')
lemmatizer = WordNetLemmatizer()

In [None]:
# Stemming examples
print(stemmer.stem('grows'))
print(stemmer.stem('running'))
print(stemmer.stem('coded'))

In [None]:
# When does stemming not quite work?
print(stemmer.stem('fairies'))
print(stemmer.stem('wolves'))
print(stemmer.stem('abaci'))
print(stemmer.stem('leaves'))
print(stemmer.stem('carried'))

In [None]:
# Let's try lemmatizing these, instead:
print(lemmatizer.lemmatize('fairies'))
print(lemmatizer.lemmatize('wolves'))
print(lemmatizer.lemmatize('abaci'))
print(lemmatizer.lemmatize('leaves'))
print(lemmatizer.lemmatize('carried'))

What happened with that last one? Sometimes we need to provide the lemmatizer a 'part-of-speech' tag to help resolve ambiguous cases. This is another argument in the lemmatizer:

In [None]:
print(lemmatizer.lemmatize('carried', pos='v'))

Try it with "leaves", which has more than one way to lemmatize!

In [None]:
print(lemmatizer.lemmatize('leaves', pos='n'))
print(lemmatizer.lemmatize('leaves', pos='v'))

## Challenge 5: Apply a Lemmatizer to Text

Lemmatize the tokenized `example2` text using the `nltk`'s `WordNetLemmatizer`.

## Challenge 6: Putting it All Together

Write a function called `preprocess()` that accepts a string and performs the following preprocessing steps:

* Lowercase text.
* Replace all URLs and numbers with their respective tokens.
* Strip blankspace.
* Tokenize.
* Remove punctuation.
* Remove stop words.
* Lemmatize the tokens.

Apply this function to `sowing_and_reaping`.

In [None]:
def preprocess(text):
    # YOUR CODE HERE
    return text

In [None]:
preprocess(sowing_and_reaping)

## Powerful Features of `spaCy`

We will end this portion of the workshop by examining some of the more powerful features offered by the newer NLP library, `spaCy`. Beside being quite fast, `spaCy` provides very powerful built-in tools in its tokenizer. For example, we automatically get many of the above operations in one fell swoop:

In [None]:
short_example = "We're learning about natural language processing at Berkeley."
doc = nlp(short_example)

for token in doc:
    print(
        f"Token: {token.text}; Lemma: {token.lemma_}; Part-of-speech: {token.pos_}; "
        f"Token shape: {token.shape_}; Alphabetical? {token.is_alpha}; Stop Word? {token.is_stop}"
    )

Tokenizing, lemmatization, part of speech tagging, stop word detection, and a couple other things are provided to us up front when we pass a text into the `nlp` module.

`spaCy` also comes with some pretty shiny visualization tools:

In [None]:
from spacy import displacy
displacy.render(doc, style="dep", options={'compact': True})

For longer texts, we also get the ability to perform a variety of other operations very easily. Here are some cases:

In [None]:
example3_path = '../data/example3.txt'

with open(example3_path, 'r') as file:
    example3 = file.read()
    
doc = nlp(example3)

In [None]:
print(example3)

In [None]:
# Sentence segmentation
print('Sentence Segmentation')
for sentence in doc.sents:
    print(sentence)

# Entity detection
print('\nEntity Detection:')
for entity in doc.ents:
    print(entity.text, entity.label_)

# Noun chunks
print('\nNoun Chunks:')
for chunk in doc.noun_chunks:
    print(chunk)

There's a whole lot else we can do with it! Check out `spaCy`'s documentation to see more.