# Lecture 2.5: Text and Image Data

This lecture, we are going to focus on two new types of data: text, and images.  We'll learn the basic data exploration skills required to get started with the two most popular fields of Machine Learning, Natural Language Processing (NLP), and Computer Vision (CV).

**Learning goals:**

- string operations
- basic regular expressions
- tokenization
- image manipulation

## 1. Text Data

The internet contains almost all of our civilization's knowledge. Why don't we just explore that data and use it? In a twist of irony similar to that of the [library of Babel](https://en.wikipedia.org/wiki/The_Library_of_Babel), what makes this knowledge so difficult to access is the fact that there is so much information. Being able to summarise, visualize, and explore text data is difficult, because it cannot be numerically aggregated like tabular data (see lecture 2.3 & 2.4). Words are symbols, we need some solutions to get insights into the meaning encoded in their sequences.

One of the simplest tricks is to study the distribution of words. 

Seinfeld is a heritage of the 90s, and we'd like to revisit the era. The `Seinfeld.csv` dataset contains the transcript of all the dialogues of the first episode. Let's take a peek:

In [None]:
with open('seinfeld.csv', 'r') as f:
    lines = f.readlines()
lines[0]

What an introduction! We can see that the lines are formatted as such: `CHARACTER,CONTENT`. However, we wish to explore only the content. Let's cut off the character field by splitting the lines:

In [None]:
def remove_character(string):
    return ','.join(string.split(',')[1:])

dialogue = [remove_character(l) for l in lines]

🧠 The code above uses a [list comprehension](https://realpython.com/list-comprehension-python/). Can you step through what's happening?

In [None]:
dialogue[7]

The character is gone, but the lines end in a newline. Let's remove those as well:

In [None]:
dialogue = [l.strip('\n') for l in dialogue]

In [None]:
dialogue[7]

In [None]:
dialogue[8]

It seems that _some_ of the lines start/end with a `"`. `.strip()` has the useful property of only executing if the character to remove is present. Let's use it for the `"`:

In [None]:
dialogue = [l.strip('"') for l in dialogue]

In [None]:
dialogue[7]

In [None]:
dialogue[490]

The lines are getting cleaner, but now we have to deal with numbers and punctuation. We're interested in _word_ distributions, `32` doesn't tell us anything about Seinfeld! Let's filter numbers and punctuation:

In [None]:
import re

def filter_numbers(string):
    return re.sub(r'\d+', '', string)

dialogue = [filter_numbers(l) for l in dialogue]

In [None]:
dialogue[490]

Here, we used a regular expression, a.k.a [regex](https://regexr.com/). These are strings that define _search patterns_ in other strings. They are very useful when processing text data, but can get diabolically complicated! Thankfully, the one we need for filtering punctuation is straightforward:

In [None]:
def filter_punct(string):
    return re.sub(r'[^\w\s]', '', string)

dialogue = [filter_punct(l) for l in dialogue]

In [None]:
dialogue[492]

In [None]:
dialogue[490]

It seems we've ended up with some empty lines... Let's clean those up too:

In [None]:
dialogue = [l for l in dialogue if l]

In [None]:
dialogue[490]

Finally, let's lowercase our lines to normalise all word strings:

In [None]:
dialogue = [l.lower() for l in dialogue]

In [None]:
dialogue[490]

We're close to seeing word distributions! But first, we must decide what is a "word"... Depending on how clean the text is, there are many ways of splitting a string into a sequence of word units. In the field of NLP, these units are called "tokens", and the process is refered to as "tokenization". Let's tokenize our lines by whitespace, meaning we'll split the string wherever there is at least one whitespace:

In [None]:
def tokenize_whitespace(dialogue):
    tokens = []
    for l in dialogue:
        tokens.extend(l.split())
    return tokens


tokens = tokenize_whitespace(dialogue)
len(tokens)

We have words! Now we can look at the ten most common tokens using a [Counter](https://pymotw.com/2/collections/counter.html):

In [None]:
from collections import Counter

Counter(tokens).most_common(10)

😐... All the most popular words are common words, like pronouns or determiners. This isn't really giving us insights into the show, or capturing the spirit of Seinfeld! We need more advanced text processing. One common solution is to remove [stopwords](https://nlp.stanford.edu/IR-book/html/htmledition/dropping-common-terms-stop-words-1.html) from our tokens. But for this, we'll need to use a more powerful NLP library.

[Spacy](https://spacy.io/) is distinguishing itself as the leading NLP library in python. Let's try it out! It uses powerful ML models to do some of its text analysis, so we need to download those first:

In [None]:
!python -m spacy download en

Now we're ready to analyse! Spacy's philosophy is to condense all the text processing one might wish for in one method called `.nlp()` (more details in their [tutorial](https://spacy.io/usage/spacy-101)).

In [None]:
import spacy
nlp = spacy.load("en_core_web_sm")

with open('seinfeld.csv', 'r') as f:
    content = f.read()
doc = nlp(content)

Notice how we don't need to execute any of our manual processing, since `spacy` takes care of everything. All the text metadata is now available through the `doc` object. We can explore the word distribution by removing stopwords and non-alphabetic tokens:

In [None]:
for token in doc:
    if token.is_alpha and not token.is_stop:
        print(token.text) 

This is feeling more "Seinfeld", but it's not easy to see. Let's turn it into a pretty wordcloud! ☁️

In [None]:
import matplotlib.pyplot as plt
from wordcloud import WordCloud, STOPWORDS 

stopwords = set(STOPWORDS)
token_texts = [t.text for t in doc]
big_string = ' '.join(token_texts)
wordcloud = WordCloud(width = 800, height = 800, 
                background_color ='white', 
                stopwords = stopwords, 
                min_font_size = 10).generate(big_string) 

plt.figure(figsize = (8, 8), facecolor = None) 
plt.imshow(wordcloud)

Very cool! But spacy also offers advanced linguistic visualizations out of the box. For example, we can show the dependency parse tree for the 42nd line of the dialogue:

In [None]:
from spacy import displacy

doc = nlp(lines[42])
displacy.render(doc, style="dep")

Or use `spacy` to annotate [entities](https://en.wikipedia.org/wiki/Named-entity_recognition) in the text:

In [None]:
doc2 = nlp(''.join(lines[120:130]))
displacy.render(doc2, style="ent")

So cool! Notice how `New York` is correctly tagged as a geo political entity, but George's laugh `ho` is erroneously labeled as such... This is because advanced NLP processing requires _models_. By definition, they are not perfect. This is important to note, because errors might affect downstream tasks. 

## 2. Image Data

How to Load and Display Images
How to Convert Images to NumPy Arrays and Back
How to Resize Images
How to Flip, Rotate, and Crop Images
How to Save Images to File

What's best, ice cream 🍦, or waffles 🧇? 

We have a dataset of 20 ice cream & 20 waffles images to figure it out. We choose the [pillow](https://python-pillow.org/) library to help us in this delicious venture.

In [None]:
from PIL import Image
import matplotlib.pyplot as plt
image = Image.open('waffles_or_ice_cream/ice_cream/11.jpg')
plt.imshow(image)

Viewing an image is as simple as that! We are using pillow to read the `.jpg` file, and matplotlib to view it.  
Let's see a waffle next:

In [None]:
image = Image.open('waffles_or_ice_cream/waffles/9.jpg')
plt.imshow(image)

The image is distractingly appetizing, but can you notice the axes? They are numerically labeled. That's because images are just arrays! We can see it first hand by printing `image.size`:

In [None]:
print(image.size)

We know another library that is particularly good at handling array data, NumPy. Let's convert our pillow [`Image`](https://pillow.readthedocs.io/en/stable/handbook/concepts.html) to an `ndarray`:

In [None]:
import numpy as np

data = np.asarray(image)
data.shape

Notice the `3`? This is a 3D array! This is because `RGB` images store 3 numerical values per pixel, one for each color. This might "feel" weird, since NumPy was originally designed for efficient linear algebra. But numbers is data, and images is matrices! In fact, we can display images directly using `ndarray`s:

In [None]:
plt.imshow(data)

This means that we can manipulate images with our knowledge of NumPy indexing:

In [None]:
data_flip = data[::-1, :, :]
plt.imshow(data_flip)

In [None]:
data_flip_color = data[:, :, ::-1]
plt.imshow(data_flip_color)

In [None]:
data_zoom = data[100:200, 100:200, :]
plt.imshow(data_zoom)

That blue waffle doesn't look quite as appetizing. 🙅‍♂️

pillow also offers convenient methods to carry out these common transformations:

In [None]:
image_flip = image.transpose(Image.FLIP_LEFT_RIGHT)
plt.imshow(image_flip)

In [None]:
image_crop = image.crop((100, 100, 200, 200))
plt.imshow(image_crop)

In [None]:
image_resize = image.resize((500,200))
plt.imshow(image_resize)

W I D E    W A F F L E 🤤

These operations might seem trivial, but they are important for Machine Learning. Datasets must be cleaned and normalised to be used for training. Also, a popular way to improve model accuracy in the field of [computer vision](https://towardsdatascience.com/everything-you-ever-wanted-to-know-about-computer-vision-heres-a-look-why-it-s-so-awesome-e8a58dfb641e) is to use [data augmentation](https://bair.berkeley.edu/blog/2019/06/07/data_aug/). To augment image data, we commonly have to flip, rotate, fuzz, or change pixel values.

Let's save our wide waffle masterpiece:

In [None]:
image.save('waffles_or_ice_cream/waffles/wide_waffle.jpg')

Our waffles and ice creams image sizes are all over the place. Let's normalise the dataset by rescaling all of our images. 

💪 Create a new dataset, `waffles_or_ice_cream_norm` which contains all the `waffles_or_ice_cream` images resized to $100x100$ pixels. Go through the lecture 1.4 notebook if you need a refresher on data pipelines!

## 4. Summary

Today, we learned about text and image processing. We cleaned a transcript from an episode of Seinfeld, first with simple string operations, and then with the spacy library. We also loaded, manipulated, and saved images with pillow.

# Resources

## Core Resources

- [Text processing in python](https://medium.com/@datamonsters/text-preprocessing-in-python-steps-tools-and-examples-bf025f872908)
- [Pillow tutorial](https://pillow.readthedocs.io/en/3.0.x/handbook/tutorial.html)
- [Kaggle dataset - seinfeld chronicles](https://www.kaggle.com/thec03u5/seinfeld-chronicles)
- [Kaggle dataset - waffles or ice cream](https://www.kaggle.com/sapal6/waffles-or-icecream)
        
## Additional Resources
        
- [ Text data processing walkthrough](https://www.kdnuggets.com/2018/03/text-data-preprocessing-walkthrough-python.html)
- [Guide to deal with text data](https://www.analyticsvidhya.com/blog/2018/02/the-different-methods-deal-text-data-predictive-python/)
- [Introduction to Natural Language Processing in python](https://towardsdatascience.com/introduction-to-natural-language-processing-for-text-df845750fb63)
- [Load and manipulate images with pillow](https://machinelearningmastery.com/how-to-load-and-manipulate-images-for-deep-learning-in-python-with-pil-pillow/)
- [Advanced image processing with SciPy](https://scipy-lectures.org/advanced/image_processing/)