In [None]:
# You might neeed/want these text-specific 
# libraries if you don't have them: 

# ! pip install --user --quiet nltk spacy
# ! python -m spacy download en_core_web_sm

# Working with Text

This notebook contains videos and exercises that will guide you through some basic ideas in working with text.

The goals for this notebook are as follows: 

1. Practice practice practice working with text in Python. You will code a lot in these exercises. That's the idea!

2. See the techniques make a difference on a very simple problem. This should give you a feel for what these techniques are doing!

3. Prove (to yourself and to me!) that you understand the techniques by implementing them.

This repo contains two notebooks: 1) Videos/Exercises 2) A graded assignment.

After each video in this notebook is a small exercise. The exercise asks you to _implement_ from scratch the ideas explained in the video. I encourage everyone to try to implement the ideas and to work through the exercises. If you are proficient in Python, they should take less than an hour all together. If you are not proficient in Python, they will be good practice! 

That being said, you can find solutions to the first few exercises in the `solutions.py` file included in this folder. Those solutions should help you see the effect of the techniques even if you are not able to implement them yourself.

In [22]:
from IPython.lib.display import YouTubeVideo

YouTubeVideo('0ua6H8jHiBU', width=640,height=360)

In [None]:
# 1)
# Let's say you are inventing search. 
#
# Imagine someone searching for the term "People who see ghosts". 
# Implement a search algorithm that picks the correct document
# from the `docs` array below. It should be easy!
#
# HINT: 
# Look at the Python documentation for string methods that
# you can use to manipulate text: 
# 
# https://docs.python.org/3/library/stdtypes.html#string-methods
#
# Methods such as .strip, .split, .join, .lower are commonly
# used in text mining applications.


docs = ['This is a document about people who see ghosts. Those people end up on TV shows.',
        'This is a document about seeing goats. Those people work on farms.']

def search(docs, query):
    # Your function here!
    pass

search(docs, 'People who see ghosts')

In [23]:
YouTubeVideo('vfdode-FOO8', width=640,height=360)

In [None]:
# You probably needed to lowercase both the
# term and your document in order to make
# your search algorithm work. 
#
# We call this "preprocessing". Let's make
# a preprocessing function that we can build
# on throughout the notebook:

def preprocess(doc):
    return doc.lower()

In [None]:
# 2)
# Now let's try your search algorithm on the following docs.
# Does it return the correct document? 
#
# If we agree the correct document is the second one, 
# then we can see that an "exact string match"
#
# Can can you create a search algorithm 
# that selects the second document?

# HINT: You will need to come up with a new
# decision mechanism AND add more preprocessing.
#
# HINT: You might need to use regular expressions
# in the preprocessing. You should find everything
# you need in the regular-expressions.ipynb file.


docs = ['"I dont believe people who see ghosts", said Mannie, before spitting into the wind and riding his bike down the street at top speed. He then went home and ate peanut-butter and jelly sandwiches all day. Mannie really liked peanut-butter and jelly sandwiches. He ate them so much that his poor mother had to purchase a new jar of peanut butter every afternoon.',

        'We have collected a report of people in our community who see ghosts. Each resident was asked "how many ghosts have you seen?", "describe the ghosts you saw", and "tell us about your mother." Afterwards, we compared the reports of ghosts between the different individuals, and assessed whether or not they were actually seeing these apparitions.']



search(docs, 'People who see ghosts')

In [24]:
YouTubeVideo('H1Fw24o4zeA', width=640,height=360)

In [None]:
# 3)
# Now the second document has changed a bit:
# "ghosts" to "a ghost" and "the ghosts you saw" to 
# "the last ghost you saw". 
#
# Clearly, these don't change the meaning. But they
# mess up your algorithm which relies on counting
# words!
#
# Let's add some steps to our `preprocess` function
# to create a representation of our data that 
# allows our previous word-count function to work.
#
# HINT: You can either lemmatize or stem, but
# it's a bit simpler to stem, so start with that!
# Bonus points for lemmatization :)


docs = ['"I dont believe people who see ghosts", said Mannie, before spitting into the wind and riding his bike down the street at top speed. He then went home and ate peanut-butter and jelly sandwiches all day. Mannie really liked peanut-butter and jelly sandwiches. He ate them so much that his poor mother had to purchase a new jar of peanut butter every afternoon.',

        'We have collected a report of people in our community seeing ghosts. Each resident was asked "how many ghosts have you seen?", "describe the last ghost you saw", and "tell us about your mother." Afterwards, we compared the ghost reports between the different individuals, and assessed whether or not they were actually seeing these apparitions.']

search(docs, 'People who see ghosts')

In [25]:
YouTubeVideo('x-55EOVM-1M', width=640,height=360)

In [None]:
# 4)
# Now we've added some more documents.
# You'll note that the word count solution from 
# before no longer works as intended. It picks
# the document about people seeing things.
# 
# Let's try and re-weight the words so that
# our search algorithm pays more attention
# to the words we care about, for example,
# to "ghosts". 
#

docs = ['"I dont believe people who see ghosts", said Mannie, before spitting into the wind and riding his bike down the street at top speed. He then went home and ate peanut-butter and jelly sandwiches all day. Mannie really liked peanut-butter and jelly sandwiches. He ate them so much that his poor mother had to purchase a new jar of peanut butter every afternoon.',

        'People see incredible things. One time I saw some people talking about things they were seeing, and those people were so much fun. They saw clouds and they saw airplanes. They saw dirt and they saw worms. Can you believe the amount of seeing done by these people? People are the best.',

        'This is an article about a circus. A Circus is where people go to see other people who perform great things. Circuses also have elephants and tigers, which generally get a big woop from the crowd.',

        'Lots of people have come down with Coronavirus. You can see the latest numbers and follow our updates on the pandemic below. Please, stay safe.',

        'Goats are lovely creatures. Many people love goats. People who love goats love seeing them play in the fields.',

        'We have collected a report of people in our community seeing ghosts. Each resident was asked "how many ghosts have you seen?", "describe the last ghost you saw", and "tell us about your mother." Afterwards, we compared the ghost reports between the different individuals, and assessed whether or not they were actually seeing these apparitions.']



search(docs, 'People who see ghosts')

In [26]:
YouTubeVideo('SRiCPv8isck', width=640,height=360)