# Preparing the data

Today, we're going to work on loading and cleaning the dataset. We'll write a few different functions first, and then combine them together at the end. 

In [196]:
import nltk
from nltk.corpus import stopwords
import csv
import os
import re

In [214]:
# Make sure this runs without output
train_body_path = "train_bodies.csv"
if not os.path.exists(train_body_path):
    print("Check location for train_bodies")
test_body_path = "test_bodies.csv"
if not os.path.exists(test_body_path):
    print("Check location for test_bodies")
train_stance_path = "train_stances.csv"
if not os.path.exists(train_stance_path):
    print("Check location for train_stances")
test_headline_path = "test_stances_unlabeled.csv"
if not os.path.exists(test_headline_path):
    print("Check location for test_stances_unlabeled")

## Preparing strings

### Cleaning strings

First, let's write a function to clean a string. This means taking in a string (word or sentence) and making sure that it is:
- All lowercase
- Every space is only one space long
- All letters and numbers (this part is trickier)

You will probably find string methods helpful for this task. Take a look at the documentation for Python strings to find some useful methods to accomplish at least the first two tasks:
https://docs.python.org/3/library/stdtypes.html#string-methods


In [147]:
def our_clean(s):
    # TODO: lowercase a string
    
    # TODO: make sure all spaces are only one long
    
    # TO TRY: make sure all characters are alphanumeric --- skip this if you get stuck
    return s

If you got stuck on making sure the string is alphanumeric, don't worry. The best way to do that is something called a regular expression. Regular expressions are super cool, and you can learn more about them here: https://docs.python.org/3/howto/regex.html#regex-howto

The most important thing to know is that they're a fast way to do pattern matching and replacement on strings. We can use one that looks like this to clean strings. 

In [197]:
def clean(s):
    # The regular expression '\w' matches all alphanum characters, and '+' means one or more of them in sequence
    return " ".join(re.findall(r'\w+', s, flags=re.UNICODE)).lower()

In [198]:
# Let's test it out!
upper = "ThIs SeNtenCE sHouLD bE AlL loWErCaSE"
clean_upper = clean(upper)
print(clean_upper)

this sentence should be all lowercase


In [199]:
symbols = "@this$sentence&should*have?numbers-123-}but|no+symbols#"
clean_symbols = clean(symbols)
print(clean_symbols)

this sentence should have numbers 123 but no symbols


In [200]:
spaces = "this      sentence should    have only  one  space between words"
clean_spaces = clean(spaces)
print(clean_spaces)

this sentence should have only one space between words


### Tokenizing
How do we tokenize a sentence, or break it down into its component words? We can do it ourselves, but there are libraries that do a more advanced job. Let's try making our own function first using string operations. Take a peek at the documentation first:

https://docs.python.org/2/library/stdtypes.html#string-methods


In [201]:
def our_w_tokenize(s):
    # TODO: write a function to split a sentence into its component words
    # HINT: look at str.split() in the documentation. Think about where this would fail. Can you get fancier?
    return s

In [202]:
ex_str = "This is a sentence with words that're regular and that aren't."
print(our_w_tokenize(ex_str))

This is a sentence with words that're regular and that aren't.


Now, let's make a function using nltk to tokenize a list of words. 

Compare its output with the output of nltk's word tokenizer. What difference do you see? Why might we want to use a more complex tokenizer? What other kinds of words might be tricky? 

In [203]:
# A method to tokenize a sentence into words
def w_tokenize(s):
    return nltk.word_tokenize(s)

In [204]:
print(w_tokenize(ex_str))

['This', 'is', 'a', 'sentence', 'with', 'words', 'that', "'re", 'regular', 'and', 'that', 'are', "n't", '.']


We can also tokenize sentences, dividing up a paragraph into sentences. Again, we can write a bunch of rules to do this ourselves, or we can let nltk handle it. Let's try it ourselves, using the same string methods as before. Can you think of examples that might confound your function? How might you approach that?

In [206]:
def our_s_tokenize(p):
    # TODO: come up with your own simple way of splitting a paragraph into sentences. Again, look at string libraries.
    # Think about where it might fail. Can you make it better?
    return p

In [207]:
ex_paragraph = "Here is a multi-sentence string (with some unusual parts). Does it handle question marks correctly? What about names like Mr. Rogers? nltk might use different rules than you do!"
print(our_s_tokenize(ex_paragraph))

Here is a multi-sentence string (with some unusual parts). Does it handle question marks correctly? What about names like Mr. Rogers? nltk might use different rules than you do!


Let's write a function that uses the nltk method to tokenize a sentence and compare. What's different? What rules might you have forgotten? Tokenizing is a good example of where rule-based approaches are helpful, but also challenging! It's hard to anticipate every case, but sometimes it's necessary.

In [208]:
# A function to tokenize a paragraph into sentences
def s_tokenize(p):
    return nltk.sent_tokenize(p)

In [209]:
print(s_tokenize(ex_paragraph))

['Here is a multi-sentence string (with some unusual parts).', 'Does it handle question marks correctly?', 'What about names like Mr. Rogers?', 'nltk might use different rules than you do!']


### Lemmatizing
Next, we're going to write a function to lemmatize our words. Lemmatizing words means converting them to their most basic form: singular (for nouns), present tense (for verbs), etc.Lemmatizing words makes it easier to compare for content, even if the words don't appear in exactly the same form. We're going to use nltk to lemmatize our words. Take a look at how it works below. 

In [210]:
from nltk.stem import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()
# Can you predict what each line will print?
print(lemmatizer.lemmatize("cats"))
print(lemmatizer.lemmatize("cacti"))
print(lemmatizer.lemmatize("geese"))
print(lemmatizer.lemmatize("rocks"))
print(lemmatizer.lemmatize("python"))
print(lemmatizer.lemmatize("better", pos="a"))
print(lemmatizer.lemmatize("best", pos="a"))
print(lemmatizer.lemmatize("ran"))
print(lemmatizer.lemmatize("ran",'v'))

cat
cactus
goose
rock
python
good
best
ran
run


You may be wondering what the second argument (with pos= and without) is. It's an optional argument that specifies the part of speech --- without it, the lemmatizer assumes everything is a noun. You can try a few examples of your own below if you want!

In [160]:
# Try whatever words you want here

Implementing this ourselves would be pretty tricky, so we're going to use the nltk lemmatizer. Let's write a function that does the lemmatization on a set of word tokens. 

In [211]:
# A function to take a list of word tokens and lemmatize each
def lemmatize(word_tokens):
    return [lemmatizer.lemmatize(t) for t in word_tokens]
    
print(lemmatize(w_tokenize("Several people running the marathon were injured.")))

['Several', 'people', 'running', 'the', 'marathon', 'were', 'injured', '.']


You'll notice that verbs aren't lemmatized correctly --- it's because of the optional argument. There's a way to do this with nltk's part-of-speech tagging, which is included in the challenge section.

### Removing stopwords
Can you think about what kinds of words we might not care about when processing natural language for similarity?

Words that occur very frequently and don't convey very much information in searches and NLP are called "stopwords". You can imagine some examples: the, and, a. We frequently remove these words from search queries and text comparisons to reduce some unecessary noise. Luckily, nltk has our back!

In [212]:
# Here is a list of english stopwords
stop_words = set(stopwords.words('english'))
print(stop_words)

{'then', 'should', 'doesn', 'as', "didn't", 'this', 'herself', 're', "weren't", 'its', 'd', 'were', 'under', 'will', 'his', 'your', "doesn't", 'which', 'both', 'from', 'yourself', 'itself', 'shouldn', "that'll", 'because', 'what', 'a', "shan't", "hadn't", "mustn't", 'are', 's', 'y', 'shan', 'the', 'same', 'aren', 'mightn', 'have', 'mustn', 'until', 'an', 'down', 'before', 'so', "hasn't", 'ours', "you'd", 'it', 'few', 'himself', 'each', 'them', 't', 'has', 'can', 'than', "you've", 'or', 'nor', 'didn', "you're", 'yourselves', 'did', "should've", 'yours', 'how', 'll', 'haven', 'does', 'any', 'our', 'myself', 'was', 'don', 'wasn', "mightn't", 'very', 'once', 'if', 'for', 'you', 'after', 'who', 'above', "she's", 'and', 'hasn', "won't", 'he', 'now', 'couldn', 'i', 'they', 'of', 'against', 'again', 'theirs', 'why', 'during', 'having', 'here', 'over', 'to', 'ourselves', 'up', 'me', 'had', 'into', 'where', 'is', 'all', "shouldn't", 'we', 'my', 'off', "don't", 'some', "haven't", 'further', 'won'

In [213]:
# A method to remove stopwords from sentences.
def remove_stopwords(word_tokens):
    # TODO: return ONLY the words in word_tokens that DO NOT appear in stop_words
    return word_tokens

In [164]:
stop_ex = remove_stopwords(nltk.word_tokenize("This sentence has meaningful words and stopwords"))
print(stop_ex)

['This', 'sentence', 'meaningful', 'words', 'stopwords']


### Putting it all together!
Let's put together the cleaning methods that we have to clean, tokenize, lemmatize, and remove stopwords. 

In [165]:
# Performs all our cleaning functions for words
def w_super_clean(s):
    return remove_stopwords(lemmatize(w_tokenize(clean(s))))

# Performs all our cleaning functions for paragraphs
def s_super_clean(p):
    sentences = s_tokenize(p)
    clean_sentences = []
    for s in sentences:
        clean_sentences.append(" ".join(remove_stopwords(lemmatize(w_tokenize(clean(s))))))
    return clean_sentences

In [166]:
# You can create your own "dirty sentence" to put your cleaning function to the test
dirty_s = "HeRE's a CRAzy$    sentence that's GOt lots%&*of ERrors"
clean_w = w_super_clean(dirty_s)
print(clean_w)

['crazy', 'sentence', 'got', 'lot', 'error']


In [167]:
dirty_p = "HeRE's a CRAzy$    sentence that's GOt lots%&*of ERrors. There iS more*** than ONe SenTence."
clean_s = s_super_clean(dirty_p)
print(clean_s)

['crazy sentence got lot error', 'one sentence']


## Loading Data

Now, let's try to load in our data so that we can start working with headlines and articles. 
We're going to load the article bodies into a dictionary, and the headlines and stances into lists of tuples. 

In [168]:
# This function loads a body file and breaks it into words and sentences
def load_body(filename):
    id2body = # TODO: make an empty dict
    id2body_sentences = # TODO: make an empty dict
    
    # These lines open the file and read in each row
    with open(filename, encoding='utf-8', errors='ignore') as fh:
        reader = csv.DictReader(fh)
        data = list(reader)
        for row in data:
            
            # This line gets the Body ID for this row
            body_id = row['Body ID']
            # This line gets the article body
            body = str(row['articleBody'])
            # This line strips leading and trailing spaces from the body
            body = body.strip()
            
            body_words =  # TODO: clean the body words
            
            body_sentences = # TODO: clean the body sentences
            
            # TODO: Add this article body to the id2body dict using its body_id as a key
            
            # TODO: Add the list of lists clean_body_sentences to the id2body_sentences dict using its body_id as a key
            
    
    return id2body, id2body_sentences


In [169]:
# This may take a moment to run
id2body, id2body_sentences = load_body(train_body_path)
test_id2body, test_id2body_sentences = load_body(test_body_path)

# We're going to add the test bodies to our overall body database for ease of access later
id2body.update(test_id2body)
id2body_sentences.update(test_id2body_sentences)

In [170]:
# Let's make sure that our data structure looks about right!
print(len(id2body))
print(id2body['0'])
print(id2body_sentences['0'])

2587
['small', 'meteorite', 'crashed', 'wooded', 'area', 'nicaragua', 'capital', 'managua', 'overnight', 'government', 'said', 'sunday', 'resident', 'reported', 'hearing', 'mysterious', 'boom', 'left', '16', 'foot', 'deep', 'crater', 'near', 'city', 'airport', 'associated', 'press', 'report', 'government', 'spokeswoman', 'rosario', 'murillo', 'said', 'committee', 'formed', 'government', 'study', 'event', 'determined', 'wa', 'relatively', 'small', 'meteorite', 'appears', 'come', 'asteroid', 'wa', 'passing', 'close', 'earth', 'house', 'sized', 'asteroid', '2014', 'rc', 'measured', '60', 'foot', 'diameter', 'skimmed', 'earth', 'weekend', 'abc', 'news', 'report', 'murillo', 'said', 'nicaragua', 'ask', 'international', 'expert', 'help', 'local', 'scientist', 'understanding', 'happened', 'crater', 'left', 'meteorite', 'radius', '39', 'foot', 'depth', '16', 'foot', 'said', 'humberto', 'saballos', 'volcanologist', 'nicaraguan', 'institute', 'territorial', 'study', 'wa', 'committee', 'said', 's

In [188]:
def load_title(filename):
    titles = # TODO: make an empty list
    
    # Open csv and read in rows
    with open(filename, errors='ignore') as fh:
        reader = csv.DictReader(fh)
        raw_data = list(reader)
        for row in raw_data:
            
            body_id = #TODO: get the body id cell
            title = #TODO: get the headline cell 
            title = str(title).strip()
            
            clean_title = #TODO: clean title words
        
            title_id_tuple = (clean_title, body_id)
            # TODO: append title_id_tuple to the titles list
            
            
    return titles


In [189]:
test_data = load_title(test_headline_path)

In [193]:
print(test_data[0])

(['ferguson', 'riot', 'pregnant', 'woman', 'loses', 'eye', 'cop', 'fire', 'bean', 'bag', 'round', 'car', 'window'], '2008')


In [190]:
def load_stance(filename):
    stances = # TODO: make an empty list
    with open(filename, errors='ignore') as fh:
        reader = csv.DictReader(fh)
        raw_data = list(reader)
        for row in raw_data:
            title = # TODO: get headline
            body_id = # TODO: get body id
            stance = # TODO: get stance
            
            stance = stance.strip()
            
            clean_title = # TODO: clean title words
            
            stance_tuple = (clean_title, body_id, stance)
            # TODO: append stance_tuple to stances
            
    return stances

In [191]:
train_stances = load_stance(train_stance_path)[1:]

In [192]:
print(train_data[0])

(['hundred', 'palestinian', 'flee', 'flood', 'gaza', 'israel', 'open', 'dam'], '158', 'agree')


Great! We've gotten the data into the form we need so that we can work with it in the coming days. There's some challenge work related to speeding up and improving our data cleaning process below. 

## Challenge

Here are two challenge problems! You can work on whichever one interests you.

### Challenge 1:
One of the flaws of our cleaning function is that it doesn't lemmatize non-nouns correctly (because nltk requires a part of speech argument to process words as non-nouns). Fortunately, nltk provides a method for part-of-speech tagging. Can you write a new lemmatizing function that tags parts of speech first and uses those tags to do a better job lemmatizing?

In [100]:
def better_lem(word_tokens):
    lemmas = []
    word_tags = # TODO: use nltk's pos_tag method to do part-of-speech tagging for the sentence
    
    # word_tags should be a tuple of words and tags
    # this elif structure will return a string pos that can be used as an argument to the lemmatize function
    for word, tag in word_tags:
        if tag.startswith('J'):
            pos = wordnet.ADJ
        elif tag.startswith('V'):
            pos = wordnet.VERB
        elif tag.startswith('N'):
            pos = wordnet.NOUN
        elif tag.startswith('R'):
            pos = wordnet.ADV
        else:
            pos = ''
        # TODO: if pos is not '', add the correct part-of-speech lemma to the lemmas list
        
        # TODO: otherwise, add the noun version (no second argument)
        
    return lemmas
        

SyntaxError: invalid syntax (<ipython-input-100-d040263d8606>, line 3)

In [99]:
test_str = "Verbs like ran run running or be are am is should come out about the same"
print(better_lem(test_str))

NameError: name 'better_lem' is not defined

### Challenge 2: 
Before, we allowed nltk to do all of our lemmatization for us. Using the documentation on the Python package re, can you use a regular expression to do basic lemmatization on nouns by making them singular? Think about what patterns usually characterize plural nouns, and replace those with the singular form. Do as many different cases as you have time for!

In [None]:
def re_lem(w):
    # TODO: write a regular expression that replaces plural endings with singular endings. 
    # Hint: you can run the same word through a regular expression multiple times ---  
    # if no matches are found, it won't be changed
    return w

In [None]:
# Let's test it out!
print(re_lem("horses"))
print(re_lem("cats"))
print(re_lem("ponies"))
print(re_lem("kitties"))
print(re_lem("cacti"))
print(re_lem("octopi"))
print(re_lem("geese"))
print(re_lem("mooses"))
print(re_lem("fish"))