# Feature Extraction & TF-IDF

Today, we're going to implement our tf-idf counter and sketch out the broad outlines of our feature extraction code. Keep in mind, we want everything we write to be compatible with the cleaning and loading code we wrote yesterday, since that's the data that we'll be extracting features from!

In [None]:
import nltk
from nltk.corpus import stopwords
from nltk.corpus import wordnet
tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')
import csv
import os
import re
from nltk.stem import WordNetLemmatizer
from math import log

stop_words = set(stopwords.words('english'))
lemmatizer = WordNetLemmatizer()

In [None]:
train_body_path = "train_bodies.csv"
if not os.path.exists(train_body_path):
    print("Check location for train_bodies")
test_body_path = "test_bodies.csv"
if not os.path.exists(test_body_path):
    print("Check location for test_bodies")
train_stance_path = "train_stances.csv"
if not os.path.exists(train_stance_path):
    print("Check location for train_stances")
test_headline_path = "test_stances_unlabeled.csv"
if not os.path.exists(test_headline_path):
    print("Check location for test_stances_unlabeled")

### Dictionary counting

For our idf function, we're going to want to count the number of documents where a word occurs. The best way to do counting of multiple items in Python is using a dictionary where the keys are the items to count and the values are the counts for each item. We're going to practice writing that function first. 

In [None]:
# This function will count the items into the dictionary. Count items will be a list (in this case, of words)
# and count_dictionary will be a dictionary of counts. It's important to not assume anything about count_dictionary
# it could have all keys already in it, or it could be totally empty. 
def dictionary_count(count_items, count_dictionary):
    # TODO: loop through all the items in count_items
    for ___ in _______:        
        # TODO: if the item is in the dictionary, add one to its current value, the count
        
        # TODO: if the item isn't in the dictionary, assign it as a key with the value one
        
    # TODO: return the count dictionary


In [None]:
# Let's test it out

fruit_counts = {}
my_fruit = ["apple", "blueberry", "banana", "orange", "apple", "kiwi", "kiwi", "strawberry", "blueberry", "blueberry"]
fruit_counts = dictionary_count(my_fruit, fruit_counts)
print(fruit_counts)

### Eliminating duplicates

One more thing we have to do for our idf counting. We want a factor that calculates the number of documents in which a word appears. So, we want to count at most one occurrence of a word per document. What will happen if we just count all occurrences of a word that we see?

You can imagine that we'll get a much higher number than we want, since most documents have many repeated words. So, we need to write a function to eliminate duplicate words within a single document (at least, temporarily! We want them in there for frequency counting later). 

The outline of the function below is doing this from scratch. There are a few ways to accomplish this in Python --- feel free to diverge from the structure and use another strategy if you like!

In [None]:
# This function takes in a list of items and eliminates the duplicates
def elim_dupes(items):
    # TODO: make a new list

    # TODO: loop through all list items

        # TODO: if this list item isn't in the new list, add it

    # TODO: return the new list


In [None]:
# Let's test it out

my_fruit_types = elim_dupes(my_fruit)
print(my_fruit_types)

single_count_d = {}
single_count_d = dictionary_count(my_fruit_types, single_count_d)
print(single_count_d)

## IDF scaling

What we're going to do now is write a function that finds the relative frequency of any word token across all documents. We will later use this term to scale individual term counts for each text document.

We're going to structure this function to read from a dictionary of text bodies, since that's the format that our id2body data is in. 

Here is the documentation for the dictionary type. We're going to want a function that lets us loop through the keys and items in a dictionary --- can you find it? 

https://docs.python.org/3/tutorial/datastructures.html#dictionaries


In [None]:
# Prepare the idf for a corpus of documents
def prepare_idf(corpus):
    docs_containing = {}
    idf = {}
    
    # TODO: loop through the items in id2body using a dictionary method
    for (body_id, body) in id2body.______:
        
        # TODO: use your function to remove duplicates from body
        
        docs_containing = # TODO: use your function to update docs_containing with counts
    
    for word in docs_containing:
        # TODO: set the value in the idf dict for this word to be:
        # log (number of total documents / the number of docs that contain the word)
        
    return idf

In [None]:
# Here's our cleaning code from yesterday! 
# You don't have to do anything, but read it over and make sure you remember what each function is doing

def clean(s):
    # Cleans a string: Lowercasing, trimming, removing non-alphanumeric
    return " ".join(re.findall(r'\w+', s, flags=re.UNICODE)).lower()

def w_tokenize(s):
    return nltk.word_tokenize(s)

def s_tokenize(p):
    return nltk.sent_tokenize(p)

def lemmatize(word_tokens):
    return [lemmatizer.lemmatize(t) for t in word_tokens]

def remove_stopwords(word_tokens):
    # TODO: return ONLY the words in word_tokens that DO NOT appear in stop_words
    return [w for w in word_tokens if not w in stop_words]

def w_super_clean(s):
    return remove_stopwords(lemmatize(w_tokenize(clean(s))))

def s_super_clean(p):
    sentences = s_tokenize(p)
    clean_sentences = []
    for s in sentences:
        clean_sentences.append(" ".join(remove_stopwords(lemmatize(w_tokenize(clean(s))))))
    return clean_sentences

In [None]:
# Here's our load body function from before
# Again, you don't need to do anything, but read through and ask if any lines confuse you
def load_body(filename):
    id2body = {} 
    id2body_sentences = {} 
    
    # These lines open the file and read in each row
    with open(filename, encoding='utf-8', errors='ignore') as fh:
        
        reader = csv.DictReader(fh)
        data = list(reader)
        for row in data:
            
            # This line gets the Body ID for this row
            id = row['Body ID']
            # This line gets the article body
            body = str(row['articleBody'])
            # This line strips leading and trailing spaces from the body
            body = body.strip()
            
            # Cleaning words and sentences
            body_words = w_super_clean(body) 
            body_sentences = s_super_clean(body)
            
            # Adding to the two dictionaries
            id2body[id] = body_words
            id2body_sentences[id] = body_sentences
    
    return id2body, id2body_sentences


In [None]:
# Here we're creating the body data that we'll use to train our idf scaler!
id2body, id2body_sentences = load_body(train_body_path)
test_id2body, test_id2body_sentences = load_body(test_body_path)

id2body.update(test_id2body)
id2body_sentences.update(test_id2body_sentences)

In [None]:
# Let's make our idf!
idf = prepare_idf(id2body)

In [None]:
# Let's take a peek at some of the entries. Do these look about right to you?
print(idf["person"])
print(idf["dog"])
print(idf["goldfish"])
print(idf["zebra"])

In [None]:
# Let's play with some example sentences
ex_s_1 = id2body_sentences['0'][0]
print(ex_s_1)

ex_s_2 = id2body_sentences['1'][0]
print(ex_s_2)

ex_s_3 = id2body_sentences['3'][0]
print(ex_s_3)

In [None]:
def print_sentence_idfs(s):
    for w in w_tokenize(s):
        print(w + ": " + str(idf[w]))

In [None]:
print_sentence_idfs(ex_s_1)

Is this what you would have expected? Why or why not? Try running the same function on another sentence and look at those results. Is it what you guessed?

## Looking Forward

We're not going to lay out all of our code today, but we're going to look at the broad outlines of our final project code. A code skeleton is a broad outline of your code made out of comments. It's good practice to create a code skeleton before you embark on large projects, so that you can see how everything will fit together. 

In [None]:
# def make_predictions():
    
    # Load and clean the body (training and test set)
    # Load and clean the stances (training set)
    # Load and clean the headlines (test set)
    
    # Prepare the idf
    
    # Make a predictor to train and predictor
    
    # For every example in the training set:
        # Extract the features
        # Do one training step for a predictor using those features and the correct label
    
    # For every example in the test set:
        # Extract the features 
        # Use the predictor to make a prediction based on those features
    
    # Check our predicted answers against the real answers
    # Output accuracy measures!
    
# def extract_features():
    # Get idf-scaled lexical overlaps 
    # Get semantic similarity 
    # Return a vector containing both

You'll notice that we've already done several of the first steps! Ask an instructor if you have any questions at all. Getting features is the next big hurdle, and we'll spend a few days doing that. Great work this week!

### Challenge 1: Max and Min

Can you write functions to get the maximum and minimum idf counts for words in a sentence? That will be the rarest and most common word, respectively. 

In [None]:
def get_max_idf(s):
    # Hint: split sentence into word tokens
    
    # Hint: create a variable to hold the maximum idf score and another to hold the word with that score
    
    # Hint: loop through the word tokens and check each against the maximum score!
    return s

def get_min_idf(s):
    # Hint: split sentence into word tokens
    
    # Hint: create a variable to hold the minimum idf score and another to hold the word with that score
    
    # Hint: loop through the word tokens and check each against the minumum score!
    return s


In [None]:
print("Most common: " + get_min_idf(ex_s_1) + "; Least common: " + get_max_idf(ex_s_1))
print("Most common: " + get_min_idf(ex_s_2) + "; Least common: " + get_max_idf(ex_s_2))
print("Most common: " + get_min_idf(ex_s_3) + "; Least common: " + get_max_idf(ex_s_3))

### Challenge 2: Synonyms

As many of you have mentioned, it would be really cool to be able to check if two words are synonyms when comparing them. NLTK's WordNet allows us to find synsets (synonym sets) which we can use to do just that. Let's try to write a function to check whether two words are synonyms. 

In [None]:
def synonym_check(word1, word2):
    # TODO: get all synsets from word 1

        # TODO: get lemmas for this synset

            # TODO: compare the name for this lemma to word 2

                # TODO: return True if the same 

    # TODO: otherwise, return false

print(synonym_check("good", "beneficial"))
print(synonym_check("bad", "negative"))
print(synonym_check("many", "lots"))

You should get True False False. 

Hmmm.... we can see that this method isn't as robust as we might like. Another way to check synonyms is to compare similarity indices, and then set a threshold for calling two words synonyms. 

In [None]:
def synonym_check_2(word1, word2):
    # the maximum similarity found so far
    max_wup = 0
    # gets all possible synsets for word1
    w1 = wordnet.synsets(word1)
    # gets all possible synsets for word2
    w2 = wordnet.synsets(word2) # n denotes noun
    # TODO: for synset in w1

        # TODO: for synset in w2

            # TODO: get wup_similarity between the two

            # TODO: if wup_sumilarity is greater than the previous maximum, update it

    threshold = # TODO: set threshold
    if max_wup > threshold:
        return max_wup, True
    else:
        return max_wup, False

In [None]:
print(synonym_check_2("good", "beneficial"))
print(synonym_check_2("bad", "negative"))
print(synonym_check_2("horse", "goat"))
print(synonym_check_2("terrible", "horrible"))

This method doesn't work very well either! Can you come up with something better?

In [None]:
def synonym_check_3(word1, word2):
    # TODO: your code here