# Working with Text: Assignment

The assignment consists of 3 questions you will need to answer. You will be creating, from scratch, something akin to the `TfidfVectorizer` in `sklearn`. You can reuse any functions from the lecture notebook if you want, but you don't need them.



In [None]:
# ###################################################################
#                          ASSIGNMENT
# ###################################################################
#
# Create a class (or function), called "Vectorizer" ("vectorizer").
# 
# It should take a list of strings (documents) and turn it into a
# 2D Numpy Array representing the documents as vectors (rows are
# documents). You have flexibility in how exactly they are represented,
# but it should be based on a basic term-frequency vector. 
#
# The vectors should be "normalized" such that each row has
# an l2 norm of 1.
#
# NOTE: You are implementing this in pure Python/Numpy
#
# EXERCISES:
#
# 1) Show that the euclidean distance and cosine distance
#    are proportional (preserve the relative distance between 
#    all the documents). Remember, this is only the case 
#    because of the normalized vectors.
#
# 2) Print out a 2D heatmap (seaborn heatmap, for example) of the
#    pairwise distances between all the documents. Do they make
#    sense? 
#
# 3) Get the euclidean distance between the "query document" 
#    ("People who see ghosts") to be closer to the "target document"
#    ("We have collected a report...") than any other document
#    in the corpus. Report a ratio of next_closest/target, which 
#    should be > 1. This is a competition! I will report those who 
#    get the highest score (without doing silly things).
#
#    Try to use the various optimizations (preprocessing, forms of 
#    TF-IDF, word removal, etc.) discussed in the slides to increase
#    the separation and increase the ratio.
#    




In [None]:
docs = ['People who see ghosts',

        '"I dont believe people who see ghosts", said Mannie, before spitting into the wind and riding his bike down the street at top speed. He then went home and ate peanut-butter and jelly sandwiches all day. Mannie really liked peanut-butter and jelly sandwiches. He ate them so much that his poor mother had to purchase a new jar of peanut butter every afternoon.',

        'People see incredible things. One time I saw some people talking about things they were seeing, and those people were so much fun. They saw clouds and they saw airplanes. They saw dirt and they saw worms. Can you believe the amount of seeing done by these people? People are the best.',

        'This is an article about a circus. A Circus is where people go to see other people who perform great things. Circuses also have elephants and tigers, which generally get a big woop from the crowd.',

        'Lots of people have come down with Coronavirus. You can see the latest numbers and follow our updates on the pandemic below. Please, stay safe.',

        'Goats are lovely creatures. Many people love goats. People who love goats love seeing them play in the fields.',

        'We have collected a report of people in our community seeing ghosts. Each resident was asked "how many ghosts have you seen?", "describe the last ghost you saw", and "tell us about your mother." Afterwards, we compared the ghost reports between the different individuals, and assessed whether or not they were actually seeing these apparitions.']



In [None]:
def pairwise_distance(X):
    N = X.shape[0]
    dists = np.zeros((N, N))
    for i, a in enumerate(X):
        for j, b in enumerate(X):
            dists[i, j] = np.linalg.norm(a - b)

    return dists

def get_score(vecs):
    dists = pairwise_distance(vecs)
    mxidx = np.argmin(dists[0][1:-1]) + 1
    next_best = np.linalg.norm(vecs[mxidx] - vecs[0]) 
    target = np.linalg.norm(vecs[-1] - vecs[0])
    score = next_best / target
    print('SCORE: ', score)

# Use get_score to check the score for assignment 3!