# Bag of Words Activity

By: Cecelia Henson

We will start by creating a BOW representation vector without any external libraries to see how easy it is!

In [1]:
def vectorize(tokens, vocabulary):
    ''' This function takes list of tokens as input and returns 
    a vector with 0 if the token is not present in tokens and 
    count if present.'''
    vector=[]
    for token in vocabulary:
        vector.append(tokens.count(token))
    return vector

def unique(sequence):
    '''This functions returns a list that omits duplicates but
    preserves order.'''
    seen = set()
    return [x for x in sequence if not (x in seen or seen.add(x))]

In [2]:
import re
# corpus
cs2300_desc = "An introduction to the science of computation and data including tools, languages, and methods to support artificial intelligence.  Topics include applying the scientific method for data-driven computational problems, analysis, data preparation, and visualization"
cs3300_desc = "This course provides an introduction to applied data science including data preparation, exploratory data analysis, data visualization, statistical testing, and predictive modeling.  Emphasis will be placed on extracting information from data sets that can be turned into actionable insights or interventions.  Problems and data sets are selected from a broad range of disciplines of interest to students, faculty, and industry partners."
cs3400_desc = "This course provides a broad introduction to machine learning. Machine learning explores the study and construction of algorithms that can learn from and make predictions on data. Such algorithms operate by building a model from example inputs in order to make data-driven predictions or decisions, rather than following strictly static program instructions. Topic categories include optimization and both supervised and unsupervised methods. Students will reinforce their learning of machine learning algorithms with hands-on, tutorial-oriented laboratory exercises for development of representative applications."
cs3310_desc = "This course provides students the experience of working in a team on large-scale data analysis projects using extensive data sets provided by industry, academic researchers, and the government. Students are given access to data sets and directed questions, and the students apply the theory and practices from previous courses to propose hypotheses and evaluate those hypotheses. Projects end with teams presenting their results to their client both verbally and in written form. Includes discussions of principles for effective data visualization. "

print('Original ' + cs2300_desc)

# convert to lower case
cs2300_desc = cs2300_desc.lower()
print('Lowercase ' + cs2300_desc)

# removing special characters
cs2300_desc = re.sub('[^A-Za-z0-9 ]+', '', cs2300_desc)
cs3300_desc = re.sub('[^A-Za-z0-9 ]+', '', cs3300_desc)
cs3400_desc = re.sub('[^A-Za-z0-9 ]+', '', cs3400_desc)
cs3310_desc = re.sub('[^A-Za-z0-9 ]+', '', cs3310_desc)

# tokenize string into list
cs2300_tokens = cs2300_desc.split()
cs3300_tokens = cs3300_desc.split()
cs3400_tokens = cs3400_desc.split()
cs3310_tokens = cs3310_desc.split()

print('Tokens ' + str(cs2300_tokens))

#TODO create a vocabulary list from all tokens
vocabulary = unique(cs2300_tokens + cs3300_tokens + cs3400_tokens + cs3310_tokens)
print('Vocabulary ' + str(vocabulary))

#TODO convert all 4 sets of tokens into BOW vectors
cs2300_vector = vectorize(cs2300_tokens, vocabulary)
cs3300_vector = vectorize(cs3300_tokens, vocabulary)
cs3400_vector = vectorize(cs3400_tokens, vocabulary)
cs3310_vector = vectorize(cs3310_tokens, vocabulary)
print(cs2300_vector)


Original An introduction to the science of computation and data including tools, languages, and methods to support artificial intelligence.  Topics include applying the scientific method for data-driven computational problems, analysis, data preparation, and visualization
Lowercase an introduction to the science of computation and data including tools, languages, and methods to support artificial intelligence.  topics include applying the scientific method for data-driven computational problems, analysis, data preparation, and visualization
Tokens ['an', 'introduction', 'to', 'the', 'science', 'of', 'computation', 'and', 'data', 'including', 'tools', 'languages', 'and', 'methods', 'to', 'support', 'artificial', 'intelligence', 'topics', 'include', 'applying', 'the', 'scientific', 'method', 'for', 'datadriven', 'computational', 'problems', 'analysis', 'data', 'preparation', 'and', 'visualization']
Vocabulary ['an', 'introduction', 'to', 'the', 'science', 'of', 'computation', 'and', 'dat

Special characters are often scrubbed/omitted from a BOW model.  Add some code above to remove special characters like punctuation and regenerate the BOW model.  

In the next cell we can see how to use sklearn to create a BOW model.  

In [3]:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
 
CountVec = CountVectorizer(ngram_range=(1,1), # to use bigrams ngram_range=(2,2)
                           stop_words='english')
Count_data = CountVec.fit_transform([cs3300_desc])
 
#create dataframe
cv_dataframe=pd.DataFrame(Count_data.toarray(),columns=CountVec.get_feature_names())
print(cv_dataframe)

   actionable  analysis  applied  broad  course  data  disciplines  emphasis  \
0           1         1        1      1       1     6            1         1   

   exploratory  extracting  ...  provides  range  science  selected  sets  \
0            1           1  ...         1      1        1         1     2   

   statistical  students  testing  turned  visualization  
0            1         1        1       1              1  

[1 rows x 33 columns]




## Distance
Next we want to determine how similar two BOW models are, and one way we can do this is by using a "distance" measure.  A simple distance measure that is used is the euclidian distance.  In the next cell, the "looping" version of euclidian distance is given.  You should write the "vectorized" version using numpy in the following cell.  Benchmark the two methods and test them by comparing the 4 BOW vectors previously calculated.  Which two course descriptions are "closest"?

In [4]:
import math
def euclidian_loop(X, Y):
    total = 0
    for i in range(len(X)):
        total += (X[i] - Y[i])**2
    return math.sqrt(total)

mostSimilar = euclidian_loop(cs2300_vector,cs3400_vector)
leastSimilar = euclidian_loop(cs3400_vector,cs3310_vector)
print(mostSimilar)
print(leastSimilar)

10.862780491200215
12.609520212918492
