# Bag of Words Representation

## How to Generate Bag of Words Representations

To make the term-document matrix, we need to find all unique words in our dataset.

In [None]:
#finding all unique words
all_words = []
for item in df_samples_list:
    new_abs_words =item['words']
    all_words += new_abs_words
all_words_unique = list(set(all_words))

print('There are ' + str(len(all_words_unique)) + ' unique words in our dataset.')

For each chunk of text and each term, mark '1' if it contains the term and mark '0' if it doesn't.

In [None]:
#making document term matrix
word_matrix = {}
for word in all_words_unique:
    word_vec = []
    for item in df_samples_list:
        if word in item['words']:
            word_vec += [1]
        else:
            word_vec += [0]
    word_matrix[word] = word_vec

dc_df = pd.DataFrame(word_matrix)
dc_df

# Vector Space

## Vector and Vector Space

### Introduction to Vectors

The word matrix we made above is a good example as the integration of a set of vectors.

In [None]:
dc_df

Extracting a vector from the whole matrix:

In [None]:
print("The vector for document 0 is: ")
pd.DataFrame(dc_df.loc[0,:]).transpose()

## How to Caculate Distance between Vectors

### How to Calculate Distance/Similarity between Vectors

Code for Cosine Similarity

In [None]:
from sklearn.metrics.pairwise import cosine_similarity, euclidean_distances

cos_matrix = cosine_similarity(dc_df)

pd.DataFrame(cos_matrix)

Code for Euclidean Distance

In [None]:
#sample code for euclidean distance

euc_matrix = euclidean_distances(dc_df)

pd.DataFrame(euc_matrix)

Finding the document that has the highest similarity to the selected document. Here we choose the eighth document as an example.

In [None]:
#specify the index of your chosen document
chosen_doc = 8
scores = sorted(cos_matrix[chosen_doc],reverse=True)
score = scores[1]
result_doc = list(cos_matrix[chosen_doc]).index(score)
# note: you may want to change cos_matrix to euc_matrix and set reverse=False when you are using
#       Euclidean distance since the smaller the Euclidean distance is, the similar the two documents are.

print('The document that is the most similar with document ' + str(chosen_doc) + ' is ' + 'document ' + str(result_doc) + '.')

The results shows that document 7 was the one that is most similar to document 8, which makes sense as they are in the same category. If you investigate the matrix a little bit more, you will see that the documents in the same category have the highest similarity.

# Term Weighting

## Code for Different Term Weighting Strategies

Then we compute the term frequency(TF) matrix.

In [None]:
#tf
tf_matrix = {}
tf_ranking ={}
for word in all_words_unique:
    word_vec = []
    for item in df_samples_list:
        if word in item['words']:
            word_vec += [item['words'].count(word)/len(item['words'])]
        else:
            word_vec += [0]
    tf_matrix[word] = word_vec

pd.DataFrame(tf_matrix)

Showing the term ranking according to the term frequency(TF) score.

In [None]:
def top_terms(matrix_df, n=10): #input should be a pandas dataframe
    output_dict = {}
    for index, series in matrix_df.iterrows():
        doc_num = 'doc' + str(index)
        scores = dict(series)
        scores_sorted = {k: v for k, v in sorted(scores.items(), key=lambda item: item[1], reverse=True)}
        terms = scores_sorted.keys()
        terms_topn = list(terms)[:n]
        output_dict[doc_num] = terms_topn
    output_df = pd.DataFrame(output_dict)
    return output_df.transpose()

matrix_df = pd.DataFrame(tf_matrix)
tf_ranking = top_terms(matrix_df, 10)
tf_ranking

Computing the inverse document frequency(IDF) matrix.

In [None]:
import math

#idf
idf_matrix = {}
for word in word_matrix:
    idf_matrix[word] = math.log(len(df_samples_list)/sum(word_matrix[word]))

print("Inverse Document Frequency Matrix successfully computed!")

Showing the term ranking according to the inverse document frequency(IDF) score.

In [None]:
idf_ranking = {k: v for k, v in sorted(idf_matrix.items(), key=lambda item: item[1], reverse=True)}
idf_ranking

In [None]:
#tfidf
tfidf = {}
for word in idf_matrix:
    idf = idf_matrix[word]
    tfidf_vec = tf_matrix[word]
    tfidf[word] = [i * idf for i in tfidf_vec]

pd.DataFrame(tfidf)

Finding the terms that have the highest tfidf score and showing the ranking of the terms according to tfidf.

In [None]:
tfidf_df = pd.DataFrame(tfidf)
top_tfidf_terms = top_terms(tfidf_df)

top_tfidf_terms

These time we can find the most similar document of a chosen document using the tfidf matrix. We don't expect much improvement here since the results calculated from the simple word-document matrix were pretty good as the demo dataset is small and diverse.

In [None]:
dc_df = pd.DataFrame(tfidf)
cos_matrix = cosine_similarity(dc_df)

pd.DataFrame(cos_matrix)

Again, we take document 8 as an example and see the document that is the most similar to document 8.

In [None]:
#specify the index of your chosen document
chosen_doc = 8
scores = sorted(cos_matrix[chosen_doc],reverse=True)
score = scores[1]
result_doc = list(cos_matrix[chosen_doc]).index(score)

print('The document that is the most similar with document ' + str(chosen_doc) + ' is ' + 'document ' + str(result_doc) + '.')