<a href="https://colab.research.google.com/github/coryroyce/code_assignments/blob/main/111107_Vector_Space_Model_Wedding_Gown_Cory_Randolph.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Vector Space Model - Wedding Gown

CMPE 256

Cory Randolph

11/7/2021



# Prompt

Learning Objective: Develop Vector Space Model for Wedding Gown

d1: User selected Wedding gown.

d2: User ordered on-line rose flowers.

d3: User searched diamond ring.

d4: User selected white wedding gown, online flowers, 3 carat diamond ring.


# Summary of Analysis

This notebook and the below code takes the 4 sample documents and creates a Vector Space Model to which documents are more similar to each other. 

The end result is this table:

| document_id   | document                                                                |        d1 |        d2 |        d3 |       d4 |
|:--------------|:------------------------------------------------------------------------|----------:|----------:|----------:|---------:|
| d1            | User selected Wedding gown.                                             | 1         | 0.0842048 | 0.11745   | 0.552465 |
| d2            | User ordered on-line rose flowers.                                      | 0.0842048 | 1         | 0.0776134 | 0.152707 |
| d3            | User searched diamond ring.                                             | 0.11745   | 0.0776134 | 1         | 0.361108 |
| d4            | User selected white wedding gown, online flowers, 3 carat diamond ring, | 0.552465  | 0.152707  | 0.361108  | 1        |

Based on this table we get the below results: 

*   d1 and d4 are the most similar to each other (cosine similarity of 0.55)
*   d3 is somewhat similar to d4 (a cosine similarity of 0.36)
*   d2 is weakly similar to d4 (cosine similarity of 0.15)

# Imports

In [27]:
import pandas as pd

# Data

Input the data for the documents manually.

In [28]:
data = [
        ['d1', 'User selected Wedding gown.'],
        ['d2', 'User ordered on-line rose flowers.'],
        ['d3', 'User searched diamond ring.'],
        ['d4', 'User selected white wedding gown, online flowers, 3 carat diamond ring,'],
]

columns = ['document_id', 'document']

Convert the data into a Pandas Dataframe

In [29]:
df = pd.DataFrame(data = data, columns = columns)

# Set the index
df.set_index('document_id',inplace = True)

# Display the first few rows
df.head()

Unnamed: 0_level_0,document
document_id,Unnamed: 1_level_1
d1,User selected Wedding gown.
d2,User ordered on-line rose flowers.
d3,User searched diamond ring.
d4,"User selected white wedding gown, online flowe..."


Apply the bag of words representation to the normalized text.

In [30]:
from collections import Counter

bag_of_words = (
    df['document'].
    str.lower().                  # convert all letters to lowercase
    str.replace("[^\w\s]", " ").  # replace non-alphanumeric characters by whitespace
    str.split()                   # split on whitespace
).apply(Counter)

bag_of_words

document_id
d1    {'user': 1, 'selected': 1, 'wedding': 1, 'gown...
d2    {'user': 1, 'ordered': 1, 'on': 1, 'line': 1, ...
d3    {'user': 1, 'searched': 1, 'diamond': 1, 'ring...
d4    {'user': 1, 'selected': 1, 'white': 1, 'weddin...
Name: document, dtype: object

Convert the bag of words representation into a term-frequency matrix.

In [31]:
tf = pd.DataFrame(list(bag_of_words))

# Fill the NA's with 0's
tf = tf.fillna(0)

tf

Unnamed: 0,user,selected,wedding,gown,ordered,on,line,rose,flowers,searched,diamond,ring,white,online,3,carat
0,1,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,1,0.0,0.0,0.0,1.0,1.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0
3,1,1.0,1.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0


# Apply Vector Space Model

Use Sklearn to help create and extract the feature (similar to the manual method above)

In [32]:
from sklearn.feature_extraction.text import CountVectorizer

vec = CountVectorizer()
vec.fit(df['document'])
tf_sparse = vec.transform(df['document'])


Convert the sparse representation to a dense one

In [33]:
tf_dense = tf_sparse.todense()
tf_dense

matrix([[0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0],
        [0, 0, 1, 0, 1, 1, 0, 1, 0, 1, 0, 0, 1, 0, 0],
        [0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0],
        [1, 1, 1, 1, 0, 0, 1, 0, 1, 0, 0, 1, 1, 1, 1]])

Visualize the matrix as a dataframe

In [34]:
pd.DataFrame(
    tf_dense,
    columns=vec.get_feature_names()
)

Unnamed: 0,carat,diamond,flowers,gown,line,on,online,ordered,ring,rose,searched,selected,user,wedding,white
0,0,0,0,1,0,0,0,0,0,0,0,1,1,1,0
1,0,0,1,0,1,1,0,1,0,1,0,0,1,0,0
2,0,1,0,0,0,0,0,0,1,0,1,0,1,0,0
3,1,1,1,1,0,0,1,0,1,0,0,1,1,1,1


# Apply TF-IDF

Apply TF-IDF to the document

In [35]:
from sklearn.feature_extraction.text import TfidfVectorizer

vec = TfidfVectorizer(norm=None) # Do not normalize.
vec.fit(df['document']) # This determines the vocabulary.
tf_idf_sparse = vec.transform(df['document'])
tf_idf_sparse.data

array([1.51082562, 1.        , 1.51082562, 1.51082562, 1.        ,
       1.91629073, 1.91629073, 1.91629073, 1.91629073, 1.51082562,
       1.        , 1.91629073, 1.51082562, 1.51082562, 1.91629073,
       1.51082562, 1.        , 1.51082562, 1.51082562, 1.91629073,
       1.51082562, 1.51082562, 1.51082562, 1.91629073])

# Apply Cosine Similarity

Apply Cosine Similarity to compare the vectors

In [36]:
from sklearn.metrics.pairwise import cosine_similarity, cosine_distances

cosine_similarity(tf_idf_sparse)

array([[1.        , 0.08420485, 0.1174499 , 0.55246518],
       [0.08420485, 1.        , 0.07761339, 0.15270708],
       [0.1174499 , 0.07761339, 1.        , 0.36110824],
       [0.55246518, 0.15270708, 0.36110824, 1.        ]])

Convert this to a dataframe for easy visualization.

In [37]:
df_cos_sim = pd.DataFrame(cosine_similarity(tf_idf_sparse))

# Set the index to get the orginal document numbers
df_cos_sim.set_index(df.index, drop = True, inplace = True)

# Use the document_id as the colum labels
df_cos_sim.columns = df.index.to_numpy()
df_cos_sim

Unnamed: 0_level_0,d1,d2,d3,d4
document_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
d1,1.0,0.084205,0.11745,0.552465
d2,0.084205,1.0,0.077613,0.152707
d3,0.11745,0.077613,1.0,0.361108
d4,0.552465,0.152707,0.361108,1.0


Join the cosine similarity dataframe with the original one


In [38]:
df_final = df.merge(df_cos_sim, on = 'document_id')
df_final

Unnamed: 0_level_0,document,d1,d2,d3,d4
document_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
d1,User selected Wedding gown.,1.0,0.084205,0.11745,0.552465
d2,User ordered on-line rose flowers.,0.084205,1.0,0.077613,0.152707
d3,User searched diamond ring.,0.11745,0.077613,1.0,0.361108
d4,"User selected white wedding gown, online flowe...",0.552465,0.152707,0.361108,1.0


Based on this table we get the below results: 

*   d1 and d4 are the most similar to each other (cosine similarity of 0.55)
*   d3 is somewhat similar to d4 (a cosine similarity of 0.36)
*   d2 is weakly similar to d4 (cosine similarity of 0.15)

# Reference

Example of Vector Space Model [reference](https://colab.research.google.com/github/dlsun/pods/blob/master/10-Textual-Data/10.2%20The%20Vector%20Space%20Model.ipynb#scrollTo=2UOASR79b74x)

In [39]:
# To turn a dataframe into a markdown:
df_final.to_markdown()

'| document_id   | document                                                                |        d1 |        d2 |        d3 |       d4 |\n|:--------------|:------------------------------------------------------------------------|----------:|----------:|----------:|---------:|\n| d1            | User selected Wedding gown.                                             | 1         | 0.0842048 | 0.11745   | 0.552465 |\n| d2            | User ordered on-line rose flowers.                                      | 0.0842048 | 1         | 0.0776134 | 0.152707 |\n| d3            | User searched diamond ring.                                             | 0.11745   | 0.0776134 | 1         | 0.361108 |\n| d4            | User selected white wedding gown, online flowers, 3 carat diamond ring, | 0.552465  | 0.152707  | 0.361108  | 1        |'

| document_id   | document                                                                |        d1 |        d2 |        d3 |       d4 |
|:--------------|:------------------------------------------------------------------------|----------:|----------:|----------:|---------:|
| d1            | User selected Wedding gown.                                             | 1         | 0.0842048 | 0.11745   | 0.552465 |
| d2            | User ordered on-line rose flowers.                                      | 0.0842048 | 1         | 0.0776134 | 0.152707 |
| d3            | User searched diamond ring.                                             | 0.11745   | 0.0776134 | 1         | 0.361108 |
| d4            | User selected white wedding gown, online flowers, 3 carat diamond ring, | 0.552465  | 0.152707  | 0.361108  | 1        |