#End-to-End Pipeline for User 1 Statement Similarity Analysis

In this notebook, I've created an end-to-end pipeline to find the similarity between the statement of User 1 and the statements of other users. I have utilized BERT and Word2Vec embeddings to compute similarity scores and for ranking.

## Pipeline Overview
1. Import Libraries
2. Load Models
3. User Statements
4. Preprocessing Functions
5. Embedding Functions
6. Cosine Similarity Function
7. Computation of Embeddings
8. Computation of Ranking based on similarity scores

**Information used for matching:**

• User 1: I think that fashion trends dictate what we wear and that beauty standards
can influence our choices in makeup and skincare.
• User 2: Finding the perfect outfit can be a daunting task but experimenting with
different hairstyles can help explore different looks and add flair to your style.
• User 3: Accessories can elevate any ensemble and adding bold colors to your look
can be empowering and fun.
• User 4: Fashion shows showcase the latest designs and styles and it’s a great place
to invest in quality products.
• User 5: Personal style reflects individuality and creativity, and I think that confidence
is the best accessory someone can wear.

**Step 1: Import Libraries**

In [10]:
# ! pip install gensim
# ! pip install transformers
# ! pip install torch
import numpy as np
from transformers import BertTokenizer, BertModel
import gensim.downloader as api
from sklearn.metrics.pairwise import cosine_similarity
import torch

In the above step, we have imported all the necessary modules

1.   'numpy' for numerical computation part.
2.   'BertTokenizer' and 'BertModel' from Hugging Face's Transformer library for creating the Bert embeddings in the vector space.

3. 'gensim' for creating Word2Vec embeddings

4. 'cosine_similarity' from sklearn for calculating the similary in between my vectors



**Step 2: Load Pre-trained**

In [None]:
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
bert_model = BertModel.from_pretrained('bert-base-uncased')
word2vec_model = api.load('word2vec-google-news-300')

In this step, I have used the pretrained models Bert and word2vec.

1. I have used "bert-base-uncased" as it's the base version of bert which can be suitable for my problem statement which is a general purpose task.
2. For word2vec have used 'word2vec-google-news-300' which refers to a pre-trained model trained on a large dataset of Google News articles. It's good for capturing semantic relationship between the text.

**Step 3: Users Statements**

In [8]:
user_statements = [
    "I think that fashion trends dictate what we wear and that beauty standards can influence our choices in makeup and skincare.",
    "Finding the perfect outfit can be a daunting task but experimenting with different hairstyles can help explore different looks and add flair to your style.",
    "Accessories can elevate any ensemble and adding bold colors to your look can be empowering and fun.",
    "Fashion shows showcase the latest designs and styles and it’s a great place to invest in quality products.",
    "Personal style reflects individuality and creativity, and I think that confidence is the best accessory someone can wear."
]


**Step 4: Preprocessing Functions**

In [9]:
def preprocess_bert(statement):
  inputs = tokenizer(statement, return_tensors="pt", padding=True, truncation=True)
  return inputs

def preprocess_word2vec(statement):
  tokens = statement.lower().split()
  return [word for word in tokens if word in word2vec_model]

The above step will preprocess user's statements for Bert and word2vec model.
1. 'preprocess_bert' function converts them into bert_tokens  
2. 'preprocess_word2vec' function do lowercasing, tokenization and filters out word not present in word2vec vocabulary

**Step 5: Embeddings Functions for Computation**

In [11]:
def compute_bert_embeddings(inputs):
  with torch.no_grad():
    outputs = bert_model(**inputs)
  embeddings = outputs.last_hidden_state.mean(dim=1).squeeze()
  return embeddings.numpy()

In [12]:
def compute_word2vec_embeddings(statement):
  return np.mean(word2vec_model[preprocess_word2vec(statement)], axis=0)

In the above steps Functions computing Bert and word2vec embeddings for the input statements.
1.  The compute_bert_embeddings function calculates the mean of the last hidden states of the BERT model.
2.  The compute_word2vec_embeddings function computes the average Word2Vec embedding for the words in the statement.



**Step 6: Compute Cosine Similarity Function**

In [13]:
def cosine_similarity(vec1, vec2):
  return np.dot(vec1, vec2)/(np.linalg.norm(vec1) * (np.linalg.norm(vec2)))

this function will calculate the cosine similary between two vectors using numpy.

**Step 7: Preprocessing of User 1's statement and computation embeddings**

In [30]:
# user 1's statement for Bert
user1_bert_inputs = preprocess_bert(user_statements[0])

#computation of bert embeddings for user 1
user1_bert_embeddings = compute_bert_embeddings(user1_bert_inputs)
# print('user1_bert_embeddings', user1_bert_embeddings[0])

#compuation of word2vec embeddings for user 1's statement
user1_word2vec_embeddings = compute_word2vec_embeddings(user_statements[0])
# print('user1_word2vec_embeddings', user1_word2vec_embeddings[0])

In the above step conversion of text into embedding is done for both word2vec and bert

**Step 8: Computation of similarity scores for User1 with other users**

In [23]:
# initializing list for storing similarity scores
bert_similarities =[]
word2vec_similarities = []

#iterate over user's other than user1
for statement in user_statements[1:]:
  #preprocess for bert
  statement_bert_inputs = preprocess_bert(statement)
  #compute bert embeddings for the other user's statement
  statement_bert_embeddings = compute_bert_embeddings(statement_bert_inputs)

  #compute Word2Vec embeddings for the statement
  statement_word2vec_embeddings = compute_word2vec_embeddings(statement)

  #calculation of cosine similarity between user 1 and the statements of others
  bert_similarity = cosine_similarity(user1_bert_embeddings, statement_bert_embeddings)
  word2vec_similarity = cosine_similarity(user1_word2vec_embeddings, statement_word2vec_embeddings)

  # Append similarity scores to both lists
  bert_similarities.append(bert_similarity)
  word2vec_similarities.append(word2vec_similarity)

Above steps have following:
1. preprocess the statement of user1 using preprocess function to get input for Bert.
2. computed the Bert and word2vec embeddings for user1
3. computed embeddings for other user's statements also
4. calculated cosine similarity scores and stored in each list 'bert_similarities' and 'word2vec_similarities'

**Step 9: Ranking based on similarity scores**

In [24]:
sorted_indices_bert = np.argsort(bert_similarities)[::-1]
sorted_indices_word2vec = np.argsort(word2vec_similarities)[::-1]

**Rank's Output with User1 using Bert embeddings**

In [28]:
print(f'Rank matches with user1 using Bert embeddings:')
for i, ix in enumerate(sorted_indices_bert, 1):
  print(f'Rank {i}: User {ix +2} (Similarity Score: {bert_similarities[ix]})')

Rank matches with user1 using Bert embeddings:
Rank 1: User 5 (Similarity Score: 0.8226811289787292)
Rank 2: User 4 (Similarity Score: 0.7725504040718079)
Rank 3: User 3 (Similarity Score: 0.7715209722518921)
Rank 4: User 2 (Similarity Score: 0.7578029632568359)


**Rank's Output with User1 using using Word2Vec embeddings**

In [29]:
print(f'Rank matches with user1 using word2vec embeddings:')
for i, ix in enumerate(sorted_indices_word2vec, 1):
  print(f'Rank {i}: User {ix +2} (Similarity Score: {bert_similarities[ix]})')

Rank matches with user1 using word2vec embeddings:
Rank 1: User 5 (Similarity Score: 0.8226811289787292)
Rank 2: User 3 (Similarity Score: 0.7715209722518921)
Rank 3: User 2 (Similarity Score: 0.7578029632568359)
Rank 4: User 4 (Similarity Score: 0.7725504040718079)


## Reflections and Further Analysis

### How could ranked matching be improved?
Ranked matching could be improved by considering additional factors beyond textual similarity. Some improvements include:
- **Contextual Understanding:** Incorporating techniques to understand the context of statements better, such as analyzing sentiment, topic modeling, or capturing the nuances of language.
--**Alternative Algorithms:** While BERT and Word2Vec embeddings are powerful algorithms, exploring simpler algorithms such as TF-IDF or cosine similarity on bag-of-words representations could provide meaningful insights.
- **Feedback Mechanism:** Implementing a feedback mechanism where users can provide ratings or feedback on matches, which can be used to refine the matching algorithm over time.
- **Enhanced Data Processing:** Implementing basic data preprocessing techniques such as removing stop words, stemming, or lemmatization can help improve the quality of text based data. This can lead to more accurate similarity calculations and better-ranked matches.

### Why did you choose the method you used to complete the analysis?
I chose to use BERT and Word2Vec embeddings for computing similarity scores because of their effectiveness in capturing semantic relationships between text. BERT embeddings, being contextualized, provide a good representation of the meaning of sentences, while Word2Vec embeddings capture semantic similarity based on word co-occurrences. By utilizing both methods, I can leverage their respective strengths to enhance the accuracy of the matching process.

### Other than your chosen method, what other methods would you pursue?
In addition to BERT and Word2Vec embeddings, other methods that could be explored for improving the matching process include:
- **TF-IDF:** TF-IDF (Term Frequency-Inverse Document Frequency) to represent the importance of words in documents and compute similarity scores based on weighted word frequencies.
- **Doc2Vec:** Using Doc2Vec to generate document-level embeddings, which can capture the overall semantic meaning of a document.
- **Deep Learning Models:** Exploration of more advanced deep learning models which is designed for similarity matching tasks, such as transformer-based models tailored for similarity computation.

Each of these methods has its advantages and can contribute to improving the accuracy and effectiveness of the ranked matching process.
