<a href="https://colab.research.google.com/github/elhamod/BA820/blob/main/Hands-on/04-text-mining/Text_Analysis_Advanced.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


**Course: BA820 - Unsupervised and Unstructured ML**

**Notebook created by: Mohannad Elhamod**

## 1. Intuition Behind Word2Vec

To understand how Word2Vec works, we will create a toy model by training it on a  small number of sentences.

This is not a common practice. Generally, we just use a *pre-trained* model that was fitted to millions of sentences. Such models will be of high quality.




In [None]:
!pip install gensim



In [None]:
import nltk
nltk.download('punkt')
nltk.download('punkt_tab')
nltk.download('stopwords')
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
stop_words = set(stopwords.words('english')) # Get the set of stop words

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Here is some code that could clean text up

In [None]:
import string
from nltk.stem import PorterStemmer

ps = PorterStemmer()

def cleanup_text(sentence):
  # First, word tokenize.
  tokenized_sms_messages = word_tokenize(sentence)

  # Lower case
  tokenized_sms_messages = [word.lower() for word in tokenized_sms_messages]

  # Remove punctuation
  tokenized_sms_messages = [word for word in tokenized_sms_messages if word not in string.punctuation]

  # Remove stop words
  stop_words = set(stopwords.words('english'))
  tokenized_sms_messages = [word for word in tokenized_sms_messages if word not in stop_words]

  # Stem
  tokenized_sms_messages = [ps.stem(word) for word in tokenized_sms_messages]

  return tokenized_sms_messages

In [None]:
corpus = [
    'I love sleeping in my bed',
    'He hates eating at McDonalds every night',
    'I love drinking root beer',
    'He hates studying physics textbooks',
    'I love traveling to Europe every summer',
    'He hates swimming in the big pool',
]

# Tokenize first.
tokenized_corpus = [cleanup_text(sentence) for sentence in corpus]
tokenized_corpus

[['love', 'sleep', 'bed'],
 ['hate', 'eat', 'mcdonald', 'everi', 'night'],
 ['love', 'drink', 'root', 'beer'],
 ['hate', 'studi', 'physic', 'textbook'],
 ['love', 'travel', 'europ', 'everi', 'summer'],
 ['hate', 'swim', 'big', 'pool']]

In [None]:
from gensim.models import Word2Vec
import numpy as np

# We construct and train our own Word2Vec.
model_word2vec = Word2Vec(sentences=tokenized_corpus, vector_size=100, window=3, min_count=1, epochs=10000, workers=4, negative=10)

In [None]:
print("All words captured by the model:", model_word2vec.wv.key_to_index)

word = 'love'
print("The embedding of", word, "is", model_word2vec.wv[word])

# Get the embedding for each word captured by the model.
words = model_word2vec.wv.key_to_index
embeddings = np.array([model_word2vec.wv[word] for word in words])

All words captured by the model: {'love': 0, 'hate': 1, 'everi': 2, 'pool': 3, 'summer': 4, 'swim': 5, 'big': 6, 'europ': 7, 'studi': 8, 'travel': 9, 'textbook': 10, 'physic': 11, 'drink': 12, 'root': 13, 'beer': 14, 'night': 15, 'eat': 16, 'mcdonald': 17, 'bed': 18, 'sleep': 19}
The embedding of love is [-0.02979797  0.11968966  0.02276642  0.19615366 -0.08948124 -0.2029546
  0.22655815  0.39937282 -0.14610253 -0.27000517  0.06303661 -0.15314962
 -0.35671517  0.09393874 -0.18246385  0.05125967  0.3745222   0.1459845
 -0.06354126 -0.4968024   0.36825237  0.12386394  0.17623046  0.00381761
  0.09248206  0.09840233 -0.20145446  0.4410097  -0.30198422  0.06540284
 -0.20277771 -0.00791593  0.27523726 -0.5698595  -0.10206012 -0.3123315
  0.2009343   0.08411122  0.02519415  0.10341717 -0.03776871  0.05130414
 -0.30848256  0.26447415  0.18077211 -0.12226328 -0.29050517  0.1619084
  0.46067485  0.25174743 -0.18457901  0.25135154  0.11144839 -0.29271427
  0.19707069 -0.13519196  0.05093146 -0.1

In [None]:
embeddings.shape

(20, 100)

Ten words have ten embeddings. Each word has a n-dimensional embedding (i.e., vector_size)

Now, let's plot a 3D PCA plot to see these embeddings



In [None]:
from sklearn.decomposition import PCA
import plotly.express as px
import pandas as pd

def plot_scatter_3d(model, embeddings):
  dim_red = PCA(n_components=3, random_state=42)

  embeddings_for_visualization = dim_red.fit_transform(embeddings)

  # Convert the reduced embeddings and words into a DataFrame
  df = pd.DataFrame(embeddings_for_visualization, columns=['x', 'y', 'z'])
  df['word'] = [ word for word in model_word2vec.wv.index_to_key]

  # Create a scatter plot using Plotly
  fig = px.scatter_3d(df, x='x', y='y', z='z', text='word', title='Word Embeddings Visualization')
  fig.show()

In [None]:
model_word2vec.wv.index_to_key

['love',
 'hate',
 'everi',
 'pool',
 'summer',
 'swim',
 'big',
 'europ',
 'studi',
 'travel',
 'textbook',
 'physic',
 'drink',
 'root',
 'beer',
 'night',
 'eat',
 'mcdonald',
 'bed',
 'sleep']

In [None]:
plot_scatter_3d(model_word2vec, embeddings)

Let's see how this maps to a pre-trained embedding model (GloVe or Word2Vec)

In [None]:
import gensim.downloader as api

# Load the pretrained model
# pretrained_model = api.load('word2vec-google-news-300')
pretrained_model = api.load('glove-wiki-gigaword-200')
# pretrained_model = api.load('glove-twitter-200')


Checking of the model does not recognize any of the words

In [None]:
[word  for word in words if word not in pretrained_model]

['everi']

Visualize

In [None]:
vector_size = pretrained_model.vector_size

embeddings = np.array([
    pretrained_model[word] if word in pretrained_model else np.zeros(vector_size)  # if the word is not recognized, replace it with a vector of zeros
    for word in words
])

plot_scatter_3d(pretrained_model, embeddings)

## 2. Application: Using Embeddings for Spam Detection

Now that we were able to represent the words using the pre-trained embeddings, let's apply it to our spam detection problem.

In [None]:
url = "https://raw.githubusercontent.com/elhamod/BA820/main/Hands-on/04-text-mining/hamspam.csv"
df_sms = pd.read_csv(url, names = ['type', 'text'], index_col='type')

X = df_sms['text']
y = df_sms.index

df_sms

Unnamed: 0_level_0,text
type,Unnamed: 1_level_1
ham,"Go until jurong point, crazy.. Available only ..."
ham,Ok lar... Joking wif u oni...
spam,Free entry in 2 a wkly comp to win FA Cup fina...
ham,U dun say so early hor... U c already then say...
ham,"Nah I don't think he goes to usf, he lives aro..."
...,...
spam,This is the 2nd time we have tried 2 contact u...
ham,Will ü b going to esplanade fr home?
ham,"Pity, * was in mood for that. So...any other s..."
ham,The guy did some bitching but I acted like i'd...


First, do some pre-processing.

In [None]:
message = df_sms['text'][0]
message


Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`



'Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...'

In [None]:
print("a message:", cleanup_text(message)) # cleanup_text(message), message

print("Embedding of the entir message:",pretrained_model.get_mean_vector(message))

a message: ['go', 'jurong', 'point', 'crazi', '..', 'avail', 'bugi', 'n', 'great', 'world', 'la', 'e', 'buffet', '...', 'cine', 'got', 'amor', 'wat', '...']
Embedding of the entir message: [ 0.01325061  0.07966316 -0.03237719 -0.01227001  0.00209478 -0.03084263
 -0.03625903 -0.02255045 -0.06530633 -0.02646268 -0.04867887  0.01689149
  0.00997888  0.01493262  0.06039633 -0.02337557 -0.0519876   0.01215835
 -0.02956353 -0.04136657  0.01299553  0.24933894 -0.0466489   0.00504209
  0.07708564 -0.03993603 -0.03453137  0.01513985  0.01505892  0.06015772
 -0.0179209   0.04621959 -0.05437643 -0.05569353 -0.01449729 -0.02392943
 -0.04262347 -0.0417202  -0.01880446  0.0008186  -0.0480609  -0.00255029
 -0.00280102  0.03550826 -0.03902321 -0.01563011  0.07773343 -0.05666838
  0.00923055 -0.02747241  0.04565843  0.00053058  0.0221711   0.0415442
  0.03608014  0.02005434 -0.01788707  0.02194688  0.01501543 -0.03379197
  0.03034617 -0.0343202  -0.04240783 -0.00462082 -0.01671232 -0.08615782
 -0.02023

In [None]:
messages = df_sms['text']
tokenized_messages = [cleanup_text(message) for message in messages]

Now, to calculate sentence embeddings, let's average the word embeddings.

In [None]:
import numpy as np

vector_size = pretrained_model.vector_size  # Get the embedding size

vectorized_messages = [
    pretrained_model.get_mean_vector(sentence) if len(sentence) > 0 else np.zeros(vector_size) # f no tokens are recognized, use a zero vector
    for sentence in tokenized_messages
]

Now that the embeddings are constructed, we can split to train/test sets and use supervised learning.

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
import sklearn
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix

def assess_model(df, embeddings):
  # train/test split
  X_train, X_test, y_train, y_test = train_test_split(embeddings, df.index, test_size=0.2, random_state=42)

  # train the model
  classifier = LogisticRegression()
  classifier.fit(X_train, y_train)

  # Predict on the test data
  y_pred = classifier.predict(X_test)

  # Evaluate the model
  accuracy = accuracy_score(y_test, y_pred)
  f1_score = sklearn.metrics.f1_score(y_test, y_pred, pos_label="spam")
  print(f"Accuracy: {accuracy}")
  print(f"f1_score: {f1_score}")
  print(sklearn.metrics.classification_report(y_test,y_pred))
  display(pd.DataFrame(confusion_matrix(y_test, y_pred, normalize='true'), columns=classifier.classes_, index=classifier.classes_ ))




In [None]:
assess_model(df_sms, vectorized_messages)

Accuracy: 0.9273542600896861
f1_score: 0.6639004149377593
              precision    recall  f1-score   support

         ham       0.93      0.99      0.96       966
        spam       0.87      0.54      0.66       149

    accuracy                           0.93      1115
   macro avg       0.90      0.76      0.81      1115
weighted avg       0.92      0.93      0.92      1115



Unnamed: 0,ham,spam
ham,0.987578,0.012422
spam,0.463087,0.536913


### 2.1 Misc Functions

Find words that are most similar to a word

In [None]:
word = 'astrology'

pretrained_model.similar_by_word(word) # , topn=5

[('numerology', 0.637926459312439),
 ('horary', 0.6117709875106812),
 ('palmistry', 0.5971240997314453),
 ('astrological', 0.582358717918396),
 ('divination', 0.5800657868385315),
 ('alchemy', 0.5477919578552246),
 ('astrologers', 0.526357114315033),
 ('ayurveda', 0.5182216763496399),
 ('jyotisha', 0.516189694404602),
 ('astronomy', 0.5156511664390564)]

Find word analogies

In [None]:
 pretrained_model.most_similar(positive=['woman', 'king'], negative=['man'])

[('queen', 0.6978678107261658),
 ('princess', 0.6081745028495789),
 ('monarch', 0.5889754891395569),
 ('throne', 0.5775108933448792),
 ('prince', 0.5750998258590698),
 ('elizabeth', 0.5463595986366272),
 ('daughter', 0.5399126410484314),
 ('kingdom', 0.5318052768707275),
 ('mother', 0.5168544054031372),
 ('crown', 0.5164473056793213)]

Find cosine similarity between two sentences

In [None]:
 pretrained_model.n_similarity(word_tokenize('I like it'), word_tokenize('hate it'))

0.8272675

**Questions:**

- Would dimensionality reduction help improve the results?
- Would you be able to use clustering to find different of messages? Do the clusters align with the ham/spam split?
- Visualize the dataset using non-linear methods.

## 3. Using Deep Learning Embeddings

We just saw how embeddings like Word2Vec can help us represent text as vectors to perform downstream tasks, such as classification.

Let's try now more advanced deep learning models that produce more sophisticated embeddings.

We will use `DistelBERT` through [`huggingface`](https://huggingface.co/). `huggingface` is a widely used platfrom for datasets and deep learning models, especially Transformers.



In [None]:
!pip install -U sentence-transformers

Collecting sentence-transformers
  Downloading sentence_transformers-3.4.1-py3-none-any.whl.metadata (10 kB)
Downloading sentence_transformers-3.4.1-py3-none-any.whl (275 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m275.9/275.9 kB[0m [31m21.4 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: sentence-transformers
  Attempting uninstall: sentence-transformers
    Found existing installation: sentence-transformers 3.3.1
    Uninstalling sentence-transformers-3.3.1:
      Successfully uninstalled sentence-transformers-3.3.1
Successfully installed sentence-transformers-3.4.1


In [None]:
from sentence_transformers import SentenceTransformer

st_model = SentenceTransformer('sentence-transformers/distilbert-base-nli-mean-tokens')

embeddings = st_model.encode(messages)



Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`



In [None]:
assess_model(df_sms, embeddings)

Accuracy: 0.9874439461883409
f1_score: 0.9523809523809523
              precision    recall  f1-score   support

         ham       0.99      0.99      0.99       966
        spam       0.97      0.94      0.95       149

    accuracy                           0.99      1115
   macro avg       0.98      0.97      0.97      1115
weighted avg       0.99      0.99      0.99      1115



Unnamed: 0,ham,spam
ham,0.994824,0.005176
spam,0.060403,0.939597


## 4. Using a Pretrained Model

Instead of extracting embeddings and then training logistic regression, how about we use a pre-trained deep learning model (a Transformer)?

Searching `huggingface` for a suitable model for ham/spam, one could find the following [Bert_Spam_ham](https://huggingface.co/saadkiet/Fine_Tuned_bert_Spam_ham) model.



In [None]:
# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-classification", model="udit-k/HamSpamBERT")

config.json:   0%|          | 0.00/727 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/438M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/366 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

Device set to use cuda:0


Let's try on one sentence.

In [None]:
pipe(df_sms["text"].iloc[0])

[{'label': 'LABEL_0', 'score': 0.9999368190765381}]

Notice here that while the model internally computed the embeddings, it give us the final classification, along with the score indicating its certaining. So, we do not need to train a separate classifier.

In [None]:
def assess_model_bert(df, model):
  # train/test split
  X_train, X_test, y_train, y_test = train_test_split(df["text"], df.index, test_size=0.2, random_state=42)

  # Predict on the test data
  y_pred = model(X_test.to_list())
  y_pred = [int(x["label"][-1]) for x in y_pred]
  y_pred = ["ham" if x == 0 else "spam" for x in y_pred]

  # Evaluate the model
  accuracy2 = accuracy_score(y_test, y_pred)
  f1_score = sklearn.metrics.f1_score(y_test, y_pred, pos_label="spam")
  print(f"Accuracy: {accuracy2}")
  print(f"f1_score: {f1_score}")
  print(sklearn.metrics.classification_report(y_test,y_pred))
  display(pd.DataFrame(confusion_matrix(y_test, y_pred, normalize='true'), columns=["ham", "spam"], index=["ham", "spam"]))




In [None]:
assess_model_bert(df_sms, pipe)

Accuracy: 0.9991031390134529
f1_score: 0.9966329966329966
              precision    recall  f1-score   support

         ham       1.00      1.00      1.00       966
        spam       1.00      0.99      1.00       149

    accuracy                           1.00      1115
   macro avg       1.00      1.00      1.00      1115
weighted avg       1.00      1.00      1.00      1115



Unnamed: 0,ham,spam
ham,1.0,0.0
spam,0.006711,0.993289
