# Lab 8 - Text Analytics CISB5123

## Text Clustering

###### Name: Abdul Hakiim bin Ahmad Rosli - SW01081337

Text clustering groups similar documents together based on their content, allowing you
to discover patterns, trends, and insights within large collections of text data.
Any text clustering approach involves broadly the following steps:
- Text pre-processing: Text can be noisy, hiding information between stop words, inflexions and sparse representations. Pre-processing makes the dataset easier to work with.
- Feature Extraction: One of the commonly used techniques to extract the features from textual data is calculating the frequency of words/tokens in the document/corpus.
- Clustering: We can then cluster different text documents based on the features we have generated.

### Text Clustering Using TF-IDF Vectorizer

In [14]:
import numpy as np
from sklearn.cluster import KMeans
from sklearn.feature_extraction.text import TfidfVectorizer
from tabulate import tabulate
from collections import Counter
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

In [15]:
# Download NLTK resources
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

[nltk_data] Downloading package punkt to
[nltk_data]     /Users/abdulhakiim/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/abdulhakiim/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/abdulhakiim/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [16]:
# Create the Document
dataset = ["I love playing football on the weekends","I enjoy hiking and camping in the mountains","I like to read books and watch movies","I prefer playing video games over sports","I love listening to music and going to concerts"]

In [17]:
# Preprocess the documents
lemmatizer = WordNetLemmatizer()
stop_words = set(stopwords.words('english'))

def preprocess_text(text):
    # Tokenize the text
    tokens = nltk.word_tokenize(text.lower())
    # Remove stopwords and lemmatize the tokens
    cleaned_tokens = [lemmatizer.lemmatize(token) for token in tokens if token not in stop_words]
    return ' '.join(cleaned_tokens)

preprocessed_dataset = [preprocess_text(doc) for doc in dataset]

In [18]:
# Vectorize the dataset
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(preprocessed_dataset)

In [19]:
# Perform clustering
k = 2 # Define the number of clusters
km = KMeans(n_clusters=k)
km.fit(X)

# Predict the clusters for each document
y_pred = km.predict(X)

# Display the document and its predicted cluster in a table
table_data = [["Document", "Predicted Cluster"]]
table_data.extend([[doc, cluster] for doc, cluster in zip(dataset, y_pred)])
print(tabulate(table_data, headers="firstrow"))

# Print top terms per cluster
print("\nTop terms per cluster:")
order_centroids = km.cluster_centers_.argsort()[:, ::-1]
terms = vectorizer.get_feature_names_out()
for i in range(k):
    print("Cluster %d:" % i)
    for ind in order_centroids[i, :10]:
        print(' %s' % terms[ind])
    print()

Document                                           Predicted Cluster
-----------------------------------------------  -------------------
I love playing football on the weekends                            1
I enjoy hiking and camping in the mountains                        0
I like to read books and watch movies                              1
I prefer playing video games over sports                           1
I love listening to music and going to concerts                    1

Top terms per cluster:
Cluster 0:
 camping
 enjoy
 hiking
 mountain
 weekend
 listening
 concert
 football
 game
 going

Cluster 1:
 love
 playing
 football
 weekend
 going
 sport
 music
 concert
 video
 game



  super()._check_params_vs_input(X, default_n_init=10)


In [20]:
# Calculate purity
total_samples = len(y_pred)
cluster_label_counts = [Counter(y_pred)]
purity = sum(max(cluster.values()) for cluster in cluster_label_counts) / total_samples
print("Purity:", purity)

Purity: 0.8


### Text Clustering using Word2Vec Vectorizer

In [21]:
import numpy as np
from sklearn.cluster import KMeans
from gensim.models import Word2Vec
from tabulate import tabulate
from collections import Counter
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

In [22]:
# Download NLTK resources
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

[nltk_data] Downloading package punkt to
[nltk_data]     /Users/abdulhakiim/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/abdulhakiim/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/abdulhakiim/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [33]:
# Create the documents
dataset = ["I love playing football on the weekends","I enjoy hiking and camping in the mountains","I like to read books and watch movies","I prefer playing video games over sports","I love listening to music and going to concerts"]

In [34]:
# Preprocess the documents
lemmatizer = WordNetLemmatizer()
stop_words = set(stopwords.words('english'))

def preprocess_text(text):
    # Tokenize the text
    tokens = nltk.word_tokenize(text.lower())
    # Remove stopwords and lemmatize the tokens
    cleaned_tokens = [lemmatizer.lemmatize(token) for token in tokens if token not in stop_words]
    return cleaned_tokens

preprocessed_dataset = [preprocess_text(doc) for doc in dataset]

In [36]:
# Train Word2Vec model
word2vec_model = Word2Vec(sentences=preprocessed_dataset, vector_size=100, window=5, min_count=1, workers=4)

In [37]:
# Create document embeddings
X = np.array([np.mean([word2vec_model.wv[word] for word in doc if word in word2vec_model.wv], axis=0) for doc in preprocessed_dataset])

In [38]:
k = 2 # Define the number of clusters
km = KMeans(n_clusters=k)
km.fit(X)

# Predict the clusters for each document
y_pred = km.predict(X)

# Tabulate the document and predicted cluster
table_data = [["Document", "Predicted Cluster"]]
table_data.extend([[doc, cluster] for doc, cluster in zip(dataset, y_pred)])
print(tabulate(table_data, headers="firstrow"))

Document                                           Predicted Cluster
-----------------------------------------------  -------------------
I love playing football on the weekends                            1
I enjoy hiking and camping in the mountains                        0
I like to read books and watch movies                              0
I prefer playing video games over sports                           1
I love listening to music and going to concerts                    0


  super()._check_params_vs_input(X, default_n_init=10)


In [39]:
# Calculate purity
total_samples = len(y_pred)
cluster_label_counts = [Counter(y_pred)]
purity = sum(max(cluster.values()) for cluster in cluster_label_counts) / total_samples
print("Purity:", purity)

Purity: 0.6


### Exercise

#### 1. Modify the codes for both TF-IDF & Word2Vec vectorizer by adding text preprocessing steps. Do the Purity differ when applying text preprocessing before vectorization?

**Answer:** 
It appears that applying text preprocessing before vectorization has different effects on the clustering results for TF-IDF and Word2Vec vectorizers.

TF-IDF Vectorizer:
- Before applying text preprocessing: Purity = 0.6
- After applying text preprocessing: Purity = 0.8

In the case of the TF-IDF vectorizer, the purity value increased from 0.6 to 0.8 after applying text preprocessing. This suggests that text preprocessing had a positive impact on the clustering results when using TF-IDF.
Text preprocessing steps such as tokenization, lowercasing, removing stopwords, and lemmatization help in reducing noise and normalizing the text data. By removing irrelevant words (stopwords) and transforming words to their base or dictionary form (lemmatization), the preprocessed text focuses more on the meaningful content.
The improvement in purity indicates that after preprocessing, the clusters formed by the TF-IDF vectorizer are more homogeneous and contain documents that are more similar to each other within each cluster. The preprocessing steps likely helped in better capturing the important features and improving the clustering quality.

Word2Vec Vectorizer:
- Before applying text preprocessing: Purity = 0.6
- After applying text preprocessing: Purity = 0.6

In the case of the Word2Vec vectorizer, the purity value remained the same at 0.6 before and after applying text preprocessing. This suggests that text preprocessing did not have a significant impact on the clustering results when using Word2Vec.

Word2Vec is a neural network-based model that learns dense vector representations of words, capturing semantic relationships between them. The model takes into account the context of words and learns to represent words with similar meanings closer together in the vector space.

The fact that the purity value did not change with preprocessing in the Word2Vec case could be due to a few reasons:
1. Word2Vec is already capable of handling some level of noise and variations in the text data, as it learns from the context of words.
2. The preprocessing steps applied may not have significantly altered the semantic relationships captured by Word2Vec.
3. The specific preprocessing techniques used (e.g., removing stopwords, lemmatization) may not have had a substantial impact on the clustering results in this particular case.

It's important to note that the impact of text preprocessing on clustering results can vary depending on the dataset, the specific preprocessing techniques used, and the characteristics of the vectorizer and clustering algorithm.

In summary, based on the provided purity values, applying text preprocessing before vectorization had a positive impact on the clustering results for the TF-IDF vectorizer, increasing the purity from 0.6 to 0.8. However, for the Word2Vec vectorizer, text preprocessing did not lead to a change in purity, suggesting that the preprocessing steps did not significantly affect the clustering quality in that case.

#### 2. Perform text clustering on 'customer_complaints_1.csv' dataset, specifically the Text column.

In [40]:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans

In [46]:
df = pd.read_csv('customer_complaints_1.csv')

In [47]:
# Remove any leading/trailing whitespace
df['text'] = df['text'].str.strip()

# Convert text to lowercase
df['text'] = df['text'].str.lower()

# Remove any non-alphanumeric characters
df['text'] = df['text'].str.replace(r'[^a-zA-Z0-9\s]', '')

In [50]:
# Create a TF-IDF vectorizer and transform the text data
vectorizer = TfidfVectorizer(stop_words='english')
X = vectorizer.fit_transform(df['text'])

In [51]:
# Perform K-means clustering
kmeans = KMeans(n_clusters=3, random_state=42)
kmeans.fit(X)

  super()._check_params_vs_input(X, default_n_init=10)


In [52]:
# Add the cluster labels to the DataFrame
df['Cluster'] = kmeans.labels_

In [54]:
print("Top terms per cluster:")
order_centroids = kmeans.cluster_centers_.argsort()[:, ::-1]
terms = vectorizer.get_feature_names_out()
for i in range(3):
    print(f"Cluster {i}:")
    for ind in order_centroids[i, :10]:
        print(f" - {terms[ind]}")
    print()

Top terms per cluster:
Cluster 0:
 - boxes
 - second
 - adding
 - malfunction
 - investigating
 - protocol
 - floor
 - customer
 - possible
 - account

Cluster 1:
 - internet
 - contract
 - comcast
 - service
 - xfinity
 - speed
 - mbps
 - customer
 - months
 - told

Cluster 2:
 - rude
 - service
 - day
 - rep
 - joke
 - local
 - people
 - ignorant
 - tom
 - helpful



In [56]:
for i in range(3):
    print(f"Cluster {i}:")
    cluster_complaints = df[df['Cluster'] == i]['text'].values
    for complaint in cluster_complaints[:5]:
        print(f" - {complaint}")
    print()

Cluster 0:
 - i've had the worst experiences so far since install on 10/4/16. nothing but problems. two no shows on scheduled service appointments, extreme difficulty in adding boxes to the second floor. what is so difficult about adding boxes to an existing account? no thank you, i'm not starting a second account for the second floor of the same house! a separate bundle package? all i wanted was just to add a few boxes. apparently this is not possible. well then, i guess it's not possible to remain a customer!
 - there is a malfunction on the dvr manager which is preventing us from adding more recordings. customer service is fairly certain that the problem is from the signal from their system to ours, but protocol demands that they access our home before investigating that option. since we work, that cannot be done until next saturday. customer service tech agreed that this seems illogical since logic would dictate that one would investigate the most probably malfunction first, but in