# Use of dimensionality reduction for clustering and automatic keywords extraction on document collections 
**Author**:  <br>
**Id**:  <br>
**Description**: Use of dimensionality reduction for clustering and automatic keywords extraction on document collections <br>
**Goal**: the general goal is, given one document, to efficiently identify the documents that are closest to the document under consideration, in terms of cosine similarity computed on the tf-idf representations of the documents. <br>
**Dataset**: as a test case, we will use the relatively large
Amazon Books Reviews dataset, available at at https://www.kaggle.com/datasets/mohamedbakhet/amazon-books-reviews..

##Set-up instructions
If you do not have the Kaggle JSON API file in the local drive, in order to use the Kaggle’s public API, you must first authenticate using an API token.<br>
Follow these steps:

1.   From the site header in Kaggle, click on your user profile picture;
2.   Then on “My Account” from the dropdown menu;
3.   This will take you to your account settings at https://www.kaggle.com/account. 
4.   Scroll down to the section of the page labelled API: to create a new token, click on the “Create New API Token” button.
5.   This will download a fresh authentication token onto your machine as a file named "kaggle.json".
6.   After this simply run the next cell and insert the "kaggle.json" file when requested.

In [1]:
from google.colab import files
import os.path
os.path.isfile("kaggle.json") 

if (os.path.isfile("kaggle.json")  == False):
    uploaded = files.upload()

print("[!] Kaggle configuration file uploaded")

Saving kaggle.json to kaggle.json
[!] Kaggle configuration file uploaded


Download the dataset from Kaggle, unzip it and remove the zipped file.

In [2]:
!pip install --upgrade --force-reinstall --no-deps kaggle
!mkdir ~/.kaggle
!cp kaggle.json ~/.kaggle/
! chmod 600 ~/.kaggle/kaggle.json
!kaggle datasets download -d mohamedbakhet/amazon-books-reviews

!unzip /content/amazon-books-reviews.zip
!rm /content/amazon-books-reviews.zip

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting kaggle
  Downloading kaggle-1.5.12.tar.gz (58 kB)
[?25l[K     |█████▋                          | 10 kB 30.2 MB/s eta 0:00:01[K     |███████████▏                    | 20 kB 39.6 MB/s eta 0:00:01[K     |████████████████▊               | 30 kB 46.9 MB/s eta 0:00:01[K     |██████████████████████▎         | 40 kB 34.4 MB/s eta 0:00:01[K     |███████████████████████████▉    | 51 kB 37.8 MB/s eta 0:00:01[K     |████████████████████████████████| 58 kB 7.0 MB/s 
[?25hBuilding wheels for collected packages: kaggle
  Building wheel for kaggle (setup.py) ... [?25l[?25hdone
  Created wheel for kaggle: filename=kaggle-1.5.12-py3-none-any.whl size=73052 sha256=816b18aa2f0b9e616cfd6f2ca27a0e767c195d8d645f994936a5046b3367ef47
  Stored in directory: /root/.cache/pip/wheels/29/da/11/144cc25aebdaeb4931b231e25fd34b394e6a5725cbb2f50106
Successfully built kaggle
Installing collected p

##Preprocessing
Implement all necessary text preprocessing required to transform
each review into a tf-idf vector.

In [3]:
import pandas as pd
data = pd.read_csv("/content/Books_rating.csv")
data.head(5)

Unnamed: 0,Id,Title,Price,User_id,profileName,review/helpfulness,review/score,review/time,review/summary,review/text
0,1882931173,Its Only Art If Its Well Hung!,,AVCGYZL8FQQTD,"Jim of Oz ""jim-of-oz""",7/7,4.0,940636800,Nice collection of Julie Strain images,This is only for Julie Strain fans. It's a col...
1,826414346,Dr. Seuss: American Icon,,A30TK6U7DNS82R,Kevin Killian,10/10,5.0,1095724800,Really Enjoyed It,I don't care much for Dr. Seuss but after read...
2,826414346,Dr. Seuss: American Icon,,A3UH4UZ4RSVO82,John Granger,10/11,5.0,1078790400,Essential for every personal and Public Library,"If people become the books they read and if ""t..."
3,826414346,Dr. Seuss: American Icon,,A2MVUWT453QH61,"Roy E. Perry ""amateur philosopher""",7/7,4.0,1090713600,Phlip Nel gives silly Seuss a serious treatment,"Theodore Seuss Geisel (1904-1991), aka &quot;D..."
4,826414346,Dr. Seuss: American Icon,,A22X4XUPKF66MR,"D. H. Richards ""ninthwavestore""",3/3,4.0,1107993600,Good academic overview,Philip Nel - Dr. Seuss: American IconThis is b...


Clean the dataframe dropping rows having missing values or infinite ones, since they could create problems in numeric methods. <br>
Then print a statistical description of the whole dataframe.

In [4]:
import numpy as np
df = data.replace([np.inf, -np.inf], np.nan) 
df = df.dropna()
df.describe()

Unnamed: 0,Price,review/score,review/time
count,414548.0,414548.0,414548.0
mean,21.814618,4.240382,1174725000.0
std,26.277351,1.187782,123984800.0
min,1.0,1.0,848275200.0
25%,10.85,4.0,1086566000.0
50%,14.95,5.0,1170288000.0
75%,24.0,5.0,1283990000.0
max,995.0,5.0,1362355000.0


Then, in order to make practice with a more real preprocessing on data, let's consider only the reviews having a score greater or equal than 0 and put them into a list, printing the size.

In [9]:
pos = df[df['review/score']>= 0]['review/text'].tolist()
print(len(pos))


414548


Convert all the text to lower case.

In [10]:
pos = [doc.lower() for doc in pos]

Removing contractions.

In [11]:
import re
contractions_dict = {
"ain't": "am not",
"aren't": "are not",
"can't": "cannot",
"can't've": "cannot have",
"'cause": "because",
"could've": "could have",
"couldn't": "could not",
"couldn't've": "could not have",
"didn't": "did not",
"doesn't": "does not",
"don't": "do not",
"hadn't": "had not",
"hadn't've": "had not have",
"hasn't": "has not",
"haven't": "have not",
"he'd": "he would",
"he'd've": "he would have",
"he'll": "he will",
"he'll've": "he will have",
"he's": "he is",
"how'd": "how did",
"how'd'y": "how do you",
"how'll": "how will",
"how's": "how is",
"i'd": "I would",
"i'd've": "I would have",
"i'll": "I will",
"i'll've": "I will have",
"i'm": "I am",
"i've": "I have",
"isn't": "is not",
"it'd": "it had",
"it'd've": "it would have",
"it'll": "it will",
"it'll've": "it will have",
"it's": "it is",
"let's": "let us",
"ma'am": "madam",
"mayn't": "may not",
"might've": "might have",
"mightn't": "might not",
"mightn't've": "might not have",
"must've": "must have",
"mustn't": "must not",
"mustn't've": "must not have",
"needn't": "need not",
"needn't've": "need not have",
"o'clock": "of the clock",
"oughtn't": "ought not",
"oughtn't've": "ought not have",
"shan't": "shall not",
"sha'n't": "shall not",
"shan't've": "shall not have",
"she'd": "she would",
"she'd've": "she would have",
"she'll": "she will",
"she'll've": "she will have",
"she's": "she is",
"should've": "should have",
"shouldn't": "should not",
"shouldn't've": "should not have",
"so've": "so have",
"so's": "so is",
"that'd": "that would",
"that'd've": "that would have",
"that's": "that is",
"there'd": "there had",
"there'd've": "there would have",
"there's": "there is",
"they'd": "they would",
"they'd've": "they would have",
"they'll": "they will",
"they'll've": "they will have",
"they're": "they are",
"they've": "they have",
"to've": "to have",
"wasn't": "was not",
"we'd": "we had",
"we'd've": "we would have",
"we'll": "we will",
"we'll've": "we will have",
"we're": "we are",
"we've": "we have",
"weren't": "were not",
"what'll": "what will",
"what'll've": "what will have",
"what're": "what are",
"what's": "what is",
"what've": "what have",
"when's": "when is",
"when've": "when have",
"where'd": "where did",
"where's": "where is",
"where've": "where have",
"who'll": "who will",
"who'll've": "who will have",
"who's": "who is",
"who've": "who have",
"why's": "why is",
"why've": "why have",
"will've": "will have",
"won't": "will not",
"won't've": "will not have",
"would've": "would have",
"wouldn't": "would not",
"wouldn't've": "would not have",
"y'all": "you all",
"y'alls": "you alls",
"y'all'd": "you all would",
"y'all'd've": "you all would have",
"y'all're": "you all are",
"y'all've": "you all have",
"you'd": "you had",
"you'd've": "you would have",
"you'll": "you will",
"you'll've": "you will have",
"you're": "you are",
"you've": "you have"
}

# Regular expression for finding contractions
def multiple_replace(dict, text):
  # Create a regular expression from the dictionary keys
  regex = re.compile("(%s)" % "|".join(map(re.escape, dict.keys())))
  # For each match, look-up corresponding value in dictionary
  return regex.sub(lambda mo: dict[mo.string[mo.start():mo.end()]], text)

# Removing contractions
pos = [multiple_replace(contractions_dict, doc) for doc in pos]


Remove punctuation and numbers.

In [12]:
# Removing punctuation
import string
table = str.maketrans('', '', string.punctuation)
pos = [doc.translate(table) for doc in pos]

# Removing numbers
pos = [re.sub(r'\d+', " ", doc) for doc in pos]

Apply tokenization, lemmatization and removal of stop words.

In [13]:
import nltk # Natural Language Toolkit
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('punkt')
nltk.download('omw-1.4')
from nltk.tokenize import word_tokenize #Used to extract words from documents
from nltk.stem import WordNetLemmatizer #Used to lemmatize words

# Tokenization and lemmatization
def get_lemmatized(doc):
  word_list = word_tokenize(doc)
  lemmatized_doc = ""
  for word in word_list:
    lemmatized_doc = lemmatized_doc + " " + lemmatizer.lemmatize(word)
  return lemmatized_doc

lemmatizer = WordNetLemmatizer()
pos = [get_lemmatized(doc) for doc in pos]

# Remove stop words
stopwords = nltk.corpus.stopwords.words('english') # Returns a list
stopwords = set(stopwords)

def rem_stop(doc):
  word_list = word_tokenize(doc)
  cleaned_doc = ""
  for word in word_list:
    if word not in stopwords:
      cleaned_doc += " " + word
  return cleaned_doc

pos = [rem_stop(doc) for doc in pos]

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...


Convert the list of reviews to a final dataframe and print the length.

In [14]:
df= pd.Series(pos).astype(str) # Final dataframe
print(len(df))

414548


##Vectorization
Computing the tf-idf vectors for each document passing the preprocessing phase.
More on tf-idf vectors here: [tf-idf vectorization](https://towardsdatascience.com/text-vectorization-term-frequency-inverse-document-frequency-tfidf-5a3f9604da6d).<br>
A matrix containing them is created.

In [15]:
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.text import TextPreprocessor

class AmazonReviews:
  def __init__(self, reviews_dataframe):
    self.reviews = reviews_dataframe # as a list of reviews
    self.vectorizer = TfidfVectorizer(
        analyzer='word',
        min_df=10,
        stop_words='english')
    self.preprocessor = TextPreprocessor()
    
  def get_tfidf_vectors(self):
    # Preprocessing (more accurate)
    processed_reviews = self.preprocessor.preprocess(self.reviews)
    
    # Transform reviews in vectors tf-idf
    return self.vectorizer.fit_transform(processed_reviews)


reviews = AmazonReviews(df)
matrix = reviews.get_tfidf_vectors()

ImportError: ignored

Each tf-idf vector has one entry for each word in the vocabulary of the tokenized and lemmatized documents. <br>
Let's print these words.

In [None]:
feature_names = reviews.vectorizer.get_feature_names_out();
print(feature_names)
print(matrix.shape)

ATTENTION: RUN IT TO USE COUNT VECTORIZER INSTEAD OF TF-IDF. As you can see, this class is very similar to the original class I provided, except that it uses the CountVectorizer class instead of the TfidfVectorizer class. To use this class, you would first load your Amazon review dataset into a list and then pass it to the class as a constructor argument, as shown below:

In [None]:
# import numpy as np
# from sklearn.feature_extraction.text import CountVectorizer
# from nltk.text import TextPreprocessor

# class AmazonReviews:
#   def __init__(self, reviews):
#     self.reviews = reviews
#     self.vectorizer = CountVectorizer()
#     self.preprocessor = TextPreprocessor()
    
#   def get_tfidf_vectors(self):
#     # Esegui il preprocessing delle recensioni
#     processed_reviews = self.preprocessor.preprocess(self.reviews)
    
#     # Trasforma le recensioni in vettori tf-idf
#     return self.vectorizer.fit_transform(processed_reviews)

## Clustering help classes
We will need to do Clustering and we will use a single class for doing these 2 things needed before doing clustering: 

To perform clustering, you should decide the value of k, the number of clusters. In this case, you do not know how many topics you are supposed to find, so in part (but only in part) you should proceed by trial-and-error. Principled ways to find a reasonable tentative value for k are the following: 

i) if you are using k-means, you can plot inertia of the clustering you compute for increasing values of k and apply the elbow method seen in class. Inertia of a clustering is maintained in the attribute inertia of the sklearn.cluster.KMeans object; <br>

ii) if you are using SVD, you can use the explained variance or explained variance ratio attribute of the sklearn.decomposition.TruncatedSVD object to access the explained variance values of the k components (singular vectors) you kept. In this second case, you can keep the k largest components that account for 80-90% of the total variance. In normal datasets, this corresponds to a relatively small number of components. If you think this is still too large a number of components, you can set k to some predefined value (e.g., 100 to begin with) and then plot explained variance against the number n of components for n ranging from 1 to the predefined value of k you chose. If you already notice an elbow in this interval you are done. 

These are simple heuristics and it should be clear that it is up to you to proceed in a sensible and feasible way.

Here is an example of how you could create a class method that plots the inertia of a sklearn.cluster.KMeans object for incremental values of k. This method can help you apply the "Elbow method" to choose the right number of clusters, k, for your data.

In [None]:
import matplotlib.pyplot as plt
import numpy as np
from sklearn.cluster import KMeans

class K-MEANS-K_FIND:

    def __init__(self, max_k):
        self.max_k = 1000 # maximum value of k to use in the elbow method

    def fit_predict(self, data, k):
        # create a KMeans instance with the specified number of clusters
        kmeans = KMeans(n_clusters=k)
        # fit the model to the data
        kmeans.fit(data)
        # predict the cluster labels for the data
        return kmeans.predict(data)

    def plot_inertia(self):
        ks = range(1, self.max_k + 1)
        inertias = []
        for k in ks:
            # fit the model with k clusters and append the inertia to the list
            inertias.append(self.fit_predict(data,k).inertia_)

        # plot the inertia values against the number of clusters
        plt.plot(ks, inertias, '-o')
        plt.xlabel('Number of clusters (k)')
        plt.ylabel('Inertia')
        plt.show()

# This method will fit the clustering algorithm for different values of k and plot the resulting inertia values. 
# The elbow method can then be used to choose the optimal value of k by looking for the "elbow" in the plot,
# where the inertia begins to decrease more slowly.

# The KMeans class from scikit-learn is used to fit the model to the data and make predictions.

SVD : create another class: method  it uses the explained variance or explained variance ratio of sklearn.decomposition.truncatedSVD for all values for all kept singular vectors (so the components). In this case keep the k-largest components that account for 80-90% of total variance. If this number is too big then set k to a predefined value (e.g 100) and plot the explained variance for k from 1 to this value and apply the Elbow method for finding the best “k” parameter ( the return value of the class).


The TruncatedSVD class in the sklearn.decomposition module can be used to perform Singular Value Decomposition (SVD) on a matrix. This class can be used to reduce the dimensionality of a matrix by keeping only the largest singular values and corresponding singular vectors. The explained_variance_ or explained_variance_ratio_ attribute of the TruncatedSVD object can be used to determine the amount of variance explained by each of the singular vectors.

To implement the class you described, you could do the following:

Import the TruncatedSVD class from sklearn.decomposition.
Define a class, SVD, that takes a matrix as an input in its constructor.
Define a method in the SVD class, explained_variance, that uses the TruncatedSVD class to perform SVD on the input matrix and returns the explained variance or explained variance ratio for all of the singular vectors.
Define another method in the SVD class, elbow_method, that plots the explained variance for each singular vector and applies the elbow method to find the optimal number of singular vectors to keep. This method should return the optimal number of singular vectors to keep.

In [None]:
from sklearn.decomposition import TruncatedSVD
import matplotlib.pyplot as plt

class SVD:
    def __init__(self, matrix):
        self.matrix = matrix
    
    def explained_variance(self):
        svd = TruncatedSVD()
        svd.fit(self.matrix)
        return svd.explained_variance_ratio_
    
    def elbow_method(self):
        # Perform SVD on the matrix and compute the explained variance ratio for each singular vector
        explained_variance = self.explained_variance()
        
        # Plot the explained variance ratio for each singular vector
        plt.plot(explained_variance)
        plt.xlabel('Singular Vector Index')
        plt.ylabel('Explained Variance Ratio')
        plt.show()
        
        # Use the elbow method to find the optimal number of singular vectors to keep
        # The optimal number of singular vectors is the point at which the explained variance
        # ratio starts to decrease more slowly
        num_singular_vectors = 0
        for index, variance in enumerate(explained_variance):
            if variance < explained_variance[index - 1] - 0.01:
                num_singular_vectors = index
                break
        
        # Return the optimal number of singular vectors to keep
        return num_singular_vectors

# Create an instance of the SVD class and perform SVD on a matrix
matrix = [[1, 2, 3], [4, 5, 6], [7, 8, 9]]
svd = SVD(matrix)

# Apply the elbow method to find the optimal number of singular vectors to keep
#TODO


## Test Unit Creation
Create a class, named TestSample, that gives a random sample of the dataset Amazon books reviews setting the percentage as parameter is better.
Here is one possible implementation of the TestSample class:

In [None]:
import random

class TestSample:
    def __init__(self, percentage):
        self.percentage = percentage
    
    def get_sample(self, data):
        sample_size = int(len(data) * self.percentage)
        return random.sample(data, sample_size)


The TestSample class takes a percentage as a parameter in its constructor. The get_sample() method takes a dataset as input and returns a random sample of the data with the size equal to the given percentage of the original dataset.

Here is an example of how you can use the TestSample class:



In [None]:
# Initialize the TestSample class with a percentage of 0.1
test_sample = TestSample(0.1)

# Get a random sample of the dataset
sample = test_sample.get_sample(df)