<a href="https://colab.research.google.com/github/aggelospsiris/Book-ratings-guess-using-kmeans-and-neural-networks/blob/main/Embeddings_with_bert_and_whitening.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
!pip install transformers

In [None]:
import transformers as ppb

**Dimension reduction by whitening BERT**(based on 'https://deep-ch.medium.com/dimension-reduction-by-whitening-bert-roberta-5e103093f782' article) 

I have to feed the summaries and the titles as input data in my neural network so i have to convert them into vectors. The best way to do it is **Bert**, because it 's pre-trained on an absurd amount of data,also is able to account for a word’s context and ofc is open-source.Although it produces vectors of  768 dimensions.Point to be noted here is that 768 dimension is quite large which not only leads to increase storage cost but also the computation or retrieval speed.So based on the
article:
 “Whitening Sentence Representations for Better Semantics and Faster Retrieval” i ll use whitening process. Which is a simple & effective post-processing technique.  

In [None]:
# Importing all necessary lib

import os
import sys
import torch
import numpy as np
import pandas as pd
from transformers import RobertaModel, RobertaTokenizer

from scipy.spatial import distance
from sklearn.metrics.pairwise import cosine_similarity


def Dim_reduction(sentences, tokenizer, model):
    vecs = []
    with torch.no_grad():

        for sentence in sentences:
            inputs = tokenizer(sentence, return_tensors="pt", padding=True, truncation=True,  max_length=64)
            inputs['input_ids'] = inputs['input_ids'].to(DEVICE)
            inputs['attention_mask'] = inputs['attention_mask'].to(DEVICE)

            hidden_states = model(**inputs, return_dict=True, output_hidden_states=True).hidden_states

            #Averaging the first & last hidden states
            output_hidden_state = (hidden_states[-1] + hidden_states[1]).mean(dim=1)

            vec = output_hidden_state.cpu().numpy()[0]

            vecs.append(vec)

    #Finding Kernal
    kernel, bias = compute_kernel_bias([vecs])
    kernel = kernel[:, :128]
    embeddings = []
    embeddings = np.vstack(vecs)

    #Sentence embeddings can be converted into an identity matrix
    #by utilizing the transformation matrix
    embeddings = transform_and_normalize(embeddings, 
                kernel=kernel,
                bias=bias
            )

    return embeddings

import numpy as np

def transform_and_normalize(vecs, kernel, bias):
    """
        Applying transformation then standardize
    """
    if not (kernel is None or bias is None):
        vecs = (vecs + bias).dot(kernel)
    return normalize(vecs)
    
def normalize(vecs):
    """
        Standardization
    """
    return vecs / (vecs**2).sum(axis=1, keepdims=True)**0.5
    
def compute_kernel_bias(vecs):
    """
    Calculate Kernal & Bias for the final transformation - y = (x + bias).dot(kernel)
    """
    vecs = np.concatenate(vecs, axis=0)
    mu = vecs.mean(axis=0, keepdims=True)
    cov = np.cov(vecs.T)
    u, s, vh = np.linalg.svd(cov)
    W = np.dot(u, np.diag(s**0.5))
    W = np.linalg.inv(W.T)
    return W, -mu

In [None]:
DEVICE = torch.device('cpu')
# DistilBERT:smaller but faster and less memory
model_class, tokenizer_class, pretrained_weights = (ppb.BertModel, ppb.BertTokenizer, 'bert-base-uncased')
# Load pretrained model/tokenizer
tokenizer = tokenizer_class.from_pretrained(pretrained_weights)
model = model_class.from_pretrained(pretrained_weights)

Downloading:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/570 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.predictions.transform.dense.weight', 'cls.seq_relationship.weight', 'cls.predictions.decoder.weight', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.weight']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [None]:
from google.colab import drive 
drive.mount('/drive')

#load the ratings that contains the cluster from the kmeans
df=pd.read_csv('/drive/My Drive/Colab Notebooks/info_retrival/rating_with_clusters2.csv')
#drop the columns that wont help our nn to perform better
df = df[['cluster','rating','book_title','book_author','year_of_publication','publisher','summary','category']]
df["category"] = df['category'].str.replace('[^\w\s]','')
titles = df["book_title"]
authors = df['book_author']
summaries = df['summary']
categories = df["category"]

Mounted at /drive


  df["category"] = df['category'].str.replace('[^\w\s]','')


Convert our text data into embedding of 128 dimensions.I ll produce different training dataset for my neural network(for example one dataset will be ('cluster','summaries','rating') or another ('cluster','summaries','title','author','rating') to see which gives me best results.

In [None]:
embeddings_summaries = Dim_reduction(summaries, tokenizer, model)
embeddings_authors = Dim_reduction(authors, tokenizer, model)
embeddings_titles = Dim_reduction(titles, tokenizer, model)
embeddings_categories = Dim_reduction(categories, tokenizer, model)

A further dimentions reduction using PCA cause my train data are big and the training process takes a lot of time 

In [None]:
from sklearn.decomposition import PCA
pca = PCA(n_components=3)
embeddings_summaries = pca.fit_transform(embeddings_summaries)
pca = PCA(n_components=1)
embeddings_authors = pca.fit_transform(embeddings_authors)
embeddings_titles = pca.fit_transform(embeddings_titles)
embeddings_categories = pca.fit_transform(embeddings_categories)


Saving into csv files my different training datasets

In [None]:
input_data = [None] * df.shape[0]
for i in range(df.shape[0]):
  input_data[i] = np.concatenate((df['cluster'].iloc[i],embeddings_titles[i].tolist(),embeddings_summaries[i].tolist(),df['rating'].iloc[i]), axis=None)
  #input_data[i] = np.concatenate((df['cluster'].iloc[i],embeddings_summaries[i].tolist(),df['rating'].iloc[i]), axis=None)

df_train = pd.DataFrame(input_data)
df_train.to_csv('/drive/My Drive/Colab Notebooks/info_retrival/train_after_pca_titles_summaries.csv')
