#Open Library Analysis - Big Data Computing Project
####Graph's Connected Components vs. k-means Clusters

In this notebook we analyse Open Library's data dumps freely downloadable from their website. https://openlibrary.org/developers/dumps
Books' data is probided in JSON records, which we preprocessed so as to keep only relevant information into a .csv file.

In the following we will clean the dataset and extract relevant features.

We use the resulting dataset to create a graph representing books affinity (i.e. nodes represent books and an edge connects two nodes only if their similarity is above a certain threshold). We are then interested in finding then connected components of such graph. 

Moreover, we will perform k-means clustering and compare, in terms of Silhouette Coefficient, the resulting clusters with the connected components of the aforementioned graph.

##Libraries

In [0]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

import pyspark
from pyspark.sql import *
from pyspark.sql.types import *
from pyspark.sql.functions import *

from pyspark import SparkContext, SparkConf

# Tokenizer, StopWordsRemover, Word2Vec is for nlp
# VectorAssembler is for joining multiple vectors
# Normalizer is for computing cosine similarity
# StandardScaler is for PCA
from pyspark.ml.feature import Tokenizer, StopWordsRemover, Word2Vec, VectorAssembler, Normalizer, StandardScaler, PCA
from nltk.stem.snowball import SnowballStemmer
# K-means and K-means evaluation
from pyspark.ml.evaluation import ClusteringEvaluator
from pyspark.ml.clustering import KMeans
# Graph
from graphframes import *

##Data Aquisition

It is possible to download the full dataset (22.589.356 entries) or only the first 50000 entries

short dataset: 50000 entries

In [0]:
%sh wget -P /tmp https://raw.githubusercontent.com/attennig/BDC_datasets/main/books_short.csv

In [0]:
dbutils.fs.mv("file:/tmp/books_short.csv", "dbfs:/bdc-2020-21/datasets/books_short.csv")

long dataset: 22.589.356 entries

In [0]:
%sh wget --load-cookies /tmp/cookies.txt "https://docs.google.com/uc?export=download&confirm=$(wget --quiet --save-cookies /tmp/cookies.txt --keep-session-cookies --no-check-certificate 'https://docs.google.com/uc?export=download&id=1qIhBSrpkDc-RCdbw7e1NVtNhOj_fNi5G' -O- | sed -rn 's/.*confirm=([0-9A-Za-z_]+).*/\1\n/p')&id=1qIhBSrpkDc-RCdbw7e1NVtNhOj_fNi5G" -O /tmp/books_long.csv.bz2 && rm -rf /tmp/cookies.txt

In [0]:
dbutils.fs.ls("file:/tmp")

In [0]:
dbutils.fs.mv("file:/tmp/books_long.csv.bz2", "dbfs:/bdc-2020-21/datasets/books_long.csv.bz2")

In [0]:
dbutils.fs.ls("dbfs:/bdc-2020-21/datasets")

###Load data into pyspark DataFrame

In [0]:
# Read dataset file into a Spark Dataframe
books_df = spark.read.load("dbfs:/bdc-2020-21/datasets/books_short.csv", 
                         format="csv", 
                         sep=";", 
                         inferSchema="true", 
                         header="true"
                         )

In [0]:
print("The shape of the dataset is {:d} rows by {:d} columns".format(books_df.count(), len(books_df.columns)))

##Data Cleaning

In [0]:
columns_to_drop = []
for c in books_df.columns:
  if books_df.where(col(c).isNull()).count()/books_df.count() > 0.7:
    columns_to_drop += [c]

In [0]:
books_df = books_df.drop(*columns_to_drop)

In [0]:
assert "title" in books_df.columns and "key" in books_df.columns
books_df = books_df.dropna(how="any", subset=["key", "title"])
books_df = books_df.dropDuplicates(['key'])
books_df = books_df.dropDuplicates(['title'])

In [0]:
assert "subjects" in books_df.columns and "authors" in books_df.columns
books_df = books_df.na.fill({'subjects': 'unknown', 'authors': 'unknown'})

In [0]:
books_df = books_df.orderBy(["key"], ascending=[False, False])

In [0]:
# This will return a new DF with all the columns + id
books_df = books_df.withColumn("id", monotonically_increasing_id())

In [0]:
# From Document_Clustering.ipynb
def clean_text(df, column_name, perform_stemming=True):
    """ 
    This function takes the raw text data and applies a standard NLP preprocessing pipeline consisting of the following steps:
      - Text cleaning
      - Tokenization
      - Stopwords removal
      - Stemming (Snowball stemmer)

    parameter: dataframe
    returns: the input dataframe along with the `cleaned_content` column as the results of the NLP preprocessing pipeline
    """
    # Text preprocessing pipeline
    # 1. Text cleaning
    # 1.a Case normalization
    lower_case_df = df.select(["id",lower(col(column_name)).alias(column_name)])
    # 1.b Trimming
    trimmed_df = lower_case_df.select(["id",trim(col(column_name)).alias(column_name)])
    # 1.c Filter out punctuation symbols
    no_punct_df = trimmed_df.select(["id",(regexp_replace(col(column_name), "[^a-zA-Z\\s]", "")).alias(column_name)])
    # 1.d Filter out any internal extra whitespace
    cleaned_df = no_punct_df.select(["id",trim(regexp_replace(col(column_name), " +", " ")).alias(column_name)])
    # 2. Tokenization (split text into tokens)
    tokenizer = Tokenizer(inputCol=column_name, outputCol="tokens")
    tokens_df = tokenizer.transform(cleaned_df).cache()
    # 3. Stopwords removal
    stopwords_remover = StopWordsRemover(inputCol="tokens", outputCol="terms")
    ret_df = stopwords_remover.transform(tokens_df).cache()
    # 4. Stemming (Snowball stemmer)
    if perform_stemming:
      stemmer = SnowballStemmer(language="english")
      stemmer_udf = udf(lambda tokens: [stemmer.stem(token) for token in tokens], ArrayType(StringType()))
      ret_df = ret_df.withColumn("terms_stemmed", stemmer_udf("terms")).cache()
      
    return ret_df

In [0]:
clean_title_df = clean_text(books_df, "title")

In [0]:
clean_subjects_df = clean_text(books_df, "subjects")

In [0]:
clean_authors_df = clean_text(books_df, "authors", False)

##Feature Engineering

In this section we will use nlp techniques to get numerical vectors that represent text-based features

In [0]:
# final df
# |id|w2vec(clean_title)|w2vec(clean_subject)|authors|

In [0]:
RANDOM_SEED = 42 # used below to run the actual K-means clustering
EMBEDDING_SIZE = 150 # size of embedding Word2Vec vectors

In [0]:
#Word2Vec from Document_Clustering.ipynb 
def extract_w2v_features(df, column_name, out_col_name):
  word2vec = Word2Vec(vectorSize=EMBEDDING_SIZE, minCount=5, inputCol=column_name, outputCol=out_col_name, seed=RANDOM_SEED)
  model = word2vec.fit(df)
  features = model.transform(df).cache()
  
  return features

In [0]:
w2v_title_features = extract_w2v_features(clean_title_df, "terms_stemmed", "title_vec")

In [0]:
w2v_subjects_features = extract_w2v_features(clean_subjects_df, "terms_stemmed", "subjects_vec")

In [0]:
w2v_authors_features = extract_w2v_features(clean_authors_df, "terms", "authors_vec")

In [0]:
# Final dataframe
w2v_title_features = w2v_title_features.select(["id", "title_vec"])
w2v_subjects_features = w2v_subjects_features.select(["id", "subjects_vec"])
w2v_authors_features = w2v_authors_features.select(["id", "authors_vec"])

In [0]:
engineered_books_df = w2v_title_features
engineered_books_df = engineered_books_df.join(w2v_subjects_features, ["id"])
engineered_books_df = engineered_books_df.join(w2v_authors_features, ["id"])

In [0]:
vec_ass = VectorAssembler(inputCols=["title_vec","subjects_vec","authors_vec"], outputCol="vector_feature", handleInvalid="keep")
engineered_books_df = vec_ass.transform(engineered_books_df).select(["id", "vector_feature"])

In [0]:
engineered_books_df.show() 

##Graph

In [0]:
nodes_df = engineered_books_df.select(["id"])

In [0]:
normalizer = Normalizer(inputCol="vector_feature", outputCol="norm")
vectors_norm = normalizer.transform(engineered_books_df).select("id", "norm")

In [0]:
dot_udf = udf(lambda x,y: float(x.dot(y)), DoubleType())
cosine_sim_df = vectors_norm.alias("src").join(vectors_norm.alias("dst"), col("src.id") != col("dst.id"))\
    .select(
        col("src.id").alias("src"), 
        col("dst.id").alias("dst"), 
        dot_udf("src.norm", "dst.norm").alias("dot"))

In [0]:
edges_df = cosine_sim_df.filter(cosine_sim_df.dot>0.6)

In [0]:
books_graph = GraphFrame(nodes_df, edges_df) 

In [0]:
books_graph.vertices.show(5, truncate=False)
books_graph.edges.show(5, truncate=False)

In [0]:
books_graph.degrees.show(10)

In [0]:
!pwd

In [0]:
!mkdir /databricks/driver/checkpoints

In [0]:
spark.sparkContext.setCheckpointDir('/databricks/driver/checkpoints')

In [0]:
# TODO
# capire cosa è books_CC e come va usato per l'evaluation
books_CC = books_graph.connectedComponents()

##Clustering

In [0]:
# TODO
# controllare che funzioni tutto 

In [0]:
# from Document_Clustering.ipynb  e l'altro del clustering
# k = nCC , kmeans
N_CLUSTERS = 10 # number of output clusters (K)
DISTANCE_MEASURE = "cosine" # alternatively, "cosine"
MAX_ITERATIONS = 100 # maximum number of iterations of K-means EM algorithm
TOLERANCE = 0.000001 # tolerance between consecutive centroid updates (i.e., another stopping criterion)

###PCA

In [0]:
scaler = StandardScaler(inputCol="vector_feature", 
                        outputCol="std_vector_features",
                        withStd=True, withMean=True)

# Compute summary statistics by fitting the StandardScaler
scalerModel = scaler.fit(engineered_books_df)

# Normalize each feature to have unit standard deviation.
engineered_books_df = scalerModel.transform(engineered_books_df)

In [0]:
pca_model = PCA(k=K, inputCol="std_vector_features", outputCol="pca_features")
pca_features = pca_model.fit(engineered_books_df)
pca_books_df = pca_features.transform(engineered_books_df).cache()

In [0]:
def k_means(dataset, 
            n_clusters, 
            distance_measure=DISTANCE_MEASURE, 
            max_iter=MAX_ITERATIONS, 
            tol=TOLERANCE,
            features_col="vector_feature", 
            prediction_col="cluster", 
            random_seed=RANDOM_SEED):

  print("""Training K-means clustering using the following parameters: 
  - K (n. of clusters) = {:d}
  - max_iter (max n. of iterations) = {:d}
  - distance measure = {:s}
  - random seed = {:d}
  """.format(n_clusters, max_iter, distance_measure, random_seed))
  # Train a K-means model
  kmeans = KMeans(featuresCol=features_col, 
                   predictionCol=prediction_col, 
                   k=n_clusters, 
                   initMode="k-means||", 
                   initSteps=5, 
                   tol=tol, 
                   maxIter=max_iter, 
                   seed=random_seed, 
                   distanceMeasure=distance_measure)
  model = kmeans.fit(dataset)

  # Make clusters
  clusters_df = model.transform(dataset).cache()

  return model, clusters_df

In [0]:
model, clusters_df = k_means(pca_books_df, N_CLUSTERS, max_iter=MAX_ITERATIONS, distance_measure=DISTANCE_MEASURE)

##Evaluation

###Graph's connected components evaluation

Implementing Silhouette Coefficient for graph's connected components

In [0]:
# TODO
# implementare 

###Clusters evaluation

In [0]:
# TODO
# controllare che funzioni
# Evaluate clustering by computing Silhouette score
metric_name="silhouette"
distance_measure="cosine" #"squaredEuclidean"
prediction_col="cluster"
evaluator = ClusteringEvaluator(metricName=metric_name,
                                distanceMeasure=distance_measure, 
                                predictionCol=prediction_col
                                )
evaluator.evaluate(clusters_df)