# Summarizing Music Reviews with Graphical Models
-------------------------------------------------

## Overview
Every day, digital shoppers across the globe generate hundreds of thousands of reviews on products - both new and old. As a digital retailer or brand, it is critical to understand not just the sentiment of this feedback but also the core concepts that customers write about; however, the pace of content generation has already outpaced the ability for marketing and merchandising teams at these organizations to read every piece of consumer generated content submitted.


Dynamic content summarization techniques can provide a much-needed ability to programmatically identify core concepts within natural text and leverage this insight to condense large amounts of text into information-dense summarizations. In this talk, we will explore the current state-of-the-art in content summary by implementing the graph-based keyword extraction algorithm called TopicRank on music text reviews and then use these extracted concepts to summarize all of the reviews on a given album automatically.

## Notebook Overview
Below is a walkthrough from start to finish of a method for finding the top 10 most relevant sentences from a corpus of music reviews (in particular - we will be summarizing reviews for Pink Floyd's The Dark Side of the Moon).

1. Load raw review data set (music reviews from Amazon - ~1m reviews)
2. Find and parse sentences related to Dark Side of the Moon
3. Prepare word embeddings from music corpus
4. Create a sentence graph
5. Compute PageRank on that graph
6. Look at a few example results

## Supporting Material
1. Slide [presentation](http://slides.com/dataexhaust/dynamic-content-summarization)

In [18]:
/*
 *  Environment Setup
 *  ========================
 *  - Jupyter-Scala (https://github.com/alexarchambault/jupyter-scala)
 */
import $exclude.`org.slf4j:slf4j-log4j12`, $ivy.`org.slf4j:slf4j-nop:1.7.21` // for cleaner logs
import $profile.`hadoop-2.6`
import $ivy.`org.apache.spark::spark-sql:2.1.0` // adjust spark version - spark >= 2.0
import $ivy.`org.apache.hadoop:hadoop-aws:2.6.4`
import $ivy.`org.jupyter-scala::spark:0.4.0` // for JupyterSparkSession (SparkSession aware of the jupyter-scala kernel)

// General spark imports
import org.apache.spark._
import org.apache.spark.sql._
import jupyter.spark.session._

// Create sessions
val spark = JupyterSparkSession.builder() // important - call this rather than SparkSession.builder()
  .jupyter() // this method must be called straightaway after builder()
  // .yarn("/etc/hadoop/conf") // optional, for Spark on YARN - argument is the Hadoop conf directory
  // .emr("2.6.4") // on AWS ElasticMapReduce, this adds aws-related to the spark jar list
  .master("local[*]") // change to "yarn-client" on YARN
  .config("spark.driver.memory", "8g")
  .config("spark.executor.memory", "8g")
  .appName("Graph-based Review Summarization")
  .getOrCreate()

// Access underlying spark context (for backwards compatibility)
val sc = spark.sparkContext
val sqlContext = spark.sqlContext

[32mimport [39m[36m$exclude.$                        , $ivy.$                            // for cleaner logs
[39m
[32mimport [39m[36m$profile.$           
[39m
[32mimport [39m[36m$ivy.$                                   // adjust spark version - spark >= 2.0
[39m
[32mimport [39m[36m$ivy.$                                   
[39m
[32mimport [39m[36m$ivy.$                                // for JupyterSparkSession (SparkSession aware of the jupyter-scala kernel)

// General spark imports
[39m
[32mimport [39m[36morg.apache.spark._
[39m
[32mimport [39m[36morg.apache.spark.sql._
[39m
[32mimport [39m[36mjupyter.spark.session._

// Create sessions
[39m
[36mspark[39m: [32mSparkSession[39m = org.apache.spark.sql.SparkSession@73907edd
[36msc[39m: [32mSparkContext[39m = org.apache.spark.SparkContext@d3a7cce
[36msqlContext[39m: [32mSQLContext[39m = org.apache.spark.sql.SQLContext@172150da

In [21]:
// Load special ML / NLP libraries via interop
interp.load.ivy("org.apache.spark" %% "spark-mllib" % "2.0.2")
interp.load.ivy("org.apache.spark" %% "spark-graphx" % "2.0.2")
interp.load.ivy("org.scalanlp" %% "breeze" % "0.13")
interp.load.ivy("edu.stanford.nlp" % "stanford-corenlp" % "3.6.0")
//interp.load.ivy("edu.stanford.nlp" % "stanford-corenlp" % "3.6.0" classifier "models") -- throws error (https://github.com/alexarchambault/jupyter-scala/issues/128)
interp.load.ivy("com.google.protobuf" % "protobuf-java" % "2.6.1")

// Spark SQL
import sqlContext._
import sqlContext.implicits._

// ML imports
import breeze.linalg._
import org.apache.spark.mllib.linalg.Vectors

// Graph imports
import org.apache.spark.graphx._
import org.apache.spark.rdd.RDD
import org.apache.spark.sql.Dataset

// NLP imports
import edu.stanford.nlp.simple.Document

[32mimport [39m[36msqlContext._
[39m
[32mimport [39m[36msqlContext.implicits._

// ML imports
[39m
[32mimport [39m[36mbreeze.linalg._
[39m
[32mimport [39m[36morg.apache.spark.mllib.linalg.Vectors

// Graph imports
[39m
[32mimport [39m[36morg.apache.spark.graphx._
[39m
[32mimport [39m[36morg.apache.spark.rdd.RDD
[39m
[32mimport [39m[36morg.apache.spark.sql.Dataset

// NLP imports
[39m
[32mimport [39m[36medu.stanford.nlp.simple.Document[39m

## Load data and parse sentences
We will be working with Amazon review data, made available by [UCSD](http://jmcauley.ucsd.edu/data/amazon/). We need to load it into a Spark dataframe, find the reviews related to our target Album (ASIN of B000000IRB), get the raw text and use StanfordNLP parser to get an enriched version of the sentences.

In [None]:
// Load music reviews - find albums with most reviews
val music_reviews = sqlContext.load("file:///home/garrett/dev/data/amazon/music/reviews_CDs_and_Vinyl_5.json", "json")
music_reviews.registerTempTable("reviews")

// Create merged review document for target albums
val document = sqlContext.sql("SELECT reviewText FROM reviews WHERE asin = 'B000000IRB'").map(r => r(0).toString).collect().mkString("\n\n")

// Get sentences
val sentences = new Document(document).sentences() // 8077 sentences

## Parse out and save necessary keywords
We are only interested in using the keywords (nouns and proper nouns) from the sentences, so we need to parse out the relevant text and make a form that will be easier to load from disk later.

Note: The below code will not run yet because of the Ivy model loading issue above.

In [None]:
// Filter out only words that exist in the keywords of sentences that we want to find distance
val parsed_sentences = sentences.map(s => {
    // Extract words and their part of speech
    val words = s.words().toList
    val tags = s.posTags().toList

    // Filter and return nouns
    (words zip tags).filter( x => List("NN","NNP").contains(x._2))
}).toList

// Zip sentences together to get index
val indexed_sentences = parsed_sentences.zipWithIndex

// Parallize sentences
val sentences_rdd = sc.parallelize(indexed_sentences)

// Break into individual keywords
case class SentenceKeyword(id: Int, keyword: String, pos: String)
val keywords_by_sentence = sentences_rdd.flatMap(s => s._1.map(x => (s._2, x._1, x._2)))
                                        .map(s => SentenceKeyword(s._1, s._2, s._3))

// Save to disk
keywords_by_sentence.toDF().write.parquet("file:///.../music/parsed_sentences/")

// Load from disk
org.apache.spark.sql.catalyst.encoders.OuterScopes.addOuterScope(this)
val keywords_by_sentence = sqlContext.load("file:///.../music/parsed_sentences/*", "parquet").as[SentenceKeyword]

In [31]:
// Load from disk
case class SentenceKeyword(id: Int, keyword: String, pos: String)
org.apache.spark.sql.catalyst.encoders.OuterScopes.addOuterScope(this)
val keywords_by_sentence = sqlContext.load("file:///home/garrett/dev/data/amazon/music/parsed_sentences/*", "parquet").as[SentenceKeyword]

defined [32mclass[39m [36mSentenceKeyword[39m
[36mkeywords_by_sentence[39m: [32mDataset[39m[[32mwrapper[39m.[32mwrapper[39m.[32mSentenceKeyword[39m] = [id: int, keyword: string ... 1 more field]

## Prepare word vectors
In the reference papers for TextRank, they authors use a similarity metric between parts of the sentences that is essentially the intersection of common keywords / the union of all keywords between two sentences. In our model, we will take a different approach and create a more modern notion of comutational similarity amongst words in the music review corpus.

To do this, we will first need to generate word vectors from the entire music review corpus (this will give us a mathematical representation of key concepts grounded in the reviewers' own language.

We will use a separate, offline library called [FastText](https://github.com/facebookresearch/fastText) to generate the word vectors.

Note: The below code will not run yet because of the Ivy model loading issue above.

In [None]:
// Save raw text file of all reviews -- one per line -- for FastText training purposes
music_reviews.select("reviewText").rdd.map(r => r(0).toString)
                                  .coalesce(1).saveAsTextFile("file:///.../music/text")

/*
    Train Word Vectors from Text -- Using Fast Text
    =========
    fasttext skipgram -input part-00000 -output word_vectors -dim 300
*/

// Load word-vectors into memory map
val raw_word_vectors = sc.textFile("file:///.../music/text/word_vectors.vec")
                         .mapPartitionsWithIndex { (idx, iter) => 
                            if (idx == 0) iter.drop(1) else iter }

// Get all keywords from parsed sentences
val keywords = parsed_sentences.flatMap(x => x).map(x => x._1).toList

// Filter word vectors
val filtered_word_vectors = raw_word_vectors.filter(line => keywords.contains(line.split(" ")(0)))
filtered_word_vectors.cache

// Merge with keywords
val transformed_word_vectors = filtered_word_vectors.map(line => {
    // Split line
    val values = line.split(" ")

    // Add to in-memory word vector map
    (values(0), values.slice(1, values.length).map(_.toFloat))
})
val grouped_keywords_vectors = keywords_by_sentence.map(sk => (sk.keyword, sk))
                                                   .rdd.cogroup(transformed_word_vectors)

In [None]:
// Map keywords to vectors
case class MappedSentenceKeyword(id: Int, keyword: String, vector: Array[Float])
val mapped_sentence_keywords = grouped_keywords_vectors.flatMap(grouped_keyword => {
  // Check if word exists
  if(grouped_keyword._2._2.toList.length > 0) { // word exists in vocabulary
    // Get word vector
    val word_vector = grouped_keyword._2._2.toList(0)

    // Map each sentence keyword to vector
    grouped_keyword._2._1.toList.map(sk => {
        MappedSentenceKeyword(sk.id, sk.keyword, word_vector)
    })
  } else {
    // Map each sentence keyword to vector
    grouped_keyword._2._1.toList.map(sk => {
        MappedSentenceKeyword(sk.id, sk.keyword, new Array[Float](300))
    })
  }
})

// Save to disk
mapped_sentence_keywords.toDF().write.parquet("file:///.../music/mapped_sentence_keywords/")

// Load from disk
val mapped_sentence_keywords = sqlContext.load("file:///.../music/mapped_sentence_keywords/*",
                                            "parquet").as[MappedSentenceKeyword]


## Create sentence graph (in Spark GraphX)
In order to leverage Spark's out-of-the-box PageRank computation, we need to load it into the necessary graph structure.

In [None]:
// Load sentences
case class IndexedSentence(id: Int, keywords: List[MappedSentenceKeyword])
val indexed_sentences = mapped_sentence_keywords.rdd.map(sk => (sk.id, sk)).
                            groupByKey().map(x => IndexedSentence(x._1, x._2.toList))

// Create sentence pairs
val sentence_pairs = indexed_sentences.flatMap(s1 => {
  // Only create pairs for sentences that have an ID greater than the current sentences ID
  sentences_array.slice(s1.id + 1, sentences_array.length).map(s2 => (s1, s2))
})

// Create sentence graph
case class SentenceEdge(id_1: Int, id_2: Int, score: Double)
val sentence_graph = sentence_pairs.map(S => {
  // Zip keywords with vectors
  val s1_vectors = (S._1.keywords.map(_.vector)).map(arr => new DenseVector(arr.map(_.toDouble)))
  val s2_vectors = (S._2.keywords.map(_.vector)).map(arr => new DenseVector(arr.map(_.toDouble)))

  // Fold and normalize each vector
  val avg_s1_vector = s1_vectors.fold(DenseVector.zeros[Double](300))
                                     ((acc,v) => { acc + v }) / (1.0 * s1_vectors.length)
  val avg_s2_vector = s2_vectors.fold(DenseVector.zeros[Double](300))
                                     ((acc,v) => { acc + v }) / (1.0 * s2_vectors.length)

  // Return sentence graph edge
  SentenceEdge(S._1.id, S._2.id, 
               CosineSimilarity.cosineSimilarity(avg_s1_vector.toArray, avg_s2_vector.toArray))
})

// Save sentence graph to disk
sentence_graph.toDF().write.parquet("file:///.../music/sentence_graph/v1/")

// Load sentence graph from disk
val sentence_graph = sqlContext.load("file:///.../music/sentence_graph/v1/*").as[SentenceEdge]

## Compute PageRank
We're now going to derive a PageRank authority for each sentence Vertex in the sentence graph.

In [None]:
// Create vertex RDD from sentences
val sentenceVertices: RDD[(VertexId, String)] = 
    indexed_sentences.map(s => (s.id.toLong, s.id.toString))
val defaultSentence = ("-1")

// Create edges RDD from sentence graph -- only create links if above minimum similarity
val sentenceEdges = sentence_graph.filter(se => se.score > 0.75).flatMap(se => {
  List(Edge(se.id_1.toLong, se.id_2.toLong, se.score), 
       Edge(se.id_2.toLong, se.id_1.toLong, se.score))
}).rdd

// Create graph
val graph = Graph(sentenceVertices, sentenceEdges, defaultSentence)
graph.persist() // persist graph (for performance purposes

// Calculate page rank
val ranks = graph.pageRank(0.0001).vertices

// Find top K sentences by rank
val top_ranks = ranks.sortBy(_._2, ascending=false).take(10)
val ranksAndSentences = ranks.join(sentenceVertices).sortBy(_._2._1, ascending=false).map(_._2._2)

// Get the top 10 results
ranksAndSentences.take(10)

/*
   Results: (1401, 824, 2360, 2717, 4322, 1150, 4363, 2320, 238, 3128)
 */
 

## Get results
Now that we have our ranked sentences, let's merge it with the sentences and print out the best ones.

In [None]:
// Zip sentences together to get index
case class SentenceRaw(id: Int, text: String)
val indexed_sentences_original = sentences.toList
                                          .zipWithIndex.map(x => (x._1.text(), 
                                                                  x._2))
val sentencesArray = sc.parallelize(indexed_sentences_original).collect()
sc.parallelize(sentencesArray.map(x => SentenceRaw(x._2, x._1)))
  .toDF().registerTempTable("sentences")

// Show top sentences
sqlContext.sql("SELECT text 
                FROM sentences 
                WHERE id in (1401, 824, 2360, 2717, 4322, 
                             1150, 4363, 2320, 238, 3128)")
           .map(r => r(0).toString).rdd.foreach(println)

# Results (Sample)

> "This is the best music, the best recording, the best rock album, the best concept album."

> "To make a long review short, you should buy "Dark Side Of The Moon" because: a) it's music, combining the band's sharp songwriting, outstanding musical chemistry, and impressive in-the-studio skills, is fantastic, b) it's timeless theme about all the things in life that can drive us mad---money, mortality, time (or lack of), war, etc., is pure genius, c) the clever lyrics by Roger Waters REALLY hit home, d) it's unsurpassed production & sound effects make it without question THE album to test your new stereo equipment with, and e) although I've never tried it myself, it's widely reputed to be a GREAT soundtrack album for....er, intimate encounters (especially while playing "The Great Gig In The Sky"---it's supposed to be really cool, man)."

> "The Dark Side Of The Moon is a key album into defining the peak of space rock, the revival of psychedelic rock into modern settings, the point were blues is taken into a higher prospective, the point were progressive rock can't be called pretentious but remarkable nor snooze cultural but sincere and direct, and the unique characteristic where music fits with the listener and musicians in the most pure way; not their most complex neither their most cultural one, but the most pure."