# Summarizing Music Reviews with Graphical Models
-------------------------------------------------

## Overview
Every day, digital shoppers across the globe generate hundreds of thousands of reviews on products - both new and old. As a digital retailer or brand, it is critical to understand not just the sentiment of this feedback but also the core concepts that customers write about; however, the pace of content generation has already outpaced the ability for marketing and merchandising teams at these organizations to read every piece of consumer generated content submitted.


Dynamic content summarization techniques can provide a much-needed ability to programmatically identify core concepts within natural text and leverage this insight to condense large amounts of text into information-dense summarizations. In this talk, we will explore the current state-of-the-art in content summary by implementing the graph-based keyword extraction algorithm called TopicRank on music text reviews and then use these extracted concepts to summarize all of the reviews on a given album automatically.

## Notebook Overview
Below is a walkthrough from start to finish of a method for finding the top 10 most relevant sentences from a corpus of music reviews (in particular - we will be summarizing reviews for Pink Floyd's The Dark Side of the Moon).

1. Load raw review data set (music reviews from Amazon - ~1m reviews)
2. Find and parse sentences related to Dark Side of the Moon
3. Prepare word embeddings from music corpus
4. Create a sentence graph
5. Compute PageRank on that graph
6. Look at a few example results

In [18]:
/*
 *  Environment Setup
 *  ========================
 *  - Jupyter-Scala (https://github.com/alexarchambault/jupyter-scala)
 */
import $exclude.`org.slf4j:slf4j-log4j12`, $ivy.`org.slf4j:slf4j-nop:1.7.21` // for cleaner logs
import $profile.`hadoop-2.6`
import $ivy.`org.apache.spark::spark-sql:2.1.0` // adjust spark version - spark >= 2.0
import $ivy.`org.apache.hadoop:hadoop-aws:2.6.4`
import $ivy.`org.jupyter-scala::spark:0.4.0` // for JupyterSparkSession (SparkSession aware of the jupyter-scala kernel)

// General spark imports
import org.apache.spark._
import org.apache.spark.sql._
import jupyter.spark.session._

// Create sessions
val spark = JupyterSparkSession.builder() // important - call this rather than SparkSession.builder()
  .jupyter() // this method must be called straightaway after builder()
  // .yarn("/etc/hadoop/conf") // optional, for Spark on YARN - argument is the Hadoop conf directory
  // .emr("2.6.4") // on AWS ElasticMapReduce, this adds aws-related to the spark jar list
  .master("local[*]") // change to "yarn-client" on YARN
  .config("spark.driver.memory", "8g")
  .config("spark.executor.memory", "8g")
  .appName("Graph-based Review Summarization")
  .getOrCreate()

// Access underlying spark context (for backwards compatibility)
val sc = spark.sparkContext
val sqlContext = spark.sqlContext

[32mimport [39m[36m$exclude.$                        , $ivy.$                            // for cleaner logs
[39m
[32mimport [39m[36m$profile.$           
[39m
[32mimport [39m[36m$ivy.$                                   // adjust spark version - spark >= 2.0
[39m
[32mimport [39m[36m$ivy.$                                   
[39m
[32mimport [39m[36m$ivy.$                                // for JupyterSparkSession (SparkSession aware of the jupyter-scala kernel)

// General spark imports
[39m
[32mimport [39m[36morg.apache.spark._
[39m
[32mimport [39m[36morg.apache.spark.sql._
[39m
[32mimport [39m[36mjupyter.spark.session._

// Create sessions
[39m
[36mspark[39m: [32mSparkSession[39m = org.apache.spark.sql.SparkSession@73907edd
[36msc[39m: [32mSparkContext[39m = org.apache.spark.SparkContext@d3a7cce
[36msqlContext[39m: [32mSQLContext[39m = org.apache.spark.sql.SQLContext@172150da

In [None]:
// Load special ML / NLP libraries via interop
interp.load.ivy("org.apache.spark" %% "spark-mllib" % "2.0.2")
interp.load.ivy("org.apache.spark" %% "spark-graphx" % "2.0.2")
interp.load.ivy("org.scalanlp" %% "breeze" % "0.13")
interp.load.ivy("edu.stanford.nlp" % "stanford-corenlp" % "3.6.0")
//interp.load.ivy("edu.stanford.nlp" % "stanford-corenlp" % "3.6.0" classifier "models") -- throws error (https://github.com/alexarchambault/jupyter-scala/issues/128)
interp.load.ivy("com.google.protobuf" % "protobuf-java" % "2.6.1")

// Spark SQL
import sqlContext._
import sqlContext.implicits._

// ML imports
import breeze.linalg._
import org.apache.spark.mllib.linalg.Vectors

// Graph imports
import org.apache.spark.graphx._
import org.apache.spark.rdd.RDD
import org.apache.spark.sql.Dataset

// NLP imports
import edu.stanford.nlp.simple.Document

## Load data and parse sentences
We will be working with Amazon review data, made available by [UCSD](http://jmcauley.ucsd.edu/data/amazon/). We need to load it into a Spark dataframe, find the reviews related to our target Album (ASIN of B000000IRB), get the raw text and use StanfordNLP parser to get an enriched version of the sentences.

In [19]:
// Load music reviews - find albums with most reviews
val music_reviews = sqlContext.load("file:///home/garrett/dev/data/amazon/music/reviews_CDs_and_Vinyl_5.json", "json")
music_reviews.registerTempTable("reviews")

// Create merged review document for target albums
val document = sqlContext.sql("SELECT reviewText FROM reviews WHERE asin = 'B000000IRB'").map(r => r(0).toString).collect().mkString("\n\n")

// Get sentences
val sentences = new Document(document).sentences() // 8077 sentences



cmd19.sc:5: Unable to find encoder for type stored in a Dataset.  Primitive types (Int, String, etc) and Product types (case classes) are supported by importing spark.implicits._  Support for serializing other types will be added in future releases.
val document = sqlContext.sql("SELECT reviewText FROM reviews WHERE asin = 'B000000IRB'").map(r => r(0).toString).collect().mkString("\n\n")
                                                                                             ^cmd19.sc:8: not found: type Document
val sentences = new Document(document).sentences() // 8077 sentences
                    ^

: 

In [17]:
spark.sqlContext

[36mres16[39m: [32mSQLContext[39m = org.apache.spark.sql.SQLContext@172150da