# Sentiment Analysis sulle Recensioni di Yelp

La Sentiment Analysis è il processo di identificazione dell'emozione espressa in un testo, positiva o negativa.

In questo notebook useremo Spark e la sua MLlib per costruire un modello di Sentiment Analysis usando il dataset messo a disposizione da Yelp, una famossisima applicazione che permette di recensire locali e attività commerciali.


# Inizializziamo Spark

In [1]:
from pyspark import SparkContext, SparkConf
from pyspark.sql import SQLContext

conf = SparkConf().setAppName("sentiment_analysis").setMaster("local").set("spark.driver.memory", "5g")
sc = SparkContext.getOrCreate(conf=conf)
sqlContext = SQLContext(sc)

# Importiamo il dataset in un dataframe

In [2]:
yelp_df = sqlContext.read.json('./data/yelp_academic_dataset_review.json')

In [3]:
yelp_df.count()

6685900

In [4]:
yelp_df.columns

['business_id',
 'cool',
 'date',
 'funny',
 'review_id',
 'stars',
 'text',
 'useful',
 'user_id']

In [5]:
reviews_df = yelp_df.select(["text", "stars"])

In [6]:
subreviews_df = reviews_df.sample(False, 0.01, seed=0)

In [7]:
subreviews_df.count()

67136

# Preprocessing del testo

In [8]:
import string

def remove_punct(text):
    return text.translate(str.maketrans('','', string.punctuation))

remove_punct(".... che cacchio dici, Antonio !!!1!")

' che cacchio dici Antonio 1'

In [9]:
from pyspark.sql.functions import udf

punct_remove = udf(lambda s: remove_punct(s))

In [10]:
subreviews_df = subreviews_df.withColumn("text", punct_remove(reviews_df["text"]))

In [11]:
subreviews_df.show(20, False)

+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

# Tokenizzazione

In [13]:
from pyspark.ml.feature import Tokenizer

tokenizer = Tokenizer(inputCol='text', outputCol="words")
words_df = tokenizer.transform(subreviews_df)

words_df.show()

+--------------------+-----+--------------------+
|                text|stars|               words|
+--------------------+-----+--------------------+
|So many great thi...|  5.0|[so, many, great,...|
|If I could give t...|  1.0|[if, i, could, gi...|
|Lovely food I am ...|  5.0|[lovely, food, i,...|
|Weve always been ...|  4.0|[weve, always, be...|
|After agonizing o...|  4.0|[after, agonizing...|
|Even though a few...|  4.0|[even, though, a,...|
|No stars yes no s...|  1.0|[no, stars, yes, ...|
|Not bad at all  T...|  3.0|[not, bad, at, al...|
|So arrived at 530...|  1.0|[so, arrived, at,...|
|Excellent Excelle...|  5.0|[excellent, excel...|
|We go to Vegas at...|  5.0|[we, go, to, vega...|
|Good flavor fast ...|  5.0|[good, flavor, fa...|
|First time here a...|  5.0|[first, time, her...|
|I was shocked tha...|  1.0|[i, was, shocked,...|
|Large portions Gr...|  5.0|[large, portions,...|
|This is a great l...|  5.0|[this, is, a, gre...|
|Always good  One ...|  5.0|[always, good, , ...|


# Rimozione StopWords

In [16]:
from pyspark.ml.feature import StopWordsRemover

stopwords = StopWordsRemover(inputCol="words", outputCol="filtered")
words_df = stopwords.transform(words_df)

words_df.show(20)

+--------------------+-----+--------------------+--------------------+
|                text|stars|               words|            filtered|
+--------------------+-----+--------------------+--------------------+
|So many great thi...|  5.0|[so, many, great,...|[many, great, yog...|
|If I could give t...|  1.0|[if, i, could, gi...|[give, place, 0, ...|
|Lovely food I am ...|  5.0|[lovely, food, i,...|[lovely, food, al...|
|Weve always been ...|  4.0|[weve, always, be...|[weve, always, cu...|
|After agonizing o...|  4.0|[after, agonizing...|[agonizing, one, ...|
|Even though a few...|  4.0|[even, though, a,...|[even, though, fl...|
|No stars yes no s...|  1.0|[no, stars, yes, ...|[stars, yes, star...|
|Not bad at all  T...|  3.0|[not, bad, at, al...|[bad, , standard,...|
|So arrived at 530...|  1.0|[so, arrived, at,...|[arrived, 530, sa...|
|Excellent Excelle...|  5.0|[excellent, excel...|[excellent, excel...|
|We go to Vegas at...|  5.0|[we, go, to, vega...|[go, vegas, least...|
|Good 

# Creiamo un model bag of words

In [18]:
from pyspark.ml.feature import CountVectorizer

cv = CountVectorizer(inputCol="words", outputCol="features")
cv_model = cv.fit(words_df)
cv_df = cv_model.transform(words_df)

cv_df.show(20)

Exception ignored in: <function JavaWrapper.__del__ at 0x7fecc95bcc20>
Traceback (most recent call last):
  File "/home/sparkmachine/spark/python/pyspark/ml/wrapper.py", line 40, in __del__
    if SparkContext._active_spark_context and self._java_obj is not None:
AttributeError: 'StopWordsRemover' object has no attribute '_java_obj'


+--------------------+-----+--------------------+--------------------+--------------------+
|                text|stars|               words|            filtered|            features|
+--------------------+-----+--------------------+--------------------+--------------------+
|So many great thi...|  5.0|[so, many, great,...|[many, great, yog...|(93719,[0,1,3,4,5...|
|If I could give t...|  1.0|[if, i, could, gi...|[give, place, 0, ...|(93719,[0,1,3,5,6...|
|Lovely food I am ...|  5.0|[lovely, food, i,...|[lovely, food, al...|(93719,[1,3,4,5,7...|
|Weve always been ...|  4.0|[weve, always, be...|[weve, always, cu...|(93719,[0,1,3,4,5...|
|After agonizing o...|  4.0|[after, agonizing...|[agonizing, one, ...|(93719,[0,1,2,4,5...|
|Even though a few...|  4.0|[even, though, a,...|[even, though, fl...|(93719,[0,1,3,4,5...|
|No stars yes no s...|  1.0|[no, stars, yes, ...|[stars, yes, star...|(93719,[0,1,2,3,4...|
|Not bad at all  T...|  3.0|[not, bad, at, al...|[bad, , standard,...|(93719,[0,