# SIADS 516: Homework 3
Version 1.0.20200303.2
### Dr. Chris Teplovs, School of Information, University of Michigan
<small><a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/4.0/"><img alt="Creative Commons License" style="border-width:0" src="https://i.creativecommons.org/l/by-nc-sa/4.0/88x31.png" /></a>This work is licensed under a <a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/4.0/">Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License</a>.

This homework assignment builds on the Spark DataFrame material we covered in class.
You will be using a compressed version of the Yelp Academic Dataset.  The data set is provided for you in the data/yelp-academic sub-folder of this notebook's directory and you should not need to download it again if you're working on the Coursera hosted notebook environment.

You might want to refer to the lecture companion notebooks (in workspace-files/resources/lecture_notebooks or equivalently via Coursera as "Ungraded Lab: Spark Core Demo" and "Ungraded Lab: Spark SQL Demo) for hints about libraries to import, how to set up a SparkSession, and how to read data files.

You will notice that there are a **lot** of reviews.  You might want to work off a small sample (i.e. use the sample() function in Spark) to work on a reduced size dataset while you're developing your solution.

**You should take care to document your work, preferably using markdown blocks. In-code commenting is also 
a good idea.**

### <font color="magenta">Q1: How many users have received more than 5000 cool compliments?</font>

In [8]:
#Let's set up a spark session to start off.
import pyspark.sql.functions as F
from pyspark.sql import SparkSession
spark = SparkSession \
    .builder \
    .master("local[*]") \
    .appName('HW 3 Spark Application') \
    .getOrCreate() 

sc = spark.sparkContext

#Now, let's import the yelp json files as a dataframe to begin answering Q1. We'll use the user df.

# yelp_user_df = spark.read.json('../non_auto_assignments/data/yelp_academic/yelp_academic_dataset_user.json.gz')
# yelp_business_df = ('../non_auto_assignments/data/yelp_academic/yelp_academic_dataset_business.json.gz')
# yelp_tips_df = ('../non_auto_assignments/data/yelp_academic/yelp_academic_dataset_tip.json.gz')
# yelp_checkin_df = spark.read.json('../non_auto_assignments/data/yelp_academic/yelp_academic_dataset_checkin.json.gz')
# yelp_review_df = spark.read.json('../non_auto_assignments/data/yelp_academic/yelp_academic_dataset_review.json.gz')

In [6]:
#Let's check out the schema as an FYI.
yelp_user_df = spark.read.json('../non_auto_assignments/data/yelp_academic/yelp_academic_dataset_user.json.gz')
yelp_user_df.printSchema()

root
 |-- average_stars: double (nullable = true)
 |-- compliment_cool: long (nullable = true)
 |-- compliment_cute: long (nullable = true)
 |-- compliment_funny: long (nullable = true)
 |-- compliment_hot: long (nullable = true)
 |-- compliment_list: long (nullable = true)
 |-- compliment_more: long (nullable = true)
 |-- compliment_note: long (nullable = true)
 |-- compliment_photos: long (nullable = true)
 |-- compliment_plain: long (nullable = true)
 |-- compliment_profile: long (nullable = true)
 |-- compliment_writer: long (nullable = true)
 |-- cool: long (nullable = true)
 |-- elite: string (nullable = true)
 |-- fans: long (nullable = true)
 |-- friends: string (nullable = true)
 |-- funny: long (nullable = true)
 |-- name: string (nullable = true)
 |-- review_count: long (nullable = true)
 |-- useful: long (nullable = true)
 |-- user_id: string (nullable = true)
 |-- yelping_since: string (nullable = true)



In [7]:
#Now, let's retrieve the amount of users that have received > 5000 cool compliments by using the .filter function.
cool_compliments = yelp_user_df.filter(yelp_user_df['compliment_cool'] > 5000).count()

#Let's print the results to the console.
cool_compliments










79

By utilizing the .filter function and boolean masking on the compliment_cool column, we are able to determine that the number of users that have received the compliment 'cool' over five-thousand times is seventy-nine. 

### <font color="magenta">Q2: What are the names of the top 10 most complimented businesses?</font>

In [11]:
#NOTE: We will ignore this question for this assignment.

Insert your interpretation here.

### <font color="magenta">Q3: What are the top 10 most useful positive reviews?</font>

In [9]:
# For This question, we'll use the yelp_review_df again.

yelp_review_df = spark.read.json('../non_auto_assignments/data/yelp_academic/yelp_academic_dataset_review.json.gz')

yelp_review_df.printSchema()

root
 |-- business_id: string (nullable = true)
 |-- cool: long (nullable = true)
 |-- date: string (nullable = true)
 |-- funny: long (nullable = true)
 |-- review_id: string (nullable = true)
 |-- stars: double (nullable = true)
 |-- text: string (nullable = true)
 |-- useful: long (nullable = true)
 |-- user_id: string (nullable = true)



In [10]:
#Since the meaning of positive is somewhat subjective, for the sake of concreteness, we'll consider 5-star reviews.

yelp_review_df_positive = yelp_review_df.filter(yelp_review_df['stars'] == 5)

#Now, we'll sort by the useful column in descending order, and also display the user_id, business id, stars, and review_id.

yelp_review_df_positive = yelp_review_df_positive.sort('useful', ascending = False)

yelp_review_df_positive.select(['business_id', 'user_id', 'review_id', 'useful', 'stars']).show(10)

+--------------------+--------------------+--------------------+------+-----+
|         business_id|             user_id|           review_id|useful|stars|
+--------------------+--------------------+--------------------+------+-----+
|t-o_Sraneime4DDhW...|aIbYxOV_3dBIPUcnl...|1lGXlyq4MALOMx17v...|   358|  5.0|
|6Suj9mb9565xjAKHM...|d3U8ftbUpjuPQbacW...|gAUkgn4dTO-R2n5LB...|   278|  5.0|
|IapQwLdAwztQYN99p...|UUqGHQFu2tQDGv5r3...|0nr6SQFKpR6JCYl1z...|   241|  5.0|
|Ka00H3EHLLiPNpMj1...|--2vR0DIsmQ6WfcSz...|tTs6vjzf5Mvhs4aOA...|   215|  5.0|
|igHYkXZMLAc9UdV5V...|7W-p-PJlmrzg0mk3p...|PAN7D4F6gzHELLMKs...|   215|  5.0|
|3XsOOHcDC-XP8LPbp...|7W-p-PJlmrzg0mk3p...|vTMgyKHKHpWR0u8kN...|   210|  5.0|
|PyKr1RX29U21B8NZq...|0o8HUzggoNKay9-ZM...|jZ7GeY_viZuYT2dkd...|   208|  5.0|
|t-o_Sraneime4DDhW...|--2vR0DIsmQ6WfcSz...|CC0kHI2mVkdsQWVUx...|   207|  5.0|
|7dHYudt6OOIjiaxkS...|--2vR0DIsmQ6WfcSz...|K8LGQQyUEPYjYuh6H...|   207|  5.0|
|4ONpzAtnKbDig_e_O...|--2vR0DIsmQ6WfcSz...|rqeYJ-F26J87InZbK...|

When setting a minimum threshold for positive reviews at five stars, the most useful review was noted as useful a total of three hundred and fifty-eight times.

### <font color="magenta">Q4: During what hour of the day do most checkins occur?</font>

In [2]:
#In order to begin answering this question, we'll import the checkin.json file as a df. 
yelp_checkin_df = spark.read.json('../non_auto_assignments/data/yelp_academic/yelp_academic_dataset_checkin.json.gz')


#As usual, let's print the schema as an FYI.
yelp_checkin_df.printSchema()







root
 |-- business_id: string (nullable = true)
 |-- date: string (nullable = true)



In [16]:
#Let's check out the date column, so we get a better understanding of how it's formatted.
yelp_checkin_df.select('date').show(5)

+--------------------+
|                date|
+--------------------+
|2016-04-26 19:49:...|
|2011-06-04 18:22:...|
|2014-12-29 19:25:...|
| 2016-07-08 16:43:30|
|2010-06-26 17:39:...|
+--------------------+
only showing top 5 rows



In [36]:


#Lets first split the 'date' column into two seperate ones, one with just the date and one with just hour:minutes:seconds.
yelp_checkin_df_split = yelp_checkin_df.withColumn("just_date", F.split(F.col("date"), " ").getItem(0)).withColumn("just_time", F.split(F.col("date"), " ").getItem(1))

#Now let's remove the extraneous comma from the just time field.
yelp_checkin_df_split = yelp_checkin_df_split.select("just_time", F.regexp_replace(F.col("just_time"), "[\$#,]", "").alias("just_time_no_comma"))

#Now we can try to get the hour of the day from this, using the 24 hour clock.
yelp_checkin_df_split = yelp_checkin_df_split.withColumn('just_hour_only',F.hour('just_time_no_comma'))

#Now, utilizing the just_hour_only column, we can count the occurrences of each hour, and sort in descending order.
yelp_checkin_df_split.groupBy('just_hour_only').count().sort('count', ascending=False).show(10)

+--------------+-----+
|just_hour_only|count|
+--------------+-----+
|            19|13481|
|            23|13207|
|            22|13191|
|            18|13177|
|            21|12960|
|            20|12553|
|            17|12304|
|             0|11577|
|            16|10416|
|             1| 9803|
+--------------+-----+
only showing top 10 rows



Based on the output, the most common time that customers are checking in to businesses is 19:00, or 7:00 PM, followed by 11:00 PM.

### <font color="magenta">Q5: Sentiment analysis</font>

a. List the 50 most common non-stopword words that are unique to *positive* reviews.
b. List the 50 most common non-stopword words that are unique to *negative* reviews.

You can use the stopword list that was introduced in the lecture materials or you can 
find/devise one of your own.

You will need to define what constitutes a positive review and what constitutes a negative review.  We highly recommend that you use the number of stars to figure this out.  Be sure to provide a rationale for your choice

As an example, consider the following two reviews:

* Positive: The meal was great, and the service was the best we ever experienced.
* Negative: The meal was awful.  It was the worst thing we ever experienced.

Assume our stopwords are {'the','was','and','the','was','we','it'}

* Positive unique: {'great', 'service', 'best'}

* Negative unique: {'awful', 'worst', 'thing'}

In this example, each unique word occurs just once, so the concept of "top 50" doesn't make sense.  For your data, you'll need to count the number of times each unique word occurs.

In [4]:
#Since we've already defined a positive review dataframe, let's create a negative review dataframe.
#The negative review dataframe will consist of reviews that were allocated only one star.
yelp_review_df_negative = yelp_review_df.filter(yelp_review_df['stars'] == 1)

In [22]:
#In order to begin answering this question, we'll import the re library.
import re


#Next we'll define the stopwords list
STOPWORDS = ['i',  '-', '&', "I've", "I'm", '2', 'it.' "it's",  'we', 'ourselves', 'hers', 'between', 'yourself', 'but', 'again', 'there', 'about', 'once', 'during', 'out', 'very', 'having', 'with', 'they', 'own', 'an', 'be', 'some', 'for', 'do', 'its', 'yours', 'such', 'into', 'of', 'most', 'itself', 'other', 'off', 'is', 's', 'am', 'or', 'who', 'as', 'from', 'him', 'each', 'the', 'themselves', 'until', 'below', 'are', 'we', 'these', 'your', 'his', 'through', 'don', 'nor', 'me', 'were', 'her', 'more', 'himself', 'this', 'down', 'should', 'our', 'their', 'while', 'above', 'both', 'up', 'to', 'ours', 'had', 'she', 'all', 'no', 'when', 'at', 'any', 'before', 'them', 'same', 'and', 'been', 'have', 'in', 'will', 'on', 'does', 'yourselves', 'then', 'that', 'because', 'what', 'over', 'why', 'so', 'can', 'did', 'not', 'now', 'under', 'he', 'you', 'herself', 'has', 'just', 'where', 'too', 'only', 'myself', 'which', 'those', 'i', 'after', 'few', 'whom', 't', 'being', 'if', 'theirs', 'my', 'against', 'a', 'by', 'doing', 'it', 'how', 'further', 'was', 'here', 'than']
STOPWORDS_UPPER = [word.title() for word in STOPWORDS]

#Let's create a count dataframe for positive
count_df_positive = yelp_review_df_positive.withColumn('word', F.explode(F.split(F.col('text'), ' ')))\
    .groupBy('word')\
    .count()\
    .sort('count', ascending=False)


#Let's create a count dataframe for negative
count_df_negative = yelp_review_df_negative.withColumn('word', F.explode(F.split(F.col('text'), ' ')))\
    .groupBy('word')\
    .count()\
    .sort('count', ascending=False)




#Now let's filter out the stopwords from the dataframe, while also filtering out any potential uppercase duplicates.
count_df_positive = count_df_positive.filter(~F.col('word').isin(STOPWORDS))
count_df_positive = count_df_positive.filter(~F.col('word').isin(STOPWORDS_UPPER))

count_df_negative = count_df_negative.filter(~F.col('word').isin(STOPWORDS))
count_df_negative = count_df_negative.filter(~F.col('word').isin(STOPWORDS_UPPER))







In [25]:
#Outputting top 51 since the first row just displays a count of null values. We want 50 actual words
count_df_positive.show(51)

+----------+-------+
|      word|  count|
+----------+-------+
|          |4033319|
|     great| 986721|
|     place| 978963|
|      food| 753989|
|       get| 668976|
|      good| 645730|
|      like| 644250|
|      time| 635453|
|       one| 557158|
|    really| 533162|
|        go| 531548|
|    always| 531538|
|   service| 529984|
|      best| 518238|
|     would| 515217|
|      back| 509980|
|      also| 480754|
|      love| 445954|
| recommend| 400349|
|definitely| 390759|
|       got| 388276|
|     staff| 374825|
|        us| 366962|
|      even| 365423|
|      it's| 360287|
|      made| 338603|
|  friendly| 334469|
|      nice| 324736|
|      come| 316480|
|     Great| 314480|
|      make| 308374|
|     first| 303408|
|       try| 292827|
|    little| 285873|
|      came| 284645|
|       new| 282948|
|     don't| 276007|
|     never| 274331|
|   amazing| 271350|
|     going| 265211|
|      went| 263916|
|     every| 260266|
|     could| 258111|
|      much| 252507|
|      ever| 

In [26]:
#Outputting top 51 rows since the first row just displays a count of null values. We want 50 actual words
count_df_negative.show(51)

+--------+-------+
|    word|  count|
+--------+-------+
|        |2069997|
|   would| 500736|
|     get| 484152|
|    like| 385264|
|     one| 381989|
|    told| 353544|
|    back| 350734|
|    said| 342816|
|   place| 339902|
|    time| 339723|
|    even| 332486|
|    food| 325432|
|      us| 302851|
|      go| 301994|
|   never| 297743|
| service| 296364|
|  didn't| 272059|
|     got| 267761|
|   don't| 257549|
|   asked| 253132|
|   could| 250926|
|    went| 215631|
|    came| 211249|
|   going| 192374|
|   order| 190095|
| minutes| 184698|
|  called| 183210|
|  people| 182957|
|    good| 178774|
|customer| 178563|
|    know| 178459|
| another| 173658|
|    give| 167998|
|    take| 167166|
|    come| 165141|
|    took| 163506|
| ordered| 157989|
|    make| 157014|
|   still| 153048|
|    call| 149798|
|    it's| 148363|
|     it.| 147790|
|  really| 146852|
|   first| 146311|
|    want| 145889|
|     car| 143276|
|     two| 139820|
|    ever| 137625|
| manager| 131854|
|     see| 1

For the sake of this analysis, in order to capture the more salient differences between negative and positive reviews, I manually set the minimum standard for 'positive reviews' at an allocation of 5 stars, while I considered all 1 star reviews 'negative'. I chose values on the opposite end of the customer experience spectrum based on my own intuition, since values in between these values(1.5 - 4.5) could be considered as mixed bags, in the sense that reviewers may still compose a largely positive or largely negative review at one of these intermediate values, but still include a few minor points that detracted or added to their experience, resulting in a review of more mixed sentiment, thus giving rise to the increased potentiality of duplciate describers. 



Upon viewsing the output of count_df_positive, several clear positive descriptors stand out. Words like "amazing", "great", "good", "friendly", and, "recommend", and  "nice" are utilized, with 'great' holding the number one spot. The words "staff" and "service" are also called out, likely used in conjunction with the adjectivses above. The words "would", "go", and "back" are also identified as highly utilized, and one can infer that they make up the string "would go back." If a customer has an emphatically positive experience, they would naturally frequent it, and urge other customers to do the same. 


The words listed in the output from the count_df_negative dataframe is much less abrasive than I thought, having anticipated to see adjectives with highly negative connotations, like "awful" or "terrible." In any case, it is possible to infer that the output comes from the negative dataframe. The contractions 'don't' and 'didn't' appear in the list, along with the words "never" and "got," all suggesting an experience that was lacking in some fashion. The word "manager" also appears in the output, whereas it does not appear in the output from the positive dataframe. This may suggest that customer experiences involving the manager of an establishment are, by and large, negative. 
