
# Amazon Reviews Data exploration


In this exercises notebook, you will be working with amazon reviews dataset hosted on AWS S3. The dataset is already downloaded during the deployment of EMR cluster. Read the data into an Sql dataframe and explore the data. 


Create an SQLContext instance. You need before you can do anything in this notebook. To create a basic instance, all we need is a SparkContext reference. Since we are running Spark in shell mode (using pySpark) we can use the global context object sc for this purpose.

In [1]:
sqlContext = SQLContext(sc)

# Dataframe in PySpark: Overview

In Apache Spark, a DataFrame is a distributed collection of rows under named columns. In simple terms, it is same as a table in relational database or an Excel sheet with Column headers. It also shares some common characteristics with RDD:

**Immutable in nature** : We can create DataFrame / RDD once but can’t change it. And we can transform a DataFrame / RDD  after applying transformations.


**Lazy Evaluations**: Which means that a task is not executed until an action is performed.


**Distributed**: RDD and DataFrame both are distributed in nature.

# Loading From json

In [2]:
df = sqlContext.read.json("/user/hadoop/Datasets/reviews.json")

Now let's view the schema of our data.

In [3]:
df.printSchema()

root
 |-- helpfuless_count: long (nullable = true)
 |-- helpfuless_score: long (nullable = true)
 |-- price: string (nullable = true)
 |-- productId: string (nullable = true)
 |-- profileName: string (nullable = true)
 |-- score: string (nullable = true)
 |-- summary: string (nullable = true)
 |-- text: string (nullable = true)
 |-- time: string (nullable = true)
 |-- title: string (nullable = true)
 |-- userId: string (nullable = true)



For getting the columns name we can use columns on DataFrame, similar to what we do for getting the columns in pandas DataFrame. Let’s first print the number of columns and columns name  in train file then in test file.

In [4]:
print(len(df.columns))
df.columns

11


['helpfuless_count', 'helpfuless_score', 'price', 'productId', 'profileName', 'score', 'summary', 'text', 'time', 'title', 'userId']

# Exploring the Data

Our data is now loaded into a dataframe that we named df, with all the dtypes inferred. First we'll count the number of rows it found:

In [5]:
df.count()

34686770

Then we look at the column-by-column dtypes the system estimated:

In [6]:
df.dtypes

[('helpfuless_count', 'bigint'), ('helpfuless_score', 'bigint'), ('price', 'string'), ('productId', 'string'), ('profileName', 'string'), ('score', 'string'), ('summary', 'string'), ('text', 'string'), ('time', 'string'), ('title', 'string'), ('userId', 'string')]

For each pairing (a tuple object in Python, denoted by the parentheses), the first entry is the column name and the second is the dtype.

Take a peak at five rows:

In [7]:
df.take(2)

[Row(helpfuless_count=7, helpfuless_score=7, price=u' unknown', productId=u' B000179R3I', profileName=u' Jeanmarie Kabala "JP Kabala"', score=u' 4.0', summary=u' Periwinkle Dartmouth Blazer', text=u' I own the Austin Reed dartmouth blazer in every color in which they make it-- it is a staple of my business wardrobe. Well made, quality fabric, nicely tailored, classic lines, appropriate for a professional woman. (something that can be hard to find at times) It should be noted, however, that the periwinkle and raspberry colors are lovely, but the fabric and buttons are slightly different than the "classic" colors(lighter) and the linings and interfacings are not as substantial as the brown, navy, camel, red and ivory. It\'s still a good value, particularly as these are colors appropriate to warmer seasons and climates, but I was a bit surprised.', time=u' 1182816000', title=u' Amazon.com', userId=u' A3Q0VJTUO4EZ56'), Row(helpfuless_count=0, helpfuless_score=0, price=u' 17.99', productId=

In the format column_name=value for each row. Note that the formatting above is ugly because take doesn't try to make it pretty, it just returns the row object itself. We can use show instead and that attempts to format the data better, but because there are so many columns in this case the formatting of show doesn't fit, and each line wraps down to the next

In [8]:
df.show(2,truncate= True)

+----------------+----------------+--------+-----------+--------------------+-----+--------------------+--------------------+-----------+--------------------+---------------+
|helpfuless_count|helpfuless_score|   price|  productId|         profileName|score|             summary|                text|       time|               title|         userId|
+----------------+----------------+--------+-----------+--------------------+-----+--------------------+--------------------+-----------+--------------------+---------------+
|               7|               7| unknown| B000179R3I| Jeanmarie Kabala...|  4.0| Periwinkle Dartm...| I own the Austin...| 1182816000|          Amazon.com| A3Q0VJTUO4EZ56|
|               0|               0|   17.99| B000GKXY34|          M. Gingras|  5.0|          Great fun!| Got these last C...| 1262304000| Nun Chuck, Novel...|  ADX8VLDUOL7BG|
+----------------+----------------+--------+-----------+--------------------+-----+--------------------+--------------------+

# Describe

Now we'll describe the data. Note that describe returns a new dataframe with the information, and so must have show called after it if our goal is to view it (note the nice formatting in this case). This can be called on one or more specific columns, as we do here, or the entire dataframe by passing no columns to describe:

In [9]:
df_described = df.describe('helpfuless_count', 'helpfuless_score', 'price')
df_described.show()

+-------+-----------------+------------------+------------------+
|summary| helpfuless_count|  helpfuless_score|             price|
+-------+-----------------+------------------+------------------+
|  count|         34686770|          34686770|          34686770|
|   mean|5.451863030198545|3.7175363113948054|25.637601311864604|
| stddev|22.40765220943555| 19.94850094783318| 49.64204197419118|
|    min|                0|                 0|              0.00|
|    max|            48334|             47516|           unknown|
+-------+-----------------+------------------+------------------+



# Subsetting by Columns

One of the simplest subsettings is done by selecting just a few of the columns:

To subset the columns, we need to use select operation on DataFrame and we need to pass the columns names separated by commas inside select Operation.

In [10]:
from pyspark.sql.functions import col

df_select = df.select(col('userId'),col('score')).show(5)
df_select.show(5)

+---------------+-----+
|         userId|score|
+---------------+-----+
| A3Q0VJTUO4EZ56|  4.0|
|  ADX8VLDUOL7BG|  5.0|
| A3NM6P6BIWTIAE|  3.0|
|  AVCGYZL8FQQTD|  4.0|
|        unknown|  5.0|
+---------------+-----+
only showing top 5 rows



Name: org.apache.toree.interpreter.broker.BrokerException
Message: Traceback (most recent call last):
  File "/tmp/kernel-PySpark-649f2882-e923-4609-a12b-0edf11f9c1e1/pyspark_runner.py", line 174, in <module>
    eval(compiled_code)
  File "<string>", line 3, in <module>
AttributeError: 'NoneType' object has no attribute 'show'

StackTrace: org.apache.toree.interpreter.broker.BrokerState$$anonfun$markFailure$1.apply(BrokerState.scala:158)
org.apache.toree.interpreter.broker.BrokerState$$anonfun$markFailure$1.apply(BrokerState.scala:158)
scala.Option.foreach(Option.scala:236)
org.apache.toree.interpreter.broker.BrokerState.markFailure(BrokerState.scala:157)
sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
java.lang.reflect.Method.invoke(Method.java:606)
py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
p

Note that show defaults to showing the first 20 rows, but here we've specified only 5. There is also a shortcut for this notation that does the same thing but is a little easier to read. We show both because they both show up frequently in Spark resources:

In [12]:
df_select = df[['userId', 'score']]
df_select.show(5)

+---------------+-----+
|         userId|score|
+---------------+-----+
| A3Q0VJTUO4EZ56|  4.0|
|  ADX8VLDUOL7BG|  5.0|
| A3NM6P6BIWTIAE|  3.0|
|  AVCGYZL8FQQTD|  4.0|
|        unknown|  5.0|
+---------------+-----+
only showing top 5 rows



Or we can do the same thing by dropping, which is convenient if we want to keep more columns than we want to drop:

In [13]:
df_drop = df_select.drop(col('score'))

In [14]:
df_drop.show(5)

+---------------+
|         userId|
+---------------+
| A3Q0VJTUO4EZ56|
|  ADX8VLDUOL7BG|
| A3NM6P6BIWTIAE|
|  AVCGYZL8FQQTD|
|        unknown|
+---------------+
only showing top 5 rows



# Subsetting by Rows

We often want to subset by rows also, for example by specifying a conditional. Note that we have to use .show() at the end of .describe(), because .describe() returns a new dataframe with the information. 

In [15]:
df.describe('score').show()

+-------+------------------+
|summary|             score|
+-------+------------------+
|  count|          34686770|
|   mean| 4.172705011161316|
| stddev|1.2528687360830613|
|    min|               1.0|
|    max|               5.0|
+-------+------------------+



In [16]:
df_sub = df.where(df['score'] < 4.172)

In [17]:
df_sub.show(5)

+----------------+----------------+--------+-----------+--------------------+-----+--------------------+--------------------+-----------+--------------------+---------------+
|helpfuless_count|helpfuless_score|   price|  productId|         profileName|score|             summary|                text|       time|               title|         userId|
+----------------+----------------+--------+-----------+--------------------+-----+--------------------+--------------------+-----------+--------------------+---------------+
|               7|               7| unknown| B000179R3I| Jeanmarie Kabala...|  4.0| Periwinkle Dartm...| I own the Austin...| 1182816000|          Amazon.com| A3Q0VJTUO4EZ56|
|               1|               0|   17.99| B000GKXY34|     Maria Carpenter|  3.0|  more like funchuck| Gave this to my ...| 1224633600| Nun Chuck, Novel...| A3NM6P6BIWTIAE|
|               7|               7| unknown| 1882931173| Jim of Oz "jim-o...|  4.0| Nice collection ...| This is only for...|

We can repeat the same procedure for multiple conditions and columns using standard logical operators:

In [18]:
df_filter = df.where((df['score'] > 1) & (df['score'] < 4))

In [19]:
df_filter.show(5)

+----------------+----------------+--------+-----------+----------------+-----+--------------------+--------------------+-----------+--------------------+---------------+
|helpfuless_count|helpfuless_score|   price|  productId|     profileName|score|             summary|                text|       time|               title|         userId|
+----------------+----------------+--------+-----------+----------------+-----+--------------------+--------------------+-----------+--------------------+---------------+
|               1|               0|   17.99| B000GKXY34| Maria Carpenter|  3.0|  more like funchuck| Gave this to my ...| 1224633600| Nun Chuck, Novel...| A3NM6P6BIWTIAE|
|               1|               1|   46.34| B000278ADA|     Trina Wehle|  3.0| Delivery was ver...| It took almost 3...| 1352505600| Jobst Ultrasheer...|  A9Q3932GX4FX8|
|               1|               1|   46.34| B000278ADA|          dgodoy|  2.0| sizes recomended...| sizes are much s...| 1287014400| Jobst Ultra

# Random Sampling

And finally, you might want to take a random sample of rows. This can be particularlly useful, for example, if your data is large enough to require more expensive clusters to be spun up to work with it all, and you want to use a smaller, less expensive cluster to work on a sample. Once your code is completed, you can then spin up the more expensive cluster and simply apply your code to the full sample.

You can pass three arguments into sample: **the first is a boolean, which is True to sample with replacement, False without**. The second is the **fraction of the dataset** to take, in this case 5%, and the third is an **optional random seed**. If you specify any integer here then someone else performing the same random operation that specifies the same seed will get the same result. If no seed is passed then the exact random sampling can't be duplicated.

In [20]:
df_sample = df.sample(False, 0.05, 99)

In [21]:
df_sample.describe('score').show()

+-------+------------------+
|summary|             score|
+-------+------------------+
|  count|           1732754|
|   mean| 4.171809731791125|
| stddev|1.2531445587944696|
|    min|               1.0|
|    max|               5.0|
+-------+------------------+



If you compare this to our original summary stats on unfiltered column userId, you'll see it does a pretty good job maintaining the mean and stddev in a sample of only 5% of the data.

# Counting the distinct rows
The distinct operation can be used here, to calculate the number of distinct rows in a DataFrame. Let’s apply distinct operation to calculate the number of distinct products in our dataset.

In [22]:
df.select('productId').distinct().count()

2441053

# Group by a coulmn

Here is how you can group the data by a column. We used the "userId" column to group the reviews data.

In [23]:
df.groupBy("userId").count().show()

+---------------+-----+
|         userId|count|
+---------------+-----+
| A16FTKP8BXKGHG|    1|
|  A2H06FLSHCGNX|    2|
| A29Q0CLOF0U8BN|    5|
| A38N9A0UJVYIRI|    1|
| A13ZCL7UXEDF14|  226|
| A1OAFHNE81JHOM|    1|
| A17UU8GM4NSMEJ|    6|
| A2GI02PFIHFYPK|   14|
|  AVRVACFKJ5QMM|  113|
| A290DASWU6TETY|    8|
|  ADIB6IP2IWMT4|   37|
| A2E695LJSYIX98|    4|
| A1YHAR3SQE33UA|    4|
| A1E2LXUERE1QDX|    2|
|  AY4OROA8ZNYH8|    1|
| A2BMM11E76H9WC|   59|
| A1VPXCFO261VWH|  351|
| A2PXFI7VDHQ7BU|   61|
| A3RV69EPSMI2XI|   60|
| A1OXF8GW1MUJI5|   60|
+---------------+-----+
only showing top 20 rows



## Running SQL Queries Programmatically

The sql function on a SQLContext enables applications to run SQL queries programmatically and returns the result as a DataFrame. We use registerTempTable() to register the "Reviews" RDD as a table. The lifetime of this temporary table is tied to the SQLContext that was used to create this DataFrame.

In [24]:
# Registers the Reviews RDD as a table.
df.registerTempTable("Reviews")

In [25]:
result = sqlContext.sql("SELECT count(*) as count FROM Reviews")

result.show()

+--------+
|   count|
+--------+
|34686770|
+--------+



In [26]:
result = sqlContext.sql("SELECT userId, max(score) FROM Reviews group by userId")
result.show()

+--------------------+----+
|              userId| _c1|
+--------------------+----+
| A00747652TYK4MJF...| 5.0|
| A01069451H5M05XW...| 5.0|
| A0331603K6Z5GSVD...| 5.0|
| A0434254310BU34Z...| 5.0|
| A04465292GUIUWGT...| 5.0|
| A05600701HJDE30L...| 4.0|
| A05616363SN1G36M...| 1.0|
| A0595191WNUKZV8KM3G| 5.0|
| A0707282Z8AJG3NR...| 5.0|
| A07532193EC6QULM...| 5.0|
| A08672373PELMP72...| 4.0|
|      A1000WA98BFTQB| 1.0|
|      A100I83J5W64FH| 5.0|
|       A100QWNGRXC3S| 5.0|
|      A100WTPUCJMZXR| 3.0|
|      A100ZMU8VBXEZ3| 5.0|
|      A10103MJIKKIFE| 5.0|
| A10127132IE1A73I...| 5.0|
|      A101592SF9EIWR| 4.0|
|      A10182LLGKCHDH| 5.0|
+--------------------+----+
only showing top 20 rows



# You are done

Please go back to the EMR_Deploy notebook and run the remaining code cells to clode the cluster.