# Applied data analysis - Project

## Swiss products on Amazon

The goal of the project is to analyze the swiss products sold on Amazon based on product reviews from a certain period of time. The first challenge is to filter the data to only keep swiss products, which is not trivial. Secondly, we will have to perform analysis on the users comments and the notes they gave.

The first part is the data preparation so we can work with nice and useful data later on.

Description of data: http://jmcauley.ucsd.edu/data/amazon/

### Setup

This part is about fetching the data and making it nice to work with.

In [1]:
from pyspark import SparkContext
import pyspark.sql
import pandas as pd
import json
import numpy as np

In [2]:
rdd = sc.textFile("hdfs:///datasets/amazon-reviews")

In [3]:
rdd.count()

152262068

In [4]:
rdd.first()

'{"reviewerID": "A00000262KYZUE4J55XGL", "asin": "B003UYU16G", "reviewerName": "Steven N Elich", "helpful": [0, 0], "reviewText": "It is and does exactly what the description said it would be and would do. Couldn\'t be happier with it.", "overall": 5.0, "summary": "Does what it\'s supposed to do", "unixReviewTime": 1353456000, "reviewTime": "11 21, 2012"}'

In [5]:
def is_json(myjson):
  try:
    json_object = json.loads(str(myjson))
  except (ValueError, Exception):
    return False
  return True


In [6]:
rdd = rdd.filter(is_json)
rdd = rdd.map(lambda x: json.loads(str(x)))

In [7]:
rdd.first()

{'asin': 'B003UYU16G',
 'helpful': [0, 0],
 'overall': 5.0,
 'reviewText': "It is and does exactly what the description said it would be and would do. Couldn't be happier with it.",
 'reviewTime': '11 21, 2012',
 'reviewerID': 'A00000262KYZUE4J55XGL',
 'reviewerName': 'Steven N Elich',
 'summary': "Does what it's supposed to do",
 'unixReviewTime': 1353456000}

|   field        | description |
|----------------|--------------------|
| reviewerID     | ID of the reviewer |
| asin           | ID of the product |
| reviewerName   | name of the reviewer |
| helpful        | helpfulness rating of the review |
| reviewText     | text of the review|
| overall        | rating of the product|
| summary        | summary of the review|
| unixReviewTime | time of the review (unix time)|
| reviewTime     | time of the review (raw)|

Plus for convenience, we will create a separate rdd that contains all the product IDs.

In [8]:
products = rdd.map(lambda x: (x['asin'], 1))
products.cache()

PythonRDD[5] at RDD at PythonRDD.scala:43

In [9]:
products.count()

142831980

In [10]:
uniqueProducts = products.reduceByKey(lambda a, b: a + b)

In [11]:
uniqueProducts.first()

('B00EDH41T2', 2)

In [12]:
uniqueProducts.count()

12650775

### Data filtering

Our first job is to filter out non-swiss products.

*insert here plan to achieve what explained above*