<a href="https://colab.research.google.com/github/cihan38/ML-Sklearn-json-parse-with-Colab/blob/master/json_parsing_colab.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Hello and how's it going everybody! 

It is important to examine customer feedback for businesses. Understanding the customers’ likes and dislikes is very important as this information may not only help with increased growth in sales but also may improve the quality of products, services, and company management.
Almost all businesses collect and publish customer feedback. Some examples that can be found with a simple internet search include Google and Amazon reviews. According to marketing research published in Forbes (see the link https://www.forbes.com/sites/serenitygibbons/2018/09/20/why-businesses-need-to-see-customer-feedback-as-make-or-break/#1f40c09b1083) a negative review could result in the reduction of 20% of existing and 70% of potential future customers. 

To address this problem using Machine Learning models, we will analyze Amazon customer feedback data.  The data is fairly large, consisting of multiple json files (located here http://jmcauley.ucsd.edu/data/amazon/). From this website, I downloaded four different subsets of files/datasets on my desktop. The name of these datasets are "Books" (file size around 3GB), "Electronics" (around 476MB), "Sports and Outdoors" (around 65MB), and "Beauty" (around 43MB). I then moved said files into my Google Drive. This is convenient as I can easily mount my Google Drive to my Colab. This makes it easier to parse big data. 

To link your Google Drive with Colab, either run the code located in the cell below or go to Colab. To the left of your browser, you will see a chevron which expands a vertical menu.  Once expanded, you should be able to see the “Mount Drive” option.  Double click “Mount Drive” and run the code. A link will appear which will take you to a new browser and provide you with an authorization code. Copy the authorization code and paste it into the cell.  Doing so will mount your Google Drive to Colab.





In [0]:
from google.colab import drive
drive.mount('/content/drive')

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3aietf%3awg%3aoauth%3a2.0%3aoob&response_type=code&scope=email%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdocs.test%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.photos.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fpeopleapi.readonly

Enter your authorization code:
··········
Mounted at /content/drive




The file is in json.gz format (compressed format). We need to use “import json” and “import gzip” statements to work with and parse the data. 
You can get the path of the file by navigating to the “Files” menu located in the vertical menu discussed previously.  Select “drive” and then select “My Drive.”  Locate the folder that has the large json file and right click.  Select the option “Copy path.”  Once you copied the file path, you can paste it in the cell.  For example, I am looking to use the file titled “reviews_Beauty_5.json.gz” in “amazon large data” folder located in “My Drive.” My path code is: '/content/drive/My Drive/amazon large data/reviews_Beauty_5.json.gz'

Your path code should look similar (but not necessarily identical) as the path code in the example above.

In order to read the first line of the data file, run the code below.
As you can see, we successfully ran the code and read the data. The data is in Python dictionary format. 

These are the keys for the dictionary, which will help you to do further analyses. 
reviewerID

asin

reviewerName

helpful

reviewText

overall

summary

unixReviewTime

reviewTime



In [0]:
import json
import gzip

with gzip.open('/content/drive/My Drive/amazon large data/reviews_Beauty_5.json.gz', 'rb') as f:
  for line in f:
    print(line)
    break

b'{"reviewerID": "A1YJEY40YUW4SE", "asin": "7806397051", "reviewerName": "Andrea", "helpful": [3, 4], "reviewText": "Very oily and creamy. Not at all what I expected... ordered this to try to highlight and contour and it just looked awful!!! Plus, took FOREVER to arrive.", "overall": 1.0, "summary": "Don\'t waste your money", "unixReviewTime": 1391040000, "reviewTime": "01 30, 2014"}\n'




To randomly parse the data, use the statement “import random.”  To obtain smaller pieces of the json files, we need to create an empty list.

Using the for loop statement, we can read each line in the data. Also, we can limit the data by year using the if condition/statement.  In this example, I limited the data to the year 2014. Additionally, we can randomly sample/select 10000 cases from the json data. 

To evaluate the success of our code, we need to print both the length of the data (which is 10000) and the first 5 rows of the data.  If you are able to view this information, then this is an indication that the code was successful. 


In [0]:
import random

data = []
with gzip.open('/content/drive/My Drive/amazon large data/reviews_Beauty_5.json.gz', 'rb') as f:
	for line in f:
		review = json.loads(line)
		year = int(review['reviewTime'].split(' ')[-1])
		if year == 2014:
			data.append(review)

final_data = random.sample(data, 10000)

print(len(final_data))

print(final_data[0:5])



10000
[{'reviewerID': 'A2I8KUDXTC9WYI', 'asin': 'B00K83VT5Y', 'reviewerName': 'theresa', 'helpful': [0, 0], 'reviewText': "I love Vitamin C Serum and this serum from Health Royals is excellent. Since I started using it, my skin and complexion looks wonderful.  I have noticed my fine lines and wrinkles starting to fade. I also don't have as much puffiness under my eyes in the morning as I use to have.  Vitamin C Serum is so beneficial for the skin and this company makes a great product.  I also use it on my neck and my hands.  My hands are looking so much better and the age spots on them are starting to fade also.  I highly recommend this product.", 'overall': 5.0, 'summary': 'I love this Vitamin C Serum.', 'unixReviewTime': 1405814400, 'reviewTime': '07 20, 2014'}, {'reviewerID': 'A3OV02UE5E8P5I', 'asin': 'B000O3OZD6', 'reviewerName': 'sutherngal', 'helpful': [0, 0], 'reviewText': 'This case is great for my make-up. It is big enough to put everything. It comes with a black strap.', 'ov



Make sure you save the data. I saved the data as Beautysmall.json in my Google Drive.

In [0]:
with open('/content/drive/My Drive/amazon large data/Beautysmall.json', 'w') as f:
	for review in final_data:
		f.write(json.dumps(review)+'\n')



Once saved, I can read the first row/line of this smaller json file.

In [0]:
file_name = '/content/drive/My Drive/amazon large data/Beautysmall.json'

with open(file_name) as f:
  for line in f:
    print(line)
    break

{"reviewerID": "A7L6NVT1KZJ1R", "asin": "B00AUFS12O", "reviewerName": "Debi S. \"Debi S.\"", "helpful": [0, 0], "reviewText": "Love this stuff! I get all day wear from my eye shadow with no more creasing, nor fading. Keeps it looking perfect for 8 plus hours. This will be a staple for me now. I have older skin, so need heavier moisturizers which can wreak havoc with eye shadows. Not a problem anymore, even in humidity. I have sensitive skin, and no problems from it, another huge plus.", "overall": 5.0, "summary": "Works perfect!", "unixReviewTime": 1402099200, "reviewTime": "06 7, 2014"}





I can read the following keys: 'reviewText', 'overall' rating, and 'summary' of the review. For this example, I can see that it is a five-star review with very positive customer experience.

In [0]:
with open(file_name) as f:
  for line in f:
    review =json.loads(line)
    print(review['reviewText'])
    print(review['overall'])
    print(review['summary'])
    break

great prize for all these things
5.0
Five Stars




We can also append the data into list.

In [0]:
reviews =[]
with open(file_name) as f:
  for line in f:
    review =json.loads(line)
    reviews.append((review['reviewText'], review['overall']))
reviews[5]
len(reviews)

10000



I repeated the same process for the following files: "Books," "Electronics," and "Sports and Outdoors."
Now we parsed the datasets. Let’s see how we can apply Machine Learning models to analyze customer feedback data. 
For a guide on this, please see the next code snippet.
