### First let's read in our Amazon data into a list

So, in this lecture, we're going to see how we can actually read JSON and CSV files into Python objects. Well, so far, we've seen how we can actually open those files using the csv.reader function or the JSON library. What do we actually do once we've opened the files? We'll also introduce some new libraries, in particular the gzip library, which is going to allow us to manipulate gzip files on the flier. So previously, we covered the basics of reading CSV and JSON files using a few different libraries.

So, the question is what comes next? How do we actually go from just opening those files into reading them into appropriate data structures? So, the first thing we're going to want to do, is just to read one of the files we've have been working with. In this case we'll look again at the Amazon Gift Card data, which is a TSV file. So far, we've been able to read it by doing something like, this importing the CSV library, specifying the path to the file, opening the file, providing that open file to the csv.reader library along with a delimited option. Then we can read the header and all of the lines in that file.

So, the questions we'd like to answer to extend that, are first, how are we going to be able to handle sort of large CSV or JSON files without having to unzip them? So far, we just operated on raw, CSV, TSV or JSON files, but many of the datasets we'll actually look at, would come zipped, can we exploit that to our advantage? Secondly, how do we actually extract relevant parts of the data for performing analysis? Often, we'll be looking at very large datasets, and not all parts of those datasets are relevant. So, how do we filter them or build a relevant subset of the data to work with? Third, what are some convenient data structures that we'll make accessing these types of data more convenient.

First, we'll look at the gzip library. Issue we might want to overcome sometimes is that we'll have very large CSV, TSV or JSON files that are going to be cumbersome to store on disk if we have to extract all of them beforehand. So, datasets like the ones we've been working with, the Amazon dataset actually comes in gziped format. So, is there some way we can work with that file without having to unzip it? That's exactly what the gzip library is going to do. So to read the file in gziped format, we import the gzip library, we specify the paths of the file which now includes the.gz or gz extension. Then we open the file using the gzip library, this looks very similar to opening regular file with a few subtly different options. Just looking at this file, it's a 12 megabyte file when it is compressed, and it's a 39 megabyte file when it's unzipped. So, it's already worthwhile to try and manipulating this dataset in his native gziped format. So, when we open that file using the gzip library we specify the options rt. R is to read, t just specifies that the file is really a text file as opposed to reading gziped file in byte format which would be inconvenient.

Otherwise, once we've opened the file using the gzip library, we can manipulate it pretty much like we would any other regular file. So, we can now pass the open gziped file to CSV reader, rather than passing an open file to csv.reader. Otherwise, it's going to be exactly the same, we can read the header and all of the following lines exactly as we would for a regular unzipped file. That's all we need to know about the gzip library. It essentially allows us to read zipped files in the.gz format without having to unzip them.

In [1]:
import gzip

In [3]:
path = "datasets/amazon_reviews_us_Gift_Card_v1_00.tsv.gz"
f = gzip.open(path, 'rt')

In [4]:
import csv
reader = csv.reader(f, delimiter = '\t')

In [5]:
header = next(reader)

In [7]:
header

['marketplace',
 'customer_id',
 'review_id',
 'product_id',
 'product_parent',
 'product_title',
 'product_category',
 'star_rating',
 'helpful_votes',
 'total_votes',
 'vine',
 'verified_purchase',
 'review_headline',
 'review_body',
 'review_date']

### Reading and filtering files line by line

The next concept I'd like to introduce is,how can we read and filter out data sets line by line? So, for manipulating a very large file and we have a gzipped, it's not going to help us if we then try to read the entire file into memory all in one go, because we're just going to run out of memory. So, the next concept we would like to introduce is to say, "How can we construct a data structure containing some reduced subset of the file that we'd really like to work with?" So, perhaps, in the case of our Amazon dataset, we'd like to build a subset that ignores the text fields in that dataset, because we'd just like to do some operations on the rating, or the vote, or the user data. That's what we'll do in this example. 

In [9]:
data = []

In [10]:
for line in reader:          #File is read one line at a time
    line = line[:3]          #drop the text fields, in this case, the last 3
    if line[-1] == 'Y':      #discard unverified reviews
        dataset.append(line)

In [11]:
dataset[0]

{'marketplace': 'US',
 'customer_id': '24371595',
 'review_id': 'R27ZP1F1CD0C3Y',
 'product_id': 'B004LLIL5A',
 'product_parent': '346014806',
 'product_title': 'Amazon eGift Card - Celebrate',
 'product_category': 'Gift Card',
 'star_rating': 5,
 'helpful_votes': 0,
 'total_votes': 0,
 'vine': False,
 'verified_purchase': True,
 'review_headline': 'Five Stars',
 'review_body': 'Great birthday gift for a young adult.',
 'review_date': '2015-08-31'}

So, rather than reading the whole file and then trying to remove the text fields, which could cause us to run out of memory, what we'll do instead, is read the file line by line, delete the text fields and store the reduced entries of each line inside our appropriate data structure, which in this case is the list. So, all this happening in this code is the following: we read the file one line at a time just by passing our csv.reader object into a for loop. The second thing we do, is we just drop some entries from that file. In this case, we are dropping the last three entries which correspond to the text portion of the review.

The next idea is something I personally do and I find very useful. We just to take our CSV structured data and store it in key-value pairs much like we would have four adjacent object. So, rather than trying to manipulate a CSV file by looking for entry number two, which we remember corresponds to the user ID, and entry number 21 which corresponds to the index of the review field, that could be very cumbersome. Rather than doing that, we might actually like to use something like a dictionary data structure, that will store for us key-value pairs indicating which key corresponds to which entry. So, in this case, we might do that as we read the file by using this dictionary constructor. So, this is going to take the header and the line we're currently reading, and convert that to a dictionary which maps each key in the header, to each value in the line. 

So, it's essentially going to convert that line to a dictionary that we can index by keys from the header. The second thing that might be useful to do, since we're actually reading as file as a string, is to convert some of the numeric fields into Python types, such as integer or boolean types. So, we have fields here like the number of helpful votes or the star rating. As we read the file, they're going to be represented as strings, so it might be more useful to convert them to floats, integers, booleans, whatever type they natively come in. The same thing applies for the verified purchase and viine fields, which in this case are, yes or no, or just the characters Y or N, which we might convert to true or false values, which will make it easier later on to perform logic on those fields. So again, in our other two ideas. First of all, we use the dict operator to make our line into a Python dictionary, which is going to make it much easier for us to index the different fields by keys, rather than by the index of that field. Secondly, we convert strings to numbers and booleans where possible.

So, that's about it for reading files into Python data structures. We did a few things in this lecture. First of all, we introduce the gzip library, which is going to be very convenient when we want to manipulate large files that maybe we don't want to unzip. We also saw some techniques for pre-processing data sets as we read them. So, now on your own, you should be able to work with some of the larger Amazon datasets or the help review data, and compile some simple statistics for them by reading them in a native gziped format. Also you should be able to experiment with his dict operator which you can use to convert CSV or TSV data into dictionary objects mapping keys from the header into fields from each line.

In [12]:
dataset = []
for line in reader:
    d = dict(zip(header, line))
    for field in ['helpful_votes', 'star_rating', 'total_votes']:
        d[field] = int(d[field])
    for field in ['verified_purchase', 'vine']:
        if d[field] == "Y":
            d[field] = True
        else:
            d[field] = False
    dataset.append(d)

### Let's calculate some summary statistics

#### Summary stats

Compute Average Rating, it can be computed with list comprehension

In [44]:
ratings = [d['star_rating'] for d in dataset]

In [45]:
sum(ratings) / len(ratings)

4.731333018677096

#### Rating score distribution

In [46]:
ratingCounts = {1: 0, 2: 0, 3: 0, 4: 0, 5: 0}

In [47]:
for d in dataset:
    ratingCounts[d['star_rating']] += 1

In [48]:
ratingCounts

#### Using the default dic function

In [63]:
ratingCounts = {1: 0, 2: 0, 3: 0, 4: 0, 5: 0}

In [64]:
from collections import defaultdict

In [65]:
ratingCounts = defaultdict(int)

In [66]:
for d in dataset:
    ratingCounts[d['star_rating']] += 1

In [67]:
ratingCounts

defaultdict(int, {5: 129029, 1: 4766, 4: 9808, 2: 1560, 3: 3147})

#### Calculate 'verified purchases'

In [73]:
verifiedCounts = defaultdict(int)

In [74]:
for d in dataset:
    verifiedCounts[d['verified_purchase']] += 1

In [75]:
verifiedCounts

defaultdict(int, {True: 135289, False: 13021})

##### Most popular products

In [82]:
productCounts = defaultdict(int)

In [83]:
for d in dataset:
    productCounts[d['product_id']] += 1

In [84]:
counts = [(productCounts[p], p) for p in productCounts]

In [85]:
counts.sort()

In [86]:
counts[-10:]

[(2038, 'B004KNWWO0'),
 (2173, 'B0066AZGD4'),
 (2630, 'BT00DDC7CE'),
 (2643, 'B004LLIKY2'),
 (3407, 'BT00DDC7BK'),
 (3440, 'BT00CTOUNS'),
 (4283, 'B00IX1I3G6'),
 (5034, 'BT00DDVMVQ'),
 (6037, 'B00A48G0D4'),
 (28705, 'B004LLIKVU')]

#### Top rated products

Here we need to compute the average rating for each product, which requires that we first construct the list of ratings for each product. This can be done using the defaultdict, with the 'list' subclass.

In [96]:
ratingsPerProduct = defaultdict(list)

In [97]:
for d in dataset:
    ratingsPerProduct[d['product_id']].append(d['star_rating'])

In [98]:
averageRatingPerProduct = {}
for p in ratingsPerProduct:
    averageRatingPerProduct[p] = sum(ratingsPerProduct[p]) / len(ratingsPerProduct[p])

We now have two data structures: one which stores the list of ratings for each product, and one which stores the average rating for each product. 

##### Now we can sort by ratings, and also filter to only include reasonably popular products:

In [101]:
topRated = [(averageRatingPerProduct[p], p) for p in averageRatingPerProduct if len(ratingsPerProduct[p]) > 50]

In [None]:
topRated.sort()

In [102]:
topRated[-10:]

[(4.534246575342466, 'B008BMI0JC'),
 (4.698347107438017, 'B0062ONF64'),
 (4.734375, 'B005EISP96'),
 (4.691056910569106, 'B0062ONETC'),
 (4.634615384615385, 'B00BWDH1YC'),
 (4.931034482758621, 'B00CT77E60'),
 (4.756338028169014, 'B00BWDH77S'),
 (4.2745098039215685, 'B00BWDHYBM'),
 (4.67741935483871, 'B009I1ZRN2'),
 (4.882352941176471, 'B004LLJ6B8')]