## Exploratory Data Analysis

This notebook contain the metadata of books from amazon. This data can be found at the [Amazon Review Data](https://nijianmo.github.io/amazon/index.html). Since this is the first analysis, we'll do only with some samples (10000).

In [5]:
import pandas as pd
import json
import gzip 

The following cell will open the json file and get the data to a python list

In [41]:
data = []
i = 0
with gzip.open('meta_Books.json.gz') as f:
    for l in f:
        if(i == 10000): # limiting to 10000 rows
            break
        data.append(json.loads(l.strip()))
        i += 1

print(len(data))
print(data[500])

10000
{'category': [], 'tech1': '', 'description': ['Very clean dust jacket. No markings in book. A find example of this book.'], 'fit': '', 'title': 'The memoirs of Princess Alice, Duchess of Gloucester', 'also_buy': ['1909771155', '1910198129', '1546960376', '1910198137'], 'tech2': '', 'brand': 'Alice', 'feature': [], 'rank': '1,575,603 in Books (', 'also_view': ['0312302398', '1524763136'], 'main_cat': 'Books', 'similar_item': '', 'date': '', 'price': '$8.23', 'asin': '0002166461', 'imageURL': [], 'imageURLHighRes': []}


The book subsection has a lot of samples, so the data here is limited to just 10000. 

In [15]:
df = pd.DataFrame.from_dict(data)
df.head()

Unnamed: 0,category,tech1,description,fit,title,also_buy,tech2,brand,feature,rank,also_view,main_cat,similar_item,date,price,asin,imageURL,imageURLHighRes
0,[],,[It is a biology book with God&apos;s perspect...,,Biology Gods Living Creation Third Edition 10 ...,"[0669009075, B000K2P5SA, B00MD4G2N0, B000ASIPT...",,Keith Graham,[],"1,349,781 in Books (","[0019777701, B000AUCX7I, B000K2P5SA, B001CK63X...",Books,,,$39.94,0000092878,[],[]
1,"[Books, New, Used & Rental Textbooks, Medicine...",,[],,Mksap 16 Audio Companion: Medical Knowledge Se...,[],,Acp,[],"1,702,625 in Books (","[B01MUCYEV7, B01KUGTY6O]",Books,,,,000047715X,[],[]
2,"[Books, Arts & Photography, Music]",,"[Discography of American Punk, Hardcore, and P...",,"Flex! Discography of North American Punk, Hard...",[],,Burkhard Jarisch,[],"6,291,012 in Books (",[],Books,,,$199.99,0000004545,[],[]
3,"[Books, Arts & Photography, Music]",,[This is a collection of classic gospel hymns ...,,Heavenly Highway Hymns: Shaped-Note Hymnal,[],,Stamps/Baxter,[],"2,384,057 in Books (","[0006180116, 0996092730, B000QFOGY0, B06WWKNDL...",Books,,,,0000013765,[],[]
4,[],,[],,Georgina Goodman Nelson Womens Size 8.5 Purple...,[],,,[],"11,735,726 in Books (",[],Books,,,$164.10,0000000116,[],[]


Converting this dataframe to json and saving it. The converted json has the following name: `formated_data_books`

In [22]:
data = df.to_json(r'./formated_data_books.json', orient='records')

Removing books that does't have any category

In [30]:
df[df.category.map(len) == 0].index

Int64Index([   0,    4,    9,   10,   11,   13,   34,   36,   38,   45,
            ...
            9936, 9940, 9942, 9944, 9954, 9957, 9959, 9960, 9969, 9977],
           dtype='int64', length=1983)

In [31]:
df = df.drop(df[df.category.map(len) == 0].index)
df.head()

Unnamed: 0,category,tech1,description,fit,title,also_buy,tech2,brand,feature,rank,also_view,main_cat,similar_item,date,price,asin,imageURL,imageURLHighRes
1,"[Books, New, Used & Rental Textbooks, Medicine...",,[],,Mksap 16 Audio Companion: Medical Knowledge Se...,[],,Acp,[],"1,702,625 in Books (","[B01MUCYEV7, B01KUGTY6O]",Books,,,,000047715X,[],[]
2,"[Books, Arts & Photography, Music]",,"[Discography of American Punk, Hardcore, and P...",,"Flex! Discography of North American Punk, Hard...",[],,Burkhard Jarisch,[],"6,291,012 in Books (",[],Books,,,$199.99,0000004545,[],[]
3,"[Books, Arts & Photography, Music]",,[This is a collection of classic gospel hymns ...,,Heavenly Highway Hymns: Shaped-Note Hymnal,[],,Stamps/Baxter,[],"2,384,057 in Books (","[0006180116, 0996092730, B000QFOGY0, B06WWKNDL...",Books,,,,0000013765,[],[]
5,"[Books, New, Used & Rental Textbooks, Medicine...",,[Brand new; never used.],,Principles of Analgesic Use in the Treatment o...,"[0323056962, 0123979285]",,American Pain Society,[],"2,906,939 in Books (","[0323056962, 0521879272]",Books,,,,0000555010,[],[]
6,"[Books, Medical Books, Medicine]",,[Flash cards used with accompany MKSAP 15 audi...,,MKSAP 15 Audio Companion,[],,ACP,[],"2,236,549 in Books (",[],Books,,,,0000477141,[],[]


Sidebar: We'll probably have to remove other samples, I just did it as an example. More suggestions are welcomed.

### Data Exploratory Analysis (DEA)

In [36]:
df.shape

(8017, 18)

The dataframe contains 8017 books and 18 characteristics

In [35]:
df.info();

<class 'pandas.core.frame.DataFrame'>
Int64Index: 8017 entries, 1 to 9999
Data columns (total 18 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   category         8017 non-null   object
 1   tech1            8017 non-null   object
 2   description      8017 non-null   object
 3   fit              8017 non-null   object
 4   title            8017 non-null   object
 5   also_buy         8017 non-null   object
 6   tech2            8017 non-null   object
 7   brand            8017 non-null   object
 8   feature          8017 non-null   object
 9   rank             8017 non-null   object
 10  also_view        8017 non-null   object
 11  main_cat         8017 non-null   object
 12  similar_item     8017 non-null   object
 13  date             8017 non-null   object
 14  price            8017 non-null   object
 15  asin             8017 non-null   object
 16  imageURL         8017 non-null   object
 17  imageURLHighRes  8017 non-null   

All the data has an object type

In [37]:
df.describe()

Unnamed: 0,category,tech1,description,fit,title,also_buy,tech2,brand,feature,rank,also_view,main_cat,similar_item,date,price,asin,imageURL,imageURLHighRes
count,8017,8017.0,8017,8017.0,8017,8017,8017.0,8017.0,8017,8017,8017,8017,8017.0,8017.0,8017.0,8017,8017,8017
unique,485,1.0,4625,1.0,7940,4738,1.0,4924.0,1,8017,4614,3,1.0,1.0,2790.0,8017,3,3
top,"[Books, Literature & Fiction, Genre Fiction]",,[],,Ship of Gold,[],,,[],"12,887,728 in Books (",[],Books,,,,7639562,[],[]
freq,453,8017.0,2445,8017.0,2,3265,8017.0,160.0,8017,1,3347,8009,8017.0,8017.0,2021.0,1,8015,8015


Veryfing if any sample has an `imageURL` or `imageURLHighRes`

In [43]:
df[df.imageURL.map(len) != 0]

Unnamed: 0,category,tech1,description,fit,title,also_buy,tech2,brand,feature,rank,also_view,main_cat,similar_item,date,price,asin,imageURL,imageURLHighRes
1886,"[Books, Mystery, Thriller &amp; Suspense]",,[],,The Vulgar Boatman,[],,Fontana Press,[],"1,575,289 in Health &amp; Household (",[],Health &amp; Personal Care,,,,6177611,[https://images-na.ssl-images-amazon.com/image...,[https://images-na.ssl-images-amazon.com/image...
6777,"[Books, Romance, Contemporary]",,[],,Midnight Is A Lonely Place / House Of Echoes,[],,HarperCollins,[],"1,150,119 in Health & Household (",[0007280777],Health & Personal Care,,,,7645783,[https://images-na.ssl-images-amazon.com/image...,[https://images-na.ssl-images-amazon.com/image...


In [44]:
df[df.imageURLHighRes.map(len) != 0]

Unnamed: 0,category,tech1,description,fit,title,also_buy,tech2,brand,feature,rank,also_view,main_cat,similar_item,date,price,asin,imageURL,imageURLHighRes
1886,"[Books, Mystery, Thriller &amp; Suspense]",,[],,The Vulgar Boatman,[],,Fontana Press,[],"1,575,289 in Health &amp; Household (",[],Health &amp; Personal Care,,,,6177611,[https://images-na.ssl-images-amazon.com/image...,[https://images-na.ssl-images-amazon.com/image...
6777,"[Books, Romance, Contemporary]",,[],,Midnight Is A Lonely Place / House Of Echoes,[],,HarperCollins,[],"1,150,119 in Health & Household (",[0007280777],Health & Personal Care,,,,7645783,[https://images-na.ssl-images-amazon.com/image...,[https://images-na.ssl-images-amazon.com/image...


In the first 10000 samples, two images has url to images