## Exploratory Data Analysis

This notebook contain the metadata of books from amazon. This data can be found at the [Amazon Review Data](https://nijianmo.github.io/amazon/index.html). Since this is the first analysis, we'll do only with some samples (100000).

In [1]:
import pandas as pd
import json
import gzip 

The following function will open the json file and insert the data to a python list

In [2]:
def create_data_samples(size=10000):
    data = []
    i = 0

    with gzip.open('meta_Books.json.gz') as f:
        for l in f:
            if(i == size): # limiting to 100000 rows
                break
            data.append(json.loads(l.strip()))
            i += 1
    return data


In [3]:
data = create_data_samples(size=100000)
print(len(data))
print(data[500])

100000
{'category': [], 'tech1': '', 'description': ['Very clean dust jacket. No markings in book. A find example of this book.'], 'fit': '', 'title': 'The memoirs of Princess Alice, Duchess of Gloucester', 'also_buy': ['1909771155', '1910198129', '1546960376', '1910198137'], 'tech2': '', 'brand': 'Alice', 'feature': [], 'rank': '1,575,603 in Books (', 'also_view': ['0312302398', '1524763136'], 'main_cat': 'Books', 'similar_item': '', 'date': '', 'price': '$8.23', 'asin': '0002166461', 'imageURL': [], 'imageURLHighRes': []}


The book subsection has a lot of samples, so the data here is limited to just 10000, but you can set any value in the parameter on `create_data_samples`. 

In [4]:
df = pd.DataFrame.from_dict(data)
df.shape

(100000, 18)

In [5]:
df.head()

Unnamed: 0,category,tech1,description,fit,title,also_buy,tech2,brand,feature,rank,also_view,main_cat,similar_item,date,price,asin,imageURL,imageURLHighRes
0,[],,[It is a biology book with God&apos;s perspect...,,Biology Gods Living Creation Third Edition 10 ...,"[0669009075, B000K2P5SA, B00MD4G2N0, B000ASIPT...",,Keith Graham,[],"1,349,781 in Books (","[0019777701, B000AUCX7I, B000K2P5SA, B001CK63X...",Books,,,$39.94,0000092878,[],[]
1,"[Books, New, Used & Rental Textbooks, Medicine...",,[],,Mksap 16 Audio Companion: Medical Knowledge Se...,[],,Acp,[],"1,702,625 in Books (","[B01MUCYEV7, B01KUGTY6O]",Books,,,,000047715X,[],[]
2,"[Books, Arts & Photography, Music]",,"[Discography of American Punk, Hardcore, and P...",,"Flex! Discography of North American Punk, Hard...",[],,Burkhard Jarisch,[],"6,291,012 in Books (",[],Books,,,$199.99,0000004545,[],[]
3,"[Books, Arts & Photography, Music]",,[This is a collection of classic gospel hymns ...,,Heavenly Highway Hymns: Shaped-Note Hymnal,[],,Stamps/Baxter,[],"2,384,057 in Books (","[0006180116, 0996092730, B000QFOGY0, B06WWKNDL...",Books,,,,0000013765,[],[]
4,[],,[],,Georgina Goodman Nelson Womens Size 8.5 Purple...,[],,,[],"11,735,726 in Books (",[],Books,,,$164.10,0000000116,[],[]


In [21]:
df.head(1).also_buy.values

array([list(['0669009075', 'B000K2P5SA', 'B00MD4G2N0', 'B000ASIPTK', '0130508470', '1892427524', '0321567919', 'B000BJBH20', '0547484631', 'B000HAJTQO', 'B000AUCX7I', '0130365645', 'B000BI1Y2O', '0395976715', '052817729X', '1579246443', 'B001CK63XK', '1591669847', '0395879884', '836585161X', 'B01J2F9BH6', 'B00KYEHR4E', '158008141X', '1857928393', '0927545829', 'B015AR0RA0', 'B000TVHHRE', '0865167990', '1579246052', 'B003NXXVD4', 'B000OH6AX0', '061802087X', 'B000NU2X02', '0743252012'])],
      dtype=object)

## Creating the dataset to NLP model

In this first approach will be using just three columns from the metadata:
 - title
 - asin (id of the product)
 - also_buy ()
 
So we need to remove the other columns:

In [7]:
df = df.drop(columns=['category', 'tech1', 'description', 'fit', 'also_view', 
              'main_cat', 'imageURL', 'imageURLHighRes','tech2', 
              'brand', 'feature', 'rank', 'similar_item', 'date', 'price'])
df.head()

Unnamed: 0,title,also_buy,asin
0,Biology Gods Living Creation Third Edition 10 ...,"[0669009075, B000K2P5SA, B00MD4G2N0, B000ASIPT...",0000092878
1,Mksap 16 Audio Companion: Medical Knowledge Se...,[],000047715X
2,"Flex! Discography of North American Punk, Hard...",[],0000004545
3,Heavenly Highway Hymns: Shaped-Note Hymnal,[],0000013765
4,Georgina Goodman Nelson Womens Size 8.5 Purple...,[],0000000116


In [12]:
df.shape

(100000, 3)

Method to create a json with a dataframe:

In [17]:
def create_json(dataframe, file_name: str):
    dataframe.to_json(r'./{file_name}.json', orient='records')

Deleting row with books without a title:

In [18]:
df[df.title == '']

Unnamed: 0,title,also_buy,asin


In [13]:
df[df.title == ''].shape

(5, 3)

In [19]:
df = df.drop(df[df.title == ''].index)
df.shape

(99995, 3)

In [None]:
create_json(df, 'metadata_100000')

In [12]:
also_buy = str(df.head(2).also_buy.values)
len(also_buy)
also_buy.split(',')

["[list(['0669009075'",
 " 'B000K2P5SA'",
 " 'B00MD4G2N0'",
 " 'B000ASIPTK'",
 " '0130508470'",
 " '1892427524'",
 " '0321567919'",
 " 'B000BJBH20'",
 " '0547484631'",
 " 'B000HAJTQO'",
 " 'B000AUCX7I'",
 " '0130365645'",
 " 'B000BI1Y2O'",
 " '0395976715'",
 " '052817729X'",
 " '1579246443'",
 " 'B001CK63XK'",
 " '1591669847'",
 " '0395879884'",
 " '836585161X'",
 " 'B01J2F9BH6'",
 " 'B00KYEHR4E'",
 " '158008141X'",
 " '1857928393'",
 " '0927545829'",
 " 'B015AR0RA0'",
 " 'B000TVHHRE'",
 " '0865167990'",
 " '1579246052'",
 " 'B003NXXVD4'",
 " 'B000OH6AX0'",
 " '061802087X'",
 " 'B000NU2X02'",
 " '0743252012'])\n list([])]"]

Now we have the json in our `data`, we can save it as a json file with `df.to_json` method from pandas.

Removing books that does't have any category

In [21]:
df[df.fit == '']

Unnamed: 0,category,tech1,description,fit,title,also_buy,tech2,brand,feature,rank,also_view,main_cat,similar_item,date,price,asin,imageURL,imageURLHighRes
1,"[Books, New, Used & Rental Textbooks, Medicine...",,[],,Mksap 16 Audio Companion: Medical Knowledge Se...,[],,Acp,[],"1,702,625 in Books (","[B01MUCYEV7, B01KUGTY6O]",Books,,,,000047715X,[],[]
2,"[Books, Arts & Photography, Music]",,"[Discography of American Punk, Hardcore, and P...",,"Flex! Discography of North American Punk, Hard...",[],,Burkhard Jarisch,[],"6,291,012 in Books (",[],Books,,,$199.99,0000004545,[],[]
3,"[Books, Arts & Photography, Music]",,[This is a collection of classic gospel hymns ...,,Heavenly Highway Hymns: Shaped-Note Hymnal,[],,Stamps/Baxter,[],"2,384,057 in Books (","[0006180116, 0996092730, B000QFOGY0, B06WWKNDL...",Books,,,,0000013765,[],[]
5,"[Books, New, Used & Rental Textbooks, Medicine...",,[Brand new; never used.],,Principles of Analgesic Use in the Treatment o...,"[0323056962, 0123979285]",,American Pain Society,[],"2,906,939 in Books (","[0323056962, 0521879272]",Books,,,,0000555010,[],[]
6,"[Books, Medical Books, Medicine]",,[Flash cards used with accompany MKSAP 15 audi...,,MKSAP 15 Audio Companion,[],,ACP,[],"2,236,549 in Books (",[],Books,,,,0000477141,[],[]
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
99994,"[Books, Business & Money, Economics]",,[You know the people in this book. You'll reme...,,Chinese Whispers: The True Story Behind Britai...,[],,Visit Amazon's Hsiao-Hung Pai Page,[],"3,861,504 in Books (",[],Books,,,$37.19,0141035684,[],[]
99996,"[Books, History, Military]",,[Colonel Richard Kemp is a former Commanding O...,,Attack State Red,"[1476773084, 0007296649, 009194855X, 000725780...",,Col Richard Kemp,[],"2,293,499 in Books (",[],Books,,,$20.90,0141041633,[],[]
99997,"[Books, Literature & Fiction, Women's Fiction]",,[Julia Llewellyn is the author of The Love Tra...,,Ten Minutes To Fall In Love,[],,Visit Amazon's Julia Llewellyn Page,[],"13,478,874 in Books (",[],Books,,,$31.00,0141048174,[],[]
99998,"[Books, Literature &amp; Fiction]",,"[Judith Summers is the author of four novels, ...",,"Badness Of King George,The: Fostering The Resc...",[0141032235],,Summers Judith,[],"3,168,897 in Books (",[],Books,,,$13.66,0141046473,[],[]


In [12]:
df = df.drop(df[df.category.map(len) == 0].index)
df.head()

Unnamed: 0,category,tech1,description,fit,title,also_buy,tech2,brand,feature,rank,also_view,main_cat,similar_item,date,price,asin,imageURL,imageURLHighRes
1,"[Books, New, Used & Rental Textbooks, Medicine...",,[],,Mksap 16 Audio Companion: Medical Knowledge Se...,[],,Acp,[],"1,702,625 in Books (","[B01MUCYEV7, B01KUGTY6O]",Books,,,,000047715X,[],[]
2,"[Books, Arts & Photography, Music]",,"[Discography of American Punk, Hardcore, and P...",,"Flex! Discography of North American Punk, Hard...",[],,Burkhard Jarisch,[],"6,291,012 in Books (",[],Books,,,$199.99,0000004545,[],[]
3,"[Books, Arts & Photography, Music]",,[This is a collection of classic gospel hymns ...,,Heavenly Highway Hymns: Shaped-Note Hymnal,[],,Stamps/Baxter,[],"2,384,057 in Books (","[0006180116, 0996092730, B000QFOGY0, B06WWKNDL...",Books,,,,0000013765,[],[]
5,"[Books, New, Used & Rental Textbooks, Medicine...",,[Brand new; never used.],,Principles of Analgesic Use in the Treatment o...,"[0323056962, 0123979285]",,American Pain Society,[],"2,906,939 in Books (","[0323056962, 0521879272]",Books,,,,0000555010,[],[]
6,"[Books, Medical Books, Medicine]",,[Flash cards used with accompany MKSAP 15 audi...,,MKSAP 15 Audio Companion,[],,ACP,[],"2,236,549 in Books (",[],Books,,,,0000477141,[],[]


In [15]:
df[df.brand.map(len) == 0]

Unnamed: 0,category,tech1,description,fit,title,also_buy,tech2,brand,feature,rank,also_view,main_cat,similar_item,date,price,asin,imageURL,imageURLHighRes
35,"[Books, Reference, Writing, Research & Publish...",,[],,"Everyday Writer, The - Andrea Lunsford - Other...",[],,,[],"17,494,462 in Books (",[],Books,,,$99.98,0000095699,[],[]
79,"[Books, Arts & Photography, Music]",,[],,Concert Piano Solos,[],,,[],"5,976,031 in Books (",[0006593437],Books,,,,0001543849,[],[]
80,"[Books, Arts & Photography, Music]",,[],,"20 Black Gospel Favorites, Volume 3",[],,,[],"18,893,699 in Books (",[],Books,,,,0001527355,[],[]
87,"[Books, Arts & Photography, Music]",,[],,Sonatas - For Piano,[],,,[],"15,702,609 in Books (",[],Books,,,$73.25,0001148427,[],[]
90,"[Books, Christian Books & Bibles, Bibles]",,[],,Polish New Testament and Psalms,[],,,[],"14,148,608 in Books (",[],Books,,,$35.99,0001469568,[],[]
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
91842,"[Books, Computers &amp; Technology, Operating ...",,"[Book by Christine M. Gianone, , ]",,Using Ms-DOS Kermit: Connecting Your PC to the...,[],,,[],"14,426,096 in Books (",[],Books,,,$18.32,013952276X,[],[]
91953,"[Books, Humor &amp; Entertainment, Puzzles &am...",,[1983 edition. Book shows age. Underlining and...,,Winning Poker,[],,,[],"4,165,504 in Books (",[],Books,,,$5.10,0139610529,[],[]
99343,"[Books, Humor &amp; Entertainment, Puzzles &am...",,"[Book by unknown, , ]",,"The "" Weakest Link "" Quiz Book: Bk. 2 (Quiz Book)",[1842225952],,,[],"5,766,064 in Books (",[],Books,,,$0.77,014100701X,[],[]
99776,"[Books, Biographies & Memoirs]",,[],,In Your Face: One Woman's Encounter with Cance...,[],,,[],"14,263,565 in Books (",[],Books,,,,0141033339,[],[]


Sidebar: We'll probably have to remove other samples, I just did it as an example. More suggestions are welcomed.

### Data Exploratory Analysis (DEA)

In [36]:
df.shape

(8017, 18)

The dataframe contains 8017 books and 18 characteristics

In [35]:
df.info();

<class 'pandas.core.frame.DataFrame'>
Int64Index: 8017 entries, 1 to 9999
Data columns (total 18 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   category         8017 non-null   object
 1   tech1            8017 non-null   object
 2   description      8017 non-null   object
 3   fit              8017 non-null   object
 4   title            8017 non-null   object
 5   also_buy         8017 non-null   object
 6   tech2            8017 non-null   object
 7   brand            8017 non-null   object
 8   feature          8017 non-null   object
 9   rank             8017 non-null   object
 10  also_view        8017 non-null   object
 11  main_cat         8017 non-null   object
 12  similar_item     8017 non-null   object
 13  date             8017 non-null   object
 14  price            8017 non-null   object
 15  asin             8017 non-null   object
 16  imageURL         8017 non-null   object
 17  imageURLHighRes  8017 non-null   

All the data has an object type

In [37]:
df.describe()

Unnamed: 0,category,tech1,description,fit,title,also_buy,tech2,brand,feature,rank,also_view,main_cat,similar_item,date,price,asin,imageURL,imageURLHighRes
count,8017,8017.0,8017,8017.0,8017,8017,8017.0,8017.0,8017,8017,8017,8017,8017.0,8017.0,8017.0,8017,8017,8017
unique,485,1.0,4625,1.0,7940,4738,1.0,4924.0,1,8017,4614,3,1.0,1.0,2790.0,8017,3,3
top,"[Books, Literature & Fiction, Genre Fiction]",,[],,Ship of Gold,[],,,[],"12,887,728 in Books (",[],Books,,,,7639562,[],[]
freq,453,8017.0,2445,8017.0,2,3265,8017.0,160.0,8017,1,3347,8009,8017.0,8017.0,2021.0,1,8015,8015


Veryfing if any sample has an `imageURL` or `imageURLHighRes`

In [43]:
df[df.imageURL.map(len) != 0]

Unnamed: 0,category,tech1,description,fit,title,also_buy,tech2,brand,feature,rank,also_view,main_cat,similar_item,date,price,asin,imageURL,imageURLHighRes
1886,"[Books, Mystery, Thriller &amp; Suspense]",,[],,The Vulgar Boatman,[],,Fontana Press,[],"1,575,289 in Health &amp; Household (",[],Health &amp; Personal Care,,,,6177611,[https://images-na.ssl-images-amazon.com/image...,[https://images-na.ssl-images-amazon.com/image...
6777,"[Books, Romance, Contemporary]",,[],,Midnight Is A Lonely Place / House Of Echoes,[],,HarperCollins,[],"1,150,119 in Health & Household (",[0007280777],Health & Personal Care,,,,7645783,[https://images-na.ssl-images-amazon.com/image...,[https://images-na.ssl-images-amazon.com/image...


In [44]:
df[df.imageURLHighRes.map(len) != 0]

Unnamed: 0,category,tech1,description,fit,title,also_buy,tech2,brand,feature,rank,also_view,main_cat,similar_item,date,price,asin,imageURL,imageURLHighRes
1886,"[Books, Mystery, Thriller &amp; Suspense]",,[],,The Vulgar Boatman,[],,Fontana Press,[],"1,575,289 in Health &amp; Household (",[],Health &amp; Personal Care,,,,6177611,[https://images-na.ssl-images-amazon.com/image...,[https://images-na.ssl-images-amazon.com/image...
6777,"[Books, Romance, Contemporary]",,[],,Midnight Is A Lonely Place / House Of Echoes,[],,HarperCollins,[],"1,150,119 in Health & Household (",[0007280777],Health & Personal Care,,,,7645783,[https://images-na.ssl-images-amazon.com/image...,[https://images-na.ssl-images-amazon.com/image...


In the first 10000 samples, two images has url to images