# Generating ISBNs and Metadata

In [None]:
import re
import pandas as pd
import numpy as np

from CSVtoDF import CSVtoDF

I will use personally defined module `CSVtoDF`, which will temporary open csv file with with statement and only load manually picked columns and drop the rest and close the file.

In [None]:
with CSVtoDF('best_sellers_copy.csv') as df:
    df['isbn'] = df['primary_isbn10']
    df['isbn13'] = df['primary_isbn13']
    df['title1'] = df['title'].str.lower()
    df['author1'] = df['author'].str.lower()
    
df.head()

Actually we don't need ISBN10 as ISBN13 can better do the job.

We drop the `isbn` column and keep the remaining 3.

In [None]:
df = df[['isbn13', 'title1', 'author1']]
print()
df.info()

We see that there are some _Null_ values in our dataframe.
But at this moment we are only interested with ISBN13 column, as the values found in that column will be used as a indexes to connect to the goodreads webpage for scraping.

In [None]:
df1 = df[df['isbn13'].isna()]
df1

Fortunatelly, Summer Secrets by Barbara Freethy is the only book which has no ISBN13 information.
I will simply replace the empty value with ISBN number.

In [None]:
df['isbn13'].replace(np.NaN, 'B003K15AKQ', inplace=True)
print()
print(df.shape)
df.info()

Next we have to see at the duplicate values in our dataframe.

First I combine Author's name, Title and ISBN13 number into one column. This way we will check for absolute duplicates (books that are exactly same editions, otherwise even if the same book is published with different cover or as revised version its ISBN will be changed).

In [None]:
df['author_title_isbn13'] = df['author1'] + ' ' + df['title1'] + ' ' + df['isbn13']
print()
df['author_title_isbn13'].head()

In [None]:
dups = df.pivot_table(index=['author_title_isbn13'], aggfunc='size')
print()
print(dups.sort_values(ascending=False))

Now we can observe that Gone Girl and GOT and some other books appear several times in out df. That is because as mentioned earlier some books have been on the bestseller list for tens of weeks and their data came along with each week they have been featured.

We will drop those duplicates using `pandas` `drop_duplicates`.

In [None]:
df.drop_duplicates('author_title_isbn13', keep='first', ignore_index=True, inplace=True)
df = df[['title1', 'author1', 'isbn13']]
df.shape

We also observed in a previous cell that some books have ASIN code instead of ISBN, this can be issue as Goodreads can't identify books based on ASIN. So I'll filter them out as well.

In [None]:
aisbn = []
i13 = []

for i in list(df['isbn13']):
    if re.search(r'^[\dB]+', i):
        i13.append(i)
    else:
        aisbn.append(i)
        
print(len(aisbn))
print(len(i13))

In [None]:
df_final = df[df['isbn13'].str.contains(r'^[\dB]+')]
df_final.shape

We are down to wooping 2935 entries, but this are original titles which can be used for further analysis.

We see that there are no more `NaN` values in `isbn13` column and no more duplicates, thus we can proceed and extract it as a list for web scraping.