# Data Wrangling using Pandas

## Dataset Description: Google Books
This data was acquired from Google Books store. Google API was used to acquire the data. Nine features were gathered for each book in the data set. the column names mostly are self explanatory nevertheless, it will be explained below.

- title : the title of the book.
- authors : name of the authors of the books (might include more than one author).
- language : the language of the book
- generes\categories : the categories associated with the book (by Google store)
- rating\averageRating : the average rating of each book out of 5.
- maturityRating : wheather the content of the book is for mature or NOT MATURE audience.
- publisher : the name of the publisher.
- publishedDate : when the book was published.
- pageCount : number of pages of the books.
- voters : the number of voters to the book.
- ISBN : the unique identifier for each book.
- description : brief introductory description of the book.
- price : price of the book on the google books store
- currency : the currency of the price in the google books store.

### Tasks:
- Load the dataset
  - Load google_books_1299.csv into a pandas DataFrame.
- Analyze genre distribution
  - Display the value counts of the generes column to understand the distribution of book genres.
- Process genres
  - Split the generes column by the comma (,) delimiter.
  - Compute the frequency of each genre.
  - Retain the top 10 most frequent genres, and replace all other genres with Other.
  - If multiple Other genres appear in the same row, retain only one Other.
- Explode genres
  - Transform the generes column so that each row contains exactly one genre.
  - Example: A row with generes of Fiction, Mystery will be split into two rows: one with Fiction and the other with Mystery. Refer to the df.explode() documentation for - guidance.
- Hyphenate ISBN numbers
  - Use the isbnid library to hyphenate the ISBN numbers in the ISBN column. Refer to the isbn.ipynb notebook as an example. Skip any records with invalid ISBN numbers.
- Extract ISBN components
  - Create two new columns:
    - registration_group: the second part of the hyphenated ISBN.
    - publisher_code: the third part of the hyphenated ISBN.
    - Example: For ISBN 978-1-61262-686-4, registration_group is 1 and publisher_code is 61262.
- Create a pivot table of ratings
  - Generate a pivot table where:
    - Rows correspond to registration_group.
    - Columns correspond to generes (including Other).
    - Values are the average rating for each combination, rounded to two decimal places.
    - If no ratings exist for a particular combination, fill the value with 0.

### Setup Code (Please run this first to set up the environment)

If you haven't install `isbnid` yet, you may use the following command:
```bash
pip install isbnid
```

In [8]:
import numpy as np
import pandas as pd
import isbn
from collections import Counter, defaultdict
#%pip install isbnlib
import isbnlib

In [9]:
if __name__ == "__main__":
    CSV_PATH = 'google_books_1299.csv'

### Load Dataset

In [10]:
def load_data(csv_path):
    """
    Load the Google Books dataset from a CSV file into a Pandas DataFrame.
    IN: csv_path, str, path to the CSV file
    OUT: google_books_df, pd.DataFrame
    """
    # Your Code Here
    google_books_df = pd.read_csv("google_books_1299.csv")
    return google_books_df

In [11]:
if __name__ == "__main__":
    google_books_df = load_data(CSV_PATH)
    display(google_books_df)

Unnamed: 0.1,Unnamed: 0,title,author,rating,voters,price,currency,description,publisher,page_count,generes,ISBN,language,published_date
0,0,Attack on Titan: Volume 13,Hajime Isayama,4.6,428,43.28,SAR,NO SAFE PLACE LEFT At great cost to the Garris...,Kodansha Comics,192,none,9781612626864,English,"Jul 31, 2014"
1,1,Antiques Roadkill: A Trash 'n' Treasures Mystery,Barbara Allan,3.3,23,26.15,SAR,Determined to make a new start in her quaint h...,Kensington Publishing Corp.,288,"Fiction , Mystery &amp, Detective , Cozy , Gen...",9780758272799,English,"Jul 1, 2007"
2,2,The Art of Super Mario Odyssey,Nintendo,3.9,9,133.85,SAR,Take a globetrotting journey all over the worl...,Dark Horse Comics,368,"Games &amp, Activities , Video &amp, Electronic",9781506713816,English,"Nov 5, 2019"
3,3,Getting Away Is Deadly: An Ellie Avery Mystery,Sara Rosett,4.0,10,26.15,SAR,"With swollen feet and swelling belly, pregnant...",Kensington Publishing Corp.,320,none,9781617734076,English,"Mar 1, 2009"
4,4,"The Painted Man (The Demon Cycle, Book 1)",Peter V. Brett,4.5,577,28.54,SAR,The stunning debut fantasy novel from author P...,HarperCollins UK,544,"Fiction , Fantasy , Dark Fantasy",9780007287758,English,"Jan 8, 2009"
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1294,1294,Twas The Nightshift Before Christmas: Festive ...,Adam Kay,4.7,47,41.82,SAR,A short gift book of festive hospital diaries ...,Pan Macmillan,112,"Medical , Health Care Delivery",9781529018592,English,"Oct 17, 2019"
1295,1295,Why We Sleep: The New Science of Sleep and Dreams,Matthew Walker,4.8,52,46.85,SAR,'Astonishing ... an amazing book ... absolutel...,Penguin UK,368,"Psychology , Cognitive Psychology &amp, Cognition",9780141983776,English,"Sep 28, 2017"
1296,1296,How to Understand Business Finance: Edition 2,Bob Cinnamon,3.5,4,46.85,SAR,The modern marketplace is increasingly unpredi...,Kogan Page Publishers,176,none,9780749460211,English,"Apr 3, 2010"
1297,1297,Spider-Man: Kraven's Last Hunt,J. M. DeMatteis,4.6,74,43.28,SAR,"Collects Web of Spider-Man #31-32, Amazing Spi...",Marvel Entertainment,168,none,9781302377366,English,"Dec 10, 2014"


### Analyze genre distribution

In [12]:
def get_genre_distribution(df):
    """
    Display the value counts of the generes column
    IN: df, pd.DataFrame, dataframe containing book information
    OUT: genre_counts, dict, dictionary with genres as keys and their frequencies as values
    """
    gcol = "generes" if "generes" in df.columns else ("genres" if "genres" in df.columns else None)
    if gcol is None:
        raise KeyError("Expected a 'generes' or 'genres' column.")

    # split, explode, clean, and count
    s = (
        df[gcol]
        .astype(str)
        .str.split(",")
        .explode()
        .str.strip()
        .replace({"": None, "none": None, "None": None, "NONE": None})
        .dropna()
    )

    genre_counts = s.value_counts().to_dict()
    return genre_counts

In [13]:
if __name__ == "__main__":
    genre_distribution = get_genre_distribution(google_books_df.copy())
    for genre, count in genre_distribution.items():
        print(f"{genre}: {count}")

Fiction: 397
General: 180
Economics: 116
Business &amp: 112
Fantasy: 104
Detective: 96
Mystery &amp: 96
Thrillers: 75
Adventure: 50
Action &amp: 50
Science Fiction: 48
Comics & Graphic Novels: 44
Self-Help: 39
Graphic Novels: 38
Comics &amp: 38
Superheroes: 35
Epic: 35
Suspense: 35
Crime: 34
Personal Growth: 26
Juvenile Fiction: 21
Amateur Sleuth: 21
Psychology: 17
Women Sleuths: 17
Biography &amp: 16
Autobiography: 16
Motivational: 14
Mystery & Detective: 12
Cozy: 12
Personal Finance: 11
Classics: 10
Military: 10
Horror: 10
Business & Economics: 10
Success: 10
Accounting: 9
Psychological: 9
Social Science: 9
Mythical Creatures: 8
Young Adult Fiction: 8
Private Investigators: 8
Dragons &amp: 7
Computers: 7
International Mystery &amp: 7
Police Procedural: 7
Fantasy &amp: 7
Magic: 7
Dark Fantasy: 7
Humor: 6
Literary: 6
Mentoring & Coaching: 6
Traditional: 6
Personal Success: 5
Philosophy: 5
Body: 5
Mind &amp: 5
Action & Adventure: 5
Spirit: 5
Industries: 4
Self-Esteem: 4
Self-Management:

### Process genres

- Since the missing values in genres column were filled with string `none`, ensure that `none` is replaced by na, and not counted as a genre.
- You may want to replace `&amp,` with `&`. Else it may cause mis-splitting of genres.
- Ensure that if multiple 'Other' genres appear in the same row, only one 'Other' is retained.

In [14]:
def process_genres(df):
    """
    Process the generes column
    IN: df, pd.DataFrame, dataframe containing book information
    OUT: df, pd.DataFrame, dataframe with processed generes
    """
    if "generes" not in df.columns:
        raise KeyError("Expected a 'generes' column in the dataframe.")

    out = df.copy()

    # Replace 'none' (any casing) with NaN
    out["generes"] = out["generes"].replace(
        {"none": np.nan, "None": np.nan, "NONE": np.nan}
    )

    # Replace '&amp,' artifact with '&'
    out["generes"] = out["generes"].astype(str).str.replace("&amp,", "&", regex=False)

    # Split the generes column by comma
    def _split(cell):
        if pd.isna(cell) or str(cell).lower() == "nan":
            return np.nan
        parts = [p.strip() for p in str(cell).split(",") if p.strip() != ""]
        return parts if parts else np.nan

    out["_generes_list"] = out["generes"].map(_split)

    # Compute frequency of each genre (globally)
    freq = out["_generes_list"].explode().dropna().value_counts()

    # Retain the top 10 most frequent genres
    top10 = set(freq.head(10).index)

    # Replace all other genres with 'Other' and ensure at most one 'Other' per row
    def _map_and_dedupe(lst):
        if not isinstance(lst, list):
            return np.nan
        mapped = [g if g in top10 else "Other" for g in lst]
        seen, keep = set(), []
        for g in mapped:
            if g not in seen:
                seen.add(g)
                keep.append(g)
        return keep if keep else np.nan

    out["_generes_list"] = out["_generes_list"].map(_map_and_dedupe)

    # Join back to comma-separated string (leave NaN as NaN)
    out["generes"] = out["_generes_list"].apply(
        lambda x: ", ".join(x) if isinstance(x, list) else np.nan
    )

    return out.drop(columns=["_generes_list"])

In [15]:
if __name__ == "__main__":
    google_books_processed_genres = process_genres(google_books_df.copy())
    display(google_books_processed_genres)

Unnamed: 0.1,Unnamed: 0,title,author,rating,voters,price,currency,description,publisher,page_count,generes,ISBN,language,published_date
0,0,Attack on Titan: Volume 13,Hajime Isayama,4.6,428,43.28,SAR,NO SAFE PLACE LEFT At great cost to the Garris...,Kodansha Comics,192,,9781612626864,English,"Jul 31, 2014"
1,1,Antiques Roadkill: A Trash 'n' Treasures Mystery,Barbara Allan,3.3,23,26.15,SAR,Determined to make a new start in her quaint h...,Kensington Publishing Corp.,288,"Fiction, Mystery & Detective, Other, General",9780758272799,English,"Jul 1, 2007"
2,2,The Art of Super Mario Odyssey,Nintendo,3.9,9,133.85,SAR,Take a globetrotting journey all over the worl...,Dark Horse Comics,368,Other,9781506713816,English,"Nov 5, 2019"
3,3,Getting Away Is Deadly: An Ellie Avery Mystery,Sara Rosett,4.0,10,26.15,SAR,"With swollen feet and swelling belly, pregnant...",Kensington Publishing Corp.,320,,9781617734076,English,"Mar 1, 2009"
4,4,"The Painted Man (The Demon Cycle, Book 1)",Peter V. Brett,4.5,577,28.54,SAR,The stunning debut fantasy novel from author P...,HarperCollins UK,544,"Fiction, Fantasy, Other",9780007287758,English,"Jan 8, 2009"
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1294,1294,Twas The Nightshift Before Christmas: Festive ...,Adam Kay,4.7,47,41.82,SAR,A short gift book of festive hospital diaries ...,Pan Macmillan,112,Other,9781529018592,English,"Oct 17, 2019"
1295,1295,Why We Sleep: The New Science of Sleep and Dreams,Matthew Walker,4.8,52,46.85,SAR,'Astonishing ... an amazing book ... absolutel...,Penguin UK,368,Other,9780141983776,English,"Sep 28, 2017"
1296,1296,How to Understand Business Finance: Edition 2,Bob Cinnamon,3.5,4,46.85,SAR,The modern marketplace is increasingly unpredi...,Kogan Page Publishers,176,,9780749460211,English,"Apr 3, 2010"
1297,1297,Spider-Man: Kraven's Last Hunt,J. M. DeMatteis,4.6,74,43.28,SAR,"Collects Web of Spider-Man #31-32, Amazing Spi...",Marvel Entertainment,168,,9781302377366,English,"Dec 10, 2014"


### Explode genres

In [16]:
def explode_genres(df):
    """
    Transform the generes column so that each row contains exactly one genre
    IN: df, pd.DataFrame, dataframe containing book information
    OUT: df, pd.DataFrame, dataframe with exploded generes
    """
    if "generes" not in df.columns:
        raise KeyError("Expected a 'generes' column in the dataframe.")

    out = df.copy()

    # Ensure strings, fill missing as 'Other', then split into lists
    out["generes"] = out["generes"].fillna("Other").astype(str)
    out["generes"] = out["generes"].apply(
        lambda s: [p.strip() for p in s.split(",")] if s else ["Other"]
    )

    # Explode to one genre per row
    out = out.explode("generes").reset_index(drop=True)

    # Clean empties after explode
    out["generes"] = out["generes"].replace("", "Other")
    out["generes"] = out["generes"].fillna("Other")

    return out

In [17]:
if __name__ == "__main__":
    google_books_exploded = explode_genres(google_books_processed_genres.copy())
    display(google_books_exploded)

Unnamed: 0.1,Unnamed: 0,title,author,rating,voters,price,currency,description,publisher,page_count,generes,ISBN,language,published_date
0,0,Attack on Titan: Volume 13,Hajime Isayama,4.6,428,43.28,SAR,NO SAFE PLACE LEFT At great cost to the Garris...,Kodansha Comics,192,Other,9781612626864,English,"Jul 31, 2014"
1,1,Antiques Roadkill: A Trash 'n' Treasures Mystery,Barbara Allan,3.3,23,26.15,SAR,Determined to make a new start in her quaint h...,Kensington Publishing Corp.,288,Fiction,9780758272799,English,"Jul 1, 2007"
2,1,Antiques Roadkill: A Trash 'n' Treasures Mystery,Barbara Allan,3.3,23,26.15,SAR,Determined to make a new start in her quaint h...,Kensington Publishing Corp.,288,Mystery & Detective,9780758272799,English,"Jul 1, 2007"
3,1,Antiques Roadkill: A Trash 'n' Treasures Mystery,Barbara Allan,3.3,23,26.15,SAR,Determined to make a new start in her quaint h...,Kensington Publishing Corp.,288,Other,9780758272799,English,"Jul 1, 2007"
4,1,Antiques Roadkill: A Trash 'n' Treasures Mystery,Barbara Allan,3.3,23,26.15,SAR,Determined to make a new start in her quaint h...,Kensington Publishing Corp.,288,General,9780758272799,English,"Jul 1, 2007"
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2331,1296,How to Understand Business Finance: Edition 2,Bob Cinnamon,3.5,4,46.85,SAR,The modern marketplace is increasingly unpredi...,Kogan Page Publishers,176,Other,9780749460211,English,"Apr 3, 2010"
2332,1297,Spider-Man: Kraven's Last Hunt,J. M. DeMatteis,4.6,74,43.28,SAR,"Collects Web of Spider-Man #31-32, Amazing Spi...",Marvel Entertainment,168,Other,9781302377366,English,"Dec 10, 2014"
2333,1298,W is for Wasted: A Kinsey Millhone Novel,Sue Grafton,4.3,206,39.34,SAR,Private investigator Kinsey Millhone finds sho...,Penguin,448,Fiction,9781101636459,English,"Sep 10, 2013"
2334,1298,W is for Wasted: A Kinsey Millhone Novel,Sue Grafton,4.3,206,39.34,SAR,Private investigator Kinsey Millhone finds sho...,Penguin,448,Thrillers,9781101636459,English,"Sep 10, 2013"


### Hyphenate ISBN numbers

- Be careful that some ISBN values are not valid. For non-valid ISBN values, you can convert them to NaN.
- A typical usage of `isbnid` library is as follows:
    ```python
    import isbn # note that the import is 'isbn', not 'isbnid'
    isbn_val = isbn.ISBN('9781612626864')
    isbn_val.hyphen()
    ```

In [18]:
def hyphenate_isbn(df):
    """
    Hyphenate the ISBN numbers in the ISBN column
    IN: df, pd.DataFrame, dataframe containing book information
    OUT: df, pd.DataFrame, dataframe with hyphenated ISBNs
    """
    # Your Code Here
    if "ISBN" not in df.columns:
        raise KeyError("Expected an 'ISBN' column in the dataframe.")

    out = df.copy()

    def _try_isbnid_hyphen(s):
        # Some installs provide 'isbn' (isbnid) exposing ISBN().hyphen()
        try:
            import isbn
            return isbn.ISBN(s).hyphen()
        except Exception:
            return None

    def _try_isbnlib_hyphen(s):
        try:
            can = isbnlib.canonical(s)
            if not can:
                return None
            if len(can) == 10:
                can = isbnlib.to_isbn13(can)
            return isbnlib.mask(can)  # adds hyphens with proper grouping
        except Exception:
            return None

    def _hyphen(val):
        if pd.isna(val):
            return np.nan
        s = str(val).strip()
        if not s:
            return np.nan
        # first try isbn/id
        h = _try_isbnid_hyphen(s)
        if not h:
            # then try isbnlib
            h = _try_isbnlib_hyphen(s)
        return h if h else np.nan

    out["ISBN_hyphen"] = out["ISBN"].apply(_hyphen)
    return out
    

In [19]:
if __name__ == "__main__":
    google_books_hyphenated = hyphenate_isbn(google_books_exploded.copy())
    display(google_books_hyphenated)

Unnamed: 0.1,Unnamed: 0,title,author,rating,voters,price,currency,description,publisher,page_count,generes,ISBN,language,published_date,ISBN_hyphen
0,0,Attack on Titan: Volume 13,Hajime Isayama,4.6,428,43.28,SAR,NO SAFE PLACE LEFT At great cost to the Garris...,Kodansha Comics,192,Other,9781612626864,English,"Jul 31, 2014",978-1-61262-686-4
1,1,Antiques Roadkill: A Trash 'n' Treasures Mystery,Barbara Allan,3.3,23,26.15,SAR,Determined to make a new start in her quaint h...,Kensington Publishing Corp.,288,Fiction,9780758272799,English,"Jul 1, 2007",978-0-7582-7279-9
2,1,Antiques Roadkill: A Trash 'n' Treasures Mystery,Barbara Allan,3.3,23,26.15,SAR,Determined to make a new start in her quaint h...,Kensington Publishing Corp.,288,Mystery & Detective,9780758272799,English,"Jul 1, 2007",978-0-7582-7279-9
3,1,Antiques Roadkill: A Trash 'n' Treasures Mystery,Barbara Allan,3.3,23,26.15,SAR,Determined to make a new start in her quaint h...,Kensington Publishing Corp.,288,Other,9780758272799,English,"Jul 1, 2007",978-0-7582-7279-9
4,1,Antiques Roadkill: A Trash 'n' Treasures Mystery,Barbara Allan,3.3,23,26.15,SAR,Determined to make a new start in her quaint h...,Kensington Publishing Corp.,288,General,9780758272799,English,"Jul 1, 2007",978-0-7582-7279-9
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2331,1296,How to Understand Business Finance: Edition 2,Bob Cinnamon,3.5,4,46.85,SAR,The modern marketplace is increasingly unpredi...,Kogan Page Publishers,176,Other,9780749460211,English,"Apr 3, 2010",978-0-7494-6021-1
2332,1297,Spider-Man: Kraven's Last Hunt,J. M. DeMatteis,4.6,74,43.28,SAR,"Collects Web of Spider-Man #31-32, Amazing Spi...",Marvel Entertainment,168,Other,9781302377366,English,"Dec 10, 2014",978-1-302-37736-6
2333,1298,W is for Wasted: A Kinsey Millhone Novel,Sue Grafton,4.3,206,39.34,SAR,Private investigator Kinsey Millhone finds sho...,Penguin,448,Fiction,9781101636459,English,"Sep 10, 2013",978-1-101-63645-9
2334,1298,W is for Wasted: A Kinsey Millhone Novel,Sue Grafton,4.3,206,39.34,SAR,Private investigator Kinsey Millhone finds sho...,Penguin,448,Thrillers,9781101636459,English,"Sep 10, 2013",978-1-101-63645-9


### Extract ISBN components

- Make sure to skip rows with invalid ISBN numbers.

In [20]:
def extract_isbn_components(df):
    """
    Create two new columns: registration_group and publisher_code
    IN: df, pd.DataFrame, dataframe containing book information
    OUT: df, pd.DataFrame, dataframe with new ISBN component columns
    """
    if "ISBN_hyphen" in df.columns:
        s = df["ISBN_hyphen"].astype(str)
    elif "ISBN" in df.columns:
        s = df["ISBN"].astype(str)           # fallback if hyphenated column not created yet
    else:
        raise KeyError("Expected 'ISBN_hyphen' or 'ISBN' column.")

    out = df.copy()
    out["registration_group"] = pd.NA
    out["publisher_code"] = pd.NA

    parts = s.str.split("-")
    mask = parts.str.len().ge(5)             # valid hyphenated form has at least 5 parts

    out.loc[mask, "registration_group"] = parts[mask].str[1]
    out.loc[mask, "publisher_code"]     = parts[mask].str[2]

    return out

In [21]:
if __name__ == "__main__":
    google_books_with_isbn_components = extract_isbn_components(google_books_hyphenated.copy())
    display(google_books_with_isbn_components)

Unnamed: 0.1,Unnamed: 0,title,author,rating,voters,price,currency,description,publisher,page_count,generes,ISBN,language,published_date,ISBN_hyphen,registration_group,publisher_code
0,0,Attack on Titan: Volume 13,Hajime Isayama,4.6,428,43.28,SAR,NO SAFE PLACE LEFT At great cost to the Garris...,Kodansha Comics,192,Other,9781612626864,English,"Jul 31, 2014",978-1-61262-686-4,1,61262
1,1,Antiques Roadkill: A Trash 'n' Treasures Mystery,Barbara Allan,3.3,23,26.15,SAR,Determined to make a new start in her quaint h...,Kensington Publishing Corp.,288,Fiction,9780758272799,English,"Jul 1, 2007",978-0-7582-7279-9,0,7582
2,1,Antiques Roadkill: A Trash 'n' Treasures Mystery,Barbara Allan,3.3,23,26.15,SAR,Determined to make a new start in her quaint h...,Kensington Publishing Corp.,288,Mystery & Detective,9780758272799,English,"Jul 1, 2007",978-0-7582-7279-9,0,7582
3,1,Antiques Roadkill: A Trash 'n' Treasures Mystery,Barbara Allan,3.3,23,26.15,SAR,Determined to make a new start in her quaint h...,Kensington Publishing Corp.,288,Other,9780758272799,English,"Jul 1, 2007",978-0-7582-7279-9,0,7582
4,1,Antiques Roadkill: A Trash 'n' Treasures Mystery,Barbara Allan,3.3,23,26.15,SAR,Determined to make a new start in her quaint h...,Kensington Publishing Corp.,288,General,9780758272799,English,"Jul 1, 2007",978-0-7582-7279-9,0,7582
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2331,1296,How to Understand Business Finance: Edition 2,Bob Cinnamon,3.5,4,46.85,SAR,The modern marketplace is increasingly unpredi...,Kogan Page Publishers,176,Other,9780749460211,English,"Apr 3, 2010",978-0-7494-6021-1,0,7494
2332,1297,Spider-Man: Kraven's Last Hunt,J. M. DeMatteis,4.6,74,43.28,SAR,"Collects Web of Spider-Man #31-32, Amazing Spi...",Marvel Entertainment,168,Other,9781302377366,English,"Dec 10, 2014",978-1-302-37736-6,1,302
2333,1298,W is for Wasted: A Kinsey Millhone Novel,Sue Grafton,4.3,206,39.34,SAR,Private investigator Kinsey Millhone finds sho...,Penguin,448,Fiction,9781101636459,English,"Sep 10, 2013",978-1-101-63645-9,1,101
2334,1298,W is for Wasted: A Kinsey Millhone Novel,Sue Grafton,4.3,206,39.34,SAR,Private investigator Kinsey Millhone finds sho...,Penguin,448,Thrillers,9781101636459,English,"Sep 10, 2013",978-1-101-63645-9,1,101


### Create a pivot table of ratings

- Round the values to two decimal places.

In [24]:
def create_rating_pivot_table(df):
    """
    Generate a pivot table of ratings
    IN: df, pd.DataFrame, dataframe containing book information
    OUT: pivot_table, pd.DataFrame, pivot table of ratings
    """
    # Your Code Here
   # detect required columns
    gcol = "generes" if "generes" in df.columns else ("genres" if "genres" in df.columns else None)
    if gcol is None:
        raise KeyError("Expected a 'generes' or 'genres' column.")
    rcol = "rating" if "rating" in df.columns else ("averageRating" if "averageRating" in df.columns else None)
    if rcol is None:
        raise KeyError("Expected a 'rating' or 'averageRating' column.")
    if "registration_group" not in df.columns:
        raise KeyError("Expected 'registration_group' column. Run extract_isbn_components() first.")

    dfn = df.copy()
    # ensure numeric ratings
    dfn[rcol] = pd.to_numeric(dfn[rcol], errors="coerce")

    pivot_table = (
        dfn.pivot_table(index="registration_group", columns=gcol, values=rcol, aggfunc="mean")
           .fillna(0.0)
           .sort_index()
           .sort_index(axis=1)
           .round(2)
    )
    return pivot_table

In [25]:
if __name__ == "__main__":
    rating_pivot_table = create_rating_pivot_table(google_books_with_isbn_components.copy())
    display(rating_pivot_table)

generes,Action & Adventure,Business & Economics,Comics & Graphic Novels,Fantasy,Fiction,General,Mystery & Detective,Other,Science Fiction,Self-Help,Thrillers
registration_group,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
0,4.54,4.14,4.2,4.49,4.41,4.33,4.3,4.35,4.53,4.39,4.3
1,4.66,4.47,4.4,4.69,4.5,4.45,4.39,4.46,4.44,4.33,4.54
3,0.0,2.5,0.0,0.0,4.9,3.33,0.0,3.53,0.0,0.0,0.0
80,0.0,0.0,0.0,0.0,0.0,4.63,0.0,4.53,0.0,0.0,0.0


### DataFrame Schema

In [26]:
if __name__ == "__main__":
    google_books_processed_genres.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1299 entries, 0 to 1298
Data columns (total 14 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Unnamed: 0      1299 non-null   int64  
 1   title           1299 non-null   object 
 2   author          1299 non-null   object 
 3   rating          1224 non-null   float64
 4   voters          1224 non-null   object 
 5   price           1299 non-null   float64
 6   currency        1299 non-null   object 
 7   description     1296 non-null   object 
 8   publisher       1299 non-null   object 
 9   page_count      1299 non-null   int64  
 10  generes         772 non-null    object 
 11  ISBN            1299 non-null   object 
 12  language        1299 non-null   object 
 13  published_date  1299 non-null   object 
dtypes: float64(2), int64(2), object(10)
memory usage: 142.2+ KB


In [27]:
if __name__ == "__main__":
    google_books_exploded.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2336 entries, 0 to 2335
Data columns (total 14 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Unnamed: 0      2336 non-null   int64  
 1   title           2336 non-null   object 
 2   author          2336 non-null   object 
 3   rating          2204 non-null   float64
 4   voters          2204 non-null   object 
 5   price           2336 non-null   float64
 6   currency        2336 non-null   object 
 7   description     2333 non-null   object 
 8   publisher       2336 non-null   object 
 9   page_count      2336 non-null   int64  
 10  generes         2336 non-null   object 
 11  ISBN            2336 non-null   object 
 12  language        2336 non-null   object 
 13  published_date  2336 non-null   object 
dtypes: float64(2), int64(2), object(10)
memory usage: 255.6+ KB


In [28]:
if __name__ == "__main__":
    google_books_hyphenated.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2336 entries, 0 to 2335
Data columns (total 15 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Unnamed: 0      2336 non-null   int64  
 1   title           2336 non-null   object 
 2   author          2336 non-null   object 
 3   rating          2204 non-null   float64
 4   voters          2204 non-null   object 
 5   price           2336 non-null   float64
 6   currency        2336 non-null   object 
 7   description     2333 non-null   object 
 8   publisher       2336 non-null   object 
 9   page_count      2336 non-null   int64  
 10  generes         2336 non-null   object 
 11  ISBN            2336 non-null   object 
 12  language        2336 non-null   object 
 13  published_date  2336 non-null   object 
 14  ISBN_hyphen     2189 non-null   object 
dtypes: float64(2), int64(2), object(11)
memory usage: 273.9+ KB


In [29]:
if __name__ == "__main__":
    google_books_with_isbn_components.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2336 entries, 0 to 2335
Data columns (total 17 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   Unnamed: 0          2336 non-null   int64  
 1   title               2336 non-null   object 
 2   author              2336 non-null   object 
 3   rating              2204 non-null   float64
 4   voters              2204 non-null   object 
 5   price               2336 non-null   float64
 6   currency            2336 non-null   object 
 7   description         2333 non-null   object 
 8   publisher           2336 non-null   object 
 9   page_count          2336 non-null   int64  
 10  generes             2336 non-null   object 
 11  ISBN                2336 non-null   object 
 12  language            2336 non-null   object 
 13  published_date      2336 non-null   object 
 14  ISBN_hyphen         2189 non-null   object 
 15  registration_group  2189 non-null   object 
 16  publis

In [30]:
if __name__ == "__main__":
    rating_pivot_table.info()

<class 'pandas.core.frame.DataFrame'>
Index: 4 entries, 0 to 80
Data columns (total 11 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   Action & Adventure       4 non-null      float64
 1   Business & Economics     4 non-null      float64
 2   Comics & Graphic Novels  4 non-null      float64
 3   Fantasy                  4 non-null      float64
 4   Fiction                  4 non-null      float64
 5   General                  4 non-null      float64
 6   Mystery & Detective      4 non-null      float64
 7   Other                    4 non-null      float64
 8   Science Fiction          4 non-null      float64
 9   Self-Help                4 non-null      float64
 10  Thrillers                4 non-null      float64
dtypes: float64(11)
memory usage: 384.0+ bytes
