# Data Wrangling using Pandas

## Dataset Description: Google Books
This data was acquired from Google Books store. Google API was used to acquire the data. Nine features were gathered for each book in the data set. the column names mostly are self explanatory nevertheless, it will be explained below.

- title : the title of the book.
- authors : name of the authors of the books (might include more than one author).
- language : the language of the book
- generes\categories : the categories associated with the book (by Google store)
- rating\averageRating : the average rating of each book out of 5.
- maturityRating : wheather the content of the book is for mature or NOT MATURE audience.
- publisher : the name of the publisher.
- publishedDate : when the book was published.
- pageCount : number of pages of the books.
- voters : the number of voters to the book.
- ISBN : the unique identifier for each book.
- description : brief introductory description of the book.
- price : price of the book on the google books store
- currency : the currency of the price in the google books store.

### Tasks:
- Load the dataset
  - Load google_books_1299.csv into a pandas DataFrame.
- Analyze genre distribution
  - Display the value counts of the generes column to understand the distribution of book genres.
- Process genres
  - Split the generes column by the comma (,) delimiter.
  - Compute the frequency of each genre.
  - Retain the top 10 most frequent genres, and replace all other genres with Other.
  - If multiple Other genres appear in the same row, retain only one Other.
- Explode genres
  - Transform the generes column so that each row contains exactly one genre.
  - Example: A row with generes of Fiction, Mystery will be split into two rows: one with Fiction and the other with Mystery. Refer to the df.explode() documentation for - guidance.
- Hyphenate ISBN numbers
  - Use the isbnid library to hyphenate the ISBN numbers in the ISBN column. Refer to the isbn.ipynb notebook as an example. Skip any records with invalid ISBN numbers.
- Extract ISBN components
  - Create two new columns:
    - registration_group: the second part of the hyphenated ISBN.
    - publisher_code: the third part of the hyphenated ISBN.
    - Example: For ISBN 978-1-61262-686-4, registration_group is 1 and publisher_code is 61262.
- Create a pivot table of ratings
  - Generate a pivot table where:
    - Rows correspond to registration_group.
    - Columns correspond to generes (including Other).
    - Values are the average rating for each combination, rounded to two decimal places.
    - If no ratings exist for a particular combination, fill the value with 0.

### Setup Code (Please run this first to set up the environment)

If you haven't install `isbnid` yet, you may use the following command:
```bash
pip install isbnid
```

In [None]:
import numpy as np
import pandas as pd
import isbn
from collections import Counter, defaultdict

In [None]:
if __name__ == "__main__":
    CSV_PATH = 'google_books_1299.csv'

### Load Dataset

In [None]:
def load_data(csv_path):
    """
    Load the Google Books dataset from a CSV file into a Pandas DataFrame.
    IN: csv_path, str, path to the CSV file
    OUT: google_books_df, pd.DataFrame
    """
    # Your Code Here
    return google_books_df

In [None]:
if __name__ == "__main__":
    google_books_df = load_data(CSV_PATH)
    display(google_books_df)

### Analyze genre distribution

In [None]:
def get_genre_distribution(df):
    """
    Display the value counts of the generes column
    IN: df, pd.DataFrame, dataframe containing book information
    OUT: genre_counts, dict, dictionary with genres as keys and their frequencies as values
    """
    # Your Code Here
    return genre_counts

In [None]:
if __name__ == "__main__":
    genre_distribution = get_genre_distribution(google_books_df.copy())
    for genre, count in genre_distribution.items():
        print(f"{genre}: {count}")

### Process genres

- Since the missing values in genres column were filled with string `none`, ensure that `none` is replaced by na, and not counted as a genre.
- You may want to replace `&amp,` with `&`. Else it may cause mis-splitting of genres.
- Ensure that if multiple 'Other' genres appear in the same row, only one 'Other' is retained.

In [None]:
def process_genres(df):
    """
    Process the generes column
    IN: df, pd.DataFrame, dataframe containing book information
    OUT: df, pd.DataFrame, dataframe with processed generes
    """
    # Replace 'none' with NaN
    # Your Code Here

    # Replace '&amp,' with '&'
    # Your Code Here

    # Split the generes column by the comma (,) delimiter
    # Your Code Here
    
    # Compute the frequency of each genre
    # Your Code Here
    
    # Retain the top 10 most frequent genres
    # Your Code Here
    
    # Replace all other genres with Other
    # Your Code Here
    return df

In [None]:
if __name__ == "__main__":
    google_books_processed_genres = process_genres(google_books_df.copy())
    display(google_books_processed_genres)

### Explode genres

In [None]:
def explode_genres(df):
    """
    Transform the generes column so that each row contains exactly one genre
    IN: df, pd.DataFrame, dataframe containing book information
    OUT: df, pd.DataFrame, dataframe with exploded generes
    """
    # Your Code Here
    return df

In [None]:
if __name__ == "__main__":
    google_books_exploded = explode_genres(google_books_processed_genres.copy())
    display(google_books_exploded)

### Hyphenate ISBN numbers

- Be careful that some ISBN values are not valid. For non-valid ISBN values, you can convert them to NaN.
- A typical usage of `isbnid` library is as follows:
    ```python
    import isbn # note that the import is 'isbn', not 'isbnid'
    isbn_val = isbn.ISBN('9781612626864')
    isbn_val.hyphen()
    ```

In [None]:
def hyphenate_isbn(df):
    """
    Hyphenate the ISBN numbers in the ISBN column
    IN: df, pd.DataFrame, dataframe containing book information
    OUT: df, pd.DataFrame, dataframe with hyphenated ISBNs
    """
    # Your Code Here
    return df

In [None]:
if __name__ == "__main__":
    google_books_hyphenated = hyphenate_isbn(google_books_exploded.copy())
    display(google_books_hyphenated)

### Extract ISBN components

- Make sure to skip rows with invalid ISBN numbers.

In [None]:
def extract_isbn_components(df):
    """
    Create two new columns: registration_group and publisher_code
    IN: df, pd.DataFrame, dataframe containing book information
    OUT: df, pd.DataFrame, dataframe with new ISBN component columns
    """
    # Mask to filter out rows with valid ISBNs
    # Your Code Here

    # Update registration_group and publisher_code only for valid ISBNs
    # Your Code Here

    return df

In [None]:
if __name__ == "__main__":
    google_books_with_isbn_components = extract_isbn_components(google_books_hyphenated.copy())
    display(google_books_with_isbn_components)

### Create a pivot table of ratings

- Round the values to two decimal places.

In [None]:
def create_rating_pivot_table(df):
    """
    Generate a pivot table of ratings
    IN: df, pd.DataFrame, dataframe containing book information
    OUT: pivot_table, pd.DataFrame, pivot table of ratings
    """
    # Your Code Here
    return pivot_table

In [None]:
if __name__ == "__main__":
    rating_pivot_table = create_rating_pivot_table(google_books_with_isbn_components.copy())
    display(rating_pivot_table)

### DataFrame Schema

In [None]:
if __name__ == "__main__":
    google_books_processed_genres.info()

In [None]:
if __name__ == "__main__":
    google_books_exploded.info()

In [None]:
if __name__ == "__main__":
    google_books_hyphenated.info()

In [None]:
if __name__ == "__main__":
    google_books_with_isbn_components.info()

In [None]:
if __name__ == "__main__":
    rating_pivot_table.info()