I tried substituting list comprehensions for places where I use lambdas, but found that the comprehensions are much slower - like 2-20x slower. We always talk about using compehensions instead of lambdas

# Cleaning & Normalization To Do List  ✅

- ✅ Check if 'bookid' is unique - yes, but at end, I think we will drop it in favor of a serial ID
- ✅ Bookid is not unique figure out what to do - dropped duplicates. Difference was in price usually
- ✅ Pull ISBN from description field into ISBN field, if null or 99999999999 placeholder   
- NOTE ON ISBN INFO IN DESCRIPTION FIELD: I cannot now find a good way to eliinate duplicate ISBN information without eliminating some useful infrmation from the description field
- ✅ Split out extra ppl from author field, multiple authors
- ✅ Split out all list columns into separate tables
- ✅ Sort out cases where publish date is earlier than first publish date 
- ✅ Check dates - some mislabeled as 2000 instead of 1900
- ✅ Check for nulls, fill if possible?
- ✅ Strip extra spaces
- ✅ Eliminate duplicate entries
- ✅ Remove all non-numeric chars from numeric fields
- ✅ Standardize as many fields as possible
- ✅ Look up best way to normalize/handle ratings by stars
- ✅ Clean pages field of edition df
- ✅ Clean price field of edition df

- ✅ Create dfs
- ✅ Set dtypes
- ✅ Create tables
- ✅ Create relationships
- ✅ Load data
- Create read user


Maybe: (?)
- Create views
- Create indexes
- Create triggers
- Create functions
- Create procedures
- Create roles





# Questions to ask stakeholders/users:
- Should we assume that rating is an average of the star ratings? This is not currently precisely true of the data. (There are some rows where there is a rating, but no star ratings)
- Is it safe to assume that the order of star ratings was 5, 4, 3, 2, 1?
- Should we clean up the series_num field? There are severl instances where one title contains multiple joined instances from a series of smaller works. 

# Analysis Notes

-  Regarding the "best books". The ratings seem immaterial. Sales are what matter right? We don't have data on sales. What is a good proxy for sales?
Probably number of ratings, right?

- Probably can't just be the only metric though, bc then it would just be the most popular books. Need to combine with something else. Create a composite metric?

- There are 3-6 entries with duplicate ISBNs (depending on how you count) that are not '999999999' placeholders. I have not dropped them from the dfs, bc they have ratings data that might need to be aggregated across them. But I have flagged them with a boolean column `is_duplicate_isbn`.

Side bar: thinking about what to use as the "best" book metric. Is there a correlation between the number of ratings and the average rating?

In [None]:
num_ratings_corr = df['numRatings'].corr(df['rating'])
print(f"Number of ratings and ratings correlation: {num_ratings_corr}")


# Normalization Notes:

1. First Normal Form (1NF)

- It only has atomic (indivisible) values. In other words, each cell should contain only one value, not a set of values or empty sets.
- Entries in a column are of the same kind. Each column should be of the same type (numeric, text, date, etc.).
- Each column in a table should represent a single attribute of the entity modelled by the table (e.g., a 'car' table might have separate columns for 'make', 'model', 'year', etc.)
- Order in which data is saved does not matter.

--

2. Second Normal Form (2NF)

- It is in 1NF.
- All non-key columns are fully dependent on the primary key. A non-key column must be functionally dependent on the entire set of primary key(s). There should be no partial dependency.
- In other words, if a column is dependent on only part of a multi-part primary key, then the table fails 2NF.

--

3. Third Normal Form (3NF)

- It is in 2NF.
- It has no transitive dependencies. A transitive dependency occurs when a non-key column is dependent on another non-key column, which is dependent on the primary key.
- Every non-key attribute must be functionally dependent on the primary key directly and not through some other non-key attributes.



In [None]:
import re
import pandas as pd

def move_isbn(df):
    """
    Extract and move ISBN from the description to the ISBN field if the ISBN is '9999999999999' or null.

    The function applies various regex patterns to identify ISBNs from the description field and 
    moves them to the ISBN field.

    :param df: A DataFrame containing 'isbn' and 'description' columns
    :return: The modified DataFrame
    :raises ValueError: If 'isbn' or 'description' columns are missing
    """

    if 'isbn' not in df.columns or 'description' not in df.columns: # Check if required columns are present
        raise ValueError("'isbn' or 'description' column is missing in the DataFrame.")

    isbn_pattern = re.compile(r'((?:\D)?(\d{13})(?:\D)?|(?:\D)?(\d{10})(?:\D)?|B\d{4}[A-Z]{3}\d{1}[A-Z]|978-\d[-\d]{9,13})') # Compile ISBN regex pattern

    mask = (df['isbn'] == '9999999999999') | pd.isnull(df['isbn']) # Identify rows with specific ISBN or null ISBN
    descriptions = df.loc[mask, 'description'].astype(str)        # Extract relevant descriptions

    extracted_isbns = descriptions.str.extract(isbn_pattern)[0].str.replace('-', '') # Extract ISBNs and remove dashes
    extracted_isbns = extracted_isbns.str[1:].where(~extracted_isbns.str[0].str.isdigit(), extracted_isbns) # Remove non-digit first character if exists
    extracted_isbns = extracted_isbns.str[:-1].where(~extracted_isbns.str[-1].str.isdigit(), extracted_isbns) # Remove non-digit last character if exists

    df.loc[mask, 'isbn'] = extracted_isbns  # Replace ISBN column with extracted values for relevant rows

    return df  # Return modified DataFrame



import re
import pandas as pd

def move_isbn(row):

    if 'isbn' not in row or 'description' not in row:
        raise ValueError("'isbn' or 'description' column is missing in the row.")

    isbn_pattern = re.compile(r'((?:\D)?(\d{13})(?:\D)?|(?:\D)?(\d{10})(?:\D)?|B\d{4}[A-Z]{3}\d{1}[A-Z]|978-\d[-\d]{9,13})')

    if row['isbn'] == '9999999999999' or pd.isnull(row['isbn']):
        if isinstance(row['description'], str):
            isbn_match = re.search(isbn_pattern, row['description'])
            if isbn_match:
                match = isbn_match.group(1)
                if match.startswith('978-'):
                    match = match.replace('-', '')
                if not match[0].isdigit():
                    match = match[1:]
                if not match[-1].isdigit():
                    match = match[:-1]
                row['isbn'] = match

    return row

# Usage:
try:
    df = df.apply(move_isbn, axis=1)
except ValueError as e:
    print(e)
