# GoodReads ML Recommendations

## About
- _Description of plan/approach_

### Ingest data
[Kaggle data source](https://www.kaggle.com/bahramjannesarr/goodreads-book-datasets-10m)
- Data was downloaded and unzipped using Kaggle API
    - Remove all `user_rating_*.csv` files.

#### All Imports

In [1]:
import re
import os
import glob
import warnings
import spacy.cli
import numpy as np
import pandas as pd
from bs4 import BeautifulSoup
from kaggle.api.kaggle_api_extended import KaggleApi

# Warning suppression
warnings.filterwarnings('ignore')

# Download Spacy and initialize
spacy.cli.download("en_core_web_lg")
nlp = spacy.load("en_core_web_lg")

[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_lg')


#### Load Kaggle data with Kaggle API
- [Follow these instructions](https://python.plainenglish.io/how-to-use-the-kaggle-api-in-python-4d4c812c39c7) to get `kaggle.json` API key.
    - Read the error to find where `.kaggle/kaggle.json` file should go.

In [2]:
# Kaggle API authentication
api = KaggleApi()
api.authenticate()

# Download and unzip all files
api.dataset_download_files('bahramjannesarr/goodreads-book-datasets-10m',
                           path='./data',
                           unzip=True)

# Remove `user_rating` data files
!rm data/user_rating_*.csv

KeyboardInterrupt: 

#### Combine into one large dataset
- Remove rows where `Description` is null

In [None]:
# Concat all files
book_r_0 = pd.concat(map(pd.read_csv, glob.glob('./data/book*.csv')))

In [None]:
# Remove row if `Description` is NaN
book_rating = book_r_0.copy()
book_rating = book_rating.dropna(axis=0, subset=['Description'])

# Save column names to variable
book_rating_col = book_rating.columns

In [None]:
book_rating.head(1)

In [None]:
book_rating.isnull().sum(axis=0)

### Data Cleaning

#### Cleaning functions
1. `clean_ratings()`**:**
Remove star label (i.e. '5:10' for a 5-star rating with 10 votes) from `RatingDist` columns. With `x` option set to true;
remove 'total:' from column and set type to int.
    - **Input**
         - *String*
    - **Options**
        - `x=True` Switch on total replacement, default star rating removal
    - **Output**
        - *Int*

2. `clean_tags()`**:**
Remove any rendering tagging from text
    - **Input**
        - *String*
    - **Output**
        - *String*

3. `tokenize()`**:**

In [None]:
def clean_ratings(raw_txt, x=None):
    if x is not None :
        return int(re.sub('[[a-z\:]', '', raw_txt, count=6))
    else:
        return int(re.sub('[0-9\:]', '', raw_txt, count=2))

def clean_tags(raw_txt):
    soup = BeautifulSoup(raw_txt)
    return soup.get_text()

#### Cleaning actions

In [None]:
# Copy df
book_rating_cpy = book_rating.copy()
# Remove `Id` and make new one
book_rating_cpy = book_rating_cpy.drop(columns=['Id', 'Count of text reviews',
                                                'pagesNumber', 'PagesNumber',
                                                'Language'],
                                       axis=0)
# Clean columns
book_rating_cpy['RatingDistTotal'] = book_rating_cpy['RatingDistTotal'].apply(lambda x: clean_ratings(x, x=True))

txt_col = ['Name', 'Authors', 'Description']
for col in txt_col:
    book_rating_cpy[col] = book_rating_cpy[col].apply(lambda x: clean_tags(x))

lst_col = ['RatingDist1', 'RatingDist2', 'RatingDist3', 'RatingDist4', 'RatingDist5']
for col in lst_col:
    book_rating_cpy[col] = book_rating_cpy[col].apply(lambda x: clean_ratings(x))

#### Tokenize

In [None]:
book_rating_cpy['Description.Tokens'] = book_rating_cpy['Description'].apply(lambda text: [token.lemma_ for token in nlp(text) if (token.is_stop != True) and (token.is_punct != True)])

In [None]:
book_rating_cpy.head(10)