# GoodReads ML Recommendations

### Ingest data
[Kaggle data source](https://www.kaggle.com/bahramjannesarr/goodreads-book-datasets-10m)
- Data was downloaded and unzipped using Kaggle API
    - Remove all `user_rating_*.csv` files.

#### All Imports

In [106]:
import re
import os
import glob
import gensim
import joblib
import warnings
import spacy.cli
import numpy as np
import pandas as pd
from bs4 import BeautifulSoup
import gensim.corpora as corpora
from sklearn.pipeline import Pipeline
from gensim.models import CoherenceModel
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from kaggle.api.kaggle_api_extended import KaggleApi
from sklearn.feature_extraction.text import TfidfVectorizer


# Warning suppression
warnings.filterwarnings('ignore')

# Download Spacy and initialize
spacy.cli.download("en_core_web_sm")
nlp = spacy.load("en_core_web_sm")

[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')


#### Load Kaggle data with Kaggle API
- [Follow these instructions](https://python.plainenglish.io/how-to-use-the-kaggle-api-in-python-4d4c812c39c7) to get `kaggle.json` API key.
    - Read the error to find where `.kaggle/kaggle.json` file should go.

In [2]:
# Kaggle API authentication
api = KaggleApi()
api.authenticate()

# Download and unzip all files
api.dataset_download_files('bahramjannesarr/goodreads-book-datasets-10m',
                           path='./data',
                           unzip=True)

# Remove `user_rating` data files
!rm data/user_rating_*.csv

zsh:1: no matches found: data/user_rating_*.csv


#### Combine into one large dataset
- Remove rows where `Description` is null

In [3]:
# Concat all files
book_r_0 = pd.concat(map(pd.read_csv, glob.glob('./data/book*.csv')))

In [4]:
# Remove row if `Description` is NaN
book_rating = book_r_0.copy()
book_rating = book_rating.dropna(axis=0, subset=['Description'])
book_rating.reset_index(drop=True)

Unnamed: 0,Id,Name,Authors,ISBN,Rating,PublishYear,PublishMonth,PublishDay,Publisher,RatingDist5,...,RatingDist3,RatingDist2,RatingDist1,RatingDistTotal,CountsOfReview,Language,PagesNumber,Description,pagesNumber,Count of text reviews
0,1900511,Barbarossa,Christopher Ailsby,1840138009,3.00,2007,4,1,New Line Books,5:0,...,3:1,2:0,1:0,total:1,0,,192.0,"On 22 June 1941, Adolf Hitler launched Operati...",,
1,1900514,Images of Barbarossa,Christopher Ailsby,0711028257,3.50,2001,1,25,Ian Allan Ltd,5:0,...,3:2,2:1,1:0,total:8,0,,256.0,"On 22 June 1941, Adolf Hitler launched Operati...",,
2,1900520,Romania After 2000: Five New Romanian Plays,Daniel Charles Gerould,0595436560,4.00,2007,9,1,Martin E. Segal Theatre Center Publications,5:1,...,3:1,2:0,1:0,total:6,0,,226.0,The first anthology of new Romanian Drama publ...,,
3,1900521,Global Foreigners: An Anthology of Plays,Saviana Stănescu,1905422423,4.60,2006,12,7,Seagull Books,5:4,...,3:1,2:0,1:0,total:5,0,,320.0,"In Waxing West, Daniella, newly arrived in the...",,
4,1900525,Diary of a Clone,Saviana Stănescu,092338961X,4.80,2003,1,1,Meeting Eyes Bindery,5:4,...,3:0,2:0,1:0,total:5,0,,66.0,Poetry. Translation. DIARY OF A CLONE is a sma...,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1171183,1499980,The O'Brien Book of Irish Fairy Tales & Legends,Una Leavy,0862784824,4.24,1996,9,10,O'Brien Press,5:39,...,3:8,2:4,1:0,total:95,1,,,Irish fairy tales and legends are full of ench...,96.0,1.0
1171184,1499988,Irish Folk and Fairy Tales Omnibus Edition,Michael Scott,0751508861,4.19,1989,24,8,Sphere,5:140,...,3:42,2:13,1:4,total:311,10,,,"Here, collected in one volume, are tales and l...",637.0,10.0
1171185,1499990,Robin Hood: The Shaping of the Legend,Jeffrey L. Singman,0313301018,3.00,1998,23,7,Praeger,5:0,...,3:1,2:0,1:0,total:1,0,,,Among the narrative traditions of the Middle A...,224.0,0.0
1171186,1499992,Competing on Value,Mack Hanan,0814450369,3.50,1991,22,4,Amacom,5:2,...,3:2,2:2,1:0,total:8,1,,,Presents a new approach to selling that emphas...,220.0,1.0


In [5]:
book_rating.head(1)

Unnamed: 0,Id,Name,Authors,ISBN,Rating,PublishYear,PublishMonth,PublishDay,Publisher,RatingDist5,...,RatingDist3,RatingDist2,RatingDist1,RatingDistTotal,CountsOfReview,Language,PagesNumber,Description,pagesNumber,Count of text reviews
0,1900511,Barbarossa,Christopher Ailsby,1840138009,3.0,2007,4,1,New Line Books,5:0,...,3:1,2:0,1:0,total:1,0,,192.0,"On 22 June 1941, Adolf Hitler launched Operati...",,


In [6]:
book_rating.isnull().sum(axis=0)

Id                             0
Name                           0
Authors                        0
ISBN                        3166
Rating                         0
PublishYear                    0
PublishMonth                   0
PublishDay                     0
Publisher                   7862
RatingDist5                    0
RatingDist4                    0
RatingDist3                    0
RatingDist2                    0
RatingDist1                    0
RatingDistTotal                0
CountsOfReview                 0
Language                 1021168
PagesNumber               424052
Description                    0
pagesNumber               747136
Count of text reviews     820755
dtype: int64

### Data Cleaning

#### Cleaning functions
1. `clean_ratings()`**:**
Remove star label (i.e. '5:10' for a 5-star rating with 10 votes) from `RatingDist` columns. With `x` option set to true;
remove 'total:' from column and set type to int.
    - **Input**
         - *String*
    - **Options**
        - `x=True` Switch on total replacement, default star rating removal
    - **Output**
        - *Int*

2. `clean_tags()`**:**
Remove any rendering tagging from text
    - **Input**
        - *String*
    - **Output**
        - *String*

3. `tokenize()`**:**
Remove dates in most formats (1/1/2000 & 1-1-2000), unicode, newline and nonalphanumeric chars.
Lower the case of the output.
    - **Input**
        - *String*
    - **Output**
        - *String*
        
4. `standard_eng()`**:**
Standardize language codes for trimming by language. Unknown and non-english will be labeled 'remove', english
will be labeled 'eng'.
    - **Input**
        - *String*
    - **Output**
        - *String*

In [7]:
def clean_ratings(raw_txt, x=None):
    if x is not None :
        return int(re.sub('[[a-z\:]', '', raw_txt, count=6))
    else:
        return int(re.sub('[0-9\:]', '', raw_txt, count=2))

def is_english(s):
    try:
        s.encode(encoding='utf-8').decode('ascii')
    except UnicodeDecodeError:
        return 'NaN'
    else:
        return s

def clean_tags(raw_txt):
    soup = BeautifulSoup(raw_txt)
    return soup.get_text()


def tokenize(raw_txt):
    """
    Remove unicode, ellipses and line return chars from text.
    Input: Uncouth string
    :Output: Couth string
    """
    text_raw_0 = raw_txt
    regx_txt_date = '\d{1}[\/-]\d{1}[\/-]\d{4}|\d{2}[\/-]\d{2}[\/-]\d{4}'
    # Remove ellipses, unicode (like \\x33 and \\xe3) and newline
    rm_txt_code = re.sub(r"\\\n", '', text_raw_0)
    rm_txt_code = re.sub(regx_txt_date, '', rm_txt_code)
    rm_txt_code = re.sub(r"\\\\n", '', rm_txt_code)
    rm_txt_code = re.sub(r"\\\\[x][a-zA-Z0-9]{2}", '', rm_txt_code)
    rm_txt_code = re.sub('[^a-zA-Z ]', '', rm_txt_code)
    output = rm_txt_code.lower()
    return output


eng_types = ['en-US', 'eng', 'en-GB', 'en-CA']
def standard_eng(x):
    if x in eng_types:
        return 'eng'
    else:
        return 'remove'


def word_vect():
    return [np.reshape(nlp(doc).vector, (350,)) for doc in docs]



#### Cleaning actions

In [8]:
# # Copy df
book_rating_cpy = book_rating.copy()
book_rating_cpy['Language'].fillna('eng', inplace=True)

# Standardize language tags, drop renamed rows (renamed to 'remove')
book_rating_cpy['Language'] = book_rating_cpy['Language'].apply(lambda x: standard_eng(x))
book_rating_cpy = book_rating_cpy.drop(book_rating_cpy[book_rating_cpy.Language == 'remove'].index)

# Remove `Id` and make new one
book_rating_cpy = book_rating_cpy.drop(columns=['Id', 'Count of text reviews',
                                                'pagesNumber', 'PagesNumber',
                                                'Language'],
                                       axis=0)
# Clean columns
book_rating_cpy['RatingDistTotal'] = book_rating_cpy['RatingDistTotal'].apply(lambda x: clean_ratings(x, x=True))

txt_col = ['Name', 'Authors', 'Description']
for col in txt_col:
    book_rating_cpy[col] = book_rating_cpy[col].apply(lambda x: clean_tags(x))
    
for col in txt_col:
    book_rating_cpy[col] = book_rating_cpy[col].apply(lambda x: tokenize(x))

lst_col = ['RatingDist1', 'RatingDist2', 'RatingDist3', 'RatingDist4', 'RatingDist5']
for col in lst_col:
    book_rating_cpy[col] = book_rating_cpy[col].apply(lambda x: clean_ratings(x))

In [9]:
book_rating_cpy.shape

(978089, 16)

In [20]:
book_rating_cpy.head(5)

Unnamed: 0,Name,Authors,ISBN,Rating,PublishYear,PublishMonth,PublishDay,Publisher,RatingDist5,RatingDist4,RatingDist3,RatingDist2,RatingDist1,RatingDistTotal,CountsOfReview,Description,Description.Tokens
0,barbarossa,christopher ailsby,1840138009,3.0,2007,4,1,New Line Books,0,0,1,0,0,1,0,on june adolf hitler launched operation barb...,"[ , june, , adolf, hitler, launch, operation,..."
3,romania after five new romanian plays,daniel charles gerould,0595436560,4.0,2007,9,1,Martin E. Segal Theatre Center Publications,1,4,1,0,0,6,0,the first anthology of new romanian drama publ...,"[anthology, new, romanian, drama, publish, uni..."
4,global foreigners an anthology of plays,saviana stnescu,1905422423,4.6,2006,12,7,Seagull Books,4,0,1,0,0,5,0,in waxing west daniella newly arrived in the u...,"[wax, west, daniella, newly, arrive, romania, ..."
5,diary of a clone,saviana stnescu,092338961X,4.8,2003,1,1,Meeting Eyes Bindery,4,1,0,0,0,5,0,poetry translation diary of a clone is a smart...,"[poetry, translation, diary, clone, smart, mov..."
7,the challenge of carl schmitt,chantal mouffe,1859847048,3.44,1999,9,17,Verso,4,22,19,2,3,50,0,carl schmitts thought serves as a warning agai...,"[carl, schmitt, thought, serve, warning, dange..."


#### Tokenize

In [10]:
book_rating_cpy['Description.Tokens'] = book_rating_cpy['Description'].apply(lambda text: [token.lemma_ for token in nlp(text) if (token.is_stop != True) and (token.is_punct != True)])

442888

In [92]:
doc_lim = 100000
id2word = corpora.Dictionary(book_rating_cpy['Description.Tokens'].iloc[:doc_lim])
corpus = [id2word.doc2bow(list_of_token) for list_of_token in book_rating_cpy['Description.Tokens'].iloc[:doc_lim]]

In [93]:
def compute_vals(id2word, corpus, texts, start, lim, step):
    coherence_vals = []
    model_lst = []
    
    for num_topics in range(start, lim, step):
        mc_lda = gensim.models.ldamulticore.LdaMulticore(corpus=corpus, 
                                                        id2word=id2word,
                                                        num_topics=num_topics,
                                                        chunksize=200,
                                                        passes=10,
                                                        per_word_topics=True)
        model_lst.append(mc_lda)
        coh_mod = CoherenceModel(model=mc_lda, texts=texts, dictionary=id2word, coherence='c_v')
        coherence_vals.append(coh_mod.get_coherence_per_topic())

    return model_lst, coherence_vals
    

In [94]:
model_list, coherence_vals = compute_vals(id2word, 
                                          corpus, 
                                          texts=book_rating_cpy['Description.Tokens'].iloc[:doc_lim], 
                                          start=5, 
                                          lim=50, 
                                          step=5)

KeyboardInterrupt: 

In [None]:
best_model = model_list[1]
mod_topics = best_model.show_topics(formatted=True)
print(mod_topics)

In [None]:
best_model.print_topics(num_words=100)

### Gridsearch with Vectorizer and Classifier

#### Who wrote about input?

In [104]:
X = book_rating_cpy['Description'][:10000]
Y = book_rating_cpy['Authors'][:10000]

In [105]:
# Initilize pipe tools
vectorizer = TfidfVectorizer()
rfc = RandomForestClassifier()

# Pipe object
pipe = Pipeline([
    ('vect', vectorizer),
    ('rfc', rfc)
])

# GS parameter dict
params = {
    'vect__max_features': [100,500],
    'rfc__n_estimators': [10,15],
    'rfc__max_depth': [None,20],
    
}

# Make GS object
gs = GridSearchCV(pipe, 
                  params,
                  cv=2,
                  n_jobs=-1,
                  verbose=1)

gs.fit(X,Y)

Fitting 2 folds for each of 8 candidates, totalling 16 fits


GridSearchCV(cv=2,
             estimator=Pipeline(steps=[('vect', TfidfVectorizer()),
                                       ('rfc', RandomForestClassifier())]),
             n_jobs=-1,
             param_grid={'rfc__max_depth': [None, 20],
                         'rfc__n_estimators': [10, 15],
                         'vect__max_features': [100, 500]},
             verbose=1)

In [107]:
# Save model
f_name = 'rfc_10k.sav'
model = gs.best_estimator_

In [108]:
joblib.dump(model, f_name)

['rfc_10k.sav']