# 1. Introduction

Name: Celine Clarissa

Original Dataset: [Kaggle](https://www.kaggle.com/datasets/athu1105/book-genre-prediction/data).

Deployment: [Hugging Face](https://huggingface.co/spaces/celineclarissa/GC7)

GitHub: [GitHub Link](https://github.com/celineclarissa/Book-Genre-Prediction)

---

## Identifying the Problem

#### `Background`

I am a data scientist at a book distribution company. As a company, it is important to know the characteristics of books in order to sort books based on its genre. The information can then be used to make strategies based on book genre.

#### `Problem Statement and Objectives (SMART Framework)`
 
As a data scientist at a book distribution company, skills of training, testing, tuning, and evaluating a model are important because the company can then use the model to predict the genre of a book before accepting to distribute it. The company can then determine a business strategy like planning a choosing books to distribute based on genre, for example. This can be done by using data. After analyzing book genre characteristics from EDA, data scientist will then do feature engineering towards data. Then, data scientist will do modelling with ANN to predict genre of book. Then, data scientist will attempt to improve model. The best model is aimed to have an accuracy score of more than 90% and then deployed on HuggingFace for effective use after 7 working days. Webapp where model is deployed will also feature a page for EDA.

---
---

# 2. Import Libraries
The following are the libraries used in my model inference.

In [8]:
# import libraries
import tensorflow as tf
import tensorflow_hub as tf_hub
import pandas as pd
import numpy as np
from tensorflow.keras.models import load_model
import re

# import feature engineering
import nltk
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('wordnet')
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize
import tensorflow as tf
import tensorflow_hub as tf_hub
import warnings
warnings.filterwarnings('ignore')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\User\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\User\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\User\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


---
---
# 3. Loading and Defining

Data scientist will now define function for text preprocessing.

In [9]:
# define stopwords
stopwords_eng = stopwords.words('english')

# define text preprocessing function
def text_preprocessing(text):
  '''
  This function is created to do text preprocessing: change text to lowercase, remove numbers and punctuation symbols, remove stopwords,
  lemmatize text, and tokenize text. Text preprocessing can be done just by calling this function.
  '''
  # change text to lowercase
  text = text.lower()

  # remove [UNK]
  text = text.replace('[UNK]', '')
  text = text.replace('unk', '')
  text = text.replace('UNK', '')
  text = text.replace('[unk]', '')

  # remove numbers
  text = re.sub(r'\d+', '', text)

  # remove comma
  text = text.replace(',', '')

  # remove period symbol
  text = text.replace('.', '')

  # remove exclamation mark
  text = text.replace('!', '')

  # remove question mark
  text = text.replace('?', '')

  # change texts using quotation marks that have negative connotation
  text = text.replace("don't", "do not")
  text = text.replace("aren't", "are not")
  text = text.replace("isn't", "is not")
  text = text.replace("didn't", "did not")
  text = text.replace("can't", "cannot")
  text = text.replace("couldn't", "could not")
  text = text.replace("didn't", "did not")

  # remove quotation mark
  text = text.replace('"', '')
  text = text.replace("'", '')
  text = text.replace('’', '')

  # remove whitespace
  text = text.strip()

  # tokenization
  tokens = word_tokenize(text)

  # remove stopwords
  tokens = [word for word in tokens if word not in stopwords_eng]

  # lemmatization
  lemmatizer = WordNetLemmatizer()
  tokens = [lemmatizer.lemmatize(word) for word in tokens]

  # combine tokens
  text = ' '.join(tokens)

  return text

Data scientist will use pretrained layer to improve model.

In [10]:
# get pretrained layer from kaggle
url = 'https://tfhub.dev/google/tf2-preview/nnlm-id-dim128-with-normalization/1'
pretrained_layer = tf_hub.KerasLayer(url, output_shape=[128], input_shape=[], dtype=tf.string)

Data scientist will load trained model to later be used for inference data.

In [11]:
# load model
model = load_model('model_2.h5', custom_objects={'KerasLayer': pretrained_layer})


Caching the list of root modules, please wait!
(This will only be done once - type '%rehashx' to reset cache!)



Then, data scientist will define dictionary to convert nominal to class name.

In [12]:
# define class dictionary
dict_class = {0: 'fantasy',
              1: 'science',
              2: 'crime',
              3: 'history',
              4: 'horror',
              5: 'thriller',
              6: 'psychology',
              7: 'romance',
              8: 'sports',
              9: 'travel'}

---
---
# 4. Inference Data

## 4.1. Define Inference Data

Data scientist will create inference data which the model has never been trained with before.

In [13]:
# create inference data
inf_data = {'index': 4657,
            'title': "The Notebook",
            'summary': "Noah and Allie spend a wonderful summer together, but her family and the socio-economic realities of the time prevent them from being together. Although Noah attempts to keep in contact with Allie after they are forced to separate, his letters go unanswered. Eventually, Noah professes his undying and eternal love in one final letter. Noah travels north to find gainful employment and to escape the ghost of Allie, and eventually he goes off to war. After serving his country, he returns home to restore an old farmhouse. A newspaper article about his endeavor catches Allie's eye, and 14 years after she last saw Noah, Allie returns to him. The only problem is she is engaged to another man. After spending two wonderful reunion days together, Allie must decide between the two men that she loves."}

# put inference data into dataframe
inf_data = pd.DataFrame(inf_data, index=[0])
# show dataframe
inf_data

Unnamed: 0,index,title,summary
0,4657,The Notebook,Noah and Allie spend a wonderful summer togeth...


## 4.2. Preprocessing

Then, data scientist will use text_preprocessing function defined in steps above so that the model can process data better.

In [14]:
inf_data['text_processed'] = inf_data['summary'].apply(lambda x: text_preprocessing(x))
inf_data

Unnamed: 0,index,title,summary,text_processed
0,4657,The Notebook,Noah and Allie spend a wonderful summer togeth...,noah allie spend wonderful summer together fam...


## 4.3. Prediction

In [15]:
# make prediction
# calculate probability
y_pred_inf = model.predict(inf_data.text_processed)
# take class with biggest probability
y_pred_inf_class = np.argmax(y_pred_inf, axis=-1)



In [16]:
# show probability for each class
y_pred_df = pd.DataFrame(y_pred_inf, columns=['fantasy', 'science', 'crime', 'history', 'horror', 'thriller', 'psychology', 'romance', 'sports', 'travel'])
y_pred_df

Unnamed: 0,fantasy,science,crime,history,horror,thriller,psychology,romance,sports,travel
0,0.194008,0.138966,0.104051,0.127359,0.112602,0.234017,0.023547,0.023245,0.019361,0.022844


In [17]:
# convert probability to class name
print(f'Book Genre Prediction: {dict_class[int(y_pred_inf_class)]}')

Book Genre Prediction: thriller
