<table align="center" width=100%>
    <tr>
        <td width="20%">
        </td>
        <td>
            <div align="center">
                <font color="#21618C" size=5px>
                    <b> Book Century Identifier - Model Building<br>
                    </b>
                </font>
            </div>
        </td>
         <td width="25%">
        </td>
    </tr>
</table>

<a id="contents"> </a>
## Table of Contents:

1. **[Importing Required Libraries](#import)**
2. **[Cleaning the Data](#clean)**
4. **[Building the Model](#model)**
5. **[Evaluating the Model](#eval)**
6. **[Future Scope of this Project](#futurescope)**

<a id="import"> </a>
## 1. Importing Required Libraries:

[Back to Contents](#contents)

In [49]:
# Importing pandas in order to work with DataFrames:
import pandas as pd
# Importing string to recognize punctuation to strip the data of it to extract individual words:
import string

# Importing Train Test Split, and Cross Validation to evaluate the Model:
from sklearn.model_selection import train_test_split, cross_val_score
# Importing LabelEncoder to Encode the Categorical Target Variable:
from sklearn.preprocessing import LabelEncoder
# Importing CountVectorizer to assemble the Document-Term Matrices:
from sklearn.feature_extraction.text import CountVectorizer
# Importing Classification Machine Learning Algorithms, since the Target Variable is Categorical:
from sklearn.naive_bayes import MultinomialNB,BernoulliNB
# Importing Classification Report to evaluate the Model built:
from sklearn.metrics import classification_report

# Modifying Jupyter Notebook settings in order to get output from all shell commands, and not just the latest:
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

In [28]:
df = pd.read_csv('books_db.csv')
df.drop('Unnamed: 0',axis=1,inplace=True)

In [29]:
df.head()
df.tail()
df.info()

Unnamed: 0,Name,Author,Year,Content,Century
0,Harry Potter and the Deathly Hallows,J K Rowling,2014,"ldemort, indicating the seat on his immediate ...",21
1,Harry Potter and the Deathly Hallows,J K Rowling,2014,aze had wandered upward to the body revolving ...,21
2,Harry Potter and the Deathly Hallows,J K Rowling,2014,"rt. “At any rate, it remains unlikely that the...",21
3,Harry Potter and the Deathly Hallows,J K Rowling,2014,"a small man halfway down the table, who had b...",21
4,Harry Potter and the Deathly Hallows,J K Rowling,2014,hiss on even after the cruel mouth had stoppe...,21


Unnamed: 0,Name,Author,Year,Content,Century
355,Dr Faustus,Christopher Marlowe,1588,"quod tumeraris:[52] per Jehovam, Gehenna...",16
356,Dr Faustus,Christopher Marlowe,1588,om Faustus doth dedicate himself. This wo...,16
357,Dr Faustus,Christopher Marlowe,1588,"And meet me in my study at midnight, And...",16
358,Dr Faustus,Christopher Marlowe,1588,should be full of vermin.[70] WAGNER. S...,16
359,Dr Faustus,Christopher Marlowe,1588,"llow me. CLOWN. But, do you hear? if I s...",16


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 360 entries, 0 to 359
Data columns (total 5 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   Name     360 non-null    object
 1   Author   360 non-null    object
 2   Year     360 non-null    int64 
 3   Content  360 non-null    object
 4   Century  360 non-null    int64 
dtypes: int64(2), object(3)
memory usage: 14.2+ KB


<a id="clean"> </a>
## 2. Cleaning the Data:

[Back to Contents](#contents)

In [30]:
# Creating a user-defined function to remove the punctuation in the text, so we can extract words from it:
def remove_punct(text):
    text = "".join([char for char in text if char not in string.punctuation])
    return text

df['Content_cleaned'] = df['Content'].apply(remove_punct)
df['Content_cleaned']

0      ldemort indicating the seat on his immediate r...
1      aze had wandered upward to the body revolving ...
2      rt “At any rate it remains unlikely that the M...
3       a small man halfway down the table who had be...
4       hiss on even after the cruel mouth had stoppe...
                             ...                        
355     quod tumeraris52      per Jehovam Gehennam et...
356    om Faustus doth dedicate himself      This wor...
357     And meet me in my study at midnight      And ...
358     should be full of vermin70       WAGNER So th...
359    llow me       CLOWN But do you hear if I shoul...
Name: Content_cleaned, Length: 360, dtype: object

In [31]:
# Instantiating our Label Encoder:
le = LabelEncoder()
# Checking the Target Variable before:
df['Century'].value_counts()

# Encoding our Categorical Target Variable:
df['Century_Enc'] = le.fit_transform(df['Century'])
# Checking the Target Variable after:
df['Century_Enc'].value_counts()

19    70
21    60
20    60
18    60
17    60
16    50
Name: Century, dtype: int64

3    70
5    60
4    60
2    60
1    60
0    50
Name: Century_Enc, dtype: int64

<a id="model"> </a>
## 3. Building the Model:

[Back to Contents](#contents)

Before we can build our Classification Model, we need to build our Document-Term Matrix, and isolate our target variable. 

In [44]:
# Building the Document-Term Matrix and isolating the Target Variable:
CV = CountVectorizer(stop_words="english")
X = pd.DataFrame(CV.fit_transform(df['Content_cleaned']).toarray(), columns = CV.get_feature_names_out())
y = df['Century_Enc']

# Taking a look at the Document-Term Matrix:
X.head()

Unnamed: 0,10,102829,11,1116,119,119105,12,1225,13,13000,...,zakat,zeal,zealants,zeno,zimmerman,zoo,zoology,zootown,æmilianus,æthiopians
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [46]:
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size = 0.7, random_state = 1331, stratify=y)

In [47]:
# Since we have a Multiclass Categorical Target Variable, let us use the
# Multiclass Naive Baye's Algorithm to build our model:
NB = MultinomialNB()
NB.fit(X_train, y_train)

<a id="eval"> </a>
## 4. Evaluating the Model:

[Back to Contents](#contents)

Evaluating the Model:

In [51]:
# Classification Report for the Train Data:
NB_train_predict = NB.predict(X_train)
print("Classification Report for Train Data - Multinomial Naive Bayes' Model:\n", classification_report(y_train, NB_train_predict), sep = '')

# Classification Report for the Test Data:
NB_test_predict = NB.predict(X_test)
print("Classification Report for Test Data - Multinomial Naive Bayes' Model:\n", classification_report(y_test, NB_test_predict), sep = '')

Classification Report for Train Data - Multinomial Naive Bayes' Model:
              precision    recall  f1-score   support

           0       1.00      1.00      1.00        35
           1       1.00      1.00      1.00        42
           2       1.00      1.00      1.00        42
           3       1.00      1.00      1.00        48
           4       1.00      1.00      1.00        42
           5       1.00      1.00      1.00        42

    accuracy                           1.00       251
   macro avg       1.00      1.00      1.00       251
weighted avg       1.00      1.00      1.00       251

Classification Report for Test Data - Multinomial Naive Bayes' Model:
              precision    recall  f1-score   support

           0       1.00      0.93      0.97        15
           1       0.94      0.94      0.94        18
           2       0.94      0.89      0.91        18
           3       0.91      0.91      0.91        22
           4       0.90      1.00      0.95  

Since I want the model to be able to predict each century accurately, I will select Weighted F1 Score as my metric of choice, to account for any class imbalance, no matter how little.

We can see that our Multinomial Naive Bayes Model has fit itself perfectly to the train data. It is likely that our model is overfit.

Let us perform Cross Validation to get a better idea of how well our model performs, and account for the Train-Test Split coincidentally giving us a high score.

In [59]:
# Performing 10-fold Cross Validation:
scores = cross_val_score(NB, X, y, scoring = 'f1_weighted', cv=10)
# Displaying the individual Weighted F1 Scores obtained for each model:
print(scores)
# Displaying the average metric:
print('The Average Weighted F1 Score obtained through cross validation:', scores.mean())

[0.81878307 0.68008072 0.74821259 0.97188552 0.82738095 0.73794261
 0.85872831 1.         0.79259259 0.7257696 ]
The Average Weighted F1 Score obtained through cross validation: 0.8161375969709302


81.6% is a respectable Weighted F1 Score

<a id="futurescope"> </a>
## 5. Future Scope of this Project:

[Back to Contents](#contents)

While I am satisfied with the outcome of this project, it is undoubtedly a prototype with huge scope for improvement.

First things first, I can more programmatically source my data, say by downloading all of the books available on [Project Gutenberg](https://www.gutenberg.org/).

Second, I can clean the data a bit better, and perform Feature Selection in order to remove words without dictionary meaning.

Thirdly, when creating the Document-Term Matrix, I can also incorporate N-Grams as features.

Lastly, I can implement a way for the user to input some text data and obtain a predicted Century value.