Book Century Identifier

What it does:

A classification model that predicts the century in which a book was written using Bag-of-Words model (Natural Language Processing)

Sometimes GitHub is unable to Preview Code Blocks for Jupyter Notebooks. If this happens, you can just view my Notebook.

This project was built to test my understanding of the Natural Language Processing Self-Study course offered as part of the PGP-DSE program at Great Learning Hyderabad.

How to build it yourself:

Install Python:

Install non-standard Python libraries: launch command prompt and run this command:

C:\Windows\system32\ pip install ipykernal, jupyterlab, notebook, numpy, pandas, scikit-learn

Download the dataset and Jupyter Notebook. Ensure that they are in the same folder.
Launch Jupyter Notebook from the Start Menu, and navigate to the folder containing the dataset and Jupyter Notebook you just downloaded.
Go to Cell -> Run All.
Profit!

How to interpret it:

The final score.mean() tells you how good of a model it is. The closer this value is to 1, the better the predictive power of the model. In other words, the closer the value is to 1, the more accurately the model is able to predict the Century in which the text was written, based on the sentence structure, vocabulary, and grammar.

My instance and the insights derived:

I got a Weighted F1 Score of 81.6%. This is not bad, but also I would like to get this number to be as high as I can, so there is definitely scope for improvement.

Future Scope:

I am satisfied with the outcome of this project. However it is simplistic and undoubtedly a prototype with huge scope for improvement:

I can more programmatically source my data, say by downloading all of the books available on Project Gutenberg.
I can clean the data a bit better, and perform Feature Selection in order to remove words without dictionary meaning.
When creating the Document-Term Matrix, I can also incorporate N-Grams as features.
I can implement a way for the user to input some text data and obtain a predicted Century value, making it actually usable and not just an object of curiosity.

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
LICENSE		LICENSE
README.md		README.md
assembling-the-dataset.ipynb		assembling-the-dataset.ipynb
books_db.csv		books_db.csv
books_list.csv		books_list.csv
building-the-naive-bayes-model.ipynb		building-the-naive-bayes-model.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LICENSE

LICENSE

README.md

README.md

assembling-the-dataset.ipynb

assembling-the-dataset.ipynb

books_db.csv

books_db.csv

books_list.csv

books_list.csv

building-the-naive-bayes-model.ipynb

building-the-naive-bayes-model.ipynb

Repository files navigation

Book Century Identifier

What it does:

How to build it yourself:

How to interpret it:

My instance and the insights derived:

Future Scope:

About

Releases

Packages

Languages

License

galahad38/book-century-identifier

Folders and files

Latest commit

History

Repository files navigation

Book Century Identifier

What it does:

How to build it yourself:

How to interpret it:

My instance and the insights derived:

Future Scope:

About

Resources

License

Stars

Watchers

Forks

Languages