A classification model that predicts the century in which a book was written using Bag-of-Words model (Natural Language Processing)
Sometimes GitHub is unable to Preview Code Blocks for Jupyter Notebooks. If this happens, you can just view my Notebook.
This project was built to test my understanding of the Natural Language Processing Self-Study course offered as part of the PGP-DSE program at Great Learning Hyderabad.
- Install Python:
- Install non-standard Python libraries:
launch command prompt and run this command:
C:\Windows\system32\ pip install ipykernal, jupyterlab, notebook, numpy, pandas, scikit-learn
- Download the dataset and Jupyter Notebook. Ensure that they are in the same folder.
- Launch Jupyter Notebook from the Start Menu, and navigate to the folder containing the dataset and Jupyter Notebook you just downloaded.
- Go to Cell -> Run All.
- Profit!
The final score.mean() tells you how good of a model it is. The closer this value is to 1, the better the predictive power of the model. In other words, the closer the value is to 1, the more accurately the model is able to predict the Century in which the text was written, based on the sentence structure, vocabulary, and grammar.
I got a Weighted F1 Score of 81.6%. This is not bad, but also I would like to get this number to be as high as I can, so there is definitely scope for improvement.
I am satisfied with the outcome of this project. However it is simplistic and undoubtedly a prototype with huge scope for improvement:
- I can more programmatically source my data, say by downloading all of the books available on Project Gutenberg.
- I can clean the data a bit better, and perform Feature Selection in order to remove words without dictionary meaning.
- When creating the Document-Term Matrix, I can also incorporate N-Grams as features.
- I can implement a way for the user to input some text data and obtain a predicted Century value, making it actually usable and not just an object of curiosity.