<table align="center" width=100%>
    <tr>
        <td width="20%">
        </td>
        <td>
            <div align="center">
                <font color="#21618C" size=5px>
                    <b> Book Century Identifier - Data Collection<br>
                    </b>
                </font>
            </div>
        </td>
         <td width="25%">
        </td>
    </tr>
</table>

<a id="contents"> </a>
## Table of Contents:

1. **[Objective of this Code](#obj)**
2. **[Importing Required Libraries](#import)**
3. **[Reading the Book List csv file and building the skeleton of the DataFrame](#bookscsv)**
4. **[Reading the Book txt files and adding the Content to the DataFrame](#bookstxt)**
5. **[Adding the Target Variable to the DataFrame](#target)**
6. **[Splitting the Content into Rows to get more Documents](#splitdocs)**
7. **[Saving the DataFrame to a csv file](#savecsv)**

<a id="obj"> </a>
## 1. Objective of this Code:

[Back to Contents](#contents)

Before we even think about building an NLP Model, we first need to:
1. Source the Data
2. Clean the Data
3. Build the Target Variable

1. Sourcing the Data:
First things first, I headed over to [Goodreads](https://www.goodreads.com/list/tag/by-century) to get a list of the best books ever written, century-wise.

Selected maybe 5-6 books for each century featured in its own list.

I tried to stick to one entry per franchise and one entry per author. This wasn't always possible, as intact text content from the 16th century isn't always readily available.

Then I obtained these books off the internet, mostly [Project Gutenberg](https://www.gutenberg.org/).

I put together a simple table with the columns 'Name', 'Author', and 'Year' on [Google Sheets](https://docs.google.com/spreadsheets/d/15w_N4IGD0Sek75qLf8_tU8KctR3UteKc3lGS8wfQGPk/edit?usp=sharing)

Now it was time to assemble a Dataset, and import the contents of the books to the DataFrame:

<a id="import"> </a>
## 2. Importing Required Libraries:

[Back to Contents](#contents)

In [1]:
# Importing pandas in order to work with DataFrames:
import pandas as pd

# Importing os to navigate file systems and open files to import the book content:
import os

<a id="bookscsv"> </a>
## 3. Reading the Book List csv file and building the skeleton of the DataFrame:

[Back to Contents](#contents)

In [2]:
# reading the csv file to create the dataframe skeleton
books = pd.read_csv('books_list.csv')
books.set_index('Name',inplace=True)
books

Unnamed: 0_level_0,Author,Year
Name,Unnamed: 1_level_1,Unnamed: 2_level_1
Harry Potter and the Deathly Hallows,J K Rowling,2014
The Hunger Games,Suzanne Collins,2008
The Kite Runner,Khaled Hosseini,2003
Life of Pi,Yann Martel,2001
The Fault in Our Stars,John Green,2012
The Help,Kathryn Stockett,2009
To Kill a Mockingbird,Harper Lee,1960
1984,George Orwell,1949
The Great Gatsby,F Scott Fitzgerald,1925
Harry Potter and the Sorcerer's Stone,J K Rowling,1997


In [3]:
# Confirming that the index of the dataframe is indeed, the book name:
books.index

Index(['Harry Potter and the Deathly Hallows', 'The Hunger Games',
       'The Kite Runner', 'Life of Pi', 'The Fault in Our Stars', 'The Help',
       'To Kill a Mockingbird', '1984', 'The Great Gatsby',
       'Harry Potter and the Sorcerer's Stone', 'The Hobbit', 'Farhenheit 451',
       'Pride and Prejudice', 'The Picture of Dorian Gray',
       'Wuthering Heights', 'Crime and Punishment', 'Frankenstein',
       'Through the Looking Glass', 'Dracula', 'Gulliver's Travels',
       'Robinson Crusoe', 'The US Constitution',
       'The Decline and Fall of the Roman Empire',
       'The Autobiography of Benjamin Franklin', 'The Monk', 'Hamlet',
       'Macbeth', 'Paradise Lost', 'The Pilgrim's Progress', 'Leviathan',
       'An Essay Concerning Human Understanding', 'Romeo and Juliet',
       'A Midsummer Night's Dream', 'Essays', 'Edward II', 'Dr Faustus'],
      dtype='object', name='Name')

<a id="bookstxt"> </a>
## 4. Reading the Book txt files and adding the Content to the DataFrame:

[Back to Contents](#contents)

The books are in a directory named 'books' which is present in the same directory as the Jupyter Notebook.
I will now add the content from the books to the DataFrame.

In [4]:
directory = r'books'
# Looping through every file in the directory 'books':
for filename in os.listdir(directory):
    # Checking if the file is a txt file (ebook):
    if filename.endswith('.txt'):
        # Opening the file in utf-8 encoding:
        with open(os.path.join(directory, filename), encoding='utf-8') as f:
            # Getting the file name without the extension:
            filename_raw = filename.split(sep='.')[0]
            # Saving the contents of the text file into the variable book:
            book = f.read()
            # 2. Cleaning the Data:
            # Removing the newline characters:
            book_cleaned = book.replace('\n', ' ')
            # Adding the content to the respective record with the book name as the index, in a column named 'Content':
            books.loc[filename_raw, 'Content'] = book_cleaned[5000:55000]

In [5]:
# Checking that the contents of every book were added to the DataFrame:
books

Unnamed: 0_level_0,Author,Year,Content
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Harry Potter and the Deathly Hallows,J K Rowling,2014,"ldemort, indicating the seat on his immediate ..."
The Hunger Games,Suzanne Collins,2008,"ven here, even in the middle of nowhere, you w..."
The Kite Runner,Khaled Hosseini,2003,"of yours?"" He'd close the door, leave me to w..."
Life of Pi,Yann Martel,2001,to confess that as a matter of fact it was a ...
The Fault in Our Stars,John Green,2012,"tion of cookies and lemonade, sat down in the ..."
The Help,Kathryn Stockett,2009,up her eyes at me like I done something wrong...
To Kill a Mockingbird,Harper Lee,1960,"ustry, Atticus was related by blood or marriag..."
1984,George Orwell,1949,except a series of bright-lit tableaux occurri...
The Great Gatsby,F Scott Fitzgerald,1925,d. “How do you get to West Egg village?” he a...
Harry Potter and the Sorcerer's Stone,J K Rowling,1997,n't see a single collecting tin. It was on his...


In [6]:
# Checking the size of the each value in the 'Content' columns:
for item in books['Content']:
    print(len(item))

50000
50000
50000
50000
50000
50000
50000
50000
50000
50000
50000
50000
50000
50000
50000
50000
50000
50000
50000
50000
50000
22125
50000
50000
50000
50000
50000
50000
50000
50000
50000
50000
50000
50000
50000
50000


<a id="target"> </a>
## 5. Adding the Target Variable to the DataFrame:

[Back to Contents](#contents)

The target variable I have in mind here is the century in which the book was written/published.
Unfortunately I couldn't simply get this information from Gutenberg, as the Year they have in their
database is simply the year the book was added to their collection.

This is why I had to put together the dataset myself, at least for this prototype project.

I can extract the Century from the 'Year' column I painstakingly put together manually:

In [7]:
# 3. Building the Target Variable:
# Getting the quotient of the 'Year' column when divided by 100, then adding 1 to get the 'Century'.
# Since this will be a categorical column, let us convert it back to an object.
books['Century'] = ((books['Year']//100) + 1)
books['Century'] = books['Century'].astype(str)

In [8]:
books[['Year','Century']]

Unnamed: 0_level_0,Year,Century
Name,Unnamed: 1_level_1,Unnamed: 2_level_1
Harry Potter and the Deathly Hallows,2014,21
The Hunger Games,2008,21
The Kite Runner,2003,21
Life of Pi,2001,21
The Fault in Our Stars,2012,21
The Help,2009,21
To Kill a Mockingbird,1960,20
1984,1949,20
The Great Gatsby,1925,20
Harry Potter and the Sorcerer's Stone,1997,20


<a id="splitdocs"> </a>
## 6. Splitting the Content into Rows to get more Documents:

[Back to Contents](#contents)

Since we are trying to capture patterns in the sentence structure, vocabulary and grammar in order to predict the century in which the book was written and not necessarily predict information about the book itself, I am okay with splitting the content down in order to get more documents.

The only danger with this is limiting the vocabulary of each record, and of course, overfitting to the patterns we find in the admittedly limited data I've collected.

In [9]:
# First let us reset the index, since we no longer need the Book Name in order to identify the records
# and preserver the Name Data:
books.reset_index(inplace=True)

In [10]:
# Iterate through the records of the DataFrame
for i in books.index:
    # The US Constitution is ~22000 characters long, so we are limited to 10 documents each consisting of 2000 characters/
    # If the US Constitution is dropped from the DataFrame, we can create more documents.
    # Performing List Comprehension in order to split the content into documents consisting of 2000 characters each:
    books.at[i,'Content']=[books.at[i,'Content'][2000*j:2000*(j+1)] for j in range(0,10)]

In [11]:
# Splitting the individual elements of the lists into their own separate records:
books = books.explode('Content',ignore_index=True)
books

Unnamed: 0,Name,Author,Year,Content,Century
0,Harry Potter and the Deathly Hallows,J K Rowling,2014,"ldemort, indicating the seat on his immediate ...",21
1,Harry Potter and the Deathly Hallows,J K Rowling,2014,aze had wandered upward to the body revolving ...,21
2,Harry Potter and the Deathly Hallows,J K Rowling,2014,"rt. “At any rate, it remains unlikely that the...",21
3,Harry Potter and the Deathly Hallows,J K Rowling,2014,"a small man halfway down the table, who had b...",21
4,Harry Potter and the Deathly Hallows,J K Rowling,2014,hiss on even after the cruel mouth had stoppe...,21
...,...,...,...,...,...
355,Dr Faustus,Christopher Marlowe,1588,"quod tumeraris:[52] per Jehovam, Gehenna...",16
356,Dr Faustus,Christopher Marlowe,1588,om Faustus doth dedicate himself. This wo...,16
357,Dr Faustus,Christopher Marlowe,1588,"And meet me in my study at midnight, And...",16
358,Dr Faustus,Christopher Marlowe,1588,should be full of vermin.[70] WAGNER. S...,16


<a id="savecsv"> </a>
## 7. Saving the DataFrame to a csv file:

[Back to Contents](#contents)

Now that we have converted our raw Data into a usable DataFrame, let us save it into a csv file. We can always rerun this code, if we want to extract more of the content from the books.

This DataFrame will need to undergo a lot more processing before it is ready for model-building purposes.

In [12]:
# Saving the DataFrame to a csv file:
books.to_csv('books_db.csv')