First, I want to look at a single piece of data and will open up one of the text files. I notice that each text file is an entire book. I also see that the header provides metadata and I assume these text files will have the similar format (Title, Author, Posting Date, Release Date, First Posted, Last Updated, Language, Character set encoding).

In order to answer the questions in part 1, I want to build a dataframe containing the metadata to work with. I also need to separate the actual book text from the metadata.

In [1]:
import pandas as pd
import re
import os

Decided to use regular expressions to detect the metadata.
Defining all the regex here

In [2]:
#dictionary containing the metadata that I am looking for
metadata_dict = {
    #Using regex to match the pattern, and everything after the pattern up to a '\n'.
    'title': re.compile(r'Title: (?P<title>.*)\n'),
    'author': re.compile(r'Author: (?P<author>.*)\n'),
    'posting_date': re.compile(r'Posting Date: (?P<posting_date>.*)\n'),
    'release_date': re.compile(r'Release Date: (?P<release_date>.*)\n'),
    'first_posted': re.compile(r'First Posted: (?P<first_posted>.*)\n'),
    'last_updated': re.compile(r'Last Updated: (?P<last_updated>.*)\n'),
    'language': re.compile(r'Language: (?P<language>.*)\n'),
    'character_set_encoding': re.compile(r'Character Set Encoding: (?P<character_set_encoding>.*)\n'),
}

In [3]:
#line parser that checks for regex matches 
def parse_line(line):
    for key, rx in metadata_dict.items():
        match = rx.match(line)
        if match:
            return key, match
    #if metadata is not found
    return None, None

In [4]:
#empty list that will be appended to after parsing each book
data = []
#looping through files that end with '.txt' within the directory 'gutenberg'
for filename in os.listdir('gutenberg'):
    if filename.endswith('.txt'):
        #creating an empty dict for each book, this will eventually be one element in the 'data' list and one row in the df
        row = {}
        #ignoring encoding/decoding errors
        with open('gutenberg/'+filename, encoding='utf-8', errors='ignore') as f:
            #saving the filename as a data point
            row['Filename'] = filename
            #looping through the first fifty lines
            for i in range(40):
                line = f.readline()
                #calling the line parser to detect metadata
                key, match = parse_line(line)

                if key == 'title':
                    row['Title'] = match.group('title')
                if key == 'author':
                    row['Author'] = match.group('author')
                if key == 'posting_date':
                    row['Posting Date'] = match.group('posting_date')
                if key == 'release_date':
                    row['Release Date'] = match.group('release_date')
                if key == 'first_posted':
                    row['First Posted'] = match.group('first_posted')
                if key == 'last_updated':
                    row['Last Updated'] = match.group('last_updated')
                if key == 'language':
                    row['Language'] = match.group('language')
                if key == 'character_set_encoding':
                    row['Character Set Encoding'] = match.group('character_set_encoding')

        #Parsing the actual book text
            #resetting the read pointer
            f.seek(0)
            raw = f.read()
            #Determining start and stop points of the actual book text
            #Start point can be improved, edge case of *** START ... *** taking more than one line.
            start = re.search(r'\*\*\*.*START.*', raw).end()
            #There is a line before this, usually starting with 'End of the Project' but has variations. The ending point can be improved upon.
            stop = re.search(r'\*\*\*.*END.*', raw).start()
            text = raw[start:stop]
            #replacing non-alphanumerical with space and converting to lower case.
            processed_text = re.sub('[^A-Za-z0-9.]+', ' ', text).lower()
            #Tracking whether the word 'truth' occurs more than twice
            if len(re.findall(r'truth', processed_text)) > 2:
                row['Truth appearing more than twice'] = True
            #Number of times closing quotations appears in the text.
            row['Instances of dialogue'] = len(re.findall(r'\”', text))
            #Number of characters in book
            row['Book Length in Characters'] = len(text)   
            row['Filename'] = filename
            
        data.append(row)

Now that I have a list of dictionaries, I can convert this directly to a dataframe.

In [5]:
metadata_df = pd.DataFrame(data)

In [6]:
metadata_df

Unnamed: 0,Author,Book Length in Characters,Filename,First Posted,Instances of dialogue,Language,Last Updated,Posting Date,Release Date,Title,Truth appearing more than twice
0,Dante Alighieri,561938,1012-0.txt,"September 4, 1997",52,Italian,"December 8, 2014","November 7, 2015 [EBook #1012]","August, 1997",La Divina Commedia di Dante,
1,by (AKA B. M. Sinclair) B. M. Bower,160078,1014-0.txt,,444,English,"October 9, 2016","July 27, 2008 [EBook #1014]","August, 1997",The Lure of the Dim Trails,True
2,"Francis Parkman, Jr.",722277,1015-0.txt,,543,English,"November 18, 2016",,"April 27, 2006 [EBook #1015]",The Oregon Trail,True
3,Oscar Wilde,82115,1017-0.txt,,0,English,,,"September 26, 2014 [eBook #1017]",The Soul of Man,
4,Mark Twain (Samuel Clemens),294746,102-0.txt,,0,,"November 8, 2016",,"August 20, 2006 [EBook #102]",The Tragedy of Pudd'nhead Wilson,True
5,Henry David Thoreau,67788,1022-0.txt,,38,English,"July 22, 2017",,"August 7, 2008 [EBook #1022]",Walking,True
6,Robert Louis Stevenson and Lloyd Osbourne,765417,1024-0.txt,,2859,English,"September 14, 2016",,"February 11, 2006 [EBook #1024]",The Wrecker,True
7,George Grossmith,231288,1026-0.txt,,658,English,,,"August 14, 2011 [eBook #1026]",The Diary of a Nobody,True
8,Zane Grey,527194,1027-0.txt,,1655,English,"October 14, 2016","July 27, 2008 [EBook #1027]",August 1997,The Lone Star Ranger,True
9,(AKA Charlotte Bronte) Currer Bell,501202,1028-0.txt,,1218,English,"November 1, 2016",,"August 6, 2008 [EBook #1028]",The Professor,True


I want to see if there are any missing values for each of the columns to decide which columns to use when answering questions. Will also do some general statistics.
I also notice from a quick glance, that the date columns need cleaning as well. Some contain extra words (i.e. August 20, 2006 [EBook #102]) or different formats of date (i.e. August 1997).

In [7]:
percent_missing = metadata_df.isnull().sum() * 100 / len(metadata_df)
percent_missing.sort_values()

Book Length in Characters           0.000000
Filename                            0.000000
Instances of dialogue               0.000000
Title                               0.000000
Author                              0.131406
Language                            0.131406
Release Date                        2.759527
Truth appearing more than twice    18.134034
Last Updated                       24.572930
Posting Date                       52.036794
First Posted                       98.685940
dtype: float64

In [8]:
metadata_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 761 entries, 0 to 760
Data columns (total 11 columns):
Author                             760 non-null object
Book Length in Characters          761 non-null int64
Filename                           761 non-null object
First Posted                       10 non-null object
Instances of dialogue              761 non-null int64
Language                           760 non-null object
Last Updated                       574 non-null object
Posting Date                       365 non-null object
Release Date                       740 non-null object
Title                              761 non-null object
Truth appearing more than twice    623 non-null object
dtypes: int64(2), object(9)
memory usage: 65.5+ KB


There are a total of 761 rows, which when compared to the number of files in the folder ending in .txt, is a match.

In which books does the word "truth" appear more than twice?

In [9]:
#Selecting book titles that have a value of True in the column 'Truth appearing more than twice'
metadata_df['Title'].loc[metadata_df['Truth appearing more than twice'] == True]

1                             The Lure of the Dim Trails
2                                       The Oregon Trail
4                       The Tragedy of Pudd'nhead Wilson
5                                                Walking
6                                            The Wrecker
7                                  The Diary of a Nobody
8                                   The Lone Star Ranger
9                                          The Professor
10                                        The Night-Born
11             The Cavalier Songs and Ballads of England
13                                             The Pupil
14                              Joe Wilson and His Mates
15                                                 Style
16                                     A Reading of Life
18                                God The Invisible King
19                                   The New Machiavelli
20                                    The Ruling Passion
21                             

Which book has the most dialogue between characters?

In [10]:
#Querying the row with the max value in column 'Instances of dialogue'
metadata_df.loc[metadata_df['Instances of dialogue'].idxmax()]

Author                                      Alexandre Dumas, père
Book Length in Characters                                 2624640
Filename                                               1184-0.txt
First Posted                                                  NaN
Instances of dialogue                                       15194
Language                                                  English
Last Updated                                    February 24, 2017
Posting Date                                                  NaN
Release Date                       November 8, 2008 [EBook #1184]
Title                                   The Count of Monte Cristo
Truth appearing more than twice                              True
Name: 64, dtype: object

In [11]:
metadata_df.loc[metadata_df['Book Length in Characters'].idxmax()]

Author                                            Victor Hugo
Book Length in Characters                             3235163
Filename                                            135-0.txt
First Posted                                              NaN
Instances of dialogue                                       0
Language                                              English
Last Updated                                 January 18, 2016
Posting Date                                              NaN
Release Date                       June 22, 2008 [EBook #135]
Title                                          Les Misérables
Truth appearing more than twice                          True
Name: 148, dtype: object

English to non-English

In [12]:
#dropping rows with null values in the Language column
language_df = metadata_df.dropna(subset=['Language'])
language_df['Language'].value_counts(normalize=True)*100

English     99.210526
French       0.526316
Japanese     0.131579
Italian      0.131579
Name: Language, dtype: float64

Most common release date

In [13]:
#Without cleansing
metadata_df['Release Date'].mode()

0    May, 1999
dtype: object

In [14]:
mode_df = metadata_df
#removing [Ebook #]
mode_df['Release Date'] = mode_df['Release Date'].str.replace(r'\[.*\].*','')
#removing whitespace
mode_df['Release Date'] = mode_df['Release Date'].str.strip()
#regex looking for a word character, followed by a space, followed by four digits. Adding the comma if found (i.e. August 1997 to August, 1997)
mode_df['Release Date'] = mode_df['Release Date'].str.replace(r'(\w)( \d{4})', r'\1,\2')

In [15]:
mode_df['Release Date'].mode()

0    August, 1999
dtype: object

Average Release Date (year only)

In [16]:
#initializing df under a different name, dedicated for this problem
average_df = pd.DataFrame(data)
#found outlier with text after the [EBook #], removing everything after [] including []
average_df['Release Date'] = average_df['Release Date'].str.replace(r'\[.*\].*','')
#stripping any white spaces
average_df['Release Date'] = average_df['Release Date'].str.strip()
#Creating new column 'Release Year' by taking last four chars of 'Release Date'
average_df['Release Year'] = average_df['Release Date'].str[-4:]
#removing rows with null values in the column 'Release Year'
average_df = average_df.dropna(subset=['Release Year'])
#converting to int
average_df['Release Year'] = average_df['Release Year'].astype(int)

In [17]:
avg_year = average_df['Release Year'].mean()
print(avg_year)
print(round(avg_year))

2003.495945945946
2003
