# COGS 108 - EDA Checkpoint

# Names

- Jared Simpauco
- Colin Kavanagh
- Darren Jiang
- Chester Ni
- Jack Howe

# Research Question

Can we create a model to determine an author’s birth year based on excerpts of their written published literature?

## Background and Prior Work

For our project, we plan to create a model that can estimate the time period an author was born in based on the type of text used in their published written literature. We believe this would be an interesting project to attempt as the way text is written and stylized can vary between time periods. While it is not a well-documented or widely explored field, there have been attempts to address similar problems in authorship attribution, age detection from text, and author profiling. However, estimating an author's birth year directly from text is a more specific and nuanced problem. For our report, we plan to pull random passages from Project Gutenberg, which is a non-profit online library of free eBooks. This will act as our dataset and allow us to get a wide variety of published literature to gather data from.

We found a few articles that may support our research question and help prove the validity of our project. According to the article Review of age and gender detection methods based on handwriting analysis by Fahimeh Alaei & Alireza Alaei, “developing an automated handwriting analysis system to detect a gender or age category from handwriting samples involves two stages, developing and training a model and then testing the trained model”(Alaei). While analysis of handwriting is different from analysis of text, being able to discern the difference in age based on one’s handwriting is something we plan to use in the creation of our own model. Additionally, the creation of our model is proven to be possible as if a model can determine one's age based on handwriting, another model could determine one’s age based on the type of font used during a given time period. Another article that supports our project is Age Detection in Chat by Jenny Tam and Craig H. Martell. Using Naive Bayes Classifier(NBC), they tested for differences in text length, emoticon usage, and punctuation to determine the person's age. They found that “As she compared teens against older and older age groups, however, her results monotonically increased until generating an f-score measure of 0.932 for teens against 50 year olds”(Tam & Martell). This research proves that there is a significant difference between younger and older generations texting as the test they made was highly accurate. Given their research was done during the same time period there is a possibility that we can create a model that can detect these differences between varying time periods.

Alaei, F., Alaei, A. Review of age and gender detection methods based on handwriting analysis. Neural Comput & Applic 35, 23909–23925 (2023). https://doi.org/10.1007/s00521-023-08996-x
J. Tam and C. H. Martell, "Age Detection in Chat," 2009 IEEE International Conference on Semantic Computing, Berkeley, CA, USA, 2009, pp. 33-39, doi: 10.1109/ICSC.2009.37.

# Hypothesis


Based on preliminary research, we hypothesize that the language models we work with will be able to find some amount of correlation between the style of writing within samples of literature and the time period in which they were written. The English language has and continues to evolve over time, and certain trends may manifest within grammatical structures and vocabulary. However, we are aware that many factors could affect our ability to accurately predict the time period associated with the literature – for instance, translated texts may stylistically be more similar to other texts from the period of time when they were translated, as opposed to the period of time during which they were written.

# Data

## Data overview

- Dataset #1
  - Dataset Name: Book ID & Author DOB
  - Link to the dataset: N/A
  - Link to API used: https://github.com/garethbjohnson/gutendex
  - Number of observations: ~3200
  - Number of variables: 2

- Dataset #2
  - Dataset Name: Book ID & Book Excerpts
  - Link to the dataset: N/A
  - Link to API used: https://www.gutenberg.org/cache/epub/
  - Number of observations: ~9600
  - Number of variables: 2

- Dataset #3
  - Dataset Name: Final Dataset
  - Link to the dataset: N/A
  - Number of observations: ~8700
  - Number of variables: 3

Now write 2 - 5 sentences describing each dataset here. Include a short description of the important variables in the dataset; what the metrics and datatypes are, what concepts they may be proxies for. Include information about how you would need to wrangle/clean/preprocess the dataset

If you plan to use multiple datasets, add a few sentences about how you plan to combine these datasets.

### Dataset 1: Book ID & Author DOB
In our first dataset, we store the author's year of birth and a book ID for each one of roughly 3200 books selected from the Project Gutenberg library. We selected books by querying Project Gutenberg’s unofficial API to find roughly 640 books from each century from the year 1500 through 1999. The API is also able to provide metadata about each book, such as the book id, the author’s name, the author’s date of birth, etc. Using this API, we put together our first dataset.

Variables:
- book_id: (int) ID of book
- birth_yr: (int) the birth year of the author

### Dataset 2: Book ID & Book Excerpts
For the second dataset, we randomly sampled three excerpts from each of the ~3200 books. This means we have around ~9600 observations. Each excerpt represents roughly one paragraph from within the book, and we will be using these paragraphs to determine the time period the book was written. This dataset was put together using Project Gutenberg’s official texts API.

Variables:
- book_id: (int) ID of the book 
- book excerpts: (str) the randomized excerpts pulled from the books

### Dataset 3: Final Dataset
We will be combining the two datasets by joining them using the book IDs. From there, some cleaning will be required. During merging, any rows that aren't included in either one of the datasets will not be included in the final dataset. Any rows that have null/nan values will also be removed. Additionally, some text samples aren't true excerpts from the text, but rather a list of section names or the table of contents. These types of rows normally have less tha 50 characters, and will be removed as well.

Variables:
- book_id: (int) ID of the book 
- book excerpts: (str) the randomized excerpts pulled from the books
- birth_yr: (int) the birth year of the author

6.2 MB - Compressed data files

## Dataset #1: Book ID & Author DOB

Notes:
- All code that generates the datasets can be found at: "Project_Files/pull_data.ipynb"
- To run the code correctly and to avoid unnecessary generation of datasets, make sure to unzip/uncompress the data.zip file

In [None]:
# Libraries fro Dataset Creation
from urllib.request import urlopen 
from urllib.error import HTTPError
import json 
import numpy as np
import csv
import re
import random
import pandas as pd
import os

# Hide Warnings (for Presentation)
import warnings
warnings.filterwarnings('ignore')

# Additional Libraries for EDA
import matplotlib.pyplot as plt
from sklearn.feature_extraction.text import TfidfVectorizer
from wordcloud import WordCloud
import nltk
from nltk.corpus import stopwords

nltk.download('stopwords')
nltk.download('punkt')
import string

In [None]:
assert os.path.isdir("Project_Files/data"), "Please unzip data.zip file in Project_Files before running this notebook"

In [None]:
def pull_data(min_yr = 1500, max_yr = 2000, num_pgs = 20):
    # Creates a 2d numpy array, where the first column is the book id and 
    # the second column is the year that the author was born
    data = []

    for yr in range(min_yr, max_yr, 100):
        for pg in range(1, num_pgs + 1, 1):
            # Show Progress
            print(f"Processing Year {yr}, Page {pg}/{num_pgs}", end='\r')
            # Create a query that gets all books where the author is alive in the specified century at the specified page
            url = f"http://gutendex.com/books?author_year_start={yr}&author_year_start={yr + 99}&languages=en&page={pg}"
            # Pull resulting json file
            response = urlopen(url)
            data_json = json.loads(response.read()) 
            # Save book id and author birth year in data
            data_pg = np.array([(x['id'], x['authors'][0]['birth_year']) for x in data_json['results']])
            data.extend(data_pg)
            
    return np.array(data)

In [None]:
try:
    # Check if date_data.csv exists
    date_data = pd.read_csv("Project_Files/data/date_data.csv").astype(int)
except FileNotFoundError:
    # If date_data.csv doesn't exist, pull neccessary data using Gutendex API
    date_data = pull_data()
    # Save resulting data in csv file
    data_csv = {'book_id': int(date_data[:,0]), 'birth_yr': int(date_data[:,1])}
    with open('data/date_data.csv', 'w') as f:
        w = csv.writer(f)
        w.writerow(data_csv.keys())
        w.writerows(zip(*data_csv.values()))
    # Pull data as pandas Dataframe for further use
    date_data = pd.read_csv("Project_Files/data/date_data.csv").astype(int)

date_data.head()

## Dataset #2: Book ID & Book Excerpts

In [None]:
def get_text(book_id):
    # Pulls the text file from the Gutenberg Archive of a given book using its book id
    url = f"https://www.gutenberg.org/cache/epub/{book_id}/pg{book_id}.txt"
    text = urlopen(url).read()
    return text

def get_text_samples(text, num_samples = 3):
    # Get rid of Guttenberg Header and footer
    book_text = [x.strip() for x in text.decode("utf-8").split('***')][2]
    # Remove '\r' symbol
    book_text = re.sub(r"[\r]+", "", book_text)
    # split by paragraph breaks
    book_text = re.split(r"\n{2,}", book_text)
    # remove paragraphs that are less 8 average length sentences long
    book_text = list(filter(lambda x: len(x) >= (50 * 8), np.array(book_text)))
    # Randomly sample remaining paragraphs
    paragraphs = random.sample(book_text, min(num_samples, len(book_text)))
    # Replace \n with ' ' and return paragraphs
    return np.array([re.sub(r"\n", " ", p) for p in paragraphs])

In [None]:
def create_excerpt_data(data):
    # Creates a 2d numpy array of the book id and randomly sampled
    # paragraphs within the book
    book_ids = data[:,0].astype(int)
    book_samples = []
    invalid_ids = []

    for i in range(book_ids.shape[0]):
        # For each book try to access the text file
        try:
            text = get_text(book_ids[i])
        except HTTPError as err:
            # If unable to access the text file, display the error code 
            # and save the book_id in invalid_ids for logging purposes
            print(f"HTTP {err.code} Error: book_id = {book_ids[i]}")
            invalid_ids.append(book_ids[i])
            
        # Clean and randomly sample text samples
        text_samples = get_text_samples(text)
        # Combine text samples with associated book_id
        ids = np.full(len(text_samples), book_ids[i])
        # Save samples and book id into book_samples
        samples = np.array(list(zip(ids, text_samples)))
        book_samples.extend(samples)
        
        # Show Progress
        print(f"Progress: {i/book_ids.shape[0]}", end='\r')

    return np.array(book_samples), invalid_ids

In [None]:
try:
    # Check if excerpts.csv exists
    excerpt_data = pd.read_csv("Project_Files/data/excerpts.csv")
except FileNotFoundError:
    # If excerpts.csv doesn't exist, create necessary data
    book_samples, invalid_ids = create_excerpt_data(date_data.to_numpy())
    # Save resulting excerpts in csv file
    book_data_csv = {'book_id': book_samples[:,0], 'text': book_samples[:,1]}
    with open('data/excerpts.csv', 'w') as f:
        w = csv.writer(f)
        w.writerow(book_data_csv.keys())
        w.writerows(zip(*book_data_csv.values()))
    # Pull data as Pandas DataFrame fro further use
    excerpt_data = pd.read_csv("Project_Files/data/excerpts.csv")

excerpt_data.head()

## Dataset #3: Final Dataset

In [None]:
try:
    # Check if data.csv exists
    data = pd.read_csv("Project_Files/data/data.csv")
except:
    # Merge Datasets together (removing nan values in the process)
    data = excerpt_data.merge(date_data, how='inner').drop_duplicates().reset_index(drop=True)
    # remove entries where text is less than 50 characters long
    data = data[data['text'].apply(lambda x: len(re.findall(r"\w+", x)) > 50)]
    # strip any remaining whitespace from the text
    data['text'] = data['text'].str.strip()
    # drop a specific entry that we manually identified as a table of contents
    data = data.drop(4996)
    # Save final dataset to csv file
    data.to_csv("Project_Files/data/data.csv", index=False)

data.head()

# Results

## Exploratory Data Analysis

Carry out whatever EDA you need to for your project.  Because every project will be different we can't really give you much of a template at this point. But please make sure you describe the what and why in text here as well as providing interpretation of results and context.

For our initial EDA, we wanted to see how many books and excerpts we managed to collect after some initial cleaning

In [None]:
print(f"Number of books in dataset: {data['book_id'].unique().shape[0]}")
print(f"Number of excerpts in dataset: {data.shape[0]}")

### Excerpt Length

When making the dataset, we randomly sampled excerpts from books in the Guttenberg project. One of the first things we wanted to see was the distribution of excerpts lengths of the entire dataset. To do this, we created a new column for EDA purposes:
- text_len: the number of characters in the 'text' column

In [None]:
# Create new column that contains number of characters of each text
text_len = data['text'].str.len()
data['text_len'] = text_len

# Display the distribution of text length 
plt.hist(
    data['text_len'],
    bins = 18
)
plt.xlabel('text length')
plt.ylabel('number of excerpts')
plt.title('Histogram of Excerpt Text Length')

# Print out the dataset with new text_len column
data.head()

From here, we can see that it's difficult to see the distribution, so we remove outliers in text length by removing any excerpts that are greater than 3 standard deviations greater than the mean:

In [None]:
# Calculate threshold to remove values
t = data['text_len'].mean() + 3 * data['text_len'].std()

# Remove excerpts that are greater the threshold
data = data.loc[data['text_len'] < t]

# Display the new distribution of text length 
plt.hist(
    data['text_len'],
    bins = 18
)
plt.xlabel('text length')
plt.ylabel('number of excerpts')
plt.title('Histogram of Excerpt Text Length with Outliers Removed');

### Author Birth Year

When creating our dataset, we wanted to pull a consistent amount of books from each century (as in we wanted a set number of books where the author of the book was born in a particular century) from 1500 to 1900. However, due to the way the API that we used worked, the closest that we could do was pull a set number of books where the author was alive from each century. Because of this, and because of the data cleaning we did earlier, we wanted to see what the actual distribution of author birth years would be.

In [None]:
plt.hist(
    data['birth_yr']
)
plt.xlabel('year')
plt.ylabel('number of excerpts')
plt.title('Histogram of Author Birth Year');

Similar to text length, there seemed to be an outlier with author birth year, where one or more authors were born before 1400

In [None]:
data[data['birth_yr'] < 1400]

After some searching, we found that it was a single book from a single author, which turned out to be "The Retreat of Ten Thousand", which was a Greek Epic that was translated to English and published again in 1897. Because it was translated into English, and because this was the only outlier in terms of author birth year, we removed it from the dataset.

In [None]:
# Remove outlier from data
data = data[data['birth_yr'] >= 1400]

# Display new histogram of author birth years
plt.hist(
    data['birth_yr']
)
plt.xlabel('year')
plt.ylabel('number of excerpts')
plt.title('Histogram of Author Birth Year');

There seems to be a significant skew in the dataset, where there are significantly more books where the author was born between 1800-2000 compared to 1500-1800.

In [None]:
data['birth_yr'].describe()

Looking at the summary statistics of the author's birth year seems to match this observation, as the 25th, 50th, and 75th percentile percentiles all fall in the 1800-1900 century.

In [None]:
# Creating new Dataset to look at distribution of books per century
centuries = data.copy()

# Creaing conditions #
centuries_cond = [
    (centuries['birth_yr'] >= 1500) & (centuries['birth_yr'] < 1600),
    (centuries['birth_yr'] >= 1600) & (centuries['birth_yr'] < 1700),
    (centuries['birth_yr'] >= 1700) & (centuries['birth_yr'] < 1800),
    (centuries['birth_yr'] >= 1800) & (centuries['birth_yr'] < 1900),
    (centuries['birth_yr'] >= 1900) & (centuries['birth_yr'] < 2000)
]

# Creating century column #
centuries['Century'] = np.select(centuries_cond, ['16th', '17th', '18th', '19th', '20th'], default='null')

# Removing null values (outliers) #
centuries = centuries[centuries['Century'] != 'null']

# Getting counts #
centuries = (
    centuries
    .groupby('Century')[['text']]
    .count()
    .reset_index()
    .rename(columns={
        'text': 'Count'
        }
    )
)

In [None]:
plt.bar(x='Century', height='Count', data=centuries);
plt.xlabel('Centuries');
plt.ylabel('Count');
plt.title('Distribution of Excerpts Per Century');

### Author Birth Year vs Text Length

Beyond just looking at the distribution of author birth year or text length, we wanted to see if there was any correlation between the birth year and the text length. 

To do so, we created a scatter plot where the x-axis is the author's birth year and the y-axis is the excerpt text length. We also plotted the line of best fit in order to visually see any kind of correlation, if there was any.

In [None]:
# Create line of best fit
a, b = np.polyfit(
    data['birth_yr'].to_numpy(), 
    data['text_len'].to_numpy(), 
    1
)
# Plot scatter plot of birth year and text length
plt.scatter(
    x = data['birth_yr'], 
    y = data['text_len'],
    alpha = 0.5,
    c = 'gray'
)
plt.xlabel('Author Birth Year')
plt.ylabel('Excerpt Text Length')
plt.title('Scatterplot of Author Birth year vs Text Length');
# Plot line of best fit
plt.plot(
    data['birth_yr'].to_numpy(), 
    a * data['birth_yr'].to_numpy() + b,
    color='red'
) ;

From the plot, we can see a weak negative correlation between the excerpt text length and the author's birth year. This is likely due to the fact that there are significantly more books in the dataset written during 1750-1950 compared to 1550-1750. Beyond the line of best fit, we can see from the scatterplot itself that the are no obvious indicator of any kind of trend, and creating a more balanced, randomly sampled dataset will likely reduce this weak correlation even further.

### Balanced Dataset

From the previous correlation, it's obvious to see that there is a significant skew in the distribution of birth years in the dataset. To fix this, we want to create a new balanced dataset by randomly sampling a set number of excerpts from each century.

The first thing we did was to create a new column that indicates the century the author was born in.

In [None]:
# Create a new column that indicates century
data['birth_ctry'] = data['birth_yr'] // 100 + 1
data.head()

Once we have that column, we grouped the dataset by the newly created birth century column ('birth_ctry') and randomly sampled 200 excerpts from each century. We did this random sampling with replacement, because several centuries had less than 2000 excerpts in it.

In [None]:
# Randomly sample 2000 entries from each century
data_balanced = (
    data
    .groupby('birth_ctry')
    .apply(
        lambda x: x.sample(2000, replace=True)
    )
    .reset_index(drop=True))
data_balanced.head()

Now with this new balanced dataset, we wanted to look at the distribution again in order to see if the negative correlation we saw in the earlier visualization has been affected in any way

In [None]:
# Create line of best fitDAT
a, b = np.polyfit(
    data_balanced['birth_yr'].to_numpy(), 
    data_balanced['text_len'].to_numpy(), 
    1
)
# Plot scatter plot of birth year and text length
plt.scatter(
    x = data_balanced['birth_yr'], 
    y = data_balanced['text_len'],
    alpha = 0.5,
    c = 'gray'
)
plt.xlabel('Author Birth Year')
plt.ylabel('Excerpt Text Length')
plt.title('Scatterplot of Author Birth year vs Text Length');
# Plot line of best fit
plt.plot(
    data_balanced['birth_yr'].to_numpy(), 
    a * data_balanced['birth_yr'].to_numpy() + b,
    color='red'
) 
# Show new quartiles and summary statistics
data_balanced['birth_yr'].describe()

From the figure, we can see that the already weak negative correlation is weaker in the new balanced dataset

### Common Words

Throughout our EDA process, we looked at numerical data such as the author birth year and the length of the excerpt without really looking into the content of the excerpts themselves. SInce our project is focused on predicting an author's birth year by looking at the writing style and vocabulary of the associated text excerpt, we wanted to get a better idea of the vacabulary being used.

One of the methods we are likely to use to identify author birth year is TF-IDF. While the result of TF-IDF vectorizer isn't likely to be visually interesting or helpful, it may be useful to see some of the most common words that appear in the excerpts and identify any invalid words that come up. As a result, we decided to do a very brief look into some of the most common words that appear in the dataset.

To start off, we created a copy of the dataset in order to remove punctuation and do some text processing

In [None]:
# Create a copy of the dataset to do text analysis on
data_tfidf = data.copy()

# Process the dataset by removing punctuation for TF-IDF
def preprocess(text):
    text = re.sub(r'[^A-Za-z0-9]+', " ", text)
    text = text.lower()
    return text
data_tfidf["text"] = data.get("text").apply(preprocess)
data_tfidf.head()

From here, we created a TF-IDF Vectorizer to vectorize the excerpts and find some of the most common words used in the dataset

In [None]:
# Create a standard TF-IDF using arbitrary parameters
tfidf = TfidfVectorizer(
    sublinear_tf=True,
    analyzer='word',
    max_features=2000,
    tokenizer=nltk.tokenize.word_tokenize,
    stop_words=stopwords.words("english")
)

In [None]:
# Vectorize the excerpt text 
tfidf_array = tfidf.fit_transform(data_tfidf["text"]).toarray()
text_tfidf = pd.DataFrame(tfidf_array)
text_tfidf.columns = tfidf.get_feature_names_out()
# Pull the top ten most unique words used in the dataset
most_unique = text_tfidf.idxmax(axis = 1)
most_unique[:10]

From here, we can see the 10 most unique words in the dataset. Notably, these are all words that would be fairly common in writing found in older texts, which could be explained with the skew in author birth distributions that we saw earlier.

In [None]:
# Join all the text in the column into one string
text = " ".join(i for i in data_tfidf.text)

# Create a WordCloud object with some parameters
wordcloud = WordCloud(background_color="white", stopwords=stopwords.words("english"), min_font_size=10).generate(text)

# Plot the word cloud using matplotlib
plt.imshow(wordcloud, interpolation="bilinear")
plt.axis("off")
plt.show()

In the Word Cloud above, one can identify that the words most consistenly found in our data are the words: one, would, and time in that order. This is not what we were expecting to happen as we wanted to find specific key words which would help identify the time period each of these text excerpts were from. Most of the other words appear to be text we'd say or use in modern day English so one can infer that the context the words are used in is the deciding factor.

# Ethics & Privacy

We plan on gathering our data from [Project Gutenberg](https://www.gutenberg.org/), where all books published onto the website have their US copyright expired. A potential bias with the dataset is the type of books that are collected from scraping the website. To elaborate, books are added based on community input, thus only certain books are added and we are limited to what is available. Adding onto this, since we cannot feasibly scrape every single book from the website, only the most recently added books will be scraped, leaving behind the books that have been added for some time. While we cannot control what gets onto the website, we can randomly sample in order to include more works that were added further back in time rather than only those that were added recently. All books that are on Project Gutenberg are [free to use however the user sees fit](https://www.gutenberg.org/policy/permission.html) and therefore, will not be a problem in regards to data privacy.

# Team Expectations 

In our project, we expect a few but essential things from all our group members. We expect all group members to equally contribute to the best of their ability to the project. This project is a team effort, and everyone has something to contribute, whether they realize it or not. We also expect all members to attend weekly meetings at the time that we have already agreed and established. We expect everyone to stay in close communication via either Discord or the text chat that we have created. Any problems, concerns, or ideas that come up should be swiftly communicated via either of these channels. We intend to uphold these expectations not only to create an outstanding final product but also to alleviate any stress that can potentially be allocated to other team members. We are all in this together until the end. 

# Project Timeline Proposal

| Meeting Date  | Meeting Time| Completed Before Meeting  | Discuss at Meeting |
|---|---|---|---|
| 10/30 | 4:30pm | Brainstorm Topic/Data Question ideas | Discuss background experience; determine data science question; begin background research and working of project proposal |
| 11/01 | 4:30pm | Finish project proposal | Confirm progress on project proposal, edit proposal as needed, and submit. |
| 11/06 | 4:30pm | Begin looking at Project Gutenberg API. Cross-reference data in Project Gutenberg to ensure dates are correct. If possible, begin importing and cleaning data. | Discuss any issues that may arise from the dataset. Assign group members to specific parts of work; discuss wrangling and other possible analytical approaches. |
| 11/13 | 4:30pm | Import and clean data from API; (Randomly pull excerpts from each book as well as author birth year); Write report for **Checkpoint 1: Data** | Review dataset and discuss any complications that arose from importing and cleaning data; Turn in **Checkpoint 1: Data** |
| 11/20 | 4:30pm | EDA and Analysis (potentially model training) | Review EDA and analysis |
| 11/28 | 4:30pm | Model Training and Analysis; Finish Report for **Checkpoint 2: EDA** | Review Model Training and Analysis; look at any mistakes or obstacles that occurred and discuss ways of correcting them. Turn in **Checkpoint 2: EDA** |
| 12/04 | 4:30pm | Complete Analysis; Begin drafting results, conclusions, and discussion; If possible, work on Final project Presentation/**Final Project Report** | Review, edit, and finalize project report. Record Project Presentation if possible |
| 12/11 | 4:30pm | Finish project presentation and record project presentation video parts (if needed) | Turn in **Final Project Report**/Final Project and Group Surveys |