**If you lost points on the last checkpoint you can get them back by responding to TA/IA feedback**  

Update/change the relevant sections where you lost those points, make sure you respond on GitHub Issues to your TA/IA to call their attention to the changes you made here.

Please update your Timeline... no battle plan survives contact with the enemy, so make sure we understand how your plans have changed.

# COGS 108 - Data Checkpoint

# Names

- Jared Simpauco
- Colin Kavanagh
- Darren Jiang
- Chester Ni
- Jack Howe

# Research Question

Can we create a model to determine an author’s birth year based on excerpts of their written published literature?

## Background and Prior Work

For our project, we plan to create a model that can estimate the time period an author was born in based on the type of text used in their published written literature. We believe this would be an interesting project to attempt as the way text is written and stylized can vary between time periods. While it is not a well-documented or widely explored field, there have been attempts to address similar problems in authorship attribution, age detection from text, and author profiling. However, estimating an author's birth year directly from text is a more specific and nuanced problem. For our report, we plan to pull random passages from Project Gutenberg, which is a non-profit online library of free eBooks. This will act as our dataset and allow us to get a wide variety of published literature to gather data from. 
We found a few articles that may support our research question and help prove the validity of our project. According to the article Review of age and gender detection methods based on handwriting analysis by Fahimeh Alaei & Alireza Alaei, “developing an automated handwriting analysis system to detect a gender or age category from handwriting samples involves two stages, developing and training a model and then testing the trained model”(Alaei). While analysis of handwriting is different from analysis of text, being able to discern the difference in age based on one’s handwriting is something we plan to use in the creation of our own model. Additionally, the creation of our model is proven to be possible as if a model can determine one's age based on handwriting, another model could determine one’s age based on the type of font used during a given time period. Another article that supports our project is Age Detection in Chat by Jenny Tam and Craig H. Martell. Using Naive Bayes Classifier(NBC), they tested for differences in text length, emoticon usage, and punctuation to determine the person's age. They found that “As she compared teens against older and older age groups, however, her results monotonically increased until generating an f-score measure of 0.932 for teens against 50 year olds”(Tam & Martell). This research proves that there is a significant difference between younger and older generations texting as the test they made was highly accurate. Given their research was done during the same time period there is a possibility that we can create a model that can detect these differences between varying time periods.
Alaei, F., Alaei, A. Review of age and gender detection methods based on handwriting analysis. Neural Comput & Applic 35, 23909–23925 (2023). https://doi.org/10.1007/s00521-023-08996-x
J. Tam and C. H. Martell, "Age Detection in Chat," 2009 IEEE International Conference on Semantic Computing, Berkeley, CA, USA, 2009, pp. 33-39, doi: 10.1109/ICSC.2009.37.

# Hypothesis


Based on preliminary research, we hypothesize that the language models we work with will be able to find some amount of correlation between the style of writing within samples of literature and the time period in which they were written. The English language has and continues to evolve over time, and certain trends may manifest within grammatical structures and vocabulary. However, we are aware that many factors could affect our ability to accurately predict the time period associated with the literature – for instance, translated texts may stylistically be more similar to other texts from the period of time when they were translated, as opposed to the period of time during which they were written.

# Data

## Data overview

For each dataset include the following information
- Dataset #1
  - Dataset Name: Book ID & Author DOB
  - Link to the dataset: N/A
  - Link to API used: https://github.com/garethbjohnson/gutendex
  - Number of observations: ~3200
  - Number of variables: 2

- Dataset #2
  - Dataset Name: Book ID & Book Excerpts
  - Link to the dataset: N/A
  - Link to API used: https://www.gutenberg.org/cache/epub/
  - Number of observations: ~9600
  - Number of variables: 2

- Dataset #3
  - Dataset Name: Final Dataset
  - Link to the dataset: N/A
  - Number of observations: ~8700
  - Number of variables: 3

Now write 2 - 5 sentences describing each dataset here. Include a short description of the important variables in the dataset; what the metrics and datatypes are, what concepts they may be proxies for. Include information about how you would need to wrangle/clean/preprocess the dataset

If you plan to use multiple datasets, add a few sentences about how you plan to combine these datasets.

### Dataset 1: Book ID & Author DOB
In our first dataset, we store the author's year of birth and a book ID for each one of roughly 3200 books selected from the Project Gutenberg library. We selected books by querying Project Gutenberg’s unofficial API to find roughly 640 books from each century from the year 1500 through 1999. The API is also able to provide metadata about each book, such as the book id, the author’s name, the author’s date of birth, etc. Using this API, we put together our first dataset.

Variables:
- book_id: (int) ID of book
- birth_yr: (int) the birth year of the author

### Dataset 2: Book ID & Book Excerpts
For the second dataset, we randomly sampled three excerpts from each of the ~3200 books. This means we have around ~9600 observations. Each excerpt represents roughly one paragraph from within the book, and we will be using these paragraphs to determine the time period the book was written. This dataset was put together using Project Gutenberg’s official texts API.

Variables:
- book_id: (int) ID of the book 
- book excerpts: (str) the randomized excerpts pulled from the books

### Dataset 3: Final Dataset
We will be combining the two datasets by joining them using the book IDs. From there, some cleaning will be required. During merging, any rows that aren't included in either one of the datasets will not be included in the final dataset. Any rows that have null/nan values will also be removed. Additionally, some text samples aren't true excerpts from the text, but rather a list of section names or the table of contents. These types of rows normally have less tha 50 characters, and will be removed as well.

Variables:
- book_id: (int) ID of the book 
- book excerpts: (str) the randomized excerpts pulled from the books
- birth_yr: (int) the birth year of the author

6.2 MB - Compressed data files

## Dataset #1: Book ID & Author DOB

Notes:
- All code that generates the datasets can be found at: "Project_Files/pull_data.ipynb"
- Code written here is copied from Project_Files/pull_data.ipynb and is intended to be run in the Project_files directory. 
- To run the code correctly and to avoid unnecessary generation of datasets, make sure to unzip/uncompress the data.zip file and ensure that the data folder is in the same directory that the code is run in.


In [None]:
from urllib.request import urlopen 
from urllib.error import HTTPError
import json 
import numpy as np
import csv
import re
import random
import pandas as pd

In [None]:
def pull_data(min_yr = 1500, max_yr = 2000, num_pgs = 20):
    # Creates a 2d numpy array, where the first column is the book id and 
    # the second column is the year that the author was born
    data = []

    for yr in range(min_yr, max_yr, 100):
        for pg in range(1, num_pgs + 1, 1):
            # Show Progress
            print(f"Processing Year {yr}, Page {pg}/{num_pgs}", end='\r')
            # Create a query that gets all books where the author is alive in the specified century at the specified page
            url = f"http://gutendex.com/books?author_year_start={yr}&author_year_start={yr + 99}&languages=en&page={pg}"
            # Pull resulting json file
            response = urlopen(url)
            data_json = json.loads(response.read()) 
            # Save book id and author birth year in data
            data_pg = np.array([(x['id'], x['authors'][0]['birth_year']) for x in data_json['results']])
            data.extend(data_pg)
            
    return np.array(data)

In [None]:
try:
    # Check if date_data.csv exists
    date_data = pd.read_csv("data/date_data.csv").astype(int)
except FileNotFoundError:
    # If date_data.csv doesn't exist, pull neccessary data using Gutendex API
    date_data = pull_data()
    # Save resulting data in csv file
    data_csv = {'book_id': int(date_data[:,0]), 'birth_yr': int(date_data[:,1])}
    with open('data/date_data.csv', 'w') as f:
        w = csv.writer(f)
        w.writerow(data_csv.keys())
        w.writerows(zip(*data_csv.values()))
    # Pull data as pandas Dataframe for further use
    date_data = pd.read_csv("data/date_data.csv").astype(int)

## Dataset #2: Book ID & Book Excerpts

In [None]:
def get_text(book_id):
    # Pulls the text file from the Gutenberg Archive of a given book using its book id
    url = f"https://www.gutenberg.org/cache/epub/{book_id}/pg{book_id}.txt"
    text = urlopen(url).read()
    return text

def get_text_samples(text, num_samples = 3):
    # Get rid of Guttenberg Header and footer
    book_text = [x.strip() for x in text.decode("utf-8").split('***')][2]
    # Remove '\r' symbol
    book_text = re.sub(r"[\r]+", "", book_text)
    # split by paragraph breaks
    book_text = re.split(r"\n{2,}", book_text)
    # remove paragraphs that are less 8 average length sentences long
    book_text = list(filter(lambda x: len(x) >= (50 * 8), np.array(book_text)))
    # Randomly sample remaining paragraphs
    paragraphs = random.sample(book_text, min(num_samples, len(book_text)))
    # Replace \n with ' ' and return paragraphs
    return np.array([re.sub(r"\n", " ", p) for p in paragraphs])

In [None]:
def create_excerpt_data(data):
    # Creates a 2d numpy array of the book id and randomly sampled
    # paragraphs within the book
    book_ids = data[:,0].astype(int)
    book_samples = []
    invalid_ids = []

    for i in range(book_ids.shape[0]):
        # For each book try to access the text file
        try:
            text = get_text(book_ids[i])
        except HTTPError as err:
            # If unable to access the text file, display the error code 
            # and save the book_id in invalid_ids for logging purposes
            print(f"HTTP {err.code} Error: book_id = {book_ids[i]}")
            invalid_ids.append(book_ids[i])
            
        # Clean and randomly sample text samples
        text_samples = get_text_samples(text)
        # Combine text samples with associated book_id
        ids = np.full(len(text_samples), book_ids[i])
        # Save samples and book id into book_samples
        samples = np.array(list(zip(ids, text_samples)))
        book_samples.extend(samples)
        
        # Show Progress
        print(f"Progress: {i/book_ids.shape[0]}", end='\r')

    return np.array(book_samples), invalid_ids

In [None]:
try:
    # Check if excerpts.csv exists
    excerpt_data = pd.read_csv("data/excerpts.csv")
except FileNotFoundError:
    # If excerpts.csv doesn't exist, create necessary data
    book_samples, invalid_ids = create_excerpt_data(date_data.to_numpy())
    # Save resulting excerpts in csv file
    book_data_csv = {'book_id': book_samples[:,0], 'text': book_samples[:,1]}
    with open('data/excerpts.csv', 'w') as f:
        w = csv.writer(f)
        w.writerow(book_data_csv.keys())
        w.writerows(zip(*book_data_csv.values()))
    # Pull data as Pandas DataFrame fro further use
    excerpt_data = pd.read_csv("data/excerpts.csv")

## Dataset #3: Final Dataset

In [None]:
# Merge Datasets together (removing nan values in the process)
data = excerpt_data.merge(date_data, how='inner').drop_duplicates().reset_index(drop=True)
# remove entries where text is less than 50 characters long
data = data[data['text'].apply(lambda x: len(re.findall(r"\w+", x)) > 50)]
# Save final dataset to csv file
data.to_csv("data/data.csv")

# Ethics & Privacy

We plan on gathering our data from [Project Gutenberg](https://www.gutenberg.org/), where all books published onto the website have their US copyright expired. A potential bias with the dataset is the type of books that are collected from scraping the website. To elaborate, books are added based on community input, thus only certain books are added and we are limited to what is available. Adding onto this, since we cannot feasibly scrape every single book from the website, only the most recently added books will be scraped, leaving behind the books that have been added for some time. While we cannot control what gets onto the website, we can randomly sample in order to include more works that were added further back in time rather than only those that were added recently. All books that are on Project Gutenberg are [free to use however the user sees fit](https://www.gutenberg.org/policy/permission.html) and therefore, will not be a problem in regards to data privacy.

# Team Expectations 

In our project, we expect a few but essential things from all our group members. We expect all group members to equally contribute to the best of their ability to the project. This project is a team effort, and everyone has something to contribute, whether they realize it or not. We also expect all members to attend weekly meetings at the time that we have already agreed and established. We expect everyone to stay in close communication via either Discord or the text chat that we have created. Any problems, concerns, or ideas that come up should be swiftly communicated via either of these channels. We intend to uphold these expectations not only to create an outstanding final product but also to alleviate any stress that can potentially be allocated to other team members. We are all in this together until the end. 

# Project Timeline Proposal

| Meeting Date  | Meeting Time| Completed Before Meeting  | Discuss at Meeting |
|---|---|---|---|
| 10/30 | 4:30pm | Brainstorm Topic/Data Question ideas | Discuss background experience; determine data science question; begin background research and working of project proposal |
| 11/01 | 4:30pm | Finish project proposal | Confirm progress on project proposal, edit proposal as needed, and submit. |
| 11/06 | 4:30pm | Begin looking at Project Gutenberg API. Cross-reference data in Project Gutenberg to ensure dates are correct. If possible, begin importing and cleaning data. | Discuss any issues that may arise from the dataset. Assign group members to specific parts of work; discuss wrangling and other possible analytical approaches. |
| 11/13 | 4:30pm | Import and clean data from API; (Randomly pull excerpts from each book as well as author birth year); Write report for **Checkpoint 1: Data** | Review dataset and discuss any complications that arose from importing and cleaning data; Turn in **Checkpoint 1: Data** |
| 11/20 | 4:30pm | EDA and Analysis (potentially model training) | Review EDA and analysis |
| 11/27 | 4:30pm | Model Training and Analysis; Finish Report for **Checkpoint 2: EDA** | Review Model Training and Analysis; look at any mistakes or obstacles that occurred and discuss ways of correcting them. Turn in **Checkpoint 2: EDA** |
| 12/04 | 4:30pm | Complete Analysis; Begin drafting results, conclusions, and discussion; If possible, work on Final project Presentation/**Final Project Report** | Review, edit, and finalize project report. Record Project Presentation if possible |
| 12/11 | 4:30pm | Finish project presentation and record project presentation video parts (if needed) | Turn in **Final Project Report**/Final Project and Group Surveys |