# Project Proposal

### Basic Info
_____
 
**Title**: Fandom Trends and Pecularities From AO3 Data
 

**Names**: Rebekah Washburn, Noble Ledbetter, Henry Brunisholz
 

**Emails**: Rebekah - u1310114@utah.edu, Noble - u0967666@utah.edu, Henry - u1276675@utah.edu
 

**UIDs**: Rebekah - u1310114, Noble - u0967666, Henry - u1276675



### Background and Motivation
___

The world of fanfiction has been thriving for years. Many popular authors started in fanfiction, and famous novels like *Fifty Shades of Grey* by E. L. James and *After* by Anna Todd started as fan works. As the world of fanfiction and fandom has become more mainstream, sites like [Archive of Our Own (AO3)](https://archiveofourown.org) have held millions of works available for people to enjoy. Both Henry and Rebekah are interested in fanfiction and the data surround these works, and are excited to analyze data AO3 specifically. Noble, an aspiring author, is interested to see trends in the fanfiction universe. We hope to create something meaningful, like Toastystats' fandom statistics, found [here](https://archiveofourown.org/users/destinationtoast/pseuds/toastystats/works?fandom_id=87791).

### Project Objectives
___
Project objectives can be broken into two main groups:
- Questions about the data: How does age rating relate to a fics popularity? How closely connected are varying defintions of popularity (comment count, hit count, kudos count)? How does the length of a work relate to its popularity? Are there mismatches in "supply and demand", where a certain kind of fic is very popular among fans but not produced very frequently by writers? What about the other way around, where writers write a lot of not-so-popular stories? How complicated/readable is the average fanfiction? How does that vary with measurements of popularity?

- Data analysis skill-development: How do we determine whether a question concerning data is worth studying? How do we handle large data sets? How do we create a smaller, random portion of the data to use when testing our code? How do we clean and organize data "from the wild" so to speak, as opposed to a class example? What is the best way to display our data, through charts or other tools, to convey the significants of findings, or the lack of findings? 

### Data Description and Acquistion
___
The data we are analyzing was collected in 2020 by reddit user theCodeCat, who scrapped non-user-restricted fan-works from ArchiveOfOurOwn (AO3). The data is available for download [here](https://www.reddit.com/r/datasets/comments/i254cw/archiveofourown_dataset/), and is in a SQLite database. None of us have yet worked with a SQLite Database, so we will need to explore the formatting changes necessary to get it ready for Python. 

The dataset is 77GB when compressed and 502GB uncompressed, and contains data from approximately 6 million different works. The information stored for each work includes its:
- ID
- Age Rating
- Completion Status
- Title
- Description
- Current number of chapters
- Planned number of chapters
- Language
- Word count
- Hit count
- Comment count (not, however, the comments themselves)
- Bookmark count
- Date published
- Authors
- Users/Authors work is dedicated to
- If relevant, what series the work is a part of 
- Tags (content warnings, fandoms, relationships, characters, relationship types, generic)
- Chapter text

While we will have access to all of this information, there are certain things that we will likely ignore, both due to a lack of relevance, and for ethical considerations. This includes authors' names, and the people stories are dedicated to. 

### Ethical Considerations
___
Ethical considerations for the use and analysis of this data are fortunately limited, but not non-existent.

First, there is the fact that many fanfic writers do not want their fanfics used for data analysis (or machine learning!) projects, as emphasized by the general uproar on the internet following the revelation that LLM's like ChatGPT were trained on data that included fanfiction published to the internet. Using the dataset collected by theCoolCat mostly avoids this ethical quandry, because theCoolCat was the one who collected unrestricted data and distributed it to the world. It will not change any fanfic writer's experience for us to privately analyze that data. We are also not using the data for machine learning, so no writer's work will be used to create "new" works.

Second, there could be personal, upsetting or sensitive information contained in this data set, such as folks' real names or addresses, personal and private experiences, or content not suitable for professional environments like graphic sex scenes. This risk is magnified since the dataset contains actual chapter text from each fanfiction. However, any analysis we produce dealing with the a work's contained words will be completed using algorithms to assess the reading level of a text or the balance of words with positive and negative connotation within the text. We will not be reading the chapter texts ourselves, so we should have minimal dealing with sensitive content. 

Third, there is a popular view of fanfiction as derivative, illegal, and pornographic writing of low quality. If our data analysis finds that, for example, fanfic tends to have lower word length/sentence length than books, that could contribute to this public stigma. If the most popular works tend to have more adult subject material, people might try to protest against the existence of sites like AO3 because of the content available. However, this ethical risk is reduced because our project will be done based off data publically avaialable already, meaning any conclusions we draw from it could have been drawn already. Additionally, it is arguable that this is an ethical risk at all. Something in our data may or may not be able to be interpreted by fandom's detractors in a negative light, but, as the saying goes, haters gonna hate.

### Data Cleaning and Processing
___
We will certainly have to do substantial data extraction and cleanup due to the size of the data we have. Thankfully, similar tags in our data (such as words, kudos, and hits) will allow us to simplify some of this extraction.

The quantities we expect to pull from our data are many. First, we are planning on creating correlations matrices between different notions of popularity, like hits, kudos, and comments, to see how closely they are connected - all those quantities have to be gathered. Second, we want to see how well notions of popularity correlate with other meta-data, like the length of the work, number of chapters of the work, the work's rating, and whether the work is completed or not - all of these quantities must also be collected. Further, we want to do a complexity/readability analysis, where the average word length and sentence length are collated for analysis and potential comparison with other quantities previously collected. We also want to perform a rudimentary "sentiment analysis" where the emotional valence of different words is classified as positive and negative, these quantities will be collected and compared against previously collected ones like length or popularity. 

The ideal scenario will be to do data processing in similar ways as it has been done in class, through PANDAS. As the dataset is stored in an SQLite database, this may not be possible, more research into how to handle SQLite databases is needed.

We are also interested in determining if there is a most popular type of fanfiction. To look at this, we are going to have to sort through custom tags ("freeforms" in the html). These tags are created for a work by its author, and as such some works with similar tones or topics may have similar, but not identical, tags. However, since most of our focus is on standardized data (word count, bookmarks, kudos), our project will not focus around sorting through tags and determining if they should be categorized with another unqiue tag. We have not yet been able to look through the data, but once we see how it is organized, we may be able to count the uses of common tags, such as the genre of a work.

### Exploratory Analysis
___
We will use scatterplots and bar charts primarily to determine trends in our data, like how kudos relate to the completeness of the fanfic or if rating increases based on the length of the fanfic. We will also look at correlation heatmaps to see whether different levels of popularity - kudos (likes), bookmarks, hits, and comments - have any correlation to one another. If they do not, this could mean that certain fandoms show appreciation or interest for works in different ways. Using the altair or seaborn programs, we can also use a third indicator to visualize other trends within our data. 

### Analysis Methodology
___
The specific questions we want to answer are as follows:
- How closely connected are the four candidate notions of popularity (hits, kudos, comments, bookmarks)? Is there a sensible way to combine the four metrics to get a more holistic view of a fic's popularity?
- How do varying quantitative measurements correlate with a fic's popularity, like length, chapter count, completition status, rating, complexity/readability, and emotional valence?
- Do any of the answers to the previous questions change based on the fandom being considered? 
- How do large fandoms with many works in the dataset compare to smaller fandoms with fewer works? Are there differences in the mean popularity, length, rating, completition status, chapter counts, readability, or emotional valence?
- Based on the above, if one were trying to create the most statistically popular fanfiction, what would it look like?

The techniques we plan to use are, tentatively, as follows:
- Ordinary Least Squares Regression, to find the strength of the correlation between two of our collected quantities by controlling for variables to identify and remove potential confounding variables.
- Tests of statistical significance, like the t-test and the z-test, to find if mean differences between quantities are statistically significant or not.
- Sentiment analysis, based off of a simple categorization of words into positive, negative, and neutral buckets.
- Readability analysis, inspired by popular equations for finding the readability of texts like the Flesch-Kincaid scale, but without counting syllables (unless we find a way for the computer to count syllables).

### Project Schedule
___

The project is due April 19th, while the first milestone is due April 3rd. Our goal is to have a rudimentary version of the project done by the third. This means that the cleaning, data analysis, and visualization components will hopefully be at least started by the third, especially the cleaning portion.

- Weeks 1-2 (March 17th - March 30st): Data cleaning and sorting, break the large data file into more easily used chunks.
- Week 3 (March 31st - April 6th): Complete some analysis tasks, perhaps simple correlations between easily found quantities like notions of popularity, fic length, rating, and completition status.
- Week 4 (April 7th - April 13th): Complete more analysis tasks, perhaps the sentiment analysis or readability analysis, turn in the first milestone on the third.
- Week 5 (April 13th - April 19th): Finish analysis tasks, begin on data visualization. Complete visualization and presentation components of project.