# COGS 108 - Project Proposal

## Authors

- Amelia Fletcher: Conceptualization, Background research, Writing - original draft, review & editing
- Beren Gao: Conceptualization, Methodology, Experimental Investigation, Analysis, Writing - original draft, review & editing
- Neil Pakinggan: Software, Visualization, Conceptualization, Data curation, Background research
- Monica Sandoval: Conceptualization, Writing - original draft, review & editing
- Julia Zhang: Analysis, Visualization, Conceptualization, Writing - original draft, review & editing

## Research Question

To what extent can bot-generated and human-generated Reddit comments be distinguished using predefined linguistic features (e.g. wording, syntax) and behavioral features (e.g. posting frequency, response latency)?

## Background and Prior Work

Since the rise of AI, online discussion forums (i.e., Reddit, Twitter) have seen an increase in bots that mimic or imitate human social interactions, particularly participating in conversations and discourse of various topics in the comment sections of threads. Such mimicry blurs the line between human-generated internet discussion versus AI bot-generated discussion, and what is real versus what is fake. This raises the question: To what extent can bot-generated and human-generated Reddit comments be distinguished using predefined linguistic features (e.g., wording, syntax) and behavioral features (e.g., posting and response frequency)? 

Previous research has already been done regarding this topic; for example, the world’s largest Turing test study from AI lab AI21, involving 1.5 million human participants, tasked humans with discerning whether the user they’re chatting with was human or AI. This resulted in correct guesses that they were interacting with AI in “60% of conversations,” a statistic that researchers claimed was “not much higher than chance.”<a name="cite_ref-1"></a>[<sup>1</sup>](#cite_note-1) Alongside AI21’s test study, different research publications have noted that in subreddits, human user posts contained distinct features such as grammatical discrepancies, internet jargon, and erroneous capitalization, whilst ChatGPT-4 generated texts contained impeccable grammar, a complex syntactical structure, and overused emojis.<a name="cite_ref-2"></a>[<sup>2</sup>](#cite_note-2) This leads to a possible method of distinguishing an AI from human users: finding the difference of “lexical richness” and “logical soundness” between LLM-authored posts compared to human ones.<a name="cite_ref-3"></a>[<sup>3</sup>](#cite_note-3) 


1. <a name="cite_note-1"></a> [^](#cite_ref-1) Zhang, M. (9 Jun 2023) In Largest-Ever Turing Test, 1.5 Million Humans Guess Little Better Than Chance. *Artisana*. https://www.artisana.ai/articles/in-largest-ever-turing-test-1-5-million-humans-guess-little-better-than
2. <a name="cite_note-2"></a> [^](#cite_ref-2) Arcenal, E. & Capistrano, L. & Guzman, M. & Forrosuelo, M. & Miranda, J. (Sep 2024) Comparative Analysis of Reddit Posts and ChatGPT-Generated Texts’ Linguistic Features: A Short Report on Artificial Intelligence’s Imitative Capabilities. *International Journal of Multidisciplinary: Applied Business and Education*.
https://doi.org/10.11594/ijmaber.05.09.063.
3. <a name="cite_note-3"></a> [^](#cite_red-3) Dönmez E. & Maurer M. & Lapesa G. & Falenska A. (Nov 2025) AI Argues Differently: Distinct Argumentative and Linguistic Patterns of LLMs in Persuasive Contexts *Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing*
https://aclanthology.org/2025.emnlp-main.1755.pdf

## Hypothesis


We hypothesize that __bot-generated Reddit comments are distinguishable from human-generated comments__ as they exhibit systematic differences in *linguistic and behavioral patterns*. We assume that these differences are measurable using predefined quantities, with bots tending to exhibit more repetitive syntax, distinct wording patterns, higher posting frequency, and shorter response latency compared to humans. 

This prediction is based on prior research implying that artificial intelligence (AI) often displays systematic linguistic features and temporal behaviors that differ from humans, who tend to produce more variable language, including grammatical discrepancies and slang, and irregular posting patterns. 

As a result, we predict a strong correlation between the type of author (bot vs human) and their commenting habits on Reddit.

## Data

### Ideal Datasets
To effectively answer our research question, an ideal dataset would consist mainly of three types of variables.

1. **Contextual features**

    This includes subreddit, user id and user karma, account age, and comments labeled as either bot-generated or human-generated. The subreddit's context could further include categorical grouping of subreddit types (political, hobbyist, sports, questioning, etc.)

2.  **Linguistic features**

    This includes the comment text, common word counts, average word length, various syntax complexity measures (word complexity, sentence construction, etc.), emotional sentiment, categorical grouping of the text, and emoji counts.

3. **Behavioral features**

    This includes timestamps for each comment/reply, comment counts per user, and common reply chain lengths per user.

Ideally, the dataset should contain hundreds of thousands to millions of individual comments on Reddit to encapsulate the diversity and variability of comments across different contexts.

Each row should represent one comment with all associated features. The data should be stored in a structured dataset like JSON or CSV with clean column labeling. 

Data would be collected through Reddit's official data API that provides essential features including comment text, user ids/stats, and timestamps. These data could also be collected through online archives that allow access to open datasets, or by manually/automatically scraping random posts on Reddit.

   
### Real Datasets
#### Dataset 1: The "Dead Internet" Theory: Reddit Bot vs. Human
The dataset is located on Kaggle at https://www.kaggle.com/datasets/nudratabbas/the-dead-internet-theory-reddit-bot-vs-human
The dataset can be previewed publicly on Kaggle, but downloading it requires a registered Kaggle account. 

This dataset includes relevant linguistic and behavioral features, including average word length, sentiment score, reply delays in seconds, and the content of URLs. It also has sufficient user and comment information, for example, user ids, account age, and user karma. Most importantly, the dataset indicates whether the comment is generated by a bot or a human. 

## Ethics 


[![Deon badge](https://img.shields.io/badge/ethics%20checklist-deon-brightgreen.svg?style=popout-square)](http://deon.drivendata.org/)

### A. Data Collection
 - [X] **A.1 Informed consent**: If there are human subjects, have they given informed consent, where subjects affirmatively opt-in and have a clear understanding of the data uses to which they consent?

> Reddit's TOS includes that they have the "right for us to make Your Content available for syndication, broadcast, distribution, or publication by other companies, organizations, or individuals who partner with Reddit." While none of our datasets themselves  seem to be Reddit-sponsored or partnered sources, the fact remains that Reddit is a public forum, and most content posted there is public.

 - [ ] **A.2 Collection bias**: Have we considered sources of bias that could be introduced during data collection and survey design and taken steps to mitigate those?

> As the data was not collected by us directly, we cannot confirm or deny this. The groups ('subreddits') these data were collected from may be biased in what comments are allowed or visibile in each, based on group culture or rules.

 - [X] **A.3 Limit PII exposure**: Have we considered ways to minimize exposure of personally identifiable information (PII) for example through anonymization or not collecting information that isn't relevant for analysis?

> The data is completely anonymous, with no personally identifiable information present.

 - [X] **A.4 Downstream bias mitigation**: Have we considered ways to enable testing downstream results for biased outcomes (e.g., collecting data on protected group status like race or gender)?

> We can test within specific groups in the common scenario that the group's population does not match that of Reddit overall, then compare results relative to the group's population, rather than the population of all data.

### B. Data Storage
 - [X] **B.1 Data security**: Do we have a plan to protect and secure data (e.g., encryption at rest and in transit, access controls on internal users and third parties, access logs, and up-to-date software)?
 - [X] **B.2 Right to be forgotten**: Do we have a mechanism through which an individual can request their personal information be removed?

> As the data has already been collected and any identifiable information removed, this is impossible.
 - [X] **B.3 Data retention plan**: Is there a schedule or plan to delete the data after it is no longer needed?

### C. Analysis
 - [X] **C.1 Missing perspectives**: Have we sought to address blindspots in the analysis through engagement with relevant stakeholders (e.g., checking assumptions and discussing implications with affected communities and subject matter experts)?

> We currently have no plans to do this. We also have no plans *not* to do this. Personally, I would like to ask a Reddit engineer or moderator how things are, but I'm unsure if I can get a hold of any.
 - [ ] **C.2 Dataset bias**: Have we examined the data for possible sources of bias and taken steps to mitigate or address these biases (e.g., stereotype perpetuation, confirmation bias, imbalanced classes, or omitted confounding variables)?

> We have not currently taken a proper, detailed look at any potential factors of bias in the data.
 - [ ] **C.3 Honest representation**: Are our visualizations, summary statistics, and reports designed to honestly represent the underlying data?

> We do not have representations.
 - [X] **C.4 Privacy in analysis**: Have we ensured that data with PII are not used or displayed unless necessary for the analysis?
 - [ ] **C.5 Auditability**: Is the process of generating the analysis well documented and reproducible if we discover issues in the future?

> I sure hope it will be.

### D. Modeling
 - [ ] **D.1 Proxy discrimination**: Have we ensured that the model does not rely on variables or proxies for variables that are unfairly discriminatory?
 - [ ] **D.2 Fairness across groups**: Have we tested model results for fairness with respect to different affected groups (e.g., tested for disparate error rates)?
 - [ ] **D.3 Metric selection**: Have we considered the effects of optimizing for our defined metrics and considered additional metrics?
 - [ ] **D.4 Explainability**: Can we explain in understandable terms a decision the model made in cases where a justification is needed?
 - [ ] **D.5 Communicate limitations**: Have we communicated the shortcomings, limitations, and biases of the model to relevant stakeholders in ways that can be generally understood?

### E. Deployment
 - [ ] **E.1 Monitoring and evaluation**: Do we have a clear plan to monitor the model and its impacts after it is deployed (e.g., performance monitoring, regular audit of sample predictions, human review of high-stakes decisions, reviewing downstream impacts of errors or low-confidence decisions, testing for concept drift)?
 - [ ] **E.2 Redress**: Have we discussed with our organization a plan for response if users are harmed by the results (e.g., how does the data science team evaluate these cases and update analysis and models to prevent future harm)?
 - [ ] **E.3 Roll back**: Is there a way to turn off or roll back the model in production if necessary?
 - [ ] **E.4 Unintended use**: Have we taken steps to identify and prevent unintended uses and abuse of the model and do we have a plan to monitor these once the model is deployed?


## Team Expectations 

**COMMUNICATION**
* We can expect to communicate regularly through our Instagram group chat and check in with each other regarding progress and deadlines
* If confusion arises, we clear any misunderstandings by asking each other questions as soon as possible

**TONE**
* Firm yet reasonable, well-meaning criticism, and we should recognize that we are all here for the same goal/purpose in this project

**DECISION-MAKING**
* We will take into consideration everyone’s opinions and try to reach some sort of agreement or consensus of opinion when it comes to decisions.
* This can be similar to a group voting system
* We could also have people add to decisions with their ideas instead of it being set in stone

**TASKS AND EXPECTATIONS**
* Communicate precisely and explicitly with each other regarding our division of work and the role that we play transparently, so as not to cause any confusion.
* We will have each member have a specific role that they will fulfill throughout the project, listed at the top of this project proposal.

**CIRCUMSTANCES**
* We will respect each member's differing schedules and workloads as students.
* If in the case of a member not being able to keep up with set deadlines or their part in the project, they should always ask for help as early as possible or notify us when they think so (i.e., they have midterms to focus on at the end of the week)
* We can always have someone help out when needed to lighten the workload 

## Project Timeline Proposal


| Meeting Date  | Meeting Time| Completed Before Meeting  | Discuss at Meeting |
|---|---|---|---|
| 2/7  |  5 PM | Gather dataset(s) that will help us answer our research question  | Explain how gathered datasets contribute to the project and answering our research question | 
| 2/11  |  4 PM | Begin collecting background research on our topic | Discuss collected research and begin narrowing down dataset(s) and background research | 
| 2/15  | 5 PM  | Write and edit data descriptions; begin editing data visualization  | Communicate data purposes in order to formulate proper descriptions; discuss data cleanliness and what needs to be completed before submission   |
| 2/18  | 6 PM  | Finalize datasets and descriptions; Submit Data Checkpoint | Discuss finalization of datasets being used and ensure that the written descriptions provide ample information for their purpose    |
| 2/22  | 4 PM  | Beginning construction of visualization of AI patterns; visualize our findings | Discuss types of visualization that will be most effective for our project  |
| 3/1  | 5 PM  | Continue constructing EDA section; finalize written portion of EDA | Explain written descriptions and context of EDA; clean up information |
| 3/4  | 5 PM  | Complete EDA; Submit EDA Checkpoint | Communicate any final changes necessary for EDA portion of project; begin discussion on finalizing project |
| 3/8  | 4 PM  | Write Abstract, Discussion, and Conclusion | Finalize changes on Abstract, Discussion, and Conclusion; discuss repeated areas and analysis further in the context of project overall, ensure everything still makes sense and finalize; discuss video filming and roles for filming  |
| 3/11  | 5 PM  | Film video | Discuss any necessary editing; finalize video and discuss end of project  |
| 3/18  | Before 11:59 PM  | Finalize everything | Turn in Final Project, Video, Team Eval Survey, and Post Course Survey |