## Rubric

Instructions: DELETE this cell before you submit via a `git push` to your repo before deadline. This cell is for your reference only and is not needed in your report. 

Scoring: Out of 10 points

- Each Developing  => -2 pts
- Each Unsatisfactory/Missing => -4 pts
  - until the score is 

If students address the detailed feedback in a future checkpoint they will earn these points back


|                  | Unsatisfactory                                                                                                                                                                                                    | Developing                                                                                                                                                                                              | Proficient                                     | Excellent                                                                                                                              |
|------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|------------------------------------------------|----------------------------------------------------------------------------------------------------------------------------------------|
| Data relevance   | Did not have data relevant to their question. Or the datasets don't work together because there is no way to line them up against each other. If there are multiple datasets, most of them have this trouble | Data was only tangentially relevant to the question or a bad proxy for the question. If there are multiple datasets, some of them may be irrelevant or can't be easily combined.                       | All data sources are relevant to the question. | Multiple data sources for each aspect of the project. It's clear how the data supports the needs of the project.                         |
| Data description | Dataset or its cleaning procedures are not described. If there are multiple datasets, most have this trouble                                                                                              | Data was not fully described. If there are multiple datasets, some of them are not fully described                                                                                                      | Data was fully described                       | The details of the data descriptions and perhaps some very basic EDA also make it clear how the data supports the needs of the project. |
| Data wrangling   | Did not obtain data. They did not clean/tidy the data they obtained.  If there are multiple datasets, most have this trouble                                                                                 | Data was partially cleaned or tidied. Perhaps you struggled to verify that the data was clean because they did not present it well. If there are multiple datasets, some have this trouble | The data is cleaned and tidied.                | The data is spotless and they used tools to visualize the data cleanliness and you were convinced at first glance                      |


# COGS 108 - Data Checkpoint

- Amelia Fletcher: Conceptualization, Background research, Writing - original draft, review & editing
- Beren Gao: Conceptualization, Methodology, Experimental Investigation, Analysis, Writing - original draft, review & editing
- Neil Pakinggan: Software, Visualization, Conceptualization, Data curation, Background research
- Monica Sandoval: Conceptualization, Writing - original draft, review & editing
- Julia Zhang: Analysis, Visualization, Conceptualization, Writing - original draft, review & editing

## Research Question

To what extent can bot-generated and human-generated Reddit comments be distinguished using predefined linguistic features (e.g. wording, syntax) and behavioral features (e.g. posting frequency, response latency)?

## Background and Prior Work

Since the rise of AI, online discussion forums (i.e., Reddit, Twitter) have seen an increase in bots that mimic or imitate human social interactions, particularly participating in conversations and discourse of various topics in the comment sections of threads. Such mimicry blurs the line between human-generated internet discussion versus AI bot-generated discussion, and what is real versus what is fake. This raises the question: To what extent can bot-generated and human-generated Reddit comments be distinguished using predefined linguistic features (e.g., wording, syntax) and behavioral features (e.g., posting and response frequency)? 

Previous research has already been done regarding this topic; for example, the world’s largest Turing test study from AI lab AI21, involving 1.5 million human participants, tasked humans with discerning whether the user they’re chatting with was human or AI. This resulted in correct guesses that they were interacting with AI in “60% of conversations,” a statistic that researchers claimed was “not much higher than chance.”<a name="cite_ref-1"></a>[<sup>1</sup>](#cite_note-1) Alongside AI21’s test study, different research publications have noted that in subreddits, human user posts contained distinct features such as grammatical discrepancies, internet jargon, and erroneous capitalization, whilst ChatGPT-4 generated texts contained impeccable grammar, a complex syntactical structure, and overused emojis.<a name="cite_ref-2"></a>[<sup>2</sup>](#cite_note-2) This leads to a possible method of distinguishing an AI from human users: finding the difference of “lexical richness” and “logical soundness” between LLM-authored posts compared to human ones.<a name="cite_ref-3"></a>[<sup>3</sup>](#cite_note-3) 


1. <a name="cite_note-1"></a> [^](#cite_ref-1) Zhang, M. (9 Jun 2023) In Largest-Ever Turing Test, 1.5 Million Humans Guess Little Better Than Chance. *Artisana*. https://www.artisana.ai/articles/in-largest-ever-turing-test-1-5-million-humans-guess-little-better-than
2. <a name="cite_note-2"></a> [^](#cite_ref-2) Arcenal, E. & Capistrano, L. & Guzman, M. & Forrosuelo, M. & Miranda, J. (Sep 2024) Comparative Analysis of Reddit Posts and ChatGPT-Generated Texts’ Linguistic Features: A Short Report on Artificial Intelligence’s Imitative Capabilities. *International Journal of Multidisciplinary: Applied Business and Education*.
https://doi.org/10.11594/ijmaber.05.09.063.
3. <a name="cite_note-3"></a> [^](#cite_red-3) Dönmez E. & Maurer M. & Lapesa G. & Falenska A. (Nov 2025) AI Argues Differently: Distinct Argumentative and Linguistic Patterns of LLMs in Persuasive Contexts *Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing*
https://aclanthology.org/2025.emnlp-main.1755.pdf

## Hypothesis


We hypothesize that __bot-generated Reddit comments are distinguishable from human-generated comments__ as they exhibit systematic differences in *linguistic and behavioral patterns*. We assume that these differences are measurable using predefined quantities, with bots tending to exhibit more repetitive syntax, distinct wording patterns, higher posting frequency, and shorter response latency compared to humans. 

This prediction is based on prior research implying that artificial intelligence (AI) often displays systematic linguistic features and temporal behaviors that differ from humans, who tend to produce more variable language, including grammatical discrepancies and slang, and irregular posting patterns. 

As a result, we predict a strong correlation between the type of author (bot vs human) and their commenting habits on Reddit.

## Data

### Data overview

Instructions: REPLACE the contents of this cell with descriptions of your actual datasets.

For each dataset include the following information
- Dataset #1
  - Dataset Name:
  - Link to the dataset:
  - Number of observations:
  - Number of variables:
  - Description of the variables most relevant to this project
  - Descriptions of any shortcomings this dataset has with repsect to the project
- Dataset #2 (if you have more than one!)
  - same as above
- etc

Each dataset deserves either a set of bullet points as above or a few sentences if you prefer that method.

If you plan to use multiple datasets, add a few sentences about how you plan to combine these datasets.

In [None]:
# Run this code every time when you're actively developing modules in .py files.  It's not needed if you aren't making modules
#
## this code is necessary for making sure that any modules we load are updated here 
## when their source code .py files are modified

%load_ext autoreload
%autoreload 2

In [None]:
# Setup code -- this only needs to be run once after cloning the repo!
# this code downloads the data from its source to the `data/00-raw/` directory
# if the data hasn't updated you don't need to do this again!

# if you don't already have these packages (you should!) uncomment this line
# %pip install requests tqdm

import sys
sys.path.append('./modules') # this tells python where to look for modules to import

import get_data # this is where we get the function we need to download data

# replace the urls and filenames in this list with your actual datafiles
# yes you can use Google drive share links or whatever
# format is a list of dictionaries; 
# each dict has keys of 
#   'url' where the resource is located
#   'filename' for the local filename where it will be stored 
datafiles = [
    { 'url': 'https://drive.google.com/uc?export=download&id=1SXcJvoOo5uEGYC9gJBDOOC_fSFXXb4ht', 'filename':'reddit_dead_internet_analysis_2026.csv'},
    { 'url': 'https://drive.google.com/uc?export=download&id=1yKcxExZjCLTvs0_lJ8QoF9Ej9e4Oee1l', 'filename':'balanced_ai_human_prompt.csv'}
]

get_data.get_raw(datafiles,destination_directory='data/00-raw/')

### Dataset #1 

Instructions: 
1. Change the header from Dataset #1 to something more descriptive of the dataset
2. Write a few paragraphs about this dataset. Make sure to cover
   1. Describe the important metrics, what units they are in, and giv some sense of what they mean.  For example "Fasting blood glucose in units of mg glucose per deciliter of blood.  Normal values for healthy individuals range from 70 to 100 mg/dL.  Values 100-125 are prediabetic and values >125mg/dL indicate diabetes. Values <70 indicate hypoglycemia. Fasting idicates the patient hasn't eaten in the last 8 hours.  If blood glucose is >250 or <50 at any time (regardless of the time of last meal) the patient's life may be in immediate danger"
   2. If there are any major concerns with the dataset, describe them. For example "Dataset is composed of people who are serious enough about eating healthy that they voluntarily downloaded an app dedicated to tracking their eating patterns. This sample is likely biased because of that self-selection. These people own smartphones and may be healthier and may have more disposable income than the average person.  Those who voluntarily log conscientiously and for long amounts of time are also likely even more interested in health than those who download the app and only log a bit before getting tired of it"
3. Use the cell below to 
    1. load the dataset 
    2. make the dataset tidy or demonstrate that it was already tidy
    3. demonstrate the size of the dataset
    4. find out how much data is missing, where its missing, and if its missing at random or seems to have any systematic relationships in its missingness
    5. find and flag any outliers or suspicious entries
    6. clean the data or demonstrate that it was already clean.  You may choose how to deal with missingness (dropna of fillna... how='any' or 'all') and you should justify your choice in some way
    7. You will load raw data from `data/00-raw/`, you will (optionally) write intermediate stages of your work to `data/01-interim` and you will write the final fully wrangled version of your data to `data/02-processed`
4. Optionally you can also show some summary statistics for variables that you think are important to the project
5. Feel free to add more cells here if that's helpful for you


In [None]:
## YOUR CODE TO LOAD/CLEAN/TIDY/WRANGLE THE DATA GOES HERE
import pandas as pd
import numpy as np

reddit_dataset = pd.read_csv('')

### Dataset #2 

See instructions above for Dataset #1.  Feel free to keep adding as many more datasets as you need.  Put each new dataset in its own section just like these. 

Lastly if you do have multiple datasets, add another section where you demonstrate how you will join, align, cross-reference or whatever to combine data from the different datasets

Please note that you can always keep adding more datasets in the future if these datasets you turn in for the checkpoint aren't sufficient.  The goal here is demonstrate that you can obtain and wrangle data.  You are not tied down to only use what you turn in right now.

In [None]:
## YOUR CODE TO LOAD/CLEAN/TIDY/WRANGLE THE DATA GOES HERE


## Ethics

Instructions: REPLACE the contents of this cell with your work, including any updates to recover points lost in your proposal feedback

## Team Expectations 

**COMMUNICATION**
* We can expect to communicate regularly through our Instagram group chat and check in with each other regarding progress and deadlines
* If confusion arises, we clear any misunderstandings by asking each other questions as soon as possible

**TONE**
* Firm yet reasonable, well-meaning criticism, and we should recognize that we are all here for the same goal/purpose in this project

**DECISION-MAKING**
* We will take into consideration everyone’s opinions and try to reach some sort of agreement or consensus of opinion when it comes to decisions.
* This can be similar to a group voting system
* We could also have people add to decisions with their ideas instead of it being set in stone

**TASKS AND EXPECTATIONS**
* Communicate precisely and explicitly with each other regarding our division of work and the role that we play transparently, so as not to cause any confusion.
* We will have each member have a specific role that they will fulfill throughout the project, listed at the top of this project proposal.

**CIRCUMSTANCES**
* We will respect each member's differing schedules and workloads as students.
* If in the case of a member not being able to keep up with set deadlines or their part in the project, they should always ask for help as early as possible or notify us when they think so (i.e., they have midterms to focus on at the end of the week)
* We can always have someone help out when needed to lighten the workload 

## Project Timeline Proposal


| Meeting Date  | Meeting Time| Completed Before Meeting  | Discuss at Meeting |
|---|---|---|---|
| 2/7  |  5 PM | Gather dataset(s) that will help us answer our research question  | Explain how gathered datasets contribute to the project and answering our research question | 
| 2/11  |  4 PM | Begin collecting background research on our topic | Discuss collected research and begin narrowing down dataset(s) and background research | 
| 2/15  | 5 PM  | Write and edit data descriptions; begin editing data visualization  | Communicate data purposes in order to formulate proper descriptions; discuss data cleanliness and what needs to be completed before submission   |
| 2/18  | 6 PM  | Finalize datasets and descriptions; Submit Data Checkpoint | Discuss finalization of datasets being used and ensure that the written descriptions provide ample information for their purpose    |
| 2/22  | 4 PM  | Beginning construction of visualization of AI patterns; visualize our findings | Discuss types of visualization that will be most effective for our project  |
| 3/1  | 5 PM  | Continue constructing EDA section; finalize written portion of EDA | Explain written descriptions and context of EDA; clean up information |
| 3/4  | 5 PM  | Complete EDA; Submit EDA Checkpoint | Communicate any final changes necessary for EDA portion of project; begin discussion on finalizing project |
| 3/8  | 4 PM  | Write Abstract, Discussion, and Conclusion | Finalize changes on Abstract, Discussion, and Conclusion; discuss repeated areas and analysis further in the context of project overall, ensure everything still makes sense and finalize; discuss video filming and roles for filming  |
| 3/11  | 5 PM  | Film video | Discuss any necessary editing; finalize video and discuss end of project  |
| 3/18  | Before 11:59 PM  | Finalize everything | Turn in Final Project, Video, Team Eval Survey, and Post Course Survey |