Reddit Clusters

Understanding hateful subreddits through text clustering.

Directory Structure

.
├── clustering
│   ├── results
│   │   ├── CringeAnarchy.txt
│   │   ├── The_Donald.txt
│   │   └── TheRedPill.txt
│   ├── run_nmfs.sh
│   └── tfidf_nmf.py
├── data
│   ├── bigquery
│   │   └── 2017
│   │       └── 11-12
│   │           └── download_data.sh
│   └── stoplist.txt
├── LICENSE
├── README.md
└── wordclouds
    ├── blackdisc.jpeg
    ├── images
    │   ├── CringeAnarchy
    │   │   ├── 0.06%.png
    │   │   ├── 0.12%.png
    │   │   ├── ...
    │   │   ├── ...
    │   ├── The_Donald
    │   │   ├── 0.36%.png
    │   │   ├── 0.51%.png
    │   │   ├── ...
    │   │   ├── ...
    │   └── TheRedPill
    │       ├── 0.56%.png
    │       ├── 0.91%.png
    │       ├── ...
    │       └── ...
    ├── make_wordclouds.ipynb
    └── OCR-A-Std-Regular.ttf

The clustering directory contains all code used to vectorize and cluster the subreddits. tfidf_nmf.py is the main program, and run_nmfs.sh is simply a driver script. The results subdirectory contains the log files when running tfidf_nmf.py on /r/TheRedPill, /r/The_Donald and /r/CringeAnarchy.

The data directory contains download_data.sh (which downloads the Reddit data from my Google Cloud Storage) and stoplist.txt (which includes Reddit-specific words such as "moderator", "karma", etc.).

The wordclouds directory contains images (which contains png files of the wordclouds themselves) and make_wordclouds.ipynb (which generates the wordclouds). Note that 1) blackdisc.jpeg and OCR-A-Std-Regular.ttf are just helper files to create the wordclouds, and 2) the png files are named after their cluster importance (e.g. 0.50%.png is a wordcloud whose cluster has an importance of 0.50%).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

clustering

clustering

data

data

wordclouds

wordclouds

.gitignore

.gitignore

LICENSE

LICENSE

README.md

README.md

Repository files navigation

Reddit Clusters

Directory Structure

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 45 Commits
clustering		clustering
data		data
wordclouds		wordclouds
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

License

eigenfoo/reddit-clusters

Folders and files

Latest commit

History

Repository files navigation

Reddit Clusters

Directory Structure

About

Topics

Resources

License

Stars

Watchers

Forks

Languages