Understanding hateful subreddits through text clustering
Switch branches/tags
Nothing to show
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
clustering
data
wordclouds
.gitignore
LICENSE
README.md

README.md

Reddit Clusters

Understanding hateful subreddits through text clustering.

Directory Structure

.
├── clustering
│   ├── results
│   │   ├── CringeAnarchy.txt
│   │   ├── The_Donald.txt
│   │   └── TheRedPill.txt
│   ├── run_nmfs.sh
│   └── tfidf_nmf.py
├── data
│   ├── bigquery
│   │   └── 2017
│   │       └── 11-12
│   │           └── download_data.sh
│   └── stoplist.txt
├── LICENSE
├── README.md
└── wordclouds
    ├── blackdisc.jpeg
    ├── images
    │   ├── CringeAnarchy
    │   │   ├── 0.06%.png
    │   │   ├── 0.12%.png
    │   │   ├── ...
    │   │   ├── ...
    │   ├── The_Donald
    │   │   ├── 0.36%.png
    │   │   ├── 0.51%.png
    │   │   ├── ...
    │   │   ├── ...
    │   └── TheRedPill
    │       ├── 0.56%.png
    │       ├── 0.91%.png
    │       ├── ...
    │       └── ...
    ├── make_wordclouds.ipynb
    └── OCR-A-Std-Regular.ttf

The clustering directory contains all code used to vectorize and cluster the subreddits. tfidf_nmf.py is the main program, and run_nmfs.sh is simply a driver script. The results subdirectory contains the log files when running tfidf_nmf.py on /r/TheRedPill, /r/The_Donald and /r/CringeAnarchy.

The data directory contains download_data.sh (which downloads the Reddit data from my Google Cloud Storage) and stoplist.txt (which includes Reddit-specific words such as "moderator", "karma", etc.).

The wordclouds directory contains images (which contains png files of the wordclouds themselves) and make_wordclouds.ipynb (which generates the wordclouds). Note that 1) blackdisc.jpeg and OCR-A-Std-Regular.ttf are just helper files to create the wordclouds, and 2) the png files are named after their cluster importance (e.g. 0.50%.png is a wordcloud whose cluster has an importance of 0.50%).