Skip to content

eigenfoo/reddit-clusters

master
Switch branches/tags

Name already in use

A tag already exists with the provided branch name. Many Git commands accept both tag and branch names, so creating this branch may cause unexpected behavior. Are you sure you want to create this branch?
Code

Latest commit

 

Git stats

Files

Permalink
Failed to load latest commit information.
Type
Name
Latest commit message
Commit time
 
 
 
 
 
 
 
 
 
 
 
 

Reddit Clusters

Understanding hateful subreddits through text clustering.

Directory Structure

.
├── clustering
│   ├── results
│   │   ├── CringeAnarchy.txt
│   │   ├── The_Donald.txt
│   │   └── TheRedPill.txt
│   ├── run_nmfs.sh
│   └── tfidf_nmf.py
├── data
│   ├── bigquery
│   │   └── 2017
│   │       └── 11-12
│   │           └── download_data.sh
│   └── stoplist.txt
├── LICENSE
├── README.md
└── wordclouds
    ├── blackdisc.jpeg
    ├── images
    │   ├── CringeAnarchy
    │   │   ├── 0.06%.png
    │   │   ├── 0.12%.png
    │   │   ├── ...
    │   │   ├── ...
    │   ├── The_Donald
    │   │   ├── 0.36%.png
    │   │   ├── 0.51%.png
    │   │   ├── ...
    │   │   ├── ...
    │   └── TheRedPill
    │       ├── 0.56%.png
    │       ├── 0.91%.png
    │       ├── ...
    │       └── ...
    ├── make_wordclouds.ipynb
    └── OCR-A-Std-Regular.ttf

The clustering directory contains all code used to vectorize and cluster the subreddits. tfidf_nmf.py is the main program, and run_nmfs.sh is simply a driver script. The results subdirectory contains the log files when running tfidf_nmf.py on /r/TheRedPill, /r/The_Donald and /r/CringeAnarchy.

The data directory contains download_data.sh (which downloads the Reddit data from my Google Cloud Storage) and stoplist.txt (which includes Reddit-specific words such as "moderator", "karma", etc.).

The wordclouds directory contains images (which contains png files of the wordclouds themselves) and make_wordclouds.ipynb (which generates the wordclouds). Note that 1) blackdisc.jpeg and OCR-A-Std-Regular.ttf are just helper files to create the wordclouds, and 2) the png files are named after their cluster importance (e.g. 0.50%.png is a wordcloud whose cluster has an importance of 0.50%).

About

Understanding hateful subreddits through text clustering

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published