Project to scrape the EA Forum and Slate Star Codex. An article describing this project is available here
- Overall statistics - look at the most commented posts etc.
- Wordclouds
- Fine-tune GPT2 and generate text
- Spiders used for scraping are
src/scrapy/ea_forum/ea_forum/spiders/forum_scraper.py
andsrc/scrapy/ssc/ssc/spiders/ssc_scraper.py
- Scraped and cleaned data are in
data/ea_forum/cleaned_data_eaforum.csv
anddata/ssc/cleaned_data_ssc.csv
.
- Code used for cleaning data, exploratory_data_analysis and wordcloud generation is under
/src/eda
- Code used for gpt2 training is under
/src/gpt2/