Skip to content

GA Data Science Immersive Project 3: Subreddit classification

Notifications You must be signed in to change notification settings

gbkgwyneth/GA-DSI-project-03

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Project 3: Web APIs & Classification

Description

For project 3, your goal is two-fold:

  1. Using Reddit's API, you'll collect posts from two subreddits of your choosing.
  2. You'll then use NLP to train a classifier on which subreddit a given post came from. This is a binary classification problem.

About the API

Reddit's API is fairly straightforward. For example, if I want the posts from /r/boardgames, all I have to do is add .json to the end of the url: https://www.reddit.com/r/boardgames.json

To help you get started, we have a primer video on how to use Reddit's API: https://www.youtube.com/watch?v=5Y3ZE26Ciuk


Requirements

  • Scrape and prepare your data using the requests library.
  • Create and compare two models. One of these must be a random forest, however the other can be a classifier of your choosing: logistic regression, KNN, SVM, etc.
  • A Jupyter Notebook with your analysis for a peer audience of data scientists.
  • An executive summary of the results you found.
  • A short presentation outlining your process and findings for a semi-technical audience.

Pro Tip 1: Reddit will give you 25 posts per request. To get enough data, you'll need to hit Reddit's API repeatedly (most likely in a for loop). Be sure to use the time.sleep() function at the end of your loop to allow for a break in between requests. THIS IS CRUCIAL

Pro tip 2: The API will cap you at 1,000 posts for each subreddit (assuming the subreddit has that many posts).

Pro tip 3: At the end of each loop, be sure to save the results from your scrape as a csv: JSON from Reddit > Pandas DataFrame > CSV. That way, if something goes wrong in your loop, you won't lose all your data.

About

GA Data Science Immersive Project 3: Subreddit classification

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published