Executive Summary

If interested in eventually launching targetted recommendations/ads to users on various subreddits, this project provides multiple pre-trained models for early differentiation. Multiple models are given as a streamlined "launching point" for other future data science efforts.

(back to top)

About

Problem Statement

In the competitive and dynamic world of big data, data science teams are eager to leverage the internet's free data for insight.

This project aims to "pre-train" several NLP classification models and then provide an executive summary of the results to an existing data science client. This data science team is looking to accurately differentiate between two specific subreddits (AskReddit, AskScience) as a first step in developing targetted ads/recommendations.

Success of these pre-trained models will be based on balanced accuracy score because a "false positive" is not anymore problematic than a "false negative" in this business context. The scope of the project is limited to the data scrapped within 3 weeks on said subreddits. The model choices were limited by local compute power. The executive summary provides "future considerations" for the existing data science client, including mentions between score choice, model choice, and scope choice.

(back to top)

Built With

(back to top)

Process

Data Collection and Cleaning

Data was collected with PushShift.io (api) on the following subreddits:

AskReddit
AskScience

Each dataset was at around 12.5k posts. Given the nature of the project (executive summary + selling to data science team, the data is included in the repo.)

Provided Datasets

askreddit_data.csv: AskReddit Raw Data
askscience_data.csv: AskScience Raw Data
clean_ask_data.csv: Cleaned Data from Both DFs

Preprocessing included extracting stems/lemma, removing non-English posts, fixing typos, and removing duplicate posts (reposts).

Likewise, prior to modeling, I applied CountVectorizer and Tfidf Vectorizer + standardization to the training corpus.

(back to top)

Modeling / Analysis

I applied logistic regression, random forest, and stacked model (decision tree as meta learner) on both sets, totaling 6 model comparisons.

(back to top)

Results

Selected Screenshots (EDA)

(back to top)

Conclusion

From the model results, we see that the logistic regression is actually the best model in both cases of the cvec and tfidf data.

Random forest is slightly overfit, but overall had very weak results when trying to predict the negative class (seen in the near perfect recall score but terrible precision score).

Logisitic regression was much more overfit, but when comparing the true pos/neg rates, it had a relatively equal performance both ways.

Due to the lower performance of the RF, the stacking model suffered in suit.

The final model recommendations:

Logistic regression if you want to prioritize balanced accuracy
Random forest if you want to prioritize recall

(back to top)

Contact

If you wish to contact me, Christopher Denq, please reach out via LinkedIn.

If you're curious about more projects, check out my website or GitHub.

(back to top)

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
code		code
ignore		ignore
images		images
repo		repo
scrapped_data		scrapped_data
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
Reddit NLP Presentation.pdf		Reddit NLP Presentation.pdf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

code

code

ignore

ignore

images

images

repo

repo

scrapped_data

scrapped_data

.gitignore

.gitignore

LICENSE

LICENSE

README.md

README.md

Reddit NLP Presentation.pdf

Reddit NLP Presentation.pdf

Repository files navigation

Subreddit Differentiator

Table of Contents

Executive Summary

About

Problem Statement

Built With

Process

Data Collection and Cleaning

Modeling / Analysis

Results

Selected Screenshots (EDA)

Conclusion

Contact

About

Releases

Packages

Languages

License

cdenq/subreddit-differentiator

Folders and files

Latest commit

History

Repository files navigation

Subreddit Differentiator

Table of Contents

Executive Summary

About

Problem Statement

Built With

Process

Data Collection and Cleaning

Modeling / Analysis

Results

Selected Screenshots (EDA)

Conclusion

Contact

About

Topics

Resources

License

Stars

Watchers

Forks

Languages