# Measuring Risk Tolerance - Crypto or Stocks?

**Disclaimer:** I am not a financial advisor, and I am not offering or providing any financial advice to anyone. This is simply a data science project that analyzes keywords unique to the posts in the r/wallstreetbets and r/SatoshiStreetBets subreddits to understand whether an individual might identify more closely with stocks or cryptocurrency.

### Technology & Skills
**Technical Skills:** Binary classification, data collection, scraping web API, data cleaning, EDA, data visualization, machine learning, bias-variance tradeoff, sentiment analysis, natural language processing (NLP), count vectorizer, TF-IDF vectorizer, pre-processing data, modeling, confusion matrix, misclassification, precision, recall, f1-score, ROC AUC, pipeline, gridsearch, word clouds, pickling

**Technology:** Streamlit, Tableau, Heroku, Python, Jupyter Notebook, GitHub, Git

**Python Libraries:** NLTK, requests, time, pandas, numpy, matplotlib, seaborn, scikit-learn, pickle, streamlit, PIL

**Models:** Logistic regression, decision tree classifier, bagging classifier, multinomial naive bayes, AdaBoost classifier, random forest classifier, support vector classifier

### Overview
The code notebooks have been organized into 5 sections:
1. Introduction
2. Data Collection
3. Data Cleaning, EDA & Data Visualization
4. Pre-Processing & Modeling
5. Conclusion

### Problem Statement

In the past year, retail investors have flocked to the stock and cryptocurrency markets in the hopes of netting a handsome return on their investments. While they were present long before the COVID-19 pandemic, their participation and impact on the markets have grown in recent months. From playing an active role in short-squeezing GME's stock to creating hype around dogecoin, retail investors have engaged in numerous types of trading activity with a wide-ranging level of risk.

Investments and trades made in the stock and crypto markets both assume some level of risk. Given the wildly volatile nature of cryptocurrency, I consider cryptocurrencies to have a higher risk profile than stocks for the purpose of this project.

The r/wallstreetbets subreddit is a community of 9.4 million members who seek to make money by investing and trading in the stock market. The r/SatoshiStreetBets subreddit is the cryptocurrency equivalent of r/wallstreetbets with a smaller community of 347K members. While r/wallstreetbets mostly focuses on the stock market and r/SatoshiStreetBets mainly engages with the cryptocurrency market, conversations in both subreddits do occasionally overlap with each other.

For this project, my goal is two-fold: (1) I aim to build a classification model that can predict if a post came from r/wallstreetbets or r/SatoshiStreetBets with a minimum accuracy of 80% or higher and (2) I plan to identify words unique to each subreddit so that I can utilize these words to determine if an individual retail investor might have a risk profile more tolerant to stocks or cryptocurrency.

As a data scientist consulting Reddit to provide cautionary warnings on its r/wallstreetbets and r/SatoshiStreetBets subreddits, I hope to determine an investment type (i.e. stock or crypto) that may be more suitable to the individual retail investor based on the keywords with which they identify.

### Data Sources
The data was collected using Pushshift's API. The links to the data have been provided below.

- [r/wallstreetbets](https://api.pushshift.io/reddit/search/submission?subreddit=wallstreetbets): This is the Pushshift web API for the r/wallstreetbets subreddit.
- [r/SatoshiStreetBets](https://api.pushshift.io/reddit/search/submission?subreddit=SatoshiStreetBets): This is the Pushshift web API for the r/SatoshiStreetBets subreddit.

### Data Dictionary

|Feature|Type|Dataset|Description|
|---|---|---|---|
|subreddit|object|cleaned_posts|The name of the subreddit|
|title|object|cleaned_posts|The contents of the post|
|post_char_length|int|cleaned_posts|The character length of the post|
|post_word_count|int|cleaned_posts|The word count of the post|
|sentiment_compound|float|cleaned_posts|The compound score from sentiment analysis|
|sentiment_negative|float|cleaned_posts|The negativity score from sentiment analysis|
|sentiment_neutral|float|cleaned_posts|The neutrality score from sentiment analysis|
|sentiment_positive|float|cleaned_posts|The positivity score from sentiment analysis|