# WallStreetBets MDP Hackathon

***
## Warning

Please note, participation in this hackathon is entirely voluntary. Since you will be scraping through Reddit, a social news aggregation, web content rating, and discussion website, you will almost certainly encounter inappropriate or foul content and language that does not adhere to Morningstar’s professional guidelines. If you do not feel comfortable with this, please remember it is **absolutely not mandatory** to participate. There will be other hackathons and opportunities to learn these skills in the future. 

***

## Objective

Morningstar is interested in adding Social Media into our analytics as a predictive indicator of market volitility.  Increasingly individuals are sharing "stock tips" thru social media outlets such as /r/WallStreetBets as a way to build momentum.  In many ways it's the same ol' "pump and dump" but for the Social Media age. 

As you work your way thru this Hackathon you will help Morningstar understand successful techniques on how to extract data from the /r/WallStreetBets website, and other similar subreddits or other similar sites.  Upon completion of this challenge our Data Science Team will review your matches, your approach and your results.  From there they will select the most interesting submissions and ask you to present your work to the team during a 30 minute review.  The winners will be selected from those top submissions.

Morningstar is working with Argonne Labs to build a disinformation indicator on how sensitive a company is to disinformation shared on Social Media.  For example, how likely is Dominion Voting Systems impacted by Social Media posts shared leading up to and beyond the US Election.  Led by our CDO, Alex Golbin, Morningstar is working to publish this "VIX" for Social Media Disruption and even measuring how sensitive a particular company/sector is to disinformation.

In this Hackathon, you will attempt to build a program to scrape /r/WallStreetBets and extract out likely company ticker references.  You will then use those matches files to populate the schema defined in the sample_extract_structure.xlsx file with your matches plus the raw data that you used to derive the match.

**Although the assignment is intended to be done in Python (a common language for Data Science) and all examples are in Python, prior experience in Python is not required.  It is a fairly simple language to learn and use.**


In [1]:
import dnalab
import pandas as pd

## DNA Lab

DNA Lab is Morningstar's newest addition to Morningstar Direct, our flagship Software Product.  It allows our users to interact with Morningstar's Data is new and  exciting ways giving to our customers very similar access to our data as our own Researchers for the first time.  It is built upon Jupyter Lab & Jupyter Notebooks and we've worked to keep the Jupyter experience intact and instead focus on adding Morningstar functionality as plug & play components.

If you are familiar with Jupyter already, then you should feel at home.  If you are not familiar, then we recommend watching some training videos on using Jupyter Notebooks via your preferred method which is available in your region.

Here is a video where Jinyoung Kim demos the capabilities of this system:  https://web.microsoftstream.com/video/8fb1bc88-eea3-4add-ad04-d732b53338e4

you will also find more videos here:  https://web.microsoftstream.com/channel/05659af5-92cf-4439-933f-b6c646f41e19?referrer=https:%2F%2Fmswiki.morningstar.com%2F


## Rules

1. You are welcome to use DNA Lab as your IDE but that is not required.  
2. You are invited to download the data that you need and using that cache in order to improve performance.
3. Only use /r/wallstreetbets as your source material and only match comments from 2021-03-01T00:00:00Z until 2021-03-31T23:59:59Z.  No other sources are to be used for this contest.
4. Winners will publish their source code as part of the follow up so we can all learn from their glorious achievements.
5. All Python libraries/APIs are permitted as long as they are only helping in retrieving threads/comments from /r/wallstreetbets, or help in extracting text for matching purposes.
6. You may use any technique for matching company including Company Names, Tickers, etc
7. Individuals or pairs are permitted.  If you are not able to find a partner and want one, please reach out to @Brad Boemmel.
8. Reddit enforces rate limits, you may decide to shard the datasets and share the results across the pairs to increase coverage if you so chose.

## Security Reference Data

Here is a sample query which shows the universe of public company's that Morningstar's Equity team tracks today.  You are not restricted to only using this index but this should help you match tickers to company_ids, cusips, isins in order to create your output file.  You can use any data set that is available on the Lake by using the Data Explorer.  

Its often difficult to find things at the moment, so below you will find a sample query with some key data that you'll want to try to match against.  In your program, make sure that you pull this file in its entirety and then write it to your local directory as a pickle/csv etc for performance reasons.  On the last day, you may want to pull and refresh this query for your final run just in case something new arrives.

You will need a few columns from this table in order to populate the sample_extract_structure including cusip & isin.  Look for the Notebook called "4." to see the extract structure explained.

In [None]:
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

In [14]:
query_string = """
SELECT
standardname, companyid, companystatus, operationstatus, shortname
FROM
"equity-atlas_v3"."companyinfo" LIMIT 300
"""

df = dnalab.query(query_string)
df_pd = pd.DataFrame(df)
df_pd.head(300)

Unnamed: 0,standardname,companyid,companystatus,operationstatus,shortname
0,Bank of Commerce Holdings Inc,0C000006U4,U,N,Bank of Commerce Hldgs
1,Baxter International Inc,0C000006V2,U,N,Baxter Intl
2,BCE Inc,0C000006VC,U,N,BCE
3,Baker Hughes Inc,0C000006TA,V,M,Baker Hughes
4,AZZ Inc,0C000006T0,U,N,AZZ
5,Ballistic Recovery Systems Inc,0C000006TJ,U,N,Ballistic Recovery
6,Best Buy Co Inc,0C000006WE,U,N,Best Buy Co
7,Bunge Ltd,0C00000726,U,N,Bunge
8,Luxbright AB,0C0000BZIL,U,N,Luxbright
9,Brown-Forman Corp,0C0000071J,U,N,Brown-Forman


In [56]:
query_string = f"""
 SELECT
 companyid, cusip, symbol
 FROM
  "equity-atlas_v3"."shareclassinfo"
 WHERE "equity-atlas_v3"."shareclassinfo"."symbol" = 'AAPL'
 LIMIT 1
""" 

df = dnalab.query(query_string)
df_pd = pd.DataFrame(df)
df_pd.head(300)

Unnamed: 0,companyid,cusip,symbol
0,0C00000ADA,37833100,AAPL


## Extraction Strategies

You are rate limited so you will need to be creative in order to extract out the most important parts.
- Tickers are generally three or four characters, NYSE are 3 & Nasdaq uses 4.  Eg. GME vs MORN.
- Are new company's typically listed in the Post Title or in the comments?
- Are heavily upvoted comments generally more useful?
- Are you likely to find important matches deeper in the comment tree or near the root?
- Can you extract data from images?  Is that useful?
- Should you ignore posts which prominently feature tickers you've already extracted?
- Do the tags help you in determining whether a post is likely useful for mining?  (Eg.  DD posts)

## Mismatching
- Since Tickers are three letter and four letter codes, one of the challenges is not to make a mismatch, for example the sentense "It is important to be happy" could match NYSE:IT (Gartner) NYSE:BE (Bloom Energy).  Context is important and part of the challenge.  In this case, for example, most tickers are expressed in ALL UPPER CASE.
- With that in mind, you'll want to conduct some Quality Review of matches and review the raw data to determine if the match is valid or not.  You can then improve your extraction process in order to avoid mismatching.

## Local Caching
- Since you are rate limited and you may want to re-process comments/posts as you refine your matching techniques, its strongly recommended that you separate your code to download raw data and write that data to disk.  Then your second process reads the raw data, refines it, and creates the submission file.  This is a common practice in Data Science as it allows you to rapidly make adjustments and quickly see results.  You should also do this for the Security Reference data above, so that your second notebook runs very quickly.  

Feel free to swap or share raw data dumps with competing teams.  The challenge is not necessarily in downloading, the challenge is in matching and the data processing and using the tools.




## Getting started

If you made changes to this folder, all of your changes are visible to others.  So, we need to have you create your own copy of this repository in your own private directory.

1.  Select File from the menu and chose "New Terminal" and you will be greeted with a bash terminal.
2.  type `git clone https://msstash.morningstar.com/scm/~tgilber/wallstreetbets_hackathon.git` into the terminal.  This will create a local version of the instructions within your own private directory so that you can explore DNA Lab.
3. type `exit`
4. Close this page and navigate to your own copy of this repository.  This will allow you to edit the notebooks directly without impacting other users.

You will now have a copy of the project in your own directory.  Although its a git repo, even then don't push any changes to it although you are welcome to commit locally.

If you are working with a partner, then you'll need to either create a branch or setup your own repo to keep your code secret.  Unfortunately, there is no easy way to do this for everyone so you'll need to handle this yourself by setting up a shared git repo that you can push to, or use another method to work together.

## How to submit your work

See the document "4. Extract Structure Explained" notebook for more detailed information.
See the document "5. Submitting your work" for the final instructions.

## Winners

**Updated 6-APR-2021: I added a notebook "6. Scoring..." to add more clarity around how scoring will work**

In true Data Science fashion this is a Qualitative and Quantitative aspect to determine the winners!

Our Data Scientist team will initially review your extract files (defined in 5. Extract Structure) and attempt to load your files into the general pool.  They will then compare your work against each other plus other sources that they have gathered.  Higher match rates, higher volumes of matches, fewer false matches will result in going into the next round.

A small number of teams will be selected for the Qualitative round where they will be invited to present their work to the Mumbai Data Science Team, talk thru the challenges, how they overcame them, what they learned and what unique solutions they employed.

From that our Mumbai Data Science team will choose a winner and runner-up and it will be announced.

Finally, the MDPs would be invited to select a representative group to present this Hackathon as a Morningstar Tech Tuesday session.

## Getting Help

* Brad will setup a Team Chat with key representatives for help answering questions among yourselves.
* Tag @Brad Boemmel or @Tim Gilbert if you get stuck and can't find the solution yourself.
* This is a global contest so you may not receive quick answers, our MDPs are spread over the globe, our Data Scientists are in Mumbai, and Brad & Tim work during North American hours.


## Etcetera

Here are a bunch of random sources and ideas to help you get started.

- https://praw.readthedocs.io/en/latest/
- https://towardsdatascience.com/scraping-reddit-data-1c0af3040768
- https://www.memebergterminal.com/
- https://swaggystocks.com/dashboard/wallstreetbets/ticker-sentiment
- https://gist.github.com/pangyuteng/849d2696d59b1457616c8cc2ccd205fd
- https://www.kaggle.com/gpreda/reddit-wallstreetsbets-posts
