# Cleaning




Columns are the same for each dataset.     
So we can write one script to clean them all. (Test on one for loop on others)

### Imports and setup

In [87]:
import pandas as pd
import numpy as np 
import matplotlib.pyplot as plt
import os
import seaborn as sns
import sys
import demoji
import nltk 
from nltk.tokenize import RegexpTokenizer
from nltk.corpus import stopwords
nltk.download('stopwords')


sys.path.append('../')



pd.set_option('display.max_rows', 250)
pd.set_option('display.max_columns', 250)



data_path = os.path.join('combined_files.csv')


[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\paganinik\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


### Before data can be read in to dataframe, I think it is necessary to do some preprocessing to the csv file itself

Preprocessing has been done in clean.py. clean.py takes the borken csv files and fixes them into correct records. The text of posts had commas in it which was breaking the csv files. 


## Still some more data cleaning needs to be done

In [88]:
df = pd.read_csv(data_path)

In [89]:
df.head(1)

Unnamed: 0,created,id,author,retrieved,edited,pinned,archived,locked,removed,deleted,is_self,is_video,is_original_content,title,link_flair_text,upvote_ratio,score,gilded,total_awards_received,num_comments,num_crossposts,selftext,thumbnail,shortlink
0,2021-01-01 00:02:06,ko124i,[deleted],2021-02-02 21:52:13,1970-01-01 00:00:00,0,0,0,1,1,1,0,0,3k - 170k since March (Also buy LIT!!),Gain,1.0,34,0,1,14,0,[deleted],default,https://redd.it/ko124i


In [90]:
df.columns[17:20]

Index(['gilded', 'total_awards_received', 'num_comments'], dtype='object')

### Converting types

In [91]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1483575 entries, 0 to 1483574
Data columns (total 24 columns):
 #   Column                 Non-Null Count    Dtype  
---  ------                 --------------    -----  
 0   created                1483575 non-null  object 
 1   id                     1483575 non-null  object 
 2   author                 1483575 non-null  object 
 3   retrieved              1483575 non-null  object 
 4   edited                 1483575 non-null  object 
 5   pinned                 1483575 non-null  int64  
 6   archived               1483575 non-null  int64  
 7   locked                 1483575 non-null  int64  
 8   removed                1483575 non-null  int64  
 9   deleted                1483575 non-null  int64  
 10  is_self                1483575 non-null  int64  
 11  is_video               1483575 non-null  int64  
 12  is_original_content    1483575 non-null  int64  
 13  title                  1483573 non-null  object 
 14  link_flair_text   

In [92]:


# date time columns
df['created'] =  pd.to_datetime(df['created'], format='%Y-%m-%d %H:%M:%S.%f')
df['retrieved'] =  pd.to_datetime(df['retrieved'], format='%Y-%m-%d %H:%M:%S.%f')
df['edited'] =  pd.to_datetime(df['edited'], format='%Y-%m-%d %H:%M:%S.%f')

# boolean / categorical variables
df['pinned'] = df['pinned'].astype('bool')
df['archived'] = df['archived'].astype('bool')
df['locked'] = df['locked'].astype('bool')
df['removed'] = df['removed'].astype('bool')
df['deleted'] = df['deleted'].astype('bool')
df['is_self'] = df['is_self'].astype('bool')
df['is_video'] = df['is_video'].astype('bool')
df['is_original_content'] = df['is_original_content'].astype('bool')

# int types
df['score'] = df['score'].astype('int')
df['gilded'] = df['gilded'].astype('int')
df['total_awards_received'] = df['total_awards_received'].astype('int')
df['num_comments'] = df['num_comments'].astype('int')
df['num_crossposts'] = df['num_crossposts'].astype('int')



Columns:    
| Index | Feature               | Type     | Description                                                    | 
|-------|-----------------------|----------|----------------------------------------------------------------|
| 0     | id                    | string   | The id of the submission                                       |
| 1     | author                | string   | The redditors username                                         |
| 2     | created               | datetime | Time the submission was created                                |
| 3     | retrieved             | datetime | Time the submission was retrieved                              |
| 4     | edited                | datetime | Time the submission was modified                               |
| 5     | pinned                | boolean  | Whether or not the submission is pinned                        |
| 6     | archived              | boolean  | Whether or not the submission is archived                      |
| 7     | locked                | boolean  | Whether or not the submission is locked                        |
| 8     | removed               | boolean  | Whether or not the submission is removed                       |
| 9     | deleted               | boolean  | Whether or not the submission is user deleted                  |
| 10    | is_self               | boolean  | Whether or not the submission is a text                        |
| 11    | is_video              | boolean  | Whether or not the submission is a video                       |
| 12    | is_original_content   | boolean  | Whether or not the submission has been set as original content |
| 13    | title                 | string   | Title of the submission                                        |
| 14    | link_flair_text       | string   | Submission link flairs text content                            |
| 15    | upvote_ratio          | double   | Percentage of upvotes from all votes on submission             |
| 16    | score                 | integer  | number of upvotes                                              |
| 17    | gilded                | integer  | number of gilded awards                                        |
| 18    | total_awards_received | integer  | number of awards on the submission                             |
| 19    | num_comments          | integer  | number of comments on the submission                           |
| 20    | num_crossposts        | integer  | number of crossposts on the submission                         |
| 21    | selftext              | string   | submission selftext on text posts                              |
| 22    | thumbnail             | string   | submission thumbnail on image posts                            |
| 23    | shortlink             | string   | submission short url                                           |    

### Cleaning functions for Title


Things that need to be cleaned from "Title":
- New lines
- Emojis (convert or remove?)
- Spam messages (possibly only take posts that have a certain number of upvotes)
- links (need to remove entire record if link is only thing)
- videos (same as link)
- A lot of records do not talk about a specific stock. (Remove them?)

Columns that can be removed for sure:
- id
- shortlink  
- thumbnail
- retrieved
- edited 
- pinned  
- archived  
- locked  
- removed (if removed is true should we discard the record?)
- deleted (same as removed)
- is_self   
- is_video (use as flag to remove records?)
- gilded   

Maybe keep:   (general stats about the post)
- score
- upvote_ratio
- comments

Keep:     
- created
- title + selftext




Can we combine comments into a score. The score could be a weighted average of upvote ratio, comments, etc.

In [93]:
df['selftext'].value_counts()[0:4]

[removed]                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               

A lot of self-text posts remove the text after or delete it. And there are some additional messages that seem to be from bots that we could probably remove as well. I suggest removing 'deleted' and 'removed' and have an empty string instead. I think we should also combine this column with the title column so we only have one column with text. 

In [94]:
df[(df['is_self'] == 1) & (df['removed'] == 1)]

Unnamed: 0,created,id,author,retrieved,edited,pinned,archived,locked,removed,deleted,is_self,is_video,is_original_content,title,link_flair_text,upvote_ratio,score,gilded,total_awards_received,num_comments,num_crossposts,selftext,thumbnail,shortlink
0,2021-01-01 00:02:06,ko124i,[deleted],2021-02-02 21:52:13,1970-01-01,False,False,False,True,True,True,False,False,3k - 170k since March (Also buy LIT!!),Gain,1.00,34,0,1,14,0,[deleted],default,https://redd.it/ko124i
9,2021-01-01 00:13:41,ko190a,[deleted],2021-02-03 21:12:56,1970-01-01,False,False,False,True,True,True,False,False,TSXV ROVR OTCQB ROVMF could be getting ready t...,General Discussion,1.00,1,0,0,0,0,[deleted],default,https://redd.it/ko190a
11,2021-01-01 00:18:03,ko1bnp,dluther93,2021-02-02 21:52:13,1970-01-01,False,False,False,True,False,True,False,False,What would make GME shorts win?,Discussion,1.00,1,0,0,0,0,[removed],default,https://redd.it/ko1bnp
15,2021-01-01 00:18:57,ko1c6o,[deleted],2021-02-03 21:17:46,1970-01-01,False,False,False,True,True,True,False,False,Stocks for beginners: How do you know which st...,,0.55,1,0,0,14,0,[deleted],default,https://redd.it/ko1c6o
16,2021-01-01 00:18:57,ko1c6o,[deleted],2021-02-03 21:17:46,1970-01-01,False,False,False,True,True,True,False,False,Stocks for beginners: How do you know which st...,,0.55,1,0,0,14,0,[deleted],default,https://redd.it/ko1c6o
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1483564,2021-12-31 23:46:45,rt6fuz,[deleted],2022-01-01 03:29:07,1970-01-01,False,False,False,True,False,True,False,False,I'm about to start making $120K salary. Should...,Auto,1.00,1,0,0,1,0,[removed],default,https://redd.it/rt6fuz
1483565,2021-12-31 23:48:27,rt6gxc,peachezandsteam,2022-01-01 03:29:52,1970-01-01,False,False,False,True,False,True,False,False,Is the notable congresswoman's GOOG calls a gu...,Discussion,0.66,1,0,0,0,0,[removed],https://b.thumbs.redditmedia.com/pp0YjoMYmhccq...,https://redd.it/rt6gxc
1483571,2021-12-31 23:55:49,rt6lul,coyote_of_the_month,2022-01-01 03:29:07,1970-01-01,False,False,False,True,False,True,False,False,Company was unable to process additional 403(b...,R5: Legal,1.00,1,0,0,3,0,[removed],self,https://redd.it/rt6lul
1483572,2021-12-31 23:55:51,rt6lv6,[deleted],2022-01-01 03:56:58,1970-01-01,False,False,False,True,False,True,False,False,Winner or loser? Only time will tell. 2021 end...,Discussion,1.00,1,0,0,1,0,[removed],default,https://redd.it/rt6lv6


In [95]:
df['thumbnail'].value_counts()

default                                                                             849993
self                                                                                431161
image                                                                               113541
nsfw                                                                                  3317
spoiler                                                                               1925
                                                                                     ...  
https://a.thumbs.redditmedia.com/m8QV6nOfMondOzYlaoRpWQ2qjYjz0SB5DezQixtPfM8.jpg         1
https://a.thumbs.redditmedia.com/UvYbtr6hGZ3WFrk32-3oT-hJ18H50aStwX8RF08zKn8.jpg         1
https://a.thumbs.redditmedia.com/wL3ECjUia5J3zuQ3dVSnqtdMUag0o2sie4tMQjXXRu8.jpg         1
https://b.thumbs.redditmedia.com/3b2alO76OukxzyuDfWZzxld-1X7UqPsuvjqslCtvfwU.jpg         1
https://b.thumbs.redditmedia.com/TtUVXN1XpoXXuzY85bJMNo1451L4fTOqYqKailX9M-c.jpg         1

In [96]:
df.columns

Index(['created', 'id', 'author', 'retrieved', 'edited', 'pinned', 'archived',
       'locked', 'removed', 'deleted', 'is_self', 'is_video',
       'is_original_content', 'title', 'link_flair_text', 'upvote_ratio',
       'score', 'gilded', 'total_awards_received', 'num_comments',
       'num_crossposts', 'selftext', 'thumbnail', 'shortlink'],
      dtype='object')

### We just want the text for the most part

Experimenting with upvote_ratio and score as well.


In [97]:
df = df[['created','removed', 'deleted', 'is_self','title', 'upvote_ratio', 'score', 'selftext']]

pseudocode:   
```
if removed or deleted:    
    just take title  and date

if is_self and not removed or deleted and type(selftext) is string:
    take date, title + selftext
else:
    take date, title

```

In [98]:
df_extracted = df.loc[df['is_self'] & ~(df['removed'] | df['deleted']) & (df['selftext'].apply(lambda x: type(x) == str))]
df_extracted['text'] = df_extracted['title'] + ' ' + df_extracted['selftext']
df_extracted = df_extracted[['created', 'text', 'upvote_ratio', 'score']]

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_extracted['text'] = df_extracted['title'] + ' ' + df_extracted['selftext']


In [99]:
df_extracted.head()

Unnamed: 0,created,text,upvote_ratio,score
3,2021-01-01 00:05:17,Advice for someone who's never dealt with stoc...,0.4,0
4,2021-01-01 00:05:17,Advice for someone who's never dealt with stoc...,0.4,0
7,2021-01-01 00:13:13,So /r/stocks what was your 2020 investment rat...,0.63,4
8,2021-01-01 00:13:13,So /r/stocks what was your 2020 investment rat...,0.63,4
10,2021-01-01 00:15:38,WSBVoteBot Log for Jan 01 2021 Every time a ne...,0.5,0


In [100]:
df_extracted.shape

(306368, 4)

In [101]:

df_extracted['text'].head(20)


3     Advice for someone who's never dealt with stoc...
4     Advice for someone who's never dealt with stoc...
7     So /r/stocks what was your 2020 investment rat...
8     So /r/stocks what was your 2020 investment rat...
10    WSBVoteBot Log for Jan 01 2021 Every time a ne...
14    Hedging your portfolio Just out of curiosity  ...
20    BNGO Bear Case (Serious) I'm actually quite sk...
21    Daily Executions- December 31 2020 Hi Everyone...
35    Built two Google Sheets templates with automat...
36    Built two Google Sheets templates with automat...
41    $GAXY Youtuber London Investor will interview ...
47    Thoughts on Old School Value's stock tracking ...
57    GME is the Rockets 🚀🚀🚀🚀 Gamestop colors: Red  ...
58    Western Digital (WDC) rose 11.83% today. Anybo...
59    ARK invest selling $TSLA Ark invest ETF’s ARKW...
60    ARK invest selling $TSLA Ark invest ETF’s ARKW...
66    Recent IPO Chindata (CD). Looks promising. Wha...
67    Recent IPO Chindata (CD). Looks promising.

# Cleaning the text of each post left.

### Removing emojis

In [102]:
df_extracted['text'] = df_extracted['text'].str.encode('ascii', 'ignore').str.decode('ascii')

### Removing links

In [103]:
df_extracted['text'][10]

'WSBVoteBot Log for Jan 01 2021 Every time a new submission is posted to wallstreetbets /u/wsbvotebot posts a comment that allows you to click and vote to remove that submission. This is the log of volunteer moderators actions which you can vote to reverse. [Check the leaderboard](https://www.reddit.com/r/wallstreetbets/wiki/leaderboard) to see who is doing the most to keep /r/wallstreetbets great.User commentary as replies to the messages below are encouraged. Report bugs to /u/zjz.'

In [104]:
import re

#regex from chatgpt seems to work
url_pattern = re.compile(r"(https?:\/\/(?:www\.|(?!www))[a-zA-Z0-9][a-zA-Z0-9-]+[a-zA-Z0-9]\.[^\s]{2,}|www\.[a-zA-Z0-9][a-zA-Z0-9-]+[a-zA-Z0-9]\.[^\s]{2,}|https?:\/\/(?:www\.|(?!www))[a-zA-Z0-9]+\.[^\s]{2,}|www\.[a-zA-Z0-9]+\.[^\s]{2,})")
df_extracted['text'] = df_extracted['text'].str.replace(url_pattern,'')

### Remove reddit mentions (Maybe)

In [105]:
df_extracted['text'] = df_extracted['text'].str.replace(re.compile(r"(\/u\/[a-zA-Z0-9]+|\/r\/[a-zA-Z0-9]+)"),'')

### Remove Duplicate texts (likely from bots)

In [106]:
before = df_extracted.shape[0]
print(f'Shape before drop duplicates: {df_extracted.shape}')
df_extracted = df_extracted.drop_duplicates(subset=["text"], keep=False)
print(f'Shape after drop duplicates: {df_extracted.shape}')
print(f'Records lost: {before - df_extracted.shape[0]}')

Shape before drop duplicates: (306368, 4)
Shape after drop duplicates: (302003, 4)
Records lost: 4365


In [107]:
df_extracted['text']

10         WSBVoteBot Log for Jan 01 2021 Every time a ne...
14         Hedging your portfolio Just out of curiosity  ...
20         BNGO Bear Case (Serious) I'm actually quite sk...
21         Daily Executions- December 31 2020 Hi Everyone...
41         $GAXY Youtuber London Investor will interview ...
                                 ...                        
1483539    Why do people open up multiple positions of th...
1483546    Five penny stocks to put on your watchlist in ...
1483548    Any suggestion on what to do with an employer ...
1483566    Best Keeper credit cards? What are the best ke...
1483570    im a teen and i want to start investing in ind...
Name: text, Length: 302003, dtype: object

### Extracting what each post is about (ticker information)


Need to add more tickers

In [108]:
import json

ticker_dict = 0

with open('stonks.json', 'r') as f:
    ticker_dict = json.load(f)
    


def find_ticker(text):
    mentioned = []
    for ticker, names in ticker_dict.items():
        for name in names:
            if name in text:
                mentioned.append(ticker)
                break
    if len(mentioned) == 0:
        return np.nan
    else:
        
        return " ".join(mentioned)

df_extracted['mentioned'] = df_extracted['text'].apply(find_ticker)


In [109]:
df_extracted['mentioned'].value_counts()
df_extracted[df_extracted['mentioned'] == 'MSFT TSLA AAPL GOOGL NVDA WFC']['text'][912309]

"Was just wondering what to do if you were me I have a IRA with Wells Fargo. My positions are. 1 Amazon. 2 Google. 5 Microsoft. 10 Apple.I'm not interested in doing anything with Google  Microsoft  or Apple. I want to keep all these. My question is what should I do with my 1 Amazon. Obviously I can just keep it. After it fell $50 today and has fallen past days I'm not to far off my average cost again.I'm interested in Nvidia so I could just sell Amazon and grab 5 nvidia to replace it with. I'm also interested in Tesla  so same thing. Could sell Amazon and grab 5 Tesla or something.Like I said though I could just do nothing. I'm assuming Amazon has the possibility to hit $3500 to $4000 by the end of the year. It hasn't done much movement in the past 9 months though.I'm not a newbie to this stock trading thing just wondering about opinions. 1 Amazon  5 Tesla  5 Nvidia  X Shares of something else.Looking for advice. If you're me what you doing tomorrow with my 1 share of amazon?"

In [110]:
df_extracted['mentioned'].info()

<class 'pandas.core.series.Series'>
Int64Index: 302003 entries, 10 to 1483570
Series name: mentioned
Non-Null Count  Dtype 
--------------  ----- 
80118 non-null  object
dtypes: object(1)
memory usage: 4.6+ MB


In [111]:
df_extracted.to_csv('just_dates_and_text.csv', index=False)

In [112]:
df_extracted.head(100)

Unnamed: 0,created,text,upvote_ratio,score,mentioned
10,2021-01-01 00:15:38,WSBVoteBot Log for Jan 01 2021 Every time a ne...,0.5,0,GME
14,2021-01-01 00:18:40,Hedging your portfolio Just out of curiosity ...,0.6,2,
20,2021-01-01 00:24:04,BNGO Bear Case (Serious) I'm actually quite sk...,0.74,42,
21,2021-01-01 00:24:14,Daily Executions- December 31 2020 Hi Everyone...,0.91,9,
41,2021-01-01 00:42:36,$GAXY Youtuber London Investor will interview ...,0.97,31,
47,2021-01-01 00:47:19,Thoughts on Old School Value's stock tracking ...,0.5,0,
57,2021-01-01 00:56:35,GME is the Rockets Gamestop colors: Red Whit...,0.82,57,GME
58,2021-01-01 00:57:44,Western Digital (WDC) rose 11.83% today. Anybo...,0.64,6,
70,2021-01-01 01:08:50,AMC will be back. AMC had a rough year just li...,0.56,2,GME AMC
79,2021-01-01 01:18:17,PLTR - Public Service Announcment Listen up my...,0.9,123,


### Removing punctuation

In [113]:
df_extracted['text'] = df_extracted['text'].replace('[^\w\s]', '', regex=True)


### Removing capitalization

In [114]:
df_extracted['text'] = df_extracted['text'].str.lower()



### Stop word removal

In [115]:
stop_word_list = set(stopwords.words('english'))


df_extracted['text'] = df_extracted['text'].map(lambda x : " ".join(w for w in x.split() if w not in stop_word_list))

### Stemming?

The following cell takes a while to run (under 5 min)

In [116]:
from nltk.stem import WordNetLemmatizer
from nltk import word_tokenize




nltk.download('wordnet')
nltk.download('punkt')
lemmatizer = WordNetLemmatizer()
df_extracted['text'] = df_extracted['text'].apply(lambda x: " ".join(lemmatizer.lemmatize(word) for word in word_tokenize(x)))

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\paganinik\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\paganinik\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [117]:
df_extracted['text']

10         wsbvotebot log jan 01 2021 every time new subm...
14         hedging portfolio curiosity anyone make move h...
20         bngo bear case serious im actually quite skept...
21         daily execution december 31 2020 hi everyone i...
41         gaxy youtuber london investor interview coo ga...
                                 ...                        
1483539    people open multiple position pair whats benef...
1483546    five penny stock put watchlist 2022 llnw limel...
1483548    suggestion employer pay hi young man working s...
1483566    best keeper credit card best keeper card know ...
1483570    im teen want start investing index stock sp500...
Name: text, Length: 302003, dtype: object

In [118]:
df_extracted.head(10)

Unnamed: 0,created,text,upvote_ratio,score,mentioned
10,2021-01-01 00:15:38,wsbvotebot log jan 01 2021 every time new subm...,0.5,0,GME
14,2021-01-01 00:18:40,hedging portfolio curiosity anyone make move h...,0.6,2,
20,2021-01-01 00:24:04,bngo bear case serious im actually quite skept...,0.74,42,
21,2021-01-01 00:24:14,daily execution december 31 2020 hi everyone i...,0.91,9,
41,2021-01-01 00:42:36,gaxy youtuber london investor interview coo ga...,0.97,31,
47,2021-01-01 00:47:19,thought old school value stock tracking spread...,0.5,0,
57,2021-01-01 00:56:35,gme rocket gamestop color red white blackhoust...,0.82,57,GME
58,2021-01-01 00:57:44,western digital wdc rose 1183 today anybody kn...,0.64,6,
70,2021-01-01 01:08:50,amc back amc rough year like everyone else exc...,0.56,2,GME AMC
79,2021-01-01 01:18:17,pltr public service announcment listen fellow ...,0.9,123,


spectral analysis

### Combining datasets

to simplify make a classifier (buy = 1 or sell = 0)

In [119]:
ticker_df = pd.read_csv('ticker_data.csv')
ticker_df.head()

# Check the mentioned column
# if its empty do nothing and 

Unnamed: 0,Date,MSFT,TSLA,GME,AMC,BB,NOK,BABA,AAPL,GOOGL,DIS,SNAP,SPOT,NVDA,F,BA,META,MCD,V,WMT,JNJ,JPM,T,VZ,PG,MRK,KO,PFE,XOM,GE,WFC,CSCO,INTC,CMCSA,PEP
0,2021-01-04 00:00:00-05:00,-0.02175,0.01433,-0.092105,-0.086364,-0.01791,-0.025063,0.00596,-0.030782,-0.019244,-0.025129,-0.016852,-0.020226,0.000706,-0.032917,-0.034667,-0.021253,-0.019908,-0.011305,0.015454,-0.004706,-0.012784,0.001701,-0.001866,-0.013175,-0.012683,-0.027824,-0.001627,0.001206,-0.038568,-0.020449,-0.007899,-0.00441,-0.033856,-0.018638
1,2021-01-05 00:00:00-05:00,0.002946,0.015822,0.001153,-0.005025,0.022659,0.012531,0.049552,0.016448,0.008672,0.012713,0.019453,0.0137,0.023283,0.021251,0.033652,0.009989,0.006185,-0.008321,-0.005798,0.013376,0.00528,-0.008136,-0.004757,0.005145,0.00533,-0.002866,0.012799,0.039675,0.032598,0.014623,0.006868,0.023458,-0.003589,0.004373
2,2021-01-06 00:00:00-05:00,0.000377,-0.003309,0.058824,-0.009852,0.0,0.007481,-0.031241,-0.008769,0.013304,0.004486,0.014457,0.01973,-0.045982,0.005688,0.003853,0.005,-0.00142,-0.005519,0.010821,0.016472,0.012858,0.014281,0.008162,0.010017,0.022338,-0.027901,0.001086,0.004504,0.048938,0.024765,0.006118,0.013085,0.029316,0.008894
3,2021-01-07 00:00:00-05:00,0.019856,0.049394,-0.021115,-0.014423,0.045926,-0.004988,-0.01051,0.019944,0.027555,-0.005125,0.041509,0.043851,0.029034,0.013423,-0.003187,0.010681,-0.005816,0.007587,-0.006571,0.008616,0.001326,-0.003664,0.004462,-0.00644,0.017807,-0.002595,0.00054,-0.000889,-0.025086,-0.004467,0.017195,0.013595,0.009163,-0.002381
4,2021-01-08 00:00:00-05:00,0.004299,0.028061,-0.026953,0.023924,0.047091,-0.002538,0.036467,-0.002869,0.011631,-0.000671,-0.013101,0.05406,-0.006417,-0.010989,-0.017368,-0.002758,0.01395,0.005977,-0.001702,-0.002928,0.000368,-0.004118,-0.007385,0.002311,-0.015882,0.020987,-0.000807,0.005752,0.004429,-0.012496,0.01304,-0.015253,0.023431,0.012002


Pseudocode to combine datasets
```
split dataframe into many dataframes by date (one dataframe per day) -> 4 pm previous day until 4 pm til next day.
get rid of weekend posts



for each day:
    for each record in day:
        for each company in stocks:
            if company in mention column:
                add post to that x and that day
remove records of the dataset that are less than ten posts



```

### Splitting into 365 dataframes (one per day)


In [120]:
df_extracted.head()
day_frames = []
df_extracted['new_date'] = df_extracted['created'] + pd.Timedelta(hours=8)
groups = df_extracted.groupby(df_extracted['new_date'].dt.date) 

for name, group in groups:
    day_frames.append(group[['created', 'new_date', 'text', 'upvote_ratio', 'score', 'mentioned']])


In [121]:
day_frames[0].head()

Unnamed: 0,created,new_date,text,upvote_ratio,score,mentioned
10,2021-01-01 00:15:38,2021-01-01 08:15:38,wsbvotebot log jan 01 2021 every time new subm...,0.5,0,GME
14,2021-01-01 00:18:40,2021-01-01 08:18:40,hedging portfolio curiosity anyone make move h...,0.6,2,
20,2021-01-01 00:24:04,2021-01-01 08:24:04,bngo bear case serious im actually quite skept...,0.74,42,
21,2021-01-01 00:24:14,2021-01-01 08:24:14,daily execution december 31 2020 hi everyone i...,0.91,9,
41,2021-01-01 00:42:36,2021-01-01 08:42:36,gaxy youtuber london investor interview coo ga...,0.97,31,


In [122]:
day_frames_no_nan = []
for x in day_frames:
    day_frames_no_nan.append(x.dropna())

In [123]:
day_frames_no_nan[0].head()

Unnamed: 0,created,new_date,text,upvote_ratio,score,mentioned
10,2021-01-01 00:15:38,2021-01-01 08:15:38,wsbvotebot log jan 01 2021 every time new subm...,0.5,0,GME
57,2021-01-01 00:56:35,2021-01-01 08:56:35,gme rocket gamestop color red white blackhoust...,0.82,57,GME
70,2021-01-01 01:08:50,2021-01-01 09:08:50,amc back amc rough year like everyone else exc...,0.56,2,GME AMC
83,2021-01-01 01:26:38,2021-01-01 09:26:38,still painful void greenwich ray dalio open lo...,0.64,4,VZ
86,2021-01-01 01:30:43,2021-01-01 09:30:43,gme weird option price action x200b bugging fu...,0.79,22,GME


In [124]:
day_frames_split = []

for df in day_frames_no_nan:
    # gets all the rows where there is multiple tickers in mentioned
    split_df = df[df['mentioned'].str.contains(' ')].copy()
    # splits the rows that have a space into a list
    split_df['mentioned'] = split_df['mentioned'].str.split(' ')
    # expands it out
    split_df = split_df.explode('mentioned')
    split_df.reset_index(drop=True, inplace=True)
    df = df[~df['mentioned'].str.contains(' ')].copy()
    df.reset_index(drop=True, inplace=True)
    df = pd.concat([df, split_df], sort=False)
    df.sort_values(by='created', inplace=True)
    df.reset_index(drop=True, inplace=True)
    day_frames_split.append(df)
    



In [130]:
day_frames_split[1]

Unnamed: 0,created,new_date,text,upvote_ratio,score,mentioned
0,2021-01-01 16:32:50,2021-01-02 00:32:50,would list 10 stock portfolio 2 kid aged 10 7 ...,0.7,13,DIS
1,2021-01-01 16:32:50,2021-01-02 00:32:50,would list 10 stock portfolio 2 kid aged 10 7 ...,0.7,13,MSFT
2,2021-01-01 16:32:50,2021-01-02 00:32:50,would list 10 stock portfolio 2 kid aged 10 7 ...,0.7,13,NVDA
3,2021-01-01 16:32:50,2021-01-02 00:32:50,would list 10 stock portfolio 2 kid aged 10 7 ...,0.7,13,MCD
4,2021-01-01 16:32:50,2021-01-02 00:32:50,would list 10 stock portfolio 2 kid aged 10 7 ...,0.7,13,AAPL
5,2021-01-01 16:39:33,2021-01-02 00:39:33,1 year 100 roi challenge january thread first ...,0.92,198,GME
6,2021-01-01 17:21:00,2021-01-02 01:21:00,gme reality fantasy prediction gme going big w...,0.94,241,GME
7,2021-01-01 19:11:41,2021-01-02 03:11:41,watch list early 2021 dd good response last po...,0.99,61,T
8,2021-01-01 19:41:04,2021-01-02 03:41:04,nvda blow im tired one talking heard right im ...,0.59,22,GME
9,2021-01-01 19:41:04,2021-01-02 03:41:04,nvda blow im tired one talking heard right im ...,0.59,22,NVDA


In [128]:
concat_days = []
for df in day_frames_split:
    groups = df.groupby('mentioned')
    ret = groups['text'].apply(lambda x: ' '.join(x))
    ret = ret.reset_index().rename(columns={0: 'text'})
    concat_days.append(ret)
    

In [131]:
concat_days[1]

Unnamed: 0,mentioned,text
0,AAPL,would list 10 stock portfolio 2 kid aged 10 7 ...
1,BABA,prosus nv stock euronext amsterdam prx almost ...
2,BB,bbby short squeeze plan take bbby 6 day q3 era...
3,CMCSA,wsecdr cd projekt red analysis undervalued 37 ...
4,DIS,would list 10 stock portfolio 2 kid aged 10 7 ...
5,F,update chinese delisting chu chl ceoit also af...
6,GME,1 year 100 roi challenge january thread first ...
7,GOOGL,something randomly positive bngo necessarily r...
8,INTC,palantir going moon guide paperhanded dummy al...
9,JPM,starlinkpsth timeline posting oldest throw awa...


How to do the word embedding thing


https://www.youtube.com/watch?v=ZogxNcyqVqE&ab_channel=TheAIUniversity


https://www.guru99.com/word-embedding-word2vec.html


http://web.stanford.edu/class/cs224n/

https://www.youtube.com/playlist?list=PLoROMvodv4rOSH4v6133s9LFPRHjEmbmJ

### Word Embeddings - Word2Vec




Training the word2vec model

### Figuring out which companies we can predict for

Since, it would be unfair to ask a model to predict the price of a stock that is not mentioned in the data that is given, we need to do something about it.   

If we are asking the model to predict the price for Tesla in one hour based off the reddit comments from the previous 5 hours, Tesla would need to be mentioned in the previous 5 hours.     

I think a threshold of maybe like at least 5 posts in the last 5 hours to be included as a training example.     

To decrease the likelihood of the the target company not being mentioned we can increase the time window 