# Exploration of Kaggle Data
This notebook can be used to create timeseries data from a kaggle data set of reddit post. It supports several options on how to construct the time series that can be found in the 'settings' section. The datasets this notebook has been tested on can be found under the following links:
- [Kaggle Redit Dataset: dataisbeautiful](https://www.kaggle.com/datasets/unanimad/dataisbeautiful)
- [Kaggle Wall Street Bets Dataset](https://www.kaggle.com/datasets/gpreda/reddit-wallstreetsbets-posts)
#### Settings

In [77]:
# Select Dataset for Exploration (options: 'wsb_df', 'all_df')
dataset = "wsb_df"
# Select path to the data set
path_to_data = "../../Data/reddit_wsb.csv"

In [78]:
# Select the bin method (by 'seconds' or one 'day')
bin_method = "day"

# Select the bin size in seconds (e.g. 604800 for a week)
bin_size = 604800

In [79]:
# Select text processing method ('tfidf' or 'target_words')
proc_method = 'target_words'

# If word cound was selected, select a list of words to look for
target_words = ['GME', 'GameStop']

#### Load Data

In [80]:
#imports
from datetime import datetime
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer

In [None]:
# DataFrame with wsb posts
df=pd.read_csv(path_to_data, low_memory=False)
if dataset=="all_df":
    # Rename columns
    df=df.rename({'created_utc': 'created'}, axis=1)

  df= pd.read_csv(path_to_data)


In [82]:
df.head()

Unnamed: 0.1,Unnamed: 0,author,author_created_utc,author_flair_background_color,author_flair_css_class,author_flair_richtext,author_flair_template_id,author_flair_text,author_flair_text_color,author_flair_type,...,gildings.gid_2,gildings.gid_3,author_cakeday,distinguished,edited,gilded,is_submitter,locked,all_awardings,total_awards_received
0,0,Nm0369,1493507000.0,,,[],,,,text,...,0.0,0.0,,,,,,,,
1,1,achennupati,1487615000.0,,,[],,,,text,...,0.0,0.0,,,,,,,,
2,2,grissomza,1443582000.0,,,[],,,,text,...,0.0,0.0,,,,,,,,
3,3,ktdfintech,1528505000.0,,,[],,,,text,...,0.0,0.0,,,,,,,,
4,4,Lester_Diamond23,1503953000.0,,,[],,,,text,...,0.0,0.0,,,,,,,,


#### Small Data Analysis

In [68]:
start_date = datetime.fromtimestamp(df.created.min())
end_date = datetime.fromtimestamp(df.created.max())

In [69]:
print(f"The dataset is from {start_date} to {end_date} and has {df.shape[0]} datapoints.")

The dataset is from 2020-09-29 02:46:56 to 2021-08-16 08:26:20 and has 53187 datapoints.


#### Data To Timeseries

In [70]:
# Drop all irelevant columns
df = df[['title','score','created']]

In [71]:
# Binning data into buckets of selected bin size
dates = sorted(list(set(df['created'])))
mini, maxi = min(dates),max(dates)
# Puts datapoints into bins (depends on bin_method)
if bin_method == 'seconds':
    df['bins'] = df['created'].map(lambda timestamp:int((timestamp-mini)//bin_size))
elif bin_method == 'day':
    df['bins'] = df['created'].map(lambda timestamp: datetime.fromtimestamp(timestamp).date())

In [72]:
aggregation_dict = {'title':'sum'}
df = df.groupby(df['bins']).aggregate(aggregation_dict)

#### Text Processing

In [73]:
if proc_method=='tfidf':
    # Creates a tfidf vectorizer 
    tfidf = TfidfVectorizer(
        analyzer='word',
        lowercase=True,
        stop_words='english',
        max_features=1000
    )
    # Creates tfidf matrix
    features=tfidf.fit_transform(df['title']).toarray()   
    index = df.index
    # Creates a new datafram from the tfidf matrix
    df = pd.DataFrame(
        data=features,
        columns=range(len(features[0])),
        index=index,
    )

In [74]:
if proc_method=='target_words':
    for word in target_words:
        df[word] = df['title'].map(lambda title: title.count(word))
    df['aggregat'] =  df.loc[:,[word for word in target_words]].sum(axis=1)

In [75]:
df = df[target_words+['aggregat']]

In [76]:
df.head(3)

Unnamed: 0_level_0,GME,GameStop,aggregat
bins,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2020-09-29,0,0,0
2021-01-28,240,17,257
2021-01-29,2943,124,3067


#### Save Data

In [135]:
#df.to_csv('wsb_time_series.csv')