# 01 - Data Collection with PRAW/Reddit API

## r/AskAstronomy vs r/AskAstrologers:

__Problem statement__:

Basic stats:

- [r/AskAstronomy](https://www.reddit.com/r/askastronomy/): 54,000 users (top 2%). 
- [r/AskAstrologers](https://www.reddit.com/r/askastrologers/): 147,000 users (top 1%)



Outline

1. Define functions to perform webscraping on subreddit pages and combine into a dataframe.

2. Collect data from past ~24 hours from each subreddit and merge into a single `DataFrame`.

3. Save raw dataframe as csv file.

    


In [1]:
import praw
import pandas as pd


Instantiate `praw.Reddit()` instance

In [2]:
reddit = praw.Reddit()

## 1. Functions for scraping subreddit data

In [3]:
# retrieve posts from a subreddit or list of subreddits
def get_posts(subreddit, limit=5000, kind='new', last_post=None):
    
    # retrieve submissions from subreddit new
    if kind=='new':
        if last_post is not None:

            posts = reddit.subreddit(subreddit).new(limit=limit,params={'after':last_post})
        else:

            posts = reddit.subreddit(subreddit).new(limit=limit)
    # loop through posts and create dictoinary
    posts_list=[]
    for post in posts:
    
        posts_list.append(
            {
                'title':post.title,
                'selftext':post.selftext,
                'subreddit':post.subreddit,
                'created_utc':post.created_utc,
                'name':post.name
            }

        )
    # return a dataframe with posts
    return pd.DataFrame(posts_list)


import datetime
# format time column as pandas Datetimeindex and sort chronologically
def format_time_col(df):
    
    utc_created = [datetime.datetime.utcfromtimestamp(int(ts)) for ts in df['created_utc']]
    df['created_utc']=pd.DatetimeIndex(utc_created)
    df.sort_values(by='created_utc',ascending=False,inplace=True)
    return df


# create merged dataframe of two subreddits to save
def get_subreddit_data(subreddit_list, limit=5000, last_post=None):
    # retieve post dataframe
    df = get_posts("+".join([f"{subreddit}" for subreddit in subreddit_list]) , limit=limit, last_post=last_post)
    
    # make subreddit column lowercase
    df['subreddit']=df['subreddit'].map(lambda x: str(x).lower())

    # format time column
    df=format_time_col(df)
    
    # return dataframe
    return df[['name','created_utc','title','selftext','subreddit']]



## 2. Get the r/askastrologers and r/askastronomy dataframes

In [4]:
subreddit_list = ['askastrologers','askastronomy']

df_merged = get_subreddit_data(subreddit_list,limit=10_000)

### Check shape and some example posts

In [5]:
df_merged.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1985 entries, 0 to 1984
Data columns (total 5 columns):
 #   Column       Non-Null Count  Dtype         
---  ------       --------------  -----         
 0   name         1985 non-null   object        
 1   created_utc  1985 non-null   datetime64[ns]
 2   title        1985 non-null   object        
 3   selftext     1985 non-null   object        
 4   subreddit    1985 non-null   object        
dtypes: datetime64[ns](1), object(4)
memory usage: 93.0+ KB


In [6]:
df_merged['subreddit'].value_counts()

askastrologers    997
askastronomy      988
Name: subreddit, dtype: int64

__Subreddit post distribution looks pretty even.__

## 3. Save merged data as csv file

In [7]:
df_merged.to_csv(f"../data/data_raw_{'_'.join(subreddit_list)}_{df_merged['created_utc'].iloc[0].strftime('%Y-%m-%d_%H-%M-%S')}.csv",index=False)

In [19]:
# import os,sys

# flist = [str(f) for f in list(os.listdir('../data')) if 'raw_askastrologers' in f]
# print(flist)
# df_list = [pd.read_csv(f'../data/{f}',index_col='name',
#                        parse_dates=['created_utc']) for f in flist]

# flist = [flist[i] for i in range(len(df_list)) if 'post' in df_list[i].columns]


['data_raw_askastrologers_askastronomy_2024-01-27_20-06-07.csv', 'data_raw_askastrologers_askastronomy_2024-01-27_20-55-01.csv', 'data_raw_askastrologers_askastronomy_2024-01-29_14-49-04.csv', 'data_raw_askastrologers_askastronomy_2024-01-29_22-51-36.csv']


In [20]:
# df_list = [pd.read_csv(f'../data/{f}',index_col='name',
#                        parse_dates=['created_utc']) for f in flist]

In [21]:
# df_list[0].columns

Index(['Unnamed: 0', 'title', 'selftext', 'subreddit', 'created_utc', 'post'], dtype='object')

In [22]:
# [df.drop(columns=['post'],inplace=True) for df in df_list]

[None, None]

In [23]:
# [df_list[i].to_csv(flist[i]) for i in range(len(flist))]

[None, None]