# Data Collection

## Contents:
-  [Digital Nomad Data Collection](#Digital-Nomad-Data-Collection)
-  [Solo Travel Data Collection](#Solo-Travel-Data-Collection)
-  [Save as JSON](#Save-as-JSON)

In [1]:
!pip install psaw



In [2]:
from psaw import PushshiftAPI
import datetime as dt
import pandas as pd
import json
import time

Instantiate PushshiftAPI

In [3]:
api = PushshiftAPI()

## Digital Nomad Data Collection

We collect posts from our first subreddit: r/digitalnomad. 
-  In order to retrieve the most recent posts, we specify January 1, 2018 for `after`. Only posts after this date will be collected.
-  We only want the `title`, `selftext` and `url` for each post. All other information will not be collected.
-  To avoid class imbalance, we only want to collect a maximum of 6,000 posts since the specified date.

In [4]:
dn_posts = list(api.search_submissions(after=int(dt.datetime(2018, 1, 1).timestamp()),
                                       subreddit='digitalnomad',
                                       filter=['title', 'selftext', 'url'],
                                       limit=6000))

There's a total of 5,705 posts from January 1, 2018 to December 19, 2018.

## Solo Travel Data Collection

We collect posts from our second subreddit: r/solotravel.
-  We specify August 10, 2018 for `after` because we want the most recent posts. r/solotravel is much more active than r/digitalnomad, and as a result, has more posts in a shorter time frame.
-  Again, we only want `title`, `selftext` and `url` for each post. All other information will not be collected.
-  Knowing that r/solotravel has more posts, we set a maximum of 6,000 posts to undersample our majority class and prevent class imbalance.

In [5]:
st_posts = list(api.search_submissions(after=int(dt.datetime(2018, 8, 10).timestamp()), 
                                       subreddit='solotravel', 
                                       filter=['title', 'selftext', 'url'], 
                                       limit=6000))

In [6]:
len(st_posts)

6000

We collected 6,000 posts from August 10, 2018 to December 19, 2018. Given that we have exactly 6,000 posts, there appears to be more posts available if needed.

## Save as JSON

Before saving the data as json files, we want to note when the data was collected. Since both subreddits are continuously being updated, including the time will help us determine how up to date the data is and how prevalent our findings are.

In [7]:
now = time.time()

In [8]:
with open(f'../data/dn_posts_{now:.0f}.json', 'w+') as f:
    json.dump(dn_posts, f)

In [9]:
with open(f'../data/st_posts_{now:.0f}.json', 'w+') as f:
    json.dump(st_posts, f)