# POC : Extract data from 4chan API

This notebook is a proof of concept to extract data from the 4chan API and store it as Parquet files.

We need to extract 2 types of files:

- 1 file for threads named `threads_{timestamp}.parquet`
- 1 file per thread named `posts_{thread_id}_{number_of_posts}.parquet`

Once we have those files, we can use the scripts developed here to build our data pipelines.


[Documentation of the library used to get new data](https://basc-py4chan.readthedocs.io/en/latest/index.html)

## Step 1 : Get the list of all threads on /pol/


In [None]:
import pandas as pd
import basc_py4chan

# First, we need to create a board object. This is the object that will be used to access the board.
board = basc_py4chan.Board('pol')

# Now we can retrieve all the threads on the board.
threads = board.get_all_threads(expand=False)
threads_ids = board.get_all_thread_ids()
print('There are', len(threads), 'active threads on /pol/')

In [None]:
# For every thread, we can populate a dataframe with the thread's information.
threads_df = pd.DataFrame()
for i, thread in enumerate(threads):
    thread_dict = {'thread_id': threads_ids[i],
                   'is_sticky': thread.sticky,
                   'is_closed': thread.closed,
                   'topic': thread.topic.text_comment,
                   'number_of_posts': len(thread.all_posts),
                   'url': thread.url}
    new_row = pd.DataFrame(thread_dict, index=[0])
    threads_df = pd.concat([threads_df, new_row], axis=0)

In [None]:
threads_df.head()

In [None]:
# Create the exported parquet file named 'threads_{timestamp}.parquet'
timestamp = pd.Timestamp.now().strftime('%Y%m%d_%H%M%S')
threads_df.to_parquet(f'data/threads_{timestamp}.parquet')

In [None]:
# Load the parquet file and print the first 5 rows
threads_df_parquet = pd.read_parquet('threads_20230324_130829.parquet')
threads_df_parquet.head()

![Success](https://media.giphy.com/media/a0h7sAqON67nO/giphy.gif)

## Step 2 : Get the list of all posts for each thread

In [None]:
# Create and export a parquet file with every post for every thread

for i, thread in enumerate(threads):
    thread_df = pd.DataFrame()
    for post in thread.all_posts:
        post_dict = {'thread_id': threads_ids[i],
                     'post_id': post.post_id,
                     'poster_id': post.poster_id,
                     'poster_name': post.name,
                     'is_op': post.is_op,
                     'tripcode': post.tripcode,
                     'email': post.email,
                     'subject': post.subject,
                     'comment': post.text_comment,
                     'has_file': post.has_file,
                     'post_datetime': post.datetime,
                     'url': post.url}
        if post_dict['has_file']:
            post_dict['file_name'] = post.file.filename_original
            post_dict['file_extension'] = post.file.file_extension
        else:
            post_dict['file_name'] = None
            post_dict['file_extension'] = None

        new_row = pd.DataFrame(post_dict, index=[0])
        thread_df = pd.concat([thread_df, new_row], axis=0)
    thread_df.to_parquet(f'data/posts_{threads_ids[i]}_{thread_df.shape[0]}.parquet')

In [None]:
# Load a parquet file and print the first 5 rows

thread_df_parquet = pd.read_parquet('data\posts_420846947.parquet')
thread_df_parquet.head()

![Success](https://media.giphy.com/media/Od0QRnzwRBYmDU3eEO/giphy.gif)

We now have proof that we can extract data from the 4chan API and store it as Parquet files.
We can now use this approach to build our data pipelines.

## Test the output from the first pipeline, API to GCP Buckets

In [None]:
threads = pd.read_parquet('threads_20230324_13.parquet')
posts = pd.read_parquet('posts_20230324_13.parquet')

In [None]:
threads.head()

In [None]:
posts.head()

## Test for dbt API 

In [None]:
import requests

# request dbt api to get the list of jobs

account_id = '148403'
auth_token = "65a52d89ef6730fff272b2794035fed23d569fd0"

url = 'https://cloud.getdbt.com/api/v2/accounts/{accountId}/jobs/'

# Add token to headers and define content-type

headers = { 'Token': auth_token, 'Content-Type': 'application/json' }

# Make the request and print the response

response = requests.get(url, headers=headers)

print(response.status_code)
