# POC : Extract data from 4chan API

This notebook is a proof of concept to extract data from the 4chan API and store it as Parquet files.

We need to extract 2 types of files:

- 1 file for threads named `threads_{timestamp}.parquet`
- 1 file per thread named `posts_{thread_id}_{number_of_posts}.parquet`

Once we have those files, we can use the scripts developed here to build our data pipelines.


[Documentation of the library used to get new data](https://basc-py4chan.readthedocs.io/en/latest/index.html)

## Step 1 : Get the list of all threads on /pol/


In [1]:
import pandas as pd
import basc_py4chan

# First, we need to create a board object. This is the object that will be used to access the board.
board = basc_py4chan.Board('pol')

# Now we can retrieve all the threads on the board.
threads = board.get_all_threads(expand=False)
threads_ids = board.get_all_thread_ids()
print('There are', len(threads), 'active threads on /pol/')

There are 202 active threads on /pol/


In [2]:
# For every thread, we can populate a dataframe with the thread's information.
threads_df = pd.DataFrame()
for i, thread in enumerate(threads):
    thread_dict = {'thread_id': threads_ids[i],
                   'is_sticky': thread.sticky,
                   'is_closed': thread.closed,
                   'topic': thread.topic.text_comment,
                   'number_of_posts': len(thread.all_posts),
                   'url': thread.url}
    new_row = pd.DataFrame(thread_dict, index=[0])
    threads_df = pd.concat([threads_df, new_row], axis=0)

In [3]:
threads_df.head()

Unnamed: 0,thread_id,is_sticky,is_closed,topic,number_of_posts,url
0,124205675,True,True,"This board is for the discussion of news, worl...",1,http://boards.4chan.org/pol/thread/124205675
0,259848258,True,True,Check the catalog before posting a new thread!...,1,http://boards.4chan.org/pol/thread/259848258
0,420846947,False,False,Would Hitler really redeem the white race like...,163,http://boards.4chan.org/pol/thread/420835411
0,420847465,False,False,How come no one asks who this racist n(ger is ...,5,http://boards.4chan.org/pol/thread/420848571
0,420846799,False,False,Regardless of how retarded the Vietnam war was...,2,http://boards.4chan.org/pol/thread/420848774


In [4]:
# Create the exported parquet file named 'threads_{timestamp}.parquet'
timestamp = pd.Timestamp.now().strftime('%Y%m%d_%H%M%S')
threads_df.to_parquet(f'data/threads_{timestamp}.parquet')

In [5]:
# Load the parquet file and print the first 5 rows
threads_df_parquet = pd.read_parquet(f'data/threads_{timestamp}.parquet')
threads_df_parquet.head()

Unnamed: 0,thread_id,is_sticky,is_closed,topic,number_of_posts,url
0,124205675,True,True,"This board is for the discussion of news, worl...",1,http://boards.4chan.org/pol/thread/124205675
0,259848258,True,True,Check the catalog before posting a new thread!...,1,http://boards.4chan.org/pol/thread/259848258
0,420846947,False,False,Would Hitler really redeem the white race like...,163,http://boards.4chan.org/pol/thread/420835411
0,420847465,False,False,How come no one asks who this racist n(ger is ...,5,http://boards.4chan.org/pol/thread/420848571
0,420846799,False,False,Regardless of how retarded the Vietnam war was...,2,http://boards.4chan.org/pol/thread/420848774


![Success](https://media.giphy.com/media/a0h7sAqON67nO/giphy.gif)

## Step 2 : Get the list of all posts for each thread

In [6]:
# Create and export a parquet file with every post for every thread

for i, thread in enumerate(threads):
    thread_df = pd.DataFrame()
    for post in thread.all_posts:
        post_dict = {'thread_id': threads_ids[i],
                     'post_id': post.post_id,
                     'poster_id': post.poster_id,
                     'poster_name': post.name,
                     'is_op': post.is_op,
                     'tripcode': post.tripcode,
                     'email': post.email,
                     'subject': post.subject,
                     'comment': post.text_comment,
                     'has_file': post.has_file,
                     'post_datetime': post.datetime,
                     'url': post.url}
        if post_dict['has_file']:
            post_dict['file_name'] = post.file.filename_original
            post_dict['file_extension'] = post.file.file_extension
        else:
            post_dict['file_name'] = None
            post_dict['file_extension'] = None

        new_row = pd.DataFrame(post_dict, index=[0])
        thread_df = pd.concat([thread_df, new_row], axis=0)
    thread_df.to_parquet(f'data/posts_{threads_ids[i]}_{thread_df.shape[0]}.parquet')

In [8]:
# Load a parquet file and print the first 5 rows

thread_df_parquet = pd.read_parquet('data\posts_420846947.parquet')
thread_df_parquet.head()

Unnamed: 0,thread_id,post_id,poster_id,poster_name,is_op,tripcode,email,subject,comment,has_file,post_datetime,url,file_name,file_extension
0,420846947,420835411,1km+O+yD,Anonymous,True,,,Did Hitler send German women to India to fuck ...,Would Hitler really redeem the white race like...,True,2023-03-24 05:45:36,http://boards.4chan.org/pol/thread/420835411#p...,1660707546684947.png,.png
0,420846947,420835438,1km+O+yD,Anonymous,False,,,,Germanbros...,True,2023-03-24 05:46:05,http://boards.4chan.org/pol/thread/420835411#p...,1667965092478052.webm,.webm
0,420846947,420835956,cD6UCZuW,Anonymous,False,,,,>>420835411\nin their dreams,False,2023-03-24 05:52:47,http://boards.4chan.org/pol/thread/420835411#p...,,
0,420846947,420836073,XiHt38Lw,Anonymous,False,,,,>>420835411\nIndians are some of the biggest n...,False,2023-03-24 05:54:41,http://boards.4chan.org/pol/thread/420835411#p...,,
0,420846947,420836220,vrGFSYxf,Anonymous,False,,,,>>420835438\nHe really said this? But /pol/ to...,False,2023-03-24 05:56:43,http://boards.4chan.org/pol/thread/420835411#p...,,


![Success](https://media.giphy.com/media/Od0QRnzwRBYmDU3eEO/giphy.gif)

We now have proof that we can extract data from the 4chan API and store it as Parquet files.
We can now use this approach to build our data pipelines.