# POC : Extract data from 4chan API

This notebook is a proof of concept to extract data from the 4chan API and store it as Parquet files.

We need to extract 2 types of files:

- 1 file for threads named `threads_{timestamp}.parquet`, containing those columns : 
  - thread_id
  - is_sticky
  - is_closed
  - topic
  - url
- 1 file per thread named `posts_{thread_id}_{number_of_posts}.parquet`, containing those columns :
  - thread_id
  - post_id
  - poster_id
  - poster_name
  - subject
  - text_comment
  - is_op
  - post_datetime
  - has_file

Once we have those files, we can use the scripts developed here to build our data pipelines.


[Documentation of the library used to get new data](https://basc-py4chan.readthedocs.io/en/latest/index.html)

## Step 1 : Get the list of all threads on /pol/


In [18]:
import pandas as pd
import basc_py4chan

# First, we need to create a board object. This is the object that will be used to access the board.
board = basc_py4chan.Board('pol')

# Now we can retrieve all the threads on the board.
threads = board.get_all_threads(expand=False)
threads_ids = board.get_all_thread_ids()
print('There are', len(threads), 'active threads on /pol/')

There are 202 active threads on /pol/


In [22]:
# For every thread, we can populate a dataframe with the thread's information.
threads_df = pd.DataFrame()
for i, thread in enumerate(threads):
    thread_dict = {'thread_id': threads_ids[i],
                   'is_sticky': thread.sticky,
                   'is_closed': thread.closed,
                   'topic': thread.topic.text_comment,
                   'url': thread.url}
    new_row = pd.DataFrame(thread_dict, index=[0])
    threads_df = pd.concat([threads_df, new_row], axis=0)

In [23]:
threads_df.head()

Unnamed: 0,thread_id,is_sticky,is_closed,topic,url
0,124205675,True,True,"This board is for the discussion of news, worl...",http://boards.4chan.org/pol/thread/124205675
0,259848258,True,True,Check the catalog before posting a new thread!...,http://boards.4chan.org/pol/thread/259848258
0,420744772,False,False,How did Adam and Eve make a choice before gain...,http://boards.4chan.org/pol/thread/420744740
0,420744004,False,False,Previous: >>420740563\nTimeline /tug/: https:/...,http://boards.4chan.org/pol/thread/420748375
0,420750274,False,False,,http://boards.4chan.org/pol/thread/420744772


In [27]:
# Create the exported parquet file named 'threads_{timestamp}.parquet'
timestamp = pd.Timestamp.now().strftime('%Y%m%d_%H%M%S')
threads_df.to_parquet(f'threads_{timestamp}.parquet')

In [28]:
# Load the parquet file and print the first 5 rows
threads_df_parquet = pd.read_parquet(f'threads_{timestamp}.parquet')
threads_df_parquet.head()

Unnamed: 0,thread_id,is_sticky,is_closed,topic,url
0,124205675,True,True,"This board is for the discussion of news, worl...",http://boards.4chan.org/pol/thread/124205675
0,259848258,True,True,Check the catalog before posting a new thread!...,http://boards.4chan.org/pol/thread/259848258
0,420744772,False,False,How did Adam and Eve make a choice before gain...,http://boards.4chan.org/pol/thread/420744740
0,420744004,False,False,Previous: >>420740563\nTimeline /tug/: https:/...,http://boards.4chan.org/pol/thread/420748375
0,420750274,False,False,,http://boards.4chan.org/pol/thread/420744772


![Success](https://media.giphy.com/media/a0h7sAqON67nO/giphy.gif)