## 4chan /biz/ Scraper ##

This notebook will scrape the /biz/ board on 4chan at the moment, and download all images associated with the 200 active threads on the board.

The intent behind this scraping is to feed the reply/subject text data into a machine learning model and attempt to recrate posts. I would also like to do basic analysis on the text data.

The image data is collected and stored in /imgs/, I'd like for the images to be the input for both a trained classification neural network and a trained generative neural network.

In [24]:
import time
import pandas as pd
from urllib.request import urlopen
import json
from bs4 import BeautifulSoup
from urllib.error import HTTPError
import requests

def get_jsonparsed_data(url):
    response = urlopen(url)
    data = response.read().decode("utf-8")
    return json.loads(data)


def get_numbers(df, i):  
    postno = df['Post Number'][i]
    replies = df['Replies'][i]
    return(postno, replies)

dataframe = pd.DataFrame(columns=['Subject', 'Comment', 'Post Number', 'Replies', 'reply_list', 'tim_list', 'ext_list', 'combined'])
for i in range(10):
    i=i+1
    url = (("https://a.4cdn.org/biz/") + str(i) + '.json')
    data = get_jsonparsed_data(url)

    list = len(data['threads'])
   
    for i in range(0, list): 
      try:
          comment = data['threads'][i]['posts'][0]['com']
          subject = data['threads'][i]['posts'][0]['sub']
          postno = data['threads'][i]['posts'][0]['no']
          replies = data['threads'][i]['posts'][0]['replies']
      except KeyError:
          subject = 'No subject'
          postno = data['threads'][i]['posts'][0]['no']
          replies = data['threads'][i]['posts'][0]['replies']
      except KeyError:
          comment = data['threads'][i]['posts'][0]['sub']
      dataframe = dataframe.append({'Subject':subject, 'Comment':comment, 'Post Number':postno, 'Replies':replies}, ignore_index=True)
    time.sleep(1)
    i=i+1

dataframe = dataframe.fillna(0)

Above code produces dataframe seen below

In [13]:
dataframe.head(5)

Unnamed: 0,Subject,Comment,Post Number,Replies,reply_list,tim_list,ext_list,combined
0,NO BEGGING,"<span style=""font-weight:600;font-size:150%;li...",4884770,0,0,0,0,0
1,Welcome to /biz/ - Business &amp; Finance,This board is for the discussion of topics rel...,21374000,1,[This board is for the discussion of topics re...,[1597354727695],[.png],[1597354727695.png]
2,Metaverse,What metaverse tokens are you bullish on? I fe...,44565418,1,[What metaverse tokens are you bullish on? I f...,[1639677988777],[.png],[1639677988777.png]
3,/MTV/ MultiVAC Tech General - tranny and schiz...,MTV is a lowcap L1 project featuring never bef...,44554528,59,[MTV is a lowcap L1 project featuring never be...,"[1639652244478, 1639652716862, 1639656243913, ...","[.jpg, .png, .png, .png, .png, .jpg, .png, .pn...","[1639652244478.jpg, 1639652716862.png, 1639656..."
4,I lifted a rock even Bybon couldnt lift,"Gods smile upon me, can you say the same?",44565490,0,0,0,0,0


### Grab reply text, image md5 hash (name), and img extension #

This function goes in and grabs all text and image data from replies in a given thread. It then drops the replies into a list in "reply_list", and drops the image name (it's md5 hash) and extension in the repsective columns
called "tim_list" and "ext_list". The "combined" column is the result of pairing the img name (md5) with the correlating extension.


In [15]:
for i in range(0,len(dataframe)):

  try:
    postno, replies = get_numbers(dataframe, i) 
    url = (("https://a.4cdn.org/biz/thread/") + str(postno) + '.json')
    data = get_jsonparsed_data(url)
    replies_text = []
    extensions = []
    images = []
    combined = []
    for j in range(0,replies):
      try:
        reply = data['posts'][j]['com']
        img = data['posts'][j]['tim']
        ext = data['posts'][j]['ext']
        replies_text.append(reply)
        images.append(str(img))
        extensions.append(str(ext))
        combined.append(str(img)+str(ext))
        dataframe['reply_list'][i] = replies_text
        dataframe['tim_list'][i] = images
        dataframe['ext_list'][i] = extensions
        dataframe['combined'][i] = combined
      except KeyError:
        pass
  except HTTPError:
    dataframe['reply_list'][i] = '404'
    dataframe['tim_list'][i] = '404'
    dataframe['ext_list'][i] = '404'
  else:
    pass
time.sleep(1)


The above will format the dataframe into the format below:

In [16]:
dataframe

Unnamed: 0,Subject,Comment,Post Number,Replies,reply_list,tim_list,ext_list,combined
0,NO BEGGING,"<span style=""font-weight:600;font-size:150%;li...",4884770,0,0,0,0,0
1,Welcome to /biz/ - Business &amp; Finance,This board is for the discussion of topics rel...,21374000,1,[This board is for the discussion of topics re...,[1597354727695],[.png],[1597354727695.png]
2,Metaverse,What metaverse tokens are you bullish on? I fe...,44565418,1,[What metaverse tokens are you bullish on? I f...,[1639677988777],[.png],[1639677988777.png]
3,/MTV/ MultiVAC Tech General - tranny and schiz...,MTV is a lowcap L1 project featuring never bef...,44554528,59,[MTV is a lowcap L1 project featuring never be...,"[1639652244478, 1639652716862, 1639656243913, ...","[.jpg, .png, .png, .png, .png, .jpg, .png, .pn...","[1639652244478.jpg, 1639652716862.png, 1639656..."
4,I lifted a rock even Bybon couldnt lift,"Gods smile upon me, can you say the same?",44565490,0,0,0,0,0
...,...,...,...,...,...,...,...,...
195,No subject,When exactly does the bear market begin?,44563980,1,0,0,0,0
196,No subject,bought xec at 0.00034 for ~450$ now I have ~12...,44562967,2,[bought xec at 0.00034 for ~450$ now I have ~1...,[1639673262714],[.jpg],[1639673262714.jpg]
197,No subject,I need to fucking gift my children and wife gi...,44557126,87,[I need to fucking gift my children and wife g...,"[1639661083757, 1639661257543, 1639661344068, ...","[.jpg, .jpg, .jpg, .jpg, .jpg, .jpg, .jpg, .jp...","[1639661083757.jpg, 1639661257543.jpg, 1639661..."
198,No subject,it’s dumping,44562928,7,"[it’s dumping, Decembear, the final chapter, <...","[1639673192059, 1639673833610, 1639674612556]","[.jpg, .png, .jpg]","[1639673192059.jpg, 1639673833610.png, 1639674..."


## Download all images on the /biz/ board ##

The function below will download all images on the /biz/ board and put them in the /imgs/ folder. This typically takes around 8-10 minutes and will result in around 600MB of data being downloaded.

In [23]:
# DOWNLOAD ALL IMAGES FOR THREAD/IN THREADS

links = []
newlinks = []
dataframe['combined'] = dataframe['combined'].fillna(0)
combined = dataframe['combined']

for i in range(0, len(combined)):
    if combined[i] == 0:
        links.append('No image')
    elif combined[i] != 0:
        links = combined[i]
        for i in links:
            newlinks.append(i)
            imgURL = ("https://i.4cdn.org/biz/") + str(i)
            name = str(i)
            print(imgURL)
            r = requests.get(imgURL)
            with open('imgs/'+name, 'wb') as f:
                f.write(r.content)
            time.sleep(1)
print(name)

https://i.4cdn.org/biz/1597354727695.png
1597354727695.png
