## 4chan /biz/ Scraper ##

This notebook will scrape the /biz/ board on 4chan at the moment, and download all images associated with the 200 active threads on the board.

The intent behind this scraping is to feed the reply/subject text data into a machine learning model and attempt to recrate posts. I would also like to do basic analysis on the text data.

The image data is collected and stored in /imgs/, I'd like for the images to be the input for both a Fazle Rabbitrained classification neural network and a trained generative neural network.

In [126]:
import time
import pandas as pd
from urllib.request import urlopen
import json
from bs4 import BeautifulSoup
from urllib.error import HTTPError
import requests
from PIL import Image



def get_jsonparsed_data(url):
    response = urlopen(url)
    data = response.read().decode("utf-8")
    return json.loads(data)


def get_numbers(df, i):  
    postno = df['Post Number'][i]
    replies = df['Replies'][i]
    return(postno, replies)

dataframe = pd.DataFrame(columns=['Subject', 'Comment', 'Post Number', 'Replies', 'reply_list', 'tim_list', 'ext_list', 'combined'])
for i in range(10):
    i=i+1
    url = (("https://a.4cdn.org/biz/") + str(i) + '.json')
    data = get_jsonparsed_data(url)

    list = len(data['threads'])
   
    for i in range(0, list): 
      try:
          comment = data['threads'][i]['posts'][0]['com']
          subject = data['threads'][i]['posts'][0]['sub']
          postno = data['threads'][i]['posts'][0]['no']
          replies = data['threads'][i]['posts'][0]['replies']
      except KeyError:
          subject = 'No subject'
          postno = data['threads'][i]['posts'][0]['no']
          replies = data['threads'][i]['posts'][0]['replies']
      except KeyError:
          comment = data['threads'][i]['posts'][0]['sub']
      dataframe = dataframe.append({'Subject':subject, 'Comment':comment, 'Post Number':postno, 'Replies':replies}, ignore_index=True)
    time.sleep(1)
    i=i+1

dataframe = dataframe.fillna(0)

Above code produces dataframe seen below

In [127]:
dataframe.head(5)

Unnamed: 0,Subject,Comment,Post Number,Replies,reply_list,tim_list,ext_list,combined
0,NO BEGGING,"<span style=""font-weight:600;font-size:150%;li...",4884770,0,0,0,0,0
1,Welcome to /biz/ - Business &amp; Finance,This board is for the discussion of topics rel...,21374000,1,0,0,0,0
2,No subject,We will never make it will we,44574618,10,0,0,0,0
3,No subject,We will never make it will we,44574334,22,0,0,0,0
4,No subject,"Even if geist did dump, it’s not like anyone l...",44575166,3,0,0,0,0


### Grab reply text, image md5 hash (name), and img extension #

This function goes in and grabs all text and image data from replies in a given thread. It then drops the replies into a list in "reply_list", and drops the image name (it's md5 hash) and extension in the repsective columns
called "tim_list" and "ext_list". The "combined" column is the result of pairing the img name (md5) with the correlating extension.


In [128]:
for i in range(0,len(dataframe)):

  try:
    postno, replies = get_numbers(dataframe, i) 
    url = (("https://a.4cdn.org/biz/thread/") + str(postno) + '.json')
    data = get_jsonparsed_data(url)
    replies_text = []
    extensions = []
    images = []
    combined = []
    for j in range(0,replies):
      try:
        reply = data['posts'][j]['com']
        img = data['posts'][j]['tim']
        ext = data['posts'][j]['ext']
        replies_text.append(reply)
        images.append(str(img))
        extensions.append(str(ext))
        combined.append(str(img)+str(ext))
        dataframe['reply_list'][i] = replies_text
        dataframe['tim_list'][i] = images
        dataframe['ext_list'][i] = extensions
        dataframe['combined'][i] = combined
      except KeyError:
        pass
  except HTTPError:
    dataframe['reply_list'][i] = '404'
    dataframe['tim_list'][i] = '404'
    dataframe['ext_list'][i] = '404'
  else:
    pass
time.sleep(1)


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  dataframe['reply_list'][i] = replies_text
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_single_block(indexer, value, name)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  dataframe['tim_list'][i] = images
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  dataframe['ext_list'][i] = extensions
A val

The above will format the dataframe into the format below:

In [129]:
dataframe

Unnamed: 0,Subject,Comment,Post Number,Replies,reply_list,tim_list,ext_list,combined
0,NO BEGGING,"<span style=""font-weight:600;font-size:150%;li...",4884770,0,0,0,0,0
1,Welcome to /biz/ - Business &amp; Finance,This board is for the discussion of topics rel...,21374000,1,[This board is for the discussion of topics re...,[1597354727695],[.png],[1597354727695.png]
2,No subject,We will never make it will we,44574618,10,"[We will never make it will we, <a href=""#p445...","[1639695644833, 1639696526505, 1639696579937, ...","[.jpg, .jpg, .jpg, .jpg, .jpg]","[1639695644833.jpg, 1639696526505.jpg, 1639696..."
3,No subject,We will never make it will we,44574334,22,"[<a href=""#p44574334"" class=""quotelink"">&gt;&g...","[1639695205203, 1639695244628, 1639695324334, ...","[.png, .png, .png, .jpg, .jpg, .png, .jpg, .jp...","[1639695205203.png, 1639695244628.png, 1639695..."
4,No subject,"Even if geist did dump, it’s not like anyone l...",44575166,3,"[Even if geist did dump, it’s not like anyone ...",[1639696837532],[.jpg],[1639696837532.jpg]
...,...,...,...,...,...,...,...,...
194,No subject,GUYS CHECK THE PRICE AH FUCK WTF IS HAPPENING ...,44573668,2,[GUYS CHECK THE PRICE AH FUCK WTF IS HAPPENING...,"[1639693722292, 1639694343921]","[.jpg, .png]","[1639693722292.jpg, 1639694343921.png]"
195,No subject,Convince me this isn&#039;t the same shit.,44570922,45,"[Convince me this isn&#039;t the same shit., <...","[1639688004544, 1639688536731, 1639690437871, ...","[.png, .png, .jpg, .jpg, .jpg]","[1639688004544.png, 1639688536731.png, 1639690..."
196,No subject,Remember uou were all put on house arrest for ...,44573737,9,[Remember uou were all put on house arrest for...,[1639693874813],[.jpg],[1639693874813.jpg]
197,No subject,What else is there but to wage?,44571681,10,"[What else is there but to wage?, Don&#039;t b...","[1639689661806, 1639692661851, 1639694124045]","[.png, .jpg, .jpg]","[1639689661806.png, 1639692661851.jpg, 1639694..."


In [130]:
# DOWNLOAD ALL IMAGES FOR THREAD/IN THREADS

links = []
newlinks = []
dataframe['combined'] = dataframe['combined'].fillna(0)
combined = dataframe['combined']

for i in range(0, len(combined)):
    if combined[i] == 0:
        links.append('No image')
    elif combined[i] != 0:
        links = combined[i]
        for i in links:
            newlinks.append(i)
            imgURL = ("https://i.4cdn.org/biz/") + str(i)
            name = str(i)
            print(imgURL)
            r = requests.get(imgURL)
            with open('imgs/'+name, 'wb') as f:
                f.write(r.content)
            time.sleep(0)
print(name)

https://i.4cdn.org/biz/1597354727695.png
https://i.4cdn.org/biz/1639695644833.jpg
https://i.4cdn.org/biz/1639696526505.jpg
https://i.4cdn.org/biz/1639696579937.jpg
https://i.4cdn.org/biz/1639696670415.jpg
https://i.4cdn.org/biz/1639696728469.jpg
https://i.4cdn.org/biz/1639695205203.png
https://i.4cdn.org/biz/1639695244628.png
https://i.4cdn.org/biz/1639695324334.png
https://i.4cdn.org/biz/1639695794642.jpg
https://i.4cdn.org/biz/1639696143382.jpg
https://i.4cdn.org/biz/1639696188230.png
https://i.4cdn.org/biz/1639696435192.jpg
https://i.4cdn.org/biz/1639696509025.jpg
https://i.4cdn.org/biz/1639696633538.jpg
https://i.4cdn.org/biz/1639696957050.png
https://i.4cdn.org/biz/1639696837532.jpg
https://i.4cdn.org/biz/1639690077190.jpg
https://i.4cdn.org/biz/1639694425766.gif
https://i.4cdn.org/biz/1639694411652.png
https://i.4cdn.org/biz/1639694478245.jpg
https://i.4cdn.org/biz/1639696138671.jpg
https://i.4cdn.org/biz/1639696493747.jpg
https://i.4cdn.org/biz/1639696987225.png
https://i.4cdn.o

In [131]:
directory = "imgs/"
files_in_directory = os.listdir(directory)
filtered_files = [file for file in files_in_directory if file.endswith(".webm")]

for file in filtered_files:
	path_to_file = os.path.join(directory, file)
	os.remove(path_to_file)

In [132]:
path = "imgs/"
dirs = os.listdir( path )
final_size = 440;

def resize_aspect_fit():
    for item in dirs:
         if item == '.DS_Store':
             continue
         if os.path.isfile(path+item):
             im = Image.open(path+item)
             f, e = os.path.splitext(path+item)
             size = im.size
             ratio = float(final_size) / max(size)
             new_image_size = tuple([int(x*ratio) for x in size])
             im = im.resize(new_image_size, Image.ANTIALIAS)
             new_im = Image.new("RGB", (final_size, final_size))
             new_im.paste(im, ((final_size-new_image_size[0])//2, (final_size-new_image_size[1])//2))
             new_im.save(f + '_resized.jpg', 'JPEG', quality=90)
resize_aspect_fit()



In [133]:
directory = "imgs/"
files_in_directory = os.listdir(directory)
filtered_files = [file for file in files_in_directory if not file.endswith("_resized.jpg")]

for file in filtered_files:
	path_to_file = os.path.join(directory, file)
	os.remove(path_to_file)