## 4chan /biz/ Scraper ##

This notebook will scrape the /biz/ board on 4chan at the moment, and download all images associated with the 200 active threads on the board.

The intent behind this scraping is to feed the reply/subject text data into a machine learning model and attempt to recrate posts. I would also like to do basic analysis on the text data.

The image data is collected and stored in /imgs/, I'd like for the images to be the input for both a Fazle Rabbitrained classification neural network and a trained generative neural network.

In [325]:
import time
import pandas as pd
from urllib.request import urlopen
import json
from bs4 import BeautifulSoup
from urllib.error import HTTPError
import requests
from PIL import Image



def get_jsonparsed_data(url):
    response = urlopen(url)
    data = response.read().decode("utf-8")
    return json.loads(data)


def get_numbers(df, i):  
    postno = df['Post Number'][i]
    replies = df['Replies'][i]
    return(postno, replies)

dataframe = pd.DataFrame(columns=['Subject', 'Comment', 'Post Number', 'Replies', 'reply_list', 'tim_list', 'ext_list', 'combined'])
for i in range(10):
    i=i+1
    url = (("https://a.4cdn.org/biz/") + str(i) + '.json')
    data = get_jsonparsed_data(url)

    list = len(data['threads'])
   
    for i in range(0, list): 
      try:
          comment = data['threads'][i]['posts'][0]['com']
          subject = data['threads'][i]['posts'][0]['sub']
          postno = data['threads'][i]['posts'][0]['no']
          replies = data['threads'][i]['posts'][0]['replies']
      except KeyError:
          subject = 'No subject'
          postno = data['threads'][i]['posts'][0]['no']
          replies = data['threads'][i]['posts'][0]['replies']
      except KeyError:
          comment = data['threads'][i]['posts'][0]['sub']
      dataframe = dataframe.append({'Subject':subject, 'Comment':comment, 'Post Number':postno, 'Replies':replies}, ignore_index=True)
    time.sleep(1)
    i=i+1

dataframe = dataframe.fillna(0)

Above code produces dataframe seen below

In [None]:
dataframe.head(5)

### Grab reply text, image md5 hash (name), and img extension #

This function goes in and grabs all text and image data from replies in a given thread. It then drops the replies into a list in "reply_list", and drops the image name (it's md5 hash) and extension in the repsective columns
called "tim_list" and "ext_list". The "combined" column is the result of pairing the img name (md5) with the correlating extension.


In [326]:
for i in range(0,len(dataframe)):

  try:
    postno, replies = get_numbers(dataframe, i) 
    url = (("https://a.4cdn.org/biz/thread/") + str(postno) + '.json')
    data = get_jsonparsed_data(url)
    replies_text = []
    extensions = []
    images = []
    combined = []
    for j in range(0,replies):
      try:
        reply = data['posts'][j]['com']
        img = data['posts'][j]['tim']
        ext = data['posts'][j]['ext']
        replies_text.append(reply)
        images.append(str(img))
        extensions.append(str(ext))
        combined.append(str(img)+str(ext))
        dataframe['reply_list'][i] = replies_text
        dataframe['tim_list'][i] = images
        dataframe['ext_list'][i] = extensions
        dataframe['combined'][i] = combined
      except KeyError:
        pass
  except HTTPError:
    dataframe['reply_list'][i] = '404'
    dataframe['tim_list'][i] = '404'
    dataframe['ext_list'][i] = '404'
  else:
    pass
time.sleep(1)


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  dataframe['reply_list'][i] = replies_text
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_single_block(indexer, value, name)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  dataframe['tim_list'][i] = images
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  dataframe['ext_list'][i] = extensions
A val

The above will format the dataframe into the format below:

In [330]:
dataframe

Unnamed: 0,Subject,Comment,Post Number,Replies,reply_list,tim_list,ext_list,combined
0,NO BEGGING,"<span style=""font-weight:600;font-size:150%;li...",4884770,0,0,0,0,0
1,Welcome to /biz/ - Business &amp; Finance,This board is for the discussion of topics rel...,21374000,1,[This board is for the discussion of topics re...,[1597354727695],[.png],[1597354727695.png]
2,No subject,Tri-daily reminder to check the accumulation v...,44778582,30,[Tri-daily reminder to check the accumulation ...,"[1640259117372, 1640259338790, 1640259431455, ...","[.jpg, .jpg, .jpg, .jpg]","[1640259117372.jpg, 1640259338790.jpg, 1640259..."
3,USD/TRY,Why is there no discussion on this? What comes...,44778485,12,[Why is there no discussion on this? What come...,[1640258781899],[.jpg],[1640258781899.jpg]
4,No subject,"<span class=""quote"">&gt;131 IQ</span><br><span...",44775328,46,"[<span class=""quote"">&gt;131 IQ</span><br><spa...","[1640245305718, 1640248452342, 1640249647104, ...","[.jpg, .jpg, .jpg, .jpg, .png, .jpg, .jpg, .jp...","[1640245305718.jpg, 1640248452342.jpg, 1640249..."
...,...,...,...,...,...,...,...,...
180,No subject,"<span class=""quote"">&gt;FFTB advertisement bef...",44776229,2,"[<span class=""quote"">&gt;FFTB advertisement be...",[1640249677394],[.jpg],[1640249677394.jpg]
181,I&#039;m addicted to staking rose,If I were to stop I wouldn&#039;t receive 100 ...,44774994,24,[If I were to stop I wouldn&#039;t receive 100...,"[1640243855407, 1640244934329, 1640245476199, ...","[.jpg, .png, .jpg, .jpg, .png, .jpg, .png]","[1640243855407.jpg, 1640244934329.png, 1640245..."
182,I am officially fomoing into,BilboBagginsPutinCharmander9000Inu ticker: BIN...,44771912,27,[BilboBagginsPutinCharmander9000Inu ticker: BI...,"[1640233049704, 1640250773164, 1640251343016]","[.jpg, .jpg, .jpg]","[1640233049704.jpg, 1640250773164.jpg, 1640251..."
183,No subject,"I just staked almost 200,000 RBC. I was one of...",44776584,9,"[I just staked almost 200,000 RBC. I was one o...","[1640251314635, 1640251373934]","[.jpg, .jpg]","[1640251314635.jpg, 1640251373934.jpg]"


In [514]:
# DOWNLOAD ALL IMAGES FOR THREAD/IN THREADS

links = []
newlinks = []
dataframe['combined'] = dataframe['combined'].fillna(0)
combined = dataframe['combined']

for i in range(0, len(combined)):
    if combined[i] == 0:
        links.append('No image')
    elif combined[i] != 0:
        links = combined[i]
        for i in links:
            newlinks.append(i)
            imgURL = ("https://i.4cdn.org/biz/") + str(i)
            name = str(i)
            print(imgURL)
            r = requests.get(imgURL)
            with open('imgs/'+name, 'wb') as f:
                f.write(r.content)
            time.sleep(0)
print(name)

https://i.4cdn.org/biz/1597354727695.png


FileNotFoundError: [Errno 2] No such file or directory: 'imgs/1597354727695.png'

In [None]:
directory = "imgs/"
files_in_directory = os.listdir(directory)
filtered_files = [file for file in files_in_directory if file.endswith(".webm")]

for file in filtered_files:
	path_to_file = os.path.join(directory, file)
	os.remove(path_to_file)

In [None]:
path = "imgs/"
dirs = os.listdir( path )
final_size = 440;

def resize_aspect_fit():
    for item in dirs:
         if item == '.DS_Store':
             continue
         if os.path.isfile(path+item):
             im = Image.open(path+item)
             f, e = os.path.splitext(path+item)
             size = im.size
             ratio = float(final_size) / max(size)
             new_image_size = tuple([int(x*ratio) for x in size])
             im = im.resize(new_image_size, Image.ANTIALIAS)
             new_im = Image.new("RGB", (final_size, final_size))
             new_im.paste(im, ((final_size-new_image_size[0])//2, (final_size-new_image_size[1])//2))
             new_im.save(f + '_resized.jpg', 'JPEG', quality=90)
resize_aspect_fit()

In [None]:
directory = "imgs/"
files_in_directory = os.listdir(directory)
filtered_files = [file for file in files_in_directory if not file.endswith("_resized.jpg")]

for file in filtered_files:
	path_to_file = os.path.join(directory, file)
	os.remove(path_to_file)

In [None]:
dataframe2 = pd.DataFrame({'Subject': dataframe['Subject'], 'Comment': dataframe['Comment'], 'Replies': dataframe['reply_list']})

dataframe2

In [None]:
df = dataframe2.explode('Replies')
replies = []
for i in range(0, 200):
    reply = df['Replies'][i]
    replies.append(reply)
replies

len(replies)


In [None]:
dataframe['reply_list'][3]

In [None]:
dataframe['reply_list']

In [328]:
### THIS WILL CLEAN ALL DATA IN REPLY_TEXT ###

import re

exploded_df = dataframe['reply_list'].explode().reset_index()

all_reply_test = exploded_df['reply_list']


all_replies = []
for i in range(0, len(all_reply_test)):
    result = re.sub("<(.*)>.*?|<(.*) />", " ", str(all_reply_test[i]))
    result = re.sub('&#039;', "'", result)
    result = re.sub('&quot;', '"', result)
    all_replies.append(result)

In [331]:
## Append all subject and comment text to lists ##


subjects = []
comments = []
for i in range(0,len(dataframe)):
    subject_result =  re.sub("<(.*)>.*?|<(.*) />", " ", str(dataframe['Subject'][i]))
    subject_result = re.sub('&#039;', "'", subject_result)
    subject_result = re.sub('&quot;', '"', subject_result)
    subjects.append(subject_result)
    comment_result =  re.sub("<(.*)>.*?|<(.*) />", " ", str(dataframe['Comment'][i]))
    comment_result = re.sub('&#039;', "'", comment_result)
    comment_result = re.sub('&quot;', '"', comment_result)
    comments.append(comment_result)


In [356]:
## ALL TEXT ON THE BOARD ##

import numpy as np

all_text_on_board = all_replies + subjects + comments

textdf = pd.DataFrame({'All text': all_text_on_board})

textdf = textdf.replace('0', np.nan)
textdf = textdf.replace(' ', np.nan)
textdf = textdf.dropna(how='all', axis=0).reset_index(drop=True)

textdf

Unnamed: 0,All text
0,This board is for the discussion of topics rel...
1,Tri-daily reminder to check the accumulation v...
2,Do you always seek for high volumes in early-...
3,Are you guys insisting on making me commit sui...
4,"they are just mean, you are a great trader ba..."
...,...
1683,It's over isn't it ?
1684,If I were to stop I wouldn't receive 100 rose ...
1685,BilboBagginsPutinCharmander9000Inu ticker: BIN...
1686,"I just staked almost 200,000 RBC. I was one of..."


In [512]:
alltext_list = textdf['All text']


   
literally_all_text =[]
for i in range(0,len(alltext_list)):
    text = alltext_list[i]
    literally_all_text.append(text)

corp_string = ' '.join(map(str, literally_all_text))

corp_string = re.sub("\d+", "", corp_string)
#corp_string = re.sub(r'[\t\n ]+', ' ', corp_string)
#corp_string = re.sub(r"\b([^ ]|\d+)\b","",corp_string)
corp_string = re.sub("\\s+"," ", corp_string)
corp_string = re.sub(" ", " ", corp_string)

corp_string.strip()
corp_string.lower()





In [513]:
import spacy
from collections import Counter
nlp = spacy.load("en_core_web_sm")
doc = nlp(corp_string)
# all tokens that arent stop words or punctuations
words = [token.text
         for token in doc
         if not token.is_stop and not token.is_punct and len(token) >= 4]

# noun tokens that arent stop words or punctuations
nouns = [token.text
         for token in doc
         if (not token.is_stop and
             not token.is_punct and
             token.pos_ == "NOUN")]

adjectives = [token.text
         for token in doc
         if (not token.is_stop and
             not token.is_punct and
             token.pos_ == "ADJ")]

verbs = [token.text
         for token in doc
         if (not token.is_stop and
             not token.is_punct and
             token.pos_ == "VERB")]

propn = [token.text
         for token in doc
         if (not token.is_stop and
             not token.is_punct and
             token.pos_ == "PROPN")]
             
adposition = [token.text
         for token in doc
         if (not token.is_stop and
             not token.is_punct and
             token.pos_ == "ADP")]

# most common tokens
word_freq = Counter(words)
common_words = word_freq.most_common(100)

# most common noun tokens
noun_freq = Counter(nouns)
common_nouns = noun_freq.most_common(100)

# most common adjective tokens
adj_freq = Counter(adjectives)
common_adjectives = adj_freq.most_common(100)

# most common verb tokens
verb_freq = Counter(verbs)
common_verbs = verb_freq.most_common(100)

# most common propositions
prop_freq = Counter(propn)
common_props = prop_freq.most_common(100)

# most common adpostions
adposition_freq = Counter(propn)
common_adpositions = adposition_freq.most_common(100)



In [453]:
common_verbs


[('going', 68),
 ('think', 60),
 ('know', 52),
 ('buy', 49),
 ('want', 44),
 ('got', 31),
 ('need', 25),
 ('bought', 22),
 ('sell', 21),
 ('gon', 19),
 ('coming', 18),
 ('based', 17),
 ('look', 17),
 ('said', 17),
 ('tell', 17),
 ('staking', 15),
 ('use', 14),
 ('Imagine', 14),
 ('looks', 13),
 ('looking', 13),
 ('getting', 13),
 ('understand', 13),
 ('happen', 13),
 ('buying', 13),
 ('related', 12)]

In [454]:
common_nouns

[('subject', 74),
 ('shit', 56),
 ('time', 53),
 ('people', 45),
 ('money', 36),
 ('day', 34),
 ('year', 34),
 ('coin', 32),
 ('coins', 31),
 ('crypto', 27),
 ('price', 24),
 ('thread', 23),
 ('market', 23),
 ('guys', 21),
 ('years', 20),
 ('today', 20),
 ('lot', 18),
 ('way', 18),
 ('point', 16),
 ('week', 16),
 ('guy', 15),
 ('fuck', 15),
 ('chart', 14),
 ('post', 14),
 ('world', 14)]

In [504]:
common_props

[('anon', 31),
 ('k', 31),
 ('ICP', 27),
 ('General', 15),
 ('APY', 12),
 ('XMR', 12),
 ('CRO', 11),
 ('BTC', 11),
 ('Monero', 11),
 ('Anon', 10),
 ('christmas', 10),
 ('Bitcoin', 10),
 ('WAGMI', 10),
 ('Dev', 9),
 ('kek', 9),
 ('amp', 9),
 ('Christmas', 8),
 ('ETH', 8),
 ('FUD', 8),
 ('Fantom', 8),
 ('AI', 7),
 ('FTM', 7),
 ('Algorand', 6),
 ('STBL', 6),
 ('schizos', 6),
 ('God', 6),
 ('DeFi', 6),
 ('s', 6),
 ('Binance', 6),
 ('M', 6),
 ('DB', 5),
 ('fren', 5),
 ('XRP', 5),
 ('lmao', 5),
 ('Fuck', 5),
 ('ba', 5),
 ('Kek', 5),
 ('Chainlink', 5),
 ('chan', 5),
 ('TICKER', 5),
 ('Welcome', 5),
 ('xmr', 5),
 ('China', 5),
 ('ROSE', 5),
 ('Findora', 5),
 ('Edition', 5),
 ('IQ', 4),
 ('schizo', 4),
 ('bros', 4),
 ('USDC', 4),
 ('bro', 4),
 ('Ethereum', 4),
 ('KDA', 4),
 ('Jesus', 4),
 ('fud', 4),
 ('ID', 4),
 ('bum', 4),
 ('Rubic', 4),
 ('Coinbase', 4),
 ('D', 4),
 ('LUNA', 4),
 ('FUCKING', 4),
 ('ASS', 4),
 ('santa', 4),
 ('LITEINU', 4),
 ('BlockBank', 4),
 ('BNB', 4),
 ('NFTs', 4),
 ('CRV

In [511]:
import random

# Sample sentences

for i in range(0,10):
    randnum = random.randrange(0, 99, 1)
    randnum2 = random.randrange(0, 99, 1)
    randnum3 = random.randrange(0, 15, 1)
    sentence = common_props[randnum3][0] + " " + common_verbs[randnum][0] + " " + common_verbs[randnum2][0] + " " +  common_adjectives[randnum][0] + " " + common_nouns[randnum][0]
    print(sentence)


BTC look related real market
Monero sold reading cheap weeks
ICP gon holding nice crypto
ICP got seen financial day
Anon said having high guys
Monero happen goes right chart
anon pump gone true man
General pump trying true man
christmas bought left ready coin
APY read s nigger place


In [None]:
from bs4 import BeautifulSoup
from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator
import matplotlib.pyplot as plt