# Problem Statement

I was hired by ***greeting card company*** to classify whether or not a joke is a 'dad-joke'. Father's day is approaching; the company wants to run a dad-joke-based ad campaign. No one can agree on what consititutes a *dad-joke*. So, the goal of this project is to properly classify jokes as either 'dad-jokes' or 'standard jokes'.

In [1]:
import requests as r
import time
import pandas as pd
import numpy as np
from datetime import datetime as dt

Using pushshift.io API to pull Reddit posts

In [4]:
url = 'https://api.pushshift.io/reddit/search/submission/?'

In [35]:
params = {
    'subreddit': 'Jokes',
    'size': '1'
}
req = r.get(url, params=params)

In [36]:
req.raise_for_status()

In [40]:
df1 = pd.DataFrame.from_dict(req.json()['data'])

In [59]:
df2 = pd.DataFrame.from_dict(req.json()['data'])

In [61]:
df1 = df1.append(df2)

In [69]:
df1.columns

Index(['all_awardings', 'allow_live_comments', 'author',
       'author_flair_css_class', 'author_flair_richtext', 'author_flair_text',
       'author_flair_type', 'author_fullname', 'author_patreon_flair',
       'author_premium', 'awarders', 'can_mod_post', 'contest_mode',
       'created_utc', 'domain', 'full_link', 'gildings', 'id',
       'is_crosspostable', 'is_meta', 'is_original_content',
       'is_reddit_media_domain', 'is_robot_indexable', 'is_self', 'is_video',
       'link_flair_background_color', 'link_flair_richtext',
       'link_flair_text_color', 'link_flair_type', 'locked', 'media_only',
       'no_follow', 'num_comments', 'num_crossposts', 'over_18',
       'parent_whitelist_status', 'permalink', 'pinned', 'pwls',
       'retrieved_on', 'score', 'selftext', 'send_replies', 'spoiler',
       'stickied', 'subreddit', 'subreddit_id', 'subreddit_subscribers',
       'subreddit_type', 'thumbnail', 'title', 'total_awards_received',
       'treatment_tags', 'upvote_ratio',

This function queries pushshift.io for a particular subreddit until it reaches 7500 posts, dropping duplicates along the way, and putting all the posts into a DataFrame. I ran this function on both r/Jokes and r/DadJokes.

In [21]:
def reddit_query(sub):
    hack = dt.now()
    URL = 'https://api.pushshift.io/reddit/search/submission/?'
    SUBFIELDS = ['title', 'selftext', 'subreddit']
    day = 1
    r1 = r.get(URL, params={'subreddit': sub, 'size': '1'})
    assert r1.status_code == 200
    df = pd.DataFrame.from_dict(r1.json()['data'])
    while df.shape[0] < 7_500:
        params = {
            'subreddit': sub,
            'size': '100',
            'after': f'{3*day}d'
            }
        req = r.get(URL, params=params)
        try:
            assert req.status_code == 200
        except AssertionError:
            continue
        df2 = pd.DataFrame.from_dict(req.json()['data'])
        df = df.append(df2, ignore_index=True)
        df.drop_duplicates(subset=['title'], inplace=True, ignore_index=True)
        print(df.shape[0])
        print(dt.now() - hack)
        day += 1
        time.sleep(1)
    df = df[SUBFIELDS]
    return df
    

In [22]:
jokes_df = reddit_query('Jokes')

101
0:00:08.630163
200
0:00:10.620751
300
0:00:12.584684
400
0:00:14.345839
500
0:00:23.894352
525
0:00:27.658965
623
0:00:31.278576
722
0:00:40.707326
822
0:00:42.585346
922
0:00:44.412299
1021
0:00:47.554255
1021
0:00:56.987683
1021
0:00:58.769829
1116
0:01:00.971347
1211
0:01:04.602776
1308
0:01:13.685902
1407
0:01:31.453327
1503
0:01:45.900829
1602
0:01:48.117334
1701
0:01:50.540781
1801
0:01:52.857705
1899
0:02:01.388774
1996
0:02:03.248406
2092
0:02:06.557230
2191
0:02:09.741728
2289
0:02:11.757793
2388
0:02:20.860905
2481
0:02:23.198332
2579
0:02:25.262170
2675
0:02:27.467730
2772
0:02:29.153788
2869
0:02:39.668136
2969
0:02:41.435588
3065
0:02:43.996291
3164
0:02:53.204322
3263
0:02:57.161498
3361
0:03:03.714380
3461
0:03:14.962932
3557
0:03:19.187822
3657
0:03:26.373225
3756
0:03:28.642124
3855
0:03:30.433854
3952
0:03:32.304662
4051
0:03:34.501246
4150
0:03:52.258158
4250
0:04:10.777298
4346
0:04:20.680246
4443
0:04:23.128464
4539
0:04:25.703240
4631
0:04:35.134098
4731
0:04:

In [23]:
jokes_df.head()

Unnamed: 0,title,selftext,subreddit
0,I got Covid in November and lost my sense of t...,[removed],Jokes
1,"A well-known professor of language, was caught...","The university decided to take action, seeing ...",Jokes
2,"I'm quite a normal person, I'm very good frien...",I don't know why...,Jokes
3,How does a gypsy soccer match end?,Without goals.,Jokes
4,What is Forrest Gump’s password?,[removed],Jokes


In [24]:
jokes_df = jokes_df[jokes_df['selftext'] != '[removed]']

In [25]:
jokes_df.shape

(6377, 3)

In [26]:
jokes_df = jokes_df[jokes_df['title'] != '[removed]']

In [27]:
jokes_df.shape

(6377, 3)

In [29]:
jokes_df.to_csv('./data/jokes.csv', index=False)

In [30]:
dadjokes_df = reddit_query('dadjokes')

97
0:00:04.746822
193
0:00:15.280348
291
0:00:17.803885
383
0:00:29.140241
480
0:00:32.274060
492
0:00:34.658433
588
0:00:37.315482
683
0:00:50.676926
776
0:01:06.529948
872
0:01:08.433935
968
0:01:11.639440
968
0:01:19.275211
968
0:01:21.102076
1066
0:01:23.715954
1164
0:01:25.691803
1259
0:01:28.159414
1359
0:01:37.358294
1457
0:01:41.907752
1555
0:01:45.137416
1651
0:01:54.261188
1750
0:01:57.530377
1848
0:01:59.426284
1943
0:02:02.369687
2041
0:02:11.765530
2136
0:02:13.797170
2233
0:02:15.704619
2308
0:02:18.460439
2404
0:02:27.184202
2502
0:02:31.649277
2593
0:02:35.884246
2684
0:02:46.384785
2782
0:02:48.800434
2878
0:02:50.946498
2972
0:02:53.260333
3070
0:03:02.367743
3169
0:03:04.216499
3263
0:03:06.683915
3359
0:03:09.941636
3456
0:03:19.988993
3553
0:03:21.955366
3646
0:03:23.968352
3741
0:03:26.002500
3838
0:03:28.427483
3934
0:03:37.804249
4030
0:03:41.112875
4126
0:03:43.459901
4223
0:03:53.088105
4319
0:03:55.156604
4415
0:03:57.090172
4509
0:03:59.209655
4604
0:04:01.3

In [31]:
dadjokes_df = dadjokes_df[dadjokes_df['selftext'] != '[removed]']

In [32]:
dadjokes_df.shape

(6121, 3)

In [33]:
dadjokes_df = dadjokes_df[dadjokes_df['title'] != '[removed]']

In [34]:
dadjokes_df.shape

(6121, 3)

I then concatenated the post title and post body together into a single column. I removed posts containing "[deleted]", as these were essentially NaN values. I added a column depicted which subreddit a row belongs to, then combined the DataFrames into one, and saved it as a .csv.


In [35]:
dadjokes_df.to_csv('./data/dadjokes.csv', index=False)

In [41]:
dadjokes_df.head()

Unnamed: 0,title,selftext,subreddit
0,You can never run through a camp site!,You can only ran...cause it's past tents,dadjokes
1,I invested in a fertility clinic....,Heard the business was expanding,dadjokes
2,Did you know ancient Egyptian houses didn’t ha...,Instead they had a horn and a sign saying “Too...,dadjokes
3,I just planted a new tree in my back yard. It’...,You might say it’s starting to branch out.,dadjokes
4,What’s J-Lo’s favorite race?,Iditarod \n\n(I did A-Rod),dadjokes


In [42]:
full = pd.concat([jokes_df, dadjokes_df])

In [44]:
full.tail()

Unnamed: 0,title,selftext,subreddit
7554,Have you guys heard about Cole's law?,It is thinly sliced cabbage,dadjokes
7555,What do you call a walking toilet?,A Portabloo,dadjokes
7556,How do you attract a squrrel,Climb like a tree and act like a nut,dadjokes
7559,I used to like tractors...,Now I prefer air conditioning. \n\nYou could s...,dadjokes
7561,"Hydrogen: Helium, how do I become like you?",Helium: Be noble.,dadjokes


In [45]:
full.shape

(12498, 3)

In [47]:
full['post'] = full.title + ' ' + full.selftext

In [49]:
full = full[['post', 'subreddit']]

In [51]:
full.subreddit = full.subreddit.map(lambda x: 1 if x == 'Jokes' else 0)
full.rename(columns={'subreddit': 'is_jokes'}, inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self[name] = value
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  return super().rename(


In [52]:
full.head()

Unnamed: 0,post,is_jokes
1,"A well-known professor of language, was caught...",1
2,"I'm quite a normal person, I'm very good frien...",1
3,How does a gypsy soccer match end? Without goals.,1
5,A granddaughters questions The first time our ...,1
7,What did the loaf of sourdough bread day to th...,1


In [53]:
full.to_csv('./data/full.csv', index=False)