# Problem Statement 

</br> Reddit is a social media website focused on social news aggregation, web content rating, and discussions. Members contribute content through different subreddits(community-managed pages) that are then voted up or down by other members of the community. Scored through an algorithm that measures votes and among other things time that the post has been there, which assures that new content will always float to the top. Submissions with a high enough score will make it to the front page of reddit. Reddit's subreddits have created interesting niche communities that have flourished and may not have survived on other parts of the internet. 

r/AmiTheAsshole and r/relationships are subreddits that have communities built around helping people through the collective opinion of others. r/AmiTheAsshole focuses on helping people gain perspective on conflict and r/relationships focuses on helping people in their interpersonal relationships. While the subreddits seem far apart, conflict and interpersonal relationships are often intertwined. By building models to effectively classify between the two subreddits, then finding the features that drive the coefficients, this project aims to understand how people use language when describing general conflict in comparison to interpersonal conflict.

This notebook aims to retrieve data from Reddit through looping requests of .json files. 

## Relationships & AmITheAsshole
Try to import some Data from reddits 

In [11]:
import requests
import pandas as pd
import time
import random

# Get r/amitheasshole

In [9]:
url = 'https://www.reddit.com/r/AmItheAsshole.json'

In [10]:
headers = {'User-agent': 'Alex_bot_0.1'}

In [11]:
res = requests.get(url, headers=headers)

In [12]:
res.status_code

200

In [13]:
posts = []
after = None

for a in range(40):
    if after == None:
        current_url = url
    else:
        current_url = url + '?after=' + after
    print(current_url)
    res = requests.get(current_url, headers=headers)
    
    if res.status_code != 200:
        print('Status error', res.status_code)
        break
    
    current_dict = res.json()
    current_posts = [p['data'] for p in current_dict['data']['children']]
    posts.extend(current_posts)
    after = current_dict['data']['after']
    
    # generate a random sleep duration to look more 'natural'
    sleep_duration = random.randint(2,5)
    print(sleep_duration)
    time.sleep(sleep_duration)

https://www.reddit.com/r/AmItheAsshole.json
4
https://www.reddit.com/r/AmItheAsshole.json?after=t3_k0nuyg
2
https://www.reddit.com/r/AmItheAsshole.json?after=t3_k0ykuq
2
https://www.reddit.com/r/AmItheAsshole.json?after=t3_k0ldcf
5
https://www.reddit.com/r/AmItheAsshole.json?after=t3_k10bwi
3
https://www.reddit.com/r/AmItheAsshole.json?after=t3_k0eysc
3
https://www.reddit.com/r/AmItheAsshole.json?after=t3_k1acx4
3
https://www.reddit.com/r/AmItheAsshole.json?after=t3_k11dhw
4
https://www.reddit.com/r/AmItheAsshole.json?after=t3_k19kok
3
https://www.reddit.com/r/AmItheAsshole.json?after=t3_k1bnlu
4
https://www.reddit.com/r/AmItheAsshole.json?after=t3_k18kgj
5
https://www.reddit.com/r/AmItheAsshole.json?after=t3_k16a3s
3
https://www.reddit.com/r/AmItheAsshole.json?after=t3_k10ex2
4
https://www.reddit.com/r/AmItheAsshole.json?after=t3_k17fe1
3
https://www.reddit.com/r/AmItheAsshole.json?after=t3_k0wbnc
3
https://www.reddit.com/r/AmItheAsshole.json?after=t3_k0y3fv
2
https://www.reddit.com/r

In [14]:
len(posts)

981

In [15]:
aita = pd.DataFrame(posts)
aita.tail()

Unnamed: 0,approved_at_utc,subreddit,selftext,author_fullname,saved,mod_reason_title,gilded,clicked,title,link_flair_richtext,...,permalink,parent_whitelist_status,stickied,url,subreddit_subscribers,created_utc,num_crossposts,media,is_video,author_cakeday
971,,AmItheAsshole,"So. I'm in XXclass, and waa waiting in line to...",t2_3ydo629l,False,,0,False,AITA for reporting one sixthgrade touching my ...,[],...,/r/AmItheAsshole/comments/k1124j/aita_for_repo...,all_ads,False,https://www.reddit.com/r/AmItheAsshole/comment...,2429506,1606339000.0,0,,False,
972,,AmItheAsshole,My mom works as a house cleaner. She makes gre...,t2_6hjpialj,False,,0,False,AITA for not going to work with my mom?,[],...,/r/AmItheAsshole/comments/k0tym2/aita_for_not_...,all_ads,False,https://www.reddit.com/r/AmItheAsshole/comment...,2429506,1606317000.0,0,,False,
973,,AmItheAsshole,"So, this happened a while ago. I've already a...",t2_5nnpsji8,False,,0,False,AITA for being rude to my mom's boyfriend?,[],...,/r/AmItheAsshole/comments/k1a3cy/aita_for_bein...,all_ads,False,https://www.reddit.com/r/AmItheAsshole/comment...,2429506,1606372000.0,0,,False,
974,,AmItheAsshole,"So I often go to my cousins house, as I only h...",t2_87g1iu7h,False,,0,False,AITA for yelling at my parents?,[],...,/r/AmItheAsshole/comments/k14rv7/aita_for_yell...,all_ads,False,https://www.reddit.com/r/AmItheAsshole/comment...,2429506,1606351000.0,0,,False,
975,,AmItheAsshole,I (24F) and my partner (25M) are unemployed an...,t2_91p94biw,False,,0,False,WIBTA for asking grandpa to leave his dog at h...,[],...,/r/AmItheAsshole/comments/k1bq6u/wibta_for_ask...,all_ads,False,https://www.reddit.com/r/AmItheAsshole/comment...,2429506,1606380000.0,0,,False,
976,,AmItheAsshole,My boyfriend has been talking about getting a ...,t2_91l87fhb,False,,0,False,AITA for trying to prevent my bf from getting ...,[],...,/r/AmItheAsshole/comments/k16sbr/aita_for_tryi...,all_ads,False,https://www.reddit.com/r/AmItheAsshole/comment...,2429506,1606359000.0,0,,False,
977,,AmItheAsshole,So I (23f) am a caregiver for my great grandmo...,t2_m7w2o,False,,0,False,WIBTA if I told these people to stop getting m...,[],...,/r/AmItheAsshole/comments/k0wqe5/wibta_if_i_to...,all_ads,False,https://www.reddit.com/r/AmItheAsshole/comment...,2429506,1606326000.0,0,,False,
978,,AmItheAsshole,"Firstly, apologies if the format goes wrong I’...",t2_91glek3d,False,,0,False,WIBTA if I told my crush to stop venting about...,[],...,/r/AmItheAsshole/comments/k10tu1/wibta_if_i_to...,all_ads,False,https://www.reddit.com/r/AmItheAsshole/comment...,2429506,1606338000.0,0,,False,
979,,AmItheAsshole,"I (25f) deal with parents, mainly my dad, who ...",t2_90i5z,False,,1,False,AITA for removing my stalker app after turning...,[],...,/r/AmItheAsshole/comments/k05ly2/aita_for_remo...,all_ads,False,https://www.reddit.com/r/AmItheAsshole/comment...,2429506,1606227000.0,0,,False,
980,,AmItheAsshole,This happened in April and it's a long story \...,t2_3h29slkv,False,,0,False,AITA for telling a friend I'm here for them?,[],...,/r/AmItheAsshole/comments/k1bnlu/aita_for_tell...,all_ads,False,https://www.reddit.com/r/AmItheAsshole/comment...,2429506,1606380000.0,0,,False,


In [16]:
aita.shape

(981, 104)

In [17]:
aita.to_csv('aita.csv', index=False)

# Get r/Relationships data

In [18]:
url = 'https://www.reddit.com/r/relationships/.json'

In [19]:
headers = {'User-agent': 'Alex_bot_0.1'}

In [20]:
res = requests.get(url, headers=headers)

In [21]:
res.status_code

200

In [22]:
posts = []
after = None

for a in range(40):
    if after == None:
        current_url = url
    else:
        current_url = url + '?after=' + after
    print(current_url)
    res = requests.get(current_url, headers=headers)
    
    if res.status_code != 200:
        print('Status error', res.status_code)
        break
    
    current_dict = res.json()
    current_posts = [p['data'] for p in current_dict['data']['children']]
    posts.extend(current_posts)
    after = current_dict['data']['after']
    
    # generate a random sleep duration to look more 'natural'
    sleep_duration = random.randint(2,5)
    print(sleep_duration)
    time.sleep(sleep_duration)

https://www.reddit.com/r/relationships/.json
5
https://www.reddit.com/r/relationships/.json?after=t3_k140cv
4
https://www.reddit.com/r/relationships/.json?after=t3_k1b8vl
3
https://www.reddit.com/r/relationships/.json?after=t3_k15srz
4
https://www.reddit.com/r/relationships/.json?after=t3_k0v4hr
2
https://www.reddit.com/r/relationships/.json?after=t3_k0ywwg
3
https://www.reddit.com/r/relationships/.json?after=t3_k0z7r2
3
https://www.reddit.com/r/relationships/.json?after=t3_k0vgaf
5
https://www.reddit.com/r/relationships/.json?after=t3_k0l3x4
3
https://www.reddit.com/r/relationships/.json?after=t3_k0cv6w
4
https://www.reddit.com/r/relationships/.json?after=t3_k0l6g3
4
https://www.reddit.com/r/relationships/.json?after=t3_k0f7s9
4
https://www.reddit.com/r/relationships/.json?after=t3_jzsl7u
3
https://www.reddit.com/r/relationships/.json?after=t3_k0ag4u
2
https://www.reddit.com/r/relationships/.json?after=t3_k02sxc
3
https://www.reddit.com/r/relationships/.json?after=t3_jzmfq6
4
https://

In [26]:
rs = pd.DataFrame(posts)
rs.tail()

Unnamed: 0,approved_at_utc,subreddit,selftext,author_fullname,saved,mod_reason_title,gilded,clicked,title,link_flair_richtext,...,permalink,parent_whitelist_status,stickied,url,subreddit_subscribers,created_utc,num_crossposts,media,is_video,author_cakeday
812,,relationships,I have been teaching my MIL’s middle school cl...,t2_72h7mkf,False,,0,False,Am I (M29) in the right for wanting to ask my ...,[],...,/r/relationships/comments/k69w8g/am_i_m29_in_t...,all_ads,False,https://www.reddit.com/r/relationships/comment...,3005560,1607041000.0,0,,False,
813,,relationships,\nMy husband lies to me about almost everythin...,t2_5buyht28,False,,0,False,Is my husband falling out of love with me?,[],...,/r/relationships/comments/k69ug5/is_my_husband...,all_ads,False,https://www.reddit.com/r/relationships/comment...,3005560,1607041000.0,0,,False,
814,,relationships,Sorry if it comes out as dumb but I'm just con...,t2_441rqhjd,False,,0,False,A guy (25M) asked me (23F) to go skate next we...,[],...,/r/relationships/comments/k63jrw/a_guy_25m_ask...,all_ads,False,https://www.reddit.com/r/relationships/comment...,3005560,1607022000.0,0,,False,
815,,relationships,Clarification: As with all Asian family outsid...,t2_6f2sotlp,False,,0,False,Son stuck in grandma-mom feud. (Part I),[],...,/r/relationships/comments/k6bruf/son_stuck_in_...,all_ads,False,https://www.reddit.com/r/relationships/comment...,3005560,1607049000.0,0,,False,
816,,relationships,TLDR: I want to have copious amounts of sex bu...,t2_12i65vwy,False,,0,False,I'm (24F) am addicted to secret sex and affair...,[],...,/r/relationships/comments/k6q203/im_24f_am_add...,all_ads,False,https://www.reddit.com/r/relationships/comment...,3005560,1607105000.0,0,,False,


In [24]:
rs.shape

(994, 103)

# Dropping duplicate posts

In [66]:
aita = pd.read_csv('datasets/aita.csv')
aita.drop_duplicates(subset =['selftext','title'], keep= 'first',inplace=True)
aita.shape

(754, 104)

In [67]:
rs = pd.read_csv('datasets/rs.csv')
rs.drop_duplicates(subset =['selftext','title'],keep= 'first',inplace=True)
rs.shape

(470, 103)

#### After dropping duplicate posts, the dataset shrank considerably. Making a second pull for more datasets. 

In [68]:
#Ratio of r/AmiTheAsshole posts that were dropped
(981 - 754) / 981

0.23139653414882771

In [69]:
(1000 - 754)*1.23139653414882771

302.92354740061165

In [70]:
#Ratio of r/relationship posts that were dropped
(994 - 468) / 994

0.5291750503018109

In [71]:
(1000-468)* 1.5291750503018109

813.5211267605634

In [72]:
1000-468

532

# Second Pull for r/AmiTheAsshole

In [73]:
url = 'https://www.reddit.com/r/AmItheAsshole.json'

In [74]:
headers = {'User-agent': 'Alex_bot_0.1'}

In [75]:
res = requests.get(url, headers=headers)

In [76]:
res.status_code

200

In [77]:
posts = []
after = None

for a in range(12):
    if after == None:
        current_url = url
    else:
        current_url = url + '?after=' + after
    print(current_url)
    res = requests.get(current_url, headers=headers)
    
    if res.status_code != 200:
        print('Status error', res.status_code)
        break
    
    current_dict = res.json()
    current_posts = [p['data'] for p in current_dict['data']['children']]
    posts.extend(current_posts)
    after = current_dict['data']['after']
    
    # generate a random sleep duration to look more 'natural'
    sleep_duration = random.randint(2,5)
    print(sleep_duration)
    time.sleep(sleep_duration)

https://www.reddit.com/r/AmItheAsshole.json
5
https://www.reddit.com/r/AmItheAsshole.json?after=t3_k7glbb
2
https://www.reddit.com/r/AmItheAsshole.json?after=t3_k7jey5
4
https://www.reddit.com/r/AmItheAsshole.json?after=t3_k7kxou
2
https://www.reddit.com/r/AmItheAsshole.json?after=t3_k7lqwx
3
https://www.reddit.com/r/AmItheAsshole.json?after=t3_k7n4rb
3
https://www.reddit.com/r/AmItheAsshole.json?after=t3_k6pwvi
4
https://www.reddit.com/r/AmItheAsshole.json?after=t3_k7g6x7
5
https://www.reddit.com/r/AmItheAsshole.json?after=t3_k7k9l0
5
https://www.reddit.com/r/AmItheAsshole.json?after=t3_k7k47n
4
https://www.reddit.com/r/AmItheAsshole.json?after=t3_k7ent6
4
https://www.reddit.com/r/AmItheAsshole.json?after=t3_k6xpiw
3


In [78]:
aita2 = pd.DataFrame(posts)
aita2.head()

Unnamed: 0,approved_at_utc,subreddit,selftext,author_fullname,saved,mod_reason_title,gilded,clicked,title,link_flair_richtext,...,permalink,parent_whitelist_status,stickied,url,subreddit_subscribers,created_utc,num_crossposts,media,is_video,author_cakeday
0,,AmItheAsshole,[This recent vice article](https://www.vice.co...,t2_7ojjg,False,,1,False,New Resources for Anyone Looking to Help Those...,[],...,/r/AmItheAsshole/comments/jbswil/new_resources...,all_ads,True,https://www.reddit.com/r/AmItheAsshole/comment...,2446895,1602785000.0,2,,False,
1,,AmItheAsshole,Welcome to the monthly open forum! This is the...,t2_2yspbtwq,False,,0,False,Monthly Open Forum December 2020,[],...,/r/AmItheAsshole/comments/k4owfz/monthly_open_...,all_ads,True,https://www.reddit.com/r/AmItheAsshole/comment...,2446895,1606842000.0,1,,False,
2,,AmItheAsshole,Husband (28m) and I (28f) live in Colorado. We...,t2_96l4cm06,False,,1,False,"AITA for ""dramatically"" running away from an i...",[],...,/r/AmItheAsshole/comments/k7evv4/aita_for_dram...,all_ads,False,https://www.reddit.com/r/AmItheAsshole/comment...,2446895,1607199000.0,1,,False,
3,,AmItheAsshole,I'm a freshman in college and before college I...,t2_96hrapyj,False,,0,False,AITA for giving my family the silent treatment...,[],...,/r/AmItheAsshole/comments/k7bpjd/aita_for_givi...,all_ads,False,https://www.reddit.com/r/AmItheAsshole/comment...,2446895,1607190000.0,1,,False,
4,,AmItheAsshole,I’m getting married next Saturday in a cocktai...,t2_54tn8n83,False,,0,False,AITA for demanding my mother leave with the we...,[],...,/r/AmItheAsshole/comments/k7dsd9/aita_for_dema...,all_ads,False,https://www.reddit.com/r/AmItheAsshole/comment...,2446895,1607196000.0,0,,False,


# Second pull for r/relationships

In [79]:
url = 'https://www.reddit.com/r/relationships/.json'

In [80]:
headers = {'User-agent': 'Alex_bot_0.1'}

In [81]:
res = requests.get(url, headers=headers)

In [82]:
res.status_code

200

In [83]:
posts = []
after = None

for a in range(33):
    if after == None:
        current_url = url
    else:
        current_url = url + '?after=' + after
    print(current_url)
    res = requests.get(current_url, headers=headers)
    
    if res.status_code != 200:
        print('Status error', res.status_code)
        break
    
    current_dict = res.json()
    current_posts = [p['data'] for p in current_dict['data']['children']]
    posts.extend(current_posts)
    after = current_dict['data']['after']
    
    # generate a random sleep duration to look more 'natural'
    sleep_duration = random.randint(2,5)
    print(sleep_duration)
    time.sleep(sleep_duration)

https://www.reddit.com/r/relationships/.json
3
https://www.reddit.com/r/relationships/.json?after=t3_k7k700
2
https://www.reddit.com/r/relationships/.json?after=t3_k7n35q
3
https://www.reddit.com/r/relationships/.json?after=t3_k79avi
4
https://www.reddit.com/r/relationships/.json?after=t3_k7jikv
4
https://www.reddit.com/r/relationships/.json?after=t3_k7gwbm
3
https://www.reddit.com/r/relationships/.json?after=t3_k6kazl
2
https://www.reddit.com/r/relationships/.json?after=t3_k76bhk
3
https://www.reddit.com/r/relationships/.json?after=t3_k72vvm
4
https://www.reddit.com/r/relationships/.json?after=t3_k70e65
5
https://www.reddit.com/r/relationships/.json?after=t3_k6wfal
2
https://www.reddit.com/r/relationships/.json?after=t3_k6sk6t
5
https://www.reddit.com/r/relationships/.json?after=t3_k6pirw
4
https://www.reddit.com/r/relationships/.json?after=t3_k6989i
2
https://www.reddit.com/r/relationships/.json?after=t3_k69w8g
2
https://www.reddit.com/r/relationships/.json?after=t3_k5w5cc
2
https://

Ensuring the ratio for the posts isn't too skewed this time. 

In [84]:
rs2 = pd.DataFrame(posts)
rs2.shape

(818, 104)

In [85]:
aita2.drop_duplicates(subset =['selftext','title'],keep= 'first',inplace=True)
aita2.shape

(302, 104)

In [86]:
rs2.drop_duplicates(subset =['selftext','title'],keep= 'first',inplace=True)
rs2.shape

(466, 104)

In [87]:
aita3 = pd.concat([aita, aita2], axis=0, ignore_index=True)
aita3.shape

(1056, 104)

In [88]:
rs3 = pd.concat([rs,rs2], axis=0, ignore_index=True)
rs3.shape

(936, 104)

Managed to put the ratio down to 1056:936

# Exporting to CSV 

In [89]:
aita3.to_csv('datasets/aita3.csv', index=False)

In [90]:
rs3.to_csv('datasets/rs3.csv', index=False)