# Session 10

Objective: analyzing the last Reddit posts using their JSON data.

In [1]:
URL = "https://www.reddit.com/r/all.json"
URL

'https://www.reddit.com/r/all.json'

In [2]:
import requests  # Not part of the stdlib

In [3]:
response = requests.get(URL)
response

<Response [429]>

<div class="alert alert-warning">If the HTTP status is <code>429</code>, download the data from your browser or try again later</div>

Usual HTTP statuses are:

- `200 OK`: Everything went okay
- `404 NOT FOUND`: The URL was not found (most widely known because it's more visible)
- `429 TOO MANY REQUESTS`: The resource is rate limited
- `500 INTERNAL SERVER ERROR`: The server broke while handling the request

In [7]:
response.status_code

200

The `.json()` method of the response turns the data to a Python object representing the JSON:

In [8]:
data = response.json()
type(data)

dict

In [11]:
data.keys()

dict_keys(['kind', 'data'])

In [13]:
type(data["data"]["children"])

list

To write that JSON to disk, you need the `json` module:

In [2]:
import json  # belongs to the standard library (stdlib)

In [10]:
# "with" closes the file automatically
with open("reddit_all.json", mode="w") as fh:  # fh = file handle
    json.dump(data, fh)

To read it:

In [3]:
with open("reddit_all.json") as fh:
    data = json.load(fh)

Now, you can proceed analyzing the posts. How many of them are there?

In [4]:
posts = data["data"]["children"]
len(posts)

25

How many different "kinds"?

In [5]:
kinds = {post["kind"] for post in posts}
kinds

{'t3'}

If there is only one, we can ignore it (it doesn't make any difference).

Now, are the IDs unique?

In [6]:
ids = {post["data"]["id"] for post in posts}
len(ids) == len(posts)

True

The answer is yes!

Let's extract the different subreddits:

In [7]:
subreddits = {}
for post in posts:
    post_id = post["data"]["id"]
    subreddits[post_id] = post["data"]["subreddit"]

subreddits

{'yvlc6y': 'pics',
 'yvjde7': 'politics',
 'yvj89t': 'aww',
 'yvjf3r': 'news',
 'yvje97': 'WhitePeopleTwitter',
 'yvjq82': 'nextfuckinglevel',
 'yvj8du': 'LeopardsAteMyFace',
 'yvkn7w': 'StarWars',
 'yvjnuu': 'lego',
 'yvjwa9': 'coolguides',
 'yvi52x': 'Damnthatsinteresting',
 'yvgxs9': 'antiwork',
 'yvj2ic': 'nba',
 'yvkts9': 'technicallythetruth',
 'yvm23u': 'worldnews',
 'yvhs31': 'wholesomememes',
 'yvl55a': 'AnimalsBeingBros',
 'yviu2n': 'TikTokCringe',
 'yvh7wv': 'FunnyAnimals',
 'yvh3zx': 'Gamingcirclejerk',
 'yvk7an': 'pics',
 'yveu7r': 'ProgrammerHumor',
 'yvfxt4': 'pcmasterrace',
 'yvf9ti': 'wallstreetbets',
 'yvmju0': 'meirl'}

We want a simplified version of the posts for simpler analysis. Let's add a `is_memes` key that is `True` if the post belongs to a subreddit with the substring "meme" in it.

In [8]:
def trim_post(post):
    is_memes = "meme" in post["data"]["subreddit"].lower()
    return {
        "id": post["data"]["id"],
        "subreddit": post["data"]["subreddit"],
        "upvote_ratio": post["data"]["upvote_ratio"],
        "is_memes": is_memes,
    }

In [9]:
trimmed_posts = [trim_post(post) for post in posts]
trimmed_posts[:3]

[{'id': 'yvlc6y',
  'subreddit': 'pics',
  'upvote_ratio': 0.95,
  'is_memes': False},
 {'id': 'yvjde7',
  'subreddit': 'politics',
  'upvote_ratio': 0.9,
  'is_memes': False},
 {'id': 'yvj89t', 'subreddit': 'aww', 'upvote_ratio': 0.96, 'is_memes': False}]

And finally, let's see if there's a difference in upvote ratio depending on whether the post belongs to a memes subreddit or not:

In [10]:
scores = {
    True: [],
    False: [],
}

for post in trimmed_posts:
    category = post["is_memes"]
    upvote_ratio = post["upvote_ratio"]

    scores[category].append(upvote_ratio)

scores

{True: [0.97],
 False: [0.95,
  0.9,
  0.96,
  0.85,
  0.96,
  0.96,
  0.96,
  0.93,
  0.96,
  0.87,
  0.87,
  0.95,
  0.97,
  0.98,
  0.94,
  0.97,
  0.92,
  0.97,
  0.94,
  0.84,
  0.93,
  0.92,
  0.94,
  0.98]}

In [11]:
from statistics import mean

{
    category: mean(score_list)
    for category, score_list
    in scores.items()
}

{True: 0.97, False: 0.9341666666666667}

Or, alternatively, using pandas:

In [12]:
import pandas as pd

In [13]:
df = pd.DataFrame.from_records(trimmed_posts)
df.head()

Unnamed: 0,id,subreddit,upvote_ratio,is_memes
0,yvlc6y,pics,0.95,False
1,yvjde7,politics,0.9,False
2,yvj89t,aww,0.96,False
3,yvjf3r,news,0.85,False
4,yvje97,WhitePeopleTwitter,0.96,False


In [15]:
df.groupby("is_memes")["upvote_ratio"].describe().loc[:, "count":"std"]

Unnamed: 0_level_0,count,mean,std
is_memes,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
False,24.0,0.934167,0.040531
True,1.0,0.97,


### Appendix: `**kwargs`

The special `**kwargs` syntax is often found in function definitions. The important part is not the name of the variable (although most people use `kwargs`), but the double star `**` or **unpacking**: it transforms all the named function parameters to a dictionary.

In [16]:
def accept_parameters(**kwargs):
    return kwargs

In [17]:
accept_parameters(a=1, b=[2, 3, 4])

{'a': 1, 'b': [2, 3, 4]}