### Reddit dataset

From inspection of the two data tables `submissions` and `comments` and their attributes, it appears that the dataset is fairly denormalized.
For instance, most entires in both `submissions` and `comments` have attributes `subreddit`, `subreddit_id`, `subreddit_name_prefixed`, `subreddit_type`. `submissions` also contains `subreddit_subscribers`, which is the number of subscribers to the subreddit. `subreddit_name_prefixed` is the prefixed subreddit name, which can be easily constructed from `subreddit` in this format `r/[subreddit]`, thus not too relevant and can be excluded.


From the observations, `subreddit` can be converted to its own entity set which allows retrieve info about subreddits more easily. This makes sense, as in our dataset, there are only posts and comments from 13 different subreddits.

In this code below, we will further explore information on subreddits stored in both `submissions` and `comments` and create a seperate table for `subreddit`.

In [9]:
import os
from dotenv import load_dotenv

import psycopg2

#### Connect to database using `psycopg2`

In [16]:
load_dotenv()

conn = psycopg2.connect(
    dbname='reddit',
    user=os.getenv('DB_USERNAME'),
    password=os.getenv('DB_PASSWORD'),
    host=os.getenv('DB_HOST'),
    port=os.getenv('DB_PORT')
)

cur = conn.cursor()

#### Execute a query on `submissions`

In [17]:
# Execute a query
cur.execute("SELECT id FROM submissions ORDER BY RANDOM() LIMIT 5")

# Retrieve query results
records = cur.fetchall()
records

[('15pw0eg',), ('7ransz',), ('evo0t6',), ('147w7g2',), ('83h1ka',)]

#### Execute a query on `comments`

In [18]:
# Execute a query
cur.execute("SELECT id FROM comments ORDER BY RANDOM() LIMIT 5")

# Retrieve query results
records = cur.fetchall()
records

[('k7gwts2',), ('jh3h9v7',), ('fbg6hxw',), ('k2u403c',), ('f3staxv',)]