# Introduction to the pushshift dataset

***This notebook works best an EMR 6.1.0 (Haddop 3.2.1, Spark 3.0.0) cluster with 1 master-node (m5.xlarge) and 3 core-nodes (m5.xlarge)***.

---

The pushshift dataset is available in the `s3://bigdatateaching` under the `/reddit/parquet/` prefix. This notebook provides code to read this data so that you can identify the subreddits that are of interest to you and start working with the data.

In this notebook we do the following:

1. Setup a SparkSession.

1. Read the `submissions` and `comments` data into Pyspark dataframes and do so basic exploratory analysis.

1. Filter the subreddits of interest into new PySpark dataframes, save them into an S3 bucket in your account.

That's it, you are ready to work with reddit data for your project!

## Setup SparkSession

In [1]:
import findspark
findspark.init()
from pyspark.sql import SparkSession
import pyspark.sql.functions as f
from pyspark.sql.functions import isnan, when, count, col, lit
spark = SparkSession.builder.appName("reddit-bigdata-project").getOrCreate()


Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
23/03/22 23:32:47 WARN Client: Neither spark.yarn.jars nor spark.yarn.archive is set, falling back to uploading libraries under SPARK_HOME.
23/03/22 23:32:56 WARN YarnSchedulerBackend$YarnSchedulerEndpoint: Attempted to request executors before the AM has registered!


In [2]:
spark

## Read the `submissions` and `comments` data

In [3]:
comments = spark.read.parquet(
    "s3a://bigdatateaching/reddit/parquet/comments/yyyy=*/mm=*/*comments*.parquet",
    header=True
)
comments.show()

[Stage 2:>                                                          (0 + 1) / 1]

+-------------------+----------------------+--------------------+--------------------+----------------+-----------+-------------+------+------+-------+---------+----------+------------+-----+--------+--------------------+------------+
|             author|author_flair_css_class|   author_flair_text|                body|controversiality|created_utc|distinguished|edited|gilded|     id|  link_id| parent_id|retrieved_on|score|stickied|           subreddit|subreddit_id|
+-------------------+----------------------+--------------------+--------------------+----------------+-----------+-------------+------+------+-------+---------+----------+------------+-----+--------+--------------------+------------+
|          y26404986|                  null|                null|Mmmmm ... is this...|               0| 1641098735|         null|  null|     0|hqwcail|t3_rtzn1c| t3_rtzn1c|  1645567209|    1|   false|          RedditSets|   t5_3psukr|
|            iTwango|                  null|                

                                                                                

In [4]:
submissions = spark.read.parquet(
    "s3a://bigdatateaching/reddit/parquet/submissions/yyyy=*/mm=*/*submissions*.parquet",
    header=True
)
submissions.show()

[Stage 5:>                                                          (0 + 1) / 1]

+------------------+----------------------+--------------------+-----------+-------------+--------------------+-------------+------+-------+------+------------+-------+----------+------------+-----+--------------------+--------+--------------------+------------+--------------------+--------------------+
|            author|author_flair_css_class|   author_flair_text|created_utc|distinguished|              domain|       edited|    id|is_self|locked|num_comments|over_18|quarantine|retrieved_on|score|            selftext|stickied|           subreddit|subreddit_id|               title|                 url|
+------------------+----------------------+--------------------+-----------+-------------+--------------------+-------------+------+-------+------+------------+-------+----------+------------+-----+--------------------+--------+--------------------+------------+--------------------+--------------------+
|         [deleted]|                  null|                null| 1640997058|         

                                                                                

In [5]:
%%time
print(f"shape of the submissions dataframe is {submissions.count():,}x{len(submissions.columns)}") 



shape of the submissions dataframe is 313,972,841x21
CPU times: user 11.3 ms, sys: 12.4 ms, total: 23.7 ms
Wall time: 9.16 s


                                                                                

In [6]:
%%time
print(f"shape of the comments dataframe is {comments.count():,}x{len(comments.columns)}") 



shape of the comments dataframe is 3,076,907,105x17
CPU times: user 130 ms, sys: 18.6 ms, total: 149 ms
Wall time: 57.4 s


                                                                                

## Exploratory data analysis

Let us find the number of submissions and comments per subreddit. We will use SparkSQL for this.

In [7]:
submissions.createOrReplaceTempView("submissions")
comments.createOrReplaceTempView("comments")

In [8]:
%%time

sql_str="select subreddit, count(id) as count from submissions group by subreddit order by count desc"
submissions_subreddit_count = spark.sql(sql_str)
submissions_subreddit_count.show()



+-------------------+-------+
|          subreddit|  count|
+-------------------+-------+
|          AskReddit|2820312|
|       dirtykikpals|1797324|
|           dirtyr4r|1645658|
|        GaySnapchat|1294169|
|          jerkbudss|1226116|
|          teenagers| 992427|
|      DirtySnapchat| 924343|
|      FreeKarma4You| 740495|
|relationship_advice| 714845|
|      AutoNewspaper| 701442|
|        MassiveCock| 694869|
|            FemBoys| 653319|
|               cock| 642364|
|              memes| 631451|
|     wifepictrading| 621277|
|          Eldenring| 616378|
|           gonewild| 607159|
|         needysluts| 584244|
|     PokemonGoRaids| 575693|
|      Roleplaybuddy| 555313|
+-------------------+-------+
only showing top 20 rows

CPU times: user 56.7 ms, sys: 15.5 ms, total: 72.2 ms
Wall time: 35.2 s


                                                                                

Save it to a file

In [9]:
%%time
submissions_subreddit_count.toPandas().to_csv("submissions_per_subreddit.csv", index=None)

                                                                                

CPU times: user 5.05 s, sys: 522 ms, total: 5.58 s
Wall time: 56.3 s


In [10]:
%%time

sql_str="select subreddit, count(id) as count from comments group by subreddit order by count desc"
comment_subreddit_count = spark.sql(sql_str)
comment_subreddit_count.show()



+-------------------+--------+
|          subreddit|   count|
+-------------------+--------+
|          AskReddit|77932752|
|      AmItheAsshole|27773841|
|          teenagers|23794403|
|          worldnews|18548934|
|      FreeKarma4You|16380561|
|        FreeKarma4U|16204497|
|     wallstreetbets|15137889|
|           politics|13371145|
|                nfl|13305635|
|              memes|12549874|
|                nba|12302911|
|             soccer|12286831|
|            TrueFMK|12103484|
|           antiwork|11165889|
|     PublicFreakout|10774101|
|relationship_advice|10238985|
|          Eldenring|10238395|
|           facepalm|10060271|
|               news|10013591|
|  interestingasfuck| 9479187|
+-------------------+--------+
only showing top 20 rows

CPU times: user 428 ms, sys: 92.6 ms, total: 520 ms
Wall time: 3min 29s


                                                                                

Save it to a file

In [11]:
%%time
comment_subreddit_count.toPandas().to_csv("comments_per_subreddit.csv", index=None)

                                                                                

CPU times: user 2.61 s, sys: 363 ms, total: 2.98 s
Wall time: 5min 51s


## Filter for a subreddit of interest

In [11]:
subreddit = 'worldnews' # replace this with something that is of interest to you

In [12]:
%%time

sql_str=f"select * from submissions where subreddit='{subreddit}'"
submissions_filtered = spark.sql(sql_str)
submissions_filtered.show()

[Stage 26:>                                                         (0 + 1) / 1]

+------------------+----------------------+-----------------+-----------+-------------+--------------------+------+------+-------+------+------------+-------+----------+------------+-----+---------+--------+---------+------------+--------------------+--------------------+
|            author|author_flair_css_class|author_flair_text|created_utc|distinguished|              domain|edited|    id|is_self|locked|num_comments|over_18|quarantine|retrieved_on|score| selftext|stickied|subreddit|subreddit_id|               title|                 url|
+------------------+----------------------+-----------------+-----------+-------------+--------------------+------+------+-------+------+------------+-------+----------+------------+-----+---------+--------+---------+------------+--------------------+--------------------+
|           legmeta|                  null|             null| 1640995268|         null|uk.finance.yahoo.com|  null|rt6pjn|  false| false|          25|  false|     false|  1654199630

                                                                                

In [13]:
%%time
sql_str=f"select * from comments where subreddit='{subreddit}'"
comments_filtered = spark.sql(sql_str)
comments_filtered.show()


[Stage 27:>                                                         (0 + 1) / 1]

+--------------------+----------------------+-----------------+--------------------+----------------+-----------+-------------+-------------+------+-------+---------+----------+------------+-----+--------+---------+------------+
|              author|author_flair_css_class|author_flair_text|                body|controversiality|created_utc|distinguished|       edited|gilded|     id|  link_id| parent_id|retrieved_on|score|stickied|subreddit|subreddit_id|
+--------------------+----------------------+-----------------+--------------------+----------------+-----------+-------------+-------------+------+-------+---------+----------+------------+-----+--------+---------+------------+
|           [deleted]|                  null|             null|           [removed]|               0| 1641098736|         null|         null|     0|hqwcan2|t3_rtv67k| t3_rtv67k|  1645567209|    2|   false|worldnews|    t5_2qh13|
|      Dark_Vulture83|                  null|             null|Did Russia learn ...|

                                                                                

## Save the dataset to your own S3 bucket

In [14]:
target_bucket = "ss4608-ppol-567"

In [15]:
%%time
comments_filtered.write.mode("overwrite").parquet(f"s3a://{target_bucket}/{subreddit}/comments")

                                                                                

CPU times: user 1.6 s, sys: 323 ms, total: 1.92 s
Wall time: 23min 54s


In [16]:
%%time
submissions_filtered.write.mode("overwrite").parquet(f"s3a://{target_bucket}/{subreddit}/submissions")

                                                                                

CPU times: user 160 ms, sys: 27.2 ms, total: 187 ms
Wall time: 2min 29s


#### Confirm that we can read what we just wrote

In [17]:
%%time
submissions = spark.read.parquet(
    f"s3a://{target_bucket}/{subreddit}/submissions",
    header=True
)
submissions.show()

[Stage 31:>                                                         (0 + 1) / 1]

+------------------+----------------------+-----------------+-----------+-------------+--------------------+------+------+-------+------+------------+-------+----------+------------+-----+---------+--------+---------+------------+--------------------+--------------------+
|            author|author_flair_css_class|author_flair_text|created_utc|distinguished|              domain|edited|    id|is_self|locked|num_comments|over_18|quarantine|retrieved_on|score| selftext|stickied|subreddit|subreddit_id|               title|                 url|
+------------------+----------------------+-----------------+-----------+-------------+--------------------+------+------+-------+------+------------+-------+----------+------------+-----+---------+--------+---------+------------+--------------------+--------------------+
|         Individ27|                  null|             null| 1645942501|         null|         twitter.com|  null|t2ho9l|  false| false|           1|  false|     false|  1654093542

                                                                                

In [22]:
%%time
submissions.count()



CPU times: user 4.16 ms, sys: 3.11 ms, total: 7.27 ms
Wall time: 3.07 s


                                                                                

170144

In [20]:
%%time
comments = spark.read.parquet(
    f"s3a://{target_bucket}/{subreddit}/comments",
    header=True
)
comments.show()

+-------------------+----------------------+-----------------+--------------------+----------------+-----------+-------------+------+------+-------+---------+----------+------------+-----+--------+----------+------------+
|             author|author_flair_css_class|author_flair_text|                body|controversiality|created_utc|distinguished|edited|gilded|     id|  link_id| parent_id|retrieved_on|score|stickied| subreddit|subreddit_id|
+-------------------+----------------------+-----------------+--------------------+----------------+-----------+-------------+------+------+-------+---------+----------+------------+-----+--------+----------+------------+
|  a_simple_creature|                  null|             null|He wouldn’t be el...|               0| 1650888877|         null|  null|     0|i64d10c|t3_ub4gtq|t1_i63c4et|  1655550748|    1|   false|technology|    t5_2qh16|
|          cutestain|                  null|             null|I don't know. I w...|               0| 1650888880|

In [18]:
%time
comments.count()

CPU times: user 3 µs, sys: 0 ns, total: 3 µs
Wall time: 5.96 µs


                                                                                

3076907105