# Reddit: 'Front Page of the Internet'

![](./images/reddit_search_result.png)

## What does it take for a post to get there?
![](./images/reddit_top.png)

# Big Query June 2017 Data for 'default' subreddits

![](./images/bg_table_desc.png)

# Getting Going with Spark

In [1]:
import findspark
# my local spark install
findspark.init('/Users/dreyco676/spark-2.2.0-bin-hadoop2.7/')

In [2]:
import pyspark
from pyspark.sql import SQLContext

# create spark contexts
sc = pyspark.SparkContext()
sqlContext = SQLContext(sc)

### Here we can import a CSV to a spark dataframe

In [53]:
df = sqlContext.read.csv('data/reddit_defaults_june17.csv', header=True, inferSchema=True)

In [4]:
df.count()

1087311

In [5]:
df.describe()

DataFrame[summary: string, created_utc: string, subreddit: string, author: string, domain: string, url: string, num_comments: string, score: string, ups: string, downs: string, title: string, selftext: string, saved: string, id: string, from_kind: string, gilded: string, from: string, stickied: string, retrieved_on: string, over_18: string, thumbnail: string, subreddit_id: string, hide_score: string, link_flair_css_class: string, author_flair_css_class: string, archived: string, is_self: string, from_id: string, permalink: string, name: string, author_flair_text: string, quarantine: string, link_flair_text: string, distinguished: string]

# ಠ_ಠ
### Even though we told it to infer the schema it still set everything to string.

In [52]:
int_cols = ['num_comments', 'score', 'ups', 'downs', 'over_18']
bool_cols = ['saved', 'gilded', 'stickied', 'hide_score', 'archived', 'is_self', 'quarantine']
date_cols = ['created_utc', 'retrieved_on']

In [44]:
df[['created_utc']].show()

+-----------+
|created_utc|
+-----------+
| 1497045651|
| 1498078666|
| 1498318066|
| 1497534495|
| 1498002534|
| 1498821879|
| 1498490141|
| 1497556307|
| 1498388671|
| 1498143572|
| 1498444669|
| 1497528723|
| 1496679510|
| 1497188637|
| 1497751261|
| 1496930590|
| 1497231917|
| 1498746229|
| 1496829959|
| 1497894195|
+-----------+
only showing top 20 rows



## Our dates are in Unix Epoch time 
(seconds since 1970-01-01)

In [54]:
from pyspark.sql.functions import from_unixtime

for col in date_cols:
    df = df.withColumn(col, from_unixtime(df[col]))

In [55]:
df[['created_utc']].show()

+-------------------+
|        created_utc|
+-------------------+
|2017-06-09 17:00:51|
|2017-06-21 15:57:46|
|2017-06-24 10:27:46|
|2017-06-15 08:48:15|
|2017-06-20 18:48:54|
|2017-06-30 06:24:39|
|2017-06-26 10:15:41|
|2017-06-15 14:51:47|
|2017-06-25 06:04:31|
|2017-06-22 09:59:32|
|2017-06-25 21:37:49|
|2017-06-15 07:12:03|
|2017-06-05 11:18:30|
|2017-06-11 08:43:57|
|2017-06-17 21:01:01|
|2017-06-08 09:03:10|
|2017-06-11 20:45:17|
|2017-06-29 09:23:49|
|2017-06-07 05:05:59|
|2017-06-19 12:43:15|
+-------------------+
only showing top 20 rows



In [None]:
# Use langid module to classify the language to make sure we are applying the correct cleanup actions for English
# https://github.com/saffsd/langid.py
def check_lang(data_str):
    predict_lang = langid.classify(data_str)
    if predict_lang[1] >= .9:
        language = predict_lang[0]
    else:
        language = 'NA'
    return language

In [None]:
from pyspark.sql.functions import udf
from pyspark.sql.types import StringType

# Register all the functions in Preproc with Spark Context
check_lang_udf = udf(pp.check_lang, StringType())

In [None]:
# predict language and filter out those with less than 90% chance of being English
lang_df = data_df.withColumn("lang", check_lang_udf(data_df["text"]))

https://github.com/reddit/reddit/wiki/JSON