# Machine Learning

In [20]:
import bz2
import json
from pyspark.ml.feature import *
# CountVectorizer, Tokenizer, RegexTokenizer

## Prepare Data

### Load data from file
Since our data is stored as a bz2 we can use a python library, bz2 to iterate through the compressed file

In [5]:
# load compressed file
reddit_comments = bz2.BZ2File('RC_2015-01.bz2', "r")
# Read first line and load as json
json.loads(reddit_comments.readline())

{u'archived': False,
 u'author': u'YoungModern',
 u'author_flair_css_class': None,
 u'author_flair_text': None,
 u'body': u'Most of us have some family members like this. *Most* of my family is like this. ',
 u'controversiality': 0,
 u'created_utc': u'1420070400',
 u'distinguished': None,
 u'downs': 0,
 u'edited': False,
 u'gilded': 0,
 u'id': u'cnas8zv',
 u'link_id': u't3_2qyr1a',
 u'name': u't1_cnas8zv',
 u'parent_id': u't3_2qyr1a',
 u'retrieved_on': 1425124282,
 u'score': 14,
 u'score_hidden': False,
 u'subreddit': u'exmormon',
 u'subreddit_id': u't5_2r0gj',
 u'ups': 14}

### Create DataFrame object

Load the first 1000 comments and create a Spark DataFrame. This DataFrame will be used to test the machine learning methods below.

In [33]:
reddit_comments = bz2.BZ2File('RC_2015-01.bz2', "r")

# total number of lines to read
N = 1000
json_lines= []
# this method is not memory efficient, will require a large list with full dataset
for _ in range(0,N):
    json_lines.append(reddit_comments.readline())

# create 
commentRDD = sc.parallelize(json_lines)
commentDF = spark.read.json(commentRDD)

# remove comments that don't contain any body text
commentDF = commentDF.filter("body != '[deleted]'")

print "Total number of comments retrieved: %s" % len(json_lines)
print "Total number of comments with body not deleted: %s" % commentDF.count()

Total number of comments retrieved: 1000
Total number of comments with body not deleted: 935


In [14]:
print "Column names of comment DataFrame:"
print commentDF.columns

Column names of comment DataFrame:
['archived', 'author', 'author_flair_css_class', 'author_flair_text', 'body', 'controversiality', 'created_utc', 'distinguished', 'downs', 'edited', 'gilded', 'id', 'link_id', 'name', 'parent_id', 'retrieved_on', 'score', 'score_hidden', 'subreddit', 'subreddit_id', 'ups']


Create a subset of the comment DataFrame only containing the id, upvotes and body

In [23]:
sentenceDF = commentDF.select('id','ups','body')
sentenceDF.show()

+-------+---+--------------------+
|     id|ups|                body|
+-------+---+--------------------+
|cnas8zv| 14|Most of us have s...|
|cnas8zw|  3|But Mill's career...|
|cnas8zx|  1|Mine uses a strai...|
|cnas8zy|  1|           [deleted]|
|cnas8zz|  2|Very fast, thank ...|
|cnas900|  6|The guy is a prof...|
|cnas901|  1|This is a great q...|
|cnas902|  1|Is the IE-Shiv-Gh...|
|cnas903|  1|                 :D.|
|cnas905|  2|I don't know how ...|
|cnas906|  2|       says you my g|
|cnas907| 10|/r/Im14andthisisf...|
|cnas908|  1|  i love this music!|
|cnas909|  2|You mean the vill...|
|cnas90a|  2|I always forget h...|
|cnas90b|  1|           [deleted]|
|cnas90c|  1|If you enjoy deep...|
|cnas90d|  1|           [deleted]|
|cnas90e|  1|Haha awesome man ...|
|cnas90f|  3|I completely agre...|
+-------+---+--------------------+
only showing top 20 rows



Use the tokenizer to convert the comment bodies to arrays

In [24]:
# use pyspark tokenizer object to split words in array
tokenizer = Tokenizer(inputCol="body", outputCol="words")
wordsDF = tokenizer.transform(sentenceDF)
wordsDF.show()

+-------+---+--------------------+--------------------+
|     id|ups|                body|               words|
+-------+---+--------------------+--------------------+
|cnas8zv| 14|Most of us have s...|[most, of, us, ha...|
|cnas8zw|  3|But Mill's career...|[but, mill's, car...|
|cnas8zx|  1|Mine uses a strai...|[mine, uses, a, s...|
|cnas8zy|  1|           [deleted]|         [[deleted]]|
|cnas8zz|  2|Very fast, thank ...|[very, fast,, tha...|
|cnas900|  6|The guy is a prof...|[the, guy, is, a,...|
|cnas901|  1|This is a great q...|[this, is, a, gre...|
|cnas902|  1|Is the IE-Shiv-Gh...|[is, the, ie-shiv...|
|cnas903|  1|                 :D.|               [:d.]|
|cnas905|  2|I don't know how ...|[i, don't, know, ...|
|cnas906|  2|       says you my g|  [says, you, my, g]|
|cnas907| 10|/r/Im14andthisisf...|[/r/im14andthisis...|
|cnas908|  1|  i love this music!|[i, love, this, m...|
|cnas909|  2|You mean the vill...|[you, mean, the, ...|
|cnas90a|  2|I always forget h...|[i, always, fo

Remove stopwords from words column

Use MLib library with spark to create a bag of words representation of the text.

In [None]:
# Input data: Each row is a bag of words with a ID.
df = spark.createDataFrame([
    (0, "a b c".split(" ")),
    (1, "a b b c a".split(" "))
], ["id", "words"])

# fit a CountVectorizerModel from the corpus.
cv = CountVectorizer(inputCol="words", outputCol="features", vocabSize=3, minDF=2.0)
model = cv.fit(df)
result = model.transform(df)
result.show()

