Modify the classpath... JVM <3

In [1]:
classpath.add(
    "org.apache.spark" %% "spark-core" % "1.6.0",
    "org.apache.spark" %% "spark-sql" % "1.6.0",
    "org.apache.spark" %% "spark-mllib" % "1.6.0",
    "org.xerial" % "sqlite-jdbc" % "3.8.6"
)

Adding 159 artifact(s)




Some imports...

In [2]:
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.sql._
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types._
import org.apache.spark.SparkConf

import org.apache.spark.mllib.linalg._

import org.apache.spark.ml
import org.apache.spark.ml.Pipeline
import org.apache.spark.ml.feature._
import org.apache.spark.ml.classification._

import org.sqlite.JDBC

import java.sql.DriverManager

[32mimport [36morg.apache.spark.SparkContext[0m
[32mimport [36morg.apache.spark.SparkContext._[0m
[32mimport [36morg.apache.spark.sql._[0m
[32mimport [36morg.apache.spark.sql.functions._[0m
[32mimport [36morg.apache.spark.sql.types._[0m
[32mimport [36morg.apache.spark.SparkConf[0m
[32mimport [36morg.apache.spark.mllib.linalg._[0m
[32mimport [36morg.apache.spark.ml[0m
[32mimport [36morg.apache.spark.ml.Pipeline[0m
[32mimport [36morg.apache.spark.ml.feature._[0m
[32mimport [36morg.apache.spark.ml.classification._[0m
[32mimport [36morg.sqlite.JDBC[0m
[32mimport [36mjava.sql.DriverManager[0m

In [3]:
val uri = "jdbc:sqlite:database.sqlite"

val seriousCount = 30100
val sarcasmCount = 30100
val numFeatures = 5000

[36muri[0m: [32mString[0m = [32m"jdbc:sqlite:database.sqlite"[0m
[36mseriousCount[0m: [32mInt[0m = [32m30100[0m
[36msarcasmCount[0m: [32mInt[0m = [32m30100[0m
[36mnumFeatures[0m: [32mInt[0m = [32m5000[0m

Create the Spark context and start Spark

In [4]:
val conf = new SparkConf()
    .setAppName("Solution")
    .setMaster("local[4]")
val sc = new SparkContext(conf)
val sqlContext = new SQLContext(sc)

Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
16/02/12 18:12:49 INFO SparkContext: Running Spark version 1.6.0
16/02/12 18:12:49 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
16/02/12 18:12:50 INFO SecurityManager: Changing view acls to: amharc
16/02/12 18:12:50 INFO SecurityManager: Changing modify acls to: amharc
16/02/12 18:12:50 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(amharc); users with modify permissions: Set(amharc)
16/02/12 18:12:50 INFO Utils: Successfully started service 'sparkDriver' on port 36901.
16/02/12 18:12:51 INFO Slf4jLogger: Slf4jLogger started
16/02/12 18:12:51 INFO Remoting: Starting remoting
16/02/12 18:12:51 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkDriverActorSystem@192.168.1.102:42560]
16/02/12 18:12:51 INFO Utils: Successfully started service 

[36mconf[0m: [32mSparkConf[0m = org.apache.spark.SparkConf@29df6bd0
[36msc[0m: [32mSparkContext[0m = org.apache.spark.SparkContext@3c5d6655
[36msqlContext[0m: [32mSQLContext[0m = org.apache.spark.sql.SQLContext@65fd157e

Load the JDBC sqlite driver, so that the JVM will be happy

In [5]:
Class.forName("org.sqlite.JDBC")

[36mres4[0m: [32mClass[0m[[32m?0[0m] = class org.sqlite.JDBC

Fix the database schema.

The schema of the provided database.sqlite file lacks type signatures on some columns, which causes sqlite to report them as typeless, which Sqlite JDBC treats as "type 0" (implicit conversion), which JDBC interprets as NULL types, which causes spark sql to throw an exception.

Because user-provided signatures are not available with spark sql's DefaultSource, manually overriding the types in sqlite seems to be the least invasive option to make spark understand the database.

This function forcibly changes the schema of the 'May2015' table, annotating each previously-unannotated column explicitly as BLOB.
Note that this keeps the affinity (see below), so no data loss occurs, but this
causes the 'table_info' PRAGMA to return the correct types.

Only the affinity of "body" is changed from BLOB to TEXT, but this should be harmless...


In [6]:
{
    val conn = DriverManager.getConnection(uri)
    val stmt = conn.createStatement()
    stmt.execute("PRAGMA writable_schema = true")
    stmt.execute("UPDATE sqlite_master SET sql = '" +
      "CREATE TABLE May2015(" +
        "created_utc INTEGER," +
        "ups INTEGER," + 
        "subreddit_id BLOB," +
        "link_id BLOB," +
        "name BLOB," +
        "score_hidden BLOB," +
        "author_flair_css_class BLOB," +
        "author_flair_text BLOB," +
        "subreddit BLOB," +
        "id BLOB," +
        "removal_reason BLOB," +
        "gilded int," +
        "downs int," +
        "archived BLOB," +
        "author BLOB," +
        "score int," +
        "retrieved_on int," +
        "body TEXT," +
        "distinguished BLOB," +
        "edited BLOB," +
        "controversiality int," +
        "parent_id BLOB)' " +
      "WHERE name='May2015'")
    stmt.execute("PRAGMA writable_schema = false")
}

[36mconn[0m: [32mjava[0m.[32msql[0m.[32mConnection[0m = org.sqlite.SQLiteConnection@5bc1bf57
[36mstmt[0m: [32mjava[0m.[32msql[0m.[32mStatement[0m = org.sqlite.jdbc4.JDBC4Statement@389aaa18
[36mres5_2[0m: [32mBoolean[0m = [32mfalse[0m
[36mres5_3[0m: [32mBoolean[0m = [32mfalse[0m
[36mres5_4[0m: [32mBoolean[0m = [32mfalse[0m

In [7]:
val df = sqlContext.read.format("jdbc")
      .options(Map(
          "url" -> uri,
          "dbtable" -> "May2015"
        ))
      .load()

[36mdf[0m: [32mDataFrame[0m = [created_utc: bigint, ups: bigint, subreddit_id: binary, link_id: binary, name: binary, score_hidden: binary, author_flair_css_class: binary, author_flair_text: binary, subreddit: binary, id: binary, removal_reason: binary, gilded: bigint, downs: bigint, archived: binary, author: binary, score: bigint, retrieved_on: bigint, body: string, distinguished: binary, edited: binary, controversiality: bigint, parent_id: binary]

In [8]:
val sarcasm = df
    .filter(df("body").endsWith(" /s"))
    .withColumn("label", lit(1.0))
    .limit(sarcasmCount)
    .cache()

val serious = df
    .filter(!df("body").contains("/s"))
    .withColumn("label", lit(0.0))
    .limit(seriousCount)
    .cache()

val together = serious.unionAll(sarcasm)

[36msarcasm[0m: [32mDataFrame[0m = [created_utc: bigint, ups: bigint, subreddit_id: binary, link_id: binary, name: binary, score_hidden: binary, author_flair_css_class: binary, author_flair_text: binary, subreddit: binary, id: binary, removal_reason: binary, gilded: bigint, downs: bigint, archived: binary, author: binary, score: bigint, retrieved_on: bigint, body: string, distinguished: binary, edited: binary, controversiality: bigint, parent_id: binary, label: double]
[36mserious[0m: [32mDataFrame[0m = [created_utc: bigint, ups: bigint, subreddit_id: binary, link_id: binary, name: binary, score_hidden: binary, author_flair_css_class: binary, author_flair_text: binary, subreddit: binary, id: binary, removal_reason: binary, gilded: bigint, downs: bigint, archived: binary, author: binary, score: bigint, retrieved_on: bigint, body: string, distinguished: binary, edited: binary, controversiality: bigint, parent_id: binary, label: double]
[36mtogether[0m: [32mDataFrame[0m = [cre

In [9]:
sarcasm
    .select("body")
    .limit(20)
    .collect()
    .foreach(println)

[Having sex with my girlfriend at least 5 times a day is my main escape. I am FA because I would like a second girlfriend for regular threesomes but she doesn't want to. /s]
[Awesome case, plus those blue LEDs make your framerate higher. /s]
[I don't know man. [This](http://www.reddit.com/r/WTF/comments/34f7fx/went_fishing_didnt_catch_a_fish/) guy caught a rusty gun while fishing. I mean, are you  trying to tell me that this guy with maggots chomping on his flesh is **more** wtf than that? /s]
[because he is famous

Edit: oh yeah /s]
[&gt; My deputies did their job to the fullest extent of their abilities....In the sense that we kept these drugs from reaching our streets, this operation was a success...

Yup, I lived in BH at the time and remember how hard it was to find weed after that. /s]
[Because what better way to woo affection from the ladies than mentioning their bodies in a love letter? /s]
[Slightly over half.. last I checked the world was 51% male.. which of course means majo



In [10]:
serious
    .select("body")
    .limit(20)
    .collect()
    .foreach(println)

[くそ
読みたいが買ったら負けな気がする
図書館に出ねーかな]
[gg this one's over. off to watch the NFL draft I guess]
[Are you really implying we return to those times or anywhere near that political environment?  If so, you won't have much luck selling the American people on that governance concept without ushering in American Revolution 2.0.]
[No one has a European accent either  because it doesn't exist. There are accents from Europe but not a European accent.]
[That the kid "..reminds me of Kevin."   so sad :-(]
[Haha, i was getting nauseous from it, if that was your ingame experience that would have given a whole new level of Bloodborne ^^ ]
[After reading this, I wholeheartedly believe you should let her go. 

You and her simply aren't compatible. She's looking for a committment and you're bent on avoiding it. You should figure out your committment issues before getting into a committed relationship.  ]
[Let's do this. See you guys on the other side.]
[You can buy a mystery sampler from small batch and reque



In [11]:
val Array(trainData, holdoutData) = together.randomSplit(Array(0.67, 0.33))

[36mtrainData[0m: [32mDataFrame[0m = [created_utc: bigint, ups: bigint, subreddit_id: binary, link_id: binary, name: binary, score_hidden: binary, author_flair_css_class: binary, author_flair_text: binary, subreddit: binary, id: binary, removal_reason: binary, gilded: bigint, downs: bigint, archived: binary, author: binary, score: bigint, retrieved_on: bigint, body: string, distinguished: binary, edited: binary, controversiality: bigint, parent_id: binary, label: double]
[36mholdoutData[0m: [32mDataFrame[0m = [created_utc: bigint, ups: bigint, subreddit_id: binary, link_id: binary, name: binary, score_hidden: binary, author_flair_css_class: binary, author_flair_text: binary, subreddit: binary, id: binary, removal_reason: binary, gilded: bigint, downs: bigint, archived: binary, author: binary, score: bigint, retrieved_on: bigint, body: string, distinguished: binary, edited: binary, controversiality: bigint, parent_id: binary, label: double]

In [12]:
val tokenizer = new Tokenizer()
      .setInputCol("body")
      .setOutputCol("words")

val remover = new StopWordsRemover()
      .setInputCol("words")
      .setOutputCol("filteredWords")
      .setStopWords(Array("/s"))

[36mtokenizer[0m: [32mTokenizer[0m = tok_e3f02b093cf9
[36mremover[0m: [32mStopWordsRemover[0m = stopWords_61249fdd07b3

In [13]:
val hashingTF = new HashingTF()
      .setInputCol("filteredWords")
      .setOutputCol("rawFeatures")
      .setNumFeatures(numFeatures)

val idf = new IDF()
      .setInputCol("rawFeatures")
      .setOutputCol("features")

[36mhashingTF[0m: [32mHashingTF[0m = hashingTF_0d8fb741cd26
[36midf[0m: [32mIDF[0m = idf_2e22c6847277

In [14]:
val lr = new LogisticRegression()
      .setMaxIter(100)
      .setRegParam(1/1.25)

[36mlr[0m: [32mLogisticRegression[0m = logreg_0ff01226d396

In [15]:
val pipeline = new Pipeline()
      .setStages(Array(tokenizer, remover, hashingTF, idf, lr))

[36mpipeline[0m: [32mPipeline[0m = pipeline_bf1cd6bb8213

In [16]:
val model = pipeline.fit(trainData)

[36mmodel[0m: [32mml[0m.[32mPipelineModel[0m = pipeline_bf1cd6bb8213

In [17]:
val predictions = model.transform(holdoutData).cache()

[36mpredictions[0m: [32mDataFrame[0m = [created_utc: bigint, ups: bigint, subreddit_id: binary, link_id: binary, name: binary, score_hidden: binary, author_flair_css_class: binary, author_flair_text: binary, subreddit: binary, id: binary, removal_reason: binary, gilded: bigint, downs: bigint, archived: binary, author: binary, score: bigint, retrieved_on: bigint, body: string, distinguished: binary, edited: binary, controversiality: bigint, parent_id: binary, label: double, words: array<string>, filteredWords: array<string>, rawFeatures: vector, features: vector, rawPrediction: vector, probability: vector, prediction: double]

Find some sarcastic comments

In [18]:
predictions
    .filter(expr("prediction = 1.0"))
    .limit(20)
    .select("body")
    .collect()
    .foreach(println)

[くそ
読みたいが買ったら負けな気がする
図書館に出ねーかな]
[and what about all of the players who hit vr 14 after 1.6 but before the removal of them?]
[NSFL]
[In other words, never make a decision, and you will remain forever free]
[As the great Alex James said, "A man with a Barbour jacket and a bottle of champagne is invincible" - get the Barbour! It compliments your build the best and it's just an awesome jacket]
[no money, just ID, driver license, credit cards and a subway stampcard]
[so fucking inspirational ]
[ErickCachorroZL157]
[&gt;maybe jews

not maybe]
[But that isn't what is going on. You keep saying throwing bricks or rocks. That is the least of Baltimore s problem. The violence isn't even against the police.  It is against businesses and people, predominantly white people. That is racism. None of it is justified. Just like the LA riots weren't

I am white &amp; I am not a racist. I don't support racism.]
[I can't understand why they are so expensive, especially with the  console versions going for 



In [19]:
predictions
    .filter(expr("prediction = 0.0"))
    .limit(20)
    .select("body")
    .collect()
    .foreach(println)

[Let's do this. See you guys on the other side.]
[You can buy a mystery sampler from small batch and request them]
[I can't answer better than Acquittal. 

I just want you to know that I think you made the right decision. Good luck ! ]
[99.99% of the power is filtered on the motherboard before it reaches any core components. hence no crashing, really shitty power supply will cause it to crash if its really bad, I.E. bad caps in the PSU. good grammar and punctuation on the Internets dose not mean you're smart, douche. ]
[[deleted]]
[Me too. Same hammock fabric, too.]
[I would like to use them for training ]
[The fire escape is there. You hear wood splintering, and look to see that a raptor has managed to break a hole in the top of the door, just above the dresser. Its head pokes through, then disappears. There's another thud, and the dresser moves forward a few inches.]
[Definitely test a few.  I was told as a beginner I would be happy with a 10 ft sit-on. But I have some boating experi



In [20]:
val vals = predictions
    .select("probability", "prediction", "body")
    .limit(100000)
    .collect()

[36mvals[0m: [32mArray[0m[[32mRow[0m] = [33mArray[0m(
  [[0.4600140536213604,0.5399859463786397],1.0,くそ
読みたいが買ったら負けな気がする
図書館に出ねーかな],
  [[0.5107539447420855,0.48924605525791454],0.0,Let's do this. See you guys on the other side.],
  [[0.5054309045780849,0.4945690954219151],0.0,You can buy a mystery sampler from small batch and request them],
  [[0.45381039081448055,0.5461896091855194],1.0,and what about all of the players who hit vr 14 after 1.6 but before the removal of them?],
  [[0.5468139451863717,0.45318605481362834],0.0,I can't answer better than Acquittal. 

I just want you to know that I think you made the right decision. Good luck ! ],
  [[0.48498279714390263,0.5150172028560973],1.0,NSFL],
  [[0.6272024268826191,0.3727975731173809],0.0,99.99% of the power is filtered on the motherboard before it reaches any core components. hence no crashing, really shitty power supply will cause it to crash if its really bad, I.E. bad caps in the PSU. good grammar and punctuation on th

In [25]:
vals.filter(x => x(0).asInstanceOf[DenseVector](1) > 0.8).foreach(println)

[[0.07342791391436534,0.9265720860856347],1.0,Here's a small and very incomplete list of why nobody cares what the Bible says.

Issue | Bible | Modern
-----:|:-----:|:-----:
Lying to your children about if a fruit is deadly or not. | ✔ | ✘
Punishing your children when they catch you in your lie. | ✔ | ✘
Murdering your son because a voice told you to. | ✔ | ✘
Murdering your son because everyone else disobeyed you. | ✔ | ✘
Drowning everyone who disagrees with you. | ✔ | ✘
Eating lobster. | ✘ | ✔
Selling your daughter into sexual slavery. | ✔ | ✘
Freedom of religion. | ✘ | ✔
Letting people with disabilities go to church. | ✘ | ✔
Murdering disrespectful children. | ✔ | ✘
Slavery. | ✔ | ✘
Different slavery in some special context. | ✔ | ✘
Beating your slaves so badly they take three days to recover. | ✔ | ✘
Forcing a woman to marry her rapist. | ✔ | ✘
Allowing women to speak in church. | ✘ | ✔
Male on male sex. | ✘ | ✔
Adult on prepubescent sex. | ✔ | ✘
Offering your daughters to a rape mob



In [22]:
predictions.agg(sum(abs(expr("prediction - label"))), count(col("label"))).show()

+------------------------------+------------+
|sum(abs((prediction - label)))|count(label)|
+------------------------------+------------+
|                        6559.0|       19925|
+------------------------------+------------+



