# Spark  ML task

Let's consider the classification problem into the following two classes:
- 1 for 'US_Republican_Party_politicians'
- 0 for 'US_Democratic_Party_politicians'

Instead of a dictionary, you can use hashing.

We are invited to check how the model behaves after using hashing and answer the following questions:
1. **What roc_auc_score on the test sample is obtained when using a dictionary?**
2. **What roc_auc_score on the test sample is obtained when switching from dictionary to hashing?**

Details:
1. Divide the samples into training and test by parity `id` articles: even for training, odd for test. Only for the training part, we count the gradients!
2. To calculate roc_auc_score, you need to get predictions and true answers for examples from the test set. All pairs (prediction, answer) fit into memory, use it!
3. Use `murmurhash3_32(x) % 2**14` as the hash function.
4. Fix the random seed at the initial guess of the weights: `np.random.seed(0); weights = np.random.random(...)`
5. Train for 100 iterations. After each iteration, call `weights_broadcast.destroy()` to remove the broadcast variable so you don't run out of memory.


Save the solution to the `result.json` file. 
File content example:

```json
{
    "q1": 0.123,
    "q2": 0.456
}

In [1]:
from sklearn.utils import murmurhash3_32
import findspark
findspark.init()

import pyspark
sc = pyspark.SparkContext(appName='jupyter')

from pyspark.sql import SparkSession, Row
se = SparkSession(sc)

Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
2023-12-27 23:21:07,241 WARN yarn.Client: Neither spark.yarn.jars nor spark.yarn.archive is set, falling back to uploading libraries under SPARK_HOME.


In [2]:
from sklearn.metrics import roc_auc_score

# y_true - real classes
# y_score - class 1 probabilities
# https://en.wikipedia.org/wiki/Receiver_operating_characteristic
roc_auc_score(y_true=[1, 1, 0, 0], y_score=[0.8, 0.7, 0.3, 0.2])

1.0

In [4]:
! ls -lh wiki

total 205M
-rw-rw-r-- 1 jovyan root  61M Mar 16  2023 categories.jsonl
-rw-rw-r-- 1 jovyan root  387 Oct 13  2022 README.txt
-rw-rw-r-- 1 jovyan root 144M Mar 16  2023 wiki.jsonl


In [5]:
! head -n 1 wiki/wiki.jsonl

{"title": "April", "text": "April\n\nApril is the fourth month of the year, and comes between March and May. It is one of four months to have 30 days.\n\nApril always begins on the same day of week as July, and additionally, January in leap years. April always ends on the same day of the week as December.\n\nApril's flowers are the Sweet Pea and Daisy. Its birthstone is the diamond. The meaning of the diamond is innocence.\n\nApril comes between March and May, making it the fourth month of the year. It also comes first in the year out of the four months that have 30 days, as June, September and November are later in the year.\n\nApril begins on the same day of the week as July every year and on the same day of the week as January in leap years. April ends on the same day of the week as December every year, as each other's last days are exactly 35 weeks (245 days) apart.\n\nIn common years, April starts on the same day of the week as October of the previous year, and in leap years, May 

In [6]:
! hadoop fs -copyFromLocal wiki /

copyFromLocal: `/wiki/categories.jsonl': File exists
copyFromLocal: `/wiki/wiki.jsonl': File exists
copyFromLocal: `/wiki/README.txt': File exists
copyFromLocal: `/wiki/.ipynb_checkpoints/wiki-checkpoint.jsonl': File exists
copyFromLocal: `/wiki/.ipynb_checkpoints/categories-checkpoint.jsonl': File exists
copyFromLocal: `/wiki/.ipynb_checkpoints/README-checkpoint.txt': File exists


In [7]:
! hadoop fs -ls -h /wiki

Found 4 items
drwxr-xr-x   - jovyan supergroup          0 2023-12-27 20:08 /wiki/.ipynb_checkpoints
-rw-r--r--   1 jovyan supergroup        387 2023-12-27 20:08 /wiki/README.txt
-rw-r--r--   1 jovyan supergroup     60.9 M 2023-12-27 20:08 /wiki/categories.jsonl
-rw-r--r--   1 jovyan supergroup    143.4 M 2023-12-27 20:08 /wiki/wiki.jsonl


In [8]:
import re
import string

def tokenize(text):
    text = re.sub(f'[^{re.escape(string.printable)}]', ' ', text)  # replace unprintable characters with a space
    text = re.sub(f'[{re.escape(string.punctuation)}]', ' ', text)  # and punctuation
    words = text.lower().split()
    return words

In [9]:
import json

def mapper(line):
    text = json.loads(line)['text']
    words = tokenize(text)
    return [(word, 1) for word in set(words)]

In [10]:
%%time
word_counts = (
    sc.textFile("hdfs:///wiki/wiki.jsonl")
    .flatMap(mapper)
    .reduceByKey(lambda a, b: a + b)
    .collect()
)

                                                                                

CPU times: user 839 ms, sys: 573 ms, total: 1.41 s
Wall time: 44.4 s


In [11]:
top_word_counts = sorted(word_counts, key=lambda x: -x[1])[:50000]

In [12]:
# indexes are needed for vectorization of texts
word_to_index = {word: index for index, (word, count) in enumerate(top_word_counts)}

In [13]:
list(word_to_index.items())[:5]

[('the', 0), ('in', 1), ('a', 2), ('of', 3), ('is', 4)]

In [14]:
from collections import Counter
Counter(["a", "a", "b"])

Counter({'a': 2, 'b': 1})

In [15]:
# second option: broadcast variable
word_to_index_broadcast = sc.broadcast(word_to_index)

Broadcast variables are useful when you want to broadcast the same data to all executors:
- dictionary in ML algorithm
- vector of weights in ML algorithm

Executors have **read-only** access to this data

Send once and can be used multiple times

In [16]:
def mapper(line):
    j = json.loads(line)
    text = j['text']
    words = tokenize(text)
    indices = []
    values = []
    for word, count in Counter(words).items():
        if word in word_to_index_broadcast.value:
            index = word_to_index_broadcast.value[word]
            indices.append(index)
            tf = count / float(len(words))
            values.append(tf)
    return np.array(indices), np.array(values)

In [17]:
%%time
(
    sc.textFile("hdfs:///wiki/wiki.jsonl")
    .map(mapper)
    .take(1)
)

2023-12-27 23:23:02,007 WARN scheduler.TaskSetManager: Lost task 0.0 in stage 2.0 (TID 41) (a8cd43e76584 executor 1): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File "/tmp/hadoop-jovyan/nm-local-dir/usercache/jovyan/appcache/application_1703689302506_0003/container_1703689302506_0003_01_000002/pyspark.zip/pyspark/worker.py", line 619, in main
    process()
  File "/tmp/hadoop-jovyan/nm-local-dir/usercache/jovyan/appcache/application_1703689302506_0003/container_1703689302506_0003_01_000002/pyspark.zip/pyspark/worker.py", line 611, in process
    serializer.dump_stream(out_iter, outfile)
  File "/tmp/hadoop-jovyan/nm-local-dir/usercache/jovyan/appcache/application_1703689302506_0003/container_1703689302506_0003_01_000002/pyspark.zip/pyspark/serializers.py", line 259, in dump_stream
    vs = list(itertools.islice(iterator, batch))
  File "/usr/local/spark/python/pyspark/rdd.py", line 1563, in takeUpToNumLeft
    except StopIteration:
  File "/tmp/ha

Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.runJob.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 2.0 failed 4 times, most recent failure: Lost task 0.3 in stage 2.0 (TID 44) (a8cd43e76584 executor 1): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File "/tmp/hadoop-jovyan/nm-local-dir/usercache/jovyan/appcache/application_1703689302506_0003/container_1703689302506_0003_01_000002/pyspark.zip/pyspark/worker.py", line 619, in main
    process()
  File "/tmp/hadoop-jovyan/nm-local-dir/usercache/jovyan/appcache/application_1703689302506_0003/container_1703689302506_0003_01_000002/pyspark.zip/pyspark/worker.py", line 611, in process
    serializer.dump_stream(out_iter, outfile)
  File "/tmp/hadoop-jovyan/nm-local-dir/usercache/jovyan/appcache/application_1703689302506_0003/container_1703689302506_0003_01_000002/pyspark.zip/pyspark/serializers.py", line 259, in dump_stream
    vs = list(itertools.islice(iterator, batch))
  File "/usr/local/spark/python/pyspark/rdd.py", line 1563, in takeUpToNumLeft
    except StopIteration:
  File "/tmp/hadoop-jovyan/nm-local-dir/usercache/jovyan/appcache/application_1703689302506_0003/container_1703689302506_0003_01_000002/pyspark.zip/pyspark/util.py", line 74, in wrapper
    return f(*args, **kwargs)
  File "/tmp/ipykernel_39426/786089338.py", line -1, in mapper
NameError: name 'np' is not defined

	at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:556)
	at org.apache.spark.api.python.PythonRunner$$anon$3.read(PythonRunner.scala:762)
	at org.apache.spark.api.python.PythonRunner$$anon$3.read(PythonRunner.scala:744)
	at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:509)
	at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
	at scala.collection.Iterator.foreach(Iterator.scala:943)
	at scala.collection.Iterator.foreach$(Iterator.scala:943)
	at org.apache.spark.InterruptibleIterator.foreach(InterruptibleIterator.scala:28)
	at scala.collection.generic.Growable.$plus$plus$eq(Growable.scala:62)
	at scala.collection.generic.Growable.$plus$plus$eq$(Growable.scala:53)
	at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:105)
	at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:49)
	at scala.collection.TraversableOnce.to(TraversableOnce.scala:366)
	at scala.collection.TraversableOnce.to$(TraversableOnce.scala:364)
	at org.apache.spark.InterruptibleIterator.to(InterruptibleIterator.scala:28)
	at scala.collection.TraversableOnce.toBuffer(TraversableOnce.scala:358)
	at scala.collection.TraversableOnce.toBuffer$(TraversableOnce.scala:358)
	at org.apache.spark.InterruptibleIterator.toBuffer(InterruptibleIterator.scala:28)
	at scala.collection.TraversableOnce.toArray(TraversableOnce.scala:345)
	at scala.collection.TraversableOnce.toArray$(TraversableOnce.scala:339)
	at org.apache.spark.InterruptibleIterator.toArray(InterruptibleIterator.scala:28)
	at org.apache.spark.api.python.PythonRDD$.$anonfun$runJob$1(PythonRDD.scala:166)
	at org.apache.spark.SparkContext.$anonfun$runJob$5(SparkContext.scala:2264)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
	at org.apache.spark.scheduler.Task.run(Task.scala:131)
	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:506)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1491)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:509)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:750)

Driver stacktrace:
	at org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2450)
	at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2399)
	at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2398)
	at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
	at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
	at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
	at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2398)
	at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1156)
	at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1156)
	at scala.Option.foreach(Option.scala:407)
	at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1156)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2638)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2580)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2569)
	at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
	at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:938)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2224)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2245)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2264)
	at org.apache.spark.api.python.PythonRDD$.runJob(PythonRDD.scala:166)
	at org.apache.spark.api.python.PythonRDD.runJob(PythonRDD.scala)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
	at py4j.Gateway.invoke(Gateway.java:282)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182)
	at py4j.ClientServerConnection.run(ClientServerConnection.java:106)
	at java.lang.Thread.run(Thread.java:750)
Caused by: org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File "/tmp/hadoop-jovyan/nm-local-dir/usercache/jovyan/appcache/application_1703689302506_0003/container_1703689302506_0003_01_000002/pyspark.zip/pyspark/worker.py", line 619, in main
    process()
  File "/tmp/hadoop-jovyan/nm-local-dir/usercache/jovyan/appcache/application_1703689302506_0003/container_1703689302506_0003_01_000002/pyspark.zip/pyspark/worker.py", line 611, in process
    serializer.dump_stream(out_iter, outfile)
  File "/tmp/hadoop-jovyan/nm-local-dir/usercache/jovyan/appcache/application_1703689302506_0003/container_1703689302506_0003_01_000002/pyspark.zip/pyspark/serializers.py", line 259, in dump_stream
    vs = list(itertools.islice(iterator, batch))
  File "/usr/local/spark/python/pyspark/rdd.py", line 1563, in takeUpToNumLeft
    except StopIteration:
  File "/tmp/hadoop-jovyan/nm-local-dir/usercache/jovyan/appcache/application_1703689302506_0003/container_1703689302506_0003_01_000002/pyspark.zip/pyspark/util.py", line 74, in wrapper
    return f(*args, **kwargs)
  File "/tmp/ipykernel_39426/786089338.py", line -1, in mapper
NameError: name 'np' is not defined

	at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:556)
	at org.apache.spark.api.python.PythonRunner$$anon$3.read(PythonRunner.scala:762)
	at org.apache.spark.api.python.PythonRunner$$anon$3.read(PythonRunner.scala:744)
	at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:509)
	at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
	at scala.collection.Iterator.foreach(Iterator.scala:943)
	at scala.collection.Iterator.foreach$(Iterator.scala:943)
	at org.apache.spark.InterruptibleIterator.foreach(InterruptibleIterator.scala:28)
	at scala.collection.generic.Growable.$plus$plus$eq(Growable.scala:62)
	at scala.collection.generic.Growable.$plus$plus$eq$(Growable.scala:53)
	at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:105)
	at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:49)
	at scala.collection.TraversableOnce.to(TraversableOnce.scala:366)
	at scala.collection.TraversableOnce.to$(TraversableOnce.scala:364)
	at org.apache.spark.InterruptibleIterator.to(InterruptibleIterator.scala:28)
	at scala.collection.TraversableOnce.toBuffer(TraversableOnce.scala:358)
	at scala.collection.TraversableOnce.toBuffer$(TraversableOnce.scala:358)
	at org.apache.spark.InterruptibleIterator.toBuffer(InterruptibleIterator.scala:28)
	at scala.collection.TraversableOnce.toArray(TraversableOnce.scala:345)
	at scala.collection.TraversableOnce.toArray$(TraversableOnce.scala:339)
	at org.apache.spark.InterruptibleIterator.toArray(InterruptibleIterator.scala:28)
	at org.apache.spark.api.python.PythonRDD$.$anonfun$runJob$1(PythonRDD.scala:166)
	at org.apache.spark.SparkContext.$anonfun$runJob$5(SparkContext.scala:2264)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
	at org.apache.spark.scheduler.Task.run(Task.scala:131)
	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:506)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1491)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:509)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	... 1 more


# File with article categories

In [18]:
wiki = se.read.json("hdfs:///wiki/wiki.jsonl")
wiki.registerTempTable("wiki")
wiki.limit(2).toPandas()

                                                                                

Unnamed: 0,id,text,title,url
0,1,April\n\nApril is the fourth month of the year...,April,https://simple.wikipedia.org/wiki?curid=1
1,2,August\n\nAugust (Aug.) is the eighth month of...,August,https://simple.wikipedia.org/wiki?curid=2


In [19]:
categories = se.read.json("hdfs:///wiki/categories.jsonl")
categories.registerTempTable("categories")
categories.limit(2).toPandas()

                                                                                

Unnamed: 0,category,page_id
0,Months,1
1,Months,2


# Set up logistic regression

In [20]:
import numpy as np

In [21]:
joined_train = se.sql("""
select
    wiki.text,
    cast(categories.category == 'US_Republican_Party_politicians' as int) as target
from
    wiki 
join
    categories
on wiki.id == categories.page_id
where
    categories.category in ('US_Republican_Party_politicians', 'US_Democratic_Party_politicians')
    and wiki.id % 2 = 0
""")
joined_train.limit(2).toPandas()

                                                                                

Unnamed: 0,text,target
0,"Harry S. Truman\n\nHarry S. Truman (May 8, 188...",0
1,Ronald Reagan\n\nRonald Wilson Reagan (; Febru...,1


In [22]:
joined_test = se.sql("""
select
    wiki.text,
    cast(categories.category == 'US_Republican_Party_politicians' as int) as target
from
    wiki 
join
    categories
on wiki.id == categories.page_id
where
    categories.category in ('US_Republican_Party_politicians', 'US_Democratic_Party_politicians')
    and wiki.id % 2 = 1
""")
joined_test.limit(2).toPandas()

                                                                                

Unnamed: 0,text,target
0,George W. Bush\n\nGeorge Walker Bush (born Jul...,1
1,Richard Nixon\n\nRichard Milhous Nixon (Januar...,1


In [59]:
def mapper(row):
    words = tokenize(row.text)
    indices = []
    values = []
    for word, count in Counter(words).items():
        if word in word_to_index:
            #index = word_to_index[word]
            #index tghrough murmurhash
            index = murmurhash3_32(word) % 2**14
            indices.append(index)
            tf = count / float(len(words))
            values.append(tf)
    return np.array(indices), np.array(values), row.target

In [60]:
dataset = joined_train.rdd.map(mapper)
dataset.cache()  # cache dataset in RAM
dataset.count()

                                                                                

1127

In [61]:
dataset.take(1)

[(array([13452,  2801,   567, 12213,    23,  8511, 15687, 10651,  6757,
          5009,  8034,  1883, 15502,  4396,   430,  3998, 12180, 11662,
         10331,  4046, 10499,  8782,  3136, 13728,  5235,  4687,  4611,
          9489,  1475,  2588,  4753, 14727,  4338, 14668, 14660,  9441,
          1797,  5514,   513, 14793,  2541,  2565, 15611, 10540, 12740,
          5109, 15346, 12119,  3802,  3810,  1149,  8742, 16083,   324,
         15072,  3140, 16351, 13616,  1097,  7702,  4846,  1008,  2172,
          9838,   336,  8307, 13490,   361,  6965, 12053,  6600, 10674,
         15279, 14326,  1037,  5050, 15381,  3896,  9372,  5734, 14352,
          9185,  9724,  7209,  6712, 13946,  9603,  3573, 14660,  9045,
          7757,    23, 12331, 12720,  2638, 11558, 13080,  5681,  3665,
          5757,  6168,  1687,  2643,  2978, 15797, 10098, 14072,  6606,
         11110,  5983, 13942,  2820,  1216, 10855, 10369,  3621, 12320,
          2885,  7964,  5618,  8558,  3446, 14190,   464, 13438,

In [62]:
def sigmoid(x):
    if x >= 0:
        return 1. / (1. + np.exp(-x))
    else:
        return np.exp(x) / (1. + np.exp(x))

In [63]:
def compute_gradient(weights_broadcast, loss, examples):
    # here we accumulate the contribution to the gradient
    gradient = np.zeros(len(weights_broadcast.value))
    
    for example in examples:
        indices, values, target = example

        # make a prediction with the current weights
        p = sigmoid(values.dot(weights_broadcast.value[indices]))

        # add to gradient accumulator
        gradient[indices] += values * (p - target)

        # count losses
        p = np.clip(p, 1e-15, 1-1e-15)
        loss.add(-(target * np.log(p) + (1 - target) * np.log(1 - p)))
    
    yield gradient

In [64]:
# number of examples
N = dataset.count()

                                                                                

In [65]:
from functools import partial
import numpy as np


# random weights
weights = np.random.random(len(word_to_index))

# Gradient Descent Epoch
for i in range(100):
    weights_broadcast = sc.broadcast(weights)
    loss = sc.accumulator(0.0)
    
    # calculate the gradient
    gradient = (
        dataset
        .coalesce(2)  # merge 200 cached partitions into 2
        .mapPartitions(partial(compute_gradient, weights_broadcast, loss))
        .reduce(lambda a, b: a + b)
    )

    # update the weights
    weights -= 0.05 * gradient
    
    weights_broadcast.destroy()
    
    print("epoch:", i, "loss:", loss.value / N)

                                                                                

epoch: 0 loss: 0.744848799507529
epoch: 1 loss: 0.7280021693727766
epoch: 2 loss: 0.7153134008119761
epoch: 3 loss: 0.7057434068147871
epoch: 4 loss: 0.6984886193851738
epoch: 5 loss: 0.6929406099502017
epoch: 6 loss: 0.6886453092920539
epoch: 7 loss: 0.6852670998741577
epoch: 8 loss: 0.6825594624226966


                                                                                

epoch: 9 loss: 0.6803420251993822
epoch: 10 loss: 0.6784831128513916
epoch: 11 loss: 0.6768867106338571
epoch: 12 loss: 0.6754828363408625


                                                                                

epoch: 13 loss: 0.6742204792992382
epoch: 14 loss: 0.6730624440921974
epoch: 15 loss: 0.6719815942850746
epoch: 16 loss: 0.6709581192631996
epoch: 17 loss: 0.6699775462611347
epoch: 18 loss: 0.6690292942113688
epoch: 19 loss: 0.6681056212670812
epoch: 20 loss: 0.6672008583512752
epoch: 21 loss: 0.6663108506010403
epoch: 22 loss: 0.6654325500141802
epoch: 23 loss: 0.6645637181498515
epoch: 24 loss: 0.6637027089987201
epoch: 25 loss: 0.6628483103011342
epoch: 26 loss: 0.6619996275110136
epoch: 27 loss: 0.6611559988987321
epoch: 28 loss: 0.6603169334067054
epoch: 29 loss: 0.6594820651403259
epoch: 30 loss: 0.658651120028529
epoch: 31 loss: 0.6578238913915031
epoch: 32 loss: 0.6570002220305995
epoch: 33 loss: 0.6561799910959705
epoch: 34 loss: 0.6553631044552275
epoch: 35 loss: 0.6545494876283696
epoch: 36 loss: 0.6537390806042834
epoch: 37 loss: 0.6529318340371119
epoch: 38 loss: 0.652127706454767
epoch: 39 loss: 0.6513266622099706
epoch: 40 loss: 0.6505286699761121
epoch: 41 loss: 0.6497

                                                                                

epoch: 77 loss: 0.6229624782525618
epoch: 78 loss: 0.6222661982997812


                                                                                

epoch: 79 loss: 0.6215722846781265


                                                                                

epoch: 80 loss: 0.6208807233634019
epoch: 81 loss: 0.6201915004511555
epoch: 82 loss: 0.6195046021554798
epoch: 83 loss: 0.6188200148078095
epoch: 84 loss: 0.6181377248557289
epoch: 85 loss: 0.6174577188618019


                                                                                

epoch: 86 loss: 0.6167799835024038
epoch: 87 loss: 0.6161045055665683
epoch: 88 loss: 0.615431271954845
epoch: 89 loss: 0.6147602696781671
epoch: 90 loss: 0.6140914858567288
epoch: 91 loss: 0.613424907718879
epoch: 92 loss: 0.6127605226000195
epoch: 93 loss: 0.6120983179415184
epoch: 94 loss: 0.6114382812896322
epoch: 95 loss: 0.6107804002944369
epoch: 96 loss: 0.6101246627087763
epoch: 97 loss: 0.6094710563872124
epoch: 98 loss: 0.6088195692849958
epoch: 99 loss: 0.6081701894570347


In [66]:
# important words for US_Respublican_Party_politicians class
sorted([(weights[index], word) for word, index in word_to_index.items()])[-1:-15:-1]

[(13.311416803140952, 'gross'),
 (2.7450311746217384, 'vaucluse'),
 (2.3575921201440018, 'pole'),
 (2.3344671874054455, 'head'),
 (2.074710786844532, 'mare'),
 (1.9659612558063253, 'anymore'),
 (1.9639927949021942, 'matthews'),
 (1.7790552583003472, 'october'),
 (1.7426233344496518, 'personality'),
 (1.6467433781679168, 'subscription'),
 (1.635357246944483, 'pray'),
 (1.621877086780188, 'decade'),
 (1.5951524315786365, 'justices'),
 (1.5234701460967726, 'norrbotten')]

In [67]:
# important words for US_Democratic_Party_politicians class
sorted([(weights[index], word) for word, index in word_to_index.items()])[:15]

[(-9.835122656167245, 'bayer'),
 (-4.717791084003386, 'chips'),
 (-3.2799845679931834, 'monuments'),
 (-2.8840784448482513, 'seemed'),
 (-2.4517059184940106, 'crew'),
 (-2.3763644597787272, 'vascular'),
 (-2.349988642131422, 'penn'),
 (-2.0467935144873026, 'vitamins'),
 (-1.814102358785159, 'histories'),
 (-1.698263292168816, 'nutrition'),
 (-1.5779257267630975, 'outer'),
 (-1.5732582390865713, 'conqueror'),
 (-1.5460234537813435, 'ivan'),
 (-0.9389934866307138, 'pbs'),
 (-0.9209082831959386, 'joachim')]

In [68]:
weights

array([1.0362651 , 0.58009835, 0.56988984, ..., 0.86778891, 0.56893277,
       0.17996736])

In [38]:
def predict(weights, examples):
    predictions=[]
    
    for example in examples:
        indices, values, target = example

        # make a prediction with the current weights
        p = sigmoid(values.dot(weights[indices]))
        predictions.append(p)
    return predictions
        

In [69]:
dataset_test = joined_test.rdd.map(mapper)
dataset_test.cache()  # cache dataset in RAM
dataset_test.count()

                                                                                

1177

In [70]:
y_hat = predict(weights, dataset_test.collect())

In [43]:
y_hat_target = [int(p > 0.5) for p in y_hat]

In [45]:
y_hat_target[:20]

[0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0]

In [71]:
y_target = [i[2] for i in dataset_test.collect()]
y_target[:20]

[1, 1, 0, 1, 1, 1, 0, 0, 1, 1, 1, 0, 1, 0, 1, 0, 0, 0, 0, 1]

In [47]:
roc_auc_score(y_target, y_hat)

0.8360652173913043

In [72]:
#new
roc_auc_score(y_target, y_hat)

0.8435521739130435

In [73]:
q1 = 0.8360652173913043
q2 = 0.8435521739130435

In [74]:
result = {
    'q1': q1,
    'q2': q2
}

In [75]:
result = json.dumps(result)
print(result)

{"q1": 0.8360652173913043, "q2": 0.8435521739130435}


In [76]:
f = open("result.json", "w")
f.write(result)
f.close()

In [None]:
# stop Spark (and YARN application)
sc.stop()