![title](http://ocausal.imbv.net/wp-content/uploads/2017/02/banner-autocop-3.jpg)
[AutoCop](http://ocausal.imbv.net/proyecto-autocop-es/), Proof of Concept of the  Observatorio de Contenidos Audiovisuales ([OCA](http://ocausal.imbv.net/proyecto-autocop-es/)), funded by the University of Salamanca Foundation [Plan TCUE 2015-2017 Fase 2]. 
Principal Investigator: Carlos Arcila Calderón. Researchers: Félix Ortega, Javier Amores, Sofía Trullenque, Miguel Vicente, Mateo Álvarez, Javier Ramírez

# AutoCop to run in Spark in English

# Training the models

## Import libraries and start Spark Context

In [1]:
import nltk
import random
from nltk.corpus import movie_reviews
from nltk.classify.scikitlearn import SklearnClassifier
from nltk.tokenize import word_tokenize

import findspark
findspark.init()
import pyspark
sc = pyspark.SparkContext(appName="myAppName").getOrCreate()
spark = pyspark.sql.SparkSession(sc)
sc._conf.getAll()

[('spark.app.name', 'myAppName'),
 ('spark.rdd.compress', 'True'),
 ('spark.serializer.objectStreamReset', '100'),
 ('spark.driver.host', '192.168.1.207'),
 ('spark.master', 'local[*]'),
 ('spark.executor.id', 'driver'),
 ('spark.submit.deployMode', 'client'),
 ('spark.driver.port', '58034'),
 ('spark.app.id', 'local-1491946048678')]

## Model training:

### Data preparation
First of all we have to preparate the repository of instances we will use, the steps are the following:

Select which kind of words we will use (Adjetives, adverbs, verbs, ect)

In [2]:
allowed_word_types = ["JJ"]

Import labeled datasets for training purposes

In [3]:
rdd_short_pos = sc.textFile("../short_reviews/positive.txt")
rdd_short_neg = sc.textFile("../short_reviews/negative.txt")

Create a list of words of all training set to transform each training instance into a feature instance

In [4]:
# First add all words to the list:
rdd_all_tokenized_words = rdd_short_pos.map(lambda tweet: (nltk.pos_tag(word_tokenize(tweet)),1)) \
.union(rdd_short_neg.map(lambda tweet: (nltk.pos_tag(word_tokenize(tweet)),0)))

In [5]:
rdd_selected_words = rdd_all_tokenized_words.map(lambda review: \
([word[0] for word in review[0] if word[1] in allowed_word_types],review[1]))

In [6]:
rdd_selected_words.take(3)

[(['21st', 'new', 'conan', 'jean-claud', 'steven'], 1),
 (['elaborate', 'huge', 'expanded'], 1),
 (['effective', 'too-tepid'], 1)]

In [16]:
rdd_all_words = rdd_selected_words.flatMap(lambda words: words[0]).distinct()

The list of words must be broadcasted to be available from every worker

In [17]:
rdd_all_broadcast_words = sc.broadcast(rdd_all_words.collect())

In [8]:
rdd_featured_instances = rdd_selected_words.map(lambda instance: (find_features(instance[0]), instance[1]))

This function will transform each text instance into a features instance

In [9]:
def find_features(instance):
    features = []
    for word in rdd_all_broadcast_words.value:
        if word in instance:
            features.append(1)
        else:
             features.append(0)   
    return features

### Save words to file

In [20]:
rdd_all_words.coalesce(1, True).saveAsTextFile("all_words")

## Import Spark labeled point
The spark mllib machine learning library uses a special kind of RDD to train the models, the LabeledPoint, which consists on a tuple composed by all the features in a vector and the label.

In [11]:
from pyspark.mllib.regression import LabeledPoint

In [12]:
rdd_training_set = rdd_featured_instances.map(lambda instance: LabeledPoint(label=instance[1], features=instance[0]))

In [13]:
rdd_training_set.count()

10662

## Naive Bayes classificator

In [14]:
from pyspark.mllib.classification import NaiveBayes, NaiveBayesModel
from pyspark.mllib.util import MLUtils

In [15]:
NB_model = NaiveBayes.train(rdd_training_set, 1.0)

In [16]:
NB_model.theta

array([[-9.79238838, -9.0992412 , -9.79238838, ..., -9.79238838,
        -8.69377609, -9.79238838],
       [-9.1400252 , -9.83317238, -9.1400252 , ..., -9.1400252 ,
        -9.83317238, -9.1400252 ]])

In [15]:
NB_all_predictions = rdd_featured_instances.map(lambda instance: NB_model.predict(instance[0]))

In [18]:
NB_all_predictions.take(10)

[1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.0, 1.0, 0.0, 1.0]

In [19]:
NB_model.save(sc, path="NB_model")

## Logistic Regression

In [16]:
from pyspark.mllib.classification import LogisticRegressionWithLBFGS, LogisticRegressionModel

In [21]:
LR_model = LogisticRegressionWithLBFGS.train(rdd_training_set)

In [22]:
LR_all_predictions = rdd_featured_instances.map(lambda instance: LR_model.predict(instance[0]))

In [23]:
LR_all_predictions.take(10)

[1, 1, 1, 1, 1, 1, 0, 1, 1, 1]

In [24]:
LR_model.save(sc, path="LR_model")

## SVM

In [17]:
from pyspark.mllib.classification import SVMWithSGD, SVMModel

In [26]:
SVM_model = SVMWithSGD.train(rdd_training_set)

In [27]:
SVM_all_predictions = rdd_featured_instances.map(lambda instance: SVM_model.predict(instance[0]))

In [28]:
SVM_all_predictions.take(10)

[1, 1, 1, 1, 1, 1, 0, 1, 1, 1]

In [29]:
SVM_model.save(sc, path="SVM_model")

## Import saved models
To check they have been saved properly

In [20]:
Saved_NB = NaiveBayesModel.load(sc, "NB_model")

In [21]:
Saved_LR = LogisticRegressionModel.load(sc, "LR_model")

In [22]:
Saved_SVM = SVMModel.load(sc, "SVM_model")