![title](http://ocausal.imbv.net/wp-content/uploads/2017/02/banner-autocop-3.jpg)
[AutoCop](http://ocausal.imbv.net/proyecto-autocop-es/), Proof of Concept of the  Observatorio de Contenidos Audiovisuales ([OCA](http://ocausal.imbv.net/proyecto-autocop-es/)), funded by the University of Salamanca Foundation [Plan TCUE 2015-2017 Fase 2]. 
Principal Investigator: Carlos Arcila Calderón. Researchers: Félix Ortega, Javier Amores, Sofía Trullenque, Miguel Vicente, Mateo Álvarez, Javier Ramírez

# AutoCop to run in Spark in English

# Twitter API to kafka producer

This app connects to Twitter API and sends the stream to a kafka producer to make the data available from the kafka broker

This project is intended to use the power of spark to distribute the classification of the tweets from the twitter stream as they are published. To do so, we will use the Twitter API connected to a kafka server so that the information is available through a broker, separating the hashtags each one in a different kafka topic.

As for the tweet classification, it is done by tokenizing each one and comparing it's words to the ones from the training set. 
To train the models, and also for the classification, the first step is to convert each tweet into an array of features, being the features a list of selected adjetives, gathered from the training set. As a result of the transformation, we will get an array of 0's and 1's, indicating the ausence or presence of each of the words in the list in the instance tweet.

## Prerequisites

## Setup Kafka Server

## Run Kafka on local machine with single instance of Zookeeper and Kafka

### Download kafka

The first step is to download Apache Kafka, the version this script uses is Scala 2.11  - kafka_2.11-0.10.2.0.tgz (asc, md5), the download link is the following: http://mirror.nexcess.net/apache/kafka/0.10.2.0/kafka_2.11-0.10.2.0.tgz

Once downloaded and unzipped the folder to a desired location, setup configuration for zookeeper and kafka services

### Zookeeper and Kafka configuration

The configuration files are stored in kafka_folder/config/
    Zookeeper: zookeeper.properties
    Kafka: server.properties
    
Here are some example files:

zookeeper.properties
server.properties

### Start Zookeeper service

To run the kafka server, a zookeeper instance is necessary, you can use one of yours or the one included in the kafka folder. To start the zookeeper: kafka_folder/bin/zookeeper-server-start.sh config/zookeeper.properties

### Start Kafka Server

Once Zookeeper is running, we can start kafka server with this command: kafka_folder/bin/kafka-server-start.sh config/server.properties

For now we have Zookeeper and a Kafka server running, to start recieving and processing twitter streaming, we have to start a kafka producer from the twitter stream and then read the stream from the broker through spark streaming using the Kafka Stream Class.

### Start Kafka producer

To start the Kafka producer, just execute the following cells.
These cells use the tweepy library to connect to the Twitter API, to do so, credentials for the API must be provided.

### First import the necessary libraries

In [1]:
import json
from kafka import SimpleProducer, KafkaClient
import tweepy
import configparser

Class that extends tweepy stream listener that connects to the Twitter API and sends the data to a Kafka Producer with a specified topic

## Parameters

Some parameters have to be configured.

First of all the Twitter API parameters, which require auth params as well as a hashtag, then we have to configure kafka parameters.
Kafka will require the location of the kafka server, the upodate frequency of the producer (frequency to gather tweets from the API), and the topic to write on, this parameter is taken from the hashtag, and the kafka topic created for each hashtag will have the name of each one.

### Twitter API parameters

#### Auth parameters

In [2]:
twitter_credentials = {
    "consumer_key": "bWzJx7DkIehLPBLsFuB0Q0HeG",
    "consumer_secret": "wYk2PgEqDm0b9h5fHJTM4GGeIqyO9epWck7rHheLa615i2CCid",
    "access_key": "13291482-SeGkyyTUTikUaEM8Q4vnJBHVnsCBR0cz3v6rhMAt1",
    "access_secret": "ZGzaXe68bFwy7hT65Lbpi8WT5lh6cplRUU6FkbEx1IzLz"
}

#### Hashtag parameters

In [3]:
twitter_parameters = {
    "hashtag": ["#BigData"]
}

### Kafka producer parameters
These parameters controll the workflow of the 

In [4]:
kafka_producer_parameters = {
    "batch_send_freq_t": 1000,
    "batch_send_freq_n": 10,
    "topic": twitter_parameters["hashtag"][0][1:],
    "connection_string": "localhost:9092"
}

## TwitterStreamingListener

The tweepy library uses a StreamListener Class to connect to the Twitter API. We will create a class extending from the previous that will send the twitter streaming to a Kafka producer and post each tweet on the specified topic.

In [5]:
class TwitterStreamingListener(tweepy.StreamListener):

    def __init__(self, api, kafka_producer_parameters):
        self.api = api
        self.kafka_producer_parameters = kafka_producer_parameters
        super(tweepy.StreamListener, self).__init__()
        client = KafkaClient(kafka_producer_parameters["connection_string"])
        self.producer = SimpleProducer(client, async = True,
                          batch_send_every_n = kafka_producer_parameters["batch_send_freq_t"],
                          batch_send_every_t = kafka_producer_parameters["batch_send_freq_n"])
        
    def on_status(self, status):
        """ This method is called whenever new data arrives from live stream.
        We asynchronously push this data to kafka queue"""
        msg =  status.text.encode('utf-8')
        try:
            self.producer.send_messages(kafka_producer_parameters["topic"], msg)
        except Exception as e:
            print(e)
            return False
        return True

    def on_error(self, status):
        # Error in Kafka producer
        print(status)
        return True
    
    def on_timeout(self):
        print("Timeout on twitter API")
        return True # Don't kill the stream

## Start Application
Create the auth object and start application

In [6]:
twitter_parameters["hashtag"]

['#BigData']

In [None]:
# Create Auth object
auth = tweepy.OAuthHandler(twitter_credentials["consumer_key"], twitter_credentials["consumer_secret"])
auth.set_access_token(twitter_credentials["access_key"], twitter_credentials["access_secret"])
api = tweepy.API(auth)

# Create stream and bind the listener to it
stream = tweepy.Stream(auth, listener = TwitterStreamingListener(api, kafka_producer_parameters))

#Custom Filter rules pull all traffic for those filters in real time.
stream.filter(track=twitter_parameters["hashtag"])
#stream.filter(track = twitter_parameters["hashtag"], languages = ['es'])

## Follow the logs
After launching the kafka application, the tweets can be seen introducing this command in the terminal, in the kafka folder: ./bin/kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic BigData --from-beginning --zookeeper 192.168.0.11:2181 where BigData is the topic to be read

# Kafka consumer
Once the previous infrastructure is running, we have a kafka producer connected to the twitter API. To analyze the tweets with Spark, just launch the twitter-kafka-consumer-spark-streaming.ipynb notebook (the models and the words have to have been generated before with the twitter-spark-model-training.ipynb notebook)