# Project - Apache Spark & Elastichsearch

##### Students:
* Lilia IZRI      (DS)
* Yacine MOKHTARI (DS)
* Alexandre COMBEAU (DS)

##### Report
[PENSER A METTRE UN LIEN ICI]


In [1]:
# !pip install textblob

In [2]:
# import necessary packages
import pyspark
from pyspark import SparkContext
from pyspark.streaming import StreamingContext
from textblob import TextBlob

# For ML
from pyspark.mllib.linalg import Vectors
from pyspark.mllib.regression import LabeledPoint
from pyspark.mllib.clustering import StreamingKMeans

In [3]:
# Initiate the SparkContext and StreamingContext with 10 second batch interval
sc = SparkContext()
ssc = StreamingContext(sc, 10)
ssc.checkpoint("file:///tmp/spark")    # Checkpoint for backups (useful for operations by window)

## I. Process the input data (tweets)
### 1. Create our Dstream that receives data

In [4]:
# initiate streaming text from a TCP (socket) source (Our tweets received)
socket_stream = ssc.socketTextStream("127.0.0.1", 5552)

### 2. Process data and tag with sentiment 

In [16]:
# We create a function that analysis text with textblob
def sentiment(text):
    """ Function that returns -1 if a tweet is more likely negative (polarity<0)
                               0 if it's neutral  (polarity==0)
                               1 if it's positive (polarity>0)
    """
    polarity = TextBlob(text).polarity
    return 1 if polarity > 0 else -1 if polarity < 0 else 0

-------------------------------------------
Time: 2022-05-03 22:59:50
-------------------------------------------
(0.0, 2)
(0.0, 2)
(0.0, 2)
(0.0, 2)
(0.0, 2)
(0.0, 2)
(0.0, 2)
(0.0, 2)
(0.0, 2)
(0.0, 2)
...

-------------------------------------------
Time: 2022-05-03 22:59:50
-------------------------------------------
(2, 119)

-------------------------------------------
Time: 2022-05-03 23:00:00
-------------------------------------------
(0.0, 2)
(0.0, 2)
(0.0, 2)
(0.0, 2)
(0.0, 2)
(0.0, 2)
(0.0, 2)
(0.0, 2)
(0.0, 2)
(0.0, 2)
...

-------------------------------------------
Time: 2022-05-03 23:00:00
-------------------------------------------
(2, 124)

-------------------------------------------
Time: 2022-05-03 23:00:10
-------------------------------------------
(0.0, 2)
(0.0, 2)
(0.0, 2)
(0.0, 2)
(0.0, 2)
(0.0, 2)
(0.0, 2)
(0.0, 2)
(0.0, 2)
(0.0, 2)
...

-------------------------------------------
Time: 2022-05-03 23:00:10
-------------------------------------------
(2, 124)

-

Here, we just took into account the polarity and choosed to ignore the subjectivity !  ;)

In [6]:
# We split the fields of the tweet received and we add tag the data with the sentiment of the tweet
#   so the rdd below 'tweets' will be of the form (user, text, date, location, hashtags, sentiment)

def mapSplit(tweet):
    """
    A function that takes a tweet  (the one we sent from the other iPython file),
    splits it into its different fields and adds the sentiment field {-1, 0, 1}
    """
    tmp = tweet.split(' ###:field:### ')
    return (tweet[1], tweet[2], tweet[3], tweet[4], tweet[5], sentiment(tweet[2]))
             #user     #text    #date    #location  #hashtags  #sentiment(= {-1,0,1})
    
# tweets = socket_stream.map(lambda tweet: tweet.split(' ###:field:### '))\
#                       .map(lambda tweet: (tweet[1], tweet[2], tweet[3], tweet[4], tweet[5], sentiment(tweet[2])))
#                                            #user     #text    #date    #location  #hashtags  #sentiment(= {-1,0,1})

tweets = socket_stream.map(mapSplit)

### 3. ML : Cluster tweets according to sentiments and their location
⚠Remarque⚠ J'entraine et je prédis sur les mêmes données là ! Je sais pas si c'est ce qu'il faut :/

In [7]:
# We create a training set and test set 
training_data =  tweets.map(lambda tweet: Vectors.dense([float(tweet[5])]))
testing_data  =  tweets.map(lambda tweet: LabeledPoint(float(tweet[5]), Vectors.dense([float(tweet[5])])))


# We create a model with random clusters and specify the number of clusters to find
k = 3
dimension = 1
weights = 0.0
seed = 21

# init
model = StreamingKMeans(k=3, decayFactor=1.0).setRandomCenters(1, 0.0, 21)

# Train the model
model.trainOn(training_data)  

# Predict
result = model.predictOnValues(testing_data.map(lambda lp: (lp.label, lp.features)))
result.pprint()

In [8]:
# We keep the predictions of each tweet (the index of the cluster), and we create (indexCluster, 1) pairs
predictions   = result.map(lambda x: (x[1], 1))

# We reduce by key and window to get the number of elements assigned to each cluster
size_clusters = predictions.reduceByKeyAndWindow(lambda x, y: x + y, lambda x, y: x - y, 30, 10)
size_clusters.pprint()

In [9]:
# start streaming and wait couple of minutes to get enought tweets
ssc.start()

In [14]:
print("Clusters coordinates: " + str(model.latestModel().centers))

Clusters coordinates: [[ 1.0000000e-14]
 [-1.0000000e-14]
 [ 1.0417968e+00]]
-------------------------------------------
Time: 2022-05-03 22:57:00
-------------------------------------------
(0, 12)



In [15]:
print("Clusters coordinates: " + str(model.latestModel().centers))
### On voit que si on entraine le model qu'avec les sentiments (pas la localistion) donc dimension = 1
### et avec K = 3
### On voit bien que les sentiment = 0 sont assignés au cluster 0 (dont la coordonnée est presque 0)
###                                = -1 ........................1 (dont la coordonnée est -0.96 presque -1)
###                                = 1 .........................2 (dont la coordonnée est 1)


Clusters coordinates: [[ 1.0000000e-14]
 [-1.0000000e-14]
 [ 1.0417968e+00]]
-------------------------------------------
Time: 2022-05-03 22:57:10
-------------------------------------------
(0.0, 0)
(0.0, 0)
(0.0, 0)
(0.0, 0)
(0.0, 0)
(0.0, 0)
(0.0, 0)
(0.0, 0)
(0.0, 0)
(0.0, 0)
...

-------------------------------------------
Time: 2022-05-03 22:57:10
-------------------------------------------
(0, 53)

-------------------------------------------
Time: 2022-05-03 22:57:20
-------------------------------------------
(0.0, 2)
(0.0, 2)
(0.0, 2)
(0.0, 2)
(0.0, 2)
(0.0, 2)
(0.0, 2)
(0.0, 2)
(0.0, 2)
(0.0, 2)
...

-------------------------------------------
Time: 2022-05-03 22:57:20
-------------------------------------------
(0, 53)
(2, 26)

-------------------------------------------
Time: 2022-05-03 22:57:30
-------------------------------------------
(0.0, 2)
(0.0, 2)
(0.0, 2)
(0.0, 2)
(0.0, 2)
(0.0, 2)
(0.0, 2)
(0.0, 2)
(0.0, 2)
(0.0, 2)
...

------------------------------------------

In [12]:
### C'est l'output d'une ancienne cellule 
### J'entrainais le model qu'avec les sentiments (pas la localistion) donc dimension = 1
### et avec K = 3
### On voit bien que les sentiment = 0 sont assignés au cluster 0 (dont la coordonnée est 0)
###                                = -1 ........................1 (dont la coordonnée est -1)
###                                = 1 .........................2 (dont la coordonnée est 1)

print("Clusters coordinates: " + str(model.latestModel().clusterCenters))

Clusters coordinates: [[-0.05196425]
 [-0.11119605]
 [ 1.0417968 ]]
-------------------------------------------
Time: 2022-05-02 14:54:30
-------------------------------------------
(0.0, 0)
(-1.0, 1)
(0.0, 0)
(0.0, 0)
(1.0, 2)
(0.0, 0)
(0.0, 0)
(1.0, 2)
(0.0, 0)
(1.0, 2)
...

-------------------------------------------
Time: 2022-05-02 14:54:30
-------------------------------------------
(0, 11)
(1, 3)
(2, 4)

-------------------------------------------
Time: 2022-05-02 14:54:40
-------------------------------------------
(0.0, 0)
(1.0, 2)
(0.0, 0)
(1.0, 2)
(0.0, 0)
(1.0, 2)
(0.0, 0)
(0.0, 0)
(1.0, 2)
(1.0, 2)
...

-------------------------------------------
Time: 2022-05-02 14:54:40
-------------------------------------------
(0, 34)
(1, 7)
(2, 13)

-------------------------------------------
Time: 2022-05-02 14:54:50
-------------------------------------------
(0.0, 0)
(0.0, 0)
(1.0, 2)
(0.0, 0)
(0.0, 0)
(1.0, 2)
(1.0, 2)
(0.0, 0)
(0.0, 0)
(0.0, 0)
...

-----------------------------