# Chapter 10: Spark Streaming
## Ex1: Pre-processing Data from Tweets
### Requirement:
- Read data from file (Tweets)
- Pre-process data
- Save data after pre-processing to new file.

In [None]:
# pip install textblob
import csv
from textblob import TextBlob

### SENTIMENT ANALYSIS
- Sentiment polarity for an element defines the orientation of the expressed sentiment, i.e., it determines if the text expresses the positive, negative or neutral sentiment of the user about the entity in consideration.
- Polarity is float which lies in the range of [-1,1] where 1 means positive statement and -1 means a negative statement. Subjective sentences generally refer to personal opinion, emotion or judgment whereas objective refers to factual information. Subjectivity is also a float which lies in the range of [0,1].

- https://planspace.org/20150607-textblob_sentiment/
- https://www.analyticsvidhya.com/blog/2018/02/natural-language-processing-for-beginners-using-textblob/#:~:text=Polarity%20is%20float%20which%20lies,of%20%5B0%2C1%5D.

In [None]:
TextBlob("Today is a good day!").sentiment

Sentiment(polarity=0.875, subjectivity=0.6000000000000001)

In [None]:
TextBlob("Today is not a good day and he doesn't want to go to shool.").sentiment

Sentiment(polarity=-0.35, subjectivity=0.6000000000000001)

In [None]:
# Translation and Language Detection
blob = TextBlob("Sáng nay trời u ám quá")

In [None]:
blob.detect_language()

'vi'

In [None]:
blob.translate(from_lang='vi', to ='en')

TextBlob("This morning it was too dark")

### PRE-PROCESSING TEXT DATA

In [None]:
import pandas as pd

In [None]:
tweetdata = 'tweets_COVID19.txt'
sentences = []
sentiment_polarity = []
sentiment_subjectivity = []

In [None]:
with open(tweetdata, 'r') as csvfile:
    rows = csv.reader(csvfile)
    for row in rows:
        sentence = row[0]
        blob = TextBlob(sentence)
        if ("Error on_data" not in sentence):
            print (sentence)
            print (blob.sentiment.polarity, blob.sentiment.subjectivity)
            sentences.append(sentence)
            sentiment_polarity.append(blob.sentiment.polarity)
            sentiment_subjectivity.append(blob.sentiment.subjectivity)

Listening on port: 5555
0.0 0.0
Received request from: ('127.0.0.1'
-0.75 1.0
b'RT @aclararc_: boa noite\nmeu grupo da faculdade t\xc3\xa1 fazendo uma pesquisa sobre os "IMPACTOS DA COVID-19 NA SA\xc3\x9aDE MENTAL DOS INDIV\xc3\x8dDUOS"\nse v\xe2\x80\xa6'
-0.1 0.2
b'RT @WPeriyasamy: Friendly reminder that "Race in America" isn\'t a separate topic. Black Lives Matter is a racial justice issue. Immigration\xe2\x80\xa6'
0.10416666666666667 0.4666666666666667
b'RT @lximenezfyvie: El presidente de M\xc3\xa9xico declar\xc3\xb3 que no hay rebrote. Estoy de acuerdo con \xc3\xa9l. Aqu\xc3\xad seguimos en el primer
0.0 0.0
b'RT @doctormacias: La FDA aprueba el primer tratamiento para COVID-19: Veklury (remdesivir) para uso en pacientes que requieran hospitalizac\xe2\x80\xa6'
0.0 0.0
b'RT @fkelly25: \xe2\x80\x98No escape.\xe2\x80\x99 With COVID-19 in the shadows
0.0 0.0
b'Niveles de patria:\n\xe2\x9c\x85 Comprar las cosas para hacer perros calientes y no poder porque se fue la electricidad\n\xe2\

0.0 0.0
b'RT @LorenzonItalo: Olha a\xc3\xad
0.0 0.0
b'RT @apathetic_NY: \xe2\x80\x9cYou guys did NOTHING!\xe2\x80\x9d - Trump\n\nWhere\xe2\x80\x99s the COVID-19 contact tracing program you salty little bunker bitch??\n\nWhere\xe2\x80\x99s th\xe2\x80\xa6'
-0.1875 0.5
b'RT @BioethicsforPPL: Would you take a COVID-19 vaccine? \nWould you be first in line to take one? \n\n#bioethics #healthlaw #healthhum #medhum\xe2\x80\xa6'
0.25 0.3333333333333333
b'RT @Lhuchin: Los marinos desinfectaron los buques despu\xc3\xa9s que pas\xc3\xb3 Desbordes; no fue por el corona virus: fue clasismo. Desbordes es paco
0.0 0.0
b'RT @drsimonegold: WOW: Dr. Fauci\xe2\x80\x99s NIH recommends against all forms of treatment in COVID-19 patients if they do not require supplemental o\xe2\x80\xa6'
0.1 1.0
b'RT @LorenzonItalo: Olha a\xc3\xad
0.0 0.0
b'RT @Paniiit: \xe2\x80\x9c\xe0\xb8\x81\xe0\xb8\xb2\xe0\xb8\xa3\xe0\xb8\x97\xe0\xb8\xb5\xe0\xb9\x88\xe0\xb8\xab\xe0\xb8\xa5\xe0\xb8\xb2\xe0\xb8\xa2\xe0\xb9\x86\xe0\xb8\x84

0.0 0.0
b'RT @statnews: On a day the U.S. reported more than 72
0.5 0.5
b'RT @ABC: Kristen Welker on people in need of funds from stalled COVID-19 relief bill: "Mr. President
0.0 0.0
b"RT @swingleft: The federal minimum wage hasn't been raised in 11 years. 67% of Americans favor raising the wage. Trump and his billionaire\xe2\x80\xa6"
0.0 0.0
b'RT @Mistakobz: Here is evidence giving to the oba of Lagos for the COVID 19 palliative !!!! \nBefore  the Gonvermant will come up with there\xe2\x80\xa6'
0.0 0.0
b'RT @AdrienneLaw: Trump throwing shade on socialized medicine like he didn\xe2\x80\x99t just get out of a hospital providing him with free healthcare t\xe2\x80\xa6'
0.4 0.8
b'RT @QuaveEthnobot: A COVID-19 Vaccine Could Rely on Rare Trees in Chile - The Atlantic https://t.co/7APUqjX5ZC'
0.3 0.9
b'Tidak Percaya Covid-19
0.0 0.0
b'RT @cnni: Trump claimed during the debate that Covid-19 is going away.\n\nFacts First: This is false. The US coronavirus situation \xe2\x80\x94 as measu\xe2\x80

b'Mit der zweiten Coronawelle drohen neue Verbote. Das wirkt sich nicht nur auf den erhofften Aufschwung aus
0.0 0.0
b'RT @latimes: \xe2\x80\x9c2.2 million people
0.0 0.0
b'So\xc3\xb1\xc3\xa9 que recib\xc3\xada aguinaldo.\n\n*Es m\xc3\xa1s f\xc3\xa1cil que haya vacuna contra el Covid-19 en diciembre que yo reciba algo :('
-0.75 1.0
b'RT @ABC: Kristen Welker on people in need of funds from stalled COVID-19 relief bill: "Mr. President
0.0 0.0
b'RT @govkristinoem: If folks want to wear a mask
0.0 0.0
b'RT @KontraInfo: Estudios sostienen que el covid-19 reduce en 50% la calidad de semen en hombres. \xc2\xbfQu\xc3\xa9 rol tendr\xc3\xadan las vacunas? La enzima c\xe2\x80\xa6'
0.0 0.0
b'RT @Piousbae012: Proper news: Good Nigerians assists government in disbursing COVID-19 palliatives to the hungry masses. Retweet aggressive\xe2\x80\xa6'
0.35 0.35000000000000003
b'RT @LebanonTown: In terms of relative comparisons
0.0 0.0
b'RT @RichardGrenell: . @joebiden won\xe2\x80\x99t blame China for Covid 

b"RT @sjworld: COVID-19 Solidarity Response Fund for @SJofficial's 15th anniversary!\n\nD-15 to 11/06\nGoal: $15
0.0 0.0
b'@matnantess Virar uma grande gostosa \nTer um sal\xc3\xa1rio mensal de
0.0 0.0
b'\xf0\x9f\x98\x89 told ya'
0.0 0.0
b'@north_mcd'
0.0 0.0
b'Radar Bogor : Vaksinasi Covid-19 Bulan Depan Terancam Batal
0.0 0.0
b'RT @HowieHawkins: Since March we have put forward an urgent agenda to address COVID-19 and the economic collapse. The health and economic c\xe2\x80\xa6'
0.2 0.2
b'RT @BRI_SL: Pfizer Sets Up Its \xe2\x80\x98Biggest Ever\xe2\x80\x99 Vaccination Distribution Campaign\n\nThe U.S. pharmaceutical giant is preparing to ship billion\xe2\x80\xa6'
0.0 1.0
b"RT @prachatai: \xe0\xb8\xa3\xe0\xb8\xb2\xe0\xb8\xa2\xe0\xb8\x87\xe0\xb8\xb2\xe0\xb8\x99 World Economic Forum \xe0\xb8\x8a\xe0\xb8\xb5\xe0\xb9\x89\xe0\xb8\x95\xe0\xb8\xa5\xe0\xb8\xb2\xe0\xb8\x94\xe0\xb9\x81\xe0\xb8\xa3\xe0\xb8\x87\xe0\xb8\x87\xe0\xb8\xb2\xe0\xb8\x99\xe0\xb9\x80\xe0\xb8\xa3\xe0\xb8\xb4\xe0\xb9\x88\xe0\

0.8 0.75
b'RT @forkookv_: Jungkook conta que todos eles ficaram muito felizes por "Dynamite" ter chegado ao topo do Hot 100 na Billboard. Mas saber qu\xe2\x80\xa6'
0.25 0.8500000000000001
b'RT @ABC: FACT CHECK: Pres. Trump misleads when comparing COVID-19 pandemic to H1N1 and the Obama administration response. https://t.co/oqCu\xe2\x80\xa6'
0.0 0.0
b'RT @ABC: Kristen Welker on people in need of funds from stalled COVID-19 relief bill: "Mr. President
0.0 0.0
b'RT @j_dubrof: BREAKING: Due to several positive COVID-19 cases in the Ray football program
0.03409090909090909 0.3068181818181818
b'RT @Pinkmashh: Nigerians are now finding hidden warehouse filled with MONTHS of COVID relief packages of food today smh . The government is\xe2\x80\xa6'
0.11666666666666668 0.6166666666666667
b'RT @reallifeElchapo: If you are awake. Please spread the news. THERE IS A MASSACRE GOING ON IN OYIGBO RIVERS STATE. THEY ARE MOVING FROM HO\xe2\x80\xa6'
0.0 0.0
b"RT @miketatarski: I spoke with @rtenews about V

-0.05 0.7
b'RT @jigguksi: o diretor geral da OMS elogiou o jimin falando sobre o que ele disse em live referente ao covid-19 dizendo que a mensagem del\xe2\x80\xa6'
0.09318181818181817 0.75
b"RT @nowthisnews: 'Anyone who's responsible for that many deaths should not remain President of the United States of America' \xe2\x80\x94 Joe Biden ad\xe2\x80\xa6"
0.35 0.525
b'\xd0\xa7\xd1\x83\xd1\x80\xd0\xb0\xd0\xbf\xd1\x87\xd0\xb8\xd0\xbd\xd1\x81\xd0\xba\xd0\xb0\xd1\x8f \xd0\xa6\xd0\xa0\xd0\x91 \xd0\xbf\xd1\x80\xd0\xb8\xd0\xbe\xd0\xb1\xd1\x80\xd0\xb5\xd0\xbb\xd0\xb0 \xd0\xbe\xd0\xb1\xd0\xbe\xd1\x80\xd1\x83\xd0\xb4\xd0\xbe\xd0\xb2\xd0\xb0\xd0\xbd\xd0\xb8\xd0\xb5
0.0 0.0
b'Recovered COVID-19 patients\xe2\x80\x99 plasma of little use in\xc2\xa0treatment https://t.co/ukXYHoLCrC'
-0.1875 0.5
b'RT @costel_aurelian: Financial markets live https://t.co/zEcBPhhj50 via @TeleTrader'
0.06818181818181818 0.25
b'RT @ABC: FACT CHECK: Pres. Trump misleads when comparing COVID-19 pandemic to H1N1 and the Obama 

b'RT @BernardKerik: Everything @JoeBiden said he would do about COVID-19
0.0 0.0
b'RT @dionjordan00: HE SAID HES THE LEAST RACIST PERSON IN THE WORLD BUT CALLS COVID 19 \xe2\x80\x9cKung flu\xe2\x80\x9d or the \xe2\x80\x9cChinese virus\xe2\x80\x9d?? Get the fuck out\xe2\x80\xa6'
-0.35 0.5
b'US Presidential Debate 2020 LIVE Updates: Anyone who\xe2\x80\x99s responsible for that many deaths should not remain president\xe2\x80\xa6 https://t.co/qArbzzEkGJ'
0.2787878787878788 0.5166666666666667
b'Ok families who has lost a love one to COVID-19 please enlighten me how you live with it? 200000 plus\xe2\x80\xa6 https://t.co/FBC7mVeLuH'
0.37878787878787873 0.5333333333333333
b'RT @ayolex_official: Buhari wan leave him post start to dey sell rice and indomie.(low budget Dangote)\n\nSee the viral video of  Covid-19 pa\xe2\x80\xa6'
-0.2 0.15000000000000002
b'RT @JLozanoA: 479 fallecimientos por Covid-19 en las \xc3\xbaltimas horas
0.0 0.0
b'RT @ElNacionalWeb: #ServicioP\xc3\xbablico Se solicita con 

b'Thank you for a commonsense approach
0.0 0.0
b'Tiga juta vaksin Sinovac asal China akhir tahun ini ditargetkan ada yang masuk ke Indonesia. Sebelum diberikan kepa\xe2\x80\xa6 https://t.co/0deQMqgcsb'
0.0 0.0
b'RT @common_rick: #OpenUpAmerica https://t.co/EnXtmLbWVb'
0.0 0.0
b'\xe2\x80\x9c@realDonaldTrump\xe2\x80\x99s policies demonstrate his operationalization of what Henry Giroux calls  ...\xe2\x80\x99an imagined communi\xe2\x80\xa6 https://t.co/ulg0djV6GG'
0.0 0.0
b"College town mayors 'humbly request' Big Ten help combat spread of COVID-19 https://t.co/x0cdoaqsDI"
-0.06666666666666667 0.16666666666666666
b'RT @tchnolcvr: 1. you called us so many names such as chink
0.25 0.5
b'RT @sydneemcelroy: Almost everything the president of the United States has said about COVID-19 is a lie. We are in a pandemic and he is ly\xe2\x80\xa6'
0.0 0.0
b'@JaroGiesbrecht CPC 
0.0 0.0
b'RT @arfn_amn: Menjunjung kasih Tuanku. Adalah baiknya dilokapkan semuanya sementara wabak ini berlaku bagi mengurang

0.0 0.0
b'CORONAVIRUS: Brasil registrou 5.323.630 casos confirmados e 155.900 mortes por covid-19; foram 497 mortes em 24 hor\xe2\x80\xa6 https://t.co/Zz2QWUs726'
0.0 0.0
b'RT @MotherJones: How will we be able to tell if a COVID-19 vaccine is safe\xe2\x80\x94or a Trump-influenced rush job? Scientists will tell us: \xe2\x80\x9cThe sc\xe2\x80\xa6'
0.5 0.625
b"RT @PRGuy17: @theage @DanielAndrewsMP Over the next two years we'll see the full impacts unfold in France
0.175 0.275
b'RT @rapplerdotcom: Trump says he takes "full responsibility" on COVID-19
0.35 0.55
b'RT @ABC: FACT CHECK: Pres. Trump misleads when comparing COVID-19 pandemic to H1N1 and the Obama administration response. https://t.co/oqCu\xe2\x80\xa6'
0.0 0.0
b'Finally something that makes sense.'
0.0 1.0
b'Mob Set Fire To Warehouse Containing Undistributed COVID-19 Palliatives In Ondo
0.0 0.0
b'RT @UniteThePoor: 87 million people are either uninsured or underinsured. During the COVID-19 pandemic
0.0 0.0
b'RT @latimes: \xe2\x80\

0.0 0.0
b'No se qu\xc3\xa9 es peor para la econom\xc3\xada
0.0 0.0
b"RT @zerohedge: FDA Approves Gilead's Remdesivir To Treat COVID-19 Despite Data Showing Drug Doesn't Work https://t.co/cXuzUbcipE"
0.0 0.0
b'RT @latimes: \xe2\x80\x9c2.2 million people
0.0 0.0
b'RT @Pinkmashh: Nigerians are now finding hidden warehouse filled with MONTHS of COVID relief packages of food today smh . The government is\xe2\x80\xa6'
0.11666666666666668 0.6166666666666667
b'RT @ABC: Kristen Welker on people in need of funds from stalled COVID-19 relief bill: "Mr. President
0.0 0.0
b'RT @mpleejones: https://t.co/NBpOwcpOlV'
0.0 0.0
b'RT @pagina_siete: #CoronavirusMundo #P7Informa \nTransfusi\xc3\xb3n de plasma de pacientes recuperados de covid-19 tiene eficacia limitada https://\xe2\x80\xa6'
0.0 0.0
b'RT @JustinWolfers: If cooks want to wash their hands after taking a dump
0.0 0.0
b'RT @CoriBush: Imagine if Republicans in the Senate worked as fast to pass COVID-19 relief as they are working to confirm Amy Co

In [None]:
data = pd.DataFrame({"sentence": sentences, 
                     "sentiment_polarity":sentiment_polarity,
                     "sentiment_subjectivity":sentiment_subjectivity
                    })

In [None]:
data = data.drop([0, 1])

In [None]:
data.sentence = data.sentence.str.replace("b'", "")

In [None]:
data.head()

Unnamed: 0,sentence,sentiment_polarity,sentiment_subjectivity
2,RT @aclararc_: boa noite\nmeu grupo da faculda...,-0.1,0.2
3,"RT @WPeriyasamy: Friendly reminder that ""Race ...",0.104167,0.466667
4,RT @lximenezfyvie: El presidente de M\xc3\xa9x...,0.0,0.0
5,RT @doctormacias: La FDA aprueba el primer tra...,0.0,0.0
6,RT @fkelly25: \xe2\x80\x98No escape.\xe2\x80\x...,0.0,0.0


In [None]:
data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 3440 entries, 2 to 3441
Data columns (total 3 columns):
sentence                  3440 non-null object
sentiment_polarity        3440 non-null float64
sentiment_subjectivity    3440 non-null float64
dtypes: float64(2), object(1)
memory usage: 107.5+ KB


In [None]:
# data.to_csv("tweets_COVID19.csv")

### Another solution: Build function to read txt file and convert to csv file

In [None]:
def read_and_pre_pro(file_in, file_out):
    sentences = []
    sentiment_polarity = []
    sentiment_subjectivity = []
    with open(file_in, 'r') as csvfile:
        rows = csv.reader(csvfile)
        for row in rows:
            sentence = row[0]
            blob = TextBlob(sentence)
            if ("Error on_data" not in sentence):
                #print (sentence)
                #print (blob.sentiment.polarity, blob.sentiment.subjectivity)
                sentences.append(sentence)
                sentiment_polarity.append(blob.sentiment.polarity)
                sentiment_subjectivity.append(blob.sentiment.subjectivity)
        data = pd.DataFrame({"sentence": sentences, 
                     "sentiment_polarity":sentiment_polarity,
                     "sentiment_subjectivity":sentiment_subjectivity
                    })        
        data.sentence = data.sentence.str.replace("b'", "")
        
        data.to_csv(file_out)

In [None]:
file_in = "tweets_COVID19.txt"
file_out = "tweets_COVID19.csv"
read_and_pre_pro(file_in, file_out)

In [None]:
df = pd.read_csv("tweets_COVID19.csv", index_col=0)

In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 3442 entries, 0 to 3441
Data columns (total 3 columns):
sentence                  3442 non-null object
sentiment_polarity        3442 non-null float64
sentiment_subjectivity    3442 non-null float64
dtypes: float64(2), object(1)
memory usage: 107.6+ KB


In [None]:
df.head()

Unnamed: 0,sentence,sentiment_polarity,sentiment_subjectivity
0,Listening on port: 5555,0.0,0.0
1,Received request from: ('127.0.0.1',-0.75,1.0
2,RT @aclararc_: boa noite\nmeu grupo da faculda...,-0.1,0.2
3,"RT @WPeriyasamy: Friendly reminder that ""Race ...",0.104167,0.466667
4,RT @lximenezfyvie: El presidente de M\xc3\xa9x...,0.0,0.0


In [None]:
indexNames = df[df['sentence'].str.contains("Listening on port")].index 
# Delete these row indexes from dataFrame
df = df.drop(indexNames)

In [None]:
indexNames = df[df['sentence'].str.contains("Received request from")].index 
# Delete these row indexes from dataFrame
df = df.drop(indexNames)

In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 3440 entries, 2 to 3441
Data columns (total 3 columns):
sentence                  3440 non-null object
sentiment_polarity        3440 non-null float64
sentiment_subjectivity    3440 non-null float64
dtypes: float64(2), object(1)
memory usage: 107.5+ KB


In [None]:
df.head()

Unnamed: 0,sentence,sentiment_polarity,sentiment_subjectivity
2,RT @aclararc_: boa noite\nmeu grupo da faculda...,-0.1,0.2
3,"RT @WPeriyasamy: Friendly reminder that ""Race ...",0.104167,0.466667
4,RT @lximenezfyvie: El presidente de M\xc3\xa9x...,0.0,0.0
5,RT @doctormacias: La FDA aprueba el primer tra...,0.0,0.0
6,RT @fkelly25: \xe2\x80\x98No escape.\xe2\x80\x...,0.0,0.0
