

![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/SENTIMENT_EN.ipynb)




# **Find sentiment in text - Total File**



## 1. Colab Setup

In [1]:
# Install java
!apt-get update -qq
!apt-get install -y openjdk-8-jdk-headless -qq > /dev/null
!java -version

# Install pyspark
!pip install --ignore-installed -q pyspark==2.4.4

# Install Sparknlp
!pip install --ignore-installed spark-nlp

openjdk version "11.0.10" 2021-01-19
OpenJDK Runtime Environment (build 11.0.10+9-Ubuntu-0ubuntu1.18.04)
OpenJDK 64-Bit Server VM (build 11.0.10+9-Ubuntu-0ubuntu1.18.04, mixed mode, sharing)
[K     |████████████████████████████████| 215.7MB 64kB/s 
[K     |████████████████████████████████| 204kB 18.2MB/s 
[?25h  Building wheel for pyspark (setup.py) ... [?25l[?25hdone
Collecting spark-nlp
[?25l  Downloading https://files.pythonhosted.org/packages/1b/d9/44fd438e15fa9a02c0e3b3ca9eaffc509fc626592f7a03ce05d8f156d448/spark_nlp-2.7.5-py2.py3-none-any.whl (139kB)
[K     |████████████████████████████████| 143kB 8.3MB/s 
[?25hInstalling collected packages: spark-nlp
Successfully installed spark-nlp-2.7.5


In [2]:
import pandas as pd
import numpy as np
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["PATH"] = os.environ["JAVA_HOME"] + "/bin:" + os.environ["PATH"]
import json
from pyspark.ml import Pipeline
from pyspark.ml import PipelineModel
from pyspark.sql import SparkSession
import pyspark.sql.functions as F
from sparknlp.annotator import *
from sparknlp.base import *
import sparknlp
from sparknlp.pretrained import PretrainedPipeline

In [3]:
from google.colab import files
uploaded = files.upload()

Saving COVID19Tweets_HA_final_v2.csv to COVID19Tweets_HA_final_v2.csv


In [4]:
import io
tweet_df = pd.read_csv(io.BytesIO(uploaded['COVID19Tweets_HA_final_v2.csv']))

In [5]:
tweet_df = tweet_df.iloc[:,1:]

In [6]:
text_list = tweet_df['text']
text_list.shape

(34081,)

In [7]:
text_list2 = list(set(text_list))
len(text_list2)

34081

## 2. Start Spark Session

In [8]:
spark = sparknlp.start()

## 3. Select the DL model and re-run cells below

In [9]:
MODEL_NAME='sentimentdl_use_twitter'

## 4. Define Spark NLP pipleline

In [10]:
documentAssembler = DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")
    
use = UniversalSentenceEncoder.pretrained(name="tfhub_use", lang="en")\
 .setInputCols(["document"])\
 .setOutputCol("sentence_embeddings")


sentimentdl = SentimentDLModel.pretrained(name=MODEL_NAME, lang="en")\
    .setInputCols(["sentence_embeddings"])\
    .setOutputCol("sentiment")

nlpPipeline = Pipeline(
      stages = [
          documentAssembler,
          use,
          sentimentdl
      ])


tfhub_use download started this may take some time.
Approximate size to download 923.7 MB
[OK!]
sentimentdl_use_twitter download started this may take some time.
Approximate size to download 11.4 MB
[OK!]


## 5. Run the pipeline

In [11]:
empty_df = spark.createDataFrame([['']]).toDF("text")

pipelineModel = nlpPipeline.fit(empty_df)

df = spark.createDataFrame(pd.DataFrame({"text":text_list}))
result = pipelineModel.transform(df)

## 6. Visualize results

In [12]:

result.select(F.explode(F.arrays_zip('document.result', 'sentiment.result')).alias("cols")) \
.select(F.expr("cols['0']").alias("document"),
        F.expr("cols['1']").alias("sentiment")).show(truncate=False)

+----------------------------------------------------------------------------------------------------------------------------------+---------+
|document                                                                                                                          |sentiment|
+----------------------------------------------------------------------------------------------------------------------------------+---------+
| This is an excellent resource for health professionals  researchers  and community leaders  alike    anyone who wants to         |positive |
|Obese COVID19 patients are more likely to experience worse outcomes according to a study published in    co k5e41kM2kJ            |negative |
|Member Snapshot  Every day  InterAction Members are working all over the world to combat poverty and alleviate su   co GcEPMvCrj0 |positive |
|    We have released a new COVID19 guidance document to assist districts in preparing for traditional spring activities lik       |positive |

In [13]:
result

DataFrame[text: string, document: array<struct<annotatorType:string,begin:int,end:int,result:string,metadata:map<string,string>,embeddings:array<float>>>, sentence_embeddings: array<struct<annotatorType:string,begin:int,end:int,result:string,metadata:map<string,string>,embeddings:array<float>>>, sentiment: array<struct<annotatorType:string,begin:int,end:int,result:string,metadata:map<string,string>,embeddings:array<float>>>]

In [None]:
result_df = result.toPandas()

In [None]:
from google.colab import files
result_df.to_csv('twitter_sentiment.csv') 
files.download('twitter_sentiment.csv')

# **Find sentiment in text - International Women's Day**



In [15]:
from google.colab import files
uploaded = files.upload()

Saving Topic3_IntlWomensDay_Tweets.csv to Topic3_IntlWomensDay_Tweets.csv


In [16]:
topic_df = pd.read_csv(io.BytesIO(uploaded['Topic3_IntlWomensDay_Tweets.csv']))

In [17]:
topic_list = topic_df['text']
topic_list.shape

(2025,)

## 5. Run the pipeline

In [18]:
empty_df = spark.createDataFrame([['']]).toDF("text")

pipelineModel = nlpPipeline.fit(empty_df)

df_topic = spark.createDataFrame(pd.DataFrame({"text":topic_list}))
result_topic = pipelineModel.transform(df_topic)

## 6. Visualize results

In [19]:
result_topic.select(F.explode(F.arrays_zip('document.result', 'sentiment.result')).alias("cols")) \
.select(F.expr("cols['0']").alias("document"),
        F.expr("cols['1']").alias("sentiment")).show(truncate=False)

+----------------------------------------------------------------------------------------------------------------------------------+---------+
|document                                                                                                                          |sentiment|
+----------------------------------------------------------------------------------------------------------------------------------+---------+
| Inclusion in  s list of women sharing lessons learnt during the pandemic went above   beyond my expectations  Taiwan is          |negative |
| During COVID19  progress on gender equality has regressed  We ve seen 1  Appalling increases in violence against women2  R       |negative |
| Happy International WomensDay This year we celebrate the tremendous efforts by women and girls around the world in shapi         |positive |
| On InternationalWomensDay  join us in paying tribute to all the women standing at the front lines of the COVID19 crisis Whe      |positive |

In [20]:
result_topic_df = result_topic.toPandas()

In [21]:
from google.colab import files
result_topic_df.to_csv('twitter_sentiment_topic.csv') 
files.download('twitter_sentiment_topic.csv')

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>