# Use case 1: Sentiment Analysis on News Articles about Vendors

## Dealing with Semi-Structured Data stored in HDFS of SQL 2019

-   In this notebook we will see how to process, transform, prepare a JSON file data for model scoring and score each news items for sentiment labels based on external REST API requests.
-   The model end point as REST API is developed outside of SQL 2019 BDC and hosted in Azure for batch and live model scores.

In [2]:
from pyspark.sql import SparkSession
spark = SparkSession\
        .builder\
        .appName("Spark_Ingestion_Job")\
        .config("spark.executor.memory", "20g")\
        .config("spark.executor.instances", "3")\
        .config("spark.master", "yarn")\
        .config("spark.submit.deployMode", "client")\
        .config("spark.driver.memory", "30g")\
        .enableHiveSupport()\
        .getOrCreate()

Starting Spark application


ID,YARN Application ID,Kind,State,Spark UI,Driver log,Current session?
311,application_1607925552807_0429,pyspark,idle,Link,Link,✔


FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

SparkSession available as 'spark'.


FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

###  Loading Data

In [3]:
# do with RDD
import json
news_data_rdd = sc.textFile('/COE/news_data/contify_insights_new.json').map(json.loads)
news_data_rdd.take(1)

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

[{'results': [{'source': {'id': '1829', 'name': 'Domain-b', 'rank': 789288}, 'previews': [], 'url': 'https://www.domain-b.com/companies/companies_s/Siemens/20200824_acquisition.html', 'attachments': [], 'duplicates': [], 'content_types': [{'id': 3, 'name': 'News Articles'}], 'language': {'id': 'en', 'name': 'English'}, 'channel': 'News and Other Websites', 'summary': 'Siemens gets CCI nod for proposed acquisition of C&S Electric The Competition Commission of India (CCI) has approved the proposed acquisition of C&S Electric Limited by Siemens Limited. The combination envisages acquisition of 100 per cent acquisition of the share capital of C&S Electric Limited by Siemens India. At the time of closing of the proposed combination, the scope of business of C&S shall include low-voltage (LV) switchgear components and panels, LV and medium voltage (MV) power busbars as well as protection and metering devices of C&S. Certain other businesses of C&S, such as MV switchgear and package sub-stati

In [4]:
type(news_data_rdd)

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

<class 'pyspark.rdd.PipelinedRDD'>

In [4]:
from pyspark.sql.types import Row
import pyspark.sql.functions as sf 
import requests

def spliter(lines):
    data = {}
    line = lines['results']
    if line:
        for d in line:
            data['id'] = d['id']
            data['title'] = d['title']
            data['summary'] = d['summary']
    else:    
        data['id'] = ''
        data['title'] = ''
        data['summary'] = ''
    data['search_company'] = lines['search_company']
    return data

rdd_df = news_data_rdd.map(lambda x: Row(**spliter(x)))
rdd_df.collect()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

[Row(id=20082425521615, search_company='C&S ELECTRIC LIMITED ', summary='Siemens gets CCI nod for proposed acquisition of C&S Electric The Competition Commission of India (CCI) has approved the proposed acquisition of C&S Electric Limited by Siemens Limited. The combination envisages acquisition of 100 per cent acquisition of the share capital of C&S Electric Limited by Siemens India. At the time of closing of the proposed combination, the scope of business of C&S shall include low-voltage (LV) switchgear components and panels, LV and medium voltage (MV) power busbars as well as protection and metering devices of C&S. Certain other businesses of C&S, such as MV switchgear and package sub-station, lighting, diesel generating sets, engineering, procurement and construction business and the “Etacom” busbars business will be retained by the existing promoters of C&S. Siemens India focuses on the areas of power generation and distribution, intelligent infrastructure for buildings and distri

In [5]:
df = rdd_df.toDF()
df.printSchema()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

root
 |-- id: long (nullable = true)
 |-- search_company: string (nullable = true)
 |-- summary: string (nullable = true)
 |-- title: string (nullable = true)

In [7]:
type(df)

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

<class 'pyspark.sql.dataframe.DataFrame'>

In [6]:
# convert pyspark dataframe to pandas dataframe
#
pd_df = df.toPandas()
pd_df = pd_df[pd_df['summary'] != '']
pd_df.reset_index(inplace = True)
pd_df.drop('index', axis = 1, inplace = True)

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [8]:
pd_df.columns


FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

Index(['id', 'search_company', 'summary', 'title'], dtype='object')

In [9]:
# model scoring 
def sentiment_scores(text_input):
    response = requests.post("http://52.187.124.32:80/api/v1/service/absa-sentiment-predictor-v2/score", text_input, headers = {'Content-Type' : 'application/json', 'Authorization': 'Bearer 1Q7d5p2SqViNlQbhe6gtHBAiZ5MB58rU'})
    response = response.json()
    polarity = response['_doc_polarity']
    scores = response['scores']
    return(polarity, scores)

# attach model results to dataframe
def model_scores(dataframe):
    for index, row in dataframe.iterrows():
        pol, scores = sentiment_scores(row['scoring_text'].encode('utf-8'))
        dataframe.loc[index, 'polarity'] = pol
        dataframe.loc[index, 'positive'] = str(list(filter(None, [v if k == 'Positive' else 0 for k, v in scores.items()])))
        dataframe.loc[index, 'neutral'] = str(list(filter(None, [v if k == 'Neutral' else 0 for k, v in scores.items()])))
        dataframe.loc[index, 'negative'] = str(list(filter(None, [v if k == 'Negative' else 0 for k, v in scores.items()])))
    return dataframe

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [10]:
# text pre-processing
#
pd_df['scoring_text'] = '{"news":"' + pd_df['summary'] + '",' + '"name":"' + pd_df['search_company'] + '"}'
pd_df

# application of model scoring 
#
model_score_df = model_scores(pd_df)
model_score_df[['polarity', 'positive', 'negative', 'neutral']].head() # print top 5 results

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

   polarity positive negative  neutral
0  Positive  [0.625]       []  [0.094]
1  Negative       []    [1.5]       []
2  Negative       []    [0.5]  [0.167]
3  Positive  [0.333]       []  [0.167]
4  Positive    [1.0]       []       []

In [11]:
# convert pandas dataframe to Pyspark dataframe
#
model_scores_spark_df = spark.createDataFrame(model_score_df)
print(type(model_scores_spark_df))
print(model_scores_spark_df.printSchema())
print(model_scores_spark_df.show(5))

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

<class 'pyspark.sql.dataframe.DataFrame'>
root
 |-- id: double (nullable = true)
 |-- search_company: string (nullable = true)
 |-- summary: string (nullable = true)
 |-- title: string (nullable = true)
 |-- scoring_text: string (nullable = true)
 |-- polarity: string (nullable = true)
 |-- positive: string (nullable = true)
 |-- neutral: string (nullable = true)
 |-- negative: string (nullable = true)

None
+------------------+--------------------+--------------------+--------------------+--------------------+--------+--------+-------+--------+
|                id|      search_company|             summary|               title|        scoring_text|polarity|positive|neutral|negative|
+------------------+--------------------+--------------------+--------------------+--------------------+--------+--------+-------+--------+
|2.0082425521615E13|C&S ELECTRIC LIMI...|Siemens gets CCI ...|Siemens gets CCI ...|{"news":"Siemens ...|Positive| [0.625]|[0.094]|      []|
|2.0082425521616E13|      ES

In [9]:
# save spark dataframe to hdfs 
#
model_scores_spark_df.write.format('csv').mode('overwrite').option('header', True).save('/COE/news_data/news_rdd/sentiment_scores.csv')

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

An error was encountered:
name 'model_scores_spark_df' is not defined
Traceback (most recent call last):
NameError: name 'model_scores_spark_df' is not defined

