We will be using `description` coulmun text content to predict the sentiment of news article being positive or negative.

## Load Clean data to df

In [1]:
df = spark.sql("SELECT * FROM bing_lake_db.tbl_latest_news")
display(df.limit(5))
df.count()

StatementMeta(, 6745eae1-578a-4e46-9a97-023803593e83, 3, Finished, Available, Finished)

SynapseWidget(Synapse.DataFrame, f68498a0-9131-4057-a925-850766a79c2f)

178

## Import Synapse ML libraries

In [2]:
import synapse.ml.core
from synapse.ml.services import AnalyzeText

StatementMeta(, 6745eae1-578a-4e46-9a97-023803593e83, 4, Finished, Available, Finished)

## Configure Model

- Import the model and configure the input and output columns
```
model = (AnalyzeText()
.setTextCol("Choose the Text Column here")
.setKind("SentidentAnalysis")
.setOutputCol("response")
.setErrorCol("error"))
```

In [3]:
# Importing AnalyzeText nmodel and configure the input and output columns

model = (AnalyzeText()
.setTextCol("description")
.setKind("SentimentAnalysis")  # Provide the keyword what we are doing
.setOutputCol("response")
.setErrorCol("error"))

StatementMeta(, 6745eae1-578a-4e46-9a97-023803593e83, 5, Finished, Available, Finished)

## Apply the model to Dataframe

In [4]:
result = model.transform(df)

StatementMeta(, 6745eae1-578a-4e46-9a97-023803593e83, 6, Finished, Available, Finished)

In [5]:
display(result.limit(5))

StatementMeta(, 6745eae1-578a-4e46-9a97-023803593e83, 7, Finished, Available, Finished)

SynapseWidget(Synapse.DataFrame, aac20d7c-a666-4502-9bec-0c2be44d03dc)

As we can see, in the result column, we can see the output is in JSON format which has the sentiment value of each column. We will extract that value from the desctiption column.

## Create Sentiment Column

In [6]:
from pyspark.sql.functions import col

sentiment_df = result.withColumn("sentiment", col("response.documents.sentiment"))
display(sentiment_df.limit(5))
sentiment_df.count()

StatementMeta(, 6745eae1-578a-4e46-9a97-023803593e83, 8, Finished, Available, Finished)

SynapseWidget(Synapse.DataFrame, 91774115-9d59-4244-9493-22099fccddef)

178

## Drop Error and Response column
Since we do not require error and response columns, we will be dropping both of those.

In [7]:
sentiment_df_final = sentiment_df.drop("error","response")

StatementMeta(, 6745eae1-578a-4e46-9a97-023803593e83, 9, Finished, Available, Finished)

In [8]:
display(sentiment_df_final.limit(5))

StatementMeta(, 6745eae1-578a-4e46-9a97-023803593e83, 10, Finished, Available, Finished)

SynapseWidget(Synapse.DataFrame, cf579cea-c9b0-459e-9a65-8631217eceb9)

## Changing publishedAt column to date

In [14]:
from pyspark.sql.functions import col, to_date

sentiment_df_final = sentiment_df_final.withColumn("publishedAt" , to_date(col("publishedAt"),"dd-MM-yyyy")) 

StatementMeta(, 6745eae1-578a-4e46-9a97-023803593e83, 16, Finished, Available, Finished)

## Write the dataframe to Lakehouse db as a table with Incremental Load Type 1

In [9]:
from pyspark.sql.utils import AnalysisException

try:
    table_name = "bing_lake_db.tbl_sentiment_analysis"

    sentiment_df_final.write.format("delta").saveAsTable(table_name)

except AnalysisException:

    print("Table Already Exists.")

    sentiment_df_final.createOrReplaceTempView("vw_sentiment_df_final")

    spark.sql(f"""  MERGE INTO {table_name} target_table
                    USING vw_sentiment_df_final source_view

                    ON source_view.url = target_table.url

                    WHEN MATCHED AND
                    source_view.title <> target_table.title OR
                    source_view.description <> target_table.description OR
                    source_view.content <> target_table.content OR
                    source_view.author <> target_table.author OR
                    source_view.urlToImage <> target_table.urlToImage OR
                    source_view.sourceName <> target_table.sourceName OR
                    source_view.publishedAt <> target_table.publishedAt OR
                    source_view.sentiment <> target_table.sentiment

                    THEN UPDATE SET *

                    WHEN NOT MATCHED THEN INSERT *

    """)

StatementMeta(, 6745eae1-578a-4e46-9a97-023803593e83, 11, Finished, Available, Finished)

Table Already Exists.
