<a href="https://colab.research.google.com/github/Pras89tyo/BigData/blob/main/UASBigData.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Import modules and create spark session**

In [19]:
#import modules
from pyspark.sql.types import *
from pyspark.sql.functions import *
from pyspark.ml.classification import LogisticRegression
from pyspark.ml.feature import HashingTF, Tokenizer, StopWordsRemover
# Import SparkSession
from pyspark.sql import SparkSession

#create Spark session
appName = "Sentiment Analysis in Spark"
spark = SparkSession \
    .builder \
    .appName(appName) \
    .config("spark.some.config.option", "some-value") \
    .getOrCreate()

**Read data file into Spark dataFrame**

In [27]:
#read csv file into dataFrame with automatically inferred schema
amazon_csv = spark.read.csv('/content/data/Reviews.csv', inferSchema=True, header=True)
amazon_csv.show(truncate=False, n=5)

+---+----------+--------------+-----------------------------------+--------------------+----------------------+-----+----------+-------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|Id |ProductId |UserId        |ProfileName                        |HelpfulnessNumerator|HelpfulnessDenominator|Score|Time      |Summary                  |Text                                                                                                                                                                                                                                                                                                            

**Select the related data**

In [40]:
#select only "SentimentText" and "Sentiment" column,
#and cast "Sentiment" column data into integer
data = amazon_csv.select("Text", col("HelpfulnessNumerator").cast("Int").alias("label"))
data.show(truncate = False,n=5)

+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----+
|Text                                                                                                                                                                                                                                                                                                                                                                                              |label|
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

**Divide data into training and testing data**

In [41]:
#divide data, 70% for training, 30% for testing
dividedData = data.randomSplit([0.7, 0.3])
trainingData = dividedData[0] #index 0 = data training
testingData = dividedData[1] #index 1 = data testing
train_rows = trainingData.count()
test_rows = testingData.count()
print ("Training data rows:", train_rows, "; Testing data rows:", test_rows)

Training data rows: 398272 ; Testing data rows: 170182


**Prepare training data**

In [45]:
# Now proceed with tokenization
tokenizer = Tokenizer(inputCol="Text", outputCol="SentimentWords")
tokenizedTrain = tokenizer.transform(trainingData)
tokenizedTrain.show(truncate=False, n=5)

+------------------------+-----+-------------------------------+
|Text                    |label|SentimentWords                 |
+------------------------+-----+-------------------------------+
|                        |NULL |[]                             |
|                        |NULL |[]                             |
|                        |NULL |[]                             |
|                        |NULL |[]                             |
| 0% of what you don't"""|1    |[, 0%, of, what, you, don't"""]|
+------------------------+-----+-------------------------------+
only showing top 5 rows

