# DATA603 Big Data Processing Project 
Group 3: Pooja Kangokar Pranesh, Yun-Zih Chen, Elizabeth Cardosa

The goal of this project is leverage big data technologies to train a model using the UCI ML Drug Review dataset to predict the star rating of drug based on the sentiment of the review. This model will then perform inference in a streaming manner on ‘real-time’ reviews coming in. This application can then be used to help potential customers understand the overall sentiment towards a drug and if it might be useful for them. 


In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
working_folder = "/content/drive/My Drive/UMBC Fall 2022/DATA603 Big Data Processing/Project/Data/"

# Install Libraries and Dependencies

In [None]:
"""!pip install -qq pyspark 
!pip install -qq spark-nlp 
!pip install -qq findspark """

'!pip install -qq pyspark \n!pip install -qq spark-nlp \n!pip install -qq findspark '

In [None]:
# Install PySpark and Spark NLP
! pip install -qq pyspark==3.2.1 spark-nlp findspark #pyspark==3.1.2 spark-nlp findspark

[K     |████████████████████████████████| 281.4 MB 33 kB/s 
[K     |████████████████████████████████| 648 kB 43.2 MB/s 
[K     |████████████████████████████████| 198 kB 67.6 MB/s 
[?25h  Building wheel for pyspark (setup.py) ... [?25l[?25hdone


In [None]:
!wget http://setup.johnsnowlabs.com/colab.sh -O - | bash

--2022-11-27 21:23:33--  http://setup.johnsnowlabs.com/colab.sh
Resolving setup.johnsnowlabs.com (setup.johnsnowlabs.com)... 51.158.130.125
Connecting to setup.johnsnowlabs.com (setup.johnsnowlabs.com)|51.158.130.125|:80... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://setup.johnsnowlabs.com/colab.sh [following]
--2022-11-27 21:23:33--  https://setup.johnsnowlabs.com/colab.sh
Connecting to setup.johnsnowlabs.com (setup.johnsnowlabs.com)|51.158.130.125|:443... connected.
HTTP request sent, awaiting response... 302 Moved Temporarily
Location: https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp/master/scripts/colab_setup.sh [following]
--2022-11-27 21:23:33--  https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp/master/scripts/colab_setup.sh
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.111.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:44

In [None]:
import pyspark.pandas as ps
import pandas as pd



In [None]:
from pyspark.sql.types import *
from pyspark.sql.functions import *
from pyspark import SparkContext

In [None]:
from sparknlp.pretrained import PretrainedPipeline
import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *

In [None]:
"""# Import SparkSession
from pyspark.sql import SparkSession
# Create a Spark Session
spark = SparkSession.builder.master("local[*]").getOrCreate()
# Check Spark Session Information
spark"""

'# Import SparkSession\nfrom pyspark.sql import SparkSession\n# Create a Spark Session\nspark = SparkSession.builder.master("local[*]").getOrCreate()\n# Check Spark Session Information\nspark'

In [None]:
spark = sparknlp.start()

print("Spark NLP version: {}".format(sparknlp.version()))
print("Apache Spark version: {}".format(spark.version))

Spark NLP version: 4.2.3
Apache Spark version: 3.2.1


In [None]:
sc = SparkContext.getOrCreate();

# Read-in Dataset


## Dataset: https://archive.ics.uci.edu/ml/datasets/Drug+Review+Dataset+%28Drugs.com%29


The dataset provides patient reviews on specific drugs along with related conditions and a 10 star patient rating reflecting overall patient satisfaction. The data was obtained by crawling online pharmaceutical review sites. The intention was to study

- sentiment analysis of drug experience over multiple facets, i.e. sentiments learned on specific aspects such as effectiveness and side effects,
- the transferability of models among domains, i.e. conditions, and
- the transferability of models among different data sources (see 'Drug Review Dataset (Druglib.com)').

The data is split into a train (75%) a test (25%) partition (see publication) and stored in two .tsv (tab-separated-values) files, respectively.

Attribute Information:

1. drugName (categorical): name of drug
2. condition (categorical): name of condition
3. review (text): patient review
4. rating (numerical): 10 star patient rating
5. date (date): date of review entry
6. usefulCount (numerical): number of users who found review useful


Important notes:

When using this dataset, you agree that you
1. only use the data for research purposes
2. don't use the data for any commerical purposes
3. don't distribute the data to anyone else
4. cite us

Felix Gräßer, Surya Kallumadi, Hagen Malberg, and Sebastian Zaunseder. 2018. Aspect-Based Sentiment Analysis of Drug Reviews Applying Cross-Domain and Cross-Data Learning. In Proceedings of the 2018 International Conference on Digital Health (DH '18). ACM, New York, NY, USA, 121-125. DOI: [Web Link] 

## Load in Test Data

In [None]:
# Read in training data file
customschema = StructType([
  StructField("UniqueID", IntegerType(), True)
  ,StructField("drugName", StringType(), True)
  ,StructField("condition", StringType(), True)
  ,StructField("review", StringType(), True)
  ,StructField("rating", DoubleType(), True)
  ,StructField("date", StringType(), True)
  ,StructField("usefulCount", IntegerType(), True)
  ,StructField("sentiment", DoubleType(), True)
  ])

In [None]:
df_test = spark.read.format("csv")\
           .option("delimiter", ",")\
           .option("header", "true")\
           .option("quote", "\"")\
           .option("escape", "\"")\
           .option("multiLine","true")\
           .option("quoteMode","ALL")\
           .option("mode","PERMISSIVE")\
           .option("ignoreLeadingWhiteSpace","true")\
           .option("ignoreTrailingWhiteSpace","true")\
           .option("parserLib","UNIVOCITY")\
           .schema(customschema)\
           .load(working_folder + "drug_reviews_with_sentiment_test.csv")

In [None]:
df_test.count()

53766

In [None]:
df_test.show(5)

+--------+---------------+--------------------+--------------------+------+------------------+-----------+
|UniqueID|       drugName|           condition|              review|rating|              date|usefulCount|
+--------+---------------+--------------------+--------------------+------+------------------+-----------+
|  163740|    Mirtazapine|          Depression|"I&#039;ve tried ...|  10.0| February 28, 2012|         22|
|  206473|     Mesalamine|Crohn's Disease, ...|"My son has Crohn...|   8.0|      May 17, 2009|         17|
|  159672|        Bactrim|Urinary Tract Inf...|"Quick reduction ...|   9.0|September 29, 2017|          3|
|   39293|       Contrave|         Weight Loss|"Contrave combine...|   9.0|     March 5, 2017|         35|
|   97768|Cyclafem 1 / 35|       Birth Control|"I have been on t...|   9.0|  October 22, 2015|          4|
+--------+---------------+--------------------+--------------------+------+------------------+-----------+
only showing top 5 rows



## Load in and Explore Training Data

In [None]:
# Read in training data file
customschema = StructType([
  StructField("UniqueID", IntegerType(), True)
  ,StructField("drugName", StringType(), True)
  ,StructField("condition", StringType(), True)
  ,StructField("review", StringType(), True)
  ,StructField("rating", DoubleType(), True)
  ,StructField("date", StringType(), True)
  ,StructField("usefulCount", IntegerType(), True)
  ,StructField("sentiment", DoubleType(), True)
  ])

df = spark.read.format("csv")\
           .option("delimiter", ",")\
           .option("header", "true")\
           .option("quote", "\"")\
           .option("escape", "\"")\
           .option("multiLine","true")\
           .option("quoteMode","ALL")\
           .option("mode","PERMISSIVE")\
           .option("ignoreLeadingWhiteSpace","true")\
           .option("ignoreTrailingWhiteSpace","true")\
           .option("parserLib","UNIVOCITY")\
           .schema(customschema)\
           .load(working_folder + "drug_reviews_with_sentiment_test.csv")

In [None]:
df.count()

161297

In [None]:
df.show(5)

+--------+--------------------+--------------------+--------------------+------+-----------------+-----------+
|UniqueID|            drugName|           condition|              review|rating|             date|usefulCount|
+--------+--------------------+--------------------+--------------------+------+-----------------+-----------+
|  206461|           Valsartan|Left Ventricular ...|"It has no side e...|   9.0|     May 20, 2012|         27|
|   95260|          Guanfacine|                ADHD|"My son is halfwa...|   8.0|   April 27, 2010|        192|
|   92703|              Lybrel|       Birth Control|"I used to take a...|   5.0|December 14, 2009|         17|
|  138000|          Ortho Evra|       Birth Control|"This is my first...|   8.0| November 3, 2015|         10|
|   35696|Buprenorphine / n...|   Opiate Dependence|"Suboxone has com...|   9.0|November 27, 2016|         37|
+--------+--------------------+--------------------+--------------------+------+-----------------+-----------+
o

In [None]:
df_train.select('sentiment').groupBy('sentiment').count().show()

In [None]:
#pd_df = df.toPandas()

## TODO: Train model to predict star rating based off of the 'condition', 'usefulCount', and 'sentiment' with 'rating' as the target

In [None]:
from pyspark.ml.feature import StringIndexer, VectorAssembler
from pyspark.ml.classification import RandomForestClassifier
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
from pyspark.ml import Pipeline

In [None]:
df_train = df_train.drop('date', 'document', 'token', 'class')

In [None]:
df_train.show()

ERROR:root:KeyboardInterrupt while sending command.
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/dist-packages/py4j/java_gateway.py", line 1038, in send_command
    response = connection.send_command(command)
  File "/usr/local/lib/python3.7/dist-packages/py4j/clientserver.py", line 475, in send_command
    answer = smart_decode(self.stream.readline()[:-1])
  File "/usr/lib/python3.7/socket.py", line 589, in readinto
    return self._sock.recv_into(b)
KeyboardInterrupt


KeyboardInterrupt: ignored

In [None]:
target = 'rating'
numeric_cols = ['usefulCount','sentiment']
categorical_cols = ['condition']

In [None]:
# Use String Indexer to convert categorical values to a numeric index
stringIndex = StringIndexer(inputCols=categorical_cols, outputCols=[x + "_idx" for x in categorical_cols])
stringIndex_model = stringIndex.fit(df_train)

ERROR:root:KeyboardInterrupt while sending command.
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/dist-packages/py4j/java_gateway.py", line 1038, in send_command
    response = connection.send_command(command)
  File "/usr/local/lib/python3.7/dist-packages/py4j/clientserver.py", line 475, in send_command
    answer = smart_decode(self.stream.readline()[:-1])
  File "/usr/lib/python3.7/socket.py", line 589, in readinto
    return self._sock.recv_into(b)
KeyboardInterrupt


KeyboardInterrupt: ignored

In [None]:
train_df = stringIndex_model.transform(df_train).drop(*categorical_cols)

In [None]:
train_df.show(5)

In [None]:
# Assemble the inputs into the format needed for the model
assemblerInputs = [x + "_idx" for x in categorical_cols] + numeric_cols
vectorAssembler = VectorAssembler(inputCols= assemblerInputs, outputCol="features")
train_df = vectorAssembler.transform(train_df).select('features', target)

In [None]:
train_df.show(5)

In [None]:
rf = RandomForestClassifier(labelCol=target, numTrees=100, maxDepth=3)

In [None]:
pipeline_rf = Pipeline(stages=[stringIndex, vectorAssembler, rf])

In [None]:
# Fit Random Forest Model with pipeline
rf_pipelineModel = pipeline_rf.fit(df_train)

In [None]:
train_preds = rf_pipelineModel.transform(df_train)

In [None]:
# Get training accuracy
evaluator = MulticlassClassificationEvaluator(labelCol=target, metricName='accuracy')
evaluator.evaluate(train_preds)

In [None]:
# Drop unimportant columns for model 
df_test = df_test.drop('date', 'document', 'token', 'class')
# Drop rows with missing values
df_test = df_test.dropna()

In [None]:
## Drop rows where condition contains irrelevant strings
df_test = df_test.where(~df_test.condition.contains("</span>"))

In [None]:
df_test.count()

In [None]:
df_test.show(5)

In [None]:
df_test = df_test.drop('date', 'document', 'token', 'class')

In [None]:
test_preds = rf_pipelineModel.transform(df_test)

In [None]:
# Test Accuracy for the Model
evaluator.evaluate(test_preds)

## TODO: Obtain Average rating for each Drug Available and Demo updating the Drug Rating when a batch of new reviews come in

In [None]:
# Get average rating for each drug in the training set
#df_train.select("Survived").groupBy("Survived").count().orderBy("count", ascending=False).show()

In [None]:
#

In [None]:
## TODO: Write Logic to Update Drug Rating when given a new review without a rating