![title](yelp2.jpg)
# Smash the Snorkel Team Project: 
# A Weak Supervised Learning Study for Yelp Restaruant Review Classification 
# Part 1

## 1. Introduction
## 1.1 Business Problem Identification

The competition between restaurants is fiercer than ever. Restaurant owners are under consistent pressure 
to improve their business or set themselves apart from the rest. But how? Although some restuarants still rely 
on asking their customers directly, more and more restuarants have been leveraging reviews on apps like Yelp 
to understand what their customers need. In this case, the reviews containing complaints/suggestions are invaluable as they 
explicitly point out the possible solution to improving customers' satisfaction.

For restuarants receiving numerous reviews,however, it is hard for the business owners to investigate every review to figure 
out whether the reviews contain useful suggestions/complaints. Therefore, an automatic process is needed to filter irrelevent 
information quickly.

## 1.2 Project Objective
In this project we seek to automize the classification of reviews to understand whether a review is mentioning complaints or suggestions using **Yelp** review dataset. One extra challenge that lies ahead is that we don't have any datapoint with the "golden" label indicating this piece of  information. 
Consequently, to cope with this problem, we aim to find solution to classification problem without any existing labels.

**Objective**: Find solution for classification problem when there is no pre-exsiting label.

## 1.3 Key Results
There are four major results in this project.
a) Build a Database hosting the data source to mimic the industrial canon. (Part 1)
b) Create labeling functions using Snorkel DryBell to provide weak supervised-learning at industrial scale.
c) Augment dataset to create more training datapoints
d) Train a classifier for labeling a unseen review automatically. 

## 1.4 Project Structure
Below you can see a diagram that represents the workflow of the project. This includes the flow of the original files from start to end, and the actions taken in each step of the process. <br><br>
As you can see we have acquired flat files in JSON format from the official yelp website. Then we have uploaded one of the files on the CosmosDB platform of Microsoft Azure in the MongoDB database that we have created. Due to the size of the files and the resource restrictions, we were not able to upload all of them. The flat files were then parsed into csv format in order to manipulate them as spark dataframes and we moved on to use the reviews dataframe as the basis of our analysis. This was split into 3 sets: <br>
- The <b>development set</b> (500 rows) that was manually labeled and used to perform data understanding, analysis, aggregations feature engineering. This process also involved fetching the file from the MongoDB database to extract information. Then with this set we have created and evaluated the labeling functions. This whole process was iterative, as with each iteration we were imporoving the construction of the labeling functions.<br>
-  The <b>train set</b> containing vast amount of data points was sampled to be able to apply efficiently the previously created labeling functions. This resulted to having a weakly labeled train set that was then augmented using data augmentation techniques like transformation functions. The output of this was the train set that we fit our classifiers in order to come to a final prediction.<br>
- The <b>test set</b> (500 rows) that was manually labeled and used to validate and tune the classifiers.
![Diagram.PNG](Diagram.PNG)


## 2. Initialize our environment
Before we move to the main part of our project (i.e. create labels for our existing datapoints), we are going to:
- Give an overview of the packages that we are going to use
- Load our dataset and have a brief look on it
- Host a file of that dataset to a cloud-based Database


### 2.1 Load the required libraries


In [1]:
#Visualize properly pyspark dataframes
from IPython.core.display import HTML
display(HTML("<style>pre { white-space: pre !important; }</style>"))
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

In [2]:
#Pyspark
import pyspark
from pyspark import SparkContext,SparkConf
from pyspark.sql import SparkSession
from pyspark.sql.functions import array_contains
from pyspark.ml import Pipeline, PipelineModel

from pyspark.sql import SQLContext

### 2.2  Local Json Files 
For this project we are going to work extensively with the review.json file which contains reviews for different businesses from different users.

In addition, we will host the business.json file in 
Due to the limit of database resource, we import the results from our database query
In addition we import the .csv with the business_id's that they refer to restaurant


but also the file with reviews from yelp.
The json file is in a no-relational data storage format, therefore, before we can analyse our data we need to parse it into a more understandable format. 
Besides, some of the json files are quite large and hence loading them with pandas is going to be slow and make the local machine 
run out of memory easily. Accordingly, we utilise Pyspark to accelerate the computation.      


### 2.3 Host business.json file in cloud-based NoSQL DataBase

In this step we demonstrate how a file from our dataset can be hosted in a cloud-based DataBase. We do this, as we would like to show how pyspark (package for distributed computing with Apache Spark) can operate with data extracted from a DataBase. 


### 2.3.1 Selecting provider and type of DataBase
For the selection of our cloud provider but also for the type of our DataBase we have considered the following aspects (not in a strict rank):
- The type of our files; all of them are .json files, a data format of semi-structured data
- The available commercial DataBases that can work efficiently with .json files [MERGE WITH FIRST POINT]
- The cloud provider which offer free credits for students
- The documentantion of each provider for every database solution that they offer
- The easy scalability as we did not know in advance what would be our real needs (especially in terms of throughput) for the succesful completion of our project

Considering all this aspects we decided to proceed with the Cosmos DB from the Microsoft Azure.

According to Azure's documentantion:
>Azure Cosmos DB is Microsoft's globally distributed, multi-model database service. With a click of a button, Cosmos DB enables you to elastically and independently scale throughput and storage across any number of Azure regions worldwide. You can elastically scale throughput and storage, and take advantage of fast, single-digit-millisecond data access using your favorite API including: SQL, MongoDB, Cassandra, Tables, or Gremlin. 

Fom our project, we have used Cosmos DB as:
- It has an extensive documentantion for users without any prior knowledge of setting-up a DataBase.
- It is flexible not only in terms of resources, but also on the data type formats that it can store. 
- It provides a MongoDB API (one of the most common NoSQL databases) so we can operate it in the way that we should use a MongoDB database. The last point actually means more access to documentantion in the wild.


### 2.3.2 Deploying the Database in Azure
Below we demonstrate how we have created our DataBase but also how we have imported the business.json files from our dataset collection.

#### 2.3.2.1 Create Azure Cosmos DB
In the Azure portal we request to create a new Azure Cosmos DB. In the settings before deployment we select our API to be of MongoDB.


![azure1.jpeg](./azure_screenshots/azure1.jpeg)

#### 2.3.2.2 Access the Database from GUI (optional step)
Below we demonstrate how we can access our created Database with Studio 3T
>Studio 3T is a GUI and IDE for developers and data engineers who work with MongoDB. Data management features such as in-place editing and easy database connections are matched with polyglot query code generation, advanced shell with auto-completion, SQL import/export and enterprise level authentication with LDAP and Kerberos.

Studio 3T give us access to the database from our local machine. With Studio 3T can manage remotely all of the aspects of the database. For example, we can create new collections in database (collection stores a set of different files), query the database but also see traffic in our database. In addition, it allows us to upload files to the database.

To achieve this, we first copy the primary connection string from the properties of the database in Azure:

![azure10.jpeg](./azure_screenshots/azure10.jpeg)

Then we create a new connection in the Studio 3T and we add as uri, the primary connection string from Azure:

![3T studio.png](./azure_screenshots/3TStudio.png)

Now we can fully manage our database from our local machine.

#### 2.3.2.3 Importing the .json file to our database
Normally, we could import our .json file with the use of 3T Studio. When we tried to do this we have received the error 16500 (TooManyRequests) from Azure. Our first approach was to increase the throughput to high levels (50000 RU/s). Although this approach has not resolved our problem.

After some research we realized that this error occurs as we try to do a bulk import (import a big file to the database). To tackle this problem, we had to upload our file to an Azure storage account, and then transfer it from there to our database.

Below we show how we have created a storage account in Azure where we uploaded our .json file:

![azure2.jpeg](./azure_screenshots/azure2.jpeg)

Finally, we have used the Azure Data Factory to create a temporary copy pipeline, which copied the file to the database:

![azure3.jpeg](./azure_screenshots/azure3.jpeg)

Azure Data Factory offer a visual interface to do this:

![azure4.jpeg](./azure_screenshots/azure4.jpeg)

For our problem, we've requested to "copy data".
We selected the file from our storage account:

![azure5.jpeg](./azure_screenshots/azure5.jpeg)

And we used as destination our Cosmos DB:

![azure6.jpeg](./azure_screenshots/azure6.jpeg)

Below you can see the execution of the pipeline:

![azure8.jpeg](./azure_screenshots/azure8.jpeg)

After the succesful completion of the pipeline, we have finally managed to perform a bulk import into our database.


#### 2.3.2.4 Accessing our database through PySpark
To execute queries without errors, we had first to increase the thoughput of our cosmos database.
Below we increased the throughput from 400 RU/s (request units) to 10000. Note that by using a lower number of throughput actually leads to Too Many Requests error.

![azure9.jpeg](./azure_screenshots/azure9.jpeg)

The database with our business.json file is now ready to use. We are going to do the chapter 3 (Dataset Preprocessing)

## 3. Dataset Preprocessing
In this part we are going to connect to our cloud-based DataBase to extract information about the different businesses.
Based on this information we will make different subsets of our data collection.

### 3.1 Connecting to the Cosmos DB (MongoDB API) with PySpark
Below we demonstrate how we can connect to our DataBase with PySpark. In the .config("spark.mongodb.input.uri") and .config("spark.mongodb.output.uri" of Spark Session we used the primary connection string that Azure generated for the DataBase. 

In [3]:
conf = SparkConf().set("spark.jars.packages", "org.mongodb.spark:mongo-spark-connector_2.11:2.4.1")
sc = SparkContext(conf=conf)

spark = SparkSession.builder.appName("myApp") \
.config("spark.mongodb.input.uri", "mongodb://smashsnorkel:qhMXOIygCLg5EsdGB2qEOQXCKwIBEWiKIA4bUvxhvrEqNQdwPA0YVw50iUNB75sAyb7eIN3ENmUDHCYoQ4sKUA==@smashsnorkel.mongo.cosmos.azure.com:10255/?ssl=true&replicaSet=globaldb&maxIdleTimeMS=120000&appName=@smashsnorkel@") \
.config("spark.mongodb.input.readPreference.name", "secondaryPreferred")\
.config("spark.mongodb.output.uri", "mongodb://smashsnorkel:qhMXOIygCLg5EsdGB2qEOQXCKwIBEWiKIA4bUvxhvrEqNQdwPA0YVw50iUNB75sAyb7eIN3ENmUDHCYoQ4sKUA==@smashsnorkel.mongo.cosmos.azure.com:10255/?ssl=true&replicaSet=globaldb&maxIdleTimeMS=120000&appName=@smashsnorkel@") \
.getOrCreate()

sc = pyspark.SparkContext.getOrCreate()
sqlContext = SQLContext(sc)

df01 = spark.read\
    .format("mongo")\
    .option("database","yelp")\
    .option("collection", "business")\
    .load()

In the next step we perform a query in the database, where we request to get all business_ids, their average stars and their total reviews for the restaurant businesses:

In [4]:
from pyspark.sql.functions import col 

df01.createOrReplaceTempView("collection")
biz_res = sqlContext.sql('SELECT business_id, stars, review_count FROM collection WHERE categories LIKE  \'%Restaurants%\' ')
biz_res.createOrReplaceTempView('biz_res_table')

In [5]:
#For future reference in the part 2
biz_res.toPandas().to_csv('filtered_data/biz_res.csv',header = 'true') #tospark

### 3.1 Data Filtering
As we focus our classification task on the reviews for the restaurant, here we first filter the 'business' table to 
keep only the restaurants and based on this table we filter out the records in other table that is irrelevent to restaurants.


In [6]:
#######
##This code actually executes the filtering by using the business.json from a local file
'''#First:  restaurant
biz_df = sqlContext.read.json("original_data/business.json") # In our example we have used MongoDB
biz_df.createOrReplaceTempView('biz_table')
#only keep restaurants
biz_res = sqlContext.sql('SELECT * FROM biz_table WHERE categories LIKE \'%Restaurants%\'') # Mongo Needed
biz_res.createOrReplaceTempView('biz_res_table')
'''
######

#filter the irrelevant records from other table
def filter_res(parent,filename,key): 
    df = sqlContext.read.json("original_data/"+filename)
    table_name=filename[:-5]+'_table'
    df.createOrReplaceTempView(filename[:-5]+'_table')
    sqlquery = 'SELECT c.* FROM {} c LEFT JOIN {} b ON c.{} = b.{} WHERE b.{} IS NOT NULL'.format(table_name,parent,key,key,key)
    print(sqlquery)
    df_res=sqlContext.sql(sqlquery)
    return df_res

#keep only the relevant records
checkin_res = filter_res('biz_res_table','checkin.json','business_id')
review_res = filter_res('biz_res_table','review.json','business_id')
review_res.createOrReplaceTempView('review_res_table')
#tip_res = filter_res('review_res_table','tip.json','user_id') #since tip is on individual level, match them with users
user_res = filter_res('review_res_table','user.json','user_id')

#for future reference
checkin_res = checkin_res.toPandas().to_csv('filtered_data/checkin_res.csv',header = 'true') #tospark
#review_res = review_res.toPandas().to_csv('filtered_data/review_res.csv',header = 'true') #tospark - big file to execute
user_res = user_res.toPandas().to_csv('filtered_data/user_res.csv',header = 'true') #tospark

'#First:  restaurant\nbiz_df = sqlContext.read.json("original_data/business.json") # In our example we have used MongoDB\nbiz_df.createOrReplaceTempView(\'biz_table\')\n#only keep restaurants\nbiz_res = sqlContext.sql(\'SELECT * FROM biz_table WHERE categories LIKE \'%Restaurants%\'\') # Mongo Needed\nbiz_res.createOrReplaceTempView(\'biz_res_table\')\n'

SELECT c.* FROM checkin_table c LEFT JOIN biz_res_table b ON c.business_id = b.business_id WHERE b.business_id IS NOT NULL
SELECT c.* FROM review_table c LEFT JOIN biz_res_table b ON c.business_id = b.business_id WHERE b.business_id IS NOT NULL
SELECT c.* FROM user_table c LEFT JOIN review_res_table b ON c.user_id = b.user_id WHERE b.user_id IS NOT NULL


Py4JJavaError: An error occurred while calling o89.collectToPython.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 13 in stage 12.0 failed 1 times, most recent failure: Lost task 13.0 in stage 12.0 (TID 543, localhost, executor driver): java.lang.OutOfMemoryError: Java heap space
	at java.util.Arrays.copyOf(Arrays.java:3236)
	at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:118)
	at java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93)
	at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:153)
	at org.apache.spark.util.ByteBufferOutputStream.write(ByteBufferOutputStream.scala:41)
	at java.io.ObjectOutputStream$BlockDataOutputStream.drain(ObjectOutputStream.java:1877)
	at java.io.ObjectOutputStream$BlockDataOutputStream.setBlockDataMode(ObjectOutputStream.java:1786)
	at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1189)
	at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348)
	at org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:43)
	at org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:100)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:456)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)

Driver stacktrace:
	at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1889)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1877)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1876)
	at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
	at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
	at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1876)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:926)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:926)
	at scala.Option.foreach(Option.scala:257)
	at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:926)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2110)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2059)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2048)
	at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
	at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:737)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2061)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2082)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2101)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2126)
	at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:945)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
	at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
	at org.apache.spark.rdd.RDD.collect(RDD.scala:944)
	at org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:299)
	at org.apache.spark.sql.Dataset$$anonfun$collectToPython$1.apply(Dataset.scala:3263)
	at org.apache.spark.sql.Dataset$$anonfun$collectToPython$1.apply(Dataset.scala:3260)
	at org.apache.spark.sql.Dataset$$anonfun$52.apply(Dataset.scala:3370)
	at org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78)
	at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125)
	at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73)
	at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3369)
	at org.apache.spark.sql.Dataset.collectToPython(Dataset.scala:3260)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
	at py4j.Gateway.invoke(Gateway.java:282)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.GatewayConnection.run(GatewayConnection.java:238)
	at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.OutOfMemoryError: Java heap space
	at java.util.Arrays.copyOf(Arrays.java:3236)
	at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:118)
	at java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93)
	at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:153)
	at org.apache.spark.util.ByteBufferOutputStream.write(ByteBufferOutputStream.scala:41)
	at java.io.ObjectOutputStream$BlockDataOutputStream.drain(ObjectOutputStream.java:1877)
	at java.io.ObjectOutputStream$BlockDataOutputStream.setBlockDataMode(ObjectOutputStream.java:1786)
	at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1189)
	at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348)
	at org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:43)
	at org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:100)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:456)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	... 1 more


### 3.2 Data Spliting
Preparing for the development of labeling fuctions and the training of classifier, we split the filtered review data
into three parts, namely training set, testing set and development set. We first randomly sample 1000 reviews out of all
resturant reviews, and split the first 500 into development set and the rest into testing set. As we don't have labels at all,
we will manually label the reviews in development and testing sets with 1 standing for suggestions/complaints and 0 for the rest.

In addition, considering the computing resources we have it is not feasible for us to apply the labeling functions to all of the remaining
6 million reviews. As a result, we sample 2500 reviews out of the "complete" training set focusing on demonstrating how the whole process may work 
when more computing resources are available.

In [None]:
#sample business into traning_set dev_set and test_set
import pyspark.sql.functions as f
review_res = review_res.withColumn('index_1', f.monotonically_increasing_id())
review_res.createOrReplaceTempView('review_res_table')
sqlquery = 'SELECT * FROM review_res_table ORDER BY RAND(42) LIMIT {}'.format(1000)
review_sample = sqlContext.sql(sqlquery)
review_sample.createOrReplaceTempView('review_sample_table')

sqlquery = 'SELECT a.* FROM review_res_table a LEFT JOIN review_sample_table b ON a.index_1 = b.index_1 WHERE b.index_1 IS NULL '
review_train_all = sqlContext.sql(sqlquery).cache()

#sample 500 into dev_set and 500 into test_set
from pyspark.sql.functions import desc
review_sample = review_sample.withColumn('index_2', f.monotonically_increasing_id())
review_dev = review_sample.limit(500).cache()
review_test = review_sample.sort(desc("index_2")).limit(500).cache()

In [None]:
# sampling from original training_set since it is too large for our single machine to apply the LFs
review_train_all.createOrReplaceTempView('review_train_all_table')
sqlquery = 'SELECT * FROM review_train_all_table ORDER BY RAND(42) LIMIT {}'.format(2500)
review_train = sqlContext.sql(sqlquery).cache()

In [None]:
#for future reference
review_dev.toPandas().to_csv('filtered_data/review_dev.csv',header = 'true') #tospark
review_test.toPandas().to_csv('filtered_data/review_test.csv',header = 'true') #tospark
review_train.toPandas().to_csv('filtered_data/review_train.csv',header = 'true') #tospark

At this stage we have created three data frames. The development split (review_dev) will be used during the development phase of the Labelling functions. The test split (review_test) is our validation split. This will be used to evaluate the accuracy of the our labelling functions and will help us choose between the methods of assigning label probabilities. Then we will implement the chosen LFs on the train set (review_train) that will be then used to train our classifier. 

### 3.3 Manual Labeling
Our goal is to train a classifier over the Yelp data that could predict whether a comment/text is positive (containing suggestions or complaints) or negative. Hence, we have access to a large amount of unlabelled data in the form of Yelp comments with some metadata. After filtered the dataset, we split the filtered review data into training set, testing set and development set. Because originally the dataset doesn't contain any labels, our members manually labelled the reviews in development and testing sets (500 labels for each) with "1" for positive and "0" for negative. Therefore, the result of our label model can be tested in a reasonable way later.

### 3.4 A connection with part 2
At this stage we end with the pre-processing of our subsets. These subsets are going to be used in the part 2 of our analysis.
We decided to split our work in two notebooks for three reasons:

1. To keep seperate all the analysis regarding the creation and the use of the database with pyspark from the process of label creation
2. To be able to run the rest of our analysis regardless of the available cloud resource; the database has been created for demonstration purposes and its use is limited due to the available credits. We have kept the analysis of label creation in part 2 as we might be interested to reproduce our results in the future.
3. To deal with a problem that did not allow us to use different .jar files in faculty. The use of org.mongodb.spark:mongo-spark-connector_2.11:2.4.1 for making queries on the database, didn't allow us to use the sparknlp package (with its underlying jar files) on the same notebook and vice versa (as we couldn't use the .jar files to connect to the DataBase when we used the sparknlp package. It's an issue that could further be investigated.
