<a id="load"></a>
## 2. Load and explore data

In [3]:
!pip install wget  

Collecting wget
Installing collected packages: wget
Successfully installed wget-3.2


In [4]:
import wget
from pprint import PrettyPrinter

pp = PrettyPrinter(indent=2, depth=3).pprint

In [5]:
link_to_data = 'https://github.com/pmservice/wml-sample-models/raw/master/spark/sentiment-prediction/data/trainingTweets.csv'
filename = wget.download(link_to_data)

print(filename)

trainingTweets.csv


The csv file trainingTweets.csv is availble on gpfs now. Load the file to Apache® Spark DataFrame using code below.

In [6]:
spark = SparkSession.builder.getOrCreate()

df_data = spark.read\
  .format('org.apache.spark.sql.execution.datasources.csv.CSVFileFormat')\
  .option('header', 'true')\
  .option('inferSchema', 'true')\
  .load(filename)

In [44]:
spark

Explore the loaded data by using Apache® Spark DataFrame methods:
-  print schema
-  print first ten records
-  count all records

In [7]:
df_data.printSchema()

root
 |-- id: integer (nullable = true)
 |-- text: string (nullable = true)
 |-- label: integer (nullable = true)



As you can see, the data contains 3 fields. ``label`` field is the one we would like to classify tweets.

In [8]:
df_data.show(n=10)

+---+--------------------+-----+
| id|                text|label|
+---+--------------------+-----+
|  1|realdonaldtrump s...|    1|
|  2|cnnvideo hillaryc...|    1|
|  3|realdonaldtrump t...|    1|
|  4|sensanders the am...|    1|
|  5|billburton realdo...|    0|
|  6|reince hillarycli...|    0|
|  7|bentechpro realdo...|    1|
|  8|dahbigj hopeobama...|    0|
|  9|theosmelek thuddy...|    0|
| 10|realdonaldtrump r...|    0|
+---+--------------------+-----+
only showing top 10 rows



In [9]:
print("Total number of records: {count}".format(count=df_data.count()))

Total number of records: 5987


Data set contains 5987 records.

<a id="model"></a>
## 3. Create an Apache® Spark machine learning model

### 3.1: Prepare data

In this subsection split your data into: 
-  The train data set, which is the largest group, is used for training.
-  The test data set will be used for model evaluation and is used to test the assumptions of the model.
-  The predict data set will be used for prediction.

In [10]:
splitted_data = df_data.randomSplit([0.8, 0.18, 0.02], 24)
train_data = splitted_data[0]
test_data = splitted_data[1]
predict_data = splitted_data[2]

print("Number of training records: {count}".format(count=train_data.count()))
print("Number of testing records: {count}".format(count=test_data.count()))
print("Number of prediction records: {count}".format(count=predict_data.count()))

Number of training records: 4783
Number of testing records: 1076
Number of prediction records: 128


### 3.2: Create pipeline and train a model

In this section you will create an Apache® Spark machine learning pipeline and then train the model.

Import the Apache® Spark machine learning packages that will be needed in the subsequent steps.

In [11]:
from pyspark.ml.feature import Tokenizer, OneHotEncoder, StringIndexer, HashingTF
from pyspark.ml.classification import LogisticRegression
from pyspark.ml.evaluation import BinaryClassificationEvaluator
from pyspark.ml.tuning import ParamGridBuilder, CrossValidator
from pyspark.ml import Pipeline, Model

In the data preprocessing step, convert all the string fields to numeric ones by using **Tokenizer** and then **HashingTF** transformer.

In [12]:
tokenizer = Tokenizer(inputCol="text", outputCol="words")
hashingTF = HashingTF(inputCol=tokenizer.getOutputCol(), outputCol="features")

Next, define estimator you want to use for classification. Logistic Regression is used in the following example.

In [17]:
lr = LogisticRegression(maxIter=10, regParam=0.01)

Let's build the pipeline now. A pipeline consists of transformers and an estimator.

In [18]:
pipeline_lr = Pipeline(stages=[tokenizer, hashingTF, lr])

Now, you can train your Logistic Regression model by using previously defined **pipeline** and **train data**.

In [19]:
model_lr = pipeline_lr.fit(train_data)

You can evaluate the model on the test data. Area under ROC will be used as evaluation metrics.

In [20]:
predictions = model_lr.transform(test_data)
evaluatorRF = BinaryClassificationEvaluator(labelCol="label", rawPredictionCol="prediction", metricName="areaUnderROC")
accuracy = evaluatorRF.evaluate(predictions)

print("Accuracy = {acc:4.3f}".format(acc=accuracy))

Accuracy = 0.762


Now you can tune your model to achieve better accuracy. For simplicity of this notebook, tuning section is omitted.

<a id="persistence"></a>
## 4. Persist model

In this section you will store your pipeline and model in Watson Machine Learning repository by using python client libraries.

First, you must install and import Watson Machine Learning client libraries.

In [21]:
!rm -rf $PIP_BUILD/watson-machine-learning-client

In [22]:
!pip install watson-machine-learning-client --upgrade

Collecting watson-machine-learning-client
[?25l  Downloading https://files.pythonhosted.org/packages/ac/5c/be0d3efe27704bbd43481b7de364ade8c686e867617caba8654989e0864b/watson_machine_learning_client-1.0.375-py3-none-any.whl (536kB)
[K    100% |################################| 542kB 3.3MB/s eta 0:00:01
[?25hCollecting requests (from watson-machine-learning-client)
[?25l  Downloading https://files.pythonhosted.org/packages/51/bd/23c926cd341ea6b7dd0b2a00aba99ae0f828be89d72b2190f27c11d4b7fb/requests-2.22.0-py2.py3-none-any.whl (57kB)
[K    100% |################################| 61kB 2.8MB/s eta 0:00:01
[?25hCollecting certifi (from watson-machine-learning-client)
[?25l  Downloading https://files.pythonhosted.org/packages/18/b0/8146a4f8dd402f60744fa380bc73ca47303cccf8b9190fd16a827281eac2/certifi-2019.9.11-py2.py3-none-any.whl (154kB)
[K    100% |################################| 163kB 4.3MB/s eta 0:00:01
[?25hCollecting ibm-cos-sdk (from watson-machine-learning-client)
[?25l  Do

In [23]:
from watson_machine_learning_client import WatsonMachineLearningAPIClient

In [24]:
wml_credentials={
  "apikey": "P6eeZO4QmQ-kePxXCDimOwxQnlD6Ogg7BOfj_9VM7Ema",
  "iam_apikey_description": "Auto-generated for key 8fcfcdf5-caff-4a5f-a6e4-00686c02a810",
  "iam_apikey_name": "ml-credential",
  "iam_role_crn": "crn:v1:bluemix:public:iam::::serviceRole:Writer",
  "iam_serviceid_crn": "crn:v1:bluemix:public:iam-identity::a/09783aaca4f14173b87365a111c8b5d0::serviceid:ServiceId-e9bbf235-1d10-4b2a-8af2-03ded34f0e22",
  "instance_id": "25d8bad4-1b67-4087-a745-3bcf89cfee49",
  "url": "https://us-south.ml.cloud.ibm.com"
}

Create WML client and authorize it.

In [25]:
client = WatsonMachineLearningAPIClient(wml_credentials)

In [26]:
client.version

'1.0.375'

### 4.1: Save pipeline and model

In [27]:
saved_model_details = client.repository.store_model(model=model_lr, meta_props={"name": "Sentiment Prediction Model"}, training_data=train_data, pipeline=pipeline_lr)

model_uid = client.repository.get_model_uid(saved_model_details)
print(model_uid)

3ddfbde8-f1bb-45af-93ff-401e9824f60f


Check model details:

In [28]:
print(saved_model_details)

{'metadata': {'guid': '3ddfbde8-f1bb-45af-93ff-401e9824f60f', 'url': 'https://us-south.ml.cloud.ibm.com/v3/wml_instances/25d8bad4-1b67-4087-a745-3bcf89cfee49/published_models/3ddfbde8-f1bb-45af-93ff-401e9824f60f', 'created_at': '2019-09-14T21:31:47.043Z', 'modified_at': '2019-09-14T21:31:47.103Z'}, 'entity': {'runtime_environment': 'spark-2.3', 'learning_configuration_url': 'https://us-south.ml.cloud.ibm.com/v3/wml_instances/25d8bad4-1b67-4087-a745-3bcf89cfee49/published_models/3ddfbde8-f1bb-45af-93ff-401e9824f60f/learning_configuration', 'name': 'Sentiment Prediction Model', 'label_col': 'label', 'learning_iterations_url': 'https://us-south.ml.cloud.ibm.com/v3/wml_instances/25d8bad4-1b67-4087-a745-3bcf89cfee49/published_models/3ddfbde8-f1bb-45af-93ff-401e9824f60f/learning_iterations', 'training_data_schema': {'fields': [{'metadata': {}, 'name': 'id', 'nullable': True, 'type': 'integer'}, {'metadata': {}, 'name': 'text', 'nullable': True, 'type': 'string'}, {'metadata': {'modeling_role

### 4.2: Load model

In [29]:
loaded_model = client.repository.load(model_uid)

You can check the type of model. As it is the same model you saved, you can use it for local scoring.

In [30]:
print(type(loaded_model))

<class 'pyspark.ml.pipeline.PipelineModel'>


<a id="visualization"></a>
## 5. Predict locally and visualize

### 5.1: Make local prediction using previously loaded model and test data

In this subsection you will score ``predict_data`` data set.

In [31]:
predictions = loaded_model.transform(predict_data)

Preview the results by calling the *show()* method on the predictions DataFrame.

In [32]:
predictions.show(5)

+---+--------------------+-----+--------------------+--------------------+--------------------+--------------------+----------+
| id|                text|label|               words|            features|       rawPrediction|         probability|prediction|
+---+--------------------+-----+--------------------+--------------------+--------------------+--------------------+----------+
|254|realdonaldtrump t...|    1|[realdonaldtrump,...|(262144,[4312,961...|[-1.3150612511415...|[0.21164114610918...|       1.0|
|256|realdonaldtrump j...|    0|[realdonaldtrump,...|(262144,[14,13396...|[5.48207446382384...|[0.99585655006955...|       0.0|
|296|realdonaldtrump t...|    1|[realdonaldtrump,...|(262144,[15889,21...|[-1.3175186267989...|[0.21123142555360...|       1.0|
|312|sensanders keep t...|    0|[sensanders, keep...|(262144,[32890,91...|[3.29411163264101...|[0.96422625174639...|       0.0|
|362|katiedaviscourt i...|    0|[katiedaviscourt,...|(262144,[16332,21...|[3.83815328742004...|[0.978920

By tabulating a count, you can see the split by sentiment.

In [33]:
predictions.select("label").groupBy("label").count().show()

+-----+-----+
|label|count|
+-----+-----+
|    1|   49|
|    0|   79|
+-----+-----+



### 5.2: Sample visualization of data with Plotly package

In this subsection you will explore prediction results with Plotly, which is an online analytics and data visualization tool.

**Tip**: First, you need to install required packages. You can do it by running the following code. Run it only one time.

In [None]:
!pip install plotly 
!pip install cufflinks 

In [47]:
import sys
import pandas
import chart_studio.plotly as py
import plotly.graph_objs as go

init_notebook_mode(connected=True)
sys.path.append("".join([os.environ["HOME"]])) 

You have to convert the Apache Spark DataFrame to a Pandas DataFrame to be used by ploting function.

In [41]:
predictions_pdf = predictions.select("prediction", "label", "text").toPandas()
cumulative_stats = predictions_pdf.groupby(['label']).count()
labels_data_plot = cumulative_stats.index
values_data_plot = cumulative_stats['text']

Plot a pie chart that shows the predicted sentiment label.

In [48]:
product_data = [go.Pie(
            labels=labels_data_plot,
            values=values_data_plot,
    )]

product_layout = go.Layout(
    title='Sentiment',
)

fig = go.Figure(data=product_data, layout=product_layout)
iplot(fig)