# Tutorial 1: From Scikit-Learn to NimbusML


## Goals:
* Learn to write scripts with NimbusML components
* Learn to boost your existing Scikit Learn scripts with NimbusML components

## Why to use NimbusML ?
* Used ML.NET before?
* Used Scikit-Learn before?

### Would you want this?
<img align="middle" src="https://notebooks.azure.com/ganik/libraries/test111/raw/data%2Fgoals.png" height=700 />


### NimbusML:
<img align="middle" src="https://notebooks.azure.com/ganik/libraries/test111/raw/data%2Fspeed.png" width=550 height=550 />


## Lets start ...
### Lets do all the imports:

In [16]:
# Cell 1

# imports
import pandas as pd
import numpy as np
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.metrics import accuracy_score
from sklearn.linear_model import SGDClassifier

# NimbusML imports
from nimbusml import Pipeline as NimbusPipeline, FileDataStream
from nimbusml.linear_model import FastLinearBinaryClassifier
from nimbusml.feature_extraction.text import NGramFeaturizer

### Set up train and test data:

In [17]:
# Cell 2

np.random.seed(0)

# Prepare train and test data
# Twitter sentiment prediction
# Subset of Kaggle Twitter positive/negative sentiment prediction https://www.kaggle.com/c/twitter-analysis  

train_file = 'data/train.tsv'
test_file = 'data/test.tsv'
data_train = pd.read_csv(train_file, header=0, sep='\t', encoding='latin-1') 
data_test = pd.read_csv(test_file, header=0, sep='\t', encoding='latin-1')
print(data_train[:10])

label_column = 'Sentiment'
feature_column = 'SentimentText'
train_X = data_train[feature_column].values.astype('U')
train_y = data_train[label_column]

k = 5000 # cut file into 500 lines
test_X = data_test[feature_column][:k].values.astype('U')
test_y = data_test[label_column][:k]

   Sentiment                                      SentimentText
0          0           is so sad for my APL friend.............
1          0                   I missed the New Moon trailer...
2          1                            omg its already 7:30 :O
3          0  .. Omgaga. Im sooo  im gunna CRy. I've been at...
4          0       i think mi bf is cheating on me!!!       T_T
5          0                          or i just worry too much?
6          1                 Juuuuuuuuuuuuuuuuussssst Chillin!!
7          0  Sunny Again        Work Tomorrow  :-|       TV...
8          1    handed in my uniform today . i miss you already
9          1           hmmmm.... i wonder how she my number @-)


### Scikit script:

In [18]:
# Cell 3

# Define pipeline, add transforms and classifier
pipe = Pipeline([
('vect', CountVectorizer()),
('tfidf', TfidfTransformer()),
('clf', SGDClassifier(max_iter=10))])

# Train pipeline
pipe.fit(train_X, train_y)

# Get predictions
test_pred = pipe.predict(test_X)

print(test_pred[:10])
print("acc: %s" % accuracy_score(test_pred, test_y))

[0 0 0 0 0 0 0 0 0 0]
acc: 0.778


### Replace with NimbusML learner:

In [19]:
# Cell 4

# Define pipeline, add transforms and classifier
pipe = Pipeline([
('vect', CountVectorizer()),
('tfidf', TfidfTransformer()),
('clf', FastLinearBinaryClassifier())])

# Train pipeline
pipe.fit(train_X, train_y)

# Get predictions
test_pred = pipe.predict(test_X)

#print(test_pred[:10])
print("acc: %s" % accuracy_score(test_pred, test_y))

Automatically adding a MinMax normalization transform, use 'norm=Warn' or 'norm=No' to turn this behavior off.
Using 1 thread to train.
Automatically choosing a check frequency of 1.
Auto-tuning parameters: maxIterations = 285.
Auto-tuning parameters: L2 = 2.679191E-05.
Auto-tuning parameters: L1Threshold (L1/L2) = 1.
Using best model from iteration 19.
Not training a calibrator because it is not needed.
Elapsed time: 00:00:01.6238504
acc: 0.792


### If you need to look up FastLinearBinaryClassifier details, here is extensive doc site:
https://docs.microsoft.com/en-us/NimbusML
Additional TLC support alias: tlcsupp@microsoft.com


### High level architecture

* Memory passed in by ref
* Memory passed back by copy

<img align="middle" src="https://notebooks.azure.com/ganik/libraries/test111/raw/data/architecture.png" width=600 heigth=400 />



### Optimized NimbusML script:

In [20]:
# Cell 5

schema = 'sep=tab col=Label:R4:0 col=SentimentText:TX:1 header=+'
trainDs = FileDataStream(train_file, schema)
testDs = FileDataStream(test_file, schema)

pipe = NimbusPipeline([
  NGramFeaturizer() << {'Features':'SentimentText'},
  FastLinearBinaryClassifier()])

# Train pipeline
pipe.fit(trainDs)

# Get predictions
test_pred = pipe.predict(testDs)

#print(test_pred[:10])
print("acc: %s" % accuracy_score(test_pred['PredictedLabel'][:k], test_y))

Not adding a normalizer.
Using 1 thread to train.
Automatically choosing a check frequency of 1.
Auto-tuning parameters: maxIterations = 285.
Auto-tuning parameters: L2 = 2.679191E-05.
Auto-tuning parameters: L1Threshold (L1/L2) = 1.
Using best model from iteration 17.
Not training a calibrator because it is not needed.
Elapsed time: 00:00:06.9979151
acc: 0.806


## Recap:
* Created simple scikit learn script
* Used NimbusML learner in scikit learn pipeline
* Used NimbusML transformers and learner in NimbusML pipeline 

And if we would have run a whole dataset:
<img align="middle" src="https://notebooks.azure.com/ganik/libraries/test111/raw/data/scikit2NimbusML.png"/>