# Tutorial 1: How Much is Your Wine?

In this tutorial, we will create an end-to-end regression model with the wine review dataset. The API of NimbusML is compatible with sklearn, so users who are already familiar with scikit-learn can get started right away. There are also some "advanced" techniques which can be helpful for optimal performance:

1. NimbusML pipelines
2. FileDataStream
3. Column operations and roles

This tutorial is organized as following:
## 1. Quick Start
## 2. Wine Review Example
### 2.1 Data Preprocessing - Stream Data from Files
### 2.2 Model Development
### 2.3 Model Evaluation
## 3. Recap

*Let's get started!!*

Note that it would be useful to have this page opened for class referenece:

https://docs.microsoft.com/en-us/nimbusml

## 1. Quick Start

The modeling data can be sourced from several different types. Most array-like structures are supported (e.g. lists, numpy arrays, dataframes, series etc.). Let’s look at a simple example.

In [None]:
from nimbusml.linear_model import FastLinearClassifier
X = [[1,2,3],[2,3,4],[-1.2,-1,-7]]
Y = [0,0,1]

model = FastLinearClassifier()
model.fit(X,Y)

model.predict(X)

We can also use Pipeline to include more than one operators in the model, just like sklearn.

In [None]:
from nimbusml import Pipeline
from nimbusml.preprocessing.missing_values import Handler as Missingval_Handler

model = Pipeline([
                    Missingval_Handler(), # issues handling integers, input needs to be float
                    FastLinearClassifier()
                 ])
model.fit(X,Y)

scores, metrics = model.test(X,Y)
metrics

## 2. Wine Review Example

In this section, we are trying to develop a prediction model to use the review data and other information of the wine to predict its price. We will use NimbusML's text featurizer to extract numeric features from the review corpus using **pre-trained** language models.

The dataset contains a mix of numeric, categorical and text features. This section will demonstrate how  a pipeline of transforms and trainers to do the following.

-	Process data directly from files!
-	Filter records
-	New : how to apply transforms to just the columns of interest!!
-	Using OneHotVectorizer to encode the categorical features
-	Use of NGramFeaturizer  and WordEmbedding transform (a pre-trained DNN model) to convert text to numeric embeddings.
-	Feature selection using the CountSelector
-	Fitting a regression model

### 2.1 Data Preprocessing - Stream Data from Files

In [None]:
from nimbusml import FileDataStream

# we don't use pandas DataFrame, but FileDataStream to improve performance
ds_train = FileDataStream.read_csv("wine_train.csv")
ds_test = FileDataStream.read_csv("wine_test.csv")
ds_train.head(3)

In [None]:
ds_train.schema

### 2.2 Model Development

Based on the data type, we want to develop a pipeline that applies different operators onto different columns. Note that this pipeline can defintely be improved to achieve better accuracy.  

In [None]:
from IPython.display import Image
Image(filename='1.png')

In [None]:
from nimbusml.preprocessing.missing_values import Filter as Missingval_Filter
from nimbusml.feature_extraction.categorical import OneHotVectorizer
from nimbusml.feature_selection import CountSelector
from nimbusml.feature_extraction.text import NGramFeaturizer
from nimbusml.feature_extraction.text import WordEmbedding
from nimbusml.ensemble import LightGbmRegressor
from nimbusml import Role

# tk = TakeFilter(count = 100) #Always suggested to start with a TakeFilter to quickly examine the pipeline

ft = Missingval_Filter()                   << ['price']
# ft = Missingval_Filter(columns = ['price']) #Equivalent

onv = OneHotVectorizer()                   << ['country', 'province', 'region_1', 'variety']
cs = CountSelector(count = 2)              << ['country', 'province', 'region_1', 'variety']

ng = NGramFeaturizer(output_tokens_column_name = 'description_TransformedText') << ['description']
we = WordEmbedding(model_kind = 'SentimentSpecificWordEmbedding')    << ['description_TransformedText']
lgm = LightGbmRegressor()                  << {'Feature': ['country', 'province', 'region_1', 'variety', 
                                               'description_TransformedText', 'points'],
                                               'Label': 'price'}

# lgm = LightGbmRegressor(feature = ['country', 'province', 'region_1', 'variety', 
#                                                'description_TransformedText', 'points'],
#                         label = 'price') #Equivalent

model = Pipeline([ft, onv, cs, ng, we, lgm])
model.fit(ds_train)

Users can specify the input columns for the transform using:

            OneHotVectorizer(columns = ['country', 'province', 'region_1', 'variety'])
or

            OneHotVectorizer() << ['country', 'province', 'region_1', 'variety']
By default, the output column names are the same as the input (overwrite). Users can also specify the new output columns names, therefore, both the input and output columns are preserved.

            OneHotVectorizer(columns = {'country_out': 'country', 'variety_out': 'variety'})
or

            OneHotVectorizer() << {'country_out': 'country', 'variety_out': 'variety'}

For learners, users need to specify the roles for the columns by using:

            FastForestRegressor(feature = ['country', 'province'], label = 'price')

The feature, lable are the "roles" users need to specify. Notice that, it is equivalent to use the shift operator:

            FastForestRegressor() << {Role.Feature: ['country', 'province'], Role.Label: 'price'}

We have well-known names for columns.  For example, column named as “Features” would be treated as a training data.  Column named “Label” will be treated as Label by default . Also, I believe those are case sensitive.

We can also plot the pipeline using the plot function.

In [None]:
from nimbusml.utils.exports import img_export_pipeline
fig = img_export_pipeline(model,ds_train) 
fig
# fig.render("ppl1.png") # save this image to files

### 2.3 Model Evaluation

In [None]:
metrics, scores = model.test(ds_test, output_scores=True)
metrics

In [None]:
Image(filename='2.png')

## 3. Recap

In this tutorial, we presented an example to:

1. Use NimbusML pipeline
2. Train the model with FileDataStream
3. Column operation for transforms and learners:

        For Transforms, always use "columns = " (or "<<" is equivalent)
        For learners, specify roles by using "feature = ", "label = " (or "<< {'Feature': , 'Label': }")
 
For more details about the package, please refer to:

https://docs.microsoft.com/en-us/nimbusml