<a href="https://colab.research.google.com/github/adbreind/accelerate-rapids/blob/master/01_RAPIDS_cuDF_cuML_cuGraph.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Environment Sanity Check #

Click the _Runtime_ dropdown at the top of the page, then _Change Runtime Type_ and confirm the instance type is _GPU_.

Check the output of `!nvidia-smi` to make sure you've been allocated a Tesla T4.

In [1]:
!nvidia-smi

Mon Sep 30 22:13:01 2019       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 430.40       Driver Version: 418.67       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   50C    P8    10W /  70W |      0MiB / 15079MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|  No ru

In [2]:
# intall miniconda
!wget -c https://repo.continuum.io/miniconda/Miniconda3-4.5.4-Linux-x86_64.sh
!chmod +x Miniconda3-4.5.4-Linux-x86_64.sh
!bash ./Miniconda3-4.5.4-Linux-x86_64.sh -b -f -p /usr/local

# install RAPIDS packages
!conda install -q -y --prefix /usr/local -c nvidia -c rapidsai \
  -c numba -c conda-forge -c pytorch -c defaults \
  cudf=0.9 cuml=0.9 cugraph=0.9 python=3.6 cudatoolkit=10.0

# set environment vars
import sys, os, shutil
sys.path.append('/usr/local/lib/python3.6/site-packages/')
os.environ['NUMBAPRO_NVVM'] = '/usr/local/cuda/nvvm/lib64/libnvvm.so'
os.environ['NUMBAPRO_LIBDEVICE'] = '/usr/local/cuda/nvvm/libdevice/'

# copy .so files to current working dir
for fn in ['libcudf.so', 'librmm.so']:
  shutil.copy('/usr/local/lib/'+fn, os.getcwd())

--2019-09-30 22:13:02--  https://repo.continuum.io/miniconda/Miniconda3-4.5.4-Linux-x86_64.sh
Resolving repo.continuum.io (repo.continuum.io)... 104.18.201.79, 104.18.200.79, 2606:4700::6812:c84f, ...
Connecting to repo.continuum.io (repo.continuum.io)|104.18.201.79|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 58468498 (56M) [application/x-sh]
Saving to: ‘Miniconda3-4.5.4-Linux-x86_64.sh’


2019-09-30 22:13:03 (61.4 MB/s) - ‘Miniconda3-4.5.4-Linux-x86_64.sh’ saved [58468498/58468498]

PREFIX=/usr/local
installing: python-3.6.5-hc3d631a_2 ...
Python 3.6.5 :: Anaconda, Inc.
installing: ca-certificates-2018.03.07-0 ...
installing: conda-env-2.6.0-h36134e3_1 ...
installing: libgcc-ng-7.2.0-hdf63c60_3 ...
installing: libstdcxx-ng-7.2.0-hdf63c60_3 ...
installing: libffi-3.2.1-hd88cf55_4 ...
installing: ncurses-6.1-hf484d3e_0 ...
installing: openssl-1.0.2o-h20670df_0 ...
installing: tk-8.6.7-hc745277_3 ...
installing: xz-5.2.4-h14c3975_4 ...
installing: yaml-0.1.7

# cuDF and cuML Smoke Test 

_Note_: You must import nvstrings and nvcategory before cudf, else you'll get errors.

In [8]:
import nvstrings, nvcategory, cudf

gdf = cudf.DataFrame({'test':[1,2,3]})
print(gdf)
print(gdf.describe())

   test
0     1
1     2
2     3
       test
count   3.0
mean    2.0
std     1.0
min     1.0
25%     1.5
50%     2.0
75%     2.5
max     3.0


In [29]:
import cuml

df_float = cudf.DataFrame()
df_float['0'] = [1.0, 2.0, 5.0]
df_float['1'] = [4.0, 2.0, 1.0]

dbscan_float = cuml.DBSCAN(eps=1.0, min_samples=1)
dbscan_float.fit(df_float)

print(dbscan_float.labels_)

0    0
1    1
2    2
dtype: int32


In [75]:
import cugraph

G = cugraph.Graph()
G.add_edge_list(cudf.Series([0, 1, 2, 2], dtype='int32'),
                cudf.Series([1, 2, 0, 3], dtype='int32'))
cugraph.strongly_connected_components(G)

Unnamed: 0,labels,vertices
0,0,0
1,0,1
2,0,2
3,3,3


In [30]:
print("INSTALL SUCCESS")

INSTALL SUCCESS


In [2]:
print 'stopping here on purpose' # post-install

SyntaxError: Missing parentheses in call to 'print'. Did you mean print('stopping here on purpose' # post-install)? (<ipython-input-2-f82bf4183d56>, line 1)

# Add GPU to Pandas-Style Analytics with cuDF

## Beer Review Data Analysis

In this lab, we'll practice using Pandas by exploring a dataset of beer reviews. 

First we'll retrieve a small slice of the data. The full beer review dataset is surprisingly large ... or maybe not that surprising, since it seems like the kind of job that would be hard to give up so long as one more beer was out there :)

First we'll import Pandas and retrieve the data:

In [None]:
df = cudf.read_csv('beer_small.csv')

df

How many reviews are there?

In [None]:
len(df)

How can we tell if there are missing values?

In [None]:
df.count()

Since most reviews have data for most fields, let's drop the records with incomplete data

In [None]:
df2 = df.dropna()

In [None]:
df2.count()

Let's get summary statistics for the numeric columns ... things like review score and ABV

In [None]:
df2.describe()

There are some really low-alcohol beers in there ... maybe even bogus data.

Find all entries with ABV less than 1%

In [None]:
low_abv = df2[df2.beer_abv < 1]

low_abv

How many of these reviews are there?

In [None]:
len(low_abv)

Some of these are multiple reviews for the same beer, which is allowed (and even encouraged). Let's group by beer and count.

In [None]:
grouping = low_abv.groupby('beer_name')
grouping.size()

How consistent are the O'Douls overall scores?

In [None]:
scores = low_abv[low_abv.beer_name=="O'Doul's"]['review_overall']
scores

Let's plot a histogram

In [None]:
scores.hist()

What are the mean and sd for the O'Doul's overall scores?

In [None]:
scores.mean(), scores.std()

In the full dataset, can we count beers by brewery, and then by style within that brewery?

In [None]:
df2.groupby(['brewery_name', 'beer_style']).size()

### Now we'll try and build up a slightly more complex report

Step 1: Find all rows corresponsing to reviews where the beer style starts with "American"

In [None]:
all_american = df2[df2.beer_style.str.startswith('American')]
all_american

Next, make a dataframe with just the `beer_style` and `review_overall` fields for those rows.

In [None]:
narrowed = all_american[['beer_style', 'review_overall']]
narrowed

Now we'll make a boxplot to capture the range and variance of the ratings. Pandas will do all the work is we call the built-in API. Look for it here: https://pandas.pydata.org/pandas-docs/stable/user_guide/visualization.html

In [None]:
narrowed.boxplot(by='beer_style', vert=False, figsize=(12,10))

# Add GPU to Scikit-Learn-Style Modeling with cuML

## Dataset: Diamonds

This dataset of diamond sales (http://ggplot2.tidyverse.org/reference/diamonds.html) is of moderate size (~55,000 records) and resembles data records that occur in many business scenarios.

For each of the diamond sales records, we have the following properties:
* price: price in US dollars ($326-$18,823)
* carat: weight of the diamond (0.2-5.01)
* cut: quality of the cut (Fair, Good, Very Good, Premium, Ideal)
* color: diamond colour, from J (worst) to D (best)
* clarity: a measurement of how clear the diamond is (I1 (worst), SI2, SI1, VS2, VS1, VVS2, VVS1, IF (best))
* x: length in mm (0-10.74)
* y: width in mm (0-58.9)
* z: depth in mm (0-31.8)
* depth: total depth percentage = z / mean(x, y) = 2 * z / (x + y) (43-79)
* table: width of top of diamond relative to widest point (43-95)

In [None]:
df = cudf.read_csv('data/diamonds.csv')

df.head(5)

The "unnamed" column is a row number in the dataset. It turns out that this row number -- which sounds like it should be meaningless -- actually leaks key data about the diamonds. 

Can you think of why this might be?

In [None]:
df.iloc[:,1]

In [None]:
df['price']

In [None]:
import matplotlib.pyplot as plt
fig, ax = plt.subplots()
plt.plot(df.iloc[:,0], df['price'], ',') # , means just a pixel marker
ax.set(xlabel='record #', ylabel='$')

Let's get rid of the row number:

In [None]:
df2 = df.drop(df.columns[0], axis=1)

df2[:3]

#### Categorical Feautres

Now ... computers are good with numbers, but what about those words? ("Premium", "Ideal", etc.)

It turns out that not only do we need to convert them to numbers, but we often want to do that in a way that treats them as totally separate properties.

That is, we consider the "Ideal"-ness of a diamond totally separately from the "Premium"-ness of that diamond, etc., and of course each diamond only has one of those properties. This is called "one-hot encoding" (or sometimes "dummy variable encoding" or "one of k encoding").

Why do we do this? Wouldn't it make more sense to measure the goodness-of-cut along a numeric scale, almost like the carat weight?

In theory, yes -- and in some case your team may want to do that. But without putting in a lot of work (or having the business domain knowledge) to get that right, we can approximate with this encoding that is, in essence, just a math trick.

In [None]:
pd.value_counts(df2.cut)

In [None]:
df2.dtypes

In [None]:
pd.Categorical(df2.cut)

In [None]:
# In many cases, Pandas can do these steps for us (although the Categorical type is useful to know about & use)
df3 = pd.get_dummies(df2)

df3.iloc[:3, 7:18]

Now we'll split out a "test set" -- remember we want to be able to evaluate the model on records that it hasn't seen before.

In [None]:
import numpy as np
from sklearn.model_selection import train_test_split

y = df3.price

X = df3.drop(columns='price')

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

#### Baselines

In this case we'll use the mean price of the diamonds as a (constant) baseline model:

In [None]:
y.mean()

So our first "baseline" model just says for any diamond we might look at, its price is about $3900. Obviously this is usually going to be wrong, and often by a lot. But it's better than nother. Later we'll see how to compare a "real" model against this one.

Next, we'll set up the model. As we said above, kNN is very simple ... but even complex models are easy to set up with this code library:

In [None]:
from sklearn.neighbors import KNeighborsRegressor

neigh = KNeighborsRegressor(n_neighbors=5)
model = neigh.fit(X_train, y_train) 

Ok, how did we do?

For regression problems like this, we'll measure the accuracy of our predictions using RMSE (root mean squared error). This is a measure of "how wrong" we typically are in our predictions, measured in the units we are predicting (i.e., in this case, dollars).

In [None]:
from sklearn.metrics import mean_squared_error

y_pred = model.predict(X_test)
print("RMSE %f" % np.sqrt(mean_squared_error(y_test, y_pred)) )

So is that actually any good?

One way to get an idea is to compare it to the mean and standard deviation of the data:

In [None]:
print(y_test.mean(), y_test.std())

## Build a Parametric Model: Linear Regression

The canonical example of a parametric model is a linear regression model. Linear regression -- which you might have done by hand on a small amount of data in high school or a college stats class -- is simple, fast, robust, and performs reasonably well for many kinds of real-world data.

In fact, linear regression is one of the two or three most widely used algorithms in the world for data modeling.

Here's a simple version with one predictor and one response plotted against each other, along with a regression line:

<img src="https://materials.s3.amazonaws.com/i/gyP3KGA.png">

How does the computer (or the student) figure out where to draw that regression line? The goal is to minimize the __error__.

What is the error? The difference (or distance) between the true value and the value predicted by the regression line:

<img src="https://materials.s3.amazonaws.com/i/cgvGCMg.jpg" width=600>

That might be getting into too much detail for this class, so let's just say we want to calculate the mathematically best-fit line.

You can also notice that if the data itself does not embody a linear relationship, this approach may not work very well. Surprisingly, a lot of phenomena do have a large enough linear component that this algorithm often works. One thing that will help it fit complex data -- like your business records or our diamond sales -- is using more dimensions. That is, unlike the pictures here which just have one predictor (to make the pictures simple), we can use the same approach to calculate a response as a linear function of many dimensions. 

Let's fit a linear regression model to our diamonds dataset:

In [None]:
from sklearn import linear_model

lr = linear_model.LinearRegression()
linear = lr.fit(X_train, y_train)

y_pred = linear.predict(X_test)
print("RMSE %f" % np.sqrt(mean_squared_error(y_test, y_pred)) )

This model didn't fit quite as well as the kNN model (the RMSE here is larger, indicating our predictions are off by a few hundred more dollars). However, this model is very compact, since it is completely defined by about 27 parameters:

In [None]:
print("Coefficients: %s" % linear.coef_)

print("Intercept: %s" % linear.intercept_)

And making a prediction requires just multiplying and then adding 26 pairs of numbers, so it is lightning fast, even on the tiniest embedded IoT device. Alternatively, if we want to make billions of predictions, we could do that in a second with higher-end server.

### Lab: Powerplant Output 

https://archive.ics.uci.edu/ml/datasets/combined+cycle+power+plant

About the business problem: peaker plant operation

What is in this dataset? Just under 10,000 observations of:

* Temperature (AT) in the range 1.81°C and 37.11°C
* Ambient Pressure (AP) in the range 992.89-1033.30 millibar,
* Relative Humidity (RH) in the range 25.56% to 100.16%
* Exhaust Vacuum (V) in the range 25.36-81.56 cm Hg
* Net hourly electrical energy output (PE) 420.26-495.76 MW

What is the goal? To model output (PE) based other measurements

In [None]:
df = pd.read_csv('data/powerplant.csv')

df

First, think about your intuition, experience, or "domain knowledge" that might apply -- even if you don't know about power generation, you may have some ideas about atmospheric pressure and temperature, and how they might affect a combustion-based power output.

Test those ideas by building some plots. With just 4 predictors, you can make plots with all of them. Notice anything interesting?

Try to build a linear regression model for power output. (Hint: you can cut/paste a lot of the code we've already used in this notebook!)