<a href="https://colab.research.google.com/github/datascientistpur/gpu/blob/master/CUDF_GPU.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [0]:
'''Check if GPU is activated if not then perform 
Runtime->change runtime type->HArdware Accelerator->GPU->save
Ensure that the GPU is Tesla K80 then try changing the runtime again.SInce K80 doesn't support cuda10,which is our base dependency.
'''
!nvidia-smi 

Thu Apr  9 07:43:11 2020       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.64.00    Driver Version: 418.67       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|   0  Tesla P100-PCIE...  Off  | 00000000:00:04.0 Off |                    0 |
| N/A   40C    P0    25W / 250W |      0MiB / 16280MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|  No ru

In [0]:
'''Install CuDF and restart
'''
!pip install cudf-cuda100

Collecting cudf-cuda100
[?25l  Downloading https://files.pythonhosted.org/packages/39/a5/a40e0e0290c332cb2c27dd824c3e8f242d56af27cfdb4da92e5ebe0cf076/cudf_cuda100-0.6.1-cp36-cp36m-manylinux1_x86_64.whl (17.2MB)
[K     |████████████████████████████████| 17.2MB 200kB/s 
Collecting nvstrings-cuda100
[?25l  Downloading https://files.pythonhosted.org/packages/e5/88/5cddf81ffc06908d1cba1dca357e3eb1dab050f46881752fdb4084eb1484/nvstrings_cuda100-0.3.0.post1-cp36-cp36m-manylinux1_x86_64.whl (9.1MB)
[K     |████████████████████████████████| 9.1MB 25.1MB/s 
Collecting numba<0.42,>=0.40.0
[?25l  Downloading https://files.pythonhosted.org/packages/31/55/938f0023a4f37fe24460d46846670aba8170a6b736f1693293e710d4a6d0/numba-0.41.0-cp36-cp36m-manylinux1_x86_64.whl (3.2MB)
[K     |████████████████████████████████| 3.2MB 40.7MB/s 
[?25hCollecting pyarrow==0.12.1
[?25l  Downloading https://files.pythonhosted.org/packages/13/37/eb9aefcd6a041dffb4db6729daea2a91a01a1bf9815e02a3d35281348a85/pyarrow-0.12

In [0]:
####Add RMM to the current path.
!cp /usr/local/lib/python3.6/dist-packages/librmm.so .

In [0]:
####NVMM path
import os  
os.environ['NUMBAPRO_NVVM']='/usr/local/cuda-10.0/nvvm/lib64/libnvvm.so'  
os.environ['NUMBAPRO_LIBDEVICE']='/usr/local/cuda-10.0/nvvm/libdevice'

In [0]:
######Libraries
import cudf
import pandas as pd
import numpy as np
from numba import cuda
import torch
import os

In [0]:
print("pandas version:",pd.__version__)
print("cudf version:",cudf.__version__)
print("numpy version:",np.__version__)
print("cuda version:",torch.version.cuda)

pandas version: 1.0.3
cudf version: 0+unknown
numpy version: 1.18.2
cuda version: 10.1


In [0]:
######Get handle of the current CUDA context to be able to compute the memory level stats
context=cuda.current_context()
cudf_mem_space=context.get_memory_info()

In [0]:
#'''Data-set is the spends at an online retail store.It has 8 columns the metadata is as follows
#1. Invoice         invoice number          string
#2. StockCode       Stock  code             string
#3. Description     item name               string
#4. Quantity        Quantity bought         int
#5. InvoiceDate     Date of the invoce      date-time(yyyy-mm-dd hh:mm:ss)
#6. Price           Unit Price              float
#7. Customer ID     ID of the customer      string
#8. Country         origin of the customer  string'''

In [0]:
from google.colab import files
uploaded = files.upload()

Saving online_retail.csv to online_retail (1).csv


In [0]:
'''Read Files
cdf_file is the cudf file i.e. on GPU
pd_file is the pandas variant i.e. on CPU
Note cudf doesn't support read/write from/to excel,pickle files'''
import io
%time cdf_file=cudf.read_csv(io.BytesIO(uploaded["online_retail.csv"]))
%time pd_file=pd.read_csv(io.BytesIO(uploaded["online_retail.csv"]))

CPU times: user 82.7 ms, sys: 37.3 ms, total: 120 ms
Wall time: 123 ms
CPU times: user 582 ms, sys: 29.8 ms, total: 612 ms
Wall time: 617 ms


In [0]:
'''Memory consumption for both cudf and pandas'''
#context=cuda.current_context()
cudf_mem_space_post_load=context.get_memory_info()
print("Memory consumed by the CuDF:",(cudf_mem_space.free-cudf_mem_space_post_load.free)/1e9,"GB")
print("Memory consumed by the pandas frame:",(sum(pd_file.memory_usage()))/1e9,"GB")

Memory consumed by the CuDF: 0.109051904 GB
Memory consumed by the pandas frame: 0.033629632 GB


In [0]:
'''Sub-setting Data-1
Using loc'''
cdf_file_subset=cdf_file.loc[1:1000]
pd_file_subset=pd_file.loc[1:1000]

In [0]:
'''Sub-setting Data-2
Using iloc'''
cdf_file_subset=cdf_file.iloc[1:5]
pd_file_subset=pd_file.iloc[1:5]

In [0]:
####Note the cudf querying runs on latest cudf i.e. 0.6.1.
#cdf_file_query=cdf_file[cdf_file.Price>10]
pd_file_query=pd_file[pd_file.Price>10]

In [0]:
'''Frequency counts
1. Using value_counts
2. Using group by as a substitute for value_count method

CuDF doesn't support value_counts on a string column.
Similarly other standard methods available in pandas for string data-types are unsupported unless we make us of nvstrings package
CuDF Documentation:https://rapidsai.github.io/projects/cudf/en/latest/api.html
'''
%time pd_count = pd_file['Country'].value_counts()
%time pd_count1 = pd_file.groupby(['Country'])['StockCode'].count() 
#%time cudf_count = cdf_file['Country'].value_counts()
####Note the cudf querying runs on latest cudf i.e. 0.6.1.
%time cudf_count = cdf_file.groupby(['Country'])['StockCode'].count()

CPU times: user 49.5 ms, sys: 829 µs, total: 50.3 ms
Wall time: 49.9 ms
CPU times: user 54.9 ms, sys: 2.77 ms, total: 57.7 ms
Wall time: 58.3 ms


NotImplementedError: ignored

In [0]:
'''Sorting data'''
cdf_file=cdf_file.sort_values(by=['Invoice','StockCode'],ascending=True)
pd_file=pd_file.sort_values(by=['Invoice','StockCode'],ascending=True)

In [0]:
'''Extending frames'''
cdf_file1=cudf.concat([cdf_file,cdf_file],ignore_index=True)
pd_file1=pd.concat([pd_file,pd_file],ignore_index=True)

In [0]:
'''Merging frames'''
########Note to join frames both frames need to have the same reference and id names.
#cdf_file1=cdf_file.merge(cdf_file[['Invoice','StockCode','Quantity']].rename(columns={'Quantity':'Qty_y','Invoice':'inv','StockCode':"stk"}),left_on=['Invoice','StockCode'],right_on=['inv','stk'],how="inner")
cdf_file1=cdf_file.merge(cdf_file[['Invoice','StockCode','Quantity']].rename(columns={'Quantity':'Qty_y'}),on=['Invoice','StockCode'],how="inner")
pd_file1=pd_file.merge(pd_file[['Invoice','StockCode','Quantity']].rename(columns={'Quantity':'Qty_y','Invoice':'inv','StockCode':"stk"}),left_on=['Invoice','StockCode'],right_on=['inv','stk'],how="inner")

In [0]:
'''Computation of invoice,item level net price
Approach-1:Vectorized approach
'''
%time cdf_file['Net_Price']=cdf_file.Quantity*cdf_file.Price
%time pd_file['Net_Price']=pd_file.Quantity*pd_file.Price

CPU times: user 322 ms, sys: 12.4 ms, total: 334 ms
Wall time: 526 ms
CPU times: user 17 ms, sys: 976 µs, total: 18 ms
Wall time: 22.7 ms


In [0]:
'''Computation of invoice,item level net price
Approach-2:Row-wise
apply chunk:incols i.e. columns required as input,outcols i.e. output generated post processing.
Note incols and outcols to be of int/float/datetime arrays.chunks is the number of rows to be allotted to each block and tpb is threads per block.
CUDA works on the principle of threads and not cores.The looping construct is automatically unrolled to the parallel 
variant by the compiler.
'''
def set_net_item_price_cudf(Quantity, Price, out):
    for i, (x, y) in enumerate(zip(Quantity,Price)):
        out[i] = x * y
def set_net_item_price_pd(Quantity, Price):
    return(Quantity*Price)
%time outdf_cudf=cdf_file.apply_chunks(set_net_item_price_cudf,incols=['Quantity', 'Price'],outcols=dict(out=np.float64),kwargs=dict(),chunks=16,tpb=10)
%time outdf_pandas=pd_file.apply(lambda x: set_net_item_price_pd(Quantity=x['Quantity'],Price=x['Price']),axis=1)

CPU times: user 480 ms, sys: 9.84 ms, total: 490 ms
Wall time: 490 ms
CPU times: user 16.2 s, sys: 33.8 ms, total: 16.2 s
Wall time: 16.3 s


In [0]:
'''Describe doesn't work with cudf if string columns are present in 0.6.1.
In older versions describe works on non-string columns'''
#print(cdf_file[['Price','Quantity','Net_Price']].describe())
print(pd_file.describe())

AttributeError: ignored

In [0]:
'''Group-By on frames'''
cdf_group_by=cdf_file.groupby('Country',as_index=False).agg({'Price':['sum','min','max'],'Quantity' : ['sum', 'max','min'],'Net_Price':['sum','max','min']})
print(cdf_group_by.head())
pd_group_by=pd_file.groupby('Country').agg({'Price' : ['sum', 'max','min'], 'Quantity' : ['sum', 'max','min'],'Net_Price':['sum','max','min']})
pd_group_by.unstack(level=0)

     Country           sum_Price            min_Price           max_Price  sum_Quantity  max_Quantity  min_Quantity ...        min_Net_Price
0  Australia  4056.3199999999965  0.29000000000000004              662.25         20053           480           -24 ...              -662.25
1    Austria              2482.8  0.12000000000000001               130.0          6479           120           -36 ...               -130.0
2    Bahrain  352.91999999999985  0.42000000000000004  14.950000000000001          1015            96           -10 ...                -42.5
3    Belgium   7226.749999999973                  0.0  1508.6499999999999         11980           120           -30 ...  -1508.6499999999999
4    Bermuda                84.7  0.21000000000000002               12.75          2798          1152             2 ...   10.200000000000001
[2 more columns]


                Country             
Price      sum  Australia                4056.32
                Austria                  2482.80
                Bahrain                   352.92
                Belgium                  7226.75
                Bermuda                    84.70
                                          ...   
Net_Price  min  USA                       -25.50
                United Arab Emirates     -503.90
                United Kingdom         -53594.36
                Unspecified             -1189.94
                West Indies                 0.65
Length: 360, dtype: float64

In [0]:
'''CuDF doesn't support categories natively.
Categories are inherently converted to string while representing'''
pd_file2=pd_file.copy()
pd_file2['Country_Cat']=pd_file2.Country
pd_file2['Country_Cat']=pd_file2.Country_Cat.astype("category")
cdf_file2=cudf.DataFrame.from_pandas(pd_file2.copy())
type(cdf_file2.Country_Cat[1])

str

In [0]:
'''String Functionality
CuDF supports only nvstrings for string based maipulations.
The functionality for strings is quite similar to re based maipulation.
Simple regex block
Detailed Documentation:https://rapids.readthedocs.io/projects/nvstrings/en/latest/api.html
'''
#%time string_filter_cudf=cdf_file[cdf_file.Country.str.lower().str.contains('^un',regex=True)]
%time string_filter_pandas=pd_file[pd_file.Country.str.lower().str.contains('^un',regex=True)]

AssertionError: ignored

In [0]:
'''String Functionality
CuDF supports only nvstrings for string based maipulations.The functionality for strings is quite similar to re
based maipulation.
Relatively complex regex block with the same search base
Major Performance boost only when the search space/computation space is the bottleneck
'''
#%time string_filter_cudf1=cdf_file[cdf_file.Country.str.lower().str.contains('^un|and[a-z]+$',regex=True)]
%time string_filter_pandas1=pd_file[pd_file.Country.str.lower().str.contains('^un|and[a-z]+$',regex=True)]

CPU times: user 488 ms, sys: 31.9 ms, total: 520 ms
Wall time: 521 ms
