<a href="https://colab.research.google.com/github/datascientistpur/gpu/blob/master/CUDF_GPU_COLAB" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [0]:
'''Check if GPU is activated if not then perform 
Runtime->change runtime type->Hardware Accelerator->GPU->save
Ensure that the GPU is Tesla K80 then try changing the runtime again.Since K80 doesn't support cuda10,which is our base dependency.
'''
!nvidia-smi 

Sat Apr 11 05:11:03 2020       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.64.00    Driver Version: 418.67       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|   0  Tesla P100-PCIE...  Off  | 00000000:00:04.0 Off |                    0 |
| N/A   33C    P0    25W / 250W |      0MiB / 16280MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|  No ru

In [0]:
'''Install CuDF and restart and post restart run the above cell then straight continue with RMM
'''
!pip install cudf-cuda100

Collecting cudf-cuda100
[?25l  Downloading https://files.pythonhosted.org/packages/39/a5/a40e0e0290c332cb2c27dd824c3e8f242d56af27cfdb4da92e5ebe0cf076/cudf_cuda100-0.6.1-cp36-cp36m-manylinux1_x86_64.whl (17.2MB)
[K     |████████████████████████████████| 17.2MB 202kB/s 
Collecting pycparser==2.19
[?25l  Downloading https://files.pythonhosted.org/packages/68/9e/49196946aee219aead1290e00d1e7fdeab8567783e83e1b9ab5585e6206a/pycparser-2.19.tar.gz (158kB)
[K     |████████████████████████████████| 163kB 56.8MB/s 
Collecting numba<0.42,>=0.40.0
[?25l  Downloading https://files.pythonhosted.org/packages/31/55/938f0023a4f37fe24460d46846670aba8170a6b736f1693293e710d4a6d0/numba-0.41.0-cp36-cp36m-manylinux1_x86_64.whl (3.2MB)
[K     |████████████████████████████████| 3.2MB 51.5MB/s 
Collecting nvstrings-cuda100
[?25l  Downloading https://files.pythonhosted.org/packages/e5/88/5cddf81ffc06908d1cba1dca357e3eb1dab050f46881752fdb4084eb1484/nvstrings_cuda100-0.3.0.post1-cp36-cp36m-manylinux1_x86_64.

In [0]:
####Add RMM to the current path.
!cp /usr/local/lib/python3.6/dist-packages/librmm.so .

In [0]:
####NVMM path
import os  
os.environ['NUMBAPRO_NVVM']='/usr/local/cuda-10.0/nvvm/lib64/libnvvm.so'  
os.environ['NUMBAPRO_LIBDEVICE']='/usr/local/cuda-10.0/nvvm/libdevice'

In [0]:
######Libraries
import cudf
import pandas as pd
import numpy as np
from numba import cuda
import torch
import os

In [0]:
print("pandas version:",pd.__version__)
print("cudf version:",cudf.__version__)
print("numpy version:",np.__version__)
print("cuda version:",torch.version.cuda)

pandas version: 1.0.3
cudf version: 0+unknown
numpy version: 1.18.2
cuda version: 10.1


In [0]:
######Get handle of the current CUDA context to be able to compute the memory level stats
context=cuda.current_context()
cudf_mem_space=context.get_memory_info()

In [0]:
#'''Data-set is the spends at an online retail store.It has 8 columns the metadata is as follows
#1. Invoice         invoice number          string
#2. StockCode       Stock  code             string
#3. Description     item name               string
#4. Quantity        Quantity bought         int
#5. InvoiceDate     Date of the invoce      date-time(yyyy-mm-dd hh:mm:ss)
#6. Price           Unit Price              float
#7. Customer ID     ID of the customer      string
#8. Country         origin of the customer  string'''

In [0]:
from google.colab import files
uploaded = files.upload()

Saving online_retail.csv to online_retail.csv


In [0]:
'''Read Files
cdf_file is the cudf file i.e. on GPU
pd_file is the pandas variant i.e. on CPU
Note cudf doesn't support read/write from/to excel,pickle files'''
import io
%time cdf_file=cudf.read_csv(io.BytesIO(uploaded["online_retail.csv"]))
%time pd_file=pd.read_csv(io.BytesIO(uploaded["online_retail.csv"]))

CPU times: user 122 ms, sys: 42 ms, total: 164 ms
Wall time: 173 ms
CPU times: user 585 ms, sys: 11 ms, total: 596 ms
Wall time: 600 ms


In [0]:
'''Memory consumption for both cudf and pandas'''
cudf_mem_space_post_load=context.get_memory_info()
print("Memory consumed by the CuDF:",(cudf_mem_space.free-cudf_mem_space_post_load.free)/1e9,"GB")
print("Memory consumed by the pandas frame:",(sum(pd_file.memory_usage()))/1e9,"GB")

Memory consumed by the CuDF: 0.109051904 GB
Memory consumed by the pandas frame: 0.033629632 GB


In [0]:
'''Sub-setting Data-1
Using loc'''
cdf_file_subset=cdf_file.loc[1:1000]
pd_file_subset=pd_file.loc[1:1000]

In [0]:
'''Sub-setting Data-2
Using iloc'''
cdf_file_subset=cdf_file.iloc[1:5]
pd_file_subset=pd_file.iloc[1:5]

In [0]:
####Note the cudf querying runs on latest cudf i.e. 0.6.1.
#cdf_file_query=cdf_file[cdf_file.Price>10]
pd_file_query=pd_file[pd_file.Price>10]

In [0]:
'''Frequency counts
1. Using value_counts
2. Using group by as a substitute for value_count method

CuDF doesn't support value_counts on a string column.
Similarly other standard methods available in pandas for string data-types are unsupported unless we make us of nvstrings package
CuDF Documentation:https://rapidsai.github.io/projects/cudf/en/latest/api.html
'''
%time pd_count = pd_file['Country'].value_counts()
%time pd_count1 = pd_file.groupby(['Country'])['StockCode'].count() 
#%time cudf_count = cdf_file['Country'].value_counts()
####Note the cudf querying runs on latest cudf i.e. 0.6.1.
#%time cudf_count = cdf_file.groupby(['Country'])['StockCode'].count()

CPU times: user 55.9 ms, sys: 0 ns, total: 55.9 ms
Wall time: 56.3 ms
CPU times: user 58.5 ms, sys: 1.64 ms, total: 60.2 ms
Wall time: 60.1 ms


In [0]:
'''Sorting data'''
cdf_file=cdf_file.sort_values(by=['Invoice','StockCode'],ascending=True)
pd_file=pd_file.sort_values(by=['Invoice','StockCode'],ascending=True)

In [0]:
'''Extending frames'''
cdf_file1=cudf.concat([cdf_file,cdf_file],ignore_index=True)
pd_file1=pd.concat([pd_file,pd_file],ignore_index=True)

In [0]:
'''Merging frames'''
########Note to join frames both frames need to have the same reference and id names.
#cdf_file1=cdf_file.merge(cdf_file[['Invoice','StockCode','Quantity']].rename(columns={'Quantity':'Qty_y','Invoice':'inv','StockCode':"stk"}),left_on=['Invoice','StockCode'],right_on=['inv','stk'],how="inner")
cdf_file1=cdf_file.merge(cdf_file[['Invoice','StockCode','Quantity']].rename(columns={'Quantity':'Qty_y'}),on=['Invoice','StockCode'],how="inner")
pd_file1=pd_file.merge(pd_file[['Invoice','StockCode','Quantity']].rename(columns={'Quantity':'Qty_y','Invoice':'inv','StockCode':"stk"}),left_on=['Invoice','StockCode'],right_on=['inv','stk'],how="inner")

In [0]:
'''Computation of invoice,item level net price
Approach-1:Vectorized approach
'''
%time cdf_file['Net_Price']=cdf_file.Quantity*cdf_file.Price
%time pd_file['Net_Price']=pd_file.Quantity*pd_file.Price

CPU times: user 292 ms, sys: 8.92 ms, total: 301 ms
Wall time: 484 ms
CPU times: user 11.5 ms, sys: 185 µs, total: 11.7 ms
Wall time: 16.3 ms


In [0]:
'''Computation of invoice,item level net price
Approach-2:Row-wise
apply chunk:incols i.e. columns required as input,outcols i.e. output generated post processing.
Note incols and outcols to be of int/float/datetime arrays.chunks is the number of rows to be allotted to each block and tpb is threads per block.
CUDA works on the principle of threads and not cores.The looping construct is automatically unrolled to the parallel 
variant by the compiler.
'''
def set_net_item_price_cudf(Quantity, Price, out):
    for i, (x, y) in enumerate(zip(Quantity,Price)):
        out[i] = x * y
def set_net_item_price_pd(Quantity, Price):
    return(Quantity*Price)
%time outdf_cudf=cdf_file.apply_chunks(set_net_item_price_cudf,incols=['Quantity', 'Price'],outcols=dict(out=np.float64),kwargs=dict(),chunks=16,tpb=10)
outdf_pandas=pd_file
%time outdf_pandas['out']=outdf_pandas.apply(lambda x: set_net_item_price_pd(Quantity=x['Quantity'],Price=x['Price']),axis=1)

CPU times: user 280 ms, sys: 8.02 ms, total: 288 ms
Wall time: 290 ms
CPU times: user 11.9 s, sys: 21.7 ms, total: 11.9 s
Wall time: 11.9 s


In [0]:
######The current cudf vaersion doesn't support drop_duplicates.Hence the current code is not executable.The code should run on cudf version>0.5
'''%%time
def create_master(series,name_to_refer,name_to_id):
    master=cudf.DataFrame()
    master[name_to_refer]=series
    master=master.drop_duplicates()
    master[name_to_id]=np.arange(0,(len(master)))
    return(master)
invoice_master=create_master(cdf_file.Invoice,name_to_refer="Invoice",name_to_id="Invoice_ID")
stock_master=create_master(cdf_file.StockCode,name_to_refer="StockCode",name_to_id="StockCode_ID")
cdf_file1=cdf_file[['Invoice','StockCode','Quantity','Price']]
cdf_file1=cdf_file1.merge(invoice_master,how="left")
cdf_file1=cdf_file1.merge(stock_master,how="left")'''

AttributeError: ignored

In [0]:
######The current cudf vaersion doesn't support drop_duplicates.Hence the current code is not executable.The code should run on cudf version>0.5
'''Computation of invoice,item level net price
Approach-3:Group-Wise
apply_grouped:incols i.e. columns required as input,outcols i.e. output generated post processing.
Note incols and outcols to be of int/float/datetime arrays.In principle the chunks is now a variable determined on the basis of # of instances for the group.Useful for row-wise operations on groups.
'''
'''%%time
def grouped_summary(StockCode_ID,Price,Quantity,Net_price):
    for i in range(cuda.threadIdx.x, len(StockCode_ID), cuda.blockDim.x):
        Net_price[i] = Price[i] * Quantity[i]
cdf_file2=cdf_file1[['StockCode_ID','Invoice_ID','Quantity','Price']].groupby(['StockCode_ID'], method='cudf').apply_grouped(grouped_summary,incols=['StockCode_ID','Price','Quantity'],outcols={'Net_price': np.float64},tpb=600)
cdf_file2=cdf_file2.merge(stock_master,how="left")
cdf_file2=cdf_file2.merge(invoice_master,how="left")'''

In [0]:
print("apply_chunks:cudf")
print(outdf_cudf[['Invoice','StockCode','out']].sort_values(by=["StockCode","Invoice"]))
#print("apply_grouped:cudf")
#print(cdf_file2[cdf_file2.Invoice=="489434"][['Invoice','StockCode','Net_price']].sort_values(by=["StockCode"]))
print("apply:pandas")
print(outdf_pandas[['Invoice','StockCode','out']].sort_values(by=["StockCode","Invoice"]))

apply_chunks:cudf
    Invoice  StockCode                 out
41   489437      10002  10.200000000000001
7653   490063      10002  0.8500000000000001
7668   490063      10002  0.8500000000000001
9382   490136      10002  0.8500000000000001
9492   490140      10002  3.4000000000000004
9610   490144      10002  10.200000000000001
10614   490229      10002  10.200000000000001
11192   490295      10002  40.800000000000004
12036   490362      10002  0.8500000000000001
12808   490458      10002  40.800000000000004
[525451 more rows]
apply:pandas
       Invoice     StockCode    out
41      489437         10002  10.20
7653    490063         10002   0.85
7668    490063         10002   0.85
9382    490136         10002   0.85
9492    490140         10002   3.40
...        ...           ...    ...
298839  518487  gift_0001_90   0.00
96608   498492             m   2.55
96609   498492             m   3.40
157226  504396             m   4.00
228780  511509             m   2.55

[525461 rows x 3 colum

In [0]:
'''Describe doesn't work with cudf if string columns are present in 0.6.1.
In older versions describe works on non-string columns'''
#print(cdf_file[['Price','Quantity','Net_Price']].describe())
print(pd_file.describe())

            Quantity          Price  ...      Net_Price            out
count  525461.000000  525461.000000  ...  525461.000000  525461.000000
mean       10.337667       4.688834  ...      18.154506      18.154506
std       107.424110     146.126914  ...     160.333083     160.333083
min     -9600.000000  -53594.360000  ...  -53594.360000  -53594.360000
25%         1.000000       1.250000  ...       3.750000       3.750000
50%         3.000000       2.100000  ...       9.950000       9.950000
75%        10.000000       4.210000  ...      17.700000      17.700000
max     19152.000000   25111.090000  ...   25111.090000   25111.090000

[8 rows x 5 columns]


In [0]:
'''Group-By on frames'''
cdf_group_by=cdf_file.groupby('Country',as_index=False).agg({'Price':['sum','min','max'],'Quantity' : ['sum', 'max','min'],'Net_Price':['sum','max','min']})
print(cdf_group_by.head())
pd_group_by=pd_file.groupby('Country').agg({'Price' : ['sum', 'max','min'], 'Quantity' : ['sum', 'max','min'],'Net_Price':['sum','max','min']})
pd_group_by.unstack(level=0)

     Country          sum_Price            min_Price           max_Price  sum_Quantity  max_Quantity  min_Quantity ...        min_Net_Price
0  Australia  4056.319999999996  0.29000000000000004              662.25         20053           480           -24 ...              -662.25
1    Austria  2482.800000000004  0.12000000000000001               130.0          6479           120           -36 ...               -130.0
2    Bahrain  352.9199999999998  0.42000000000000004  14.950000000000001          1015            96           -10 ...                -42.5
3    Belgium  7226.749999999972                  0.0  1508.6499999999999         11980           120           -30 ...  -1508.6499999999999
4    Bermuda               84.7  0.21000000000000002               12.75          2798          1152             2 ...   10.200000000000001
[2 more columns]


                Country             
Price      sum  Australia                4056.32
                Austria                  2482.80
                Bahrain                   352.92
                Belgium                  7226.75
                Bermuda                    84.70
                                          ...   
Net_Price  min  USA                       -25.50
                United Arab Emirates     -503.90
                United Kingdom         -53594.36
                Unspecified             -1189.94
                West Indies                 0.65
Length: 360, dtype: float64

In [0]:
'''Categories support is provided from pandas'''
pd_file2=pd_file.copy()
pd_file2['Country_Cat']=pd_file2.Country.copy()
pd_file2['Country_Cat']=pd_file2.Country_Cat.astype("category")
#print(pd_file2.columns)

cdf_file2=cudf.DataFrame.from_pandas(pd_file2.copy())

type(cdf_file2.Country_Cat.cat)
print("Categorical Labels")
print(cdf_file2.Country_Cat.cat.categories)
print(cdf_file2.Country_Cat.cat.codes)

Categorical Labels
('Australia', 'Austria', 'Bahrain', 'Belgium', 'Bermuda', 'Brazil', 'Canada', 'Channel Islands', 'Cyprus', 'Denmark', 'EIRE', 'Finland', 'France', 'Germany', 'Greece', 'Hong Kong', 'Iceland', 'Israel', 'Italy', 'Japan', 'Korea', 'Lebanon', 'Lithuania', 'Malta', 'Netherlands', 'Nigeria', 'Norway', 'Poland', 'Portugal', 'RSA', 'Singapore', 'Spain', 'Sweden', 'Switzerland', 'Thailand', 'USA', 'United Arab Emirates', 'United Kingdom', 'Unspecified', 'West Indies')
0    37
1    37
2    37
3    37
4    37
5    37
6    37
7    37
8    37
9    37
[525451 more rows]
dtype: int8


In [0]:
'''CuDF doesn't support categories natively.
Categories are inherently converted to string while representing'''
pd_file2=pd_file.copy()
pd_file2['Country_Cat']=pd_file2.Country
pd_file2['Country_Cat']=pd_file2.Country_Cat.astype("category")
cdf_file2=cudf.DataFrame.from_pandas(pd_file2.copy())
type(cdf_file2.Country_Cat[1])

str

In [0]:
'''String Functionality
CuDF supports only nvstrings for string based maipulations.
The functionality for strings is quite similar to re based maipulation.
Simple regex block
Detailed Documentation:https://rapids.readthedocs.io/projects/nvstrings/en/latest/api.html
'''
#%time string_filter_cudf=cdf_file[cdf_file.Country.str.lower().str.contains('^un',regex=True)]
%time string_filter_pandas=pd_file[pd_file.Country.str.lower().str.contains('^un',regex=True)]

CPU times: user 455 ms, sys: 10.9 ms, total: 466 ms
Wall time: 469 ms


In [0]:
'''String Functionality
CuDF supports only nvstrings for string based maipulations.The functionality for strings is quite similar to re
based maipulation.
Relatively complex regex block with the same search base
Major Performance boost only when the search space/computation space is the bottleneck
'''
#%time string_filter_cudf1=cdf_file[cdf_file.Country.str.lower().str.contains('^un|and[a-z]+$',regex=True)]
%time string_filter_pandas1=pd_file[pd_file.Country.str.lower().str.contains('^un|and[a-z]+$',regex=True)]

CPU times: user 425 ms, sys: 22 ms, total: 447 ms
Wall time: 447 ms


In [0]:
%time cdf_file['pattern_extract']=cdf_file.Description.str.extract("(\\s+\\d{2,})")
%time pd_file['pattern_extract']=pd_file.Description.str.extract("(\\s+\\d{2,})")