# Introduction to RAPIDS

The RAPIDS data science framework is a GPU-empowered collection of libraries for executing end-to-end data science pipelines completely in the GPU. It is designed to make an effective use of the computational capabilities of GPUs with optimized NVIDIA CUDA® primitives and high-bandwidth GPU memory. The primary objective behind using RAPIDS is to accelerate individual parts of the typical data science workflow, and thereby accelerating the complete end-to-end workflow in Data Preparation and Machine Learning.

Read through [this](https://medium.com/future-vision/what-is-rapids-ai-7e552d80a1d2) medium article to understand how RAPIDS works.
<br><br>
If you have already worked with pandas and numpy previously, most of the tutorial will seem very familiar to you. If you haven't, do not worry. This is a great place to start!

## SETUP

**Note that pandas is a data analysis and manipulation tool built on top of the Python programming language to perform various tasks (e.g.: loading, joining, aggregating, filtering data). cuDF is a GPU DataFrame library that helps perform similar functionalities with massive acceleration.**

In [None]:
import cudf
import pandas as pd

Before we dive in, please make sure to check out the official documentation [here](https://docs.rapids.ai/api) to get an overall idea. Additionally, refer to the [cheatsheet](https://rapids.ai/assets/files/cheatsheet.pdf) for a crisp and clear representation of the functionalities provided by RAPIDS.

## SECTION 1: CUDF BASICS
### cuDF DataFrame
Firstly, we will understand creating dataframes in cuDF. You can build a dataframe in multiple ways as shown in the official documentation. Let us first initialize the dataframe object.

In [None]:
gdf = cudf.DataFrame()

Now that we have a cudf.Dataframe object, we will build the dataframe with values. Let us explore adding values by defining them through their columns.

In [None]:
#creates a column named 'index' with the values 0, 1, 2, 3, 4
gdf['index'] = [0, 1, 2, 3, 4]

#creates a column named 'value' with the values 10, 20, 30, 40, 50
gdf['value'] = [10, 20, 30, 40, 50]

#displays the current cudf dataframe
gdf

We can also build the dataframe with list of rows of the dataframe as tuples.

In [None]:
#the first parameter is the data and the second parameter is the name of the columns
df = cudf.DataFrame([
    (5, 60),
    (6, 70),
    (7, 80),
],
columns = ['index', 'value'])
df

## SECTION 2: CUDF using Netflix Movie Dataset
Now that we have a basic understanding of how to work with a cuDF DataFrame, let us try to work with creating one from a dataset. We will be using the dataset from [here](https://www.kaggle.com/shivamb/netflix-shows) to get hands-on with cuDF.<br>

### Reading a CSV file
Import the netfilx_titles.csv dataset into a cuDF dataframe.

In [None]:
gdf = cudf.read_csv('/shared-data/Apr20/data/Module3/netflix_titles.csv')

### Converting a Pandas DataFrame
Alternatively, you could also read the data using Pandas and convert the dataframe to support cuDF functionalities.

In [None]:
#creates a pandas dataframe
pdf = pd.read_csv('/shared-data/Apr20/data/Module3/netflix_titles.csv')

#creates cudf dataframe from pandas dataframe
gdf = cudf.DataFrame.from_pandas(pdf)

#display dataframe
gdf

Let us now delve into some questions on the dataset itself!

### 1. Dropping columns
__<span style="color:red">Exercise1:</span>__: This dataset has a lot of missing values primarily in the columns director and cast. Therefore, we will drop these two columns from our dataframe.
Drop Display gdf after dropping to verify that the columns have been dropped

### 2. Missing values
__<span style="color:red">Exercise2:</span>__: The dataset needs to be cleaned first. There are several NA values in the data that add no value, we can choose to drop these records. Create a clean dataframe with no NA values.
Display gdf after dropping to verify that the NA values have been dropped

### 3. Querying DataFrame
__<span style="color:red">Exercise3:</span>__: Find the shows that were released in the year 2011.

### 4. Sort values
__<span style="color:red">Exercise4:</span>__: Sort the dataframe according to the year the record was released (latest first). Refer to sort_values function, which takes the target column name and the sorting mode

### 5. GroupBy
__<span style="color:red">Exercise5:</span>__: Alternatively, you can also find the number of movies and shows using a GroupBy and size.