# Introduction to RAPIDS

The RAPIDS data science framework is a GPU-empowered collection of libraries for executing end-to-end data science pipelines completely in the GPU. It is designed to make an effective use of the computational capabilities of GPUs with optimized NVIDIA CUDA® primitives and high-bandwidth GPU memory. The primary objective behind using RAPIDS is to accelerate individual parts of the typical data science workflow, and thereby accelerating the complete end-to-end workflow in Data Preparation and Machine Learning.

Read through [this](https://medium.com/future-vision/what-is-rapids-ai-7e552d80a1d2) medium article to understand how RAPIDS works.
<br><br>
If you have already worked with pandas and numpy previously, most of the tutorial will seem very familiar to you. If you haven't, do not worry. This is a great place to start!

## SETUP

**Note that pandas is a data analysis and manipulation tool built on top of the Python programming language to perform various tasks (e.g.: loading, joining, aggregating, filtering data). cuDF is a GPU DataFrame library that helps perform similar functionalities with massive acceleration.**

In [1]:
import cudf
import pandas as pd

Before we dive in, please make sure to check out the official documentation [here](https://docs.rapids.ai/api) to get an overall idea. Additionally, refer to the [cheatsheet](https://rapids.ai/assets/files/cheatsheet.pdf) for a crisp and clear representation of the functionalities provided by RAPIDS.

## SECTION 1: CUDF BASICS
### cuDF DataFrame
Firstly, we will understand creating dataframes in cuDF. You can build a dataframe in multiple ways as shown in the official documentation. Let us first initialize the dataframe object.

In [2]:
gdf = cudf.DataFrame()

Now that we have a cudf.Dataframe object, we will build the dataframe with values. Let us explore adding values by defining them through their columns.

In [3]:
#creates a column named 'index' with the values 0, 1, 2, 3, 4
gdf['index'] = [0, 1, 2, 3, 4]

#creates a column named 'value' with the values 10, 20, 30, 40, 50
gdf['value'] = [10, 20, 30, 40, 50]

#displays the current cudf dataframe
gdf

Unnamed: 0,index,value
0,0,10
1,1,20
2,2,30
3,3,40
4,4,50


We can also build the dataframe with list of rows of the dataframe as tuples.

In [4]:
#the first parameter is the data and the second parameter is the name of the columns
df = cudf.DataFrame([
    (5, 60),
    (6, 70),
    (7, 80),
],
columns = ['index', 'value'])
df

Unnamed: 0,index,value
0,5,60
1,6,70
2,7,80


## SECTION 2: CUDF using Netflix Movie Dataset
Now that we have a basic understanding of how to work with a cuDF DataFrame, let us try to work with creating one from a dataset. We will be using the dataset from [here](https://www.kaggle.com/shivamb/netflix-shows) to get hands-on with cuDF.<br>

### Reading a CSV file
Import the netfilx_titles.csv dataset into a cuDF dataframe.

In [5]:
gdf = cudf.read_csv('/shared-data/Apr20/data/Module3/netflix_titles.csv')

### Converting a Pandas DataFrame
Alternatively, you could also read the data using Pandas and convert the dataframe to support cuDF functionalities.

In [6]:
#creates a pandas dataframe
pdf = pd.read_csv('/shared-data/Apr20/data/Module3/netflix_titles.csv')

#creates cudf dataframe from pandas dataframe
gdf = cudf.DataFrame.from_pandas(pdf)

#display dataframe
gdf

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description
0,s1,TV Show,3%,,"João Miguel, Bianca Comparato, Michel Gomes, R...",Brazil,"August 14, 2020",2020,TV-MA,4 Seasons,"International TV Shows, TV Dramas, TV Sci-Fi &...",In a future where the elite inhabit an island ...
1,s2,Movie,7:19,Jorge Michel Grau,"Demián Bichir, Héctor Bonilla, Oscar Serrano, ...",Mexico,"December 23, 2016",2016,TV-MA,93 min,"Dramas, International Movies",After a devastating earthquake hits Mexico Cit...
2,s3,Movie,23:59,Gilbert Chan,"Tedd Chan, Stella Chung, Henley Hii, Lawrence ...",Singapore,"December 20, 2018",2011,R,78 min,"Horror Movies, International Movies","When an army recruit is found dead, his fellow..."
3,s4,Movie,9,Shane Acker,"Elijah Wood, John C. Reilly, Jennifer Connelly...",United States,"November 16, 2017",2009,PG-13,80 min,"Action & Adventure, Independent Movies, Sci-Fi...","In a postapocalyptic world, rag-doll robots hi..."
4,s5,Movie,21,Robert Luketic,"Jim Sturgess, Kevin Spacey, Kate Bosworth, Aar...",United States,"January 1, 2020",2008,PG-13,123 min,Dramas,A brilliant group of students become card-coun...
...,...,...,...,...,...,...,...,...,...,...,...,...
7782,s7783,Movie,Zozo,Josef Fares,"Imad Creidi, Antoinette Turk, Elias Gergi, Car...","Sweden, Czech Republic, United Kingdom, Denmar...","October 19, 2020",2005,TV-MA,99 min,"Dramas, International Movies",When Lebanon's Civil War deprives Zozo of his ...
7783,s7784,Movie,Zubaan,Mozez Singh,"Vicky Kaushal, Sarah-Jane Dias, Raaghav Chanan...",India,"March 2, 2019",2015,TV-14,111 min,"Dramas, International Movies, Music & Musicals",A scrappy but poor boy worms his way into a ty...
7784,s7785,Movie,Zulu Man in Japan,,Nasty C,,"September 25, 2020",2019,TV-MA,44 min,"Documentaries, International Movies, Music & M...","In this documentary, South African rapper Nast..."
7785,s7786,TV Show,Zumbo's Just Desserts,,"Adriano Zumbo, Rachel Khoo",Australia,"October 31, 2020",2019,TV-PG,1 Season,"International TV Shows, Reality TV",Dessert wizard Adriano Zumbo looks for the nex...


Let us now delve into some questions on the dataset itself!

### 1. Dropping columns
__<span style="color:red">Exercise1:</span>__: This dataset has a lot of missing values primarily in the columns director and cast. Therefore, we will drop these two columns from our dataframe.
Drop Display gdf after dropping to verify that the columns have been dropped

In [7]:
gdf.drop(columns=['director','cast'],inplace=True)

Unnamed: 0,show_id,type,title,country,date_added,release_year,rating,duration,listed_in,description
0,s1,TV Show,3%,Brazil,"August 14, 2020",2020,TV-MA,4 Seasons,"International TV Shows, TV Dramas, TV Sci-Fi &...",In a future where the elite inhabit an island ...
1,s2,Movie,7:19,Mexico,"December 23, 2016",2016,TV-MA,93 min,"Dramas, International Movies",After a devastating earthquake hits Mexico Cit...
2,s3,Movie,23:59,Singapore,"December 20, 2018",2011,R,78 min,"Horror Movies, International Movies","When an army recruit is found dead, his fellow..."
3,s4,Movie,9,United States,"November 16, 2017",2009,PG-13,80 min,"Action & Adventure, Independent Movies, Sci-Fi...","In a postapocalyptic world, rag-doll robots hi..."
4,s5,Movie,21,United States,"January 1, 2020",2008,PG-13,123 min,Dramas,A brilliant group of students become card-coun...
...,...,...,...,...,...,...,...,...,...,...
7782,s7783,Movie,Zozo,"Sweden, Czech Republic, United Kingdom, Denmar...","October 19, 2020",2005,TV-MA,99 min,"Dramas, International Movies",When Lebanon's Civil War deprives Zozo of his ...
7783,s7784,Movie,Zubaan,India,"March 2, 2019",2015,TV-14,111 min,"Dramas, International Movies, Music & Musicals",A scrappy but poor boy worms his way into a ty...
7784,s7785,Movie,Zulu Man in Japan,,"September 25, 2020",2019,TV-MA,44 min,"Documentaries, International Movies, Music & M...","In this documentary, South African rapper Nast..."
7785,s7786,TV Show,Zumbo's Just Desserts,Australia,"October 31, 2020",2019,TV-PG,1 Season,"International TV Shows, Reality TV",Dessert wizard Adriano Zumbo looks for the nex...


### 2. Missing values
__<span style="color:red">Exercise2:</span>__: The dataset needs to be cleaned first. There are several NA values in the data that add no value, we can choose to drop these records. Create a clean dataframe with no NA values.
Display gdf after dropping to verify that the NA values have been dropped

In [8]:
gdf=gdf.dropna()
gdf

Unnamed: 0,show_id,type,title,country,date_added,release_year,rating,duration,listed_in,description
0,s1,TV Show,3%,Brazil,"August 14, 2020",2020,TV-MA,4 Seasons,"International TV Shows, TV Dramas, TV Sci-Fi &...",In a future where the elite inhabit an island ...
1,s2,Movie,7:19,Mexico,"December 23, 2016",2016,TV-MA,93 min,"Dramas, International Movies",After a devastating earthquake hits Mexico Cit...
2,s3,Movie,23:59,Singapore,"December 20, 2018",2011,R,78 min,"Horror Movies, International Movies","When an army recruit is found dead, his fellow..."
3,s4,Movie,9,United States,"November 16, 2017",2009,PG-13,80 min,"Action & Adventure, Independent Movies, Sci-Fi...","In a postapocalyptic world, rag-doll robots hi..."
4,s5,Movie,21,United States,"January 1, 2020",2008,PG-13,123 min,Dramas,A brilliant group of students become card-coun...
...,...,...,...,...,...,...,...,...,...,...
7781,s7782,Movie,Zoom,United States,"January 11, 2020",2006,PG,88 min,"Children & Family Movies, Comedies","Dragged from civilian life, a former superhero..."
7782,s7783,Movie,Zozo,"Sweden, Czech Republic, United Kingdom, Denmar...","October 19, 2020",2005,TV-MA,99 min,"Dramas, International Movies",When Lebanon's Civil War deprives Zozo of his ...
7783,s7784,Movie,Zubaan,India,"March 2, 2019",2015,TV-14,111 min,"Dramas, International Movies, Music & Musicals",A scrappy but poor boy worms his way into a ty...
7785,s7786,TV Show,Zumbo's Just Desserts,Australia,"October 31, 2020",2019,TV-PG,1 Season,"International TV Shows, Reality TV",Dessert wizard Adriano Zumbo looks for the nex...


### 3. Querying DataFrame
__<span style="color:red">Exercise3:</span>__: Find the shows that were released in the year 2011.

In [9]:
expr = "release_year==2011"
gdf.query(expr)

Unnamed: 0,show_id,type,title,country,date_added,release_year,rating,duration,listed_in,description
2,s3,Movie,23:59,Singapore,"December 20, 2018",2011,R,78 min,"Horror Movies, International Movies","When an army recruit is found dead, his fellow..."
104,s105,Movie,30 Minutes or Less,United States,"January 1, 2021",2011,R,83 min,"Action & Adventure, Comedies",Two crooks planning a bank heist wind up abduc...
122,s123,Movie,50/50,United States,"June 1, 2019",2011,R,100 min,"Comedies, Dramas, Independent Movies",An otherwise healthy twentysomething has a com...
136,s137,Movie,7 Khoon Maaf,India,"August 2, 2018",2011,TV-MA,148 min,"Dramas, International Movies, Thrillers","Spiced liberally with black comedy, this Bolly..."
145,s146,Movie,A 2nd Chance,Australia,"July 1, 2017",2011,PG,95 min,"Children & Family Movies, Dramas, Sports Movies",A gymnast lacks the confidence she needs to re...
...,...,...,...,...,...,...,...,...,...,...
7683,s7684,Movie,X Large,Egypt,"June 2, 2020",2011,TV-14,135 min,"Comedies, International Movies, Romantic Movies",After he is rejected by the woman he loves and...
7695,s7696,Movie,Yaara O Dildaara,India,"November 1, 2017",2011,TV-14,132 min,"Dramas, International Movies, Music & Musicals",The patriarch of a wealthy family with one ind...
7736,s7737,Movie,Young Adult,United States,"November 20, 2019",2011,R,94 min,"Comedies, Dramas, Independent Movies",When a divorced writer gets a letter from an o...
7769,s7770,Movie,Zindagi Na Milegi Dobara,India,"December 15, 2019",2011,TV-14,154 min,"Comedies, Dramas, International Movies",Three friends on an adventurous road trip/bach...


### 4. Sort values
__<span style="color:red">Exercise4:</span>__: Sort the dataframe according to the year the record was released (latest first). Refer to sort_values function, which takes the target column name and the sorting mode

In [10]:
gdf.sort_values('release_year',ascending=False)

Unnamed: 0,show_id,type,title,country,date_added,release_year,rating,duration,listed_in,description
1222,s1223,TV Show,Carmen Sandiego,United States,"January 15, 2021",2021,TV-Y7,4 Seasons,"Kids' TV, TV Thrillers","A master thief who uses her skills for good, C..."
1285,s1286,Movie,Charming,"Canada, United States, Cayman Islands","January 8, 2021",2021,TV-Y7,85 min,"Children & Family Movies, Comedies","On the eve of his 21st birthday, an adored pri..."
1440,s1441,TV Show,Cobra Kai,United States,"January 1, 2021",2021,TV-14,3 Seasons,"TV Action & Adventure, TV Dramas",Decades after the tournament that changed thei...
1514,s1515,Movie,"Crack: Cocaine, Corruption & Conspiracy",United States,"January 11, 2021",2021,TV-MA,90 min,Documentaries,"A cheap, powerful drug emerges during a recess..."
1780,s1781,TV Show,Disenchantment,United States,"January 15, 2021",2021,TV-14,3 Seasons,"TV Action & Adventure, TV Comedies, TV Sci-Fi ...","Princess duties call, but she'd rather be drin..."
...,...,...,...,...,...,...,...,...,...,...
7342,s7343,Movie,Undercover: How to Operate Behind Enemy Lines,United States,"March 31, 2017",1943,TV-PG,61 min,"Classic Movies, Documentaries",This World War II-era training film dramatizes...
7616,s7617,Movie,Why We Fight: The Battle of Russia,United States,"March 31, 2017",1943,TV-PG,82 min,Documentaries,This installment of Frank Capra's acclaimed do...
7679,s7680,Movie,WWII: Report from the Aleutians,United States,"March 31, 2017",1943,TV-PG,45 min,Documentaries,Filmmaker John Huston narrates this Oscar-nomi...
4960,s4961,Movie,Prelude to War,United States,"March 31, 2017",1942,TV-14,52 min,"Classic Movies, Documentaries",Frank Capra's documentary chronicles the rise ...


### 5. GroupBy
__<span style="color:red">Exercise5:</span>__: Alternatively, you can also find the number of movies and shows using a GroupBy and size.

In [11]:
gdf.groupby('type').count()

Unnamed: 0_level_0,show_id,title,country,date_added,release_year,rating,duration,listed_in,description
type,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
Movie,5143,5143,5143,5143,5143,5143,5143,5143,5143
TV Show,2122,2122,2122,2122,2122,2122,2122,2122,2122
