<a href="https://colab.research.google.com/github/drshahizan/Python-big-data/blob/main/Assignment%202a/SIX/Assignment_alternatives_to_Pandas_file1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## **Group Member**

1. LEE MING QI

2. NUR IRDINA ALIAH BINTI ABDUL WAHAB

3. SINGTHAI SRISOI

4. AMIRAH RAIHANAH BINTI ABDUL RAHIM


# **Alternatives to Pandas for Processing Large Dataset - cuDF**

&emsp; cuDF is a library for working with dataframes on the GPU. It is part of the RAPIDS project, which is a suite of tools for performing data science tasks on GPU. cuDF allows you to manipulate data using the familiar Pandas API, but with the added speed and performance of running on a GPU. It can be used to accelerate a wide range of data processing tasks, including data cleaning, transformation, and feature engineering.

# **Dataset**

&emsp; The dataset we are using is Rate.csv, which is accessible on Kaggle throught the link. https://www.kaggle.com/datasets/hhs/health-insurance-marketplace?select=Rate.csv

&emsp; The dataset contains information on the rates for each health insurance plan offered through the Health Insurance Marketplace in the United States. The Marketplace is a platform that allows individuals and small businesses to shop for and compare health insurance plans, and is a key part of the Affordable Care Act (ACA).

The Rate.csv file includes the following columns:

1. **BusinessYear:** The year for which the rate information applies.

2. **StateCode**: The two-letter code for the state in which the health insurance plan is offered.

3. **IssuerId**: A unique identifier for the insurer offering the health insurance plan.

4. **SourceName**: The source of the rate information (e.g. the insurer, the state insurance department).

5. **VersionNum**: A version number for the rate information.

6. **ImportDate**: The date on which the rate information was imported into the Marketplace database.

7. **PlanId**: A unique identifier for the health insurance plan.

8. **StandardComponentId**: A unique identifier for the standard component of the health insurance plan.

9. **RatingAreaId**: A unique identifier for the rating area (geographic region) in which the health insurance plan is offered.

10. **Tobacco**: A flag indicating whether the rate information applies to tobacco users (1) or non-tobacco users (0).

11. **Age**: The age of the insured person for which the rate information applies.

12. **IndividualRate**: The monthly premium (cost) for the health insurance plan for an individual.

13. **IndividualTobaccoRate**: The monthly premium for the health insurance plan for an individual tobacco user.

14. **Couple**: The monthly premium for the health insurance plan for a couple.

15. **CoupleAndOneDependent**: The monthly premium for the health insurance plan for a couple and one dependent.

16. **CoupleAndTwoDependents**: The monthly premium for the health insurance plan for a couple and two dependents.

17. **CoupleAndThreeOrMoreDependents**: The monthly premium for the health insurance plan for a couple and three or more dependents.

&emsp; This file can be useful for researchers and policymakers interested in studying trends in the health insurance market, as well as for individuals and small businesses looking for information on the health insurance plans available to them through the Marketplace.

# **Import Dataset**


In [None]:
!pip install -q kaggle

In [None]:
from google.colab import files
from os import environ

# upload kaggle API key
uploaded = files.upload()

Saving kaggle.json to kaggle.json


In [None]:
# define kaggle config folder
! mkdir "./kaggle" && mv "./kaggle.json" "./kaggle/kaggle.json"
environ['KAGGLE_CONFIG_DIR'] = './kaggle'

# hide kaggle API key for other users
! chmod 600 ./kaggle/kaggle.json

In [None]:
# fetch kaggle dataset
!kaggle datasets download -d hhs/health-insurance-marketplace -f rate.csv

Downloading Rate.csv.zip to /content
 89% 93.0M/105M [00:00<00:00, 152MB/s]
100% 105M/105M [00:00<00:00, 140MB/s] 


In [None]:
!unzip Rate.csv.zip && rm Rate.csv.zip

Archive:  Rate.csv.zip
  inflating: Rate.csv                


# **Basic Concept of cuDF**

&emsp; To use cuDF, you will need a machine with a GPU and the necessary drivers and libraries installed. You can then use the cuDF API to load data into a GPU memory and perform various operations on it, such as filtering, aggregation, and transformation. The resulting data can be accessed and processed just like a regular Pandas dataframe, making it easy to use cuDF in a wide range of data processing pipelines.

## Setup the environment


1.   Click Runtime in in the top toolbar
2.   Click Change runtime type
3.   Select GPU for Hardware accelerator
4.   Check the output of !nvidia-smi to ensure that the GPU allocated is either one of Tesla T4, P4, or P100.


In [None]:
!nvidia-smi

Fri Jan  6 13:54:53 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   46C    P0    26W /  70W |      0MiB / 15109MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

&emsp; The Tesla T4, P4, and P100 are all NVIDIA GPUs that are specifically designed for use in data centers and are well-suited for running cuDF and other GPU-accelerated software. These GPUs have a large number of compute cores, which allows them to perform many computations in parallel, making them ideal for tasks such as data manipulation and analysis.

## Install the cuDF packages

&emsp; Installs the cudf-cu11 packages from the NVIDIA GPU Cloud (NGC) package repository. The cudf-cu11 package is a GPU-accelerated DataFrame library for working with data on the GPU.

&emsp; The --extra-index-url flag specifies the URL of the package repository to use. In this case, the repository is the NGC package repository, which contains a wide range of GPU-accelerated software and libraries, including cudf-cu11 and dask-cudf-cu11.

In [None]:
pip install cudf-cu11 --extra-index-url=https://pypi.ngc.nvidia.com

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/, https://pypi.ngc.nvidia.com
Collecting cudf-cu11
  Downloading https://developer.download.nvidia.com/compute/redist/cudf-cu11/cudf_cu11-22.12.0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (442.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m442.8/442.8 MB[0m [31m3.7 MB/s[0m eta [36m0:00:00[0m
Collecting cuda-python<12.0,>=11.7.1
  Downloading cuda_python-11.8.1-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (16.2 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m16.2/16.2 MB[0m [31m52.4 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting protobuf<3.21.0a0,>=3.20.1
  Downloading protobuf-3.20.3-cp38-cp38-manylinux_2_5_x86_64.manylinux1_x86_64.whl (1.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.0/1.0 MB[0m [31m50.4 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting rmm-cu11
  Downloading https://develo

## Import cuDF 

In [None]:
import cudf

In [None]:
#check the version of cuDF
cudf.__version__

'22.12.0'

## Read Dataset

&emsp; When you read a CSV file into a cuDF DataFrame using the read_csv() function, the data is automatically transferred to the GPU, where it can be processed and analyzed using the parallel computing power of the GPU.

&emsp; This can significantly improve the performance of tasks such as data manipulation and analysis, especially when working with large datasets.

In [None]:
df = cudf.read_csv('Rate.csv')

df

Unnamed: 0,BusinessYear,StateCode,IssuerId,SourceName,VersionNum,ImportDate,IssuerId2,FederalTIN,RateEffectiveDate,RateExpirationDate,...,IndividualRate,IndividualTobaccoRate,Couple,PrimarySubscriberAndOneDependent,PrimarySubscriberAndTwoDependents,PrimarySubscriberAndThreeOrMoreDependents,CoupleAndOneDependent,CoupleAndTwoDependents,CoupleAndThreeOrMoreDependents,RowNumber
0,2014,AK,21989,HIOS,6,2014-03-19 07:06:49,21989,93-0438772,2014-01-01,2014-12-31,...,29.00,,,,,,,,,14
1,2014,AK,21989,HIOS,6,2014-03-19 07:06:49,21989,93-0438772,2014-01-01,2014-12-31,...,36.95,,73.9,107.61,107.61,107.61,144.56,144.56,144.56,14
2,2014,AK,21989,HIOS,6,2014-03-19 07:06:49,21989,93-0438772,2014-01-01,2014-12-31,...,36.95,,73.9,107.61,107.61,107.61,144.56,144.56,144.56,15
3,2014,AK,21989,HIOS,6,2014-03-19 07:06:49,21989,93-0438772,2014-01-01,2014-12-31,...,32.00,,,,,,,,,15
4,2014,AK,21989,HIOS,6,2014-03-19 07:06:49,21989,93-0438772,2014-01-01,2014-12-31,...,32.00,,,,,,,,,16
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
12694440,2016,WV,96480,SERFF,2,2015-08-20 12:28:36,96480,13-5123390,2016-01-01,2016-12-31,...,14.05,,,,,,,,,2033
12694441,2016,WV,96480,SERFF,2,2015-08-20 12:28:36,96480,13-5123390,2016-01-01,2016-12-31,...,14.05,,,,,,,,,2034
12694442,2016,WV,96480,SERFF,2,2015-08-20 12:28:36,96480,13-5123390,2016-01-01,2016-12-31,...,14.05,,,,,,,,,2035
12694443,2016,WV,96480,SERFF,2,2015-08-20 12:28:36,96480,13-5123390,2016-01-01,2016-12-31,...,14.05,,,,,,,,,2036


## View Dataset

&emsp; To view the contents of a cuDF DataFrame, you can use the head() function, which returns the first n rows of the DataFrame, while tail() function will returns the last n rows of the DataFrame.

In [None]:
#view first 5 rows of the dataframe
df.head(5)

Unnamed: 0,BusinessYear,StateCode,IssuerId,SourceName,VersionNum,ImportDate,IssuerId2,FederalTIN,RateEffectiveDate,RateExpirationDate,...,IndividualRate,IndividualTobaccoRate,Couple,PrimarySubscriberAndOneDependent,PrimarySubscriberAndTwoDependents,PrimarySubscriberAndThreeOrMoreDependents,CoupleAndOneDependent,CoupleAndTwoDependents,CoupleAndThreeOrMoreDependents,RowNumber
0,2014,AK,21989,HIOS,6,2014-03-19 07:06:49,21989,93-0438772,2014-01-01,2014-12-31,...,29.0,,,,,,,,,14
1,2014,AK,21989,HIOS,6,2014-03-19 07:06:49,21989,93-0438772,2014-01-01,2014-12-31,...,36.95,,73.9,107.61,107.61,107.61,144.56,144.56,144.56,14
2,2014,AK,21989,HIOS,6,2014-03-19 07:06:49,21989,93-0438772,2014-01-01,2014-12-31,...,36.95,,73.9,107.61,107.61,107.61,144.56,144.56,144.56,15
3,2014,AK,21989,HIOS,6,2014-03-19 07:06:49,21989,93-0438772,2014-01-01,2014-12-31,...,32.0,,,,,,,,,15
4,2014,AK,21989,HIOS,6,2014-03-19 07:06:49,21989,93-0438772,2014-01-01,2014-12-31,...,32.0,,,,,,,,,16


In [None]:
#view last 5 rows of the dataframe
df.tail(5)

Unnamed: 0,BusinessYear,StateCode,IssuerId,SourceName,VersionNum,ImportDate,IssuerId2,FederalTIN,RateEffectiveDate,RateExpirationDate,...,IndividualRate,IndividualTobaccoRate,Couple,PrimarySubscriberAndOneDependent,PrimarySubscriberAndTwoDependents,PrimarySubscriberAndThreeOrMoreDependents,CoupleAndOneDependent,CoupleAndTwoDependents,CoupleAndThreeOrMoreDependents,RowNumber
12694440,2016,WV,96480,SERFF,2,2015-08-20 12:28:36,96480,13-5123390,2016-01-01,2016-12-31,...,14.05,,,,,,,,,2033
12694441,2016,WV,96480,SERFF,2,2015-08-20 12:28:36,96480,13-5123390,2016-01-01,2016-12-31,...,14.05,,,,,,,,,2034
12694442,2016,WV,96480,SERFF,2,2015-08-20 12:28:36,96480,13-5123390,2016-01-01,2016-12-31,...,14.05,,,,,,,,,2035
12694443,2016,WV,96480,SERFF,2,2015-08-20 12:28:36,96480,13-5123390,2016-01-01,2016-12-31,...,14.05,,,,,,,,,2036
12694444,2016,WV,96480,SERFF,2,2015-08-20 12:28:36,96480,13-5123390,2016-01-01,2016-12-31,...,14.05,,,,,,,,,2037


## Sorting Data

&emsp; We can sort the data based on the columns by using sort_value(). By setting ascending=False, we will get the largest data at the top of the DataFrame.


In [None]:
df.sort_values('IndividualRate', ascending=False)

Unnamed: 0,BusinessYear,StateCode,IssuerId,SourceName,VersionNum,ImportDate,IssuerId2,FederalTIN,RateEffectiveDate,RateExpirationDate,...,IndividualRate,IndividualTobaccoRate,Couple,PrimarySubscriberAndOneDependent,PrimarySubscriberAndTwoDependents,PrimarySubscriberAndThreeOrMoreDependents,CoupleAndOneDependent,CoupleAndTwoDependents,CoupleAndThreeOrMoreDependents,RowNumber
9698,2014,AK,74819,HIOS,7,2014-01-21 08:29:49,74819,95-6042390,2014-01-01,2014-12-31,...,999999.0,,,,,,,,,15
9699,2014,AK,74819,HIOS,7,2014-01-21 08:29:49,74819,95-6042390,2014-01-01,2014-12-31,...,999999.0,,,,,,,,,15
9700,2014,AK,74819,HIOS,7,2014-01-21 08:29:49,74819,95-6042390,2014-01-01,2014-12-31,...,999999.0,,,,,,,,,15
9701,2014,AK,74819,HIOS,7,2014-01-21 08:29:49,74819,95-6042390,2014-01-01,2014-12-31,...,999999.0,,,,,,,,,15
9702,2014,AK,74819,HIOS,7,2014-01-21 08:29:49,74819,95-6042390,2014-01-01,2014-12-31,...,999999.0,,,,,,,,,16
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
12683814,2016,WV,67072,SERFF,3,2015-08-20 12:28:36,67072,93-0242990,2016-01-01,2016-12-31,...,0.0,,,,,,,,,2033
12683815,2016,WV,67072,SERFF,3,2015-08-20 12:28:36,67072,93-0242990,2016-01-01,2016-12-31,...,0.0,,,,,,,,,2034
12683816,2016,WV,67072,SERFF,3,2015-08-20 12:28:36,67072,93-0242990,2016-01-01,2016-12-31,...,0.0,,,,,,,,,2035
12683817,2016,WV,67072,SERFF,3,2015-08-20 12:28:36,67072,93-0242990,2016-01-01,2016-12-31,...,0.0,,,,,,,,,2036


## Missing Data

We can use isna() to determine the missing data in the DataFrame. 

In [None]:
df.isna().sum()

BusinessYear                                        0
StateCode                                           0
IssuerId                                            0
SourceName                                          0
VersionNum                                          0
ImportDate                                          0
IssuerId2                                           0
FederalTIN                                          0
RateEffectiveDate                                   0
RateExpirationDate                                  0
PlanId                                              0
RatingAreaId                                        0
Tobacco                                             0
Age                                                 0
IndividualRate                                      0
IndividualTobaccoRate                         7762096
Couple                                       12653504
PrimarySubscriberAndOneDependent             12653504
PrimarySubscriberAndTwoDepen

&emsp; We can use fillna() function to fill in the desired value in those missing data. For example, we can replace the null values in "IndividualTobaccoRate" with the mean of the column. We set inplace=True to apply the changes in the DataFrame.

In [None]:
df['IndividualTobaccoRate'].fillna(df.IndividualTobaccoRate.mean(), inplace=True)

In [None]:
df.IndividualTobaccoRate.isna().sum()

0

&emsp; We can also use dropna() function to drop columns or rows with null value. For example, we will drop columns with 90% of null values. We set thresh to the 10% of the row so that we will remove column without 10% of non-null value.

In [None]:
df.dropna(axis=1, thresh=(len(df)*0.1), inplace=True)

## Statistical Analysis

&emsp; We can use describe() to discover the statistical analysis data of each numerical columns. There are a lot of of other functions such as mean(), std(), var(), min() and max().


In [None]:
df.describe()

Unnamed: 0,BusinessYear,IssuerId,VersionNum,IssuerId2,IndividualRate,IndividualTobaccoRate,RowNumber
count,12694440.0,12694440.0,12694440.0,12694440.0,12694440.0,12694440.0,12694440.0
mean,2015.034,52485.92,6.865558,52485.92,4098.026,543.6911,6348.572
std,0.794052,26412.63,3.85718,26412.63,61222.71,183.6286,9011.435
min,2014.0,10046.0,1.0,10046.0,0.0,41.73,14.0
25%,2014.0,30219.0,4.0,30219.0,29.33,543.6911,873.0
50%,2015.0,49532.0,6.0,49532.0,291.6,543.6911,2728.0
75%,2016.0,76526.0,9.0,76526.0,478.98,543.6911,7577.0
max,2016.0,99969.0,24.0,99969.0,999999.0,6604.61,63493.0


## Grouping Data

&emsp; The groupby() function in cuDF allows you to group a cuDF DataFrame by one or more columns and apply a function to each group. This can be useful for a variety of purposes, such as calculating statistics for each group, filtering groups based on certain criteria, or transforming the data in each group.

In [None]:
df.groupby(['BusinessYear']).count()

Unnamed: 0_level_0,StateCode,IssuerId,SourceName,VersionNum,ImportDate,IssuerId2,FederalTIN,RateEffectiveDate,RateExpirationDate,PlanId,RatingAreaId,Tobacco,Age,IndividualRate,IndividualTobaccoRate,RowNumber
BusinessYear,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1
2015,4676092,4676092,4676092,4676092,4676092,4676092,4676092,4676092,4676092,4676092,4676092,4676092,4676092,4676092,4676092,4676092
2014,3796388,3796388,3796388,3796388,3796388,3796388,3796388,3796388,3796388,3796388,3796388,3796388,3796388,3796388,3796388,3796388
2016,4221965,4221965,4221965,4221965,4221965,4221965,4221965,4221965,4221965,4221965,4221965,4221965,4221965,4221965,4221965,4221965


In [None]:
df.groupby(['Age']).agg({'IndividualRate': ['mean','min','max']})

Unnamed: 0_level_0,IndividualRate,IndividualRate,IndividualRate
Unnamed: 0_level_1,mean,min,max
Age,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2
36,4123.018255,0.0,999999.0
26,4087.140988,0.0,999999.0
48,4194.17786,0.0,999999.0
30,4106.364008,0.0,999999.0
41,4135.77554,0.0,999999.0
21,4082.839038,0.0,999999.0
55,4298.124468,0.0,999999.0
43,4145.44615,0.0,999999.0
27,4091.294937,0.0,999999.0
59,4363.204309,0.0,999999.0


##Iteration

&emsp; cuDF DataFrames are designed to be processed in parallel on the GPU, and iterating over the rows of a DataFrame requires the data to be transferred back to the CPU one row at a time. This can be inefficient compared to other ways of processing the data, such as using vectorized operations or the apply_rows() function.

&emsp; To avoid poor performance, cuDF does not support iteration over a cuDF Series, DataFrame, or Index. Instead, you should try to use an existing function or method to accomplish the task you need to perform. If that is not possible, you can use the to_arrow() or to_pandas() function to copy the data from the GPU to the CPU, and then use from_arrow() or from_pandas() to copy the data back to the GPU.

## Convert cudf Dataframe to pandas Dataframe

To convert a cuDF DataFrame to a Pandas DataFrame, you can use the to_pandas() function. This function will transfer the data from the GPU to the CPU and return a Pandas DataFrame.

In [None]:
pandas_df = df.to_pandas()

In [None]:
type(df)

cudf.core.dataframe.DataFrame

In [None]:
type(pandas_df)

pandas.core.frame.DataFrame

&emsp; To convert a Pandas DataFrame to a cuDF DataFrame, you can use the from_pandas() function. This function will transfer the data from the CPU to the GPU and return a cuDF DataFrame.

In [None]:
cudf_df = cudf.from_pandas(pandas_df)

In [None]:
type(cudf_df)

cudf.core.dataframe.DataFrame

&emsp; Keep in mind that converting a cuDF DataFrame to a Pandas DataFrame can be slow if the cuDF DataFrame is large, as it requires the data to be transferred from the GPU to the CPU. You should try to minimize the number of times you need to do this conversion if you are working with large datasets.

# **Conclusion**

&emsp; cuDF is a powerful library for data manipulation and analysis that is designed to work with data stored on the GPU. It provides many of the same features as Pandas, including the ability to read and write data in various formats, perform data cleaning and transformation tasks, and perform statistical analysis.

&emsp; One of the main benefits of cuDF is its ability to process large datasets in parallel on the GPU, which can lead to significantly faster performance compared to Pandas and other CPU-based data manipulation libraries. This makes it an excellent choice for working with large datasets that are too large to fit in memory on a single CPU.

&emsp; Overall, cuDF is a valuable tool for data scientists and analysts who need to work with large datasets and need fast, efficient data manipulation and analysis capabilities. It is particularly well-suited for tasks such as data wrangling, data preparation, and feature engineering, and can be an important part of a data-driven workflow.