<a href="https://colab.research.google.com/github/drshahizan/Python-big-data/blob/main/assignment/ass6/hpdp/2Big2Handle/big_data_2Big2Handle_md.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Spotify Charts (Assignment 6: Mastering Big Data Handling)

Source credit: https://www.kaggle.com/datasets/dhruvildave/spotify-charts

## Group: 2Big2Handle

### Group Members

| Name                                     | Matrix Number | Task |
| :---------------------------------------- | :-------------: | ------------- |
| MIKHAIL BIN YASSIN  | A21EC0054  |  
| IKMAL BIN KHAIRULEZUAN | A21EC0186 |  


# 1. About the Dataset

The Spotify dataset is about a complete dataset of all the "Top 200" and "Viral 50" charts published globally by Spotify. Spotify publishes a new chart every 2-3 days. This is its entire collection since January 1, 2017. The dataset consist of 9 columns which are title, rank, date, artist, url, region, chart, trend, and streams. The size of the dataset is 3.48 GB. For this dataset, we will import it from kaggle website rather than download it and import it back to our google drive that will consume our laptop memory storage.

In [None]:
from  google.colab import files
files.upload()

Saving kaggle.json to kaggle.json


{'kaggle.json': b'{"username":"mikhaily","key":"e566853c8e415ee9a7094aa240eeb156"}'}

In [None]:
# Create a kaggle folder
! mkdir ~/.kaggle

In [None]:
! cp kaggle.json ~/.kaggle/

In [None]:
! chmod 600 ~/.kaggle/kaggle.json

In [None]:
! kaggle datasets download dhruvildave/spotify-charts

Downloading spotify-charts.zip to /content
 98% 929M/945M [00:10<00:00, 156MB/s]
100% 945M/945M [00:10<00:00, 90.6MB/s]


In [None]:
! unzip spotify-charts.zip

Archive:  spotify-charts.zip
  inflating: charts.csv              


#Note:

##Functions used to calculate the Computational Metrics







We will be using time(), psutil.cpu_percent() and memory_usage() as the main function to calculate the computional metrics in this task.

- `time()`: The `time()` function in Python is used to measure the execution time of a piece of code. We will use this function to measure the time taken to execute a specific block of code.

- `psutil.cpu_percent()`: The `cpu_percent()` function in the `psutil` library is used to monitor CPU usage in Python. It returns the current system-wide CPU utilization in the form of a percentage. It requires a time interval as a parameter (e.g. seconds). We have include a time interval because CPU use is calculated over a period of time.

- `memory_usage()`: The `memory_usage(`) function in pandas is used to calculate the memory usage of a dataframe. It returns the memory usage of each column in the dataframe, as well as the total memory usage of the dataframe. By default, it only considers the memory usage of the data itself, not the memory usage of the index or other metadata. However, you can include the memory usage of the index and metadata by setting the deep parameter to True.

##`memory_usage()` VS `info()`

The `memory_usage()` method and the `.info()` method in pandas can be used to check the memory usage of a dataframe, but they provide different levels of detail.

The `memory_usage()` method returns the memory usage of each column in the dataframe, as well as the total memory usage of the dataframe. By default, it only considers the memory usage of the data itself, not the memory usage of the index or other metadata. However, you can include the memory usage of the index and metadata by setting the deep parameter to True.

On the other hand, the `info()` method provides more detailed information about the dataframe, including the number of rows and columns, the data types of each column, and the memory usage of the dataframe. It also includes information about the index, such as the number of non-null values and the memory usage.

The difference between the memory usage reported by `memory_usage()` and `info()` is that `memory_usage()` only reports the memory usage of the data itself, while `info()` reports the memory usage of the data as well as the index and other metadata. Therefore, the memory usage reported by `info()` is generally higher than the memory usage reported by `memory_usage()`.

#Setup:

In [None]:
# import pandas
import pandas as pd

# Import time module
import time

# Import psutil
import psutil

In [None]:
# Initialize a dictionary to store the results
results = {'Tradisional Way':[], 'Load Less Data': [], 'Use Chunking': [], 'Optimize Data Types': [], 'Sampling': [], 'Dask': []}

This will print the total memory usage of the dataframe in bytes. If you want to convert it to a more human-readable format, you can use the following function:

In [None]:
def format_memory_usage(num, suffix='B'):
    for unit in ['', 'K', 'M', 'G', 'T', 'P', 'E', 'Z']:
        if abs(num) < 1024.0:
            return f"{num:.1f} {unit}{suffix}"
        num /= 1024.0
    return f"{num:.1f} Yi{suffix}"

In [None]:
def format_elapsed_time(elapsed_time):
    minutes, seconds = divmod(elapsed_time, 60)
    return f'{int(minutes):02d}:{int(seconds):02d}'

In [None]:
def format_cpu_usage(cpu_usage):
    return f'{cpu_usage:.2f}%'

#2. Tradisional Way of Reading Big Data:

When dealing with large amounts of data, we must be caution in how we use memory. When we have a large amount of data, memory shortage is a common problem. If all RAM space is consumed, the program will crash and throw a Memory Error, which can be difficult to handle at times. In this case, limiting memory usage becomes critical. Here we show you five smart strategies to handle large datasets effectively with the steps:

In [None]:
# Record start time
start_time = time.time()

# Reading the data
df = pd.read_csv("charts.csv")

# Calculate the CPU usage
cpu_usage = format_cpu_usage(psutil.cpu_percent(interval=1))

# Calculate the memory usage
memory_usage = format_memory_usage(df.memory_usage(deep=True).sum())

# Calculate the elapsed time
elapsed_time = format_elapsed_time(time.time() - start_time)

# Add the result to the dictionary
results['Tradisional Way'].append((memory_usage, cpu_usage, elapsed_time))

# print the difference between start
# and end time in seconds
print(results['Tradisional Way'])

[('13.1 GB', '2.00%', '01:47')]


In [None]:
df

Unnamed: 0,title,rank,date,artist,url,region,chart,trend,streams
0,Chantaje (feat. Maluma),1,2017-01-01,Shakira,https://open.spotify.com/track/6mICuAdrwEjh6Y6...,Argentina,top200,SAME_POSITION,253019.0
1,Vente Pa' Ca (feat. Maluma),2,2017-01-01,Ricky Martin,https://open.spotify.com/track/7DM4BPaS7uofFul...,Argentina,top200,MOVE_UP,223988.0
2,Reggaetón Lento (Bailemos),3,2017-01-01,CNCO,https://open.spotify.com/track/3AEZUABDXNtecAO...,Argentina,top200,MOVE_DOWN,210943.0
3,Safari,4,2017-01-01,"J Balvin, Pharrell Williams, BIA, Sky",https://open.spotify.com/track/6rQSrBHf7HlZjtc...,Argentina,top200,SAME_POSITION,173865.0
4,Shaky Shaky,5,2017-01-01,Daddy Yankee,https://open.spotify.com/track/58IL315gMSTD37D...,Argentina,top200,MOVE_UP,153956.0
...,...,...,...,...,...,...,...,...,...
26173509,BYE,46,2021-07-31,Jaden,https://open.spotify.com/track/3OUyyDN7EZrL7i0...,Vietnam,viral50,MOVE_UP,
26173510,Pillars,47,2021-07-31,My Anh,https://open.spotify.com/track/6eky30oFiQbHUAT...,Vietnam,viral50,NEW_ENTRY,
26173511,Gái Độc Thân,48,2021-07-31,Tlinh,https://open.spotify.com/track/2klsSb2iTfgDh95...,Vietnam,viral50,MOVE_DOWN,
26173512,Renegade (feat. Taylor Swift),49,2021-07-31,Big Red Machine,https://open.spotify.com/track/1aU1wpYBSpP0M6I...,Vietnam,viral50,MOVE_DOWN,


We can see that by using traditional way to read the csv is 1 minute and 47 seconds. Sometime when we want to try load the spotify chart, it show the session crashed after using all available RAM in google collab. To solve this issues, we need to use the high-RAM runtimes by purchases the Colab Pro which cost USD $9.99 per month that is quiet expensive. We need to solve this issues by performing stragies to handle the big data.

# 3. Strategies for Big Datasets with Steps for Using These Strategies



###a. Use Chunking

Chunking is the process of dividing a large dataset into smaller pieces based on our preferences. This method is useful when dealing with large datasets that may not fit entirely in memory because when we want to load big data, it will cause crash and memory error. By dividing the data into smaller, more manageable chunks, we gain not only the ability to perform parallel and distributed processing but also the flexibility to perform computations on subsets of the data, allowing for incremental and scalable analyses. Sometimes problems don't fit in the single code, or the RAM could not hold the long execution of code, sometimes dask.arrays or dask.dataframe fails to manage the long Datasets. In this example, we will chunk the dataset into smaller size.

In [None]:
# Record start time
start_time = time.time()

# Start the CPU usage monitor
cpu_usage = psutil.cpu_percent(interval=1)

#----------------------------------
file_path = "charts.csv"  # Provide the correct file path

# Chunk size
chunk_size = 2617352  # Adjust this value based on your preferences

# Read the CSV file in chunks
for i, chunk in enumerate(pd.read_csv(file_path, chunksize=chunk_size)):
    # Process each chunk as needed
    # For example, you can print the shape of each chunk
    print(f"Chunk {i+1} Shape: {chunk.shape}")

    # Perform additional processing on each chunk if necessary

    # Save the chunk to a separate CSV file
    chunk.to_csv(f"{file_path}_chunk_{i+1}.csv", index=False)

    # Get memory usage of the current chunk
    memory_usage = chunk.memory_usage(index=True, deep=False)

    # Add the result to the dictionary or do something with memory_usage

#----------------------------------

# Stop the CPU usage monitor
cpu_usage = format_cpu_usage(psutil.cpu_percent(interval=None))

# Calculate the elapsed time
elapsed_time = format_elapsed_time(time.time() - start_time)

# Get full memory_usage after loading less data (if needed)
# memory_usage = file_path.memory_usage(index=True, deep=False)

# Add the result to the dictionary
results['Use Chunking'].append((memory_usage, cpu_usage, elapsed_time))

# Print result
print(results['Use Chunking'])


Chunk 1 Shape: (2617352, 9)
Chunk 2 Shape: (2617352, 9)
Chunk 3 Shape: (2617352, 9)
Chunk 4 Shape: (2617352, 9)
Chunk 5 Shape: (2617352, 9)
Chunk 6 Shape: (2617352, 9)
Chunk 7 Shape: (2617352, 9)
Chunk 8 Shape: (2617352, 9)
Chunk 9 Shape: (2617352, 9)
Chunk 10 Shape: (2617346, 9)
[(Index           132
title      20938768
rank       20938768
date       20938768
artist     20938768
url        20938768
region     20938768
chart      20938768
trend      20938768
streams    20938768
dtype: int64, '66.60%', '04:17')]


In [None]:
# Information about the chunk
chunk_info = f"Chunk {i+1} Info:\n{chunk.info()}\n"

# Print or log the chunk information
print(chunk_info)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2617346 entries, 23556168 to 26173513
Data columns (total 9 columns):
 #   Column   Dtype  
---  ------   -----  
 0   title    object 
 1   rank     int64  
 2   date     object 
 3   artist   object 
 4   url      object 
 5   region   object 
 6   chart    object 
 7   trend    object 
 8   streams  float64
dtypes: float64(1), int64(1), object(7)
memory usage: 179.7+ MB
Chunk 10 Info:
None



In [None]:
# Read the CSV file in chunks
for i, chunk in enumerate(pd.read_csv(file_path, chunksize=chunk_size)):

    print(f"Chunk {i+1}:")

    # Display basic statistics for numerical columns
    print(chunk.describe())

    # Display the first few rows of the chunk
    print(chunk.head())

    # Display the data types of columns
    print(chunk.dtypes)

Chunk 1:
               rank       streams
count  2.617352e+06  2.450701e+06
mean   9.060635e+01  5.282585e+04
std    5.822597e+01  2.030654e+05
min    1.000000e+00  1.001000e+03
25%    3.900000e+01  3.299000e+03
50%    8.500000e+01  9.182000e+03
75%    1.400000e+02  3.131100e+04
max    2.000000e+02  8.291491e+06
                         title  rank        date  \
0      Chantaje (feat. Maluma)     1  2017-01-01   
1  Vente Pa' Ca (feat. Maluma)     2  2017-01-01   
2   Reggaetón Lento (Bailemos)     3  2017-01-01   
3                       Safari     4  2017-01-01   
4                  Shaky Shaky     5  2017-01-01   

                                  artist  \
0                                Shakira   
1                           Ricky Martin   
2                                   CNCO   
3  J Balvin, Pharrell Williams, BIA, Sky   
4                           Daddy Yankee   

                                                 url     region   chart  \
0  https://open.spotify.com/trac

Using the chunking method, we can divide the original dataset into several chunks based on a specified chunk size. This method enables us to manage large datasets efficiently. To determine an appropriate chunk size, we can use the formula: number of rows / desired number of chunks. This formula ensures that each chunk is of manageable size while covering the entire dataset. The code example provided reads the CSV file in chunks, processes each chunk as needed, and saves them as separate CSV files. Adjusting the chunk size provides flexibility based on the user's preferences and system capabilities.

###b. Parallelize with Dask

For most BigData analytics, Pandas and NumPy will be used. All of the aforementioned packages support a wide range of computations. However, if the dataset does not fit in memory, these packages will not scale. Dask appears. When a dataset does not "fit in memory," dask expands it to "fit on disk." Depending on the size of the dataset, Dask allows us to easily scale out to clusters or scale down to a single machine. We will see how the dask work in our code below.

In [None]:
# Install Dask library
!pip install dask

In [None]:
# Record start time
start_time = time.time()

# Start the CPU usage monitor
cpu_usage = psutil.cpu_percent(interval=1)

#----------------------------------
# Import dask
import dask
import dask.dataframe as dd

df_strategies = dd.read_csv("charts.csv")
print(df_strategies)
#----------------------------------

# Stop the CPU usage monitor
cpu_usage = format_cpu_usage(psutil.cpu_percent(interval=None))

# Calculate the elapsed time
elapsed_time = format_elapsed_time(time.time() - start_time)

# Get full memory_usage after loading less data
memory_usage = df_strategies.memory_usage(index=True, deep=False)

# Add the result to the dictionary
results['Dask'].append((memory_usage, cpu_usage, elapsed_time))

#print result
print(results['Dask'])

Dask DataFrame Structure:
                 title   rank    date  artist     url  region   chart   trend streams
npartitions=54                                                                       
                object  int64  object  object  object  object  object  object   int64
                   ...    ...     ...     ...     ...     ...     ...     ...     ...
...                ...    ...     ...     ...     ...     ...     ...     ...     ...
                   ...    ...     ...     ...     ...     ...     ...     ...     ...
                   ...    ...     ...     ...     ...     ...     ...     ...     ...
Dask Name: read-csv, 1 graph layer
[(Dask Series Structure:
npartitions=1
    int64
      ...
dtype: int64
Dask Name: series-groupby-sum-agg, 5 graph layers, '39.80%', '00:02')]


In [None]:
# Get full memory_usage after loading less data
memory_usage = df_strategies.memory_usage(index=True, deep=False)
memory_usage

Dask Series Structure:
npartitions=1
    int64
      ...
dtype: int64
Dask Name: series-groupby-sum-agg, 27 graph layers

###c. Load Less Data

With this method, we only load only the essential portions of the dataset to optimize memory usage.

In [None]:
# Record start time
start_time = time.time()

# Start the CPU usage monitor
cpu_usage = psutil.cpu_percent(interval=1)

#----------------------------------
# remove unwanted columns in our dataset
df_strategies = df_strategies.drop(['url', 'trend'], axis=1)
#----------------------------------

# Stop the CPU usage monitor
cpu_usage = format_cpu_usage(psutil.cpu_percent(interval=None))

# Calculate the elapsed time
elapsed_time = format_elapsed_time(time.time() - start_time)

# Get full memory_usage after loading less data
memory_usage = df_strategies.memory_usage(index=True, deep=False)

# Add the result to the dictionary
results['Load Less Data'].append((memory_usage, cpu_usage, elapsed_time))

#print result
print(results['Load Less Data'])

[(Dask Series Structure:
npartitions=1
    int64
      ...
dtype: int64
Dask Name: series-groupby-sum-agg, 6 graph layers, '100.00%', '00:01')]


In [None]:
# print the dtype of the dataframe
print(df_strategies.dtypes)

title      object
rank        int64
date       object
artist     object
region     object
chart      object
streams     int64
dtype: object


In [None]:
# print the info of the dataframe
print(df_strategies.info())

<class 'dask.dataframe.core.DataFrame'>
Columns: 7 entries, title to streams
dtypes: object(5), int64(2)None


It is especially beneficial to implement the "Load Less Data" method because it allows the analyst to selectively load and process only the essential portions of the dataset, omitting sections that may not contribute meaningfully to the analysis. We not only save memory resources by avoiding the unnecessary loading of irrelevant data, but we also streamline the analytical workflow by focusing computational efforts on the most relevant information. This method is a successful strategy for optimizing memory usage and increasing the efficiency of data analysis tasks.

###d. Optimize Data Types

The Optimizing Data Types method is the process of selecting the most efficient and suitable data types to represent the values in a dataset. This method aims to reduce memory consumption while increasing computational efficiency. By choosing suitable data types that match the range and precision of the actual data, we can reduce storage requirements while improving data processing task performance. To reduce memory usage in Google Colab, we will change the data type column in the code below based on the appropriate data type.

In [None]:
df_strategies.info()

<class 'dask.dataframe.core.DataFrame'>
Columns: 7 entries, title to streams
dtypes: object(5), int64(2)

To check the size of a pandas dataframe, you can use the memory_usage() method. Here’s an example:

In [None]:
# Check the memory usage of the dataframe
print(df_strategies.memory_usage(deep=True).sum())


dd.Scalar<series-..., dtype=int64>


Next, we convert each column to the appropriate data type using the astype() and to_numeric() methods. For example, we convert the rank column to an integer data type using the to_numeric() method with the downcast parameter set to 'integer'.

In [None]:
# Record start time
start_time = time.time()

# Start the CPU usage monitor
cpu_usage = psutil.cpu_percent(interval=1)

#----------------------------------
# Convert 'title' to category
df_strategies['title'] = df_strategies['title'].astype('category')

# Convert 'rank' to numeric with handling of non-numeric values
#df_strategies['rank'] = dd.to_numeric(df_strategies['rank'], errors='coerce', downcast='integer')
df_strategies['rank'] = df_strategies['rank'].astype('int8')

# Convert 'date' to datetime
df_strategies['date'] = dd.to_datetime(df_strategies['date'])

# Convert 'artist' to category
df_strategies['artist'] = df_strategies['artist'].astype('category')

# Convert 'region' to category
df_strategies['region'] = df_strategies['region'].astype('category')

# Convert 'chart' to category
df_strategies['chart'] = df_strategies['chart'].astype('category')

# Convert 'streams' to numeric
#df_strategies['streams'] = pd.to_numeric(df_strategies['streams'], errors='coerce', downcast='float')
df_strategies['streams'] = df_strategies['streams'].astype('float')

#----------------------------------

# Stop the CPU usage monitor
cpu_usage = format_cpu_usage(psutil.cpu_percent(interval=None))

# Calculate the elapsed time
elapsed_time = format_elapsed_time(time.time() - start_time)

# Get full memory_usage after loading less data
memory_usage = format_memory_usage(df.memory_usage(deep=True).sum())

# Add the result to the dictionary
results['Optimize Data Types'].append((memory_usage, cpu_usage, elapsed_time))

#print result
print(results['Optimize Data Types'])

[('13.1 GB', '41.20%', '00:01')]


###e. Sampling

Sampling is a useful method for dealing with big data when analyzing the entire dataset is impractical due to its size. Rather than analyzing the entire dataset, sampling involves selecting a representative subset for analysis. Here’s an example of how to use the Sampling strategy in dask:

In [None]:
# Record start time
start_time = time.time()

# Start the CPU usage monitor
cpu_usage = psutil.cpu_percent(interval=1)

#----------------------------------
# Sample 10% of the dataset
sampled_df = df_strategies.sample(frac=0.1)
#----------------------------------

# Stop the CPU usage monitor
cpu_usage = format_cpu_usage(psutil.cpu_percent(interval=None))

# Calculate the elapsed time
elapsed_time = format_elapsed_time(time.time() - start_time)

# Get full memory_usage after loading less data
memory_usage = format_memory_usage(df.memory_usage(deep=True).sum())

# Add the result to the dictionary
results['Sampling'].append((memory_usage, cpu_usage, elapsed_time))

#print result
print(results['Sampling'])

[('13.1 GB', '0.00%', '00:01')]


Here we use the sample() method to randomly select 10% of the rows from the dataset.

The frac parameter specifies the fraction of rows to return, which can be a float between 0 and 1. For example, frac=0.1 returns 10% of the rows.

# 4. Comparative Analysis



###Traditional way of Reading Big data

| Strategy | CPU Usage (%) | Time Taken (seconds) | Memory Usage (MB) |
|-------------------------|-------------------|-----------------------|----------------|
| Traditional Method         |      2.00         |      107           |    13100 (13.1GB)      |


Strategies for Big Datasets (Dask Integration)

| Strategy | CPU Usage (%) | Time Taken (seconds) | Memory Usage (MB) |
|------------|-------------------|-----------------------|----------------|
|Chunking  | 66.60%   |   257     | 176.7  |
|Dask      |     39.80        |    2             |     N/A     |
| Load Less Data |        100.00       |     1       |     N/A    |
| Optimize Data Types |    41.20     |         1      |  13100   |
| Sampling |     0.00    |      1       |     13100     |

###Analysis
- **Chunking** takes the longest time to do and require quite alot of CPU usage.

- **Dask** stands out as the most efficient strategy in terms of both time taken and memory usage.
- `Load Less Data with Dask` is the fastest but uses 100% CPU, which may not be sustainable.
- `Optimizing Data Types with Dask` reduces memory usage but doesn’t significantly impact time.
- `Sampling with Dask` doesn’t improve resource usage significantly.

# 5. Conclusion



1. Dask:

  -   Efficiency: Dask shines when dealing with large datasets. Its parallel and distributed computing capabilities allow it to efficiently process data across multiple cores or even clusters. By breaking down computations into smaller tasks, Dask minimizes bottlenecks and maximizes resource utilization.
  - Memory Optimization: Dask intelligently manages memory usage. Unlike traditional methods that load the entire dataset into memory, Dask operates on smaller chunks, reducing the risk of memory exhaustion. It dynamically spills data to disk when needed, ensuring smooth execution even with limited RAM.

2. Load Less Data with Dask:
  - Speed: This strategy prioritizes speed by loading only the necessary data. However, it comes at the cost of high CPU usage.
  - Use Case: When you need rapid insights from a massive dataset and can tolerate short bursts of high CPU load, loading less data with Dask is a pragmatic choice.
3.Optimize Data Types with Dask:
  - Memory Reduction: Dask allows you to optimize data types (e.g., using int32 instead of int64, or using categorical data). By reducing memory footprint, you gain efficiency without compromising accuracy. This strategy is especially valuable when memory constraints are critical.
  - Trade-Off: While it won’t significantly impact processing time, it pays off in memory savings. Consider it a low-hanging fruit for memory optimization.

4. Sampling with Dask:
  - Quick Insights: Sampling provides a glimpse into the dataset without processing the entire thing. It’s useful for exploratory analysis, hypothesis testing, or initial model building. However, it doesn’t fundamentally change resource usage.
  - Limitations: Sampling may miss rare events or outliers, so use it judiciously. For statistical confidence, consider larger samples or other strategies.

In summary, Dask’s flexibility, memory management, and scalability make it a powerful tool for big data. Choosing the best strategies need time and multiple trial and error to find what is the most suitable strategies to use and how to use it.

## References
- https://www.geeksforgeeks.org/how-to-check-the-execution-time-of-python-script/
- https://www.geeksforgeeks.org/introduction-to-dask-in-python/
- https://www.coiled.io/blog/dask-dtype-astype