<a href="https://colab.research.google.com/github/drshahizan/Python-big-data/blob/main/assignment/ass6/bdm%20/F2/Assignment6_F2(Big_Data_Handling).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Brewery Operations and Market Analysis Dataset

This dataset presents an extensive collection of data from a craft beer brewery, spanning from January 2020 to January 2024. It encapsulates a rich blend of brewing parameters, sales data, and quality assessments, providing a holistic view of the brewing process and its market implications.

The size of the dataset is 1.06 GB in a compressed condition whereas after unzip, 4.25 GB.

Zip condition = 1.06 GB

Unzip condition = 4.25 GB

[Kaggle Link](https://www.kaggle.com/datasets/ankurnapa/brewery-operations-and-market-analysis-dataset)

This assignement aims to demonstrate five strategies to handle large datasets effectively:

- Parallelize with Dask
- Use Chunking
- Use Sampling
- Load Less Data
- Optimize Data Types

**Assignment Group Members**
1. Thong Yee Moon MCS231001
2. Lye Kah Hooi MCS231010

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [2]:
# Upload the kaggle Token
from google.colab import files
files.upload()

Saving kaggle.json to kaggle (2).json


{'kaggle (2).json': b'{"username":"lyekahhooi","key":"0770886ac07f72a3595a1bd69071587d"}'}

## Upload Dataset and Import Library

In [3]:
! pip install kaggle --quiet
! mkdir ~/.kaggle
! cp kaggle.json ~/.kaggle/
! chmod 600 ~/.kaggle/kaggle.json
! kaggle datasets list

mkdir: cannot create directory ‘/root/.kaggle’: File exists
ref                                                                 title                                             size  lastUpdated          downloadCount  voteCount  usabilityRating  
------------------------------------------------------------------  -----------------------------------------------  -----  -------------------  -------------  ---------  ---------------  
thedrcat/daigt-v2-train-dataset                                     DAIGT V2 Train Dataset                            29MB  2023-11-16 01:38:36           1991        192  1.0              
muhammadbinimran/housing-price-prediction-data                      Housing Price Prediction Data                    763KB  2023-11-21 17:56:32           8950        154  1.0              
thedevastator/books-sales-and-ratings                               Books Sales and Ratings                           53KB  2023-12-06 04:54:33           2145         30  1.0          

In [4]:
!kaggle datasets download -d ankurnapa/brewery-operations-and-market-analysis-dataset


brewery-operations-and-market-analysis-dataset.zip: Skipping, found more recently modified local copy (use --force to force download)


In [5]:
!unzip -u "/content/brewery-operations-and-market-analysis-dataset.zip"

Archive:  /content/brewery-operations-and-market-analysis-dataset.zip


In [6]:
!pip install ipython-autotime

# Load the autotime extension to display cell execution time
%load_ext autotime

time: 440 µs (started: 2023-12-17 04:32:57 +00:00)


In [7]:
import pandas as pd
import numpy as np
import random
from sys import getsizeof
import dask.dataframe as dd
from dask.distributed import Client, LocalCluster
import time

time: 2.6 s (started: 2023-12-17 04:33:00 +00:00)


In [8]:
# magic command to measure the execution time of the code cell.
%%time

df = pd.read_csv("/content/brewery_data_complete_extended.csv")

CPU times: user 48.9 s, sys: 10.3 s, total: 59.1 s
Wall time: 1min 13s
time: 1min 13s (started: 2023-12-17 04:33:05 +00:00)


In [9]:
df.head(2)

Unnamed: 0,Batch_ID,Brew_Date,Beer_Style,SKU,Location,Fermentation_Time,Temperature,pH_Level,Gravity,Alcohol_Content,Bitterness,Color,Ingredient_Ratio,Volume_Produced,Total_Sales,Quality_Score,Brewhouse_Efficiency,Loss_During_Brewing,Loss_During_Fermentation,Loss_During_Bottling_Kegging
0,7870796,2020-01-01 00:00:19,Wheat Beer,Kegs,Whitefield,16,24.204251,5.289845,1.039504,5.370842,20,5,1:0.32:0.16,4666,2664.759345,8.577016,89.195882,4.104988,3.235485,4.663204
1,9810411,2020-01-01 00:00:31,Sour,Kegs,Whitefield,13,18.086763,5.275643,1.059819,5.096053,36,14,1:0.39:0.24,832,9758.801062,7.420541,72.480915,2.676528,4.246129,2.044358


time: 29.8 ms (started: 2023-12-17 04:34:23 +00:00)


In [10]:
df.shape

(10000000, 20)

time: 4.03 ms (started: 2023-12-17 04:34:26 +00:00)


In [11]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000000 entries, 0 to 9999999
Data columns (total 20 columns):
 #   Column                        Dtype  
---  ------                        -----  
 0   Batch_ID                      int64  
 1   Brew_Date                     object 
 2   Beer_Style                    object 
 3   SKU                           object 
 4   Location                      object 
 5   Fermentation_Time             int64  
 6   Temperature                   float64
 7   pH_Level                      float64
 8   Gravity                       float64
 9   Alcohol_Content               float64
 10  Bitterness                    int64  
 11  Color                         int64  
 12  Ingredient_Ratio              object 
 13  Volume_Produced               int64  
 14  Total_Sales                   float64
 15  Quality_Score                 float64
 16  Brewhouse_Efficiency          float64
 17  Loss_During_Brewing           float64
 18  Loss_During_Fermentat

## Initial Dataframe Size

In [12]:
initial_size = getsizeof(df)/(1024.0**3)
print('Intial Dataframe size: %2.2f GB'%initial_size)

# OR
# Initial_size = df.info(verbose=False, memory_usage="deep")
# print('Initial Data Frame size: %2.2f GB'%initial_size)

Intial Dataframe size: 4.25 GB
time: 5.47 s (started: 2023-12-17 04:34:31 +00:00)


## Method 1 - Parallelize with Dask

By Exploiting the fact that our machine has more than one core. For this purpose we use Dask, an open-source python project which parallelizes Numpy and Pandas. Under the hood, a Dask Dataframe consists of many Pandas dataframes that are manipulated in parallel. As most of the Pandas API is implemented, Dask has a very similar look and feel, making it easy to use for all who know Pandas.

In [13]:
cluster = LocalCluster()
client = Client(cluster)
client

INFO:distributed.http.proxy:To route to workers diagnostics web server please install jupyter-server-proxy: python -m pip install jupyter-server-proxy
INFO:distributed.scheduler:State start
INFO:distributed.diskutils:Found stale lock file and directory '/tmp/dask-scratch-space/scheduler-r6o1jy6r', purging
INFO:distributed.scheduler:  Scheduler at:     tcp://127.0.0.1:44375
INFO:distributed.scheduler:  dashboard at:  http://127.0.0.1:8787/status
INFO:distributed.nanny:        Start Nanny at: 'tcp://127.0.0.1:38077'
INFO:distributed.nanny:        Start Nanny at: 'tcp://127.0.0.1:45695'
INFO:distributed.scheduler:Register worker <WorkerState 'tcp://127.0.0.1:39257', name: 1, status: init, memory: 0, processing: 0>
INFO:distributed.scheduler:Starting worker compute stream, tcp://127.0.0.1:39257
INFO:distributed.core:Starting established connection to tcp://127.0.0.1:32940
INFO:distributed.scheduler:Register worker <WorkerState 'tcp://127.0.0.1:33215', name: 0, status: init, memory: 0, proc

0,1
Connection method: Cluster object,Cluster type: distributed.LocalCluster
Dashboard: http://127.0.0.1:8787/status,

0,1
Dashboard: http://127.0.0.1:8787/status,Workers: 2
Total threads: 2,Total memory: 12.67 GiB
Status: running,Using processes: True

0,1
Comm: tcp://127.0.0.1:44375,Workers: 2
Dashboard: http://127.0.0.1:8787/status,Total threads: 2
Started: Just now,Total memory: 12.67 GiB

0,1
Comm: tcp://127.0.0.1:33215,Total threads: 1
Dashboard: http://127.0.0.1:34345/status,Memory: 6.34 GiB
Nanny: tcp://127.0.0.1:38077,
Local directory: /tmp/dask-scratch-space/worker-zxnyoxjm,Local directory: /tmp/dask-scratch-space/worker-zxnyoxjm

0,1
Comm: tcp://127.0.0.1:39257,Total threads: 1
Dashboard: http://127.0.0.1:34235/status,Memory: 6.34 GiB
Nanny: tcp://127.0.0.1:45695,
Local directory: /tmp/dask-scratch-space/worker-zfo4a71w,Local directory: /tmp/dask-scratch-space/worker-zfo4a71w


time: 3.91 s (started: 2023-12-17 04:34:43 +00:00)


In [14]:
# Define the file path
file_path = "/content/brewery_data_complete_extended.csv"

time: 593 µs (started: 2023-12-17 04:34:50 +00:00)


In [15]:
# Pandas
start_time_pandas = time.time()
df_pandas = pd.read_csv(file_path)
end_time_pandas = time.time()
time_pandas = end_time_pandas - start_time_pandas
memory_pandas = df_pandas.memory_usage(deep=True).sum() / (1024.0 ** 3)
size_pandas = getsizeof(df_pandas)

# Display results
print("Pandas:")
print(f"  Time: {time_pandas:.4f} seconds")
print(f"  Memory Usage: {memory_pandas:.2f} GB")
print(f"  Dataframe Size: {size_pandas / (1024.0 ** 3):.2f} GB")

INFO:distributed.core:Event loop was unresponsive in Scheduler for 3.88s.  This is often caused by long-running GIL-holding functions or moving large chunks of data. This can cause timeouts and instability.
INFO:distributed.core:Event loop was unresponsive in Nanny for 3.89s.  This is often caused by long-running GIL-holding functions or moving large chunks of data. This can cause timeouts and instability.
INFO:distributed.core:Event loop was unresponsive in Nanny for 3.90s.  This is often caused by long-running GIL-holding functions or moving large chunks of data. This can cause timeouts and instability.
INFO:distributed.core:Event loop was unresponsive in Scheduler for 6.38s.  This is often caused by long-running GIL-holding functions or moving large chunks of data. This can cause timeouts and instability.
INFO:distributed.core:Event loop was unresponsive in Nanny for 14.27s.  This is often caused by long-running GIL-holding functions or moving large chunks of data. This can cause ti

Pandas:
  Time: 65.5799 seconds
  Memory Usage: 4.25 GB
  Dataframe Size: 4.25 GB
time: 1min 22s (started: 2023-12-17 04:34:55 +00:00)


INFO:distributed.core:Event loop was unresponsive in Scheduler for 7.92s.  This is often caused by long-running GIL-holding functions or moving large chunks of data. This can cause timeouts and instability.


In [16]:
# Dask
start_time_dask = time.time()
ddf_dask = dd.read_csv(file_path)
end_time_dask = time.time()
time_dask = end_time_dask - start_time_dask
memory_dask = ddf_dask.memory_usage(deep=True).sum() / (1024.0 ** 3)
size_dask = getsizeof(ddf_dask)

# Display results
print("\nDask:")
print(f"  Time: {time_dask:.4f} seconds")
print(f"  Memory Usage: {memory_dask.compute():.2f} GB")
print(f"  Dataframe Size: {size_dask / (1024.0 ** 3):.2f} GB")


Dask:
  Time: 0.0210 seconds
  Memory Usage: 4.25 GB
  Dataframe Size: 0.00 GB
time: 1min 13s (started: 2023-12-17 04:36:34 +00:00)


In [17]:
# Calculate speedup factors (Pandas vs Dask)
speedup_factor_time = time_pandas / time_dask
speedup_factor_memory = memory_pandas / memory_dask
speedup_factor_size = size_pandas / size_dask

# Display results
print("\nDask is approximately {:.2f} times faster than Pandas in terms of time.".format(speedup_factor_time))
print("Dask uses approximately {:.2f} times less memory than Pandas.".format(speedup_factor_memory))
print("Dask's dataframe size is {:.2f} times smaller than Pandas.".format(speedup_factor_size))


Dask is approximately 3116.75 times faster than Pandas in terms of time.
Dask uses approximately 1.00 times less memory than Pandas.
Dask's dataframe size is 95057191.65 times smaller than Pandas.
time: 1min 6s (started: 2023-12-17 04:38:10 +00:00)


Dask is approximately **3116.75 times
faster** than Pandas in terms of time !!



## Method 2 - Chunking

Reading a large dataset in its entirety can be time-consuming. By loading data in smaller chunks, we can start processing and analyzing the data sooner, rather than waiting for the entire dataset to be loaded.

In [18]:
# Chunking
start_time_chunking = time.time()

# Set the chunk size for reading the CSV file
chunk_size = 10000

# Read the CSV file in chunks
df_chunks = pd.read_csv(file_path, chunksize=chunk_size)

# Initialize variables to store memory usage and size
total_memory_usage = 0
total_size = 0

# Iterate through the chunks to perform data processing operations
for chunk in df_chunks:
    # Your data processing operations go here

    # Calculate memory usage for the current chunk
    chunk_memory_usage = chunk.memory_usage(deep=True).sum()

    # Sum up memory usage and size for each chunk
    total_memory_usage += chunk_memory_usage
    total_size += getsizeof(chunk)

end_time_chunking = time.time()

# Calculate metrics for chunking
time_chunking = end_time_chunking - start_time_chunking
memory_chunking = total_memory_usage / (1024.0 ** 3)
size_chunking = total_size

# Display results for chunking
print("\nChunking:")
print(f"  Time: {time_chunking:.4f} seconds")
print(f"  Total Memory Usage: {memory_chunking:.2f} GB")
print(f"  Total Dataframe Size: {size_chunking / (1024.0 ** 3):.2f} GB")



Chunking:
  Time: 76.7790 seconds
  Total Memory Usage: 4.25 GB
  Total Dataframe Size: 4.25 GB
time: 1min 16s (started: 2023-12-17 04:39:30 +00:00)


In [19]:
# Calculate speedup factors (Pandas_before chunking vs Pandas_after chunking)
speedup_factor_time = time_pandas / time_chunking
speedup_factor_memory = memory_pandas / memory_chunking
speedup_factor_size = size_pandas / size_chunking

# Display results
print("\nChunking is approximately {:.2f} times faster in terms of time.".format(speedup_factor_time))
print("Chunking uses approximately {:.2f} times less memory.".format(speedup_factor_memory))
print("Chunking's dataframe size is {:.2f} times smaller.".format(speedup_factor_size))


Chunking is approximately 0.72 times faster in terms of time.
Chunking uses approximately 1.00 times less memory.
Chunking's dataframe size is 1.00 times smaller.
time: 4.17 ms (started: 2023-12-17 03:48:17 +00:00)


The loading time with chunking here might appear larger due to the additional processing overhead associated with reading and processing data in chunks. Chunking involves reading the data in smaller portions (chunks), and for each chunk, there is additional processing overhead. This overhead includes reading the chunk from disk, performing any specified operations on the chunk, and then combining or aggregating the results.

In contrast, loading the entire dataset at once without chunking might be more efficient for certain operations, especially if the dataset is not significantly larger than the available memory. In this case, the entire dataset can be read and processed in a more sequential and streamlined manner.

It's important to note that the benefits of chunking are often more pronounced when dealing with datasets that are significantly larger than the available memory, as chunking allows us to work with data in a memory-efficient manner. In scenarios where the entire dataset easily fits into memory, chunking might not provide a substantial performance improvement and can even introduce some additional overhead.

## Method 3 - Sampling

In [19]:
# Sample 70% of the dataset
sampling_percentage = 0.7
start_time_sampling = time.time()
df_sampled = df_pandas.sample(frac=sampling_percentage, random_state=42)
end_time_sampling = time.time()
time_sampling = end_time_sampling - start_time_sampling
memory_sampling = df_sampled.memory_usage(deep=True).sum() / (1024.0 ** 3)
size_sampling = getsizeof(df_sampled)


INFO:distributed.core:Event loop was unresponsive in Scheduler for 13.28s.  This is often caused by long-running GIL-holding functions or moving large chunks of data. This can cause timeouts and instability.
INFO:distributed.core:Event loop was unresponsive in Nanny for 14.87s.  This is often caused by long-running GIL-holding functions or moving large chunks of data. This can cause timeouts and instability.
INFO:distributed.core:Event loop was unresponsive in Nanny for 14.87s.  This is often caused by long-running GIL-holding functions or moving large chunks of data. This can cause timeouts and instability.
INFO:distributed.core:Event loop was unresponsive in Scheduler for 4.09s.  This is often caused by long-running GIL-holding functions or moving large chunks of data. This can cause timeouts and instability.


time: 28.5 s (started: 2023-12-17 04:41:08 +00:00)


INFO:distributed.core:Event loop was unresponsive in Nanny for 4.03s.  This is often caused by long-running GIL-holding functions or moving large chunks of data. This can cause timeouts and instability.
INFO:distributed.core:Event loop was unresponsive in Scheduler for 4.03s.  This is often caused by long-running GIL-holding functions or moving large chunks of data. This can cause timeouts and instability.
INFO:distributed.core:Event loop was unresponsive in Nanny for 4.03s.  This is often caused by long-running GIL-holding functions or moving large chunks of data. This can cause timeouts and instability.


In [20]:
# Calculate speedup factors (Pandas_before sampling vs Pandas_after sampling)
speedup_factor_time_sampling = time_pandas / time_sampling
speedup_factor_memory_sampling = memory_pandas / memory_sampling
speedup_factor_size_sampling = size_pandas/ size_sampling

# Display results
print("\nComparison before and after sampling using Pandas:")
print("Before Sampling:")
print(f"  Time: {time_pandas:.4f} seconds")
print(f"  Memory Usage: {memory_pandas:.2f} GB")
print(f"  Dataframe Size: {size_pandas / (1024.0 ** 3):.2f} GB")

print(f"\nAfter Sampling ({sampling_percentage * 100}%):")
print(f"  Time: {time_sampling:.4f} seconds")
print(f"  Memory Usage: {memory_sampling:.2f} GB")
print(f"  Dataframe Size: {size_sampling / (1024.0 ** 3):.2f} GB")

print("\nSpeedup Factors:")
print(f"  Time: {speedup_factor_time_sampling:.2f} times faster")
print(f"  Memory Usage: {speedup_factor_memory_sampling:.2f} times less")
print(f"  Dataframe Size: {speedup_factor_size_sampling:.2f} times smaller")


Comparison before and after sampling using Pandas:
Before Sampling:
  Time: 65.5799 seconds
  Memory Usage: 4.25 GB
  Dataframe Size: 4.25 GB

After Sampling (70.0%):
  Time: 12.8338 seconds
  Memory Usage: 3.03 GB
  Dataframe Size: 3.03 GB

Speedup Factors:
  Time: 5.11 times faster
  Memory Usage: 1.40 times less
  Dataframe Size: 1.40 times smaller
time: 10.6 ms (started: 2023-12-17 04:41:56 +00:00)


After sampling using 70% of the dataset, times improve to around **5 times faster**.

## Method 4 - Load Less Data

Only selected column will be loading and replaced the original dataframe

In [21]:
df.head()

Unnamed: 0,Batch_ID,Brew_Date,Beer_Style,SKU,Location,Fermentation_Time,Temperature,pH_Level,Gravity,Alcohol_Content,Bitterness,Color,Ingredient_Ratio,Volume_Produced,Total_Sales,Quality_Score,Brewhouse_Efficiency,Loss_During_Brewing,Loss_During_Fermentation,Loss_During_Bottling_Kegging
0,7870796,2020-01-01 00:00:19,Wheat Beer,Kegs,Whitefield,16,24.204251,5.289845,1.039504,5.370842,20,5,1:0.32:0.16,4666,2664.759345,8.577016,89.195882,4.104988,3.235485,4.663204
1,9810411,2020-01-01 00:00:31,Sour,Kegs,Whitefield,13,18.086763,5.275643,1.059819,5.096053,36,14,1:0.39:0.24,832,9758.801062,7.420541,72.480915,2.676528,4.246129,2.044358
2,2623342,2020-01-01 00:00:40,Wheat Beer,Kegs,Malleswaram,12,15.539333,4.778016,1.037476,4.824737,30,10,1:0.35:0.16,2115,11721.087016,8.451365,86.322144,3.299894,3.10944,3.03388
3,8114651,2020-01-01 00:01:37,Ale,Kegs,Rajajinagar,17,16.418489,5.345261,1.052431,5.509243,48,18,1:0.35:0.15,3173,12050.177463,9.671859,83.09494,2.136055,4.634254,1.489889
4,4579587,2020-01-01 00:01:43,Stout,Cans,Marathahalli,18,19.144908,4.861854,1.054296,5.133625,57,13,1:0.46:0.11,4449,5515.077465,7.895334,88.625833,4.491724,2.183389,2.99063


time: 21.4 ms (started: 2023-12-17 04:42:12 +00:00)


In [22]:
# DROP 4 columns
# Loss_During_Brewing, Loss_During_Fermentation, Loss_During_Bottling_Kegging, Ingredient_Ratio

col_list = ["Batch_ID", "Brew_Date", "Beer_Style","SKU","Location","Fermentation_Time","Temperature","pH_Level","Gravity","Alcohol_Content",\
            "Bitterness","Color","Volume_Produced","Total_Sales","Quality_Score","Brewhouse_Efficiency"]

time: 730 µs (started: 2023-12-17 04:42:21 +00:00)


In [23]:
# Load Less Data
start_time_loadless = time.time()
df_loadless = pd.read_csv(file_path, usecols=col_list)
end_time_loadless = time.time()
time_loadless = end_time_loadless - start_time_loadless
memory_loadless = df_loadless.memory_usage(deep=True).sum() / (1024.0 ** 3)
size_loadless = getsizeof(df_loadless)


INFO:distributed.core:Event loop was unresponsive in Nanny for 3.90s.  This is often caused by long-running GIL-holding functions or moving large chunks of data. This can cause timeouts and instability.


time: 1min 9s (started: 2023-12-17 04:42:28 +00:00)


INFO:distributed.core:Event loop was unresponsive in Scheduler for 9.05s.  This is often caused by long-running GIL-holding functions or moving large chunks of data. This can cause timeouts and instability.
INFO:distributed.core:Event loop was unresponsive in Nanny for 9.05s.  This is often caused by long-running GIL-holding functions or moving large chunks of data. This can cause timeouts and instability.


In [24]:
# Calculate speedup factors (Pandas_before load less data vs Pandas_after load less data)
speedup_factor_time_loadless = time_pandas / time_loadless
speedup_factor_memory_loadless = memory_pandas / memory_loadless
speedup_factor_size_loadless = size_pandas/ size_loadless

# Display results
print("\nComparison before and after Load Less Data(Less 4 columns) using Pandas:")
print("Before Load Less Data:")
print(f"  Time: {time_pandas:.4f} seconds")
print(f"  Memory Usage: {memory_pandas:.2f} GB")
print(f"  Dataframe Size: {size_pandas / (1024.0 ** 3):.2f} GB")

print(f"\nAfter Load Less Data (Less 4 columns):")
print(f"  Time: {time_loadless:.4f} seconds")
print(f"  Memory Usage: {memory_loadless:.2f} GB")
print(f"  Dataframe Size: {size_loadless / (1024.0 ** 3):.2f} GB")

print("\nSpeedup Factors:")
print(f"  Time: {speedup_factor_time_loadless:.2f} times faster")
print(f"  Memory Usage: {speedup_factor_memory_loadless:.2f} times less")
print(f"  Dataframe Size: {speedup_factor_size_loadless:.2f} times smaller")


Comparison before and after Load Less Data(Less 4 columns) using Pandas:
Before Load Less Data:
  Time: 65.5799 seconds
  Memory Usage: 4.25 GB
  Dataframe Size: 4.25 GB

After Load Less Data (Less 4 columns):
  Time: 54.6219 seconds
  Memory Usage: 3.39 GB
  Dataframe Size: 3.39 GB

Speedup Factors:
  Time: 1.20 times faster
  Memory Usage: 1.25 times less
  Dataframe Size: 1.25 times smaller
time: 5.22 ms (started: 2023-12-17 04:43:47 +00:00)


In [25]:
df_loadless.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000000 entries, 0 to 9999999
Data columns (total 16 columns):
 #   Column                Dtype  
---  ------                -----  
 0   Batch_ID              int64  
 1   Brew_Date             object 
 2   Beer_Style            object 
 3   SKU                   object 
 4   Location              object 
 5   Fermentation_Time     int64  
 6   Temperature           float64
 7   pH_Level              float64
 8   Gravity               float64
 9   Alcohol_Content       float64
 10  Bitterness            int64  
 11  Color                 int64  
 12  Volume_Produced       int64  
 13  Total_Sales           float64
 14  Quality_Score         float64
 15  Brewhouse_Efficiency  float64
dtypes: float64(7), int64(5), object(4)
memory usage: 1.2+ GB
time: 27.2 ms (started: 2023-12-17 04:43:57 +00:00)


In [26]:
required_size = getsizeof(df_loadless)/(1024.0**3)
memory_reduced = initial_size - required_size

print('Intial Dataframe size: %2.2f GB'%initial_size)
print('Dataframe size after reduced column: %2.2f GB'%required_size)
print ("Memory Reduced : %2.2f GB" %memory_reduced)
print ("Decreased by : {:.1f}%".format(100*(memory_reduced/initial_size)))

INFO:distributed.core:Event loop was unresponsive in Nanny for 3.10s.  This is often caused by long-running GIL-holding functions or moving large chunks of data. This can cause timeouts and instability.
INFO:distributed.core:Event loop was unresponsive in Scheduler for 3.10s.  This is often caused by long-running GIL-holding functions or moving large chunks of data. This can cause timeouts and instability.
INFO:distributed.core:Event loop was unresponsive in Nanny for 3.11s.  This is often caused by long-running GIL-holding functions or moving large chunks of data. This can cause timeouts and instability.


Intial Dataframe size: 4.25 GB
Dataframe size after reduced column: 3.39 GB
Memory Reduced : 0.86 GB
Decreased by : 20.2%
time: 5.52 s (started: 2023-12-17 04:44:16 +00:00)


Loading only the columns of interest reduced the memory utilizations
- By reducing 4 columns, able to reduced the memory 0.86  GB and occupying approximately 3.39GB of space instead of 4.25GB before.
- Reduced 20% from original memory
- Loading lesser column will reducre more memore utilization.


## Method 5 - Optimize Data Types

Shrink numerical columns with smaller dtypes

Integer
- int8 can store integers from -128 to 127.
- int16 can store integers from -32768 to 32767.
- int 32 can store integers from -2,147,483,648 to +2,147,483,647
- int64 can store integers from -9223372036854775808 to 9223372036854775807.

Float
- float16 --> 16bit: 0.1235
- float32 --> 32bit: 0.12345679
- float64 --> 64bit: 0.12345678912121212

Object
- Category

### Object / String

- Convert an 'object' data type to "category" type
- Convert string to datetime

In [27]:
# Convert string to datetime64[ns]
df['Brew_Date'] = pd.to_datetime(df['Brew_Date'])

time_size = getsizeof(df)/(1024.0**3)
memory_reduced2 = initial_size-time_size

print ("Initial Dataframe Size : %2.2f GB" %initial_size)
print ("Memory after optimize datetime : %2.2f GB" %time_size)
print ("Memory Reduced : %2.2f GB" %memory_reduced2)
print ("Decreased by : {:.1f}%".format(100*(memory_reduced2/initial_size)))

Initial Dataframe Size : 4.25 GB
Memory after optimize datetime : 3.62 GB
Memory Reduced : 0.63 GB
Decreased by : 14.9%
time: 10.5 s (started: 2023-12-17 04:44:32 +00:00)


INFO:distributed.core:Event loop was unresponsive in Scheduler for 3.32s.  This is often caused by long-running GIL-holding functions or moving large chunks of data. This can cause timeouts and instability.
INFO:distributed.core:Event loop was unresponsive in Nanny for 3.33s.  This is often caused by long-running GIL-holding functions or moving large chunks of data. This can cause timeouts and instability.
INFO:distributed.core:Event loop was unresponsive in Nanny for 3.33s.  This is often caused by long-running GIL-holding functions or moving large chunks of data. This can cause timeouts and instability.


In [28]:
# Convert object to category
df['Batch_ID'] = df['Batch_ID'].astype('category')
df['Beer_Style'] = df['Beer_Style'].astype('category')
df['SKU'] = df['SKU'].astype('category')
df['Location'] = df['Location'].astype('category')

object_size=getsizeof(df)/(1024.0**3)
memory_reduced3=time_size-object_size

print ("Memory after optimize object %2.2f GB" %object_size)
print ("Memore Reduced : %2.2f GB" %memory_reduced3)
print ("Decreased by : {:.1f}%".format(100*(memory_reduced3/time_size)))

Memory after optimize object 2.14 GB
Memore Reduced : 1.47 GB
Decreased by : 40.7%
time: 17.2 s (started: 2023-12-17 04:45:03 +00:00)


In [29]:
df.dtypes

Batch_ID                              category
Brew_Date                       datetime64[ns]
Beer_Style                            category
SKU                                   category
Location                              category
Fermentation_Time                        int64
Temperature                            float64
pH_Level                               float64
Gravity                                float64
Alcohol_Content                        float64
Bitterness                               int64
Color                                    int64
Ingredient_Ratio                        object
Volume_Produced                          int64
Total_Sales                            float64
Quality_Score                          float64
Brewhouse_Efficiency                   float64
Loss_During_Brewing                    float64
Loss_During_Fermentation               float64
Loss_During_Bottling_Kegging           float64
dtype: object

time: 6.33 ms (started: 2023-12-17 04:45:30 +00:00)


In [30]:
Total_object_reduction = initial_size-object_size

print ("Total Memory reduced by converting object to category : %2.2f GB" %Total_object_reduction)
print ("Decreased by : {:.1f}%".format(100*(Total_object_reduction/initial_size)))

Total Memory reduced by converting object to category : 2.11 GB
Decreased by : 49.6%
time: 2.69 ms (started: 2023-12-17 04:51:41 +00:00)


### Integer

In [31]:
for col in df.columns:
  col_type=df[col].dtype
  if str(col_type)[:3] == 'int':
    col_min=df[col].min()
    col_max=df[col].max()
    if col_min > np.iinfo(np.int8).min and col_max < np.iinfo(np.int8).max:
      df[col]=df[col].astype(np.int8)
    elif col_min > np.iinfo(np.int16).min and col_max < np.iinfo(np.int16).max:
      df[col]=df[col].astype(np.int16)
    elif col_min > np.iinfo(np.int32).min and col_max < np.iinfo(np.int32).max:
      df[col]=df[col].astype(np.int32)
    else:
      df[col]=df[col].astype(np.int64)

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000000 entries, 0 to 9999999
Data columns (total 20 columns):
 #   Column                        Dtype         
---  ------                        -----         
 0   Batch_ID                      category      
 1   Brew_Date                     datetime64[ns]
 2   Beer_Style                    category      
 3   SKU                           category      
 4   Location                      category      
 5   Fermentation_Time             int8          
 6   Temperature                   float64       
 7   pH_Level                      float64       
 8   Gravity                       float64       
 9   Alcohol_Content               float64       
 10  Bitterness                    int8          
 11  Color                         int8          
 12  Ingredient_Ratio              object        
 13  Volume_Produced               int16         
 14  Total_Sales                   float64       
 15  Quality_Score                 f

In [32]:
int_size=getsizeof(df)/(1024.0**3)
memory_reduced4=object_size-int_size
print ("Memory after optimize integer %2.2f GB" %int_size)
print ("Memory Reduced : %2.2f GB" %memory_reduced4)
print ("Decreased by : {:.1f}%".format(100*(memory_reduced4/object_size)))

Memory after optimize integer 1.89 GB
Memory Reduced : 0.25 GB
Decreased by : 11.7%
time: 1.2 s (started: 2023-12-17 04:52:13 +00:00)


### Float

In [33]:
for col in df.columns:
  col_type=df[col].dtype
  if str(col_type)[:5] == 'float':
    col_min=df[col].min()
    col_max=df[col].max()
    if col_min > np.finfo(np.float16).min and col_max < np.finfo(np.float16).max:
      df[col]=df[col].astype(np.float16)
    elif col_min > np.finfo(np.float32).min and col_max < np.finfo(np.float32).max:
      df[col]=df[col].astype(np.float32)
    else:
      df[col]=df[col].astype(np.float64)

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000000 entries, 0 to 9999999
Data columns (total 20 columns):
 #   Column                        Dtype         
---  ------                        -----         
 0   Batch_ID                      category      
 1   Brew_Date                     datetime64[ns]
 2   Beer_Style                    category      
 3   SKU                           category      
 4   Location                      category      
 5   Fermentation_Time             int8          
 6   Temperature                   float16       
 7   pH_Level                      float16       
 8   Gravity                       float16       
 9   Alcohol_Content               float16       
 10  Bitterness                    int8          
 11  Color                         int8          
 12  Ingredient_Ratio              object        
 13  Volume_Produced               int16         
 14  Total_Sales                   float16       
 15  Quality_Score                 f

In [34]:
float_size=getsizeof(df)/(1024.0**3)
memory_reduced5=int_size-float_size

print ("Memory after optimize float %2.2f GB" %float_size)
print ("Memory Reduced : %2.2f GB" %memory_reduced5)
print ("Decreased by : {:.1f}%".format(100*(memory_reduced5/int_size)))

Memory after optimize float 1.33 GB
Memory Reduced : 0.56 GB
Decreased by : 29.5%
time: 1.22 s (started: 2023-12-17 04:52:34 +00:00)


### Total Memory Utilization Reduction

In [35]:
final_size=float_size
memory_optimized = initial_size - final_size

print ("Final memory : %2.2f GB" %final_size)
print ("Total memory reduced after optimize datatype : %2.2f GB" %memory_optimized)
print ("Decreased by : {:.1f}%".format(100*(memory_optimized/initial_size)))

Final memory : 1.33 GB
Total memory reduced after optimize datatype : 2.92 GB
Decreased by : 68.6%
time: 4.36 ms (started: 2023-12-17 04:53:01 +00:00)


## Conclusion & Summary



By employing these strategies, we can effectively handle and analyze big data while overcoming challenges related to memory, computational resources, and efficiency. These approaches enable the extraction of meaningful insights from large datasets without overwhelming the available resources.

Below is the summary table for 5 strategies mentioned and the comparison table when compared to traditional Pandas dataframe.



| No. | Action: Read File                           | Pandas (Control) | Parallelize with Dask | Chunking | Sampling using 70% of dataset | Load Less Data (Less 4 columns) | Optimize Data Type |
| --- | ------------------------------- | ----------------- | ---------------------- | -------- | ----------------------------- | ------------------------------- | ------------------- |
| 1   | Computation Time (second)       | 65.5799           | 0.021                  | 76.779   | 12.8338                       | 54.6219                         | NA                |
| 2   | Memory Usage (GB)               | 4.25              | 4.25                   | 4.25     | 3.03                          | 3.39                            | 1.33              |
| 3   | Dataframe Size (GB)             | 4.25              | 0 (due to lazy evaluation nature of Dask)  | 4.25     | 3.03                          | 3.39                            | 1.33              |


| No. | Performance Metrics                        | Pandas (Control) | Parallelize with Dask | Chunking | Sampling using 70% of dataset | Load Less Data (Less 4 columns) | Optimize Data Type |
| --- | ----------------------------------------- | ----------------- | ---------------------- | -------- | ----------------------------- | ------------------------------- | ------------------- |
| 1   | How many times faster (times)              | -                | 3116.75                | 0.72     | 5.11                          | 1.2                             | NA                |
| 2a  | Total memory usage reduced (GB)            | -                | 0                      | 0        | 1.22                          | 0.86                            | 2.92              |
| 2b  | Percentage of memory usage reduced (%)    | -                | 0%                     | 0%       | 29%                           | 20%                             | 69%               |
| 3   | Total size reduced (GB)                   | -               | NA                      | 0        | 1.22                          | 0.86                            | 2.92              |


In conclusion, Dask can perform much more faster than Pandas. The reported size of the Dask DataFrame as "0.00 GB" in the output is likely due to the lazy evaluation nature of Dask. Dask operates on larger-than-memory datasets by breaking them into smaller chunks and performing operations on those chunks in a lazy manner. This means that Dask doesn't actually compute the result or load the entire dataset into memory until an action is explicitly triggered.

Total runtime for "after using Chunking" is higher than "before using Chunking". It's important to note that the benefits of chunking are often more pronounced when dealing with datasets that are significantly larger than the available memory, as chunking allows us to work with data in a memory-efficient manner. In scenarios where the entire dataset easily fits into memory, chunking might not provide a substantial performance improvement and can even introduce some additional overhead.

Both Sampling and Load Less Data show good performance in reducing runtime and memory usage, whereas Optimize Data Type able to significantly reduce the memory usage, which is 69% reduction compared to the original dataset.