<a href="https://colab.research.google.com/github/drshahizan/Python-big-data/blob/main/assignment/ass6/hpdp/coconut/Assignment_6.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Assignment 6: Mastering Big Data Handling**

### **Group name: coconut**

**Group members:**

| Name                                     | Matrix Number |
| :---------------------------------------- | :-------------: |
|NG SUANG JOO | A21EC0102 |
| LING WAN YIN | A21EC0047 |


## **Pick a Big Dataset**

To give an overview of the chosen dataset, it is about an in-depth analysis of brewing variables, market sales patterns, and quality metrics in craft beer manufacturing (2020-2024). This extensive dataset encompasses a thorough collection of data spanning from January 2020 to January 2024, originating from a craft beer brewery. It provides a thorough insight into the brewing procedures and their impact on the market, incorporating a diverse range of brewing variables, sales data, and quality assessments. This dataset is around **2 GB** in size, and it encompasses 20 columns of attributes.


**Dataset URL:**
[Brewery Operations and Market Analysis Dataset](https://www.kaggle.com/datasets/ankurnapa/brewery-operations-and-market-analysis-dataset)


## **Loading the Dataset**

1. Import the necessary library for file upload.

In [None]:
from google.colab import files

2. Upload the Kaggle API token file ('kaggle.json') using the file upload widget. The file 'kaggle.json' can be found under your account settings in Kaggle.

In [None]:
# Upload kaggle.json
uploaded = files.upload()

Saving kaggle.json to kaggle.json


3. Utilise the Kaggle API Token to extract the dataset.

In [None]:
# Move Kaggle API Token to the Correct Directory
!mkdir -p /root/.kaggle
!cp kaggle.json /root/.kaggle/
!chmod 600 /root/.kaggle/kaggle.json

# Download the Kaggle dataset
!kaggle datasets download -d ankurnapa/brewery-operations-and-market-analysis-dataset

# Unzip the downloaded dataset
!unzip brewery-operations-and-market-analysis-dataset

Downloading brewery-operations-and-market-analysis-dataset.zip to /content
 99% 1.05G/1.06G [00:12<00:00, 150MB/s]
100% 1.06G/1.06G [00:12<00:00, 90.5MB/s]
Archive:  brewery-operations-and-market-analysis-dataset.zip
  inflating: brewery_data_complete_extended.csv  


4. Now the dataset file `'brewery_data_complete_extended.csv'` is ready to be used.

##**Strategies for Big Datasets**

###Load Less Data: Strategically load only the essential portions of the dataset to optimize memory usage.

Select the neccesary columns to optimize the loading time. In this case, select **'Batch_ID', 'Brew_Date', 'Beer_Style', 'Location', 'Fermentation_Time', 'Temperature', 'pH_Level', 'Gravity', 'Alcohol_Content'** only to optimise the memory usage.

In [None]:
import pandas as pd

file_path = 'brewery_data_complete_extended.csv'

# Load only necessary columns
columns_to_load = ['Batch_ID', 'Brew_Date', 'Beer_Style', 'Location', 'Fermentation_Time', 'Temperature', 'pH_Level', 'Gravity', 'Alcohol_Content']

# Load the data with the specified columns only
df = pd.read_csv(file_path, usecols=columns_to_load)

print(df.info())
print(df.head())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000000 entries, 0 to 9999999
Data columns (total 9 columns):
 #   Column             Dtype  
---  ------             -----  
 0   Batch_ID           int64  
 1   Brew_Date          object 
 2   Beer_Style         object 
 3   Location           object 
 4   Fermentation_Time  int64  
 5   Temperature        float64
 6   pH_Level           float64
 7   Gravity            float64
 8   Alcohol_Content    float64
dtypes: float64(4), int64(2), object(3)
memory usage: 686.6+ MB
None
   Batch_ID            Brew_Date  Beer_Style      Location  Fermentation_Time  \
0   7870796  2020-01-01 00:00:19  Wheat Beer    Whitefield                 16   
1   9810411  2020-01-01 00:00:31        Sour    Whitefield                 13   
2   2623342  2020-01-01 00:00:40  Wheat Beer   Malleswaram                 12   
3   8114651  2020-01-01 00:01:37         Ale   Rajajinagar                 17   
4   4579587  2020-01-01 00:01:43       Stout  Marathahalli   

###Use Chunking: Process the data in smaller pieces to avoid memory issues.

To prevent memory problems, process the data in smaller chunks. In this case, set the size of each chunk to 10000, which means it will read the data in chunks (10000 rows) in each iteration.

In [None]:
# Set the chunk size
chunk_size = 10000

# Create an iterator to read the dataset in chunks
chunk_iter = pd.read_csv(file_path, chunksize=chunk_size)

def process_chunk(chunk):
    summary_stats = chunk.describe()
    print(summary_stats)

# Process each chunk separately
for chunk in chunk_iter:
    # Perform operations on each chunk
    process_chunk(chunk)
    print("Memory usage:", chunk.memory_usage().sum() / (1024 ** 2), "MB")

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
           Batch_ID  Fermentation_Time   Temperature      pH_Level  \
count  1.000000e+04       10000.000000  10000.000000  10000.000000   
mean   4.982953e+06          14.527500     20.039707      5.001536   
std    2.890618e+06           2.834475      2.887852      0.288532   
min    9.140000e+02          10.000000     15.000048      4.500071   
25%    2.449098e+06          12.000000     17.563204      4.751664   
50%    4.970160e+06          15.000000     20.043647      5.006650   
75%    7.510254e+06          17.000000     22.571473      5.251809   
max    9.999446e+06          19.000000     24.999833      5.499836   

            Gravity  Alcohol_Content    Bitterness         Color  \
count  10000.000000     10000.000000  10000.000000  10000.000000   
mean       1.055159         5.251425     39.699000     11.946700   
std        0.014447         0.430339     11.517155      4.365108   
min        1.030003         4.50

###Optimize Data Types: Fine-tune data types to maximize efficiency and minimize memory consumption.

In [None]:
df_opt = pd.read_csv(file_path)

df_opt['Batch_ID'] = pd.to_numeric(df_opt['Batch_ID'], downcast='integer')

# Optimize Fermentation_Time to integer
df_opt['Fermentation_Time'] = pd.to_numeric(df_opt['Fermentation_Time'], downcast='integer')
df_opt['Bitterness'] = pd.to_numeric(df_opt['Bitterness'], downcast='integer')
df_opt['Color'] = pd.to_numeric(df_opt['Color'], downcast='integer')
df_opt['Volume_Produced'] = pd.to_numeric(df_opt['Volume_Produced'], downcast='integer')

# Convert Brew_Date to datetime
df['Brew_Date'] = pd.to_datetime(df_opt['Brew_Date'])

# Optimize Temperature, pH_Level, Gravity, Alcohol_Content, Total_Sales, Quality_Score, Brewhouse_Efficiency, Loss_During_Brewing, Loss_During_Fermentation and Loss_During_Bottling_Kegging to float
df_opt['Temperature'] = pd.to_numeric(df_opt['Temperature'], downcast='float')
df_opt['pH_Level'] = pd.to_numeric(df_opt['pH_Level'], downcast='float')
df_opt['Gravity'] = pd.to_numeric(df_opt['Gravity'], downcast='float')
df_opt['Alcohol_Content'] = pd.to_numeric(df_opt['Alcohol_Content'], downcast='float')
df_opt['Total_Sales'] = pd.to_numeric(df_opt['Total_Sales'], downcast='float')
df_opt['Quality_Score'] = pd.to_numeric(df_opt['Quality_Score'], downcast='float')
df_opt['Brewhouse_Efficiency'] = pd.to_numeric(df_opt['Brewhouse_Efficiency'], downcast='float')
df_opt['Loss_During_Brewing'] = pd.to_numeric(df_opt['Loss_During_Brewing'], downcast='float')
df_opt['Loss_During_Fermentation'] = pd.to_numeric(df_opt['Loss_During_Fermentation'], downcast='float')
df_opt['Loss_During_Bottling_Kegging'] = pd.to_numeric(df_opt['Loss_During_Bottling_Kegging'], downcast='float')


- To save memory use, the 'Batch_ID' column is downcast to the smallest integer type ('integer') and converted to a numeric type.
- Columns like 'Volume_Produced', 'Color', 'Bitterness' and 'Fermentation_Time' are downcast to integers after being transformed to numeric types. This contributes to these columns having a smaller memory footprint.
- Use of pd.to_datetime() transforms the 'Brew_Date' field into a datetime type. This represents date and time data in a memory-efficient manner.
- Converted to numeric types, columns such as 'Temperature', 'pH_Level', 'Gravity', 'Alcohol_Content', 'Total_Sales', 'Quality_Score', 'Brewhouse_Efficiency', 'Loss_During_Brewing', 'Loss_During_Fermentation', and 'Loss_During_Bottling_Kegging' are downcast to the lowest float type ('float'). By doing this, these numerical columns use less RAM.

- The downcasting strategy allows for efficient use of memory without compromising the integrity of the numeric data.

###Sampling: Implement sampling methodologies to extract meaningful insights from a subset of the dataset.

Implement the sampling method to reduce memory usage, as processing a smaller sample size is generally faster than working with the entire dataset. In this situation, the size of the sample is 10,000, which means a random selection of 10,000 rows for processing.

In [None]:
df_sampled = pd.read_csv(file_path)

# Randomly sample 10% of the dataset
sample_size = 10000
df_sampled = df_sampled.sample(sample_size)
print(df_sampled.info())

<class 'pandas.core.frame.DataFrame'>
Int64Index: 10000 entries, 9728485 to 4695067
Data columns (total 20 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   Batch_ID                      10000 non-null  int64  
 1   Brew_Date                     10000 non-null  object 
 2   Beer_Style                    10000 non-null  object 
 3   SKU                           10000 non-null  object 
 4   Location                      10000 non-null  object 
 5   Fermentation_Time             10000 non-null  int64  
 6   Temperature                   10000 non-null  float64
 7   pH_Level                      10000 non-null  float64
 8   Gravity                       10000 non-null  float64
 9   Alcohol_Content               10000 non-null  float64
 10  Bitterness                    10000 non-null  int64  
 11  Color                         10000 non-null  int64  
 12  Ingredient_Ratio              10000 non-null  object

###Parallelize with Dask: Dask is a powerful library that extends pandas to enable parallel and distributed computing. It's particularly useful for handling larger-than-memory datasets.

With the help of the robust library Dask, distributed and parallel computing with Pandas is made possible. When managing datasets that are larger than memory, it is extremely helpful, as larger-than-memory datasets are handled by Dask with efficiency and laziness by segmenting them into smaller tasks.

In [None]:
# Import the Dask dataframe module as dd
import dask.dataframe as dd

# Read the datset using Dask and create a Dask DataFrame (ddf)
ddf = dd.read_csv(file_path)

# Perform operations in parallel using Dask
result = ddf.groupby('Beer_Style').mean().compute()
print(result)



                Batch_ID  Fermentation_Time  Temperature  pH_Level   Gravity  \
Beer_Style                                                                     
Ale         5.000288e+06          14.501019    19.999627  4.999969  1.055006   
IPA         5.000012e+06          14.499067    19.996046  4.999830  1.054991   
Lager       5.002388e+06          14.501280    20.001902  5.000051  1.054986   
Pilsner     4.999372e+06          14.501048    19.997734  5.000042  1.055004   
Porter      4.999643e+06          14.497252    19.999347  4.999764  1.055005   
Sour        4.995978e+06          14.500461    20.003314  4.999869  1.055031   
Stout       5.000885e+06          14.503157    19.998394  4.999911  1.055006   
Wheat Beer  5.001434e+06          14.503903    20.002825  5.000089  1.054996   

            Alcohol_Content  Bitterness      Color  Volume_Produced  \
Beer_Style                                                            
Ale                5.249216   39.498093  11.999430      2

The output shows each 'Beer_Style' group's mean values over a number of numerical columns. The columns in the output display the mean values for various attributes within each beer style category, and each row in the output represents a distinct "Beer_Style." Based on the given dataset, this type of research can offer insights into the typical traits of various beer varieties.

## **Comparative Analysis**
In this section, a comparative study regarding the memory usage and computation time of traditional methods and advanced strategies—loading less data, using chunking, optimizing data types, sampling, and parallelizing using Dask—is conducted.



### Traditional Methods

In [None]:
df_ori = pd.read_csv(file_path)
print("Memory usage:", df_ori.memory_usage().sum() / (1024 ** 2), "MB")

Memory usage: 1525.8790283203125 MB


In [None]:
import timeit
time = timeit.timeit(lambda:df_ori, number=1)
print("Computation Time:", time, "seconds")

Computation Time: 2.5809999897319358e-06 seconds


- **Memory usage:** 1525.88 MB

- **Computation Time:** 2.58e-06 seconds

### Advanced Strategies

####1. Load Less Data

In [None]:
print("Memory usage:", df.memory_usage().sum() / (1024 ** 2), "MB")

Memory usage: 686.6456298828125 MB


In [None]:
time_load_less_data = timeit.timeit(lambda:df, number=1)
print("Computation Time:", time_load_less_data, "seconds")

Computation Time: 1.1700001323333709e-06 seconds


- **Memory Usage:** 686.65 MB

- **Computation Time:** 1.17e-06 seconds





>**Insights:**
Strategically loading only the essential data substantially decreases memory requirements and marginally improves computational time.

####2. Use Chunking

In [None]:
time = timeit.timeit(lambda:df, number=1)
print("Computation Time:", time, "seconds")

Computation Time: 2.2819999685452785e-06 seconds


- **Memory Usage:** 1.53 MB (refer to the execution result under the 'Advanced Strategies' > 'Use Chunking')

- **Computation Time:** 2.28e-06 seconds

>**Insights:** Breaking down the data into smaller chunks not only decreases memory demands but also sustains a comparable computation time.


####3. Optimize Data Types

In [None]:
print("Memory usage:", df_opt.memory_usage().sum() / (1024 ** 2), "MB")

Memory usage: 886.917236328125 MB


In [None]:
time_optimized = timeit.timeit(lambda: df_opt, number=1)
print("Computation Time:", time_optimized, "seconds")

Computation Time: 1.784999994924874e-06 seconds


- **Memory Usage:** 886.92 MB

- **Computation Time:** 1.78e-06 seconds

>**Insights:** Reduced memory usage compared to traditional method, and reduced computation time.

####4. Sampling


In [None]:
print("Memory usage:", df_sampled.memory_usage().sum() / (1024 ** 2), "MB")

Memory usage: 1.6021728515625 MB


In [None]:
time_sample = timeit.timeit(lambda: df_sampled, number=1)
print("Computation Time:", time_sample, "seconds")

Computation Time: 1.5960001746861963e-06 seconds


- **Memory Usage:** 1.60 MB

- **Computation Time:** 1.60e-06 seconds

>**Insights:** Sampling offers rapid computation times and is very memory-efficient. It works effectively for drawing conclusions from a portion of the data.

####5. Parallelize with Dask

In [None]:
print("Memory usage:", ddf.memory_usage(deep=True).sum().compute() / (1024 ** 2), "MB")

Memory usage: 4351.377654075623 MB


In [None]:
time_dask = timeit.timeit(lambda: ddf.compute(), number=1)
print("Computation Time:", time_dask, "seconds")

Computation Time: 55.063718380999944 seconds


- **Memory Usage:** 4351.38 MB

- **Computation Time:** 55.06 seconds

>**Insights:** Significant increase in memory usage and it does not improve the computation time.

## **Conclusion**
A number of important conclusions have been drawn from the comparison of cutting-edge big data handling methodologies with conventional methods:

- **Memory Usage:**

  Techniques like "Use Chunking" and "Load Less Data" have been shown to significantly lower memory utilisation, which makes them effective options in situations where memory limitations are crucial.
  Although there were some trade-offs in computation speed, "Optimise Data Types" and "Sampling" also helped to reduce memory utilisation.
  The memory use of "Parallelize with Dask" increased significantly, highlighting the necessity of carefully weighing trade-offs and resource availability.

- **Computation Time:**

  Low computing times for traditional approaches suggested simplicity, possibly at the sacrifice of efficiency."Use Chunking" demonstrated a balance between efficiency and resource utilisation while maintaining a similar calculation time. The techniques of "Sampling", "Optimize Data Types", and "Load Less Data" show compute times that are comparable to or better than those of the conventional approach. Dask has advantages in parallelization, but because it is distributed, it takes a lot longer to compute.

- **File Size**

  Speaking of file size, the "Load Less Data" and "Sampling" approaches definitely win as they are specifically utilized to produce smaller processed datasets, which are advantageous for storage and analysis. "Sampling" is said to be able to maintain a good balance between compute speed and memory consumption that allows for the extraction of insights from a representative portion of the data.

In conclusion, the chosen strategies offer a range of choices for managing large data, enabling businesses to customise their strategies according to the particulars of their datasets and the available computing power. An intelligent mix of techniques can result in notable gains in compute speed and memory efficiency, making it easier to analyze big datasets with limited resources. For example, the integration of chunking and loading less data can be especially beneficial for processing large datasets quickly. However, the decision hinges on striking a careful balance between scalability, processing speed, and memory efficiency—a reflection of the varied terrain of big data concerns.