<a href="https://colab.research.google.com/github/drshahizan/Python-big-data/blob/main/assignment/ass6/hpdp/YW/ASSIGNMENT6_YW.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Assignment 6: Mastering Big Data Handling
##Group Members
1. Yasmeen Natasha Binti Hafiz Shahrel A21EC0147
2. Sarah Wardina Binti Rafidin A21EC0128

#Introduction
In this assignment, you will explore the management of big data processing in data science. Big data processing involves the systematic handling and analysis of vast and complex datasets that exceed the capabilities of traditional data processing methods. It encompasses the storage, retrieval, and manipulation of massive volumes of information to extract valuable insights.

### **Step 1: Pick a Big Dataset**
Start by choosing a suitable dataset. Choose a dataset from reputable sources such as Kaggle, UCI Machine Learning Repository, or any other pertinent dataset repository. Make sure it's big—over 700 MB.

Dataset: [Flight Status Prediction](https://www.kaggle.com/datasets/robikscube/flight-delay-dataset-20182022?select=Combined_Flights_2021.csv)

**About the Dataset**

The Flight Status Prediction dataset, specifically the Combined_Flights_2021 subset, serves as a comprehensive repository of crucial flight information, encapsulating details on cancellations, delays, and other pertinent data categorized by airline. The dataset encompasses records dating back to January 2018, offering a rich historical perspective for analysis and prediction.

### **Step 2: Loading the Dataset**

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
import pandas as pd
import numpy as np

In [None]:
df = pd.read_csv("drive/MyDrive/Colab Notebooks/Dataset/Combined_Flights_2021.csv")
df

Unnamed: 0,FlightDate,Airline,Origin,Dest,Cancelled,Diverted,CRSDepTime,DepTime,DepDelayMinutes,DepDelay,...,WheelsOff,WheelsOn,TaxiIn,CRSArrTime,ArrDelay,ArrDel15,ArrivalDelayGroups,ArrTimeBlk,DistanceGroup,DivAirportLandings
0,2021-03-03,SkyWest Airlines Inc.,SGU,PHX,False,False,724,714.0,0.0,-10.0,...,724.0,813.0,5.0,843,-25.0,0.0,-2.0,0800-0859,2,0.0
1,2021-03-03,SkyWest Airlines Inc.,PHX,SGU,False,False,922,917.0,0.0,-5.0,...,940.0,1028.0,3.0,1040,-9.0,0.0,-1.0,1000-1059,2,0.0
2,2021-03-03,SkyWest Airlines Inc.,MHT,ORD,False,False,1330,1321.0,0.0,-9.0,...,1336.0,1445.0,16.0,1530,-29.0,0.0,-2.0,1500-1559,4,0.0
3,2021-03-03,SkyWest Airlines Inc.,DFW,TRI,False,False,1645,1636.0,0.0,-9.0,...,1703.0,1955.0,7.0,2010,-8.0,0.0,-1.0,2000-2059,4,0.0
4,2021-03-03,SkyWest Airlines Inc.,PHX,BFL,False,False,1844,1838.0,0.0,-6.0,...,1851.0,1900.0,3.0,1925,-22.0,0.0,-2.0,1900-1959,2,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
6311866,2021-06-01,Southwest Airlines Co.,BNA,MDW,False,False,1255,1301.0,6.0,6.0,...,1310.0,1416.0,5.0,1430,-9.0,0.0,-1.0,1400-1459,2,0.0
6311867,2021-06-01,Southwest Airlines Co.,BNA,MDW,False,False,730,727.0,0.0,-3.0,...,740.0,842.0,3.0,900,-15.0,0.0,-1.0,0900-0959,2,0.0
6311868,2021-06-01,Southwest Airlines Co.,BNA,MIA,False,False,800,757.0,0.0,-3.0,...,811.0,1056.0,5.0,1110,-9.0,0.0,-1.0,1100-1159,4,0.0
6311869,2021-06-01,Southwest Airlines Co.,BNA,MIA,False,False,1300,1252.0,0.0,-8.0,...,1300.0,1554.0,5.0,1620,-21.0,0.0,-2.0,1600-1659,4,0.0


In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6311871 entries, 0 to 6311870
Data columns (total 61 columns):
 #   Column                                   Dtype  
---  ------                                   -----  
 0   FlightDate                               object 
 1   Airline                                  object 
 2   Origin                                   object 
 3   Dest                                     object 
 4   Cancelled                                bool   
 5   Diverted                                 bool   
 6   CRSDepTime                               int64  
 7   DepTime                                  float64
 8   DepDelayMinutes                          float64
 9   DepDelay                                 float64
 10  ArrTime                                  float64
 11  ArrDelayMinutes                          float64
 12  AirTime                                  float64
 13  CRSElapsedTime                           float64
 14  ActualElapsedTime 

### **Step 3: Strategies for Big Datasets**
Apply five smart strategies to handle large datasets effectively:


*   Load Less Data: Strategically load only the essential portions of the dataset to optimize memory usage.

*   Use Chunking: Process the data in smaller pieces to avoid memory issues.

*   Optimize Data Types: Fine-tune data types to maximize efficiency and minimize memory consumption.



*   Sampling: Implement sampling methodologies to extract meaningful insights from a subset of the dataset.


*   Parallelize with Dask: Dask is a powerful library that extends pandas to enable parallel and distributed computing. It's particularly useful for handling larger-than-memory datasets.






**1. Load Less Data: Strategically load only the essential portions of the dataset to optimize memory usage.**

In [None]:
columns_to_read = ['Year', 'Airline', 'Origin', 'Dest', 'Cancelled', 'DepDelayMinutes', 'ArrDelayMinutes']

dtypes = {'Year': 'int64', 'Airline' : 'object', 'Origin' : 'object', 'Dest' : 'object', 'Cancelled' : 'bool', 'DepDelayMinutes' : 'float64', 'ArrDelayMinutes' : 'float64'}

df = pd.read_csv('drive/MyDrive/Colab Notebooks/Dataset/Combined_Flights_2021.csv', usecols = columns_to_read, dtype = dtypes)


In [None]:
print(df.head())

                 Airline Origin Dest  Cancelled  DepDelayMinutes  \
0  SkyWest Airlines Inc.    SGU  PHX      False              0.0   
1  SkyWest Airlines Inc.    PHX  SGU      False              0.0   
2  SkyWest Airlines Inc.    MHT  ORD      False              0.0   
3  SkyWest Airlines Inc.    DFW  TRI      False              0.0   
4  SkyWest Airlines Inc.    PHX  BFL      False              0.0   

   ArrDelayMinutes  Year  
0              0.0  2021  
1              0.0  2021  
2              0.0  2021  
3              0.0  2021  
4              0.0  2021  


**2. Use Chunking: Process the data in smaller pieces to avoid memory issues.**

In [None]:
import time

chunk_size = 10000
file_path = 'drive/MyDrive/Colab Notebooks/Dataset/Combined_Flights_2021.csv'

start_time = time.time()

chunks = pd.read_csv(file_path, chunksize = chunk_size)

for chunk in chunks:
  print(chunk.head())

end_time = time.time()

elapsed_time = end_time - start_time
print(f"Total elapsed time: {elapsed_time} seconds")

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
4590003     859.0    14.0         910       3.0       0.0                 0.0   
4590004    1824.0     3.0        1835      -8.0       0.0                -1.0   

         ArrTimeBlk  DistanceGroup  DivAirportLandings  
4590000   1400-1459              2                 0.0  
4590001   1600-1659              4                 0.0  
4590002   1900-1959              3                 0.0  
4590003   0900-0959              3                 0.0  
4590004   1800-1859              3                 0.0  

[5 rows x 61 columns]
         FlightDate                 Airline Origin Dest  Cancelled  Diverted  \
4600000  2021-05-30  Southwest Airlines Co.    ECP  ATL      False     False   
4600001  2021-05-30  Southwest Airlines Co.    ECP  ATL      False     False   
4600002  2021-05-30  Southwest Airlines Co.    ECP  BNA      False     False   
4600003  2021-05-30  Southwest Airlines Co.    ECP  BNA      False     False   
4600004

**3. Optimize Data Types: Fine-tune data types to maximize efficiency and minimize memory consumption.**

In [None]:
# Original Dataframe
columns_to_read = ['Year', 'Airline', 'Origin', 'Dest', 'Cancelled', 'DepDelayMinutes', 'ArrDelayMinutes']
df = pd.read_csv('drive/MyDrive/Colab Notebooks/Dataset/Combined_Flights_2021.csv', usecols = columns_to_read)

print("Original DataFrame Info:")
print(df.info())

Original DataFrame Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6311871 entries, 0 to 6311870
Data columns (total 7 columns):
 #   Column           Dtype  
---  ------           -----  
 0   Airline          object 
 1   Origin           object 
 2   Dest             object 
 3   Cancelled        bool   
 4   DepDelayMinutes  float64
 5   ArrDelayMinutes  float64
 6   Year             int64  
dtypes: bool(1), float64(2), int64(1), object(3)
memory usage: 295.0+ MB
None


In [None]:
# Convert numeric column to smaller data types
df['Year'] = df['Year'].astype('Int16')
df['DepDelayMinutes'] = df['DepDelayMinutes'].astype('float32')
df['ArrDelayMinutes'] = df['ArrDelayMinutes'].astype('float32')

# Convert object columns to categorical
categorical_cols = ['Airline', 'Origin', 'Dest', 'Cancelled']
df[categorical_cols] = df[categorical_cols].astype('category')

# Display optimized Dataframe
print ("\nDataFrame After Optimization:")
print(df.info())


DataFrame After Optimization:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6311871 entries, 0 to 6311870
Data columns (total 7 columns):
 #   Column           Dtype   
---  ------           -----   
 0   Airline          category
 1   Origin           category
 2   Dest             category
 3   Cancelled        category
 4   DepDelayMinutes  float32 
 5   ArrDelayMinutes  float32 
 6   Year             Int16   
dtypes: Int16(1), category(4), float32(2)
memory usage: 102.4 MB
None


In [None]:
print(df.head())

                 Airline Origin Dest Cancelled  DepDelayMinutes  \
0  SkyWest Airlines Inc.    SGU  PHX     False              0.0   
1  SkyWest Airlines Inc.    PHX  SGU     False              0.0   
2  SkyWest Airlines Inc.    MHT  ORD     False              0.0   
3  SkyWest Airlines Inc.    DFW  TRI     False              0.0   
4  SkyWest Airlines Inc.    PHX  BFL     False              0.0   

   ArrDelayMinutes  Year  
0              0.0  2021  
1              0.0  2021  
2              0.0  2021  
3              0.0  2021  
4              0.0  2021  


**4. Sampling: Implement sampling methodologies to extract meaningful insights from a subset of the dataset.**

In [None]:
import time

file_path = ('drive/MyDrive/Colab Notebooks/Dataset/Combined_Flights_2021.csv')

sampling_fraction = 0.1

start_time = time.time()

sample_df = pd.read_csv(file_path, skiprows=lambda x: x % (1/sampling_fraction) != 0, header=0)
print(sample_df.head())

end_time = time.time()
elapsed_time = end_time - start_time
print(f"Total elapsed time: {elapsed_time} seconds")

   FlightDate                Airline Origin Dest  Cancelled  Diverted  \
0  2021-03-03  SkyWest Airlines Inc.    DFW  DRO      False     False   
1  2021-03-03  SkyWest Airlines Inc.    DFW  JLN      False     False   
2  2021-03-03  SkyWest Airlines Inc.    SLC  PHX      False     False   
3  2021-03-03  SkyWest Airlines Inc.    DFW  HRL      False     False   
4  2021-03-03  SkyWest Airlines Inc.    DFW  DRO      False     False   

   CRSDepTime  DepTime  DepDelayMinutes  DepDelay  ...  WheelsOff  WheelsOn  \
0        2045   2040.0              0.0      -5.0  ...     2105.0    2142.0   
1        2030   2020.0              0.0     -10.0  ...     2051.0    2141.0   
2         733    730.0              0.0      -3.0  ...      741.0     856.0   
3        2025   2022.0              0.0      -3.0  ...     2041.0    2144.0   
4        1448   1442.0              0.0      -6.0  ...     1503.0    1539.0   

   TaxiIn  CRSArrTime  ArrDelay  ArrDel15  ArrivalDelayGroups  ArrTimeBlk  \
0     2.0

**5. Parallelize with Dask: Dask is a powerful library that extends pandas to enable parallel and distributed computing. It's particularly useful for handling larger-than-memory datasets.**

In [None]:
!pip install dask



In [None]:
start_time = time.time()

In [None]:
import dask.dataframe as dd
dask_df = dd.read_csv('drive/MyDrive/Colab Notebooks/Dataset/Combined_Flights_2021.csv')

In [None]:
memory_gb = dask_df.memory_usage(deep=True).sum().compute() / (1024**3)

In [None]:
import numpy as np
import dask.dataframe as dd

def reduce_mem_usage_ddf(dask_df):
    # Display initial memory usage
    start_mem = dask_df.memory_usage(deep=True).sum().compute() / 1024**3
    print('Memory usage of Dask Dataframe: {:.2f} GB'.format(start_mem))

    for col in dask_df.columns:
        col_type = dask_df[col].dtype

        if col_type.name == 'category':
            dask_df[col] = dask_df[col].cat.as_ordered()
        elif col_type.name.startswith('float'):
            dask_df[col] = dask_df[col].astype(np.float32)
        elif col_type.name.startswith('int'):
            dask_df[col] = dask_df[col].astype(np.int16)

    # Display memory usage after optimization
    end_mem = dask_df.memory_usage(deep=True).sum().compute() / 1024**3
    print('Memory usage after optimization: {:.2f} GB'.format(end_mem))
    print('Decreased by {:.1f}%'.format(100 * (start_mem - end_mem) / start_mem))

    return dask_df

In [None]:
ddf_new = reduce_mem_usage_ddf(dask_df)

Memory usage of Dask Dataframe: 7.03 GB
Memory usage after optimization: 5.98 GB
Decreased by 15.0%


In [None]:
dask_df.info(memory_usage="deep")

<class 'dask.dataframe.core.DataFrame'>
Columns: 61 entries, FlightDate to DivAirportLandings
dtypes: category(4), object(14), float32(21), int16(22)
memory usage: 1.4 GB


In [None]:
end_time = time.time()
elapsed_time = end_time - start_time
print(f"Total elapsed time for memory optimization: {elapsed_time} seconds")

Total elapsed time for memory optimization: 4158.85693025589 seconds


### **Step 4: Comparative Analysis**
Conduct a comprehensive comparative analysis between traditional methods and advanced strategies. Evaluate aspects such as memory usage, computation time, and file size. Provide meaningful insights into the advantages gained through the adoption of advanced strategies.

1. **Memory Usage**

  Traditional Methods:
  
  High memory utilisation is frequently the result of loading complete datasets into memory, particularly for huge datasets. Due to limited memory, this may lead to problems with performance or even programme crash.

  Advanced Strategies:



*   **Load Less Data**: Memory utilisation is greatly optimised by loading only the necessary sections. This is especially helpful when working with datasets larger than the RAM that is available.
*   **Chunking**: Memory requirements are further decreased by processing data in smaller pieces since only a portion of the dataset must be loaded into memory at any given moment.


*   **Optimize Data Types**: Rearranging the data types reduces the amount of memory used. It is possible to use smaller, more memory-efficient varieties without compromising data integrity.

2. **Computation Time**

  Traditional Methods:

  Large datasets can take a while to load and process completely, which can cause computations to operate more slowly.

  Advanced Strategies:



*   **Chunking**: Partial data processing can improve parallelism and cut down on computation time.
*   **Parallelize with Dask**: Dask offers a substantial benefit over typical sequential processing in that it can parallelize operations, which can significantly speed up data processing, especially for activities that can be parallelized.

3. **File Size**

  Traditional Methods:

  Huge file sizes can be produced by storing and processing huge datasets in their entirety, which might affect data transit and storage.

  Advanced Strategies:



*   **Optimize Data Types**: File sizes are decreased by smaller data types, improving the efficiency of data transit and storage.
*   **Chunking**: Smaller file sizes that are produced by storing data in smaller portions may be easier to handle and more scalable.

  **Meaningful Insights**

  Implementing **Sampling** methodologies makes it possible to extract significant insights from a portion of the dataset, giving a representative picture without requiring the processing of the full dataset.

### **Step 5: Conclusion**
Summarize your findings. Explain why you chose these strategies and how they make a difference in handling big data.

When comparing advanced strategies to old approaches, there are noticeable benefits in terms of memory usage, calculation time, and file size. Some of these strategies include loading less data, employing chunking, optimising data types, sampling, and parallelizing with Dask. These techniques, which provide increased efficiency, scalability, and the capacity to derive significant insights from vast and complicated data sets, are especially helpful when managing large datasets.