<a href="https://colab.research.google.com/github/drshahizan/Python-big-data/blob/main/assignment/ass6/bdm/Kicap/big_data.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#**Assignment 6: Mastering Big Data Handling**



**TEAM MEMBERS:**
```
NABILA HUSNA BINTI ROSLI (MCS231009)
NUR AZIMAH BINTI MOHD SALLEH (MCS231011)
```

##**Pick a Big Dataset**

###**Dataset :** `Restaurant reviews`


###**About**
This dataset, labeled "Restaurant reviews," which is a collection of information and the feedback from customers about the restaurant.

The dataset was divided into 2 parts which are the reviews during pre-covid and post-covid.

##**Loading the Dataset**

In [None]:
# Upload kaggle.json API token, and download / unzip the restaurant-reviews zip file dataset

# Install and upload the kaggle.json file
!pip install kaggle

from google.colab import files
files.upload()

!mkdir ~/.kaggle
!cp kaggle.json ~/.kaggle/
!chmod 600 ~/.kaggle/kaggle.json



Saving kaggle.json to kaggle.json


In [None]:
!kaggle datasets download -d fahadsyed97/restaurant-reviews

Downloading restaurant-reviews.zip to /content
 99% 2.02G/2.04G [00:29<00:00, 76.2MB/s]
100% 2.04G/2.04G [00:30<00:00, 72.8MB/s]


In [None]:
!unzip restaurant-reviews.zip

Archive:  restaurant-reviews.zip
  inflating: postcovid_reviews.csv   
  inflating: precovid_reviews.csv    


In [None]:
# Importing necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

## **Combine all csv into 1 csv and mount it in GoogleDrive**

In [None]:
# 2 files that need to be merged - both having the same columns
file_paths = [
    '/content/postcovid_reviews.csv',
    '/content/precovid_reviews.csv'
    ]

# Create an empty list to store DataFrames
dataframes = []

In [None]:
# Read each CSV file into a DataFrame and append to the list
for file_path in file_paths:
    d_frame = pd.read_csv(file_path)
    dataframes.append(d_frame)

In [None]:
# Merge all DataFrames into one
combined_df = pd.concat(dataframes, ignore_index=True)

In [None]:
# Save the combined DataFrame to a new CSV file
combined_df.to_csv('restaurant-reviews.csv', index=False)

In [None]:
combined_df.info(verbose=False, memory_usage='deep')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5572493 entries, 0 to 5572492
Columns: 21 entries, business_id to date_
dtypes: float64(3), int64(6), object(12)
memory usage: 8.8 GB


##**Strategies for Big Datasets**



*   Using merged dataset ('`restaurant-reviews.csv`')



### **Load Less Data**

By loading only the first 1000 rows (nrows=1000), we've created a DataFrame (df) that represents a subset of the original data.

In [None]:
df = pd.read_csv('restaurant-reviews.csv', nrows=1000)
df.dtypes

business_id        object
name               object
address            object
state_             object
city               object
postal_code        object
latitude          float64
longitude         float64
stars             float64
review_count        int64
is_open             int64
categories         object
hours              object
review_id          object
user_id            object
customer_stars      int64
useful              int64
funny               int64
cool                int64
text_              object
date_              object
dtype: object

The data types of the columns have been automatically inferred by pandas during the reading process. For example, columns like 'business_id' and 'name' are stored as object (text) types, while numerical columns like 'latitude', 'longitude', 'stars', 'review_count', and others are stored as float64 or int64.

In [None]:
df.memory_usage().sum()/(1024*1024*1024)

0.00015658140182495117

The memory usage of this subset DataFrame is relatively small, approximately 0.0001565814 GB. This is because you loaded only a fraction of the original data, which can be beneficial for quick exploration and analysis when you don't need the entire dataset.



---



### **Use Chunking**

This code reads the entire CSV file into a Pandas DataFrame without using chunks and measures the time it takes to complete.

In [None]:
# Reading the Entire DataFrame Without Using Chunks
%%time
df = pd.read_csv('restaurant-reviews.csv')
len(df)

CPU times: user 56.3 s, sys: 4.93 s, total: 1min 1s
Wall time: 1min 4s


5572493

The output shows that reading the entire file took approximately 1 minute and 4 seconds, and the DataFrame has 5,572,493 entries.

This code reads the CSV file using chunks of size 1000 rows and measures the time it takes to initialize the chunked reader.

In [None]:
# Reading the CSV File Using Chunks
%%time
chunks = pd.read_csv('restaurant-reviews.csv', iterator=True, chunksize=1000)

CPU times: user 2.22 ms, sys: 1.07 ms, total: 3.29 ms
Wall time: 5.06 ms


The output shows that setting up the chunked reader took a very short time, much less than a second.

This code iterates through each chunk of the file and calculates the total length by summing up the lengths of individual chunks. It also measures the time it takes to complete.

In [None]:
# Iterating Through Chunks and Calculating Total Length
length = 0
for chunk in chunks:
    length += len(chunk)
length

5572493

The output shows that iterating through chunks and calculating the total length took approximately 1 minute and 19 seconds. The resulting length matches the total number of entries in the DataFrame.



---



### **Optimize Data Types**

####The code below is the way to optimize the memory usage of a pandas dataframe by adjusting the data types of its column (fine-tune data types).

This prints information about the DataFrame, showing the data types of each column and the memory usage.

In [None]:
# To have the overview of the initial dataframe information.
df.info(verbose=False, memory_usage='deep')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5572493 entries, 0 to 5572492
Columns: 21 entries, business_id to date_
dtypes: float64(3), int64(6), object(12)
memory usage: 8.8 GB


In this case, the DataFrame has 21 columns with data types: float64, int64, and object. The DataFrame has 5,572,493 entries and initially consumes 8.8 GB of memory.

In [None]:
# To see the initial memory usage
start_mem = df.memory_usage().sum() / 1024**3
print('Memory usage of dataframe is {:.2f} MB'.format(start_mem))

Memory usage of dataframe is 0.87 MB


The initial memory usage before any optimization is 0.87 MB

This loop iterates through each column of the DataFrame and checks its data type. If the data type is float64, it converts it to float16. If it's int64, it converts it to int16. If it's an object, it converts it to a categorical data type.

In [None]:
# Data Type Optimization Loop
for col in df.columns:
    if df[col].dtype == 'float64':
        df[col] = df[col].astype('float16')
    if df[col].dtype == 'int64':
        df[col] = df[col].astype('int16')
    if df[col].dtype == 'object':
        df[col] = df[col].astype('category')

This calculates and prints the memory usage of the DataFrame after the data type optimization and the percentage decrease in memory usage.

In [None]:
end_mem = df.memory_usage().sum() / 1024**3
print('Memory usage after optimization is: {:.2f} MB'.format(end_mem))
print('Decreased by {:.1f}%'.format(100 * (start_mem - end_mem) / start_mem))

Memory usage after optimization is: 0.84 MB
Decreased by 4.0%


After the optimization, the memory usage has been reduced from 8.8 GB to 4.9 GB, and the percentage decrease is 4.0%.

In [None]:
# To see the final dataframe information
df.info(verbose=False, memory_usage='deep')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5572493 entries, 0 to 5572492
Columns: 21 entries, business_id to date_
dtypes: category(12), float16(3), int16(6)
memory usage: 4.9 GB


The final DataFrame has the same number of entries but uses less memory, with optimized data types: category(12), float16(3), and int16(6).

It has been reduced from 8.8 GB to 4.9 GB after implemented the optimization of data types.

Summary of data types optimization: -

* The initial DataFrame had a memory usage of 8.8 GB.
* After optimizing the data types (using float16, int16, and category), the memory usage decreased to 4.9 GB.
* The optimization resulted in a 4.0% reduction in memory usage.



---



### **Sampling**

In [None]:
# Original dataframe information (5M+ rows, 21 columns)
df.shape

(5572493, 21)

This code snippet prints the shape of the original DataFrame (df), indicating that it has 5,572,493 rows and 21 columns.

Here, we define the desired sample size as 1000. This is the number of rows we want to randomly select from the original DataFrame.

In [None]:
# Define the desired sample size
sample_size = 1000

Using the sample method, you randomly select 1000 rows from the original DataFrame (df). The random_state=42 parameter ensures reproducibility, meaning the same random rows will be selected if you run the code again with the same random state.

In [None]:
# Randomly sample data
sampled_df = df.sample(n=sample_size, random_state=42)

This prints information about the sampled DataFrame, including data types and memory usage, and displays a preview of a few rows from the sampled DataFrame.

In [None]:
# Display information about the sampled DataFrame
print("Info about Sampled DataFrame:")
print(sampled_df.info())

# Display a few rows of the sampled DataFrame
print("\nSampled DataFrame Preview:")
print(sampled_df.head())

This code snippet prints the shape of the sampled DataFrame, confirming that it now has 1000 rows and 21 columns.

In [None]:
sampled_df.shape

(1000, 21)

Summary using sampling: -


*   The original DataFrame (df) has 5,572,493 rows and 21 columns.
You randomly sampled 1000 rows from the original
* DataFrame to create a smaller subset, resulting in a new DataFrame (sampled_df) with the same number of columns but only 1000 rows.





---



## **Parallelize with Dask**

Dask is a powerful library for parallel and distributed computing. It allows us to scale our computations by parallelizing them across multiple cores or even distributed computing clusters.

`ddf` is the Dask DataFrame created from the original pandas DataFrame df. The `npartitions` parameter determines how the DataFrame is divided for parallel processing.

In [None]:
# Import Dask library
import dask.dataframe as dd

# Convert the pandas DataFrame to a Dask DataFrame
ddf = dd.from_pandas(df, npartitions=4)

The operation to calculate the mean of the 'stars' column is performed using Dask's lazy evaluation. The actual computation is triggered by calling compute().

Below are all the code that perform the operations in parallel.

In [None]:
# Perform some operations in parallel
# Example 1: Calculate the mean of the 'stars' column
mean_stars = ddf['stars'].mean()

# Compute the result
result = mean_stars.compute()

# Display the result
print("Mean Stars:", result)

Mean Stars: 4.044


In [None]:
# Example 2: Filtering data in parallel
filtered_data = ddf[ddf['stars'] > 4]

# Compute the results
filtered_data_result = filtered_data.compute()

# Display the results
print("Filtered Data:")
print(filtered_data_result.head())

In [None]:
# Example 3: Groupby and compute mean in parallel
mean_stars_by_city = ddf.groupby('city')['stars'].mean()

# Compute the results
mean_stars_by_city_result = mean_stars_by_city.compute()

# Display the results
print("\nMean Stars by City:")
print(mean_stars_by_city_result.head())


Mean Stars by City:
city
Allston      4.041667
Atlanta      3.915966
Austin       4.269355
Avon         2.500000
Beaverton    4.321429
Name: stars, dtype: float64


In [None]:
# Example 4: Count the number of reviews for each star rating in parallel
review_counts_by_stars = ddf['stars'].value_counts()

# Compute the results
review_counts_by_stars_result = review_counts_by_stars.compute()

# Display the results
print("\nReview Counts by Stars:")
print(review_counts_by_stars_result)


Review Counts by Stars:
4.0    399
4.5    387
3.5    106
3.0     56
5.0     19
2.5     18
2.0     10
1.5      5
Name: stars, dtype: int64


Summary for parallelize using Dask: -
* Perform filtering, groupby, and value counts on the Dask DataFrame (ddf), and all these operations are executed in parallel.
* The compute() method is used to trigger the actual computation and obtain the results.
* The optional Dask Client setup is included. If you have a distributed computing environment, the Dask Client can be used to manage computations across multiple workers.



---



##**Comparative Analysis**

Pros and cons for both traditional method and advanced strategies used in Assignment 6: -

###**Pandas**
* Pros
1. Simplicity and ease of use.
2. Suitable for smaller datasets that fit into memory
3. Operations are performed sequentially on a single node.
4. File size is determined by the size of the dataset and the data types used.

* Cons
1. Loads the entire dataset into memory, which can be limiting for large datasets.
2. May lead to MemoryError for very large datasets.
3. Computation time increases linearly with the size of the dataset.
4. Limited parallelization.


###**Dask**
* Pros
1. Operates on larger-than-memory datasets by dividing them into smaller partitions.
2. More scalable and adaptive memory usage.
3. Lazy evaluation minimizes unnecessary loading of data.
4. Parallelizes operations by dividing the dataset into partitions.
5. Significantly reduces computation time, especially for parallelizable operations.
6. Efficiently handles larger datasets without necessarily increasing file size.
7. Lazy evaluation minimizes the need to load the entire dataset into memory.

* Cons
1. Requires careful consideration of the partitioning strategy.
2. Overheads associated with task scheduling and communication.



Below are the comparison analysis between pandas (traditional methods) and Dask (advanced strategies). By comparing in a few terms: -

* ### Memory usage
  
  Advanced strategies (Dask) have a clear advantage for handling larger-than-memory datasets with more efficient memory usage.

* ### Computation time
  Advanced strategies (Dask) show a significant advantage, especially when operations can be parallelized.

* ### File size
  Advanced strategies (Dask) are advantageous for efficiently handling larger datasets without a proportional increase in file size.



---



##**Conclusion**

**Loading less data** is a common strategy when working with large datasets, allowing you to save memory and speed up initial data exploration tasks. However, keep in mind that working with a subset may not represent the full dataset, and decisions based on this subset should be made with caution.

The primary benefit of **optimizing data types** is to reduce memory usage, making it more efficient for handling and processing large datasets. It's important to choose the appropriate data types for columns to balance memory efficiency with the precision needed for analysis.

Using **chunks** is beneficial when dealing with large datasets that may not fit into memory. Instead of loading the entire dataset at once, it is read in smaller chunks. This allows for more efficient memory usage and processing. In the example, the time to read the entire dataset using chunks might be longer than reading it without chunks. However, the benefit becomes apparent when the dataset is too large to fit into memory, as chunks allow you to process and analyze portions of the data at a time.

**Sampling** is a useful technique for working with large datasets when you want to get a representative subset for exploratory data analysis or model building. The random_state parameter is set for reproducibility, ensuring that the same random rows are selected if the code is run again with the same random state.

### **After using Dask:**
* Traditional methods (pandas) are suitable for smaller datasets but may face limitations in terms of memory and computation time for larger datasets.
* Advanced strategies (Dask), offer scalability, efficient memory usage, and parallelization, making them highly advantageous for big data scenarios.
* The choice between traditional methods (pandas) and advanced strategies (Dask) depends on the size and complexity of the data, available computing resources, and the specific requirements of the analysis.
