---
# Algoritmos para Big Data

**Handout 5 - Apache Kafka as a messaging system, and data profiling**

**2024/25**

This lab class aims to introduce Apache Kafka as a messaging system that may play an important role in data streaming. Additionally, profiling of data to be used is considered, based upon reports generated by a specific tool -- YData Profiling.

This is one of the notebooks that should contain the implementation of the tasks presented in the handout. This notebook is about data cleaning/preparation regarding the dataset provided so it uses the YData Profiling tool.

**Dataset**

The data file to work with can be downloaded from the zip archive located at:

https://bigdata.iscte-iul.eu/datasets/books-amazon.zip

---
# Initial setup

In [1]:
# Basic imports

import json
from pathlib import Path

import pyspark
from pyspark.sql import SparkSession
import pyspark.sql.functions as F

In [2]:
# Build SparkSession
spark = SparkSession.builder.appName("DataPreparation").getOrCreate()

---
# Task A - Data ingestion

**Reading and checking data**

In [3]:
# Data to read
data_dir = '../Datasets/BooksAmazon/'
data_file = data_dir + 'books-amazon.csv'

! head $data_file

"ASIN","GROUP","FORMAT","TITLE","AUTHOR","PUBLISHER"
"1250150183","book","hardcover","The Swamp: Washington's Murky Pool of Corruption and Cronyism and How Trump Can Drain It","Eric Bolling","St. Martin's Press"
"0778319997","book","hardcover","Rise and Shine, Benedict Stone: A Novel","Phaedra Patrick","Park Row Books"
"1608322564","book","hardcover","Sell or Be Sold: How to Get Your Way in Business and in Life","Grant Cardone","Greenleaf Book Group Press"
"0310325331","book","hardcover","Christian Apologetics: An Anthology of Primary Sources","Khaldoun A. Sweis, Chad V. Meister","Zondervan"
"0312616295","book","hardcover","Gravity: How the Weakest Force in the Universe Shaped Our Lives","Brian Clegg","St. Martin's Press"
"1250066190","book","hardcover","Glass Houses: A Novel (Chief Inspector Gamache Novel)","Louise Penny","Minotaur Books"
"1592643124","book","hardcover","Reference Guide to the Talmud","Rabbi Adin Steinsaltz","The Toby Press"
"1849962839","book","hardcover","Induction 

In [4]:
# Reading data
df = spark.read.csv(
        # data_dir,
        data_file, 
        header=True, sep=',', inferSchema=True, 
        #recursiveFileLookup=True
    )

In [5]:
# Checking data that has been read
print(f'df - number of rows: {df.count()}')
df.printSchema()
df.show(10, truncate=False)

df - number of rows: 63755
root
 |-- ASIN: string (nullable = true)
 |-- GROUP: string (nullable = true)
 |-- FORMAT: string (nullable = true)
 |-- TITLE: string (nullable = true)
 |-- AUTHOR: string (nullable = true)
 |-- PUBLISHER: string (nullable = true)

+----------+-----+---------+----------------------------------------------------------------------------------------------------+------------------------------------------------------+---------------------------+
|ASIN      |GROUP|FORMAT   |TITLE                                                                                               |AUTHOR                                                |PUBLISHER                  |
+----------+-----+---------+----------------------------------------------------------------------------------------------------+------------------------------------------------------+---------------------------+
|1250150183|book |hardcover|The Swamp: Washington's Murky Pool of Corruption and Cronyism and How Tru

---
# Task B - Data profiling

**Checking duplicates**

In [10]:
print(f'df - number of rows is {df.count()}; after dropDuplicates() applied would be {df.dropDuplicates().count()}.')

df - number of rows is 63755; after dropDuplicates() applied would be 63750.


**Checking NULLs**

In [11]:
print(f'''df - number of rows after dropna(how='any') applied would be {df.dropna(how='any').count()}.''')

df - number of rows after dropna(how='any') applied would be 57196.


In [12]:
# If it is not a huge DataFrame...
print('Checking nulls at each column of df...')
dict_nulls_df = {col: df.filter(df[col].isNull()).count() for col in df.columns}
dict_nulls_df

Checking nulls at each column of df...


{'ASIN': 0,
 'GROUP': 4,
 'FORMAT': 5,
 'TITLE': 8,
 'AUTHOR': 83,
 'PUBLISHER': 6492}

**Data profiling with YData Profiling**

More adequate data profile for large datasets with YData Profiling

https://docs.profiling.ydata.ai/latest/

Even with a dataset with a large number of rows, ydata-profiling is able 
to help as it supports both Pandas Dataframes and Spark Dataframes.

See https://docs.profiling.ydata.ai/latest/integrations/pyspark/


In [13]:
from ydata_profiling import ProfileReport

profile_title = 'books-amazon.csv'

profile_report = ProfileReport(
    df,
    title=profile_title,
    infer_dtypes=False,
    interactions=None,
    missing_diagrams=None,
    correlations={
        "auto": {"calculate": False},
        "pearson": {"calculate": False},
        "spearman": {"calculate": False},
    },
)

In [14]:
# Export the profile report as html
# profile_report_html = profile_report.to_html()

# Export the profile report as json
# profile_report_json = profile_report.to_json()

In [15]:
profile_report_file = data_dir + 'profile-' + profile_title + '.html'
profile_report.to_file(Path(profile_report_file))
profile_report_file

Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]

Export report to file:   0%|          | 0/1 [00:00<?, ?it/s]

'../Datasets/BooksAmazon/profile-books-amazon.csv.html'

**Check the profile report**

---
# Task C - Data cleaning

- Applying cleaning operations upon the raw data
- Store the outcome in a file (the cleaned data)

**First, check again the profile report**

If needed and/or advisable, carry out cleaning operations over the DataFrame and store the outcome as a new data file. It can be stored as, for example:

- a parquet file, which is the correct strategy at this stage;
- or, if one wants, a csv file.

Ultimately, make sure that the final data file is ready to be properly used later on.

In [8]:
# TO DO AS APPROPRIATE

# e.g. duplicates, missing data, etc.

# df_clean = ...


In [9]:
# PS. Why repartition(1)? 
# And what about the directory/file to be created?

# sep=';'
# out_file = data_dir + 'credit-card-transactions.csv'
# ( df_clean.repartition(1).write.mode('overwrite')
#   .options(header=True, delimiter=sep)
#   .csv(out_file)
# )