---
# Algoritmos para Big Data

**Handout 2 -  Data joining, windowing, and Spark SQL**

**2024/25**

This lab class aims to get hands-on experience on three issues related to data processing: data joining, data windowing and Spark SQL.

This notebook should contain the implementation of the tasks presented in the handout.

Hence both handout and notebook must be considered together as one.

---
# Task A - Data ingestion

**Datasest**

The file can be downloaded from

https://bigdata.iscte-iul.eu/datasets/retail-data.csv

**Spark setup**

In [3]:
# Basic imports
import pyspark
from pyspark.sql import SparkSession
import pyspark.sql.functions as F

In [4]:
# Build SparkSession
spark = SparkSession.builder.appName("JoinWindowingSQL").getOrCreate()

**Reading and checking data**

In [6]:
# Reading data
data_dir ='../../Datasets/'
file_transactions = data_dir + 'credit-cards-transactions.csv'

! head $file_transactions
df_transactions = spark.read.csv(file_transactions, header=True, sep=',', inferSchema=True)


User,Card,Year,Month,Day,Time,Amount,Use Chip,Merchant Name,Merchant City,Merchant State,Zip,MCC,Errors?,Is Fraud?
0,0,2002,9,1,06:21,$134.09,Swipe Transaction,3527213246127876953,La Verne,CA,91750.0,5300,,No
0,0,2002,9,1,06:42,$38.48,Swipe Transaction,-727612092139916043,Monterey Park,CA,91754.0,5411,,No
0,0,2002,9,2,06:22,$120.34,Swipe Transaction,-727612092139916043,Monterey Park,CA,91754.0,5411,,No
0,0,2002,9,2,17:45,$128.95,Swipe Transaction,3414527459579106770,Monterey Park,CA,91754.0,5651,,No
0,0,2002,9,3,06:23,$104.71,Swipe Transaction,5817218446178736267,La Verne,CA,91750.0,5912,,No
0,0,2002,9,3,13:53,$86.19,Swipe Transaction,-7146670748125200898,Monterey Park,CA,91755.0,5970,,No
0,0,2002,9,4,05:51,$93.84,Swipe Transaction,-727612092139916043,Monterey Park,CA,91754.0,5411,,No
0,0,2002,9,4,06:09,$123.50,Swipe Transaction,-727612092139916043,Monterey Park,CA,91754.0,5411,,No
0,0,2002,9,5,06:14,$61.72,Swipe Transaction,-727612092139916043,Monterey Park,CA,91754.0,5411,,No


In [7]:
df_transactions.show(10)
print(f'df_transactions - number of rows is {df_transactions.count()    }.')
df_transactions.printSchema()

+----+----+----+-----+---+-------------------+-------+-----------------+--------------------+-------------+--------------+-------+----+-------+---------+
|User|Card|Year|Month|Day|               Time| Amount|         Use Chip|       Merchant Name|Merchant City|Merchant State|    Zip| MCC|Errors?|Is Fraud?|
+----+----+----+-----+---+-------------------+-------+-----------------+--------------------+-------------+--------------+-------+----+-------+---------+
|   0|   0|2002|    9|  1|2025-04-03 06:21:00|$134.09|Swipe Transaction| 3527213246127876953|     La Verne|            CA|91750.0|5300|   NULL|       No|
|   0|   0|2002|    9|  1|2025-04-03 06:42:00| $38.48|Swipe Transaction| -727612092139916043|Monterey Park|            CA|91754.0|5411|   NULL|       No|
|   0|   0|2002|    9|  2|2025-04-03 06:22:00|$120.34|Swipe Transaction| -727612092139916043|Monterey Park|            CA|91754.0|5411|   NULL|       No|
|   0|   0|2002|    9|  2|2025-04-03 17:45:00|$128.95|Swipe Transaction| 341

In [None]:
print(f'file_transactions - number of rows is {df_transactions.count()  }; after dropDuplicates() applied would be {df_transactions.dropDuplicates().count()   }.')

ERROR:root:Exception while sending command.
Traceback (most recent call last):
  File "/opt/conda/envs/vscode_pyspark/lib/python3.11/site-packages/IPython/core/interactiveshell.py", line 3549, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "/tmp/ipykernel_933/3073232393.py", line 1, in <module>
    print(f'file_transactions - number of rows is {df_transactions.count()  }; after dropDuplicates() applied would be {df_transactions.dropDuplicates().count()   }.')
                                                                                                                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/envs/vscode_pyspark/lib/python3.11/site-packages/pyspark/sql/dataframe.py", line 1240, in count
    return int(self._jdf.count())
               ^^^^^^^^^^^^^^^^^
  File "/opt/conda/envs/vscode_pyspark/lib/python3.11/site-packages/py4j/java_gateway.py", line 1322, in __call__
    return_value = get_return_value(
                

ConnectionRefusedError: [Errno 111] Connection refused

ERROR:root:Exception while sending command.
Traceback (most recent call last):
  File "/opt/conda/envs/vscode_pyspark/lib/python3.11/site-packages/py4j/clientserver.py", line 516, in send_command
    raise Py4JNetworkError("Answer from Java side is empty")
py4j.protocol.Py4JNetworkError: Answer from Java side is empty

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/opt/conda/envs/vscode_pyspark/lib/python3.11/site-packages/py4j/java_gateway.py", line 1038, in send_command
    response = connection.send_command(command)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/envs/vscode_pyspark/lib/python3.11/site-packages/py4j/clientserver.py", line 539, in send_command
    raise Py4JNetworkError(
py4j.protocol.Py4JNetworkError: Error while sending or receiving


In [None]:
df_transactions.dropDuplicates()

DataFrame[User: int, Card: int, Year: int, Month: int, Day: int, Time: timestamp, Amount: string, Use Chip: string, Merchant Name: bigint, Merchant City: string, MCC: int, Is Fraud?: string]

In [None]:
print(f'''file_transactions - number of rows after dropna(how='any') would be {df_transactions.dropna(how='any').count()     }.''')

file_transactions - number of rows after dropna(how='any') would be 324890.


In [None]:
print('Checking nulls at each column of df_transactions')
dict_nulls_retail = {col: df_transactions.filter(df_transactions[col].isNull()).count() for col in df_transactions.columns}
dict_nulls_retail

Checking nulls at each column of df_transactions


{'User': 0,
 'Card': 0,
 'Year': 0,
 'Month': 0,
 'Day': 0,
 'Time': 0,
 'Amount': 0,
 'Use Chip': 0,
 'Merchant Name': 0,
 'Merchant City': 0,
 'Merchant State': 2720821,
 'Zip': 2878135,
 'MCC': 0,
 'Errors?': 23998469,
 'Is Fraud?': 0}

In [None]:
# Drop the specified columns from df_transactions
columns_to_drop = ['Merchant State', 'Zip', 'Errors?']
df_transactions = df_transactions.drop(*columns_to_drop)

# Verify the remaining columns
df_transactions.printSchema()

root
 |-- User: integer (nullable = true)
 |-- Card: integer (nullable = true)
 |-- Year: integer (nullable = true)
 |-- Month: integer (nullable = true)
 |-- Day: integer (nullable = true)
 |-- Time: timestamp (nullable = true)
 |-- Amount: string (nullable = true)
 |-- Use Chip: string (nullable = true)
 |-- Merchant Name: long (nullable = true)
 |-- Merchant City: string (nullable = true)
 |-- MCC: integer (nullable = true)
 |-- Is Fraud?: string (nullable = true)



Data seems fine!

In [None]:
# call describe on df_devices and show
df_transactions.describe().show()

+-------+------------------+------------------+------------------+------------------+------------------+--------+-----------------+--------------------+-------------+-----------------+---------+
|summary|              User|              Card|              Year|             Month|               Day|  Amount|         Use Chip|       Merchant Name|Merchant City|              MCC|Is Fraud?|
+-------+------------------+------------------+------------------+------------------+------------------+--------+-----------------+--------------------+-------------+-----------------+---------+
|  count|          24386900|          24386900|          24386900|          24386900|          24386900|24386900|         24386900|            24386900|     24386900|         24386900| 24386900|
|   mean|1001.0193350938414| 1.351366184303868|2011.9551699067943|  6.52506357921671|15.718122721625134|    NULL|             NULL|-4.76922962773083...|         NULL|5561.171253336833|     NULL|
| stddev|  569.4611570323

In [None]:
# numeric columns
input_cols_num = ['User', 'Card', 'Year', 'Month', 'Day', 'Merchant Name', 'MCC']
# string columns
input_cols_str = ['Amount', 'Use Chip', 'Merchant City']
# all interest columns together
input_cols_all = input_cols_num + input_cols_str

In [None]:
print('\nUniqueness of values:')
number_records = df_transactions.count()
cols_interest = df_transactions.columns
for cl in cols_interest:
    k = df_transactions.select(cl).distinct().count()
    print(f'Column {cl} has {k} unique values out of {number_records} records.')


Uniqueness of values:
Column User has 2000 unique values out of 24386900 records.
Column Card has 9 unique values out of 24386900 records.
