### PySpark Bank Data Mining - Implementation Requirements

#### ðŸ“‹ Project Overview

Write a Spark job to extract information from banking data, which will mine information from the files `accounts.csv` and `transactions.csv` and perform data manipulation on the same.

---
#### data
account.csv schema `accountNumber`, `balance`

transactions.csv schema  `fromAccountNumber`, `toAccountNumber`, `transferAmount`

#### Accounts-Transactions Relationship
One account could have multiple transactions. A valid transaction is the transaction from a valid account number in `accounts.csv`.


### ðŸŽ¯ Implementation Tasks

#### Task 1: `init_spark_session(self)` â†’ `SparkSession`

**Requirements:**
- Create a spark session with master `local` and name `Banking Data Mining`
- Return the SparkSession object



In [1]:
import os
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, lit, sum, avg, count, row_number, round, dayofmonth, min, max, current_date, datediff, upper, to_date
from pyspark.sql.types import StructType, StructField, StringType, IntegerType, DoubleType, TimestampType
from pyspark.sql.window import Window

spark = SparkSession.builder \
                    .appName("Banking Data Minning") \
                    .master("local[*]") \
                    .getOrCreate()


Using Spark's default log4j profile: org/apache/spark/log4j2-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
26/01/10 15:03:26 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


#### Task 2: `extract_valid_transactions(self, accounts: DataFrame, transactions: DataFrame)` â†’ `DataFrame`

**Requirements:**
- Transaction is valid if:
  1. `transferAmount` is less than or equals to `balance`
  2. The `toAccountNumber` exists in `accountsDf`
- Return the filtered `transactionDf`


In [2]:
transactionDF = spark.read.csv("source_data/transactions.csv", header=True, inferSchema=True)
accountsDF = spark.read.csv("source_data/accounts.csv", header=True, inferSchema=True)

In [3]:
joined = transactionDF.join(accountsDF, transactionDF["fromAccountNumber"]==accountsDF["accountNumber"], how="inner")
valid_transactions_df = joined.filter(col("transferAmount") <= col("balance"))



#### Task 3: `distinct_transactions(self, transactions: DataFrame)` â†’ `int`

**Requirements:**
- Return the count of total distinct transactions based on `fromAccountNumber`


In [4]:
valid_transactions_df.select("fromAccountNumber").distinct().count()

37

#### Task 4: `transactions_per_account(self, transactions: DataFrame)` â†’ `dict`

**Requirements:**
- Find the count of transactions per `fromAccountNumber`
- Return top 10, `fromAccountNumber` and corresponding count as a dictionary

In [6]:
result = valid_transactions_df.groupBy("fromAccountNumber").count().orderBy(col("count").desc()).limit(10).rdd.collectAsMap()
result

{'a226': 2,
 'a92': 2,
 'a278': 2,
 'a452': 2,
 'a949': 2,
 'a688': 2,
 'a627': 2,
 'a994': 2,
 'a948': 2,
 'a575': 2}

In [7]:
spark.stop()