# Year Creadenial ETL

## pysprk

Here we make an ETL to load data from year_credential, make some transformations, and Load into Data Lake.

### import libraries

In [7]:
import sys
from platform import python_version

import findspark
findspark.init()

In [8]:
import pyspark
import pyspark.sql.functions as f

from pyspark.sql           import SparkSession
from pyspark.sql.functions import col, explode, regexp_replace, udf
from pyspark.sql.types     import Row, ArrayType, IntegerType, LongType, StringType

### create spark session

In [9]:
spark = (SparkSession
         .builder
         .appName( 'House_credentials_hpay' )
         .getOrCreate()
        )

Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
24/03/06 11:50:59 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
24/03/06 11:51:00 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
24/03/06 11:51:00 WARN Utils: Service 'SparkUI' could not bind on port 4041. Attempting port 4042.
24/03/06 11:51:00 WARN Utils: Service 'SparkUI' could not bind on port 4042. Attempting port 4043.


### Load data

In [10]:
in_path  = '/home/art/data/hpay/in/year_credential.csv'
out_path = '/home/art/data/hpay/out/year_credential'

In [11]:
df = (spark
      .read
      .options( header = True, inferSchema = True, delimiter = ',' )
      .csv( in_path )
     )
df.show()

+--------+----+---+---+---+---+---+---+---+---+---+---+---+---+
|house_id|year|m01|m02|m03|m04|m05|m06|m07|m08|m09|m10|m11|m12|
+--------+----+---+---+---+---+---+---+---+---+---+---+---+---+
| 100pino|2024|  1|  1|  0|  0|  0|  0|  0|  0|  0|  0|  0|  0|
| 101pino|2024|  1|  1|  0|  0|  0|  0|  0|  0|  0|  0|  0|  0|
| 102pino|2024|  1|  1|  0|  0|  0|  0|  0|  0|  0|  0|  0|  0|
| 200caob|2024|  1|  0|  0|  0|  0|  0|  0|  0|  0|  0|  0|  0|
| 201caob|2024|  1|  0|  0|  0|  0|  0|  0|  0|  0|  0|  0|  0|
| 202caob|2024|  1|  0|  0|  0|  0|  0|  0|  0|  0|  0|  0|  0|
| 100abed|2024|  1|  0|  0|  0|  0|  0|  0|  0|  0|  0|  0|  0|
| 101abed|2024|  0|  0|  0|  0|  0|  0|  0|  0|  0|  0|  0|  0|
| 102abed|2024|  1|  0|  0|  0|  0|  0|  0|  0|  0|  0|  0|  0|
+--------+----+---+---+---+---+---+---+---+---+---+---+---+---+



In [12]:
df.printSchema()

root
 |-- house_id: string (nullable = true)
 |-- year: integer (nullable = true)
 |-- m01: integer (nullable = true)
 |-- m02: integer (nullable = true)
 |-- m03: integer (nullable = true)
 |-- m04: integer (nullable = true)
 |-- m05: integer (nullable = true)
 |-- m06: integer (nullable = true)
 |-- m07: integer (nullable = true)
 |-- m08: integer (nullable = true)
 |-- m09: integer (nullable = true)
 |-- m10: integer (nullable = true)
 |-- m11: integer (nullable = true)
 |-- m12: integer (nullable = true)



### Transform

* Convert to upper case the column house_id

In [13]:
df.createOrReplaceTempView( 'year_credential' )

In [14]:
query = '''
SELECT
  upper( house_id ) as house_id,
  year,
  m01,
  m02,
  m03,
  m04,
  m05,
  m06,
  m07,
  m08,
  m09,
  m10,
  m11,
  m12

FROM year_credential
'''

In [16]:
df = spark.sql( query )
df.show()

+--------+----+---+---+---+---+---+---+---+---+---+---+---+---+
|house_id|year|m01|m02|m03|m04|m05|m06|m07|m08|m09|m10|m11|m12|
+--------+----+---+---+---+---+---+---+---+---+---+---+---+---+
| 100PINO|2024|  1|  1|  0|  0|  0|  0|  0|  0|  0|  0|  0|  0|
| 101PINO|2024|  1|  1|  0|  0|  0|  0|  0|  0|  0|  0|  0|  0|
| 102PINO|2024|  1|  1|  0|  0|  0|  0|  0|  0|  0|  0|  0|  0|
| 200CAOB|2024|  1|  0|  0|  0|  0|  0|  0|  0|  0|  0|  0|  0|
| 201CAOB|2024|  1|  0|  0|  0|  0|  0|  0|  0|  0|  0|  0|  0|
| 202CAOB|2024|  1|  0|  0|  0|  0|  0|  0|  0|  0|  0|  0|  0|
| 100ABED|2024|  1|  0|  0|  0|  0|  0|  0|  0|  0|  0|  0|  0|
| 101ABED|2024|  0|  0|  0|  0|  0|  0|  0|  0|  0|  0|  0|  0|
| 102ABED|2024|  1|  0|  0|  0|  0|  0|  0|  0|  0|  0|  0|  0|
+--------+----+---+---+---+---+---+---+---+---+---+---+---+---+



### Load clean data into Data Lake

In [17]:
(df
 .write
 .option( 'header', True )
 .csv( out_path )
)

In [18]:
print( 'check your clean data in: {}'.format( out_path ) )

check your clean data in: /home/art/data/hpay/out/year_credential
