
# Glue Studio Notebook
You are now running a **Glue Studio** notebook; before you can start using your notebook you *must* start an interactive session.

## Available Magics
|          Magic              |   Type       |                                                                        Description                                                                        |
|-----------------------------|--------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------|
| %%configure                 |  Dictionary  |  A json-formatted dictionary consisting of all configuration parameters for a session. Each parameter can be specified here or through individual magics. |
| %profile                    |  String      |  Specify a profile in your aws configuration to use as the credentials provider.                                                                          |
| %iam_role                   |  String      |  Specify an IAM role to execute your session with.                                                                                                        |
| %region                     |  String      |  Specify the AWS region in which to initialize a session                                                                                                  |
| %session_id                 |  String      |  Returns the session ID for the running session.                                                                                                          |
| %connections                |  List        |  Specify a comma separated list of connections to use in the session.                                                                                     |
| %additional_python_modules  |  List        |  Comma separated list of pip packages, s3 paths or private pip arguments.                                                                                 |
| %extra_py_files             |  List        |  Comma separated list of additional Python files from S3.                                                                                                 |
| %extra_jars                 |  List        |  Comma separated list of additional Jars to include in the cluster.                                                                                       |
| %number_of_workers          |  Integer     |  The number of workers of a defined worker_type that are allocated when a job runs. worker_type must be set too.                                          |
| %worker_type                |  String      |  Standard, G.1X, *or* G.2X. number_of_workers must be set too. Default is G.1X                                                                            |
| %glue_version               |  String      |  The version of Glue to be used by this session. Currently, the only valid options are 2.0 and 3.0 (eg: %glue_version 2.0)                                |
| %security_config            |  String      |  Define a security configuration to be used with this session.                                                                                            |
| %sql                        |  String      |  Run SQL code. All lines after the initial %%sql magic will be passed as part of the SQL code.                                                            |
| %streaming                  |  String      |  Changes the session type to Glue Streaming.                                                                                                              |
| %etl                        |  String      |   Changes the session type to Glue ETL.                                                                                                                   |
| %status                     |              |  Returns the status of the current Glue session including its duration, configuration and executing user / role.                                          |
| %stop_session               |              |  Stops the current session.                                                                                                                               |
| %list_sessions              |              |  Lists all currently running sessions by name and ID.                                                                                                     |
| %spark_conf                 |  String      |  Specify custom spark configurations for your session. E.g. %spark_conf spark.serializer=org.apache.spark.serializer.KryoSerializer                       |

In [None]:
# I set my own magics (config for the job)
%number_of_workers 20

## built in AWS Spark & Glue libraries
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
  
sc = SparkContext.getOrCreate()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)

Welcome to the Glue Interactive Sessions Kernel
For more information on available magic commands, please type %help in any new cell.

Please view our Getting Started page to access the most up-to-date information on the Interactive Sessions kernel: https://docs.aws.amazon.com/glue/latest/dg/interactive-sessions.html
Installed kernel version: 0.35 
Previous number of workers: 5
Setting new number of workers to: 20
Authenticating with environment variables and user-defined glue_role_arn: arn:aws:iam::741993363917:role/AWSGlueServiceRole
Trying to create a Glue session for the kernel.
Worker Type: G.1X
Number of Workers: 20
Session ID: 4c2c2e64-5a3e-4641-a572-8bc844132190
Applying the following default arguments:
--glue_kernel_version 0.35
--enable-glue-datacatalog true
Waiting for session 4c2c2e64-5a3e-4641-a572-8bc844132190 to get into ready status...
Session 4c2c2e64-5a3e-4641-a572-8bc844132190 has been created




In [2]:
### my import libraries
from io import StringIO
import numpy as np
import pandas as pd
# import awswrangler as wr
import glob
import re
#import boto3
regex_schema = "/*.csv"
from string import printable
st = set(printable)
from datetime import datetime

import pyspark.sql.functions as f
from pyspark.sql.functions import *
from pyspark.sql.functions import when

from pyspark.sql.window import Window
from functools import reduce
from pyspark.sql import DataFrame
from pyspark.sql.types import StringType, IntegerType

pd.set_option('display.max_rows', 20)
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)
pd.set_option('display.max_colwidth', None)

## Setting run date & save path
import datetime
a = datetime.datetime.now()
# date_key = a.strftime('%Y%m%d')
date_key = '20221125' # temporary
print(date_key)

# UAMS_PySpark_save_path = 's3://astro-datalake-prod-sandbox/amzar/BB/AddrStd/testing/20221028_UAMS_Step4_PySpark/' # old Qubole Zepp path
UAMS_PySpark_save_path = 's3://astro-groupdata-prod-pipeline/address_standardization/spark_uams_generation/' 

20221125


In [None]:
## assign variable names and paths at this step to allow for easier changing
# phase 1
TM_P1MDU = 's3://astro-groupdata-prod-pipeline/address_standardization/tm_uams_mdu/UAMS_Format_stndrd_TM_P1_MDU20221014.csv'
TM_P1SDU = 's3://astro-groupdata-prod-pipeline/address_standardization/tm_uams_sdu/UAMS_Format_stndrd_TM_P1_SDU20221014.csv'
TM_P2SDU = 's3://astro-groupdata-prod-pipeline/address_standardization/tm_uams_sdu/UAMS_Format_stndrd_TM_P2_SDU20221014.csv'

ALLO_P1MDU = 's3://astro-groupdata-prod-pipeline/address_standardization/allo_uams_mdu/UAMS_Format_stndrd_Allo_P1_MDU.csv'
ALLO_P1SDU = 's3://astro-groupdata-prod-pipeline/address_standardization/allo_uams_sdu/UAMS_Format_stndrd_Allo_P1_SDU.csv'
ALLO_P2SDU = 's3://astro-groupdata-prod-pipeline/address_standardization/allo_uams_sdu/UAMS_Format_stndrd_Allo_P2_SDU.csv'

MAXIS_P1MDU = 's3://astro-groupdata-prod-pipeline/address_standardization/maxis_uams_mdu/UAMS_Format_stndrd_maxis_P1_MDU.csv'
MAXIS_P1SDU = 's3://astro-groupdata-prod-pipeline/address_standardization/maxis_uams_sdu/UAMS_Format_stndrd_maxis_P1_SDU.csv'
MAXIS_P2SDU = 's3://astro-groupdata-prod-pipeline/address_standardization/maxis_uams_sdu/UAMS_Format_stndrd_maxis_P2_SDU.csv'

CTS_P1MDU = 's3://astro-groupdata-prod-pipeline/address_standardization/cts_uams_mdu/UAMS_Format_stndrd_CTS_P1_MDU.csv'
CTS_P1SDU = 's3://astro-groupdata-prod-pipeline/address_standardization/cts_uams_sdu/UAMS_Format_stndrd_CTS_P1_SDU.csv'
CTS_P2SDU = 's3://astro-groupdata-prod-pipeline/address_standardization/cts_uams_sdu/UAMS_Format_stndrd_CTS_P2_SDU.csv'

# phase 2
astro_new_std_path = "s3://astro-groupdata-prod-pipeline/address_standardization/astro_new_standardized/historical_folder/astro_new_standardized_20221025.csv"
tm_new_std_path = "s3://astro-groupdata-prod-pipeline/address_standardization/tm_new_standardized/historical_folder/TM_New_Standardised_20221013.csv.gz"
allo_new_std_path = "s3://astro-groupdata-prod-pipeline/address_standardization/allo_new_standardized/Allo_New_Standardised.csv"
cts_new_std_path = "s3://astro-groupdata-prod-pipeline/address_standardization/cts_new_standardized/CTS_New_Standardised_202209_Reformatted-SarahLocal.csv"
maxis_new_std_path = "s3://astro-groupdata-prod-pipeline/address_standardization/maxis_new_standardized/Maxis_New_Standardised.csv"



# PHASE 1
- Read and format P1/P2 mapped addresses for all the ISPs and append them

----
========================================= THIS IS THE START OF PHASE 1 ============================================
- taken from Glue Job: address_standardization-prod-uams_generation_final_1 (Pipeline 1)
- originally converted to PySpark in Zepp Qubole notebook: https://us.qubole.com/notebooks#home?id=141583&type=my-notebooks&view=home
Combine all of P1P2 Mapped Addresses from each ISP

<!-- Old file paths (from Zeppelin Qubole notebook):
TM_P1MDU = 's3://astro-datalake-prod-sandbox/amzar/BB/AddrStd/testing/20221028_UAMS_Step4_PySpark/uploaded/UAMS_Format_stndrd_TM_P1_MDU20221014.csv'
TM_P1SDU = 's3://astro-datalake-prod-sandbox/amzar/BB/AddrStd/testing/20221028_UAMS_Step4_PySpark/uploaded/UAMS_Format_stndrd_TM_P1_SDU20221014.csv'
TM_P2SDU = 's3://astro-datalake-prod-sandbox/amzar/BB/AddrStd/testing/20221028_UAMS_Step4_PySpark/uploaded/UAMS_Format_stndrd_TM_P2_SDU20221014.csv' -->
<!-- 's3://astro-datalake-prod-sandbox/amzar/BB/AddrStd/testing/20221028_UAMS_Step4_PySpark/uploaded/UAMS_Format_stndrd_ALLO_P1_MDU-20220812.csv'
's3://astro-datalake-prod-sandbox/amzar/BB/AddrStd/testing/20221028_UAMS_Step4_PySpark/uploaded/UAMS_Format_stndrd_ALLO_P1_SDU-20220812.csv'
's3://astro-datalake-prod-sandbox/amzar/BB/AddrStd/testing/20221028_UAMS_Step4_PySpark/uploaded/UAMS_Format_stndrd_ALLO_P2_SDU-20220812.csv' -->
<!-- 's3://astro-datalake-prod-sandbox/amzar/BB/AddrStd/testing/20221028_UAMS_Step4_PySpark/uploaded/UAMS_Format_stndrd_MAXIS_P1_MDU-20220812.csv'
's3://astro-datalake-prod-sandbox/amzar/BB/AddrStd/testing/20221028_UAMS_Step4_PySpark/uploaded/UAMS_Format_stndrd_MAXIS_P1_SDU-20220812.csv'
's3://astro-datalake-prod-sandbox/amzar/BB/AddrStd/testing/20221028_UAMS_Step4_PySpark/uploaded/UAMS_Format_stndrd_MAXIS_P2_SDU-20220812.csv' -->
<!-- 's3://astro-datalake-prod-sandbox/amzar/BB/AddrStd/testing/20221028_UAMS_Step4_PySpark/uploaded/UAMS_Format_stndrd_CTS_P1_MDU-20220320.csv'
's3://astro-datalake-prod-sandbox/amzar/BB/AddrStd/testing/20221028_UAMS_Step4_PySpark/uploaded/UAMS_Format_stndrd_CTS_P1_SDU-20220320.csv'
's3://astro-datalake-prod-sandbox/amzar/BB/AddrStd/testing/20221028_UAMS_Step4_PySpark/uploaded/UAMS_Format_stndrd_CTS_P2_SDU-20220320.csv' -->

In [5]:
ISP_Name = 'TM'

### P1 MDU
UAMS_MDU_P1_TM = spark.read.csv(TM_P1MDU, header=True)
# print(UAMS_MDU_P1_TM.columns)

UAMS_MDU_P1_TM = UAMS_MDU_P1_TM.withColumn('Address_Type', f.lit('MDU')) # decided to label all as MDU instead of doing an if/else or where condition
UAMS_MDU_P1_TM = UAMS_MDU_P1_TM.withColumnRenamed('service_add_objid', 'OBJID')
UAMS_MDU_P1_TM = UAMS_MDU_P1_TM.withColumn('P_Flag', f.lit('P1'))
print('UAMS MDU P1 TM is: ', UAMS_MDU_P1_TM.select('Account_No').count()) # 179663
# print(UAMS_MDU_P1_TM.columns)
# z.show(UAMS_MDU_P1_TM.limit(5))

### P1 SDU
UAMS_SDU_P1_TM = spark.read.csv(TM_P1SDU, header=True)
# print(UAMS_SDU_P1_TM.columns)

UAMS_SDU_P1_TM = UAMS_SDU_P1_TM.withColumn('Address_Type', f.lit('SDU'))
UAMS_SDU_P1_TM = UAMS_SDU_P1_TM.withColumnRenamed('service_add_objid', 'OBJID')
UAMS_SDU_P1_TM = UAMS_SDU_P1_TM.withColumn('P_Flag', f.lit('P1'))
print('UAMS SDU P1 TM is: ', UAMS_SDU_P1_TM.select('Account_No').count()) # 1639756
# print(UAMS_SDU_P1_TM.columns)
# z.show(UAMS_SDU_P1_TM.limit(5))

### P2 SDU
UAMS_SDU_P2_TM = spark.read.csv(TM_P2SDU, header=True)
# print(UAMS_SDU_P2_TM.columns)

UAMS_SDU_P2_TM = UAMS_SDU_P2_TM.withColumn('Address_Type', f.lit('SDU'))
UAMS_SDU_P2_TM = UAMS_SDU_P2_TM.withColumnRenamed('service_add_objid', 'OBJID')
UAMS_SDU_P2_TM = UAMS_SDU_P2_TM.withColumn('P_Flag', f.lit('P2'))
print('UAMS SDU P2 TM is: ', UAMS_SDU_P2_TM.select('Account_No').count()) # 1760063
print(UAMS_SDU_P2_TM.columns)
# z.show(UAMS_SDU_P2_TM.limit(5))

### append P1/P2 files and keep first in drop duplicate (TM)

## first ensure acc_no string & no .0 in them
UAMS_MDU_P1_TM = UAMS_MDU_P1_TM.withColumn("Account_No", f.regexp_replace(f.col('Account_No').cast('string'), '\.0', '') )
UAMS_SDU_P1_TM = UAMS_SDU_P1_TM.withColumn("Account_No", f.regexp_replace(f.col('Account_No').cast('string'), '\.0', '') )
UAMS_SDU_P2_TM = UAMS_SDU_P2_TM.withColumn("Account_No", f.regexp_replace(f.col('Account_No').cast('string'), '\.0', '') )
# print(UAMS_MDU_P1_TM.select('Account_No').count(), UAMS_SDU_P1_TM.select('Account_No').count(), UAMS_SDU_P2_TM.select('Account_No').count()) # 179663 1639756 1760063

## union all UAMS files together
UAMS_P1P2_TM = UAMS_MDU_P1_TM.union(UAMS_SDU_P1_TM).union(UAMS_SDU_P2_TM)

## create a sequential index to make the order of importance: P1 MDU > P1 SDU > P2 SDU, but to do it in Spark: https://stackoverflow.com/questions/51200217/how-to-create-sequential-number-column-in-pyspark-dataframe
UAMS_P1P2_TM = UAMS_P1P2_TM.withColumn("index", row_number().over(Window.orderBy(monotonically_increasing_id())) )
print('UAMS P1P2 TM {} before de-dupe:'.format(str(ISP_Name)), UAMS_P1P2_TM.select('Account_No').count()) # 3579482

## de-dupe on Account_No & keep first based on index. To do in Spark: https://stackoverflow.com/questions/38687212/spark-dataframe-drop-duplicates-and-keep-first
window = Window.partitionBy(['Account_No']).orderBy(f.col("index").asc())
UAMS_P1P2_TM = UAMS_P1P2_TM.withColumn('row', f.row_number().over(window)).filter(col('row') == 1).drop('row').drop('index')
print('UAMS P1P2 TM {} after de-dupe:'.format(str(ISP_Name)), UAMS_P1P2_TM.select('Account_No').count(), UAMS_P1P2_TM.select(f.countDistinct('Account_No')).show()) # 3579332 

# z.show(UAMS_P1P2_TM.head(5))

['_c0', 'Account_No', 'OBJID', 'House_No', 'Combined_Building', 'Street_Type_1', 'Street_1_New', 'Street_Type_2', 'Street_2_New', 'AREA', 'STD_CITY', 'POSTCODE', 'ASTRO_STATE', 'Standard_Building_Name', 'ServiceType', 'Servicable', 'HNUM_STRT_TM', 'Address_Type', 'P_Flag']


In [6]:
## assign variable names and paths at this step to allow for easier changing
ISP_Name = 'ALLO'

### P1 MDU
UAMS_MDU_P1_ALLO = spark.read.csv(ALLO_P1MDU, header=True)
# print(UAMS_MDU_P1_ALLO.columns)

UAMS_MDU_P1_ALLO = UAMS_MDU_P1_ALLO.withColumn('Address_Type', f.lit('MDU')) # decided to label all as MDU instead of doing an if/else or where condition
UAMS_MDU_P1_ALLO = UAMS_MDU_P1_ALLO.withColumnRenamed('service_add_objid', 'OBJID')
UAMS_MDU_P1_ALLO = UAMS_MDU_P1_ALLO.withColumn('P_Flag', f.lit('P1'))
print('UAMS MDU P1 ALLO is: ', UAMS_MDU_P1_ALLO.select('Account_No').count()) # 552
# print(UAMS_MDU_P1_ALLO.columns)
# z.show(UAMS_MDU_P1_ALLO.limit(5))

### P1 SDU
UAMS_SDU_P1_ALLO = spark.read.csv(ALLO_P1SDU, header=True)
# print(UAMS_SDU_P1_ALLO.columns)

UAMS_SDU_P1_ALLO = UAMS_SDU_P1_ALLO.withColumn('Address_Type', f.lit('SDU'))
UAMS_SDU_P1_ALLO = UAMS_SDU_P1_ALLO.withColumnRenamed('service_add_objid', 'OBJID')
UAMS_SDU_P1_ALLO = UAMS_SDU_P1_ALLO.withColumn('P_Flag', f.lit('P1'))
print('UAMS SDU P1 ALLO is: ', UAMS_SDU_P1_ALLO.select('Account_No').count()) # 46575
# print(UAMS_SDU_P1_ALLO.columns)
# z.show(UAMS_SDU_P1_ALLO.limit(5))

### P2 SDU
UAMS_SDU_P2_ALLO = spark.read.csv(ALLO_P2SDU, header=True)
# print(UAMS_SDU_P2_ALLO.columns)

UAMS_SDU_P2_ALLO = UAMS_SDU_P2_ALLO.withColumn('Address_Type', f.lit('SDU'))
UAMS_SDU_P2_ALLO = UAMS_SDU_P2_ALLO.withColumnRenamed('service_add_objid', 'OBJID')
UAMS_SDU_P2_ALLO = UAMS_SDU_P2_ALLO.withColumn('P_Flag', f.lit('P2'))
print('UAMS SDU P2 ALLO is: ', UAMS_SDU_P2_ALLO.select('Account_No').count()) # 48990
print(UAMS_SDU_P2_ALLO.columns)
# z.show(UAMS_SDU_P2_ALLO.limit(5))

### append P1/P2 files and keep first in drop duplicate (ALLO)

## first ensure acc_no string & no .0 in them
UAMS_MDU_P1_ALLO = UAMS_MDU_P1_ALLO.withColumn("Account_No", f.regexp_replace(f.col('Account_No').cast('string'), '\.0', '') )
UAMS_SDU_P1_ALLO = UAMS_SDU_P1_ALLO.withColumn("Account_No", f.regexp_replace(f.col('Account_No').cast('string'), '\.0', '') )
UAMS_SDU_P2_ALLO = UAMS_SDU_P2_ALLO.withColumn("Account_No", f.regexp_replace(f.col('Account_No').cast('string'), '\.0', '') )

print(UAMS_MDU_P1_ALLO.select('Account_No').count(), UAMS_SDU_P1_ALLO.select('Account_No').count(), UAMS_SDU_P2_ALLO.select('Account_No').count())

## union all UAMS files together
UAMS_P1P2_ALLO = UAMS_MDU_P1_ALLO.union(UAMS_SDU_P1_ALLO).union(UAMS_SDU_P2_ALLO)

# create a sequential index to make the order of importance: P1 MDU > P1 SDU > P2 SDU, but to do it in Spark: https://stackoverflow.com/questions/51200217/how-to-create-sequential-number-column-in-pyspark-dataframe
UAMS_P1P2_ALLO = UAMS_P1P2_ALLO.withColumn("index", row_number().over(Window.orderBy(monotonically_increasing_id())) )
print('UAMS P1P2 ALLO {} before de-dupe:'.format(str(ISP_Name)), UAMS_P1P2_ALLO.select('Account_No').count()) # 96117

# de-dupe on Account_No & keep first based on index. To do in Spark: https://stackoverflow.com/questions/38687212/spark-dataframe-drop-duplicates-and-keep-first
window = Window.partitionBy(['Account_No']).orderBy(f.col("index").asc())
UAMS_P1P2_ALLO = UAMS_P1P2_ALLO.withColumn('row', f.row_number().over(window)).filter(col('row') == 1).drop('row').drop('index')
print('UAMS P1P2 ALLO {} after de-dupe:'.format(str(ISP_Name)), UAMS_P1P2_ALLO.select('Account_No').count(), UAMS_P1P2_ALLO.select(f.countDistinct('Account_No')).show()) # 96117

# z.show(UAMS_P1P2_ALLO.head(5))

UAMS MDU P1 ALLO is:  552
UAMS SDU P1 ALLO is:  46575
UAMS SDU P2 ALLO is:  48990
['_c0', 'Account_No', 'OBJID', 'House_No', 'Combined_Building', 'Street_Type_1', 'Street_1_New', 'Street_Type_2', 'Street_2_New', 'AREA', 'STD_CITY', 'POSTCODE', 'ASTRO_STATE', 'Standard_Building_Name', 'ServiceType', 'Servicable', 'HNUM_STRT_TM', 'Address_Type', 'P_Flag']
552 46575 48990
UAMS P1P2 ALLO ALLO before de-dupe: 96117
+--------------------------+
|count(DISTINCT Account_No)|
+--------------------------+
|                     96117|
+--------------------------+

UAMS P1P2 ALLO ALLO after de-dupe: 96117 None


In [7]:
## assign variable names and paths at this step to allow for easier changing
ISP_Name = 'MAXIS'

### P1 MDU
UAMS_MDU_P1_MAXIS = spark.read.csv(MAXIS_P1MDU, header=True)
# print(UAMS_MDU_P1_MAXIS.columns)

UAMS_MDU_P1_MAXIS = UAMS_MDU_P1_MAXIS.withColumn('Address_Type', f.lit('MDU')) # decided to label all as MDU instead of doing an if/else or where condition
UAMS_MDU_P1_MAXIS = UAMS_MDU_P1_MAXIS.withColumnRenamed('service_add_objid', 'OBJID')
UAMS_MDU_P1_MAXIS = UAMS_MDU_P1_MAXIS.withColumn('P_Flag', f.lit('P1'))
print('UAMS MDU P1 MAXIS is: ', UAMS_MDU_P1_MAXIS.select('Account_No').count()) # 9279
# print(UAMS_MDU_P1_MAXIS.columns)
# z.show(UAMS_MDU_P1_MAXIS.limit(5))

### P1 SDU
UAMS_SDU_P1_MAXIS = spark.read.csv(MAXIS_P1SDU, header=True)
# print(UAMS_SDU_P1_MAXIS.columns)

UAMS_SDU_P1_MAXIS = UAMS_SDU_P1_MAXIS.withColumn('Address_Type', f.lit('SDU'))
UAMS_SDU_P1_MAXIS = UAMS_SDU_P1_MAXIS.withColumnRenamed('service_add_objid', 'OBJID')
UAMS_SDU_P1_MAXIS = UAMS_SDU_P1_MAXIS.withColumn('P_Flag', f.lit('P1'))
print('UAMS SDU P1 MAXIS is: ', UAMS_SDU_P1_MAXIS.select('Account_No').count()) # 12406
# print(UAMS_SDU_P1_MAXIS.columns)
# z.show(UAMS_SDU_P1_MAXIS.limit(5))

### P2 SDU
UAMS_SDU_P2_MAXIS = spark.read.csv(MAXIS_P2SDU, header=True)
# print(UAMS_SDU_P2_MAXIS.columns)

UAMS_SDU_P2_MAXIS = UAMS_SDU_P2_MAXIS.withColumn('Address_Type', f.lit('SDU'))
UAMS_SDU_P2_MAXIS = UAMS_SDU_P2_MAXIS.withColumnRenamed('service_add_objid', 'OBJID')
UAMS_SDU_P2_MAXIS = UAMS_SDU_P2_MAXIS.withColumn('P_Flag', f.lit('P2'))
print('UAMS SDU P2 MAXIS is: ', UAMS_SDU_P2_MAXIS.select('Account_No').count()) # 39735
print(UAMS_SDU_P2_MAXIS.columns)
# z.show(UAMS_SDU_P2_MAXIS.limit(5))

### append P1/P2 files and keep first in drop duplicate (MAXIS)

## first ensure acc_no string & no .0 in them
UAMS_MDU_P1_MAXIS = UAMS_MDU_P1_MAXIS.withColumn("Account_No", f.regexp_replace(f.col('Account_No').cast('string'), '\.0', '') )
UAMS_SDU_P1_MAXIS = UAMS_SDU_P1_MAXIS.withColumn("Account_No", f.regexp_replace(f.col('Account_No').cast('string'), '\.0', '') )
UAMS_SDU_P2_MAXIS = UAMS_SDU_P2_MAXIS.withColumn("Account_No", f.regexp_replace(f.col('Account_No').cast('string'), '\.0', '') )

print(UAMS_MDU_P1_MAXIS.select('Account_No').count(), UAMS_SDU_P1_MAXIS.select('Account_No').count(), UAMS_SDU_P2_MAXIS.select('Account_No').count())

## union all UAMS files together
UAMS_P1P2_MAXIS = UAMS_MDU_P1_MAXIS.union(UAMS_SDU_P1_MAXIS).union(UAMS_SDU_P2_MAXIS)

# create a sequential index to make the order of importance: P1 MDU > P1 SDU > P2 SDU, but to do it in Spark: https://stackoverflow.com/questions/51200217/how-to-create-sequential-number-column-in-pyspark-dataframe
UAMS_P1P2_MAXIS = UAMS_P1P2_MAXIS.withColumn("index", row_number().over(Window.orderBy(monotonically_increasing_id())) )
print('UAMS P1P2 MAXIS {} before de-dupe:'.format(str(ISP_Name)), UAMS_P1P2_MAXIS.select('Account_No').count()) # 61420

# de-dupe on Account_No & keep first based on index. To do in Spark: https://stackoverflow.com/questions/38687212/spark-dataframe-drop-duplicates-and-keep-first
window = Window.partitionBy(['Account_No']).orderBy(f.col("index").asc())
UAMS_P1P2_MAXIS = UAMS_P1P2_MAXIS.withColumn('row', f.row_number().over(window)).filter(col('row') == 1).drop('row').drop('index')
print('UAMS P1P2 MAXIS {} after de-dupe:'.format(str(ISP_Name)), UAMS_P1P2_MAXIS.select('Account_No').count(), UAMS_P1P2_MAXIS.select(f.countDistinct('Account_No')).show()) # 61417

# z.show(UAMS_P1P2_MAXIS.head(5))

UAMS MDU P1 MAXIS is:  9279
UAMS SDU P1 MAXIS is:  12406
UAMS SDU P2 MAXIS is:  39735
['_c0', 'Account_No', 'OBJID', 'House_No', 'Combined_Building', 'Street_Type_1', 'Street_1_New', 'Street_Type_2', 'Street_2_New', 'AREA', 'STD_CITY', 'POSTCODE', 'ASTRO_STATE', 'Standard_Building_Name', 'ServiceType', 'Servicable', 'HNUM_STRT_TM', 'Address_Type', 'P_Flag']
9279 12406 39735
UAMS P1P2 MAXIS MAXIS before de-dupe: 61420
+--------------------------+
|count(DISTINCT Account_No)|
+--------------------------+
|                     61417|
+--------------------------+

UAMS P1P2 MAXIS MAXIS after de-dupe: 61418 None


In [8]:
## assign variable names and paths at this step to allow for easier changing
ISP_Name = 'CTS'

### P1 MDU
UAMS_MDU_P1_CTS = spark.read.csv(CTS_P1MDU, header=True)
# print(UAMS_MDU_P1_CTS.columns)

UAMS_MDU_P1_CTS = UAMS_MDU_P1_CTS.withColumn('Address_Type', f.lit('MDU')) # decided to label all as MDU instead of doing an if/else or where condition
UAMS_MDU_P1_CTS = UAMS_MDU_P1_CTS.withColumnRenamed('service_add_objid', 'OBJID')
UAMS_MDU_P1_CTS = UAMS_MDU_P1_CTS.withColumn('P_Flag', f.lit('P1'))
print('UAMS MDU P1 CTS is: ', UAMS_MDU_P1_CTS.select('Account_No').count()) # 0
# print(UAMS_MDU_P1_CTS.columns)
# z.show(UAMS_MDU_P1_CTS.limit(5))

### P1 SDU
UAMS_SDU_P1_CTS = spark.read.csv(CTS_P1SDU, header=True)
# print(UAMS_SDU_P1_CTS.columns)

UAMS_SDU_P1_CTS = UAMS_SDU_P1_CTS.withColumn('Address_Type', f.lit('SDU'))
UAMS_SDU_P1_CTS = UAMS_SDU_P1_CTS.withColumnRenamed('service_add_objid', 'OBJID')
UAMS_SDU_P1_CTS = UAMS_SDU_P1_CTS.withColumn('P_Flag', f.lit('P1'))
print('UAMS SDU P1 CTS is: ', UAMS_SDU_P1_CTS.select('Account_No').count()) # 7707
# print(UAMS_SDU_P1_CTS.columns)
# z.show(UAMS_SDU_P1_CTS.limit(5))

### P2 SDU
UAMS_SDU_P2_CTS = spark.read.csv(CTS_P2SDU, header=True)
# print(UAMS_SDU_P2_CTS.columns)

UAMS_SDU_P2_CTS = UAMS_SDU_P2_CTS.withColumn('Address_Type', f.lit('SDU'))
UAMS_SDU_P2_CTS = UAMS_SDU_P2_CTS.withColumnRenamed('service_add_objid', 'OBJID')
UAMS_SDU_P2_CTS = UAMS_SDU_P2_CTS.withColumn('P_Flag', f.lit('P2'))
print('UAMS SDU P2 CTS is: ', UAMS_SDU_P2_CTS.select('Account_No').count()) # 37821
print(UAMS_SDU_P2_CTS.columns)
# z.show(UAMS_SDU_P2_CTS.limit(5))

### append P1/P2 files and keep first in drop duplicate (CTS)

## first ensure acc_no string & no .0 in them
UAMS_MDU_P1_CTS = UAMS_MDU_P1_CTS.withColumn("Account_No", f.regexp_replace(f.col('Account_No').cast('string'), '\.0', '') )
UAMS_SDU_P1_CTS = UAMS_SDU_P1_CTS.withColumn("Account_No", f.regexp_replace(f.col('Account_No').cast('string'), '\.0', '') )
UAMS_SDU_P2_CTS = UAMS_SDU_P2_CTS.withColumn("Account_No", f.regexp_replace(f.col('Account_No').cast('string'), '\.0', '') )

print(UAMS_MDU_P1_CTS.select('Account_No').count(), UAMS_SDU_P1_CTS.select('Account_No').count(), UAMS_SDU_P2_CTS.select('Account_No').count())

## union all UAMS files together
UAMS_P1P2_CTS = UAMS_MDU_P1_CTS.union(UAMS_SDU_P1_CTS).union(UAMS_SDU_P2_CTS)

# create a sequential index to make the order of importance: P1 MDU > P1 SDU > P2 SDU, but to do it in Spark: https://stackoverflow.com/questions/51200217/how-to-create-sequential-number-column-in-pyspark-dataframe
UAMS_P1P2_CTS = UAMS_P1P2_CTS.withColumn("index", row_number().over(Window.orderBy(monotonically_increasing_id())) )
print('UAMS P1P2 CTS {} before de-dupe:'.format(str(ISP_Name)), UAMS_P1P2_CTS.select('Account_No').count()) # 45528

# de-dupe on Account_No & keep first based on index. To do in Spark: https://stackoverflow.com/questions/38687212/spark-dataframe-drop-duplicates-and-keep-first
window = Window.partitionBy(['Account_No']).orderBy(f.col("index").asc())
UAMS_P1P2_CTS = UAMS_P1P2_CTS.withColumn('row', f.row_number().over(window)).filter(col('row') == 1).drop('row').drop('index')
print('UAMS P1P2 CTS {} after de-dupe:'.format(str(ISP_Name)), UAMS_P1P2_CTS.select('Account_No').count(), UAMS_P1P2_CTS.select(f.countDistinct('Account_No')).show()) # 45528

# z.show(UAMS_P1P2_CTS.head(5))

## 31/10/2022: brief workaround because CTS file does not have HNUM_STRT_TM column so it wouldn't union properly with the other files
UAMS_P1P2_CTS = UAMS_P1P2_CTS.withColumn('HNUM_STRT_TM', f.lit('CTS_file'))

UAMS MDU P1 CTS is:  0
UAMS SDU P1 CTS is:  7707
UAMS SDU P2 CTS is:  37821
['_c0', 'Account_No', 'OBJID', 'House_No', 'Combined_Building', 'Street_Type_1', 'Street_1_New', 'Street_Type_2', 'Street_2_New', 'AREA', 'STD_CITY', 'POSTCODE', 'ASTRO_STATE', 'Standard_Building_Name', 'ServiceType', 'Servicable', 'Address_Type', 'P_Flag']
0 7707 37821
UAMS P1P2 CTS CTS before de-dupe: 45528
+--------------------------+
|count(DISTINCT Account_No)|
+--------------------------+
|                     45527|
+--------------------------+

UAMS P1P2 CTS CTS after de-dupe: 45528 None


### Append all ISP's P1/P2 mapped addresses

In [10]:
## Revision - fakhrul - 28/6/22 adding these 3 lines of code below because some of them are lower case
UAMS_P1P2_TM = UAMS_P1P2_TM.withColumn('Servicable', f.upper(f.col('Servicable').cast('string')) )
UAMS_P1P2_ALLO = UAMS_P1P2_ALLO.withColumn('Servicable', f.upper(f.col('Servicable').cast('string')) )
UAMS_P1P2_CTS = UAMS_P1P2_CTS.withColumn('Servicable', f.upper(f.col('Servicable').cast('string')) )

## checking servicable of each ISPs they have to be capital except for maxis
print('checking servicable column (TM) :', UAMS_P1P2_TM.select('Servicable').distinct().show()) # null, TM
print('checking servicable column (ALLO) :', UAMS_P1P2_ALLO.select('Servicable').distinct().show()) # ALLO
print('checking servicable column (Maxis) :', UAMS_P1P2_MAXIS.select('Servicable').distinct().show()) # null, maxis
print('checking servicable column (CTS) :', UAMS_P1P2_CTS.select('Servicable').distinct().show()) # null, CTS

## revision - fakhrul - 8/9/22 saving these files to test them locally
UAMS_P1P2_TM.coalesce(1).write.csv(UAMS_PySpark_save_path+'localP1P2/historical_folder/TM_LOCAL_{}.csv'.format(str(date_key)), header=True, mode='overwrite')
UAMS_P1P2_ALLO.coalesce(1).write.csv(UAMS_PySpark_save_path+'localP1P2/historical_folder/ALLO_LOCAL_{}.csv'.format(str(date_key)), header=True, mode='overwrite')
UAMS_P1P2_MAXIS.coalesce(1).write.csv(UAMS_PySpark_save_path+'localP1P2/historical_folder/MAXIS_LOCAL_{}.csv'.format(str(date_key)), header=True, mode='overwrite')
UAMS_P1P2_CTS.coalesce(1).write.csv(UAMS_PySpark_save_path+'localP1P2/historical_folder/CTS_LOCAL_{}.csv'.format(str(date_key)), header=True, mode='overwrite')

## union the 4 ISP DF's together then clean some columns
UAMS_P1P2 = UAMS_P1P2_TM.union(UAMS_P1P2_MAXIS).union(UAMS_P1P2_ALLO).union(UAMS_P1P2_CTS)
UAMS_P1P2 = UAMS_P1P2.withColumn("Account_No", f.regexp_replace(f.col('Account_No').cast('string'), '\.0', '') )
UAMS_P1P2 = UAMS_P1P2.withColumn("OBJID", f.regexp_replace(f.col('OBJID').cast('string'), '\.0', '') )

## after unioning, delete the files to free up RAM
del UAMS_P1P2_TM
del UAMS_P1P2_ALLO
del UAMS_P1P2_MAXIS
del UAMS_P1P2_CTS

## create a sequential index to make the index: TM > MAXIS > ALLO > CTS, but to do it in Spark: https://stackoverflow.com/questions/51200217/how-to-create-sequential-number-column-in-pyspark-dataframe
UAMS_P1P2 = UAMS_P1P2.withColumn("index", row_number().over(Window.orderBy(monotonically_increasing_id())) )
print('UAMS P1P2 all ISPs before remove ISP|Z (now ISP|null):', UAMS_P1P2.select('Account_No').count()) # 3782395
print('Unique acc_no in UAMS P1P2 all ISPs before remove ISP|Z (now ISP|null):', UAMS_P1P2.dropDuplicates(subset=['Account_No']).select('Account_No').count()) # 3600655

# z.show(UAMS_P1P2.head(5))

## Fill '' for nans & filter out cases without ServiceType (after upper case & trim)
UAMS_P1P2 = UAMS_P1P2.fillna('')
UAMS_P1P2 = UAMS_P1P2.withColumn('ServiceType', f.upper(f.trim(f.col('ServiceType'))) ).filter(f.col('ServiceType') != '').filter(f.col('ServiceType').isNotNull()).filter(f.col('ServiceType') != 'NAN')

## create the Serviceable column
UAMS_P1P2 = UAMS_P1P2.withColumn('Serviceable', f.concat_ws('|', f.col("Servicable"), f.col("ServiceType")) )
print('checking UAMS P1P2 count after checking unique serviceable : ', UAMS_P1P2.select('Account_No').count()) # 3780562
print('checking unique acc_no in UAMS P1P2 count after checking unique serviceable : ', UAMS_P1P2.dropDuplicates(subset=['Account_No']).select('Account_No').count()) # 3600103
print('checking all unique serviceable here!!!', UAMS_P1P2.dropDuplicates(subset=['Account_No']).groupBy('Servicable').count().orderBy('count', ascending=False).show()) # 3600103

## clean OBJID column (again)
UAMS_P1P2 = UAMS_P1P2.withColumn('OBJID',  f.regexp_replace(f.col('OBJID').cast('string'), '\.0', '') )

+----------+
|Servicable|
+----------+
|      null|
|        TM|
+----------+

checking servicable column (TM) : None
+----------+
|Servicable|
+----------+
|      ALLO|
+----------+

checking servicable column (ALLO) : None
+----------+
|Servicable|
+----------+
|      null|
|     maxis|
+----------+

checking servicable column (Maxis) : None
+----------+
|Servicable|
+----------+
|      null|
|       CTS|
+----------+

checking servicable column (CTS) : None
UAMS P1P2 all ISPs before remove ISP|Z (now ISP|null): 3782395
Unique acc_no in UAMS P1P2 all ISPs before remove ISP|Z (now ISP|null): 3600655
checking UAMS P1P2 count after checking unique serviceable :  3780562
checking unique acc_no in UAMS P1P2 count after checking unique serviceable :  3600103
+----------+-------+
|Servicable|  count|
+----------+-------+
|        TM|3557269|
|     maxis|  19157|
|      ALLO|  17451|
|       CTS|   6226|
+----------+-------+

checking all unique serviceable here!!! None


### Combine all Serviceable values for each Acc_No

In [11]:
## GroupBy Acc_No & get ALL Serviceable values for that Acc_No
UAMS_P1P2_1 = UAMS_P1P2.groupBy('Account_No').agg(f.collect_set('Serviceable').alias('Serviceable_New'))
UAMS_P1P2_1 = UAMS_P1P2_1.withColumn("Serviceable_New", f.concat_ws(',', f.col('Serviceable_New')) )
# UAMS_P1P2_1.select('Serviceable_New').distinct().show() ## this code looks correct

print('checking uams p1p2_1 count: ', UAMS_P1P2_1.count()) # 3600103
print('checking uams p1p2 count: ', UAMS_P1P2.count()) # 3780562
# z.show(UAMS_P1P2.select('Account_No').head(10))

## JOIN UAMS_P1P2_1 to UAMS_P1P2
UAMS_P1P2_Merg = UAMS_P1P2_1.join(UAMS_P1P2, on='Account_No', how='left')
print('Count of UAMS P1P2 Merg before drop duplicates: ', UAMS_P1P2_Merg.select('Account_No').count() ) # 3780562
# print('checking head uams p1p2 merg before drop duplicates: ', UAMS_P1P2_Merg[['Serviceable_New']].show(10))

## de-dupe on Acc_No & Serviceable_New
UAMS_P1P2_Merg = UAMS_P1P2_Merg.dropDuplicates(subset=['Account_No','Serviceable_New'])
print('Count of UAMS P1P2 Merg after drop duplicates: ', UAMS_P1P2_Merg.select('Account_No').count() ) # 3600103
# print('checking head uams p1p2 merg after drop duplicates: ', UAMS_P1P2_Merg[['Serviceable_New']].show(10))

checking uams p1p2_1 count:  3600103
checking uams p1p2 count:  3780562
Count of UAMS P1P2 Merg before drop duplicates:  3780562
Count of UAMS P1P2 Merg after drop duplicates:  3600103


### Cleaning and standardizing State, then cleaning the address columns

In [12]:
UAMS_P1P2_Merg = UAMS_P1P2_Merg.withColumn('ASTRO_STATE', f.upper(f.trim(f.col('ASTRO_STATE').cast('string'))) )
UAMS_P1P2_Merg = UAMS_P1P2_Merg.withColumn('ASTRO_STATE', f.regexp_replace('ASTRO_STATE', 'FEDERAL TERRITORY OF KUALA LUMPUR','WIL'))
UAMS_P1P2_Merg = UAMS_P1P2_Merg.withColumn('ASTRO_STATE', f.regexp_replace('ASTRO_STATE', 'WILAYAH PERSEKUTUAN KUALA LUMPUR','WIL'))
UAMS_P1P2_Merg = UAMS_P1P2_Merg.withColumn('ASTRO_STATE', f.regexp_replace('ASTRO_STATE', 'WIL KL','WIL'))
UAMS_P1P2_Merg = UAMS_P1P2_Merg.withColumn('ASTRO_STATE', f.regexp_replace('ASTRO_STATE', 'KL','WIL'))
UAMS_P1P2_Merg = UAMS_P1P2_Merg.withColumn('ASTRO_STATE', f.regexp_replace('ASTRO_STATE', 'LKG','KEDAH'))
UAMS_P1P2_Merg = UAMS_P1P2_Merg.withColumn('ASTRO_STATE', f.regexp_replace('ASTRO_STATE', 'SELANGOR','SEL'))
UAMS_P1P2_Merg = UAMS_P1P2_Merg.withColumn('ASTRO_STATE', f.regexp_replace('ASTRO_STATE', 'SEL','SELANGOR'))
UAMS_P1P2_Merg = UAMS_P1P2_Merg.withColumn('ASTRO_STATE', f.regexp_replace('ASTRO_STATE', 'JOHOR','JOH'))
UAMS_P1P2_Merg = UAMS_P1P2_Merg.withColumn('ASTRO_STATE', f.regexp_replace('ASTRO_STATE', 'JOH','JOHOR'))
UAMS_P1P2_Merg = UAMS_P1P2_Merg.withColumn('ASTRO_STATE', f.regexp_replace('ASTRO_STATE', 'MELAKA','MEL'))
UAMS_P1P2_Merg = UAMS_P1P2_Merg.withColumn('ASTRO_STATE', f.regexp_replace('ASTRO_STATE', 'MEL','MELAKA'))
UAMS_P1P2_Merg = UAMS_P1P2_Merg.withColumn('ASTRO_STATE', f.regexp_replace('ASTRO_STATE', 'PULAU PINANG','PNG'))
UAMS_P1P2_Merg = UAMS_P1P2_Merg.withColumn('ASTRO_STATE', f.regexp_replace('ASTRO_STATE', 'PENANG','PNG'))
UAMS_P1P2_Merg = UAMS_P1P2_Merg.withColumn('ASTRO_STATE', f.regexp_replace('ASTRO_STATE', 'PINANG','PNG'))
UAMS_P1P2_Merg = UAMS_P1P2_Merg.withColumn('ASTRO_STATE', f.regexp_replace('ASTRO_STATE', 'PNG','PULAU PINANG'))
UAMS_P1P2_Merg = UAMS_P1P2_Merg.withColumn('ASTRO_STATE', f.regexp_replace('ASTRO_STATE', 'PERAK','PRK'))
UAMS_P1P2_Merg = UAMS_P1P2_Merg.withColumn('ASTRO_STATE', f.regexp_replace('ASTRO_STATE', 'PRK','PERAK'))
UAMS_P1P2_Merg = UAMS_P1P2_Merg.withColumn('ASTRO_STATE', f.regexp_replace('ASTRO_STATE', 'PERLIS','PLS'))
UAMS_P1P2_Merg = UAMS_P1P2_Merg.withColumn('ASTRO_STATE', f.regexp_replace('ASTRO_STATE', 'PLS','PERLIS'))
UAMS_P1P2_Merg = UAMS_P1P2_Merg.withColumn('ASTRO_STATE', f.regexp_replace('ASTRO_STATE', 'SABAH','SAB'))
UAMS_P1P2_Merg = UAMS_P1P2_Merg.withColumn('ASTRO_STATE', f.regexp_replace('ASTRO_STATE', 'SAB','SABAH'))
UAMS_P1P2_Merg = UAMS_P1P2_Merg.withColumn('ASTRO_STATE', f.regexp_replace('ASTRO_STATE', 'SARAWAK','SAR'))
UAMS_P1P2_Merg = UAMS_P1P2_Merg.withColumn('ASTRO_STATE', f.regexp_replace('ASTRO_STATE', 'SAR','SARAWAK'))
UAMS_P1P2_Merg = UAMS_P1P2_Merg.withColumn('ASTRO_STATE', f.regexp_replace('ASTRO_STATE', 'TERENGGANU','TRG'))
UAMS_P1P2_Merg = UAMS_P1P2_Merg.withColumn('ASTRO_STATE', f.regexp_replace('ASTRO_STATE', 'TRG','TERENGGANU'))
UAMS_P1P2_Merg = UAMS_P1P2_Merg.withColumn('ASTRO_STATE', f.regexp_replace('ASTRO_STATE', 'PAHANG','PHG'))
UAMS_P1P2_Merg = UAMS_P1P2_Merg.withColumn('ASTRO_STATE', f.regexp_replace('ASTRO_STATE', 'PHG','PAHANG'))
UAMS_P1P2_Merg = UAMS_P1P2_Merg.withColumn('ASTRO_STATE', f.regexp_replace('ASTRO_STATE', 'KEDAH','KED'))
UAMS_P1P2_Merg = UAMS_P1P2_Merg.withColumn('ASTRO_STATE', f.regexp_replace('ASTRO_STATE', 'KED','KEDAH'))
UAMS_P1P2_Merg = UAMS_P1P2_Merg.withColumn('ASTRO_STATE', f.regexp_replace('ASTRO_STATE', 'NEGERI SEMBILAN','NEG'))
UAMS_P1P2_Merg = UAMS_P1P2_Merg.withColumn('ASTRO_STATE', f.regexp_replace('ASTRO_STATE', 'NEG','NEGERI SEMBILAN'))
UAMS_P1P2_Merg = UAMS_P1P2_Merg.withColumn('ASTRO_STATE', f.regexp_replace('ASTRO_STATE', 'KELANTAN','KEL'))
UAMS_P1P2_Merg = UAMS_P1P2_Merg.withColumn('ASTRO_STATE', f.regexp_replace('ASTRO_STATE', 'KEL','KELANTAN'))
UAMS_P1P2_Merg = UAMS_P1P2_Merg.withColumn('ASTRO_STATE', f.regexp_replace('ASTRO_STATE', 'WILAYAH PERSEKUTUAN PUTRAJAYA','PUTRAJAYA'))
UAMS_P1P2_Merg = UAMS_P1P2_Merg.withColumn('ASTRO_STATE', f.regexp_replace('ASTRO_STATE', 'WIL PUTRAJAYA','PUTRAJAYA'))
UAMS_P1P2_Merg = UAMS_P1P2_Merg.withColumn('ASTRO_STATE', f.regexp_replace('ASTRO_STATE', 'WILAYAH PERSEKUTUAN PUTRAJAYA','PUTRAJAYA'))
UAMS_P1P2_Merg = UAMS_P1P2_Merg.withColumn('ASTRO_STATE', f.regexp_replace('ASTRO_STATE', 'WILAYAH PERSEKUTUAN LABUAN','LAB'))
UAMS_P1P2_Merg = UAMS_P1P2_Merg.withColumn('ASTRO_STATE', f.regexp_replace('ASTRO_STATE', 'WILAYAH LABUAN','LAB'))
UAMS_P1P2_Merg = UAMS_P1P2_Merg.withColumn('ASTRO_STATE', f.regexp_replace('ASTRO_STATE', 'WIL LABUAN','LAB'))
UAMS_P1P2_Merg = UAMS_P1P2_Merg.withColumn('ASTRO_STATE', f.regexp_replace('ASTRO_STATE', 'LABUAN','LAB'))
UAMS_P1P2_Merg = UAMS_P1P2_Merg.withColumn('ASTRO_STATE', f.regexp_replace('ASTRO_STATE', 'LAB','LABUAN'))
UAMS_P1P2_Merg = UAMS_P1P2_Merg.withColumn('ASTRO_STATE', f.regexp_replace('ASTRO_STATE', 'WILAYAH PERSEKUTUAN','WIL'))
UAMS_P1P2_Merg = UAMS_P1P2_Merg.withColumn('ASTRO_STATE', f.regexp_replace('ASTRO_STATE', 'WIL','WILAYAH PERSEKUTUAN KUALA LUMPUR'))
UAMS_P1P2_Merg = UAMS_P1P2_Merg.withColumn('ASTRO_STATE', f.regexp_replace('ASTRO_STATE', 'LGK','KEDAH'))
UAMS_P1P2_Merg = UAMS_P1P2_Merg.withColumn('ASTRO_STATE', f.regexp_replace('ASTRO_STATE', 'SIN','SINGAPORE'))

## SAVE an INTERMEDIATE TABLE ###
# mainly coz the above step (cleaning ASTRO_STATE) took 32 mins & I'm worried the notebook might crash
UAMS_P1P2_Merg.write.orc(UAMS_PySpark_save_path+"phase_1/{}/UAMS_P1P2_Merg-intermediate1.orc".format(date_key) , mode='overwrite') # 20/11/22: renamed the savepath




In [14]:
# del UAMS_P1P2_Merg # -- DELETE previous df if required
# read back in the intermediate table
UAMS_P1P2_Merg = spark.read.orc(UAMS_PySpark_save_path+"phase_1/{}/UAMS_P1P2_Merg-intermediate1.orc".format(date_key))

# _list = ['#',',',"'" ]
# UAMS_P1P2_Merg = UAMS_P1P2_Merg.withColumn("House_No", f.regexp_replace("House_No", '|'.join(_list), ''))

UAMS_P1P2_Merg = UAMS_P1P2_Merg.withColumn("House_No", f.regexp_replace("House_No", '#',''))
UAMS_P1P2_Merg = UAMS_P1P2_Merg.withColumn("House_No", f.regexp_replace("House_No", ',',''))
UAMS_P1P2_Merg = UAMS_P1P2_Merg.withColumn("House_No", f.regexp_replace("House_No", "'",''))

# _list = ['#',',', '/','-', 'No Name', '\.', '\*', '=', ':','\)', '\(', '`', '_']
# UAMS_P1P2_Merg = UAMS_P1P2_Merg.withColumn("Combined_Building", f.regexp_replace("Combined_Building", '|'.join(_list), ''))

UAMS_P1P2_Merg = UAMS_P1P2_Merg.withColumn("Combined_Building", f.regexp_replace("Combined_Building", '#',''))
UAMS_P1P2_Merg = UAMS_P1P2_Merg.withColumn("Combined_Building", f.regexp_replace("Combined_Building", ',',''))
UAMS_P1P2_Merg = UAMS_P1P2_Merg.withColumn("Combined_Building", f.regexp_replace("Combined_Building", '/',''))
UAMS_P1P2_Merg = UAMS_P1P2_Merg.withColumn("Combined_Building", f.regexp_replace("Combined_Building", '-',''))
UAMS_P1P2_Merg = UAMS_P1P2_Merg.withColumn("Combined_Building", f.regexp_replace("Combined_Building", 'No Name',''))
# UAMS_P1P2_Merg["Combined_Building"]= np.where( UAMS_P1P2_Merg["Combined_Building"]=='0', '', UAMS_P1P2_Merg["Combined_Building"])
UAMS_P1P2_Merg = UAMS_P1P2_Merg.withColumn("Combined_Building", f.regexp_replace("Combined_Building", '\.',''))
UAMS_P1P2_Merg = UAMS_P1P2_Merg.withColumn("Combined_Building", f.regexp_replace("Combined_Building", '\*',''))
UAMS_P1P2_Merg = UAMS_P1P2_Merg.withColumn("Combined_Building", f.regexp_replace("Combined_Building", '=',''))
UAMS_P1P2_Merg = UAMS_P1P2_Merg.withColumn("Combined_Building", f.regexp_replace("Combined_Building", ':',''))
UAMS_P1P2_Merg = UAMS_P1P2_Merg.withColumn("Combined_Building", f.regexp_replace("Combined_Building", '\)',''))
UAMS_P1P2_Merg = UAMS_P1P2_Merg.withColumn("Combined_Building", f.regexp_replace("Combined_Building", '\(',''))
UAMS_P1P2_Merg = UAMS_P1P2_Merg.withColumn("Combined_Building", f.regexp_replace("Combined_Building", '`',''))
UAMS_P1P2_Merg = UAMS_P1P2_Merg.withColumn("Combined_Building", f.regexp_replace("Combined_Building", '_',''))
#revision - zohreh - 5/8/22 - uams complain about carrot
UAMS_P1P2_Merg = UAMS_P1P2_Merg.withColumn("Combined_Building", f.regexp_replace("Combined_Building", '\^',''))
#UAMS_P1P2_Merg = UAMS_P1P2_Merg.withColumn("Combined_Building", f.regexp_replace("Combined_Building", '  ',''))
#UAMS_P1P2_Merg["Combined_Building"] = UAMS_P1P2_Merg["Combined_Building"].str.strip()
#UAMS_P1P2_Merg = UAMS_P1P2_Merg.withColumn("Combined_Building", f.regexp_replace("Combined_Building", '(^|\s)($|\s)','', case = False, regex = True)

## assigning SDU, MDU to Address_Type
UAMS_P1P2_Merg = UAMS_P1P2_Merg.withColumn('Address_Type', when( ((f.col('Combined_Building').isNull()) | (f.col('Combined_Building') == '')), 'SDU').otherwise('MDU') )

# _list = ['#',',', '/','-', 'No Name']
# UAMS_P1P2_Merg = UAMS_P1P2_Merg.withColumn("Street_Type_1", f.regexp_replace("Street_Type_1", '|'.join(_list), ''))

UAMS_P1P2_Merg = UAMS_P1P2_Merg.withColumn("Street_Type_1", f.regexp_replace("Street_Type_1", '#',''))
UAMS_P1P2_Merg = UAMS_P1P2_Merg.withColumn("Street_Type_1", f.regexp_replace("Street_Type_1", ',',''))
UAMS_P1P2_Merg = UAMS_P1P2_Merg.withColumn("Street_Type_1", f.regexp_replace("Street_Type_1", '/',''))
UAMS_P1P2_Merg = UAMS_P1P2_Merg.withColumn("Street_Type_1", f.regexp_replace("Street_Type_1", '-',''))
UAMS_P1P2_Merg = UAMS_P1P2_Merg.withColumn("Street_Type_1", f.regexp_replace("Street_Type_1", 'No Name',''))
UAMS_P1P2_Merg = UAMS_P1P2_Merg.withColumn("Street_Type_1", f.regexp_replace("Street_Type_1", 'JLN','JALAN'))
UAMS_P1P2_Merg = UAMS_P1P2_Merg.withColumn("Street_Type_1", f.regexp_replace("Street_Type_1", 'LRG','LORONG'))

# _list = ['#',',', '/','-', 'No Name']
# UAMS_P1P2_Merg = UAMS_P1P2_Merg.withColumn("Street_Type_2", f.regexp_replace("Street_Type_2", '|'.join(_list), ''))

UAMS_P1P2_Merg = UAMS_P1P2_Merg.withColumn("Street_Type_2", f.regexp_replace("Street_Type_2", '#',''))
UAMS_P1P2_Merg = UAMS_P1P2_Merg.withColumn("Street_Type_2", f.regexp_replace("Street_Type_2", ',',''))
UAMS_P1P2_Merg = UAMS_P1P2_Merg.withColumn("Street_Type_2", f.regexp_replace("Street_Type_2", '/',''))
UAMS_P1P2_Merg = UAMS_P1P2_Merg.withColumn("Street_Type_2", f.regexp_replace("Street_Type_2", '-',''))
UAMS_P1P2_Merg = UAMS_P1P2_Merg.withColumn("Street_Type_2", f.regexp_replace("Street_Type_2", 'No Name',''))
UAMS_P1P2_Merg = UAMS_P1P2_Merg.withColumn("Street_Type_2", f.regexp_replace("Street_Type_2", 'JLN','JALAN'))
UAMS_P1P2_Merg = UAMS_P1P2_Merg.withColumn("Street_Type_2", f.regexp_replace("Street_Type_2", 'LRG','LORONG'))

# _list = ['#',',', 'No Name']
# UAMS_P1P2_Merg = UAMS_P1P2_Merg.withColumn("Street_1_New", f.regexp_replace("Street_1_New", '|'.join(_list), ''))

UAMS_P1P2_Merg = UAMS_P1P2_Merg.withColumn("Street_1_New", f.regexp_replace("Street_1_New", '#',''))
UAMS_P1P2_Merg = UAMS_P1P2_Merg.withColumn("Street_1_New", f.regexp_replace("Street_1_New", ',',''))
UAMS_P1P2_Merg = UAMS_P1P2_Merg.withColumn("Street_1_New", f.regexp_replace("Street_1_New", 'No Name',''))

# _list = ['#',',', 'No Name']
# UAMS_P1P2_Merg = UAMS_P1P2_Merg.withColumn("Street_2_New", f.regexp_replace("Street_2_New",'|'.join(_list), ''))

UAMS_P1P2_Merg = UAMS_P1P2_Merg.withColumn("Street_2_New", f.regexp_replace("Street_2_New", '#',''))
UAMS_P1P2_Merg = UAMS_P1P2_Merg.withColumn("Street_2_New", f.regexp_replace("Street_2_New", ',',''))
UAMS_P1P2_Merg = UAMS_P1P2_Merg.withColumn("Street_2_New", f.regexp_replace("Street_2_New", 'No Name',''))

# _list = ['#',',', '/','-', 'No Name']
# UAMS_P1P2_Merg = UAMS_P1P2_Merg.withColumn("AREA", f.regexp_replace("AREA", '|'.join(_list), ''))

UAMS_P1P2_Merg = UAMS_P1P2_Merg.withColumn("AREA", f.regexp_replace("AREA", '#',''))
UAMS_P1P2_Merg = UAMS_P1P2_Merg.withColumn("AREA", f.regexp_replace("AREA", ',',''))
UAMS_P1P2_Merg = UAMS_P1P2_Merg.withColumn("AREA", f.regexp_replace("AREA", '/',''))
UAMS_P1P2_Merg = UAMS_P1P2_Merg.withColumn("AREA", f.regexp_replace("AREA", '-',''))
UAMS_P1P2_Merg = UAMS_P1P2_Merg.withColumn("AREA", f.regexp_replace("AREA", 'No Name',''))

# _list = ['#',',', '/','-', '=', ':','\)', '\(', 'No Name','\[\]']
# UAMS_P1P2_Merg = UAMS_P1P2_Merg.withColumn("STD_CITY", f.regexp_replace("STD_CITY", '|'.join(_list), ''))

UAMS_P1P2_Merg = UAMS_P1P2_Merg.withColumn("STD_CITY", f.regexp_replace("STD_CITY", '#',''))
UAMS_P1P2_Merg = UAMS_P1P2_Merg.withColumn("STD_CITY", f.regexp_replace("STD_CITY", ',',''))
UAMS_P1P2_Merg = UAMS_P1P2_Merg.withColumn("STD_CITY", f.regexp_replace("STD_CITY", '/',''))
UAMS_P1P2_Merg = UAMS_P1P2_Merg.withColumn("STD_CITY", f.regexp_replace("STD_CITY", '-',''))
UAMS_P1P2_Merg = UAMS_P1P2_Merg.withColumn("STD_CITY", f.regexp_replace("STD_CITY", '=',''))
UAMS_P1P2_Merg = UAMS_P1P2_Merg.withColumn("STD_CITY", f.regexp_replace("STD_CITY", ':',''))
UAMS_P1P2_Merg = UAMS_P1P2_Merg.withColumn("STD_CITY", f.regexp_replace("STD_CITY", '\)',''))
UAMS_P1P2_Merg = UAMS_P1P2_Merg.withColumn("STD_CITY", f.regexp_replace("STD_CITY", '\(',''))
UAMS_P1P2_Merg = UAMS_P1P2_Merg.withColumn("STD_CITY", f.regexp_replace("STD_CITY", 'No Name',''))
UAMS_P1P2_Merg = UAMS_P1P2_Merg.withColumn("STD_CITY", f.regexp_replace("STD_CITY", '\[\]',''))

# _list = ['#',',', '/','-', 'No Name']
# UAMS_P1P2_Merg = UAMS_P1P2_Merg.withColumn("ASTRO_STATE", f.regexp_replace("ASTRO_STATE", '|'.join(_list), ''))

UAMS_P1P2_Merg = UAMS_P1P2_Merg.withColumn("ASTRO_STATE", f.regexp_replace("ASTRO_STATE", '#',''))
UAMS_P1P2_Merg = UAMS_P1P2_Merg.withColumn("ASTRO_STATE", f.regexp_replace("ASTRO_STATE", ',',''))
UAMS_P1P2_Merg = UAMS_P1P2_Merg.withColumn("ASTRO_STATE", f.regexp_replace("ASTRO_STATE", '/',''))
UAMS_P1P2_Merg = UAMS_P1P2_Merg.withColumn("ASTRO_STATE", f.regexp_replace("ASTRO_STATE", '-',''))
UAMS_P1P2_Merg = UAMS_P1P2_Merg.withColumn("ASTRO_STATE", f.regexp_replace("ASTRO_STATE", 'No Name',''))

# _list = ['#',',', '/','-', 'No Name']
# UAMS_P1P2_Merg = UAMS_P1P2_Merg.withColumn("POSTCODE", f.regexp_replace("POSTCODE", '|'.join(_list), ''))

UAMS_P1P2_Merg = UAMS_P1P2_Merg.withColumn("POSTCODE", f.regexp_replace("POSTCODE", '#',''))
UAMS_P1P2_Merg = UAMS_P1P2_Merg.withColumn("POSTCODE", f.regexp_replace("POSTCODE", ',',''))
UAMS_P1P2_Merg = UAMS_P1P2_Merg.withColumn("POSTCODE", f.regexp_replace("POSTCODE", '/',''))
UAMS_P1P2_Merg = UAMS_P1P2_Merg.withColumn("POSTCODE", f.regexp_replace("POSTCODE", '-',''))
UAMS_P1P2_Merg = UAMS_P1P2_Merg.withColumn("POSTCODE", f.regexp_replace("POSTCODE", 'No Name',''))




In [None]:
## trim the columns
UAMS_P1P2_Merg = UAMS_P1P2_Merg.withColumn("Account_No", f.trim(f.col("Account_No")) )
UAMS_P1P2_Merg = UAMS_P1P2_Merg.withColumn("OBJID", f.trim(f.col("OBJID")) )
UAMS_P1P2_Merg = UAMS_P1P2_Merg.withColumn("House_No", f.trim(f.col("House_No")) )
UAMS_P1P2_Merg = UAMS_P1P2_Merg.withColumn("Combined_Building", f.trim(f.col("Combined_Building")) )
UAMS_P1P2_Merg = UAMS_P1P2_Merg.withColumn("Street_1_New", f.trim(f.col("Street_1_New")) )
UAMS_P1P2_Merg = UAMS_P1P2_Merg.withColumn("Street_2_New", f.trim(f.col("Street_2_New")) )
UAMS_P1P2_Merg = UAMS_P1P2_Merg.withColumn("Street_Type_1", f.trim(f.col("Street_Type_1")) )
UAMS_P1P2_Merg = UAMS_P1P2_Merg.withColumn("Street_Type_2", f.trim(f.col("Street_Type_2")) )
UAMS_P1P2_Merg = UAMS_P1P2_Merg.withColumn("POSTCODE", f.trim(f.col("POSTCODE")) )
UAMS_P1P2_Merg = UAMS_P1P2_Merg.withColumn("AREA", f.trim(f.col("AREA")) )
UAMS_P1P2_Merg = UAMS_P1P2_Merg.withColumn("STD_CITY", f.trim(f.col("STD_CITY")) )
UAMS_P1P2_Merg = UAMS_P1P2_Merg.withColumn("ASTRO_STATE", f.trim(f.col("ASTRO_STATE")) )

## don't know how to do below ones in pyspark
# UAMS_P1P2_Merg["Account_No"] = UAMS_P1P2_Merg["Account_No"].str.strip('\n\t')
# UAMS_P1P2_Merg["OBJID"] = UAMS_P1P2_Merg["OBJID"].str.strip('\n\t')
# UAMS_P1P2_Merg["House_No"] = UAMS_P1P2_Merg["House_No"].str.strip('\n\t')
# UAMS_P1P2_Merg["Combined_Building"] = UAMS_P1P2_Merg["Combined_Building"].str.strip('\n\t')
# UAMS_P1P2_Merg["Street_1_New"] = UAMS_P1P2_Merg["Street_1_New"].str.strip('\n\t')
# UAMS_P1P2_Merg["Street_2_New"] = UAMS_P1P2_Merg["Street_2_New"].str.strip('\n\t')
# UAMS_P1P2_Merg["Street_Type_1"] = UAMS_P1P2_Merg["Street_Type_1"].str.strip('\n\t')
# UAMS_P1P2_Merg["Street_Type_2"] = UAMS_P1P2_Merg["Street_Type_2"].str.strip('\n\t')
# UAMS_P1P2_Merg["POSTCODE"] = UAMS_P1P2_Merg["POSTCODE"].str.strip('\n\t')
# UAMS_P1P2_Merg["AREA"] = UAMS_P1P2_Merg["AREA"].str.strip('\n\t')
# UAMS_P1P2_Merg["STD_CITY"] = UAMS_P1P2_Merg["STD_CITY"].str.strip('\n\t')
# UAMS_P1P2_Merg["ASTRO_STATE"] = UAMS_P1P2_Merg["ASTRO_STATE"].str.strip('\n\t')

## fill any blanks
UAMS_P1P2_Merg = UAMS_P1P2_Merg.fillna('')

## keep only those with non-null & non-blank postcode, city, street type & street name
UAMS_P1P2_Merg = UAMS_P1P2_Merg.filter( (f.col('POSTCODE').isNotNull()) & (f.col('POSTCODE') != ''))
UAMS_P1P2_Merg = UAMS_P1P2_Merg.filter( (f.col('STD_CITY').isNotNull()) & (f.col('STD_CITY') != ''))
UAMS_P1P2_Merg = UAMS_P1P2_Merg.filter( (f.col('Street_Type_1').isNotNull()) & (f.col('Street_Type_1') != ''))
UAMS_P1P2_Merg = UAMS_P1P2_Merg.filter( (f.col('Street_1_New').isNotNull()) & (f.col('Street_1_New') != ''))
UAMS_P1P2_Merg = UAMS_P1P2_Merg.filter( (f.col('ASTRO_STATE').isNotNull()) & (f.col('ASTRO_STATE') != ''))

## return blanks for any AREA value that starts with 'AA' (null by GAPI)
UAMS_P1P2_Merg = UAMS_P1P2_Merg.withColumn("AREA", f.when(f.col('AREA').startswith("AA"), '').otherwise(f.col('AREA')) )

## filter out cases where it starts with 'AA' (null by GAPI) in important columns: city, state, postcode and if there are non-digits in postcode --> should we consider instead of filtering out at this stage, maybe coalesce to get the non-GAPI address component at an earlier Step (maybe Step 2.1 & 2.2)
UAMS_P1P2_Merg = UAMS_P1P2_Merg.filter(~f.col('STD_CITY').cast('string').startswith("AA"))
UAMS_P1P2_Merg = UAMS_P1P2_Merg.filter(~f.col('ASTRO_STATE').cast('string').startswith("AA"))
UAMS_P1P2_Merg = UAMS_P1P2_Merg.filter(~f.col('POSTCODE').cast('string').startswith("AA"))
UAMS_P1P2_Merg = UAMS_P1P2_Merg.filter(~f.col('POSTCODE').cast('string').rlike("[a-zA-Z]"))

## filter out any '\[\]' coz sometimes Excel handles blanks in cells by returning this
UAMS_P1P2_Merg = UAMS_P1P2_Merg.filter(f.col('POSTCODE') != '\[\]')
UAMS_P1P2_Merg = UAMS_P1P2_Merg.filter(f.col('STD_CITY') != '\[\]')
UAMS_P1P2_Merg = UAMS_P1P2_Merg.filter(f.col('Street_Type_1') != '\[\]')
UAMS_P1P2_Merg = UAMS_P1P2_Merg.filter(f.col('Street_1_New') != '\[\]')
UAMS_P1P2_Merg = UAMS_P1P2_Merg.filter(f.col('ASTRO_STATE') != '\[\]')

## ensure POSTCODE is only 5 digits long
UAMS_P1P2_Merg = UAMS_P1P2_Merg.withColumn("POSTCODE", f.regexp_replace(f.col('POSTCODE').cast('string'), '\.0', '') )
UAMS_P1P2_Merg = UAMS_P1P2_Merg.withColumn("POSTCODE", f.substring(f.col('POSTCODE'), 1, 5) )
UAMS_P1P2_Merg = UAMS_P1P2_Merg.withColumn("Post_length", f.length(f.col('POSTCODE')) )
print(UAMS_P1P2_Merg.select("Post_length").distinct().show())
UAMS_P1P2_Merg = UAMS_P1P2_Merg.drop(*['Post_length'])
UAMS_P1P2_Merg = UAMS_P1P2_Merg.withColumn('POSTCODE', f.lpad(f.col('POSTCODE').cast('string'), 5, '0') )

## encode then decode each column into ascii
UAMS_P1P2_Merg = UAMS_P1P2_Merg.withColumn("House_No", f.decode(f.encode(f.col('House_No'), 'ascii'), 'ascii') )
UAMS_P1P2_Merg = UAMS_P1P2_Merg.withColumn("Combined_Building", f.decode(f.encode(f.col('Combined_Building'), 'ascii'), 'ascii') )
UAMS_P1P2_Merg = UAMS_P1P2_Merg.withColumn("Street_1_New", f.decode(f.encode(f.col('Street_1_New'), 'ascii'), 'ascii') )
UAMS_P1P2_Merg = UAMS_P1P2_Merg.withColumn("Street_2_New", f.decode(f.encode(f.col('Street_2_New'), 'ascii'), 'ascii') )
UAMS_P1P2_Merg = UAMS_P1P2_Merg.withColumn("Street_Type_2", f.decode(f.encode(f.col('Street_Type_2'), 'ascii'), 'ascii') )
UAMS_P1P2_Merg = UAMS_P1P2_Merg.withColumn("Street_Type_1", f.decode(f.encode(f.col('Street_Type_1'), 'ascii'), 'ascii') )
UAMS_P1P2_Merg = UAMS_P1P2_Merg.withColumn("POSTCODE", f.decode(f.encode(f.col('POSTCODE'), 'ascii'), 'ascii') )
UAMS_P1P2_Merg = UAMS_P1P2_Merg.withColumn("STD_CITY", f.decode(f.encode(f.col('STD_CITY'), 'ascii'), 'ascii') )
UAMS_P1P2_Merg = UAMS_P1P2_Merg.withColumn("AREA", f.decode(f.encode(f.col('AREA'), 'ascii'), 'ascii') )
UAMS_P1P2_Merg = UAMS_P1P2_Merg.withColumn("ASTRO_STATE", f.decode(f.encode(f.col('ASTRO_STATE'), 'ascii'), 'ascii') )

### Save an intermediate table
# coz above steps (from cleaning ASTRO_STATE up to encode/decode) took 30ish mins total & I'm worried the notebook might crash
UAMS_P1P2_Merg.write.orc(UAMS_PySpark_save_path+"phase_1/{}/UAMS_P1P2_Merg-intermediate2.orc".format(date_key) , mode='overwrite')

+-----------+
|Post_length|
+-----------+
|          1|
|          4|
|          5|
+-----------+

None


In [None]:
### Not sure what code below does but I think it's to replace anything in string printable (from string import printable) with ''. This cell was mainly to study the codes

## Code in question:
# UAMS_P1P2_Merg["House_No"] = UAMS_P1P2_Merg["House_No"].map(str).apply(lambda x: ''.join([" " if  i not in  st else i for i in x]))
# UAMS_P1P2_Merg["Combined_Building"] = UAMS_P1P2_Merg["Combined_Building"].map(str).apply(lambda x: ''.join([" " if  i not in  st else i for i in x]))
# UAMS_P1P2_Merg["Street_1_New"] = UAMS_P1P2_Merg["Street_1_New"].map(str).apply(lambda x: ''.join([" " if  i not in  st else i for i in x]))
# UAMS_P1P2_Merg["Street_2_New"] = UAMS_P1P2_Merg["Street_2_New"].map(str).apply(lambda x: ''.join([" " if  i not in  st else i for i in x]))
# UAMS_P1P2_Merg["Street_Type_1"] = UAMS_P1P2_Merg["Street_Type_1"].map(str).apply(lambda x: ''.join([" " if  i not in  st else i for i in x]))
# UAMS_P1P2_Merg["Street_Type_2"] = UAMS_P1P2_Merg["Street_Type_2"].map(str).apply(lambda x: ''.join([" " if  i not in  st else i for i in x]))
# UAMS_P1P2_Merg["POSTCODE"] = UAMS_P1P2_Merg["POSTCODE"].map(str).apply(lambda x: ''.join([" " if  i not in  st else i for i in x]))
# UAMS_P1P2_Merg["AREA"] = UAMS_P1P2_Merg["AREA"].map(str).apply(lambda x: ''.join([" " if  i not in  st else i for i in x]))
# UAMS_P1P2_Merg["STD_CITY"] = UAMS_P1P2_Merg["STD_CITY"].map(str).apply(lambda x: ''.join([" " if  i not in  st else i for i in x]))
# UAMS_P1P2_Merg["ASTRO_STATE"] = UAMS_P1P2_Merg["ASTRO_STATE"].map(str).apply(lambda x: ''.join([" " if  i not in  st else i for i in x]))

## some experimentation I did to try & figure out what is happening - doesn't look like anything really changes...
# UAMS_P1P2_Merg_pd_test = UAMS_P1P2_Merg.select('Account_No', 'Street_1_New').sample(0.005).toPandas()
# UAMS_P1P2_Merg_pd_test["Street_1_New_1"] = UAMS_P1P2_Merg_pd_test["Street_1_New"].map(str).apply(lambda x: ''.join([" " if  i not in  st else i for i in x]))
# print(st)
# UAMS_P1P2_Merg_pd_test.head(5)

## attempt to convert to PySpark:
# UAMS_P1P2_Merg = UAMS_P1P2_Merg.withColumn("House_No", f.concat_ws("", f.array_f.col('House_No')) )

## some codes taken from the MDU enhancement notebook that may be useful:
# astro_kv_clean_2 = astro_kv_clean_2.withColumn('sorted_set', f.array_sort(f.array_distinct(f.split(f.col('new_block_building_name'), pattern=' '))) )
#  astro_kv_clean_2 = astro_kv_clean_2.withColumn('joined_set', f.concat_ws(" ", col("sorted_set")) )

In [17]:
# del UAMS_P1P2_Merg # -- DELETE previous df if required
# read back in the intermediate table
UAMS_P1P2_Merg = spark.read.orc(UAMS_PySpark_save_path+"phase_1/{}/UAMS_P1P2_Merg-intermediate2.orc".format(date_key))

## remove NANs in Combined_Building
UAMS_P1P2_Merg = UAMS_P1P2_Merg.withColumn('Combined_Building', f.regexp_replace(f.upper(f.col('Combined_Building')), 'NAN', '') )

## below step was already done above, looks like it's a repeat
# UAMS_P1P2_Merg["Address_Type"] = np.where(UAMS_P1P2_Merg["Combined_Building"].isnull(), 'SDU', 'MDU')
# UAMS_P1P2_Merg["Address_Type"]= np.where(UAMS_P1P2_Merg["Combined_Building"]=='', 'SDU', 'MDU')

## return blank if OBJID is not length 8
# print(UAMS_P1P2_Merg['OBJID'].map(str).map(len).unique())
UAMS_P1P2_Merg = UAMS_P1P2_Merg.withColumn('OBJID', when( (f.length(f.col('OBJID').cast('string')) == 8), f.col('OBJID') ).otherwise(''))
# UAMS_P1P2_Merg['OBJID'].map(str).map(len).unique()

# UAMS_P1P2_Merg[UAMS_P1P2_Merg["STD_CITY"].astype(str).str.startswith("AA")]

## keep only those with account_no length of 8 or 10
UAMS_P1P2_Merg = UAMS_P1P2_Merg.withColumn('Account_len', f.length(f.col('Account_No').cast('string')) )
UAMS_P1P2_Merg = UAMS_P1P2_Merg.filter( (f.col('Account_len') == 8) | (f.col('Account_len') == 10) )
# print(UAMS_P1P2_Merg['Account_len'].unique())
UAMS_P1P2_Merg = UAMS_P1P2_Merg.drop(*['Account_len'])

# create a sequential index as Zohreh did a pandas reset_index at this step. To do it in Spark: https://stackoverflow.com/questions/51200217/how-to-create-sequential-number-column-in-pyspark-dataframe
UAMS_P1P2_Merg = UAMS_P1P2_Merg.withColumn("index", row_number().over(Window.orderBy(monotonically_increasing_id())) )

## Cleaning up STD_CITY, POSTCODE and ASTRO_STATE MANUALLY (code originally from Maryam/Zohreh)
UAMS_P1P2_Merg = UAMS_P1P2_Merg.withColumn("STD_CITY", f.regexp_replace(f.col("STD_CITY"), 'KUALA LUMPUR','KL') )
UAMS_P1P2_Merg = UAMS_P1P2_Merg.withColumn("STD_CITY", when(f.col('STD_CITY') == 'KL', 'KUALA LUMPUR')
                                .when(f.col("STD_CITY") == 'WILAYAH PERSEKUTUANERSEKUTUAN',  'WILAYAH PERSEKUTUAN')
                                .when(f.col("STD_CITY") == 'MENGGATALKOTA KINABALU',  'MENGGATAL')
                                .when(f.col("STD_CITY") == 'TUARANKOTA KINABALU',  'TUARAN')
                                .when(f.col("ASTRO_STATE") == 'KUANTAN', 'KUANTAN')
                                .otherwise(f.col('STD_CITY')) )

UAMS_P1P2_Merg = UAMS_P1P2_Merg.withColumn("POSTCODE", when(f.col("POSTCODE") == '96000', 'SARAWAK').otherwise(f.col("POSTCODE")) )

## save intermediate table
UAMS_P1P2_Merg.write.orc(UAMS_PySpark_save_path+"phase_1/{}/UAMS_P1P2_Merg-intermediate3.orc".format(date_key), mode='overwrite', compression='snappy')

# ------------------------------------------------------------------------------------------------------------

del UAMS_P1P2_Merg # -- if required
UAMS_P1P2_Merg = spark.read.orc(UAMS_PySpark_save_path+"phase_1/{}/UAMS_P1P2_Merg-intermediate3.orc".format(date_key)) ## read in ORC version

UAMS_P1P2_Merg = UAMS_P1P2_Merg.withColumn("ASTRO_STATE", when(f.col("ASTRO_STATE") == 'WILAYAH PERSEKUTUAN KUALA LUMPURANG', 'WILAYAH PERSEKUTUAN KUALA LUMPUR')
                                .when(f.col("ASTRO_STATE") == 'W\.P\.', 'WILAYAH PERSEKUTUAN KUALA LUMPUR')
                                .when(f.col("ASTRO_STATE") == 'PHI', 'PULAU PINANG')
                                .when(f.col("ASTRO_STATE") == 'SEREMBAN', 'NEGERI SEMBILAN')
                                .when(f.col("ASTRO_STATE") == 'KUANTAN', 'PAHANG')
                                .when(f.col("ASTRO_STATE") == 'NEGERI SEMBILANERISEMBILAN', 'NEGERI SEMBILAN')
                                .when(f.col("ASTRO_STATE") == 'SENAWANG', 'NEGERI SEMBILAN')
                                .when(f.col("ASTRO_STATE") == 'SUNGAI PETANI', 'KEDAH')
                                .when(f.col("ASTRO_STATE") == 'PORT DICKSON', 'NEGERI SEMBILAN')
                                .when(f.col("ASTRO_STATE") == 'PENNSYLVANIA', 'PULAU PINANG')
                                .when(f.col("ASTRO_STATE") == 'CHERAS', 'WILAYAH PERSEKUTUAN KUALA LUMPUR')
                                .otherwise(f.col("ASTRO_STATE")) )

## save intermediate table --> have to break it up coz it seems this code causes pyspark to take forever (more than 3 hours) to finish running an Action cell
UAMS_P1P2_Merg.write.orc(UAMS_PySpark_save_path+"phase_1/{}/UAMS_P1P2_Merg-intermediate4.orc".format(date_key), mode='overwrite', compression='snappy')

# ------------------------------------------------------------------------------------------------------------

del UAMS_P1P2_Merg # -- if required
UAMS_P1P2_Merg = spark.read.orc(UAMS_PySpark_save_path+"phase_1/{}/UAMS_P1P2_Merg-intermediate4.orc".format(date_key)) ## read in ORC version

UAMS_P1P2_Merg = UAMS_P1P2_Merg.withColumn("ASTRO_STATE", when(f.col("ASTRO_STATE") == 'GEORGETOWN', 'PULAU PINANG')
                                .when(f.col("ASTRO_STATE") == 'WP KUALA LUMPUR', 'WILAYAH PERSEKUTUAN KUALA LUMPUR')
                                .when(f.col("ASTRO_STATE") == 'KUALA TERENGGANU', 'TERENGGANU')
                                .when(f.col("ASTRO_STATE") == 'IPOH', 'PERAK')
                                .when(f.col("ASTRO_STATE") == 'LABUAN FEDERAL TERRITORY', 'LABUAN')
                                .when(f.col("ASTRO_STATE") == 'PASIR PUTEH', 'KELANTAN')
                                .when(f.col("ASTRO_STATE") == 'ALOR SETAR', 'KEDAH')
                                .when(f.col("ASTRO_STATE") == 'BATU CAVES', 'SELANGOR')
                                .when(f.col("ASTRO_STATE") == 'PETALING JAYA', 'SELANGOR')
                                .when(f.col("ASTRO_STATE") == 'BANTING', 'SELANGOR')
                                .when(f.col("ASTRO_STATE") == 'PEKAN NANAS', 'JOHOR')
                                .when(f.col("ASTRO_STATE") == 'KUALA KANGSARAWAK', 'PERAK')
                                .when(f.col("ASTRO_STATE") == 'WP', 'WILAYAH PERSEKUTUAN KUALA LUMPUR')
                                .when(f.col("ASTRO_STATE") == 'BANGSARAWAK', 'WILAYAH PERSEKUTUAN KUALA LUMPUR')
                                .otherwise(f.col("ASTRO_STATE")) )

## save intermediate table --> have to break it up coz it seems this code causes pyspark to take forever (more than 3 hours) to finish running an Action cell
UAMS_P1P2_Merg.write.orc(UAMS_PySpark_save_path+"phase_1/{}/UAMS_P1P2_Merg-intermediate5.orc".format(date_key), mode='overwrite', compression='snappy')

# ------------------------------------------------------------------------------------------------------------

del UAMS_P1P2_Merg # -- if required
UAMS_P1P2_Merg = spark.read.orc(UAMS_PySpark_save_path+"phase_1/{}/UAMS_P1P2_Merg-intermediate5.orc".format(date_key)) ## read in ORC version

UAMS_P1P2_Merg = UAMS_P1P2_Merg.withColumn("ASTRO_STATE", when(f.col("ASTRO_STATE") == 'SRI HARTAMAS', 'WILAYAH PERSEKUTUAN KUALA LUMPUR')
                                .when(f.col("ASTRO_STATE") == 'SENTUL', 'WILAYAH PERSEKUTUAN KUALA LUMPUR')
                                .when(f.col("ASTRO_STATE") == 'SEGAMBUT', 'WILAYAH PERSEKUTUAN KUALA LUMPUR')
                                .when(f.col("ASTRO_STATE") == 'KUALA LUMPUR', 'WILAYAH PERSEKUTUAN KUALA LUMPUR')
                                .when(f.col("ASTRO_STATE") == 'BATU CAVES', 'SELANGOR')
                                .when(f.col("ASTRO_STATE") ==  'WILAYAH PERSEKUTUAN KUALA LUMPUR WILAYAH PERSEKUTUAN KUALA LUMPURAYAHL 0 PERSEKUTUAN KUALA LUMPUR', 'WILAYAH PERSEKUTUAN KUALA LUMPUR')
                                .when(f.col("ASTRO_STATE") ==  'SEPANGOR', 'WILAYAH PERSEKUTUAN KUALA LUMPUR')
                                .when(f.col("ASTRO_STATE") ==  'SELANGORAGOR', 'SELANGOR')
                                .when(f.col("ASTRO_STATE") ==  '  SHAH ALAM', 'SELANGOR')
                                .when(f.col("ASTRO_STATE") == 'W\.P\.', 'WILAYAH PERSEKUTUAN KUALA LUMPUR')
                                .when(f.col("ASTRO_STATE") == '   ', '')
                                .when(f.col("ASTRO_STATE") == '[]', '')
                                .when(f.col("ASTRO_STATE") == 'MALACCA', 'MELAKA')
                                .when(f.col("ASTRO_STATE") == 'WILAYAH PERSEKUTUAN KUALA LUMPURANG', 'WILAYAH PERSEKUTUAN KUALA LUMPUR')
                                .when(f.col("ASTRO_STATE") == 'W.P.', 'WILAYAH PERSEKUTUAN KUALA LUMPUR')
                                .otherwise(f.col("ASTRO_STATE")) )

## save intermediate table --> have to break it up coz it seems this code causes pyspark to take forever (more than 3 hours) to finish running an Action cell
UAMS_P1P2_Merg.write.orc(UAMS_PySpark_save_path+"phase_1/{}/UAMS_P1P2_Merg-intermediate6.orc".format(date_key), mode='overwrite', compression='snappy')

# ------------------------------------------------------------------------------------------------------------

del UAMS_P1P2_Merg # -- if required
UAMS_P1P2_Merg = spark.read.orc(UAMS_PySpark_save_path+"phase_1/{}/UAMS_P1P2_Merg-intermediate6.orc".format(date_key)) ## read in ORC version

# ## filter out Postcode_length != 5
# print('before filtering out Post_length != 5', UAMS_P1P2_Merg.select('POSTCODE').count()) # 
# UAMS_P1P2_Merg = UAMS_P1P2_Merg.withColumn("Post_length", f.length(f.col('POSTCODE')) )
# UAMS_P1P2_Merg = UAMS_P1P2_Merg.filter(f.col("Post_length") == 5 )
# print('after filtering out Post_length != 5', UAMS_P1P2_Merg.select('POSTCODE').count()) # 

## fill nulls again
UAMS_P1P2_Merg = UAMS_P1P2_Merg.fillna('')

## make all columns upper case
UAMS_P1P2_Merg = UAMS_P1P2_Merg.withColumn('House_No', f.upper(f.col('House_No')) )
UAMS_P1P2_Merg = UAMS_P1P2_Merg.withColumn('Combined_Building', f.upper(f.col('Combined_Building')) )
UAMS_P1P2_Merg = UAMS_P1P2_Merg.withColumn('Street_Type_1', f.upper(f.col('Street_Type_1')) )
UAMS_P1P2_Merg = UAMS_P1P2_Merg.withColumn('Street_1_New', f.upper(f.col('Street_1_New')) )
UAMS_P1P2_Merg = UAMS_P1P2_Merg.withColumn('Street_Type_2', f.upper(f.col('Street_Type_2')) )
UAMS_P1P2_Merg = UAMS_P1P2_Merg.withColumn('Street_2_New', f.upper(f.col('Street_2_New')) )
UAMS_P1P2_Merg = UAMS_P1P2_Merg.withColumn('AREA', f.upper(f.col('AREA')) )
UAMS_P1P2_Merg = UAMS_P1P2_Merg.withColumn('STD_CITY', f.upper(f.col('STD_CITY')) )
UAMS_P1P2_Merg = UAMS_P1P2_Merg.withColumn('ASTRO_STATE', f.upper(f.col('ASTRO_STATE')) )

## save intermediate table --> have to break it up coz it seems this code causes pyspark to take forever (more than 3 hours) to finish running an Action cell
UAMS_P1P2_Merg.write.orc(UAMS_PySpark_save_path+"phase_1/{}/UAMS_P1P2_Merg-intermediate7.orc".format(date_key), mode='overwrite', compression='snappy')




### Clean HouseNo & Street Names that were converted to Dates

In [18]:
del UAMS_P1P2_Merg # -- if required
UAMS_P1P2_Merg = spark.read.orc(UAMS_PySpark_save_path+"phase_1/{}/UAMS_P1P2_Merg-intermediate7.orc".format(date_key)) ## read in ORC version

## Fix HouseNo that are converted to date. Pyspark code taken from P2 MDU Mapping Test Qubole Zepp notebooks
UAMS_P1P2_Merg = UAMS_P1P2_Merg.withColumn('HouseNo', f.col('House_No'))

# UAMS_P1P2_Merg['HouseNo'] = UAMS_P1P2_Merg['HouseNo'].str.replace({'HouseNo': { 
#     "JAN-":"01-","-JAN":"-01", "FEB-":"02-", "-FEB":'-02',"MAR-":'03-',"-MAR":"-03",
#     "APR-":"04-","-APR":"-04", "MAY-":"05-","-MAY":"-05", "JUN-":"06-", "-JUN":"-06",
#     "JUL-":"07-", "-JUL":"-07","AUG-":'08-', "-AUG":"-08", "SEP-":"09-", "-SEP":"-09",
#    "OCT-":"10-", "-OCT":"-10",  "NOV-":"11-","-NOV":"-11", "DEC-":"12-","-DEC":"-12"  }}, case = False)
UAMS_P1P2_Merg = UAMS_P1P2_Merg.withColumn('HouseNo', f.regexp_replace(f.col('HouseNo'), "JAN-","01-"))
UAMS_P1P2_Merg = UAMS_P1P2_Merg.withColumn('HouseNo', f.regexp_replace(f.col('HouseNo'), "-JAN","-01"))
UAMS_P1P2_Merg = UAMS_P1P2_Merg.withColumn('HouseNo', f.regexp_replace(f.col('HouseNo'), "FEB-","02-"))
UAMS_P1P2_Merg = UAMS_P1P2_Merg.withColumn('HouseNo', f.regexp_replace(f.col('HouseNo'), "-FEB",'-02'))
UAMS_P1P2_Merg = UAMS_P1P2_Merg.withColumn('HouseNo', f.regexp_replace(f.col('HouseNo'), "MAR-",'03-'))
UAMS_P1P2_Merg = UAMS_P1P2_Merg.withColumn('HouseNo', f.regexp_replace(f.col('HouseNo'), "-MAR","-03"))
UAMS_P1P2_Merg = UAMS_P1P2_Merg.withColumn('HouseNo', f.regexp_replace(f.col('HouseNo'), "APR-","04-"))
UAMS_P1P2_Merg = UAMS_P1P2_Merg.withColumn('HouseNo', f.regexp_replace(f.col('HouseNo'), "-APR","-04"))
UAMS_P1P2_Merg = UAMS_P1P2_Merg.withColumn('HouseNo', f.regexp_replace(f.col('HouseNo'), "MAY-","05-"))
UAMS_P1P2_Merg = UAMS_P1P2_Merg.withColumn('HouseNo', f.regexp_replace(f.col('HouseNo'), "-MAY","-05"))
UAMS_P1P2_Merg = UAMS_P1P2_Merg.withColumn('HouseNo', f.regexp_replace(f.col('HouseNo'), "JUN-","06-"))
UAMS_P1P2_Merg = UAMS_P1P2_Merg.withColumn('HouseNo', f.regexp_replace(f.col('HouseNo'), "-JUN","-06"))
UAMS_P1P2_Merg = UAMS_P1P2_Merg.withColumn('HouseNo', f.regexp_replace(f.col('HouseNo'), "JUL-","07-"))
UAMS_P1P2_Merg = UAMS_P1P2_Merg.withColumn('HouseNo', f.regexp_replace(f.col('HouseNo'), "-JUL","-07"))
UAMS_P1P2_Merg = UAMS_P1P2_Merg.withColumn('HouseNo', f.regexp_replace(f.col('HouseNo'), "AUG-",'08-'))
UAMS_P1P2_Merg = UAMS_P1P2_Merg.withColumn('HouseNo', f.regexp_replace(f.col('HouseNo'), "-AUG","-08"))
UAMS_P1P2_Merg = UAMS_P1P2_Merg.withColumn('HouseNo', f.regexp_replace(f.col('HouseNo'), "SEP-","09-"))
UAMS_P1P2_Merg = UAMS_P1P2_Merg.withColumn('HouseNo', f.regexp_replace(f.col('HouseNo'), "-SEP","-09"))
UAMS_P1P2_Merg = UAMS_P1P2_Merg.withColumn('HouseNo', f.regexp_replace(f.col('HouseNo'), "OCT-","10-"))
UAMS_P1P2_Merg = UAMS_P1P2_Merg.withColumn('HouseNo', f.regexp_replace(f.col('HouseNo'), "-OCT","-10"))
UAMS_P1P2_Merg = UAMS_P1P2_Merg.withColumn('HouseNo', f.regexp_replace(f.col('HouseNo'), "NOV-","11-"))
UAMS_P1P2_Merg = UAMS_P1P2_Merg.withColumn('HouseNo', f.regexp_replace(f.col('HouseNo'), "-NOV","-11"))
UAMS_P1P2_Merg = UAMS_P1P2_Merg.withColumn('HouseNo', f.regexp_replace(f.col('HouseNo'), "DEC-","12-"))
UAMS_P1P2_Merg = UAMS_P1P2_Merg.withColumn('HouseNo', f.regexp_replace(f.col('HouseNo'), "-DEC","-12"))

## save intermediate table --> have to break it up coz it seems this code causes pyspark to take forever (more than 3 hours) to finish running an Action cell
UAMS_P1P2_Merg.write.orc(UAMS_PySpark_save_path+"phase_1/{}/UAMS_P1P2_Merg-intermediate8.orc".format(date_key), mode='overwrite', compression='snappy')

# ------------------------------------------------------------------------------------------------------------

del UAMS_P1P2_Merg # -- if required
UAMS_P1P2_Merg = spark.read.orc(UAMS_PySpark_save_path+"phase_1/{}/UAMS_P1P2_Merg-intermediate8.orc".format(date_key)) ## read in ORC version

## Fix HouseNo that are converted to date (DD/MM/YYYY format). Pyspark code taken from P2 MDU Mapping Test Qubole Zepp notebooks
# Filter date HouseNo
date_house = UAMS_P1P2_Merg.filter(f.regexp_extract('HouseNo', '^([0-2][0-9]|(3)[0-1])(\/)(((0)[0-9])|((1)[0-2]))(\/)\d{4}$', 0) != '' ) 
# Spliting the HouseNo
date_house = date_house.withColumn('block_date',  f.substring(date_house.HouseNo, 1, 2))
date_house = date_house.withColumn('floor',  f.substring(date_house.HouseNo, 4, 2))
date_house = date_house.withColumn('unit',  f.substring(date_house.HouseNo, 9, 2))
# Combine the split HouseNo with dashes: '-'
date_house = date_house.withColumn('HOUSE_NO_ASTRO', f.concat_ws('-', date_house.block_date, date_house.floor, date_house.unit))
# Remove additional column created to combine HouseNo
date_house = date_house.drop(*['block_date','floor','unit'])
# print('date_house:', date_house.select('ACCOUNT_NO').count(), date_house.select(f.countDistinct('ACCOUNT_NO')).show()) #  rows,  unique acc_no

# Filter not date HouseNo
not_date_house = UAMS_P1P2_Merg.filter( f.regexp_extract('HouseNo', '^([0-2][0-9]|(3)[0-1])(\/)(((0)[0-9])|((1)[0-2]))(\/)\d{4}$', 0) == '' )
not_date_house = not_date_house.withColumn('HOUSE_NO_ASTRO', f.col('HouseNo'))
# print('not_date_house:', not_date_house.select('ACCOUNT_NO').count(), not_date_house.select(f.countDistinct('ACCOUNT_NO')).show()) #  rows,  unique acc_no

# Append the 2 dfs (date_house, not_date_house)
UAMS_P1P2_Merg = date_house.union(not_date_house)
print('after reunioning the 2:', UAMS_P1P2_Merg.select('ACCOUNT_NO').count(), UAMS_P1P2_Merg.select(f.countDistinct('ACCOUNT_NO')).show()) #  rows,  unique acc_no
# create a sequential index as Zohreh did a pandas reset_index at this step again. To do it in Spark: https://stackoverflow.com/questions/51200217/how-to-create-sequential-number-column-in-pyspark-dataframe
UAMS_P1P2_Merg = UAMS_P1P2_Merg.withColumn("index", row_number().over(Window.orderBy(monotonically_increasing_id())) )

## rename final house_no column & drop extra house no columns
UAMS_P1P2_Merg = UAMS_P1P2_Merg.withColumn('House_No', f.lpad(f.col('HOUSE_NO_ASTRO').cast('string'), 10, ' ') )
UAMS_P1P2_Merg= UAMS_P1P2_Merg.drop(*['HOUSE_NO_ASTRO', 'HouseNo'])

### Save an intermediate table
## coz above steps (from cleaning ASTRO_STATE up to encode/decode) took 30ish mins total & I'm worried the notebook might crash
UAMS_P1P2_Merg.write.orc(UAMS_PySpark_save_path+"phase_1/{}/UAMS_P1P2_Merg-intermediate9.orc".format(date_key) , mode='overwrite')

# ------------------------------------------------------------------------------------------------------------

del UAMS_P1P2_Merg # -- DELETE previous df if required
# read back in the intermediate table
UAMS_P1P2_Merg = spark.read.orc(UAMS_PySpark_save_path+"phase_1/{}/UAMS_P1P2_Merg-intermediate9.orc".format(date_key))

## Fix Street_1_New that got converted to date format then pad with spaces
# UAMS_P1P2_Merg["Street_1_New"] = UAMS_P1P2_Merg["Street_1_New"].str.replace({'Street_1_New': 
#           {"JAN-":"1/", "-JAN":"/1", "FEB-":"2/", "-FEB":'/2',"MAR-":'3/',"-MAR":"/3", "APR-":"4/","-APR":"/4", "MAY-":"5/","-MAY":"/5", "JUN-":"6/", "-JUN":"/6",
#              "JUL-":"7/", "-JUL":"/7","AUG-":'8/', "-AUG":"/8", "SEP-":"9/", "-SEP":"/9", "OCT-":"10/", "-OCT":"/10",  "NOV-":"11/","-NOV":"/11", "DEC-":"12/","-DEC":"/12"  }}, case = False)
UAMS_P1P2_Merg = UAMS_P1P2_Merg.withColumn("Street_1_New", f.regexp_replace(f.col('Street_1_New'), "JAN-","1/"))
UAMS_P1P2_Merg = UAMS_P1P2_Merg.withColumn("Street_1_New", f.regexp_replace(f.col('Street_1_New'), "-JAN","/1"))
UAMS_P1P2_Merg = UAMS_P1P2_Merg.withColumn("Street_1_New", f.regexp_replace(f.col('Street_1_New'), 'FEB-','2/'))
UAMS_P1P2_Merg = UAMS_P1P2_Merg.withColumn("Street_1_New", f.regexp_replace(f.col('Street_1_New'), '-FEB','/2'))
UAMS_P1P2_Merg = UAMS_P1P2_Merg.withColumn("Street_1_New", f.regexp_replace(f.col('Street_1_New'), "MAR-",'3/'))
UAMS_P1P2_Merg = UAMS_P1P2_Merg.withColumn("Street_1_New", f.regexp_replace(f.col('Street_1_New'), "-MAR","/3"))
UAMS_P1P2_Merg = UAMS_P1P2_Merg.withColumn("Street_1_New", f.regexp_replace(f.col('Street_1_New'), "APR-","4/"))
UAMS_P1P2_Merg = UAMS_P1P2_Merg.withColumn("Street_1_New", f.regexp_replace(f.col('Street_1_New'), "-APR","/4"))
UAMS_P1P2_Merg = UAMS_P1P2_Merg.withColumn("Street_1_New", f.regexp_replace(f.col('Street_1_New'), "MAY-","5/"))
UAMS_P1P2_Merg = UAMS_P1P2_Merg.withColumn("Street_1_New", f.regexp_replace(f.col('Street_1_New'), "-MAY","/5"))
UAMS_P1P2_Merg = UAMS_P1P2_Merg.withColumn("Street_1_New", f.regexp_replace(f.col('Street_1_New'), "JUN-","6/"))
UAMS_P1P2_Merg = UAMS_P1P2_Merg.withColumn("Street_1_New", f.regexp_replace(f.col('Street_1_New'), "-JUN","/6"))
UAMS_P1P2_Merg = UAMS_P1P2_Merg.withColumn("Street_1_New", f.regexp_replace(f.col('Street_1_New'), "JUL-","7/"))
UAMS_P1P2_Merg = UAMS_P1P2_Merg.withColumn("Street_1_New", f.regexp_replace(f.col('Street_1_New'), "-JUL","/7"))
UAMS_P1P2_Merg = UAMS_P1P2_Merg.withColumn("Street_1_New", f.regexp_replace(f.col('Street_1_New'), "AUG-",'8/'))
UAMS_P1P2_Merg = UAMS_P1P2_Merg.withColumn("Street_1_New", f.regexp_replace(f.col('Street_1_New'), "-AUG","/8"))
UAMS_P1P2_Merg = UAMS_P1P2_Merg.withColumn("Street_1_New", f.regexp_replace(f.col('Street_1_New'), "SEP-","9/"))
UAMS_P1P2_Merg = UAMS_P1P2_Merg.withColumn("Street_1_New", f.regexp_replace(f.col('Street_1_New'), "-SEP","/9"))
UAMS_P1P2_Merg = UAMS_P1P2_Merg.withColumn("Street_1_New", f.regexp_replace(f.col('Street_1_New'), "OCT-","10/"))
UAMS_P1P2_Merg = UAMS_P1P2_Merg.withColumn("Street_1_New", f.regexp_replace(f.col('Street_1_New'), "-OCT","/10"))
UAMS_P1P2_Merg = UAMS_P1P2_Merg.withColumn("Street_1_New", f.regexp_replace(f.col('Street_1_New'), "NOV-","11/"))
UAMS_P1P2_Merg = UAMS_P1P2_Merg.withColumn("Street_1_New", f.regexp_replace(f.col('Street_1_New'), "-NOV","/11"))
UAMS_P1P2_Merg = UAMS_P1P2_Merg.withColumn("Street_1_New", f.regexp_replace(f.col('Street_1_New'), "DEC-","12/"))
UAMS_P1P2_Merg = UAMS_P1P2_Merg.withColumn("Street_1_New", f.regexp_replace(f.col('Street_1_New'), "-DEC","/12"))

UAMS_P1P2_Merg = UAMS_P1P2_Merg.withColumn('Street_1_New', f.lpad(f.col('Street_1_New').cast('string'), 10, ' ') )

## save intermediate table --> have to break it up coz it seems this code causes pyspark to take forever (more than 3 hours) to finish running an Action cell
UAMS_P1P2_Merg.write.orc(UAMS_PySpark_save_path+"phase_1/{}/UAMS_P1P2_Merg-intermediate10.orc".format(date_key), mode='overwrite', compression='snappy')

# ------------------------------------------------------------------------------------------------------------

del UAMS_P1P2_Merg # -- if required
UAMS_P1P2_Merg = spark.read.orc(UAMS_PySpark_save_path+"phase_1/{}/UAMS_P1P2_Merg-intermediate10.orc".format(date_key)) ## read in ORC version

## Fix Street_2_New that got converted to date format then pad with spaces
# UAMS_P1P2_Merg["Street_2_New"] = UAMS_P1P2_Merg["Street_2_New"].str.replace({'Street_2_New': 
#           {"JAN-":"1/", "-JAN":"/1", "FEB-":"2/", "-FEB":'/2',"MAR-":'3/',"-MAR":"/3", "APR-":"4/","-APR":"/4", "MAY-":"5/","-MAY":"/5", "JUN-":"6/", "-JUN":"/6",
#              "JUL-":"7/", "-JUL":"/7","AUG-":'8/', "-AUG":"/8", "SEP-":"9/", "-SEP":"/9", "OCT-":"10/", "-OCT":"/10",  "NOV-":"11/","-NOV":"/11", "DEC-":"12/","-DEC":"/12" }}, case = False)
UAMS_P1P2_Merg = UAMS_P1P2_Merg.withColumn("Street_2_New", f.regexp_replace(f.col('Street_2_New'), "JAN-","1/"))
UAMS_P1P2_Merg = UAMS_P1P2_Merg.withColumn("Street_2_New", f.regexp_replace(f.col('Street_2_New'), "-JAN","/1"))
UAMS_P1P2_Merg = UAMS_P1P2_Merg.withColumn("Street_2_New", f.regexp_replace(f.col('Street_2_New'), 'FEB-','2/'))
UAMS_P1P2_Merg = UAMS_P1P2_Merg.withColumn("Street_2_New", f.regexp_replace(f.col('Street_2_New'), '-FEB','/2'))
UAMS_P1P2_Merg = UAMS_P1P2_Merg.withColumn("Street_2_New", f.regexp_replace(f.col('Street_2_New'), "MAR-",'3/'))
UAMS_P1P2_Merg = UAMS_P1P2_Merg.withColumn("Street_2_New", f.regexp_replace(f.col('Street_2_New'), "-MAR","/3"))
UAMS_P1P2_Merg = UAMS_P1P2_Merg.withColumn("Street_2_New", f.regexp_replace(f.col('Street_2_New'), "APR-","4/"))
UAMS_P1P2_Merg = UAMS_P1P2_Merg.withColumn("Street_2_New", f.regexp_replace(f.col('Street_2_New'), "-APR","/4"))
UAMS_P1P2_Merg = UAMS_P1P2_Merg.withColumn("Street_2_New", f.regexp_replace(f.col('Street_2_New'), "MAY-","5/"))
UAMS_P1P2_Merg = UAMS_P1P2_Merg.withColumn("Street_2_New", f.regexp_replace(f.col('Street_2_New'), "-MAY","/5"))
UAMS_P1P2_Merg = UAMS_P1P2_Merg.withColumn("Street_2_New", f.regexp_replace(f.col('Street_2_New'), "JUN-","6/"))
UAMS_P1P2_Merg = UAMS_P1P2_Merg.withColumn("Street_2_New", f.regexp_replace(f.col('Street_2_New'), "-JUN","/6"))
UAMS_P1P2_Merg = UAMS_P1P2_Merg.withColumn("Street_2_New", f.regexp_replace(f.col('Street_2_New'), "JUL-","7/"))
UAMS_P1P2_Merg = UAMS_P1P2_Merg.withColumn("Street_2_New", f.regexp_replace(f.col('Street_2_New'), "-JUL","/7"))
UAMS_P1P2_Merg = UAMS_P1P2_Merg.withColumn("Street_2_New", f.regexp_replace(f.col('Street_2_New'), "AUG-",'8/'))
UAMS_P1P2_Merg = UAMS_P1P2_Merg.withColumn("Street_2_New", f.regexp_replace(f.col('Street_2_New'), "-AUG","/8"))
UAMS_P1P2_Merg = UAMS_P1P2_Merg.withColumn("Street_2_New", f.regexp_replace(f.col('Street_2_New'), "SEP-","9/"))
UAMS_P1P2_Merg = UAMS_P1P2_Merg.withColumn("Street_2_New", f.regexp_replace(f.col('Street_2_New'), "-SEP","/9"))
UAMS_P1P2_Merg = UAMS_P1P2_Merg.withColumn("Street_2_New", f.regexp_replace(f.col('Street_2_New'), "OCT-","10/"))
UAMS_P1P2_Merg = UAMS_P1P2_Merg.withColumn("Street_2_New", f.regexp_replace(f.col('Street_2_New'), "-OCT","/10"))
UAMS_P1P2_Merg = UAMS_P1P2_Merg.withColumn("Street_2_New", f.regexp_replace(f.col('Street_2_New'), "NOV-","11/"))
UAMS_P1P2_Merg = UAMS_P1P2_Merg.withColumn("Street_2_New", f.regexp_replace(f.col('Street_2_New'), "-NOV","/11"))
UAMS_P1P2_Merg = UAMS_P1P2_Merg.withColumn("Street_2_New", f.regexp_replace(f.col('Street_2_New'), "DEC-","12/"))
UAMS_P1P2_Merg = UAMS_P1P2_Merg.withColumn("Street_2_New", f.regexp_replace(f.col('Street_2_New'), "-DEC","/12"))

UAMS_P1P2_Merg = UAMS_P1P2_Merg.withColumn('Street_2_New', f.lpad(f.col('Street_2_New').cast('string'), 10, ' ') )

## these steps seems to be duplicate steps as above...
# UAMS_P1P2_Merg = UAMS_P1P2_Merg.fillna('')
# UAMS_P1P2_Merg['Combined_Building'] = UAMS_P1P2_Merg['Combined_Building'].str.upper()
# UAMS_P1P2_Merg['Street_Type_1'] = UAMS_P1P2_Merg['Street_Type_1'].str.upper()
# UAMS_P1P2_Merg['Street_Type_2'] = UAMS_P1P2_Merg['Street_Type_2'].str.upper()
# UAMS_P1P2_Merg['Street_1_New'] = UAMS_P1P2_Merg['Street_1_New'].str.upper()
# UAMS_P1P2_Merg['Street_2_New'] = UAMS_P1P2_Merg['Street_2_New'].str.upper()
# UAMS_P1P2_Merg['AREA'] = UAMS_P1P2_Merg['AREA'].str.upper()
# UAMS_P1P2_Merg['STD_CITY'] = UAMS_P1P2_Merg['STD_CITY'].str.upper()
# UAMS_P1P2_Merg['ASTRO_STATE'] = UAMS_P1P2_Merg['ASTRO_STATE'].str.upper()

## rename state column
UAMS_P1P2_Merg = UAMS_P1P2_Merg.withColumnRenamed('ASTRO_STATE','STATE')

## save intermediate table --> have to break it up coz it seems this code causes pyspark to take forever (more than 3 hours) to finish running an Action cell
UAMS_P1P2_Merg.write.orc(UAMS_PySpark_save_path+"phase_1/{}/UAMS_P1P2_Merg-intermediate11.orc".format(date_key), mode='overwrite', compression='snappy')

+--------------------------+
|count(DISTINCT ACCOUNT_NO)|
+--------------------------+
|                         0|
+--------------------------+

date_house: 0 None
+--------------------------+
|count(DISTINCT ACCOUNT_NO)|
+--------------------------+
|                   3328852|
+--------------------------+

not_date_house: 3328852 None
+--------------------------+
|count(DISTINCT ACCOUNT_NO)|
+--------------------------+
|                   3328852|
+--------------------------+

after reunioning the 2: 3328852 None


In [19]:
del UAMS_P1P2_Merg # -- if required
UAMS_P1P2_Merg = spark.read.orc(UAMS_PySpark_save_path+"phase_1/{}/UAMS_P1P2_Merg-intermediate11.orc".format(date_key)) ## read in ORC version

## Create the "Key" column (is this the unique identifier in UAMS? Or just the full address?). Then upper case & replace spaces with blanks
UAMS_P1P2_Merg = UAMS_P1P2_Merg.withColumn("Key", f.concat_ws(" ,", "House_No", "Combined_Building", "Street_Type_1", "Street_1_New", "Street_Type_2", "Street_2_New", "STD_CITY", "AREA", "POSTCODE", "STATE") )
UAMS_P1P2_Merg = UAMS_P1P2_Merg.withColumn("Key", f.regexp_replace(f.upper(f.col("Key")), " ", "") )

## Creating Address_ID based on address_key by converting the 'categorical' values to numerical values. In PySpark: https://stackoverflow.com/questions/45507803/pyspark-dataframe-how-to-convert-one-column-from-categorical-values-to-int
# UAMS_P1P2_Merg = UAMS_P1P2_Merg.withColumn("Key_cat", f.col('Key').cast('category'))
from pyspark.ml.feature import StringIndexer
indexer = StringIndexer(inputCol='Key', outputCol='Address_ID')
UAMS_P1P2_Merg_index = indexer.fit(UAMS_P1P2_Merg).transform(UAMS_P1P2_Merg)
UAMS_P1P2_Merg_index = UAMS_P1P2_Merg_index.withColumn('Address_ID', f.col('Address_ID') + 1)
# UAMS_P1P2_Merg['Address_ID'] += 1


## TEMPORARY line (this should've been run earlier in this notebook but coz I forgot to add this line earlier, I am adding it here so I can save to CSV. When re-running, can remove this bottom line
UAMS_P1P2_Merg_index = UAMS_P1P2_Merg_index.withColumn("Serviceable_New", f.concat_ws(',', f.col('Serviceable_New')) )

print('end of pipeline, uams p1p2 merged columns & count is:')
print(UAMS_P1P2_Merg_index.columns)
print(UAMS_P1P2_Merg_index.count()) # 3328852

## SAVE files!
UAMS_P1P2_Merg_index.write.orc(UAMS_PySpark_save_path+'phase_1/{}/UAMS_P1P2_Merg-final.orc'.format(str(date_key)), mode='overwrite') ## also save an orc version just in case
UAMS_P1P2_Merg_index.coalesce(1).write.csv(UAMS_PySpark_save_path+'phase_1/{}/UAMS_P1P2_Merg-final.csv.gz'.format(str(date_key)), header=True, mode='overwrite', compression='gzip')

# usage = resource.getrusage(resource.RUSAGE_SELF).ru_maxrss
# print('[debug] memory usage is (Megaytes):')
# print(usage)

end of pipeline, uams p1p2 merged columns & count is:
['Account_No', 'Serviceable_New', '_c0', 'OBJID', 'House_No', 'Combined_Building', 'Street_Type_1', 'Street_1_New', 'Street_Type_2', 'Street_2_New', 'AREA', 'STD_CITY', 'POSTCODE', 'STATE', 'Standard_Building_Name', 'ServiceType', 'Servicable', 'HNUM_STRT_TM', 'Address_Type', 'P_Flag', 'index', 'Serviceable', 'Key', 'Address_ID']


# PHASE 2
- Preparing UAMS format for all standardised base - Astro and ISPs
- to automate!

---
 ===================================== THIS IS THE START OF PIPELINE 2 - BACKUP  ===================================
- taken from Glue Job: address_standardization-prod-uams_generation_final_2_backup
- according to Fakhrul, the order for pipeline 2 is final_2_backup -> final_2_backup_2 -> final_2
- Original Zepp Qub notebook: https://us.qubole.com/notebooks#recent?id=141792&type=my-notebooks&view=recent

In [2]:
# ### Preparing UAMS format for all standardised base - Astro and ISPs

### ASTRO

## read in the data & relevant columns
# Astro_Standard = spark.read.csv("s3://astro-datalake-prod-sandbox/amzar/BB/other_adhoc/uploaded/address_checking/astro_new_standardized_20221025.csv.gz", header=True)
Astro_Standard = spark.read.csv(astro_new_std_path, header=True)  # to automate
Astro_Standard = Astro_Standard.select(['service_add_objid', 'ACCOUNT_NO0','HOUSE_NO', 'AREA', 'STD_CITY', 'ASTRO_STATE', 'POSTCODE', 'Combined_Building','Street_1', 'Street_2','Standard_Building_Name','match'])
## rename columns that got changed at some point to fit the script that Fakhrul has on Glue job
Astro_Standard = Astro_Standard.withColumnRenamed('ACCOUNT_NO0', 'ACCOUNT_NO')

print('this is astro new std :', Astro_Standard.select('ACCOUNT_NO').count()) # 4733318

## create a sequential index as Zohreh did a pandas reset_index at this step. To do it in Spark: https://stackoverflow.com/questions/51200217/how-to-create-sequential-number-column-in-pyspark-dataframe
Astro_Standard = Astro_Standard.withColumn("index", row_number().over(Window.orderBy(monotonically_increasing_id())) )

## turn 'AA' in Street_1 into blanks and same for Street_2 cases where match == 'Match'
Astro_Standard = Astro_Standard.withColumn('Street_1', when(f.col('Street_1').cast('string').startswith('AA'), '').otherwise(f.col('Street_1')))
Astro_Standard = Astro_Standard.withColumn('Street_2', when(f.col('match').cast('string') == 'Match', '').otherwise(f.col('Street_2')))

# STRT_P1.head()

## uppercase
Astro_Standard = Astro_Standard.withColumn('Street_1', f.upper(f.col('Street_1')))
Astro_Standard = Astro_Standard.withColumn('Street_2', f.upper(f.col('Street_2')))

## PySpark equivalent of defined extract_street function
r1 = "JALAN|JLN|LORONG|LRG|CHANGKAT|LAMAN|LAHAT|LEBUH|LEBUHRAYA|LENGKOK|LINGKARAN|PERSIARAN"
Astro_Standard = Astro_Standard.withColumn('Street_Type_1', f.regexp_extract(f.col('Street_1'), r1, 0) )
Astro_Standard = Astro_Standard.withColumn('Street_Type_2', f.regexp_extract(f.col('Street_2'), r1, 0) )

# STRT_P1.head()

## extract the Street Name without any of the words in the Street Type List
street_type_list = ['JLN','JALAN','LRG', 'LORONG','CHANGKAT', 'LAMAN', 'LAHAT', 'LEBUH', 'LEBUHRAYA', 'LENGKOK ','LINGKARAN', 'PERSIARAN' ]
Astro_Standard = Astro_Standard.withColumn("Street_1_New", f.regexp_replace(f.col("Street_1"), '|'.join(street_type_list), '') )
Astro_Standard = Astro_Standard.withColumn("Street_2_New", f.regexp_replace(f.col("Street_2"), '|'.join(street_type_list), '') )

print('this is astro new std :', Astro_Standard.count()) # 4733318

## select relevant columns & do some renaming
Astro_Standard = Astro_Standard.select(['service_add_objid', 'ACCOUNT_NO','HOUSE_NO', 'AREA', 'STD_CITY', 'ASTRO_STATE', 'POSTCODE', 'Combined_Building', 'Street_Type_1','Street_1_New','Street_Type_2',
                                         'Street_2_New','Standard_Building_Name'])
Astro_Standard = Astro_Standard.withColumnRenamed('service_add_objid', 'OBJID').withColumnRenamed('ASTRO_STATE','STATE').withColumnRenamed('HOUSE_NO','HouseNo')

## ensure string type & no .0
Astro_Standard = Astro_Standard.withColumn('OBJID', f.regexp_replace(f.col('OBJID').cast('string'), '\.0', '') )
Astro_Standard = Astro_Standard.withColumn('ACCOUNT_NO', f.regexp_replace(f.col('ACCOUNT_NO').cast('string'), '\.0', '') )

# print(Astro_Standard.info())
print(Astro_Standard.columns)

## cast all columns as string
Astro_Standard = Astro_Standard.select([f.col(column).cast('string') for column in Astro_Standard.columns])

## save to path 
Astro_Standard.write.orc(UAMS_PySpark_save_path+'phase_2/{}/astro_temp_standard.orc'.format(date_key), mode='overwrite', compression='snappy')
# wr.s3.to_parquet(df = Astro_Standard,compression ='snappy', schema_evolution = False, path = 's3://astro-groupdata-prod-pipeline/address_standardization/uams_temp_parquet/final_2_astro_standard.snappy.parquet')

del Astro_Standard # delete to free up running memory 

['OBJID', 'ACCOUNT_NO', 'HouseNo', 'AREA', 'STD_CITY', 'STATE', 'POSTCODE', 'Combined_Building', 'Street_Type_1', 'Street_1_New', 'Street_Type_2', 'Street_2_New', 'Standard_Building_Name']


In [3]:
### TM
# -- did a check in the 20221013 file: no values where STD_CITY or CITY_coalesced are blank. BUT CITY_coalesced had 10668091 nulls whereas STD_CITY only had 121887 nulls. So on 10/11/2022, switched from CITY_coalesced to STD_CITY
# ----> this led to 6,523,217 STD_CITY nulls/blanks in the All file created in the "Concatenate into All and further cleaning" step
# -- found that ServiceType has around 145 records which aren't FTTH or VDSL. Most are ERROR, a few are null or something else like TUARAN, SHAH ALAM, PASARAYA GIANT etc

## read in the data & relevant columns
# TM_Standard = spark.read.csv("s3://astro-datalake-prod-sandbox/amzar/BB/other_adhoc/uploaded/address_checking/TM_New_Standardised_20221013.csv.gz", header=True)
TM_Standard = spark.read.csv(tm_new_std_path, header=True) # to automate
TM_Standard = TM_Standard.select(['Combined_Building','HouseNo', 'Street_1','Street_2','AREA','STD_CITY','STATE10', 'POSTCODE12', 'ServiceType','Standard_Building_Name','match'])
## rename columns that got changed at some point (either due to how Spark handles duplicate column names or me coalescing addr components) to fit the script that Fakhrul has on Glue job
TM_Standard = TM_Standard.withColumnRenamed('STATE10', 'STATE').withColumnRenamed('POSTCODE12', 'POSTCODE') 
# .withColumnRenamed('CITY_coalesced', 'STD_CITY') --> originally used CITY_coalesced but apparently it had 10668091 nulls whereas STD_CITY only had 121887 nulls so changed to STD_CITY on 10/11/2022
# .withColumnRenamed('Street_coalesced', 'Street_1') --> originally used Street_coalesced but apparently it had 10668085 nulls whereas Street_1 only had 81969 nulls so changed to STD_CITY on 10/11/2022

print('this is TM new std :', TM_Standard.count()) # 11050724

## create a sequential index as Zohreh did a pandas reset_index at this step. To do it in Spark: https://stackoverflow.com/questions/51200217/how-to-create-sequential-number-column-in-pyspark-dataframe
TM_Standard = TM_Standard.withColumn("index", row_number().over(Window.orderBy(monotonically_increasing_id())) )

## turn 'AA' in Street_1 into blanks and same for Street_2 cases where match == 'Match'
TM_Standard = TM_Standard.withColumn('Street_1', when(f.col('Street_1').cast('string').startswith('AA'), '').otherwise(f.col('Street_1')))
TM_Standard = TM_Standard.withColumn('Street_2', when(f.col('match').cast('string') == 'Match', '').otherwise(f.col('Street_2')))

# STRT_P1.head()

## uppercase
TM_Standard = TM_Standard.withColumn('Street_1', f.upper(f.col('Street_1'))).withColumn('Street_2', f.upper(f.col('Street_2')))

## PySpark equivalent of defined extract_street function
r1 = "JALAN|JLN|LORONG|LRG|CHANGKAT|LAMAN|LAHAT|LEBUH|LEBUHRAYA|LENGKOK|LINGKARAN|PERSIARAN"
TM_Standard = TM_Standard.withColumn('Street_Type_1', f.regexp_extract(f.col('Street_1'), r1, 0) )
TM_Standard = TM_Standard.withColumn('Street_Type_2', f.regexp_extract(f.col('Street_2'), r1, 0) )

# STRT_P1.head()

## extract the Street Name without any of the words in the Street Type List
street_type_list = ['JLN','JALAN','LRG', 'LORONG','CHANGKAT', 'LAMAN', 'LAHAT', 'LEBUH', 'LEBUHRAYA', 'LENGKOK ','LINGKARAN', 'PERSIARAN' ]
TM_Standard = TM_Standard.withColumn("Street_1_New", f.regexp_replace(f.col("Street_1"), '|'.join(street_type_list), '') )
TM_Standard = TM_Standard.withColumn("Street_2_New", f.regexp_replace(f.col("Street_2"), '|'.join(street_type_list), '') )

# print('this is TM new std :', TM_Standard.count()) # 11050724

## select relevant columns & create "Serviceable"
TM_Standard = TM_Standard.select(['Combined_Building','HouseNo','Street_Type_1', 'Street_1_New','Street_Type_2','Street_2_New','AREA','STD_CITY','STATE', 'POSTCODE', 'ServiceType','Standard_Building_Name'])
TM_Standard = TM_Standard.withColumn("Servicable", f.lit('TM'))

## --> Revision - fakhrul - zohreh - 1/7/22 - remove nan and error records
print('TM_Standard before removing nan & error records :', TM_Standard.select('HouseNo').count()) # 11050724

# order ServiceType values to keep 'FTTH' over 'VDSL' when de-duping (if it exists). To do in Spark: https://stackoverflow.com/questions/38687212/spark-dataframe-drop-duplicates-and-keep-first
window = Window.partitionBy(['Combined_Building','HouseNo','Street_Type_1', 'Street_1_New','Street_Type_2','Street_2_New','AREA','STD_CITY','STATE','POSTCODE','Standard_Building_Name']).orderBy(f.col("ServiceType").asc())
TM_Standard = TM_Standard.withColumn('row', f.row_number().over(window)).filter(col('row') == 1).drop('row')
# this is the pandas de-dupe code: TM_Standard = TM_Standard.sort_values(['ServiceType']).drop_duplicates( keep = 'first')

TM_Standard = TM_Standard.filter(f.col("ServiceType") != "")
TM_Standard = TM_Standard.filter(f.col("ServiceType").isNotNull())
TM_Standard = TM_Standard.filter(f.col("ServiceType") != "NAN")
TM_Standard = TM_Standard.filter(f.col("ServiceType") != "ERROR")
print('TM_Standard after removing nan & error records :', TM_Standard.select('HouseNo').count()) # 9641239

## convert Serviceable to "TM|FTTH" or "TM|VDSL"
TM_Standard = TM_Standard.withColumn('Serviceable', f.concat_ws("|", f.col("Servicable"), f.col("ServiceType")) )

# print(TM_Standard.info())
print(TM_Standard.columns)
## cast all columns as string
TM_Standard = TM_Standard.select([f.col(column).cast('string') for column in TM_Standard.columns])
print(TM_Standard.columns)

## save to path
TM_Standard.write.orc(UAMS_PySpark_save_path+'phase_2/{}/tm_temp_standard.orc'.format(date_key), mode='overwrite', compression='snappy')
# wr.s3.to_parquet(df = TM_Standard,compression ='snappy', schema_evolution = False, path = 's3://astro-groupdata-prod-pipeline/address_standardization/uams_temp_parquet/final_2_tm_standard.snappy.parquet')

del TM_Standard

TM_Standard after removing nan & error records : 9641239
['Combined_Building', 'HouseNo', 'Street_Type_1', 'Street_1_New', 'Street_Type_2', 'Street_2_New', 'AREA', 'STD_CITY', 'STATE', 'POSTCODE', 'ServiceType', 'Standard_Building_Name', 'Servicable', 'Serviceable']
['Combined_Building', 'HouseNo', 'Street_Type_1', 'Street_1_New', 'Street_Type_2', 'Street_2_New', 'AREA', 'STD_CITY', 'STATE', 'POSTCODE', 'ServiceType', 'Standard_Building_Name', 'Servicable', 'Serviceable']


In [4]:
### Allo

## read in the data & relevant columns
# Allo_Standard = spark.read.csv("s3://astro-datalake-prod-sandbox/amzar/BB/other_adhoc/uploaded/address_checking/Allo_New_Standardised_20220707.csv", header=True)
Allo_Standard = spark.read.csv(allo_new_std_path, header=True) # to automate
Allo_Standard = Allo_Standard.select(['Combined_Building','HouseNo', 'Street_1','Street_2','AREA','STD_CITY','STATE12', 'POSTCODE14', 'ServiceType','Standard_Building_Name','match'])
## rename columns that got changed at some point (either due to how Spark handles duplicate column names or me coalescing addr components) to fit the script that Fakhrul has on Glue job
Allo_Standard = Allo_Standard.withColumnRenamed('STATE12', 'STATE').withColumnRenamed('POSTCODE14', 'POSTCODE')

# print('this is Allo new std :', Allo_Standard.count()) # 371708

## create a sequential index as Zohreh did a pandas reset_index at this step. To do it in Spark: https://stackoverflow.com/questions/51200217/how-to-create-sequential-number-column-in-pyspark-dataframe
Allo_Standard = Allo_Standard.withColumn("index", row_number().over(Window.orderBy(monotonically_increasing_id())) )

## turn 'AA' in Street_1 into blanks and same for Street_2 cases where match == 'Match'
Allo_Standard = Allo_Standard.withColumn('Street_1', when(f.col('Street_1').cast('string').startswith('AA'), '').otherwise(f.col('Street_1')))
Allo_Standard = Allo_Standard.withColumn('Street_2', when(f.col('match').cast('string') == 'Match', '').otherwise(f.col('Street_2')))

# STRT_P1.head()

## uppercase
Allo_Standard = Allo_Standard.withColumn('Street_1', f.upper(f.col('Street_1'))).withColumn('Street_2', f.upper(f.col('Street_2')))

## PySpark equivalent of defined extract_street function
r1 = "JALAN|JLN|LORONG|LRG|CHANGKAT|LAMAN|LAHAT|LEBUH|LEBUHRAYA|LENGKOK|LINGKARAN|PERSIARAN"
Allo_Standard = Allo_Standard.withColumn('Street_Type_1', f.regexp_extract(f.col('Street_1'), r1, 0) )
Allo_Standard = Allo_Standard.withColumn('Street_Type_2', f.regexp_extract(f.col('Street_2'), r1, 0) )

# STRT_P1.head()

## extract the Street Name without any of the words in the Street Type List
street_type_list = ['JLN','JALAN','LRG', 'LORONG','CHANGKAT', 'LAMAN', 'LAHAT', 'LEBUH', 'LEBUHRAYA', 'LENGKOK ','LINGKARAN', 'PERSIARAN' ]
Allo_Standard = Allo_Standard.withColumn("Street_1_New", f.regexp_replace(f.col("Street_1"), '|'.join(street_type_list), '') )
Allo_Standard = Allo_Standard.withColumn("Street_2_New", f.regexp_replace(f.col("Street_2"), '|'.join(street_type_list), '') )

print('this is Allo new std :', Allo_Standard.count()) # 371708

## select relevant columns & create "Serviceable"
Allo_Standard = Allo_Standard.select(['Combined_Building','HouseNo','Street_Type_1', 'Street_1_New','Street_Type_2','Street_2_New','AREA','STD_CITY','STATE', 'POSTCODE', 'ServiceType','Standard_Building_Name'])
Allo_Standard = Allo_Standard.withColumn("Servicable", f.lit('ALLO'))

## --> #revision - fakhrul - 1/7/22 - remove allo nan
print('Allo_Standard before removing nan & error records :', Allo_Standard.select('HouseNo').count()) # 371708

# order ServiceType values to keep 'FTTH' over 'VDSL' when de-duping (if it exists). To do in Spark: https://stackoverflow.com/questions/38687212/spark-dataframe-drop-duplicates-and-keep-first
window = Window.partitionBy(['Combined_Building','HouseNo','Street_Type_1', 'Street_1_New','Street_Type_2','Street_2_New','AREA','STD_CITY','STATE','POSTCODE','Standard_Building_Name']).orderBy(f.col("ServiceType").asc())
Allo_Standard = Allo_Standard.withColumn('row', f.row_number().over(window)).filter(col('row') == 1).drop('row')
# this is the pandas de-dupe code: Allo_Standard = Allo_Standard.sort_values(['ServiceType']).drop_duplicates( keep = 'first')
Allo_Standard = Allo_Standard.filter(f.col("ServiceType") != "")
Allo_Standard = Allo_Standard.filter(f.col("ServiceType").isNotNull())
Allo_Standard = Allo_Standard.filter(f.col("ServiceType") != "NAN")
Allo_Standard = Allo_Standard.filter(f.col("ServiceType") != "ERROR")
print('Allo_Standard after removing nan & error records :', Allo_Standard.select('HouseNo').count()) # 157457

## convert Serviceable to "Allo|FTTH" or "Allo|VDSL"
Allo_Standard = Allo_Standard.withColumn('Serviceable', f.concat_ws("|", f.col("Servicable"), f.col("ServiceType")) )

# print(Allo_Standard.info())
print(Allo_Standard.columns)
## cast all columns as string
Allo_Standard = Allo_Standard.select([f.col(column).cast('string') for column in Allo_Standard.columns])

## save to path
Allo_Standard.write.orc(UAMS_PySpark_save_path+'phase_2/{}/allo_temp_standard.orc'.format(date_key), mode='overwrite', compression='snappy')
# wr.s3.to_parquet(df = Allo_Standard,compression ='snappy', schema_evolution = False, path = 's3://astro-groupdata-prod-pipeline/address_standardization/uams_temp_parquet/final_2_allo_standard.snappy.parquet')

del Allo_Standard

this is Allo new std : 371708
Allo_Standard before removing nan & error records : 371708
Allo_Standard after removing nan & error records : 157457
['Combined_Building', 'HouseNo', 'Street_Type_1', 'Street_1_New', 'Street_Type_2', 'Street_2_New', 'AREA', 'STD_CITY', 'STATE', 'POSTCODE', 'ServiceType', 'Standard_Building_Name', 'Servicable', 'Serviceable']


In [5]:
### CTS

## read in the data & relevant columns
# CTS_Standard = spark.read.csv("s3://astro-datalake-prod-sandbox/amzar/BB/other_adhoc/uploaded/address_checking/CTS_New_Standardised_202209_Reformatted-SarahLocal.csv", header=True)
CTS_Standard = spark.read.csv(cts_new_std_path, header=True) # to automate
# ORI Glue job code had this argument: dtype = {'Combined_Building':object})
CTS_Standard = CTS_Standard.select(['Combined_Building','HouseNo', 'Street_1','Street_2','AREA','STD_CITY','STATE11', 'POSTCODE20', 'ServiceType','Standard_Building_Name','match'])
## rename columns that got changed at some point (either due to how Spark handles duplicate column names or me coalescing addr components) to fit the script that Fakhrul has on Glue job
CTS_Standard = CTS_Standard.withColumnRenamed('STATE11', 'STATE').withColumnRenamed('POSTCODE20', 'POSTCODE')

# print('this is CTS new std :', CTS_Standard.count()) # 79329

## create a sequential index as Zohreh did a pandas reset_index at this step. To do it in Spark: https://stackoverflow.com/questions/51200217/how-to-create-sequential-number-column-in-pyspark-dataframe
CTS_Standard = CTS_Standard.withColumn("index", row_number().over(Window.orderBy(monotonically_increasing_id())) )

## turn 'AA' in Street_1 into blanks and same for Street_2 cases where match == 'Match'
CTS_Standard = CTS_Standard.withColumn('Street_1', when(f.col('Street_1').cast('string').startswith('AA'), '').otherwise(f.col('Street_1')))
CTS_Standard = CTS_Standard.withColumn('Street_2', when(f.col('match').cast('string') == 'Match', '').otherwise(f.col('Street_2')))

# STRT_P1.head()

## uppercase
CTS_Standard = CTS_Standard.withColumn('Street_1', f.upper(f.col('Street_1'))).withColumn('Street_2', f.upper(f.col('Street_2')))

## PySpark equivalent of defined extract_street function
r1 = "JALAN|JLN|LORONG|LRG|CHANGKAT|LAMAN|LAHAT|LEBUH|LEBUHRAYA|LENGKOK|LINGKARAN|PERSIARAN"
CTS_Standard = CTS_Standard.withColumn('Street_Type_1', f.regexp_extract(f.col('Street_1'), r1, 0) )
CTS_Standard = CTS_Standard.withColumn('Street_Type_2', f.regexp_extract(f.col('Street_2'), r1, 0) )

# STRT_P1.head()

## extract the Street Name without any of the words in the Street Type List
street_type_list = ['JLN','JALAN','LRG', 'LORONG','CHANGKAT', 'LAMAN', 'LAHAT', 'LEBUH', 'LEBUHRAYA', 'LENGKOK ','LINGKARAN', 'PERSIARAN' ]
CTS_Standard = CTS_Standard.withColumn("Street_1_New", f.regexp_replace(f.col("Street_1"), '|'.join(street_type_list), '') )
CTS_Standard = CTS_Standard.withColumn("Street_2_New", f.regexp_replace(f.col("Street_2"), '|'.join(street_type_list), '') )

print('this is CTS new std :', CTS_Standard.count()) # 79329

## select relevant columns & create "Serviceable"
CTS_Standard = CTS_Standard.select(['Combined_Building','HouseNo','Street_Type_1', 'Street_1_New','Street_Type_2','Street_2_New','AREA','STD_CITY','STATE', 'POSTCODE', 'ServiceType','Standard_Building_Name'])
CTS_Standard = CTS_Standard.withColumn("Servicable", f.lit('CTS'))

## convert Serviceable to "CTS|FTTH" or "CTS|VDSL"
CTS_Standard = CTS_Standard.withColumn('Serviceable', f.concat_ws("|", f.col("Servicable"), f.col("ServiceType")) )

# print(CTS_Standard.info())
print(CTS_Standard.columns)
## cast all columns as string
CTS_Standard = CTS_Standard.select([f.col(column).cast('string') for column in CTS_Standard.columns])

## save to path
CTS_Standard.write.orc(UAMS_PySpark_save_path+'phase_2/{}/cts_temp_standard.orc'.format(date_key), mode='overwrite', compression='snappy')
# wr.s3.to_parquet(df = CTS_Standard,compression ='snappy', schema_evolution = False, path = 's3://astro-groupdata-prod-pipeline/address_standardization/uams_temp_parquet/final_2_cts_standard.snappy.parquet')

del CTS_Standard

this is CTS new std : 79329
this is CTS new std : 79329
['Combined_Building', 'HouseNo', 'Street_Type_1', 'Street_1_New', 'Street_Type_2', 'Street_2_New', 'AREA', 'STD_CITY', 'STATE', 'POSTCODE', 'ServiceType', 'Standard_Building_Name', 'Servicable', 'Serviceable']


In [6]:
### Maxis NGBB

## read in the data & relevant columns
# Maxis_Standard = spark.read.csv("s3://astro-datalake-prod-sandbox/amzar/BB/other_adhoc/uploaded/address_checking/Maxis_New_Standardised_20220321.csv", header=True)
Maxis_Standard = spark.read.csv(maxis_new_std_path, header=True) # to automate
# ORI Glue job code had this argument: dtype = {'Combined_Building':object})
Maxis_Standard = Maxis_Standard.select(['Combined_Building','HouseNo', 'Street_1','Street_2','AREA','STD_CITY','STATE27', 'POSTCODE15', 'ServiceType','Standard_Building_Name','match'])
## rename columns that got changed at some point (either due to how Spark handles duplicate column names or me coalescing addr components) to fit the script that Fakhrul has on Glue job
Maxis_Standard = Maxis_Standard.withColumnRenamed('STATE27', 'STATE').withColumnRenamed('POSTCODE15', 'POSTCODE')

# print('this is Maxis new std :', Maxis_Standard.count()) # 130435

## create a sequential index as Zohreh did a pandas reset_index at this step. To do it in Spark: https://stackoverflow.com/questions/51200217/how-to-create-sequential-number-column-in-pyspark-dataframe
Maxis_Standard = Maxis_Standard.withColumn("index", row_number().over(Window.orderBy(monotonically_increasing_id())) )

## turn 'AA' in Street_1 into blanks and same for Street_2 cases where match == 'Match'
Maxis_Standard = Maxis_Standard.withColumn('Street_1', when(f.col('Street_1').cast('string').startswith('AA'), '').otherwise(f.col('Street_1')))
Maxis_Standard = Maxis_Standard.withColumn('Street_2', when(f.col('match').cast('string') == 'Match', '').otherwise(f.col('Street_2')))

# STRT_P1.head()

## uppercase
Maxis_Standard = Maxis_Standard.withColumn('Street_1', f.upper(f.col('Street_1'))).withColumn('Street_2', f.upper(f.col('Street_2')))

## PySpark equivalent of defined extract_street function
r1 = "JALAN|JLN|LORONG|LRG|CHANGKAT|LAMAN|LAHAT|LEBUH|LEBUHRAYA|LENGKOK|LINGKARAN|PERSIARAN"
Maxis_Standard = Maxis_Standard.withColumn('Street_Type_1', f.regexp_extract(f.col('Street_1'), r1, 0) )
Maxis_Standard = Maxis_Standard.withColumn('Street_Type_2', f.regexp_extract(f.col('Street_2'), r1, 0) )

# STRT_P1.head()

## extract the Street Name without any of the words in the Street Type List
street_type_list = ['JLN','JALAN','LRG', 'LORONG','CHANGKAT', 'LAMAN', 'LAHAT', 'LEBUH', 'LEBUHRAYA', 'LENGKOK ','LINGKARAN', 'PERSIARAN' ]
Maxis_Standard = Maxis_Standard.withColumn("Street_1_New", f.regexp_replace(f.col("Street_1"), '|'.join(street_type_list), '') )
Maxis_Standard = Maxis_Standard.withColumn("Street_2_New", f.regexp_replace(f.col("Street_2"), '|'.join(street_type_list), '') )

print('this is Maxis new std :', Maxis_Standard.count()) # 130435

## select relevant columns & create "Serviceable"
Maxis_Standard = Maxis_Standard.select(['Combined_Building','HouseNo','Street_Type_1', 'Street_1_New','Street_Type_2','Street_2_New','AREA','STD_CITY','STATE', 'POSTCODE', 'ServiceType','Standard_Building_Name'])
Maxis_Standard = Maxis_Standard.withColumn("Servicable", f.lit('Maxis'))

## convert Serviceable to "Maxis|FTTH" or "Maxis|VDSL"
Maxis_Standard = Maxis_Standard.withColumn('Serviceable', f.concat_ws("|", f.col("Servicable"), f.col("ServiceType")) )

##just to see if serviceable is actually there for maxis
print('checking maxis serviceable: ', Maxis_Standard.select("Serviceable").head(5))

# print(Maxis_Standard.info())
print(Maxis_Standard.columns)
## cast all columns as string
Maxis_Standard = Maxis_Standard.select([f.col(column).cast('string') for column in Maxis_Standard.columns])

## save to path
Maxis_Standard.write.orc(UAMS_PySpark_save_path+'phase_2/{}/maxis_temp_standard.orc'.format(date_key), mode='overwrite', compression='snappy')
# wr.s3.to_parquet(df = Maxis_Standard,compression ='snappy', schema_evolution = False, path = 's3://astro-groupdata-prod-pipeline/address_standardization/uams_temp_parquet/final_2_maxis_standard.snappy.parquet')

del Maxis_Standard

this is Maxis new std : 130435
this is Maxis new std : 130435
checking maxis serviceable:  [Row(Serviceable='Maxis|FTTH'), Row(Serviceable='Maxis|FTTH'), Row(Serviceable='Maxis|FTTH'), Row(Serviceable='Maxis|FTTH'), Row(Serviceable='Maxis|FTTH')]
['Combined_Building', 'HouseNo', 'Street_Type_1', 'Street_1_New', 'Street_Type_2', 'Street_2_New', 'AREA', 'STD_CITY', 'STATE', 'POSTCODE', 'ServiceType', 'Standard_Building_Name', 'Servicable', 'Serviceable']


### Concatenate into All and further cleaning

In [7]:
## Read in the saved Standard files:
Astro_Standard = spark.read.orc(UAMS_PySpark_save_path+'phase_2/{}/astro_temp_standard.orc'.format(date_key))
TM_Standard = spark.read.orc(UAMS_PySpark_save_path+'phase_2/{}/tm_temp_standard.orc'.format(date_key)) 
Maxis_Standard = spark.read.orc(UAMS_PySpark_save_path+'phase_2/{}/maxis_temp_standard.orc'.format(date_key))
Allo_Standard = spark.read.orc(UAMS_PySpark_save_path+'phase_2/{}/allo_temp_standard.orc'.format(date_key))
CTS_Standard = spark.read.orc(UAMS_PySpark_save_path+'phase_2/{}/cts_temp_standard.orc'.format(date_key))

## create "F" column
Astro_Standard = Astro_Standard.withColumn('F', f.lit('A_MB'))
TM_Standard = TM_Standard.withColumn('F', f.lit('TM_MB'))
Maxis_Standard = Maxis_Standard.withColumn('F', f.lit('MAXIS_MB'))
Allo_Standard = Allo_Standard.withColumn('F', f.lit('ALLO_MB'))
CTS_Standard = CTS_Standard.withColumn('F', f.lit('CTS_MB'))

print('checking unique tm serviceable', TM_Standard.select('Serviceable').distinct().show())

## checking the info/columns
print('checking astro info: ', Astro_Standard.columns)
print('checking tm info: ', TM_Standard.columns)
print('checking maxis info: ', Maxis_Standard.columns)
print('checking allo info: ', Allo_Standard.columns)
print('checking cts info: ', CTS_Standard.columns)

## just to see if serviceable is actually there for maxis and tm before becoming frame
print('checking maxis serviceable: ', Maxis_Standard.select('Serviceable').head(5))
print('checking tm serviceable: ', TM_Standard.select('Serviceable').head(5))

## checking the MB here
print('checking tm mb here: ', TM_Standard.select('F').head(5))
print('checking astro mb here: ', Astro_Standard.select('F').head(5))
print('checking maxis mb here: ', Maxis_Standard.select('F').head(5))

## Union/concat the ISP DFs (have to do the below method as Astro & ISP Std Bases don't have exactly the same columns) --> priority is important ## KIV for TIME ISP
All_ISP = TM_Standard.union(Maxis_Standard).union(Allo_Standard).union(CTS_Standard) ## concat ISPs first as they have the same column names
# print(All_ISP.select('STD_CITY').filter(f.col('STD_CITY') == '').count()) # 0 blank STD_CITY

## generating columns that ISPs have which Astro Std Base does not have & vice versa
for column in [column for column in All_ISP.columns if column not in Astro_Standard.columns]:
    Astro_Standard = Astro_Standard.withColumn(column, f.lit(None))
for column in [column for column in Astro_Standard.columns if column not in All_ISP.columns]:
    All_ISP = All_ISP.withColumn(column, f.lit(None))

## rearranging columns to enable smooth Union-ing
Astro_Standard = Astro_Standard.select(['OBJID', 'ACCOUNT_NO', 'HouseNo', 'AREA', 'STD_CITY', 'STATE', 'POSTCODE', 'Combined_Building', 'Street_Type_1', 'Street_1_New', 'Street_Type_2', 'Street_2_New', 'Standard_Building_Name', 'F', 'ServiceType', 'Servicable', 'Serviceable'])
All_ISP = All_ISP.select(['OBJID', 'ACCOUNT_NO', 'HouseNo', 'AREA', 'STD_CITY', 'STATE', 'POSTCODE', 'Combined_Building', 'Street_Type_1', 'Street_1_New', 'Street_Type_2', 'Street_2_New', 'Standard_Building_Name', 'F', 'ServiceType', 'Servicable', 'Serviceable'])
All = Astro_Standard.union(All_ISP)
print('All count after UNION-ing all Std Bases:', All.select('HouseNo').count())

## validation to show that the above UNION step works ok
# count of blank/null STD_CITY --> 
print('no of rows with null std_city', All.filter(f.col('STD_CITY').isNull()).select('STATE').count()) # 88692
print('no of rows with blank std_city', All.filter(f.col('STD_CITY') == '').select('STATE').count()) # 0
# count of blank AREA --> print(All.filter(f.col('AREA') == '').count()) # 0
# count of blank STATE --> print(All.filter(f.col('STATE') == '').count()) # 0
# count of blank POSTCODE --> print(All.filter(f.col('POSTCODE') == '').count()) # 0
# count of blank Street_Type_1 --> print(All.filter(f.col('Street_Type_1') == '').count()) # 608635
# print(All_ISP.filter(f.col('Street_Type_1') == '').select('STATE').count()) # 40196
# print(Astro_Standard.filter(f.col('Street_Type_1') == '').select('STATE').count()) # 568439

# OLD CODE: All = Astro_Standard.unionByName(All_ISP) --> found this code leads to a lot of blanks in many columns. So switched to new method where I rearrange the columns using select() then use normal union()
# print(All.select('STD_CITY').filter(f.col('STD_CITY') == '').count()) # 6523217 blank STD_CITY --> means something is wrong with the unionByName function

# print('count of All after union/concat', All.select('F').count()) ## 11,937,295
# print('checking weird housenumbers here : ', All.filter(f.col('HouseNo') == '*').head(5))
# print('All serviceable unique here :', All.select('Serviceable').distinct().show())
# print('Checking All here after concat: ', All.select('Serviceable').head(5))

## create a sequential index as Zohreh did a pandas reset_index at this step. To do it in Spark: https://stackoverflow.com/questions/51200217/how-to-create-sequential-number-column-in-pyspark-dataframe
All = All.withColumn("index", row_number().over(Window.orderBy(monotonically_increasing_id())) )
print('Checking All df after reset index: ', All.select('Serviceable').count()) ## looks ok
print('checking all info here: ', All.columns)

# print(All.shape)

# print('checking serviceable values here for All before code below: ', All.select('Serviceable').head(10))

## cleaning any weird values like '\\|'
All = All.withColumn('Serviceable', when(f.col('Serviceable').cast('string') == '\\|', '').otherwise(f.col('Serviceable')))
print('checking serviceable values here for All: ', All.select('Serviceable').distinct().show())
print('no of rows with blank std_city', All.filter(f.col('STD_CITY') == '').select('STATE').count()) # 0

## clean POSTCODE & fill all nulls
All = All.withColumn('POSTCODE', f.regexp_replace(f.col('POSTCODE').cast('string'), '\.0', '') )
All = All.fillna('')
print('no of rows with blank std_city', All.filter(f.col('STD_CITY') == '').select('STATE').count()) # 88692

## Cleaning STATE values
All = All.withColumn('STATE', f.upper(f.trim(f.col('STATE').cast('string'))))
All = All.withColumn('STATE', f.regexp_replace(f.col('STATE'), 'FEDERAL TERRITORY OF KUALA LUMPUR','WIL'))
All = All.withColumn('STATE', f.regexp_replace(f.col('STATE'), 'WILAYAH PERSEKUTUAN KUALA LUMPUR','WIL'))
All = All.withColumn('STATE', f.regexp_replace(f.col('STATE'), 'WIL KL','WIL'))
All = All.withColumn('STATE', f.regexp_replace(f.col('STATE'), 'KL','WIL'))
All = All.withColumn('STATE', f.regexp_replace(f.col('STATE'), 'LKG','KEDAH'))
All = All.withColumn('STATE', f.regexp_replace(f.col('STATE'), 'SELANGOR','SEL'))
All = All.withColumn('STATE', f.regexp_replace(f.col('STATE'), 'SEL','SELANGOR'))
All = All.withColumn('STATE', f.regexp_replace(f.col('STATE'), 'JOHOR','JOH'))
All = All.withColumn('STATE', f.regexp_replace(f.col('STATE'), 'JOH','JOHOR'))
All = All.withColumn('STATE', f.regexp_replace(f.col('STATE'), 'MELAKA','MEL'))
All = All.withColumn('STATE', f.regexp_replace(f.col('STATE'), 'MEL','MELAKA'))
All = All.withColumn('STATE', f.regexp_replace(f.col('STATE'), 'PULAU PINANG','PNG'))
All = All.withColumn('STATE', f.regexp_replace(f.col('STATE'), 'PENANG','PNG'))
All = All.withColumn('STATE', f.regexp_replace(f.col('STATE'), 'PINANG','PNG'))
All = All.withColumn('STATE', f.regexp_replace(f.col('STATE'), 'PNG','PULAU PINANG'))
All = All.withColumn('STATE', f.regexp_replace(f.col('STATE'), 'PERAK','PRK'))
All = All.withColumn('STATE', f.regexp_replace(f.col('STATE'), 'PRK','PERAK'))
All = All.withColumn('STATE', f.regexp_replace(f.col('STATE'), 'PERLIS','PLS'))
All = All.withColumn('STATE', f.regexp_replace(f.col('STATE'), 'PLS','PERLIS'))
All = All.withColumn('STATE', f.regexp_replace(f.col('STATE'), 'SABAH','SAB'))
All = All.withColumn('STATE', f.regexp_replace(f.col('STATE'), 'SAB','SABAH'))
All = All.withColumn('STATE', f.regexp_replace(f.col('STATE'), 'SARAWAK','SAR'))
All = All.withColumn('STATE', f.regexp_replace(f.col('STATE'), 'SAR','SARAWAK'))
All = All.withColumn('STATE', f.regexp_replace(f.col('STATE'), 'TERENGGANU','TRG'))
All = All.withColumn('STATE', f.regexp_replace(f.col('STATE'), 'TRG','TERENGGANU'))
All = All.withColumn('STATE', f.regexp_replace(f.col('STATE'), 'PAHANG','PHG'))
All = All.withColumn('STATE', f.regexp_replace(f.col('STATE'), 'PHG','PAHANG'))
All = All.withColumn('STATE', f.regexp_replace(f.col('STATE'), 'KEDAH','KED'))
All = All.withColumn('STATE', f.regexp_replace(f.col('STATE'), 'KED','KEDAH'))
All = All.withColumn('STATE', f.regexp_replace(f.col('STATE'), 'NEGERI SEMBILAN','NEG'))
All = All.withColumn('STATE', f.regexp_replace(f.col('STATE'), 'NEG','NEGERI SEMBILAN'))
All = All.withColumn('STATE', f.regexp_replace(f.col('STATE'), 'KELANTAN','KEL'))
All = All.withColumn('STATE', f.regexp_replace(f.col('STATE'), 'KEL','KELANTAN'))
All = All.withColumn('STATE', f.regexp_replace(f.col('STATE'), 'WILAYAH PERSEKUTUAN PUTRAJAYA','PUTRAJAYA'))
All = All.withColumn('STATE', f.regexp_replace(f.col('STATE'), 'WIL PUTRAJAYA','PUTRAJAYA'))
All = All.withColumn('STATE', f.regexp_replace(f.col('STATE'), 'WILAYAH PERSEKUTUAN PUTRAJAYA','PUTRAJAYA'))
All = All.withColumn('STATE', f.regexp_replace(f.col('STATE'), 'WILAYAH PERSEKUTUAN LABUAN','LAB'))
All = All.withColumn('STATE', f.regexp_replace(f.col('STATE'), 'WILAYAH LABUAN','LAB'))
All = All.withColumn('STATE', f.regexp_replace(f.col('STATE'), 'WIL LABUAN','LAB'))
All = All.withColumn('STATE', f.regexp_replace(f.col('STATE'), 'LABUAN','LAB'))
All = All.withColumn('STATE', f.regexp_replace(f.col('STATE'), 'LAB','LABUAN'))
All = All.withColumn('STATE', f.regexp_replace(f.col('STATE'), 'WILAYAH PERSEKUTUAN','WIL'))
All = All.withColumn('STATE', f.regexp_replace(f.col('STATE'), 'WIL','WILAYAH PERSEKUTUAN KUALA LUMPUR'))
All = All.withColumn('STATE', f.regexp_replace(f.col('STATE'), 'LGK','KEDAH'))
All = All.withColumn('STATE', f.regexp_replace(f.col('STATE'), 'SIN','SINGAPORE'))
All = All.withColumn('STATE', f.regexp_replace(f.col('STATE'), 'PJY','PUTRAJAYA'))

print('no of rows with blank std_city', All.filter(f.col('STD_CITY') == '').select('STATE').count()) # 88692

## save intermediate table: mainly coz the above step (cleaning STATE) took 2.5 mins & I'm worried the notebook might crash
All.write.orc(UAMS_PySpark_save_path+"phase_2/{}/All-intermediate1.orc".format(date_key) , mode='overwrite', compression='snappy')
print(UAMS_PySpark_save_path+"phase_2/{}/All-intermediate1.orc".format(date_key))

print('no of rows with blank std_city', All.filter(f.col('STD_CITY') == '').select('STATE').count()) # 88692
print('All count after some filtering + cleaning STATE column', All.select('STATE').count()) # 


### ------------- more codes (mainly for checking numbers) --------------
## checking how many nulls are in some columns of different DFs
# print(Astro_Standard.filter(f.col('Street_Type_1').isNull()).select('STATE').count()) # 38946
# print(All_ISP.filter(f.col('Street_Type_1').isNull()).select('STATE').count()) # 6552534
# print(TM_Standard.filter(f.col('Street_Type_1').isNull()).select('STATE').count()) # 6545104
## looking at no of STATE values with 'AA'
# All.select('STATE').distinct().show(100) ## --> a lot of values that start with 'AA'. Also some non-STATE values e.g MERLIMAU, IPOH, NILAI
# print(All.filter(f.col("STATE").startswith('AA')).select('STATE').count()) # 21140
## using old unionByName function:
# count of blank AREA --> print(All.filter(f.col('AREA') == '').count()) # 566840
# count of blank STATE --> print(All.filter(f.col('STATE') == '').count()) # 1797568
# count of blank POSTCODE --> print(All.filter(f.col('POSTCODE') == '').count()) # 68948
# count of blank Street_Type_1 --> print(All.filter(f.col('Street_Type_1') == '').count()) # 7147449

## seeing table where STD_CITY is blank
# z.show(All.filter(f.col('STD_CITY') == '').head(100))

+--------------------+
|         Serviceable|
+--------------------+
|TM|1-G-8 JALAN PU...|
|        TM|SHAH ALAM|
| TM|BLOK KOMERSIAL A|
|TM|5425 JALAN LUB...|
|TM|LOT 411 JALAN ...|
|           TM|TUARAN|
|     TM|KUALA LUMPUR|
|    TM|KUBOR PANJANG|
|             TM|VDSL|
|             TM|FTTH|
+--------------------+

checking unique tm serviceable None
checking astro info:  ['OBJID', 'ACCOUNT_NO', 'HouseNo', 'AREA', 'STD_CITY', 'STATE', 'POSTCODE', 'Combined_Building', 'Street_Type_1', 'Street_1_New', 'Street_Type_2', 'Street_2_New', 'Standard_Building_Name', 'F']
checking tm info:  ['Combined_Building', 'HouseNo', 'Street_Type_1', 'Street_1_New', 'Street_Type_2', 'Street_2_New', 'AREA', 'STD_CITY', 'STATE', 'POSTCODE', 'ServiceType', 'Standard_Building_Name', 'Servicable', 'Serviceable', 'F']
checking maxis info:  ['Combined_Building', 'HouseNo', 'Street_Type_1', 'Street_1_New', 'Street_Type_2', 'Street_2_New', 'AREA', 'STD_CITY', 'STATE', 'POSTCODE', 'ServiceType', 'Standard_Buil

In [10]:
del All # -- DELETE previous df if required
# read back in the intermediate table
All = spark.read.orc(UAMS_PySpark_save_path+"phase_2/{}/All-intermediate1.orc".format(date_key))

## HouseNo
All = All.withColumn("HouseNo", f.regexp_replace("HouseNo", "#|,|'",'')) ## this seems to cover multiple cases & runs faster than having multiple lines for each symbol to regexp_replace

## Combined_Building
_list = ['#', ',', '/', '-', 'No Name', '\.', '\*', '=', ':','\)', '\(', '`', '_', '\^'] #revision - zohreh - 5/8/22 - uams complain about carrot
All = All.withColumn("Combined_Building", f.regexp_replace("Combined_Building", '|'.join(_list), '')) ## this seems to cover multiple cases & runs faster than having multiple lines for each symbol to regexp_replace

## extra code kept here just in case we need it again
# All["Combined_Building"]= np.where( All["Combined_Building"]=='0', '', All["Combined_Building"])
#All = All.withColumn("Combined_Building", f.regexp_replace("Combined_Building", '  ',''))
#All["Combined_Building"] = All["Combined_Building"].str.strip()
#All = All.withColumn("Combined_Building", f.regexp_replace("Combined_Building", '(^|\s)($|\s)','', case = False, regex = True)

## assigning SDU, MDU to Address_Type
All = All.withColumn('Address_Type', when( ((f.col('Combined_Building').isNull()) | (f.col('Combined_Building') == '')), 'SDU').otherwise('MDU') )

## Street_Type_1
_list = ['#', ',', '/', '-', 'No Name']
All = All.withColumn("Street_Type_1", f.regexp_replace("Street_Type_1", '|'.join(_list), ''))
All = All.withColumn("Street_Type_1", f.regexp_replace("Street_Type_1", 'JLN','JALAN'))
All = All.withColumn("Street_Type_1", f.regexp_replace("Street_Type_1", 'LRG','LORONG'))

## Street_Type_2
_list = ['#', ',', '/', '-', 'No Name']
All = All.withColumn("Street_Type_2", f.regexp_replace("Street_Type_2", '|'.join(_list), ''))
All = All.withColumn("Street_Type_2", f.regexp_replace("Street_Type_2", 'JLN','JALAN'))
All = All.withColumn("Street_Type_2", f.regexp_replace("Street_Type_2", 'LRG','LORONG'))

## Street_1_New
_list = ['#', ',', 'No Name']
All = All.withColumn("Street_1_New", f.regexp_replace("Street_1_New", '|'.join(_list), ''))

## Street_2_New
_list = ['#',',', 'No Name']
All = All.withColumn("Street_2_New", f.regexp_replace("Street_2_New",'|'.join(_list), ''))

## AREA
_list = ['#',',', '/','-', 'No Name']
All = All.withColumn("AREA", f.regexp_replace("AREA", '|'.join(_list), ''))

## STD_CITY -- there seemed to be a lot of nulls or blanks after this step -- to check before & after
print('Distinct count of STD_CITY before converting to blanks: ', All.select(f.countDistinct('STD_CITY')).show()) # 10914
# print(All.select('STD_CITY').filter(f.col('STD_CITY').isNull()).count()) # 0
print('No of rows where STD CITY is blank from All df', All.select('STD_CITY').filter(f.col('STD_CITY') == '').count()) # 6523217 --> lots of blanks to begin with... with this number, it's probably due to TM (6mil) --> FIXED
# print(All.select('STD_CITY').filter(f.col('STD_CITY') == '\[\]').count()) # 0

_list = ['#',',', '/','-', '=', ':','\)', '\(', 'No Name','\[','\]']
All = All.withColumn("STD_CITY", f.regexp_replace("STD_CITY", '|'.join(_list), ''))

print('Distinct count of STD_CITY AFTER converting to blanks: ', All.select(f.countDistinct('STD_CITY')).show()) # 10898 --> only reduced by 16 from before this cleaning step
# print(All.select('STD_CITY').filter(f.col('STD_CITY').isNull()).count()) # 0
print(All.select('STD_CITY').filter(f.col('STD_CITY') == '').count()) # 6523228
# print(All.select('STD_CITY').filter(f.col('STD_CITY') == '\[\]').count()) # 0

## STATE
_list = ['#',',', '/','-', 'No Name']
All = All.withColumn("STATE", f.regexp_replace("STATE", '|'.join(_list), ''))

## POSTCODE
_list = ['#',',', '/','-', 'No Name']
All = All.withColumn("POSTCODE", f.regexp_replace("POSTCODE", '|'.join(_list), ''))

# usage = resource.getrusage(resource.RUSAGE_SELF).ru_maxrss
# print('[debug] memory usage is (Megabytes):')
# print(usage)

print('All count at end of pipeline 2 backup 1:', All.select('STATE').count()) # 11937295

print(All.columns)
## cast all columns as string
All = All.select([f.col(column).cast('string') for column in All.columns])

## Save before next step of Pipeline 2
# Fakhrul: will probably split here for pipeline 2 backup
All.coalesce(1).write.csv(UAMS_PySpark_save_path+"phase_2/{}/All-intermediate2.csv.gz".format(date_key), mode='overwrite', header=True, compression='gzip')
# wr.s3.to_csv(df = All, path = 's3://astro-groupdata-prod-pipeline/address_standardization/uams_temp_final/all_temp_final_2_backup_before_2.csv')

All.write.orc(UAMS_PySpark_save_path+"phase_2/{}/All-intermediate2.orc".format(date_key), mode='overwrite', compression='snappy')
# wr.s3.to_parquet(df = All,compression ='snappy', schema_evolution = False, path = 's3://astro-groupdata-prod-pipeline/address_standardization/uams_temp_parquet/all_from_final_2_backup_target_1.snappy.parquet')


### --- further codes for checking outputs -----
## 7/11/2022: found that STD_CITY has a LOT of blanks (around 6 mil). To analyze it a bit:
# All.filter(f.col('STD_CITY') =='').select('Serviceable').distinct().show(100) ## seems like these many blank STD_CITY come from TM, Maxis & Allo file only
# All.filter(f.col('STD_CITY') =='').select('Servicable').distinct().show(100)
## see how many cases from each ISP
# print(All.filter(f.col('STD_CITY') =='').filter(f.col('Servicable') == 'TM').select('Serviceable').count()) # 6492440 --> NEED TO FIX this one especially!!
# print(All.filter(f.col('STD_CITY') =='').filter(f.col('Servicable') == 'ALLO').select('Serviceable').count()) # 62
# print(All.filter(f.col('STD_CITY') =='').filter(f.col('Servicable') == 'Maxis').select('Serviceable').count()) # 486

+------------------------+
|count(DISTINCT STD_CITY)|
+------------------------+
|                   11111|
+------------------------+

Distinct count of STD_CITY before converting to blanks:  None
135692
+------------------------+
|count(DISTINCT STD_CITY)|
+------------------------+
|                   11090|
+------------------------+

Distinct count of STD_CITY AFTER converting to blanks:  None
136742
14741778
['OBJID', 'ACCOUNT_NO', 'HouseNo', 'AREA', 'STD_CITY', 'STATE', 'POSTCODE', 'Combined_Building', 'Street_Type_1', 'Street_1_New', 'Street_Type_2', 'Street_2_New', 'Standard_Building_Name', 'F', 'ServiceType', 'Servicable', 'Serviceable', 'index', 'Address_Type']


In [2]:
## ============================================================ THIS IS THE START OF PIPELINE 2 - BACKUP 2 ============================================================
# taken from Glue Job: address_standardization-prod-uams_generation_final_2_backup_2
### Preparing UAMS format for all standardised base - TM and ISPs

All = spark.read.orc(UAMS_PySpark_save_path+"phase_2/{}/All-intermediate2.orc".format(date_key))
# All = spark.read.csv(UAMS_PySpark_save_path+"all_temp_final_2_backup_before_2_{}.csv".format(date_key), header=True)
# All = wr.s3.read_csv(path = 's3://astro-groupdata-prod-pipeline/address_standardization/uams_temp_final/all_temp_final_2_backup_before_2.csv')

## trim the columns
All = All.withColumn("HouseNo", f.trim(f.col("HouseNo")) )
All = All.withColumn("Combined_Building", f.trim(f.col("Combined_Building")) )
All = All.withColumn("Street_1_New", f.trim(f.col("Street_1_New")) )
All = All.withColumn("Street_2_New", f.trim(f.col("Street_2_New")) )
All = All.withColumn("Street_Type_1", f.trim(f.col("Street_Type_1")) )
All = All.withColumn("Street_Type_2", f.trim(f.col("Street_Type_2")) )
All = All.withColumn("POSTCODE", f.trim(f.col("POSTCODE")) )
All = All.withColumn("AREA", f.trim(f.col("AREA")) )
All = All.withColumn("STD_CITY", f.trim(f.col("STD_CITY")) )
All = All.withColumn("STATE", f.trim(f.col("STATE")) )

print('All count at end of Pipeline 2 Backup 2:', All.select('STATE').count()) # 11937295

All.write.orc(UAMS_PySpark_save_path+"phase_2/{}/All-intermediate3.orc".format(date_key), mode='overwrite', compression='snappy')
# All.coalesce(1).write.csv(UAMS_PySpark_save_path+"all_temp_final_2_backup_before_2_2_{}.csv".format(date_key), mode='overwrite', header=True)
# wr.s3.to_csv(df = All, path = 's3://astro-groupdata-prod-pipeline/address_standardization/uams_temp_final/all_temp_final_2_backup_before_2_2.csv')

14741778


In [3]:
## ============================================================ THIS IS THE START OF PIPELINE 2 - FINAL ============================================================
# taken from Glue Job: address_standardization-prod-uams_generation_final_2
### Preparing UAMS format for all standardised base - TM and ISPs

del All # -- if required
# All = spark.read.csv(UAMS_PySpark_save_path+"all_temp_final_2_backup_before_2_2_{}.csv".format(date_key), header=True)
All = spark.read.orc(UAMS_PySpark_save_path+"phase_2/{}/All-intermediate3.orc".format(date_key))

### Not sure what code below does but I think it's to replace anything in string printable (from string import printable) with ''
## Code in question:
# All["HouseNo"] = All["HouseNo"].map(str).apply(lambda x: ''.join([" " if  i not in  st else i for i in x]))
# All["Combined_Building"] = All["Combined_Building"].map(str).apply(lambda x: ''.join([" " if  i not in  st else i for i in x]))
# All["Street_1_New"] = All["Street_1_New"].map(str).apply(lambda x: ''.join([" " if  i not in  st else i for i in x]))
# All["Street_Type_1"] = All["Street_Type_1"].map(str).apply(lambda x: ''.join([" " if  i not in  st else i for i in x]))
# All["Street_2_New"] = All["Street_2_New"].map(str).apply(lambda x: ''.join([" " if  i not in  st else i for i in x]))
# All["Street_Type_2"] = All["Street_Type_2"].map(str).apply(lambda x: ''.join([" " if  i not in  st else i for i in x]))

# All["POSTCODE"] = All["POSTCODE"].map(str).apply(lambda x: ''.join([" " if  i not in  st else i for i in x]))
# All["AREA"] = All["AREA"].map(str).apply(lambda x: ''.join([" " if  i not in  st else i for i in x]))
# All["STD_CITY"] = All["STD_CITY"].map(str).apply(lambda x: ''.join([" " if  i not in  st else i for i in x]))
# All["STATE"] = All["STATE"].map(str).apply(lambda x: ''.join([" " if  i not in  st else i for i in x]))

# wr.s3.to_csv(df = All, path = 's3://astro-groupdata-prod-pipeline/address_standardization/uams_temp_final/all_checking_after_encrypt_test.csv')
# usage = resource.getrusage(resource.RUSAGE_SELF).ru_maxrss
# print('[debug] memory usage is (Megabytes):')
# print(usage)

## fill nulls
All = All.fillna('')

# print('All shape here check')
# print(All.shape)

# test = All [All['Area']=='AAEFF']
# test

print('All count before filtering out blanks : ', All.count()) # 11937295

## Filter out null values in POSTCODE, STD_CITY, Street_Type_1, Street_1_New, STATE
All = All.filter((f.col('POSTCODE').isNotNull()) & (f.col('POSTCODE') != '') & (f.col('POSTCODE') != '\[\]'))
print('All count AFTER filtering out blank postcode : ', All.select('POSTCODE').count()) #  11868347
All = All.filter((f.col('STD_CITY').isNotNull()) & (f.col('STD_CITY') != '') & (f.col('STD_CITY') != '\[\]'))
print('All count AFTER filtering out blank city : ', All.select('POSTCODE').count()) # 5372327
All = All.filter((f.col('Street_Type_1').isNotNull()) & (f.col('Street_Type_1') != '') & (f.col('Street_Type_1') != '\[\]'))
print('All count AFTER filtering out blank Street Type 1 : ', All.select('POSTCODE').count()) # 4738152
All = All.filter((f.col('Street_1_New').isNotNull()) & (f.col('Street_1_New') != '') & (f.col('Street_1_New') != '\[\]'))
print('All count AFTER filtering out blank Street 1 New : ', All.select('POSTCODE').count()) # 4710896
All = All.filter((f.col('STATE').isNotNull()) & (f.col('STATE') != '') & (f.col('STATE') != '\[\]'))

print('All count AFTER filtering out all blanks : ', All.count()) # 4615662 

## replace 'AA' AREA with blanks
All = All.withColumn('AREA', when(f.col('AREA').cast('string').startswith('AA'), '').otherwise(f.col('AREA')))

## filter out 'AA' STD_CITY, STATE and POSTCODE with alphabets
All = All.filter(~f.col('STD_CITY').startswith('AA')).filter(~f.col('STATE').startswith('AA')).filter(~f.col('POSTCODE').rlike('[a-zA-Z]'))

print('All count AFTER filtering out "AA" values : ', All.count()) # 4475248

# usage = resource.getrusage(resource.RUSAGE_SELF).ru_maxrss
# print('[debug] memory usage is (Megabytes): before ending 2nd')
# print(usage)

# print(All.shape)

## ensure POSTCODE is only 5 digits long
All = All.withColumn("POSTCODE", f.regexp_replace(f.col('POSTCODE').cast('string'), '\.0', '') )
All = All.withColumn("POSTCODE", f.substring(f.col('POSTCODE'), 1, 5) )
All = All.withColumn('POSTCODE', f.lpad(f.col('POSTCODE').cast('string'), 5, '0') )

# print(All.shape)

## check that POSTCODE is only len = 5
print(All.select(f.length(f.col('POSTCODE'))).distinct().show())

# usage = resource.getrusage(resource.RUSAGE_SELF).ru_maxrss
# print('[debug] memory usage is (Megabytes):')
# print(usage)

print(All.columns)
print('All count at end of Pipeline 2 Final: ', All.select('POSTCODE').count()) # 4475248

## save as csv
All.coalesce(1).write.csv(UAMS_PySpark_save_path+"phase_2/{}/All-intermediate4.orc".format(date_key), mode='overwrite', header=True, compression='gzip')
# All.coalesce(1).write.csv(UAMS_PySpark_save_path+"phase_2/{}/All_temp_final_2_{}.csv".format(date_key), mode='overwrite', header=True)
# wr.s3.to_csv(df = All, path = all_temp_path + 'all_temp_final_2.csv')

## cast all columns as string
All = All.select([f.col(column).cast('string') for column in All.columns])

All.write.orc(UAMS_PySpark_save_path+"phase_2/{}/All-intermediate4.orc".format(date_key), mode='overwrite', compression='snappy')
# All.write.orc(UAMS_PySpark_save_path+"all_temp_2_final_target_{}.orc".format(date_key), mode='overwrite', compression='snappy')
# wr.s3.to_parquet(df = All,compression ='snappy', schema_evolution = False, path = 's3://astro-groupdata-prod-pipeline/address_standardization/uams_temp_parquet/all_temp_2_final_target.snappy.parquet')

All count before filtering out blanks :  14741778
All count AFTER filtering out blank postcode :  14660249
All count AFTER filtering out blank city :  14547295
All count AFTER filtering out blank Street Type 1 :  13574972
All count AFTER filtering out blank Street 1 New :  13540186
All count AFTER filtering out all blanks :  10815879
All count AFTER filtering out "AA" values :  10574543
+----------------+
|length(POSTCODE)|
+----------------+
|               5|
+----------------+

None
['OBJID', 'ACCOUNT_NO', 'HouseNo', 'AREA', 'STD_CITY', 'STATE', 'POSTCODE', 'Combined_Building', 'Street_Type_1', 'Street_1_New', 'Street_Type_2', 'Street_2_New', 'Standard_Building_Name', 'F', 'ServiceType', 'Servicable', 'Serviceable', 'index', 'Address_Type']


## Pipeline 3

In [4]:
## ============================================================ THIS IS THE START OF PIPELINE 3 - BACKUP 1 ============================================================
# taken from Glue Job: address_standardization-prod-uams_generation_final_3_backup
### Preparing UAMS format for all standardised base - Astro and ISPs (still doing this). Eventhough we're in Pipeline 3, we are still in PHASE 2 according to Zohreh's documentation) 

# all_final_2_temp_path = args['all_final_2_temp_path'] = s3://astro-groupdata-prod-pipeline/address_standardization/uams_temp_final/all_temp_final_2.csv

#Revision - 29/6/22 - fakhrul - remove dtype for OBJID as object
# All = wr.s3.read_csv(path = all_final_2_temp_path, usecols = ['Address_Type','Standard_Building_Name',"OBJID","ACCOUNT_NO","HouseNo","Combined_Building","Street_2_New","Street_Type_2",'Street_Type_1','Street_1_New','POSTCODE','STD_CITY','AREA','STATE',"Combined_Building",'Serviceable','Servicable','ServiceType'], dtype = {'OBJID':object, 'ACCOUNT_NO':object}, engine = 'c')
#wr.s3.to_parquet(df = All,compression ='snappy', path = 's3://astro-groupdata-prod-pipeline/address_standardization/uams_temp_parquet/all_temp_3_step1_source.snappy.parquet')

del All
# All = spark.read.csv(UAMS_PySpark_save_path+"all_temp_final_2_{}.csv".format(date_key), header=True)
All = spark.read.orc(UAMS_PySpark_save_path+"phase_2/{}/All-intermediate4.orc".format(date_key))

#REVISION -21/6/22 - fakhrul testing this one out to see if it works
#All['HouseNo'] = All['HouseNo'].str.upper()
#All["HouseNo"] = All["HouseNo"].str.replace("JAN-","1/", case = False)
# ... can copy this code from other addr std notebooks

#revision - 20/6/22 - fakhrul - postcode might be affected so have to str pad again
All = All.withColumn("POSTCODE", f.regexp_replace(f.col('POSTCODE').cast('string'), '\.0', '') )
All = All.withColumn("POSTCODE", f.substring(f.col('POSTCODE'), 1, 5) )
All = All.withColumn('POSTCODE', f.lpad(f.col('POSTCODE').cast('string'), 5, '0') )
## check that POSTCODE is only len = 5
# print(All.select(f.length(f.col('POSTCODE'))).distinct().show())

## encode then decode each column into ascii
All = All.withColumn("HouseNo", f.decode(f.encode(f.col('HouseNo'), 'ascii'), 'ascii') )
All = All.withColumn("Combined_Building", f.decode(f.encode(f.col('Combined_Building'), 'ascii'), 'ascii') )
All = All.withColumn("Street_1_New", f.decode(f.encode(f.col('Street_1_New'), 'ascii'), 'ascii') )
All = All.withColumn("Street_2_New", f.decode(f.encode(f.col('Street_2_New'), 'ascii'), 'ascii') )
All = All.withColumn("Street_Type_2", f.decode(f.encode(f.col('Street_Type_2'), 'ascii'), 'ascii') )
All = All.withColumn("Street_Type_1", f.decode(f.encode(f.col('Street_Type_1'), 'ascii'), 'ascii') )
All = All.withColumn("POSTCODE", f.decode(f.encode(f.col('POSTCODE').cast('string'), 'ascii'), 'ascii') )
All = All.withColumn("STD_CITY", f.decode(f.encode(f.col('STD_CITY'), 'ascii'), 'ascii') )
All = All.withColumn("AREA", f.decode(f.encode(f.col('AREA'), 'ascii'), 'ascii') )
All = All.withColumn("STATE", f.decode(f.encode(f.col('STATE'), 'ascii'), 'ascii') )

# usage = resource.getrusage(resource.RUSAGE_SELF).ru_maxrss
# print('[debug] memory usage is (Megabytes):')
# print(usage)

## save as csv
All.coalesce(1).write.csv(UAMS_PySpark_save_path+"phase_2/{}/All-intermediate5.csv.gz".format(date_key), mode='overwrite', header=True, compression='gzip')
# wr.s3.to_csv(df = All, path = 's3://astro-groupdata-prod-pipeline/address_standardization/uams_temp_final/all_temp_3_backup_before_3.csv')

All.write.orc(UAMS_PySpark_save_path+"phase_2/{}/All-intermediate5.orc".format(date_key), mode='overwrite', compression='snappy')
# wr.s3.to_parquet(df = All,compression ='snappy', path = 's3://astro-groupdata-prod-pipeline/address_standardization/uams_temp_parquet/all_temp_3_step1_target.snappy.parquet')




In [5]:
## ============================================================ THIS IS THE START OF PIPELINE 3 - BACKUP 2 ============================================================
# taken from Glue Job: address_standardization-prod-uams_generation_final_3_backup_2
### Preparing UAMS format for all standardised base - TM and ISPs (still doing this). Eventhough we're in pipeline 3, it's still part of PHASE 2 according to Zohreh's documentation

del All # -- if required
All = spark.read.orc(UAMS_PySpark_save_path+"phase_2/{}/All-intermediate5.orc".format(date_key)) ## read in ORC version
# All = spark.read.csv(UAMS_PySpark_save_path+"all_temp_3_backup_before_3_{}.csv".format(date_key), header=True) ## read in CSV version
# All = wr.s3.read_csv(path = 's3://astro-groupdata-prod-pipeline/address_standardization/uams_temp_final/all_temp_3_backup_before_3.csv')

#revision - fakhrul - 1/7/22 - add these codes so that its proper now 
All = All.withColumn("POSTCODE", f.regexp_replace(f.col('POSTCODE').cast('string'), '\.0', '') )
All = All.withColumn("POSTCODE", f.substring(f.col('POSTCODE'), 1, 5) )
All = All.withColumn('POSTCODE', f.lpad(f.col('POSTCODE').cast('string'), 5, '0') )

## change nan to blanks
All = All.withColumn('Combined_Building', f.regexp_replace(f.upper(f.trim(f.col('Combined_Building'))), 'NAN', ''))

## assigning SDU, MDU to Address_Type
All = All.withColumn('Address_Type', when( ((f.col('Combined_Building').isNull()) | (f.col('Combined_Building') == '')), 'SDU').otherwise('MDU') )

## create a sequential index as Zohreh did a pandas reset_index at this step. To do it in Spark: https://stackoverflow.com/questions/51200217/how-to-create-sequential-number-column-in-pyspark-dataframe
All = All.withColumn("index", row_number().over(Window.orderBy(monotonically_increasing_id())) )

# print('checking unique postcode values: ', All.POSTCODE.value_counts())

## Cleaning up STD_CITY, POSTCODE and STATE MANUALLY (code originally from Maryam/Zohreh)
All = All.withColumn("STD_CITY", f.regexp_replace(f.col("STD_CITY"), 'KUALA LUMPUR','KL') )
All = All.withColumn("STD_CITY", when(f.col('STD_CITY') == 'KL', 'KUALA LUMPUR')
                                .when(f.col("STD_CITY") == 'WILAYAH PERSEKUTUANERSEKUTUAN',  'WILAYAH PERSEKUTUAN')
                                .when(f.col("STD_CITY") == 'MENGGATALKOTA KINABALU',  'MENGGATAL')
                                .when(f.col("STD_CITY") == 'TUARANKOTA KINABALU',  'TUARAN')
                                .when(f.col("STATE") == 'KUANTAN', 'KUANTAN')
                                .otherwise(f.col('STD_CITY')) )

All = All.withColumn("POSTCODE", when(f.col("POSTCODE") == '96000', 'SARAWAK').otherwise(f.col("POSTCODE")) )

## save intermediate table
All.write.orc(UAMS_PySpark_save_path+"phase_2/{}/All-intermediate6.orc".format(date_key), mode='overwrite', compression='snappy')




In [6]:
del All # -- if required
All = spark.read.orc(UAMS_PySpark_save_path+"phase_2/{}/All-intermediate6.orc".format(date_key)) ## read in ORC version

All = All.withColumn("STATE", when(f.col("STATE") == 'WILAYAH PERSEKUTUAN KUALA LUMPURANG', 'WILAYAH PERSEKUTUAN KUALA LUMPUR')
                                .when(f.col("STATE") == 'W\.P\.', 'WILAYAH PERSEKUTUAN KUALA LUMPUR')
                                .when(f.col("STATE") == 'PHI', 'PULAU PINANG')
                                .when(f.col("STATE") == 'SEREMBAN', 'NEGERI SEMBILAN')
                                .when(f.col("STATE") == 'KUANTAN', 'PAHANG')
                                .when(f.col("STATE") == 'NEGERI SEMBILANERISEMBILAN', 'NEGERI SEMBILAN')
                                .when(f.col("STATE") == 'SENAWANG', 'NEGERI SEMBILAN')
                                .when(f.col("STATE") == 'SUNGAI PETANI', 'KEDAH')
                                .when(f.col("STATE") == 'PORT DICKSON', 'NEGERI SEMBILAN')
                                .when(f.col("STATE") == 'PENNSYLVANIA', 'PULAU PINANG')
                                .when(f.col("STATE") == 'CHERAS', 'WILAYAH PERSEKUTUAN KUALA LUMPUR')
                                .otherwise(f.col("STATE")) )

## save intermediate table --> have to break it up coz it seems this code causes pyspark to take forever (more than 3 hours) to finish running an Action cell
All.write.orc(UAMS_PySpark_save_path+"phase_2/{}/All-intermediate7.orc".format(date_key), mode='overwrite', compression='snappy')

# ------------------------------------------------------------------------------------------------------------

del All # -- if required
All = spark.read.orc(UAMS_PySpark_save_path+"phase_2/{}/All-intermediate7.orc".format(date_key)) ## read in ORC version

All = All.withColumn("STATE", when(f.col("STATE") == 'GEORGETOWN', 'PULAU PINANG')
                                .when(f.col("STATE") == 'WP KUALA LUMPUR', 'WILAYAH PERSEKUTUAN KUALA LUMPUR')
                                .when(f.col("STATE") == 'KUALA TERENGGANU', 'TERENGGANU')
                                .when(f.col("STATE") == 'IPOH', 'PERAK')
                                .when(f.col("STATE") == 'LABUAN FEDERAL TERRITORY', 'LABUAN')
                                .when(f.col("STATE") == 'PASIR PUTEH', 'KELANTAN')
                                .when(f.col("STATE") == 'ALOR SETAR', 'KEDAH')
                                .when(f.col("STATE") == 'BATU CAVES', 'SELANGOR')
                                .when(f.col("STATE") == 'PETALING JAYA', 'SELANGOR')
                                .when(f.col("STATE") == 'BANTING', 'SELANGOR')
                                .when(f.col("STATE") == 'PEKAN NANAS', 'JOHOR')
                                .when(f.col("STATE") == 'KUALA KANGSARAWAK', 'PERAK')
                                .when(f.col("STATE") == 'WP', 'WILAYAH PERSEKUTUAN KUALA LUMPUR')
                                .when(f.col("STATE") == 'BANGSARAWAK', 'WILAYAH PERSEKUTUAN KUALA LUMPUR')
                                .otherwise(f.col("STATE")) )

## save intermediate table --> have to break it up coz it seems this code causes pyspark to take forever (more than 3 hours) to finish running an Action cell
All.write.orc(UAMS_PySpark_save_path+"phase_2/{}/All-intermediate8.orc".format(date_key), mode='overwrite', compression='snappy')

# ------------------------------------------------------------------------------------------------------------

del All # -- if required
All = spark.read.orc(UAMS_PySpark_save_path+"phase_2/{}/All-intermediate8.orc".format(date_key)) ## read in ORC version

All = All.withColumn("STATE", when(f.col("STATE") == 'SRI HARTAMAS', 'WILAYAH PERSEKUTUAN KUALA LUMPUR')
                                .when(f.col("STATE") == 'SENTUL', 'WILAYAH PERSEKUTUAN KUALA LUMPUR')
                                .when(f.col("STATE") == 'SEGAMBUT', 'WILAYAH PERSEKUTUAN KUALA LUMPUR')
                                .when(f.col("STATE") == 'KUALA LUMPUR', 'WILAYAH PERSEKUTUAN KUALA LUMPUR')
                                .when(f.col("STATE") == 'BATU CAVES', 'SELANGOR')
                                .when(f.col("STATE") ==  'WILAYAH PERSEKUTUAN KUALA LUMPUR WILAYAH PERSEKUTUAN KUALA LUMPURAYAHL 0 PERSEKUTUAN KUALA LUMPUR', 'WILAYAH PERSEKUTUAN KUALA LUMPUR')
                                .when(f.col("STATE") ==  'SEPANGOR', 'WILAYAH PERSEKUTUAN KUALA LUMPUR')
                                .when(f.col("STATE") ==  'SELANGORAGOR', 'SELANGOR')
                                .when(f.col("STATE") ==  '  SHAH ALAM', 'SELANGOR')
                                .when(f.col("STATE") == 'W\.P\.', 'WILAYAH PERSEKUTUAN KUALA LUMPUR')
                                .when(f.col("STATE") == '   ', '')
                                .when(f.col("STATE") == '[]', '')
                                .when(f.col("STATE") == 'MALACCA', 'MELAKA')
                                .when(f.col("STATE") == 'WILAYAH PERSEKUTUAN KUALA LUMPURANG', 'WILAYAH PERSEKUTUAN KUALA LUMPUR')
                                .when(f.col("STATE") == 'W.P.', 'WILAYAH PERSEKUTUAN KUALA LUMPUR')
                                .otherwise(f.col("STATE")) )

## save intermediate table --> have to break it up coz it seems this code causes pyspark to take forever (more than 3 hours) to finish running an Action cell
All.write.orc(UAMS_PySpark_save_path+"phase_2/{}/All-intermediate9.orc".format(date_key), mode='overwrite', compression='snappy')

# ------------------------------------------------------------------------------------------------------------

del All # -- if required
All = spark.read.orc(UAMS_PySpark_save_path+"phase_2/{}/All-intermediate9.orc".format(date_key)) ## read in ORC version

## filter out Postcode_length != 5
print('before filtering out Post_length != 5', All.select('POSTCODE').count()) # 4475248
All = All.withColumn("Post_length", f.length(f.col('POSTCODE')) )
All = All.filter(f.col("Post_length") == 5 )
print('after filtering out Post_length != 5', All.select('POSTCODE').count()) # 4439727

## fill nulls again
All = All.fillna('')

## make all columns upper case
All = All.withColumn('HouseNo', f.upper(f.col('HouseNo')) )
All = All.withColumn('Combined_Building', f.upper(f.col('Combined_Building')) )
All = All.withColumn('Street_Type_1', f.upper(f.col('Street_Type_1')) )
All = All.withColumn('Street_1_New', f.upper(f.col('Street_1_New')) )
All = All.withColumn('Street_Type_2', f.upper(f.col('Street_Type_2')) )
All = All.withColumn('Street_2_New', f.upper(f.col('Street_2_New')) )
All = All.withColumn('AREA', f.upper(f.col('AREA')) )
All = All.withColumn('STD_CITY', f.upper(f.col('STD_CITY')) )
All = All.withColumn('STATE', f.upper(f.col('STATE')) )

## save intermediate table --> have to break it up coz it seems this code causes pyspark to take forever (more than 3 hours) to finish running an Action cell
All.write.orc(UAMS_PySpark_save_path+"phase_2/{}/All-intermediate10.orc".format(date_key), mode='overwrite', compression='snappy')

before filtering out Post_length != 5 10574543
after filtering out Post_length != 5 10489664


In [7]:
del All # -- if required
All = spark.read.orc(UAMS_PySpark_save_path+"phase_2/{}/All-intermediate10.orc".format(date_key)) ## read in ORC version

## Fix HouseNo that are converted to date. Pyspark code taken from P2 MDU Mapping Test Qubole Zepp notebooks
All = All.withColumn('HouseNo', f.regexp_replace(f.col('HouseNo'), "JAN-","01-"))
All = All.withColumn('HouseNo', f.regexp_replace(f.col('HouseNo'), "-JAN","-01"))
All = All.withColumn('HouseNo', f.regexp_replace(f.col('HouseNo'), "FEB-","02-"))
All = All.withColumn('HouseNo', f.regexp_replace(f.col('HouseNo'), "-FEB",'-02'))
All = All.withColumn('HouseNo', f.regexp_replace(f.col('HouseNo'), "MAR-",'03-'))
All = All.withColumn('HouseNo', f.regexp_replace(f.col('HouseNo'), "-MAR","-03"))
All = All.withColumn('HouseNo', f.regexp_replace(f.col('HouseNo'), "APR-","04-"))
All = All.withColumn('HouseNo', f.regexp_replace(f.col('HouseNo'), "-APR","-04"))
All = All.withColumn('HouseNo', f.regexp_replace(f.col('HouseNo'), "MAY-","05-"))
All = All.withColumn('HouseNo', f.regexp_replace(f.col('HouseNo'), "-MAY","-05"))
All = All.withColumn('HouseNo', f.regexp_replace(f.col('HouseNo'), "JUN-","06-"))
All = All.withColumn('HouseNo', f.regexp_replace(f.col('HouseNo'), "-JUN","-06"))
All = All.withColumn('HouseNo', f.regexp_replace(f.col('HouseNo'), "JUL-","07-"))
All = All.withColumn('HouseNo', f.regexp_replace(f.col('HouseNo'), "-JUL","-07"))
All = All.withColumn('HouseNo', f.regexp_replace(f.col('HouseNo'), "AUG-",'08-'))
All = All.withColumn('HouseNo', f.regexp_replace(f.col('HouseNo'), "-AUG","-08"))
All = All.withColumn('HouseNo', f.regexp_replace(f.col('HouseNo'), "SEP-","09-"))
All = All.withColumn('HouseNo', f.regexp_replace(f.col('HouseNo'), "-SEP","-09"))
All = All.withColumn('HouseNo', f.regexp_replace(f.col('HouseNo'), "OCT-","10-"))
All = All.withColumn('HouseNo', f.regexp_replace(f.col('HouseNo'), "-OCT","-10"))
All = All.withColumn('HouseNo', f.regexp_replace(f.col('HouseNo'), "NOV-","11-"))
All = All.withColumn('HouseNo', f.regexp_replace(f.col('HouseNo'), "-NOV","-11"))
All = All.withColumn('HouseNo', f.regexp_replace(f.col('HouseNo'), "DEC-","12-"))
All = All.withColumn('HouseNo', f.regexp_replace(f.col('HouseNo'), "-DEC","-12"))

## save intermediate table --> have to break it up coz it seems this code causes pyspark to take forever (more than 3 hours) to finish running an Action cell
All.write.orc(UAMS_PySpark_save_path+"phase_2/{}/All-intermediate11.orc".format(date_key), mode='overwrite', compression='snappy')




In [8]:
del All # -- if required
All = spark.read.orc(UAMS_PySpark_save_path+"phase_2/{}/All-intermediate11.orc".format(date_key)) ## read in ORC version
print('Total count of All before splitting to date_house & not_date_house:', All.select('HouseNo').count()) # 4439727

## Fix HouseNo that are converted to date (DD/MM/YYYY format). Pyspark code taken from P2 MDU Mapping Test Qubole Zepp notebooks
# Filter date HouseNo
date_house = All.filter(f.regexp_extract('HouseNo', '^([0-2][0-9]|(3)[0-1])(\/)(((0)[0-9])|((1)[0-2]))(\/)\d{4}$', 0) != '' ) 
# Spliting the HouseNo
date_house = date_house.withColumn('block_date',  f.substring(date_house.HouseNo, 1, 2))
date_house = date_house.withColumn('floor',  f.substring(date_house.HouseNo, 4, 2))
date_house = date_house.withColumn('unit',  f.substring(date_house.HouseNo, 9, 2))
# Combine the split HouseNo with dashes: '-'
date_house = date_house.withColumn('HOUSE_NO_ASTRO', f.concat_ws('-', date_house.block_date, date_house.floor, date_house.unit))
# Remove additional column created to combine HouseNo
date_house = date_house.drop(*['block_date','floor','unit'])
print('date_house count:', date_house.select('HouseNo').count()) # 8644
    
# Filter not date HouseNo
not_date_house = All.filter( f.regexp_extract('HouseNo', '^([0-2][0-9]|(3)[0-1])(\/)(((0)[0-9])|((1)[0-2]))(\/)\d{4}$', 0) == '' )
not_date_house = not_date_house.withColumn('HOUSE_NO_ASTRO', f.col('HouseNo'))
print('not_date_house count:', not_date_house.select('ACCOUNT_NO').count(), 'not_date_house unique acc_no:',  not_date_house.select(f.countDistinct('ACCOUNT_NO')).show()) #  rows,  unique acc_no
# print('not_date_house count:', not_date_house.select('HouseNo').count()) # 4431083

# Append the 2 dfs (date_house, not_date_house) --> originally this was in 'final_3' but I've moved it here to consolidate all the parts of this HouseNo date cleaning step in 1 phase
All = date_house.union(not_date_house)
print('Total count of All after re-appending date_house & not_date_house (end of Pipeline 2 Final):', All.select('HouseNo').count()) # 4439727
# create a sequential index as Zohreh did a pandas reset_index at this step again. To do it in Spark: https://stackoverflow.com/questions/51200217/how-to-create-sequential-number-column-in-pyspark-dataframe
All = All.withColumn("index", row_number().over(Window.orderBy(monotonically_increasing_id())) )

# usage = resource.getrusage(resource.RUSAGE_SELF).ru_maxrss
# print('[debug] memory usage is (Megabytes) before concat frame below :')
# print(usage)

## save as ORC & csv
All.write.orc(UAMS_PySpark_save_path+"phase_2/{}/All-intermediate12.orc".format(date_key), mode='overwrite', compression='snappy')

All.coalesce(1).write.csv(UAMS_PySpark_save_path+"phase_2/{}/All-intermediate12.csv.gz".format(date_key), mode='overwrite', header=True, compression='gzip')
# wr.s3.to_csv(df = date_house, path = 's3://astro-groupdata-prod-pipeline/address_standardization/uams_temp_final/date_house_3_backup_before_3_2.csv')
# wr.s3.to_csv(df = not_date_house, path = 's3://astro-groupdata-prod-pipeline/address_standardization/uams_temp_final/not_date_house_3_backup_before_3_2.csv')

### ------ more codes to check outputs -------
### checking size of the All dataframe before saving. Refer: https://stackoverflow.com/questions/46228138/how-to-find-pyspark-dataframe-memory-usage
# Need to cache the table (and force the cache to happen)
# sample = All.sample(fraction=0.01)
# pdf = sample.toPandas() # convert to pd DF
# pdf.info() # 1% is 54 kB, so 100% is around 54 x 100 = 5400 kb = 5.4MB

Total count of All before splitting to date_house & not_date_house: 10489664
date_house count: 8644
+--------------------------+
|count(DISTINCT ACCOUNT_NO)|
+--------------------------+
|                   3521310|
+--------------------------+

not_date_house count: 10481020 not_date_house unique acc_no: None
Total count of All after re-appending date_house & not_date_house: 10489664


In [2]:
## ============================================================ THIS IS THE START OF PIPELINE 3 - FINAL ============================================================
# taken from Glue Job: address_standardization-prod-uams_generation_final_3
### Preparing UAMS format for all standardised base - TM and ISPs (still doing this). Eventhough this is pipeline 3, it's still part of Phase 2 according to Zohreh's documentation

# all_temp_path = args['all_temp_path'] = s3://astro-groupdata-prod-pipeline/address_standardization/uams_temp_final/
                          
# date_house = wr.s3.read_csv(path = 's3://astro-groupdata-prod-pipeline/address_standardization/uams_temp_final/date_house_3_backup_before_3_2.csv', usecols = ['OBJID','ACCOUNT_NO','HouseNo','AREA','STD_CITY','STATE','POSTCODE','Combined_Building','Street_Type_1','Street_1_New','Street_Type_2','Street_2_New','Standard_Building_Name','ServiceType','Servicable','Serviceable','Address_Type','Post_length','block','floor','unit','HOUSE_NO_ASTRO'], dtype = {'ACCOUNT_NO':object, 'OBJID':object})
# not_date_house = wr.s3.read_csv(path = 's3://astro-groupdata-prod-pipeline/address_standardization/uams_temp_final/not_date_house_3_backup_before_3_2.csv', usecols = ['OBJID','ACCOUNT_NO','HouseNo','AREA','STD_CITY','STATE','POSTCODE','Combined_Building','Street_Type_1','Street_1_New','Street_Type_2','Street_2_New','Standard_Building_Name','ServiceType','Servicable','Serviceable','Address_Type','Post_length','HOUSE_NO_ASTRO'], dtype = {'ACCOUNT_NO':object, 'OBJID':object})

All = spark.read.orc(UAMS_PySpark_save_path+"phase_2/{}/All-intermediate12.orc".format(date_key)) ## ORC version
# All = spark.read.orc(UAMS_PySpark_save_path+"all_date_house_3_backup_before_3_2_{}.orc".format(date_key)) ## ORC version
# All = spark.read.csv(UAMS_PySpark_save_path+"all_date_house_3_backup_before_3_2_{}.csv".format(date_key), header=True) ## CSV version

#revision - 20/6/22 - fakhrul - postcode might be affected so have to str pad again
All = All.withColumn("POSTCODE", f.regexp_replace(f.col('POSTCODE').cast('string'), '\.0', '') )
All = All.withColumn("POSTCODE", f.substring(f.col('POSTCODE'), 1, 5) )
All = All.withColumn('POSTCODE', f.lpad(f.col('POSTCODE').cast('string'), 5, '0') )

## Fix Street_1_New that got converted to date format then pad with spaces
All = All.withColumn("Street_1_New", f.regexp_replace(f.col('Street_1_New'), "JAN-","1/"))
All = All.withColumn("Street_1_New", f.regexp_replace(f.col('Street_1_New'), "-JAN","/1"))
All = All.withColumn("Street_1_New", f.regexp_replace(f.col('Street_1_New'), 'FEB-','2/'))
All = All.withColumn("Street_1_New", f.regexp_replace(f.col('Street_1_New'), '-FEB','/2'))
All = All.withColumn("Street_1_New", f.regexp_replace(f.col('Street_1_New'), "MAR-",'3/'))
All = All.withColumn("Street_1_New", f.regexp_replace(f.col('Street_1_New'), "-MAR","/3"))
All = All.withColumn("Street_1_New", f.regexp_replace(f.col('Street_1_New'), "APR-","4/"))
All = All.withColumn("Street_1_New", f.regexp_replace(f.col('Street_1_New'), "-APR","/4"))
All = All.withColumn("Street_1_New", f.regexp_replace(f.col('Street_1_New'), "MAY-","5/"))
All = All.withColumn("Street_1_New", f.regexp_replace(f.col('Street_1_New'), "-MAY","/5"))
All = All.withColumn("Street_1_New", f.regexp_replace(f.col('Street_1_New'), "JUN-","6/"))
All = All.withColumn("Street_1_New", f.regexp_replace(f.col('Street_1_New'), "-JUN","/6"))
All = All.withColumn("Street_1_New", f.regexp_replace(f.col('Street_1_New'), "JUL-","7/"))
All = All.withColumn("Street_1_New", f.regexp_replace(f.col('Street_1_New'), "-JUL","/7"))
All = All.withColumn("Street_1_New", f.regexp_replace(f.col('Street_1_New'), "AUG-",'8/'))
All = All.withColumn("Street_1_New", f.regexp_replace(f.col('Street_1_New'), "-AUG","/8"))
All = All.withColumn("Street_1_New", f.regexp_replace(f.col('Street_1_New'), "SEP-","9/"))
All = All.withColumn("Street_1_New", f.regexp_replace(f.col('Street_1_New'), "-SEP","/9"))
All = All.withColumn("Street_1_New", f.regexp_replace(f.col('Street_1_New'), "OCT-","10/"))
All = All.withColumn("Street_1_New", f.regexp_replace(f.col('Street_1_New'), "-OCT","/10"))
All = All.withColumn("Street_1_New", f.regexp_replace(f.col('Street_1_New'), "NOV-","11/"))
All = All.withColumn("Street_1_New", f.regexp_replace(f.col('Street_1_New'), "-NOV","/11"))
All = All.withColumn("Street_1_New", f.regexp_replace(f.col('Street_1_New'), "DEC-","12/"))
All = All.withColumn("Street_1_New", f.regexp_replace(f.col('Street_1_New'), "-DEC","/12"))

All = All.withColumn('Street_1_New', f.lpad(f.col('Street_1_New').cast('string'), 10, ' ') )

## save intermediate table --> have to break it up coz it seems this code causes pyspark to take forever (more than 3 hours) to finish running an Action cell
All.write.orc(UAMS_PySpark_save_path+"phase_2/{}/All-intermediate13.orc".format(date_key), mode='overwrite', compression='snappy')

# ------------------------------------------------------------------------------------------------------------

del All # -- if required
All = spark.read.orc(UAMS_PySpark_save_path+"phase_2/{}/All-intermediate13.orc".format(date_key)) ## read in ORC version

## Fix Street_2_New that got converted to date format then pad with spaces
All = All.withColumn("Street_2_New", f.regexp_replace(f.col('Street_2_New'), "JAN-","1/"))
All = All.withColumn("Street_2_New", f.regexp_replace(f.col('Street_2_New'), "-JAN","/1"))
All = All.withColumn("Street_2_New", f.regexp_replace(f.col('Street_2_New'), 'FEB-','2/'))
All = All.withColumn("Street_2_New", f.regexp_replace(f.col('Street_2_New'), '-FEB','/2'))
All = All.withColumn("Street_2_New", f.regexp_replace(f.col('Street_2_New'), "MAR-",'3/'))
All = All.withColumn("Street_2_New", f.regexp_replace(f.col('Street_2_New'), "-MAR","/3"))
All = All.withColumn("Street_2_New", f.regexp_replace(f.col('Street_2_New'), "APR-","4/"))
All = All.withColumn("Street_2_New", f.regexp_replace(f.col('Street_2_New'), "-APR","/4"))
All = All.withColumn("Street_2_New", f.regexp_replace(f.col('Street_2_New'), "MAY-","5/"))
All = All.withColumn("Street_2_New", f.regexp_replace(f.col('Street_2_New'), "-MAY","/5"))
All = All.withColumn("Street_2_New", f.regexp_replace(f.col('Street_2_New'), "JUN-","6/"))
All = All.withColumn("Street_2_New", f.regexp_replace(f.col('Street_2_New'), "-JUN","/6"))
All = All.withColumn("Street_2_New", f.regexp_replace(f.col('Street_2_New'), "JUL-","7/"))
All = All.withColumn("Street_2_New", f.regexp_replace(f.col('Street_2_New'), "-JUL","/7"))
All = All.withColumn("Street_2_New", f.regexp_replace(f.col('Street_2_New'), "AUG-",'8/'))
All = All.withColumn("Street_2_New", f.regexp_replace(f.col('Street_2_New'), "-AUG","/8"))
All = All.withColumn("Street_2_New", f.regexp_replace(f.col('Street_2_New'), "SEP-","9/"))
All = All.withColumn("Street_2_New", f.regexp_replace(f.col('Street_2_New'), "-SEP","/9"))
All = All.withColumn("Street_2_New", f.regexp_replace(f.col('Street_2_New'), "OCT-","10/"))
All = All.withColumn("Street_2_New", f.regexp_replace(f.col('Street_2_New'), "-OCT","/10"))
All = All.withColumn("Street_2_New", f.regexp_replace(f.col('Street_2_New'), "NOV-","11/"))
All = All.withColumn("Street_2_New", f.regexp_replace(f.col('Street_2_New'), "-NOV","/11"))
All = All.withColumn("Street_2_New", f.regexp_replace(f.col('Street_2_New'), "DEC-","12/"))
All = All.withColumn("Street_2_New", f.regexp_replace(f.col('Street_2_New'), "-DEC","/12"))

All = All.withColumn('Street_2_New', f.lpad(f.col('Street_2_New').cast('string'), 10, ' ') )

## save intermediate table --> have to break it up coz it seems this code causes pyspark to take forever (more than 3 hours) to finish running an Action cell
All.write.orc(UAMS_PySpark_save_path+"phase_2/{}/All-intermediate14.orc".format(date_key), mode='overwrite', compression='snappy')

# ------------------------------------------------------------------------------------------------------------

del All # -- if required
All = spark.read.orc(UAMS_PySpark_save_path+"phase_2/{}/All-intermediate14.orc".format(date_key)) ## read in ORC version

All = All.withColumn('House_No', f.lpad(f.col('HOUSE_NO_ASTRO').cast('string'), 10, ' ') )
All = All.drop(*['HOUSE_NO_ASTRO', 'HouseNo'])

## these steps seems to be duplicate steps as above...
# All = All.fillna('')
# All['Combined_Building'] = All['Combined_Building'].str.upper()
# All['Street_Type_1'] = All['Street_Type_1'].str.upper()
# All['Street_Type_2'] = All['Street_Type_2'].str.upper()
# All['Street_1_New'] = All['Street_1_New'].str.upper()
# All['Street_2_New'] = All['Street_2_New'].str.upper()
# All['AREA'] = All['AREA'].str.upper()
# All['STD_CITY'] = All['STD_CITY'].str.upper()
# All['ASTRO_STATE'] = All['ASTRO_STATE'].str.upper()

# All[All['POSTCODE'].str.contains("AA")]

All = All.withColumn("Key", f.concat_ws(" ,", "House_No", "Combined_Building", "Street_Type_1", "Street_1_New", "Street_Type_2", "Street_2_New", "STD_CITY", "AREA", "POSTCODE", "STATE") )
All = All.withColumn("Key", f.regexp_replace(f.upper(f.col("Key")), " ", "") )

print('checking on keys: ', All.select('Key').head(10))

## check that POSTCODE is only len = 5
print('Checking postcode length of All here :', All.select(f.length(f.col('POSTCODE'))).distinct().show())

# usage = resource.getrusage(resource.RUSAGE_SELF).ru_maxrss
# print('[debug] memory usage is (Megabytes): after all processing')
# print(usage)

# create a sequential index as Zohreh did a pandas reset_index at this step again. To do it in Spark: https://stackoverflow.com/questions/51200217/how-to-create-sequential-number-column-in-pyspark-dataframe
All = All.withColumn("index", row_number().over(Window.orderBy(monotonically_increasing_id())) )

print('checking all account no if we have one :', All.select('ACCOUNT_NO').head(10))

## de-dupe on Key & Serviceable, & keep first based on index. To do in Spark: https://stackoverflow.com/questions/38687212/spark-dataframe-drop-duplicates-and-keep-first
print('All count before de-dupe on Key & Serviceable', All.count()) # 10493617
window = Window.partitionBy(['Key','Serviceable']).orderBy(f.col("index").asc())
All = All.withColumn('row', f.row_number().over(window)).filter(col('row') == 1).drop('row')
print('All count AFTER de-dupe on Key & Serviceable', All.count()) # 8586359

# print('all shape')
# print(All.shape)

#revision - fakhrul - 4/7/22 - changing serviceable and key to str to avoid float error
All = All.withColumn('Key', f.col('Key').cast('string')).withColumn('Serviceable', f.col('Serviceable').cast('string'))

## groupby to create All_1 --> Combine all Serviceable values for each Key
All_1 = All.groupBy('Key').agg(f.collect_set('Serviceable').alias('Serviceable_New'))
All_1 = All_1.withColumn("Serviceable_New", f.concat_ws(',', f.col('Serviceable_New')) )
# create a sequential index as Zohreh did a pandas reset_index at this step again. To do it in Spark: https://stackoverflow.com/questions/51200217/how-to-create-sequential-number-column-in-pyspark-dataframe
All_1 = All_1.withColumn("index", row_number().over(Window.orderBy(monotonically_increasing_id())) )
print('All size at end of Pipeline 3', All.count()) # 8586359
print('All_1 size at end of Pipeline 3:', All_1.select('Serviceable_New').count()) # 7705263
All_1.select('Serviceable_New').distinct().show() ## this code looks correct

# print('Checking all 1 head : ', All_1.head())
# print('Checking all info :', All.info())
# print('Checking all head :', All.head())
# print('Checking all shape:', All.shape)
# print('all 1 info')
# print(All_1.info())

# usage = resource.getrusage(resource.RUSAGE_SELF).ru_maxrss
# print('[debug] memory usage is (Megabytes):')
# print(usage)

## save to ORC & CSV
All.write.orc(UAMS_PySpark_save_path+"phase_2/{}/All-final.orc".format(date_key), mode='overwrite', compression='snappy')
All_1.write.orc(UAMS_PySpark_save_path+"phase_2/{}/All_1-final.orc".format(date_key), mode='overwrite', compression='snappy')

All.coalesce(1).write.csv(UAMS_PySpark_save_path+"phase_2/{}/All-final.csv.gz".format(date_key), mode='overwrite', header=True, compression='gzip')
All_1.coalesce(1).write.csv(UAMS_PySpark_save_path+"phase_2/{}/All_1-final.csv.gz".format(date_key), mode='overwrite', header=True, compression='gzip')

# All.write.orc(UAMS_PySpark_save_path+"all_temp_final_3_{}.orc".format(date_key), mode='overwrite', compression='snappy')
# All_1.write.orc(UAMS_PySpark_save_path+"all_1_temp_final_3_{}.orc".format(date_key), mode='overwrite', compression='snappy')
# All.coalesce(1).write.csv(UAMS_PySpark_save_path+"all_temp_final_3_{}.csv".format(date_key), mode='overwrite', header=True)
# All_1.coalesce(1).write.csv(UAMS_PySpark_save_path+"all_1_temp_final_3_{}.csv".format(date_key), mode='overwrite', header=True)

# z.show(All_1.head(100))

checking on keys:  [Row(Key='02-03-11,PANGSAPURISUCI,JALAN,PUCHONG,,,KUALALUMPUR,TAMANSRIJATI,58200,WILAYAHPERSEKUTUANKUALALUMPUR'), Row(Key='02-08-03,PANGSAPURISUCI,JALAN,PUCHONG,,,KUALALUMPUR,TAMANSRIJATI,58200,WILAYAHPERSEKUTUANKUALALUMPUR'), Row(Key='02-10-15,PANGSAPURISUCI,JALAN,PUCHONG,,,KUALALUMPUR,TAMANSRIJATI,58200,WILAYAHPERSEKUTUANKUALALUMPUR'), Row(Key='21-01-02,KELUMPUKCAMAR,JALAN,AU2/8,,,KUALALUMPUR,TAMANSEPAKAT,54200,WILAYAHPERSEKUTUANKUALALUMPUR'), Row(Key='02-01-12,VISTAANGKASAAPARTMENT,JALAN,KERINCHI,,,KUALALUMPUR,PANTAIDALAM,59200,WILAYAHPERSEKUTUANKUALALUMPUR'), Row(Key='10-02-05,VISTAANGKASAAPARTMENT,JALAN,KERINCHI,,,KUALALUMPUR,PANTAIDALAM,59200,WILAYAHPERSEKUTUANKUALALUMPUR'), Row(Key='08-03-05,MENARAMERAKKAYANGAN,JALAN,6/56,,,KUALALUMPUR,TAMANKERAMAT,54200,WILAYAHPERSEKUTUANKUALALUMPUR'), Row(Key='26-03-02,FLATSRIMELAKA,JALAN,SIAKAP,,,KUALALUMPUR,TAMANIKHSAN,?5600,WILAYAHPERSEKUTUANKUALALUMPUR'), Row(Key='08-04-02,FLATSRIMELAKA,JALAN,SIAKAP,,,KUALALUMPUR,TAMANIK

Extra codes that Fakhrul commented out (before Nov 22) from the PIPELINE 3 - FINAL Job (open this cell to view)
<!-- ##Astro_Standard['F']= 'A_MB'
##TM_Standard['F']='TM_MB'
##Maxis_Standard['F']='MAXIS_MB'
##Allo_Standard['F']='ALLO_MB'
##CTS_Standard['F']='CTS_MB'


## ------------------------- Note - fakhrul 15/6/22 - these codes below does not seem efficient at all. if it dies i will drop the columns from the get go -------------------------
#Edit - ok i removed it from the usecols in the beginning - 15/6/22
#A_MB = All[(All['F']=='A_MB')]

#wr.s3.to_csv(df = A_MB, path = 's3://astro-groupdata-prod-target/address_standardization/a_mb.csv')

#A_MB = A_MB.drop(['F'], axis = 1)
#print('a_mb is')
#print(A_MB.shape)
#A_MB.reset_index(inplace=True, drop=True)
#A_MB = A_MB.drop_duplicates(subset=['Key'], keep='first')
#print(A_MB.shape)
#
#
#TM_MB = All[(All['F']=='TM_MB')]

#wr.s3.to_csv(df = TM_MB, path = 's3://astro-groupdata-prod-target/address_standardization/tm_mb.csv')

#TM_MB = TM_MB.drop(['F'], axis = 1)
#print('tm_mb is')
#print(TM_MB.shape)
#TM_MB.reset_index(inplace=True, drop=True)
#TM_MB = TM_MB.drop_duplicates(subset=['Key'], keep='first')
#print(TM_MB.shape)
#
#MAXIS_MB = All[(All['F']=='MAXIS_MB')]
#MAXIS_MB = MAXIS_MB.drop(['F'], axis = 1)
#print('maxis_mb is')
#print(MAXIS_MB.shape)
#MAXIS_MB.reset_index(inplace=True, drop=True)
#MAXIS_MB = MAXIS_MB.drop_duplicates(subset=['Key'], keep='first')
#print(MAXIS_MB.shape)
#
#ALLO_MB = All[(All['F']=='ALLO_MB')]
#ALLO_MB =ALLO_MB.drop(['F'], axis = 1)
#print('allo_mb is')
#print(ALLO_MB.shape)
#ALLO_MB.reset_index(inplace=True, drop=True)
#ALLO_MB = ALLO_MB.drop_duplicates(subset=['Key'], keep='first')
#print(ALLO_MB.shape)
## 
#CTS_MB = All[(All['F']=='CTS_MB')]
#CTS_MB = CTS_MB.drop(['F'], axis = 1)
#print('cts_mb is')
#print(CTS_MB.shape)
#CTS_MB.reset_index(inplace=True, drop=True)
#CTS_MB = CTS_MB.drop_duplicates(subset=['Key'], keep='first')
#print(CTS_MB.shape)
#
#
#Frame = [TM_MB, MAXIS_MB, A_MB, ALLO_MB, CTS_MB]
#
#All = pd.concat(Frame) -->

## Pipeline 4
- Put multiple ISP in one line
- Original Zepp Qubole Notebook: Converting Step 4.0_UAMS Generation to PySpark_Part3 (https://us.qub)ole.com/notebooks#recent?id=141821&type=my-notebooks&view=home

In [4]:
## =========================================================== THIS IS THE START OF Pipeline 4 - BACKUP 1 ============================================================
# taken from Glue Job: address_standardization-prod-uams_generation_final_4_backup
# according to Fakhrul, the order for pipeline 4 is final_4_backup -> final_4
 
## read in the files
All = spark.read.orc(UAMS_PySpark_save_path+"phase_2/{}/All-final.orc".format(date_key))
All_1 = spark.read.orc(UAMS_PySpark_save_path+"phase_2/{}/All_1-final.orc".format(date_key))
# All = spark.read.csv(UAMS_PySpark_save_path+"all_temp_final_3_{}.csv".format(date_key), header=True)
# All_1 = spark.read.csv(UAMS_PySpark_save_path+"all_1_temp_final_3_{}.csv".format(date_key), header=True)

#revision - 20/6/22 - fakhrul - postcode might have problems reading as it would read 08000 as 8000 instead here
All = All.withColumn("POSTCODE", f.regexp_replace(f.col('POSTCODE').cast('string'), '\.0', '') )
All = All.withColumn("POSTCODE", f.substring(f.col('POSTCODE'), 1, 5) )
All = All.withColumn('POSTCODE', f.lpad(f.col('POSTCODE').cast('string'), 5, '0') )
## check that POSTCODE is only len = 5
print(All.select(f.length(f.col('POSTCODE'))).distinct().show())

# ### Put multiple ISP in one line

#All = All.drop_duplicates(subset=['Key','Serviceable'], keep='first')

#this new line below is an added code to prevent float error
#All['Key'] = All['Key'].astype('object')
#All['Serviceable'] = All['Serviceable'].astype('str')

#All_1 = All.groupby(['Key'])['Serviceable'].agg(','.join).reset_index()
#All_1 = All_1.rename({'Serviceable':'Serviceable_New' }, axis=1)


All_1 = All_1.withColumn("Serviceable_New", f.regexp_replace(f.col("Serviceable_New"), ',,' , ',') )

## 10/11/2022: Amzar added to avoid duplicate 'index' column when joining later on
All = All.withColumnRenamed('index', 'index_nongrouped')

## save intermediate files before the big JOIN below:
All.write.orc(UAMS_PySpark_save_path+"phase_2/{}/All_pipeline4-intermediate1.orc".format(date_key), mode='overwrite', compression='snappy')
All_1.write.orc(UAMS_PySpark_save_path+"phase_2/{}/All_1_pipeline4-intermediate1.orc".format(date_key), mode='overwrite', compression='snappy')

# ----------------------------------------------------------------------------------------------------------------------------------------

## delete files to clear up memory
del All
del All_1

## read intermediate file back in
All = spark.read.orc(UAMS_PySpark_save_path+"phase_2/{}/All_pipeline4-intermediate1.orc".format(date_key))
All_1 = spark.read.orc(UAMS_PySpark_save_path+"phase_2/{}/All_1_pipeline4-intermediate1.orc".format(date_key))

## join All_1 to All on "Key"
All_Final_Merg =  All_1.join(All, on ='Key', how = 'left')
# print('chceking all final merge :', All_Final_Merg.info())
print('checking all final merge :', All_Final_Merg.columns)
print('checking all final merge count:', All_Final_Merg.select('ACCOUNT_NO').count()) # 8586359
print('checking all final merge account no:', All_Final_Merg.select('ACCOUNT_NO').head(10))

## 11/11/2022 Amzar: save intermediate file
All_Final_Merg.write.orc(UAMS_PySpark_save_path+"phase_2/{}/All_pipeline4-intermediate2.orc".format(date_key), mode='overwrite', compression='snappy')

# ----------------------------------------------------------------------------------------------------------------------------------------

## delete files to clear up memory
del All
del All_1
del All_Final_Merg


+----------------+
|length(POSTCODE)|
+----------------+
|               5|
+----------------+

None
checking all final merge : ['Key', 'Serviceable_New', 'index', 'OBJID', 'ACCOUNT_NO', 'AREA', 'STD_CITY', 'STATE', 'POSTCODE', 'Combined_Building', 'Street_Type_1', 'Street_1_New', 'Street_Type_2', 'Street_2_New', 'Standard_Building_Name', 'F', 'ServiceType', 'Servicable', 'Serviceable', 'index_nongrouped', 'Address_Type', 'Post_length', 'House_No']
checking all final merge count: 8582404
checking all final merge account no: [Row(ACCOUNT_NO='92244949'), Row(ACCOUNT_NO='91658904'), Row(ACCOUNT_NO='91368965'), Row(ACCOUNT_NO='91321254'), Row(ACCOUNT_NO='91066518'), Row(ACCOUNT_NO='91356100'), Row(ACCOUNT_NO='98478545'), Row(ACCOUNT_NO='97768024'), Row(ACCOUNT_NO='96075820'), Row(ACCOUNT_NO='95707097')]


In [5]:
## read intermediate file back in
All_Final_Merg = spark.read.orc(UAMS_PySpark_save_path+"phase_2/{}/All_pipeline4-intermediate2.orc".format(date_key))

## de-dupe on Key & Serviceable, & keep first based on index. To do in Spark: https://stackoverflow.com/questions/38687212/spark-dataframe-drop-duplicates-and-keep-first
window = Window.partitionBy(['Key','Serviceable']).orderBy(f.col("index").asc())
All_Final_Merg = All_Final_Merg.withColumn('row', f.row_number().over(window)).filter(col('row') == 1).drop('row')
# print('All_Final_Merg count AFTER de-dupe on Key & Serviceable:', All_Final_Merg.select('Key').count()) # 8586359

All = All_Final_Merg

## de-dupe on Key, & keep first based on index. To do in Spark: https://stackoverflow.com/questions/38687212/spark-dataframe-drop-duplicates-and-keep-first
window = Window.partitionBy(['Key']).orderBy(f.col("index").asc())
All = All.withColumn('row', f.row_number().over(window)).filter(col('row') == 1).drop('row')
# print('All count AFTER de-dupe on Key:', All.select('Key').count()) # 7705263

## 12/11/2022 Amzar: save intermediate file
All.write.orc(UAMS_PySpark_save_path+"phase_2/{}/All_pipeline4-intermediate3.orc".format(date_key), mode='overwrite', compression='snappy')

# ----------------------------------------------------------------------------------------------------------------------------------------

# ## delete files to clear up memory
del All

## read intermediate file back in
All = spark.read.orc(UAMS_PySpark_save_path+"phase_2/{}/All_pipeline4-intermediate3.orc".format(date_key)) 
## moved the reading of UAMS_P1P2_Merg from start of file to down here to save on memory
UAMS_P1P2_Merg = spark.read.orc(UAMS_PySpark_save_path+"phase_1/{}/UAMS_P1P2_Merg-final.orc".format(date_key))
# UAMS_P1P2_Merg = spark.read.csv(UAMS_PySpark_save_path+"uams_p1p2_merg_temp-20221031.csv.gz", header=True)

## work on UAMS_P1P2_Merg now, i.e remove any "Key" existing in UAMS_P1P2_Merg from All (use left anti-join)
# print(UAMS_P1P2_Merg.select('Key').count()) # 3328852
# print('count of unique p1p2 Keys from UAMS_P1P2_Merg:', UAMS_P1P2_Merg.select(f.countDistinct('Key')).show()) # 2427513

P1_P2_removed = All.join(UAMS_P1P2_Merg, All.Key == UAMS_P1P2_Merg.Key, 'leftanti')
print('P1_P2_removed count (not in p1p2_list):', P1_P2_removed.select('Key').count()) # 5621584

## OLD --> work on UAMS_P1P2_Merg now # 12/11/2022 --> Amzar commented below codes out as it was inefficient (filter based on list). Created new 'join' codes to speed up this filtering process
# P1_P2_removed = pd.merge(All,UAMS_P1P2_Merg, on = ['Key'], indicator=True, how='outer').query('_merge=="left_only"').drop('_merge', axis=1).reset_index(drop=True)
# p1p2_list = UAMS_P1P2_Merg.agg(f.collect_list(f.col("Key"))).collect()[0][0] 
# P1_P2_removed = All.filter(~f.col('Key').isin(p1p2_list))

## filter out nulls
P1_P2_removed = P1_P2_removed.filter(f.col('POSTCODE').isNotNull())
P1_P2_removed = P1_P2_removed.filter(f.col('STD_CITY').isNotNull())
P1_P2_removed = P1_P2_removed.filter(f.col('Street_Type_1').isNotNull())
P1_P2_removed = P1_P2_removed.filter(f.col('Street_1_New').isNotNull())
print('P1_P2_removed count after removing nulls:', P1_P2_removed.select('Key').count()) # 5621584

## get the current MAX value of Address_ID then reset_index & make Address_ID equal to the new index + j
# i = Naresh_P1_P2_Base_Final_Merg['Address_ID'].max()
i = UAMS_P1P2_Merg.select(f.max(f.col('Address_ID').cast('float'))).first()[0]
print(i) # 2427515.0
j = int(i)+2
# create a sequential index as Zohreh did a pandas reset_index at this step. To do it in Spark: https://stackoverflow.com/questions/51200217/how-to-create-sequential-number-column-in-pyspark-dataframe
P1_P2_removed = P1_P2_removed.withColumn("index", row_number().over(Window.orderBy(monotonically_increasing_id())) )
P1_P2_removed = P1_P2_removed.withColumn('Address_ID', f.col('index') + j).drop(*['index'])
print(P1_P2_removed.columns)
print(UAMS_P1P2_Merg.columns)

## drop some columns in P1_P2_removed (Amzar added on 12/11/2022) & UAMS_P1P2_Merg, then rename some other columns
P1_P2_removed = P1_P2_removed.drop(*['Serviceable', 'Servicable', 'ServiceType']).withColumnRenamed('Serviceable_New', 'Serviceable')
UAMS_P1P2_Merg = UAMS_P1P2_Merg.drop(*['Serviceable', 'Servicable', 'ServiceType']).withColumnRenamed('Serviceable_New', 'Serviceable')
# print('UAMS_P1P2_Merg count:', UAMS_P1P2_Merg.count(), 'P1_P2_removed count:', P1_P2_removed.count()) # UAMS_P1P2_Merg count: 3328852 P1_P2_removed count: 5621584

## save intermediate files before the UNION below:
P1_P2_removed.write.orc(UAMS_PySpark_save_path+"phase_2/{}/P1P2_removed_pipeline4.orc".format(date_key), mode='overwrite', compression='snappy')
UAMS_P1P2_Merg.write.orc(UAMS_PySpark_save_path+"phase_2/{}/UAMS_P1P2_Merg_pipeline4.orc".format(date_key), mode='overwrite', compression='snappy')

# ----------------------------------------------------------------------------------------------------------------------------------------

## delete files to clear up memory
del P1_P2_removed
del UAMS_P1P2_Merg


P1_P2_removed count (not in p1p2_list): 5590572
P1_P2_removed count after removing nulls: 5590572
2427515.0
['Key', 'Serviceable_New', 'OBJID', 'ACCOUNT_NO', 'AREA', 'STD_CITY', 'STATE', 'POSTCODE', 'Combined_Building', 'Street_Type_1', 'Street_1_New', 'Street_Type_2', 'Street_2_New', 'Standard_Building_Name', 'F', 'ServiceType', 'Servicable', 'Serviceable', 'index_nongrouped', 'Address_Type', 'Post_length', 'House_No', 'Address_ID']
['Account_No', 'Serviceable_New', '_c0', 'OBJID', 'House_No', 'Combined_Building', 'Street_Type_1', 'Street_1_New', 'Street_Type_2', 'Street_2_New', 'AREA', 'STD_CITY', 'POSTCODE', 'STATE', 'Standard_Building_Name', 'ServiceType', 'Servicable', 'HNUM_STRT_TM', 'Address_Type', 'P_Flag', 'index', 'Serviceable', 'Key', 'Address_ID']


In [8]:
## read ORC files back in
P1_P2_removed = spark.read.orc(UAMS_PySpark_save_path+"phase_2/{}/P1P2_removed_pipeline4.orc".format(date_key))
UAMS_P1P2_Merg= spark.read.orc(UAMS_PySpark_save_path+"phase_2/{}/UAMS_P1P2_Merg_pipeline4.orc".format(date_key))

## rename columns, create blank columns & rearrange column order for smooth UNION-ing
P1_P2_removed = P1_P2_removed.withColumnRenamed('ACCOUNT_NO', 'Account_No').withColumnRenamed('index_nongrouped', 'index')
P1_P2_removed = P1_P2_removed.withColumn('HNUM_STRT_TM', f.lit('')).withColumn('P_Flag', f.lit(None))
P1_P2_removed = P1_P2_removed.select(['Account_No', 'Serviceable', 'OBJID', 'House_No', 'Combined_Building', 'Street_Type_1', 'Street_1_New', 'Street_Type_2', 'Street_2_New', 'AREA', 'STD_CITY', 'POSTCODE', 'STATE', 'Standard_Building_Name', 'HNUM_STRT_TM', 'Address_Type', 'P_Flag', 'index', 'Key', 'Address_ID'])
UAMS_P1P2_Merg = UAMS_P1P2_Merg.select(['Account_No', 'Serviceable', 'OBJID', 'House_No', 'Combined_Building', 'Street_Type_1', 'Street_1_New', 'Street_Type_2', 'Street_2_New', 'AREA', 'STD_CITY', 'POSTCODE', 'STATE', 'Standard_Building_Name', 'HNUM_STRT_TM', 'Address_Type', 'P_Flag', 'index', 'Key', 'Address_ID'])

## Union the 2 DFs then reset index
All = UAMS_P1P2_Merg.union(P1_P2_removed)
print('All (after unioning UAMS_P1P2_Merge and P1_P2_removed) count', All.select("Key").count()) # 8950436
All = All.withColumn("index", row_number().over(Window.orderBy(monotonically_increasing_id())) )

## Change column name to UAMS format name
All = All.withColumnRenamed('Combined_Building', 'Building_Name').withColumnRenamed('Street_Type_1', 'Street_Type').withColumnRenamed('Street_1_New', 'Street_Name').withColumnRenamed('AREA', 'Area').withColumnRenamed('STD_CITY', 'City').withColumnRenamed('STATE', 'State').withColumnRenamed('POSTCODE', 'Postcode')
print("All DF columns:", All.columns)

## save intermediate file
All.write.orc(UAMS_PySpark_save_path+"phase_2/{}/All_pipeline4-intermediate4.orc".format(date_key), mode='overwrite', compression='snappy')
# OLD save: All.coalesce(1).write.csv(UAMS_PySpark_save_path+"all_temp_4_backup3_before_4_{}.csv.gz".format(date_key), header=True, mode='overwrite', compression='gzip')

# usage = resource.getrusage(resource.RUSAGE_SELF).ru_maxrss
# print('[debug] memory usage is (Bytes):')
# print(usage)

## delete files to clear up memory
del P1_P2_removed
del UAMS_P1P2_Merg
del All

All (after unioning UAMS_P1P2_Merge and P1_P2_removed) count 8919424
All DF columns: ['Account_No', 'Serviceable', 'OBJID', 'House_No', 'Building_Name', 'Street_Type', 'Street_Name', 'Street_Type_2', 'Street_2_New', 'Area', 'City', 'Postcode', 'State', 'Standard_Building_Name', 'HNUM_STRT_TM', 'Address_Type', 'P_Flag', 'index', 'Key', 'Address_ID']


In [9]:
## ============================================================ THIS IS THE START OF Pipeline 4 - FINAL ============================================================
# taken from Glue Job: address_standardization-prod-uams_generation_final_4
# according to Fakhrul, the order for pipeline 4 is final_4_backup -> final_4
### Put multiple ISP in one line

#revision - fakhrul-3/7/22 - added dtype here 
# schema = StructType().add("Account_No",StringType(),True).add("OBJID",StringType(),True)
# All = spark.read.csv(UAMS_PySpark_save_path+"all_temp_4_backup3_before_4_{}.csv.gz".format(date_key), header=True, schema = schema)
All = spark.read.orc(UAMS_PySpark_save_path+"phase_2/{}/All_pipeline4-intermediate4.orc".format(date_key))
All = All.select('Key','Address_ID', 'Account_No', 'OBJID', 'House_No','Building_Name',
          'Standard_Building_Name', 'Street_Type','Street_Name', 'Area', 'City',
          'Postcode', 'State', 'Address_Type', 'Serviceable', 'P_Flag')

#revision - fakhrul - 2/7/22 - commenting this out because we use usecols instead above
#All= All[[ 'Key','Address_ID', 'Account_No', 'OBJID', 'House_No','Building_Name',
          #'Standard_Building_Name', 'Street_Type','Street_Name', 'Area', 'City',
          #'Postcode', 'State', 'Address_Type', 'Serviceable', 'P_Flag']]
          
print('checking account no :', All.select('Account_No').distinct().show())

## make columns upper case
All = All.withColumn('Building_Name', f.upper(f.col('Building_Name')) )
All = All.withColumn('Street_Type', f.upper(f.col('Street_Type')) )
All = All.withColumn('Street_Name', f.upper(f.col('Street_Name')) )
All = All.withColumn('Area', f.upper(f.col('Area')) )
All = All.withColumn('City', f.upper(f.col('City')) )
All = All.withColumn('State', f.upper(f.col('State')) )

## ensure postcode is 5 digit only
All = All.withColumn("Postcode", f.regexp_replace(f.col('Postcode').cast('string'), '\.0', '') )
All = All.withColumn("Postcode", f.substring(f.col('Postcode'), 1, 5) )
All = All.withColumn('Postcode', f.lpad(f.col('Postcode').cast('string'), 5, '0') )

#All['Account_No'].str.len().unique()

## fix weird serviceable values
All = All.withColumn('Serviceable', when(f.col('Serviceable') == '\\|', '').otherwise(f.col('Serviceable')) ) 

print(All.select('Serviceable').distinct().show(30))

## 11/11/2022 Amzar: save intermediate file
All.write.orc(UAMS_PySpark_save_path+"phase_2/{}/All_pipeline4-intermediate5.orc".format(date_key), mode='overwrite', compression='snappy')

# ----------------------------------------------------------------------------------------------------------------------------------------

## delete files to clear up memory
del All

## read intermediate file back in
All = spark.read.orc(UAMS_PySpark_save_path+"phase_2/{}/All_pipeline4-intermediate5.orc".format(date_key))

## Fix HouseNo that are converted to date. Pyspark code taken from P2 MDU Mapping Test Qubole Zepp notebooks
All = All.withColumn('HouseNo', f.upper(f.trim(f.col('House_No').cast('string'))))
All = All.withColumn('HouseNo', f.regexp_replace(f.col('HouseNo'), "JAN-","01-"))
All = All.withColumn('HouseNo', f.regexp_replace(f.col('HouseNo'), "-JAN","-01"))
All = All.withColumn('HouseNo', f.regexp_replace(f.col('HouseNo'), "FEB-","02-"))
All = All.withColumn('HouseNo', f.regexp_replace(f.col('HouseNo'), "-FEB",'-02'))
All = All.withColumn('HouseNo', f.regexp_replace(f.col('HouseNo'), "MAR-",'03-'))
All = All.withColumn('HouseNo', f.regexp_replace(f.col('HouseNo'), "-MAR","-03"))
All = All.withColumn('HouseNo', f.regexp_replace(f.col('HouseNo'), "APR-","04-"))
All = All.withColumn('HouseNo', f.regexp_replace(f.col('HouseNo'), "-APR","-04"))
All = All.withColumn('HouseNo', f.regexp_replace(f.col('HouseNo'), "MAY-","05-"))
All = All.withColumn('HouseNo', f.regexp_replace(f.col('HouseNo'), "-MAY","-05"))
All = All.withColumn('HouseNo', f.regexp_replace(f.col('HouseNo'), "JUN-","06-"))
All = All.withColumn('HouseNo', f.regexp_replace(f.col('HouseNo'), "-JUN","-06"))
All = All.withColumn('HouseNo', f.regexp_replace(f.col('HouseNo'), "JUL-","07-"))
All = All.withColumn('HouseNo', f.regexp_replace(f.col('HouseNo'), "-JUL","-07"))
All = All.withColumn('HouseNo', f.regexp_replace(f.col('HouseNo'), "AUG-",'08-'))
All = All.withColumn('HouseNo', f.regexp_replace(f.col('HouseNo'), "-AUG","-08"))
All = All.withColumn('HouseNo', f.regexp_replace(f.col('HouseNo'), "SEP-","09-"))
All = All.withColumn('HouseNo', f.regexp_replace(f.col('HouseNo'), "-SEP","-09"))
All = All.withColumn('HouseNo', f.regexp_replace(f.col('HouseNo'), "OCT-","10-"))
All = All.withColumn('HouseNo', f.regexp_replace(f.col('HouseNo'), "-OCT","-10"))
All = All.withColumn('HouseNo', f.regexp_replace(f.col('HouseNo'), "NOV-","11-"))
All = All.withColumn('HouseNo', f.regexp_replace(f.col('HouseNo'), "-NOV","-11"))
All = All.withColumn('HouseNo', f.regexp_replace(f.col('HouseNo'), "DEC-","12-"))
All = All.withColumn('HouseNo', f.regexp_replace(f.col('HouseNo'), "-DEC","-12"))

## save intermediate table --> have to break it up coz it seems this code causes pyspark to take forever (more than 3 hours) to finish running an Action cell
All.write.orc(UAMS_PySpark_save_path+"phase_2/{}/All_pipeline4-intermediate6.orc".format(date_key), mode='overwrite', compression='snappy')

# ----------------------------------------------------------------------------------------------------------------------------------------

del All # -- if required
All = spark.read.orc(UAMS_PySpark_save_path+"phase_2/{}/All_pipeline4-intermediate6.orc".format(date_key))
print('Total count of All before splitting to date_house & not_date_house:', All.select('HouseNo').count()) # 8950436

## Fix HouseNo that are converted to date (DD/MM/YYYY format). Pyspark code taken from P2 MDU Mapping Test Qubole Zepp notebooks
# Filter date HouseNo
date_house = All.filter(f.regexp_extract('HouseNo', '^([0-2][0-9]|(3)[0-1])(\/)(((0)[0-9])|((1)[0-2]))(\/)\d{4}$', 0) != '' ) 
# Spliting the HouseNo
date_house = date_house.withColumn('block_date',  f.substring(date_house.HouseNo, 1, 2))
date_house = date_house.withColumn('floor',  f.substring(date_house.HouseNo, 4, 2))
date_house = date_house.withColumn('unit',  f.substring(date_house.HouseNo, 9, 2))
# Combine the split HouseNo with dashes: '-'
date_house = date_house.withColumn('HOUSE_NO_ASTRO', f.concat_ws('-', date_house.block_date, date_house.floor, date_house.unit))
# Remove additional column created to combine HouseNo
date_house = date_house.drop(*['block_date','floor','unit'])
print('date_house count:', date_house.select('HouseNo').count()) # 6
    
# Filter not date HouseNo
not_date_house = All.filter( f.regexp_extract('HouseNo', '^([0-2][0-9]|(3)[0-1])(\/)(((0)[0-9])|((1)[0-2]))(\/)\d{4}$', 0) == '' )
not_date_house = not_date_house.withColumn('HOUSE_NO_ASTRO', f.col('HouseNo').cast('string') )
# print(not_date_house.select('ACCOUNT_NO').count(), not_date_house.select(f.countDistinct('ACCOUNT_NO')).show()) #  rows,  unique acc_no
print('not_date_house count:', not_date_house.select('HouseNo').count()) # 8950430

# Append the 2 dfs (date_house, not_date_house) --> originally this was in 'final_3' but I've moved it here to consolidate all the parts of this HouseNo date cleaning step in 1 phase
All = date_house.union(not_date_house)
print('Total count of All after re-appending date_house & not_date_house:', All.select('HouseNo').count()) # 8950436
# create a sequential index as Zohreh did a pandas reset_index at this step again. To do it in Spark: https://stackoverflow.com/questions/51200217/how-to-create-sequential-number-column-in-pyspark-dataframe
All = All.withColumn("index", row_number().over(Window.orderBy(monotonically_increasing_id())) )

# usage = resource.getrusage(resource.RUSAGE_SELF).ru_maxrss
# print('[debug] memory usage is (Megabytes) before concat frame below :')
# print(usage)

## save intermediate table
All.write.orc(UAMS_PySpark_save_path+"phase_2/{}/All_pipeline4-intermediate7.orc".format(date_key), mode='overwrite', compression='snappy')

+----------+
|Account_No|
+----------+
|  80000127|
|  80000165|
|  80000231|
|  80000322|
|  80000461|
|  80000750|
|  80000882|
|  80000904|
|  80000931|
|  80001358|
|  80001717|
|  80001758|
|  80001771|
|  80001901|
|  80001922|
|  80001973|
|  80002376|
|  80002403|
|  80002529|
|  80002794|
+----------+
only showing top 20 rows

checking account no : None
+--------------------+
|         Serviceable|
+--------------------+
|TM|VDSL,maxis|FTT...|
|  Maxis|FTTH,TM|FTTH|
|          Maxis|FTTH|
|          maxis|VDSL|
|          ALLO|FTTH,|
|TM|VDSL,ALLO|FTTH...|
|  TM|VDSL,maxis|FTTH|
|     TM|VDSL,TM|FTTH|
|  maxis|FTTH,TM|FTTH|
|          maxis|FTTH|
|  TM|VDSL,maxis|VDSL|
|                    |
|  maxis|VDSL,TM|FTTH|
|           CTS|FTTH,|
|TM|1-G-8 JALAN PU...|
|TM|VDSL,maxis|FTT...|
|            ,TM|FTTH|
|    CTS|FTTH,TM|FTTH|
|          Maxis|VDSL|
|            CTS|FTTH|
|            TM|VDSL,|
|TM|LOT 411 JALAN ...|
|maxis|FTTH,ALLO|FTTH|
|           ALLO|FTTH|
|   TM|VDSL,AL

In [2]:
# del All # -- if required
All = spark.read.orc(UAMS_PySpark_save_path+"phase_2/{}/All_pipeline4-intermediate7.orc".format(date_key))

## Fix Street_Name that got converted to date format then pad with spaces
All = All.withColumn("Street_Name", f.regexp_replace(f.col('Street_Name'), "JAN-","1/"))
All = All.withColumn("Street_Name", f.regexp_replace(f.col('Street_Name'), "-JAN","/1"))
All = All.withColumn("Street_Name", f.regexp_replace(f.col('Street_Name'), 'FEB-','2/'))
All = All.withColumn("Street_Name", f.regexp_replace(f.col('Street_Name'), '-FEB','/2'))
All = All.withColumn("Street_Name", f.regexp_replace(f.col('Street_Name'), "MAR-",'3/'))
All = All.withColumn("Street_Name", f.regexp_replace(f.col('Street_Name'), "-MAR","/3"))
All = All.withColumn("Street_Name", f.regexp_replace(f.col('Street_Name'), "APR-","4/"))
All = All.withColumn("Street_Name", f.regexp_replace(f.col('Street_Name'), "-APR","/4"))
All = All.withColumn("Street_Name", f.regexp_replace(f.col('Street_Name'), "MAY-","5/"))
All = All.withColumn("Street_Name", f.regexp_replace(f.col('Street_Name'), "-MAY","/5"))
All = All.withColumn("Street_Name", f.regexp_replace(f.col('Street_Name'), "JUN-","6/"))
All = All.withColumn("Street_Name", f.regexp_replace(f.col('Street_Name'), "-JUN","/6"))
All = All.withColumn("Street_Name", f.regexp_replace(f.col('Street_Name'), "JUL-","7/"))
All = All.withColumn("Street_Name", f.regexp_replace(f.col('Street_Name'), "-JUL","/7"))
All = All.withColumn("Street_Name", f.regexp_replace(f.col('Street_Name'), "AUG-",'8/'))
All = All.withColumn("Street_Name", f.regexp_replace(f.col('Street_Name'), "-AUG","/8"))
All = All.withColumn("Street_Name", f.regexp_replace(f.col('Street_Name'), "SEP-","9/"))
All = All.withColumn("Street_Name", f.regexp_replace(f.col('Street_Name'), "-SEP","/9"))
All = All.withColumn("Street_Name", f.regexp_replace(f.col('Street_Name'), "OCT-","10/"))
All = All.withColumn("Street_Name", f.regexp_replace(f.col('Street_Name'), "-OCT","/10"))
All = All.withColumn("Street_Name", f.regexp_replace(f.col('Street_Name'), "NOV-","11/"))
All = All.withColumn("Street_Name", f.regexp_replace(f.col('Street_Name'), "-NOV","/11"))
All = All.withColumn("Street_Name", f.regexp_replace(f.col('Street_Name'), "DEC-","12/"))
All = All.withColumn("Street_Name", f.regexp_replace(f.col('Street_Name'), "-DEC","/12"))

All = All.withColumn('Street_Name', f.lpad(f.col('Street_Name').cast('string'), 10, ' ') )

## save intermediate table --> have to break it up coz it seems this code causes pyspark to take forever (more than 3 hours) to finish running an Action cell
All.write.orc(UAMS_PySpark_save_path+"phase_2/{}/All_pipeline4-intermediate8.orc".format(date_key), mode='overwrite', compression='snappy')

# ------------------------------------------------------------------------------------------------------------

del All # -- if required
All = spark.read.orc(UAMS_PySpark_save_path+"phase_2/{}/All_pipeline4-intermediate8.orc".format(date_key)) ## read in ORC version

All = All.withColumn('House_No', f.lpad(f.col('HOUSE_NO_ASTRO').cast('string'), 10, ' ') )
Final = All.drop(*['HOUSE_NO_ASTRO', 'HouseNo'])

## House_No
Final = Final.withColumn("HouseNo", f.regexp_replace("House_No", "#|,|'",'')) ## this seems to cover multiple cases & runs faster than having multiple lines for each symbol to regexp_replace

## Building_Name
# _list = ['#', ',', '/', '-', '!', 'No Name', '\.', '\*', '=', ':','\)', '\(', '`', '_', '\^'] # copied from Converting to PySpark Part2
_list = ['#', ',', '/', '-', '!', 'No Name'] 
Final = Final.withColumn("Building_Name", f.regexp_replace("Building_Name", '|'.join(_list), '')) ## this seems to cover multiple cases & runs faster than having multiple lines for each symbol to regexp_replace

# Final["Building_Name"]= np.where(Final["Building_Name"]=='0', '',Final["Building_Name"] )

## assigning SDU, MDU to Address_Type
Final = Final.withColumn('Address_Type', when( ((f.col('Building_Name').isNull()) | (f.col('Building_Name') == '')), 'SDU').otherwise('MDU') )

## Street_Type
_list = ['#', ',', '/', '-', 'No Name']
Final = Final.withColumn("Street_Type", f.regexp_replace("Street_Type", '|'.join(_list), ''))
# Final = Final.withColumn("Street_Type", f.regexp_replace("Street_Type", 'JLN','JALAN')) # copied from Converting to PySpark Part2
# Final = Final.withColumn("Street_Type", f.regexp_replace("Street_Type", 'LRG','LORONG')) # copied from Converting to PySpark Part2

## Street_Name
_list = ['#', ',', 'No Name']
Final = Final.withColumn("Street_Name", f.regexp_replace("Street_Name", '|'.join(_list), ''))

## Area
_list = ['#',',', '/','-', 'No Name']
Final = Final.withColumn("Area", f.regexp_replace("Area", '|'.join(_list), ''))

## City
# _list = ['#',',', '/','-', '=', ':','\)', '\(', 'No Name','\[','\]'] # copied from Converting to PySpark Part2
_list = ['#',',', '/','-', 'No Name']
Final = Final.withColumn("City", f.regexp_replace("City", '|'.join(_list), ''))

## State
_list = ['#',',', '/','-', 'No Name']
Final = Final.withColumn("State", f.regexp_replace("State", '|'.join(_list), ''))

## Postcode
_list = ['#',',', '/','-', 'No Name']
Final = Final.withColumn("Postcode", f.regexp_replace("Postcode", '|'.join(_list), ''))

## ensuring only 5 digit postcode
All = All.withColumn('Postcode', f.lpad(f.col('Postcode').cast('string'), 5, '0') )
All = All.withColumn("Postcode", f.substring(f.col('Postcode'), 1, 5) )

# Final['Postcode']  = Final['Postcode'] .replace({r'[^\x00-\x7F]+':''}, regex=True, inplace=True)
# from string import printable 
# st = set(printable) 
# Final['Postcode'] = Final['Postcode'].apply(lambda x: ''.join([" " if  i not in  st else i for i in x])) ## not sure how to do this part yet

## this below part seems redundant...
Final = Final.withColumn('Address_Type', when(f.col('Building_Name').isNull(), 'SDU').when(f.col('Building_Name') == '', 'SDU').otherwise('MDU'))

## save intermediate table --> have to break it up coz it seems this code causes pyspark to take forever (more than 3 hours) to finish running an Action cell
All.write.orc(UAMS_PySpark_save_path+"phase_2/{}/All_pipeline4-intermediate9.orc".format(date_key), mode='overwrite', compression='snappy')

# ------------------------------------------------------------------------------------------------------------

del All # -- if required
All = spark.read.orc(UAMS_PySpark_save_path+"phase_2/{}/All_pipeline4-intermediate9.orc".format(date_key)) ## read in ORC version

## remove nulls
print('count before removing nulls:', Final.select('Postcode').count()) # 8950436
Final = Final.filter(f.col('Postcode').isNotNull())
print('count after removing null Postcode:',Final.select('Postcode').count()) # 8950436
Final = Final.filter(f.col('City').isNotNull())
print('count after removing null City:',Final.select('Postcode').count()) # 8950436
Final = Final.filter(f.col('Street_Type').isNotNull())
print('count after removing null Street_Type:',Final.select('Postcode').count()) # 8950436
Final = Final.filter(f.col('Street_Name').isNotNull())
print('count after removing null Street_Name:',Final.select('Postcode').count()) # 8950436

## check OBJID which is not length 8, and make blank any which aren't 8
print(Final.select(f.length(f.col('OBJID'))).distinct().show()) # variety of lengths
Final = Final.withColumn('OBJID', when(f.length(f.col('OBJID')) == 8, f.col('OBJID')).otherwise(''))
Final = Final.withColumn("OBJID", f.substring(f.col('OBJID'), 1, 8) )
print(Final.select(f.length(f.col('OBJID'))).distinct().show()) # 8 or 0

## check Account_No which is not length 8, and make blank any which aren't 8
Final = Final.withColumn('Account_No', when(f.length(f.col('Account_No')) == 8, f.col('Account_No')).otherwise(''))
print(Final.select(f.length(f.col('Account_No'))).distinct().show()) # 8 or 0

## clean the Serviceable column
Final = Final.withColumn("Serviceable", f.regexp_replace(f.col('Serviceable').cast('string'), 'MAXIS', 'Maxis') )
Final = Final.withColumn("Serviceable", f.regexp_replace(f.col('Serviceable').cast('string'), ',Maxis\\|Z', '') )
Final = Final.withColumn("Serviceable", f.regexp_replace(f.col('Serviceable').cast('string'), 'Maxis\\|Z', '') )
# Final = Final.withColumn("Serviceable", f.regexp_replace(f.col('Serviceable').cast('string'), 'Maxis|Z', '') )

Final = Final.withColumn("Serviceable", f.regexp_replace(f.col('Serviceable').cast('string'), ',TM\\|Z', '') )
Final = Final.withColumn("Serviceable", f.regexp_replace(f.col('Serviceable').cast('string'), 'TM\\|Z', '') )
# Final = Final.withColumn("Serviceable", f.regexp_replace(f.col('Serviceable').cast('string'), 'TM|Z', '') )


## save intermediate table --> have to break it up coz it seems this code causes pyspark to take forever (more than 3 hours) to finish running an Action cell
Final.write.orc(UAMS_PySpark_save_path+"phase_2/{}/All_pipeline4-intermediate10.orc".format(date_key), mode='overwrite', compression='snappy')

# ------------------------------------------------------------------------------------------------------------

del Final # -- if required
Final = spark.read.orc(UAMS_PySpark_save_path+"phase_2/{}/All_pipeline4-intermediate10.orc".format(date_key)) ## read in ORC version

## further clean Serviceable column: #code below is added with na parameter to false
Final_1 = Final.filter(~f.col("Serviceable").cast('string').contains("Maxis\\|Z")) # .str.contains("Maxis\\|Z", na = False)]
Final_1 = Final_1.filter(~f.col("Serviceable").cast('string').contains("TM\\|Z")) # .str.contains("TM\\|Z", na = False)]

## to replace all nan values with ''
Final_1= Final_1.fillna('')

## ensure columns are string
Final_1 = Final_1.withColumn('Account_No', f.regexp_replace(f.upper(f.trim(f.col('Account_No').cast('string'))), 'NAN', '') )
Final_1 = Final_1.withColumn('OBJID', f.regexp_replace(f.upper(f.trim(f.col('OBJID').cast('string'))), 'NAN', '') )

## filter out nulls
print('Before filtering out nulls', Final_1.select("Account_No").count()) # 8214216
Final_1 = Final_1.filter((f.col('Postcode').isNotNull()) & (f.col('Postcode') != ''))
Final_1 = Final_1.filter((f.col('City').isNotNull()) & (f.col('City') != ''))
Final_1 = Final_1.filter((f.col('Street_Type').isNotNull()) & (f.col('Street_Type') != ''))
Final_1 = Final_1.filter((f.col('Street_Name').isNotNull()) & (f.col('Street_Name') != ''))
Final_1 = Final_1.filter((f.col('State').isNotNull()) & (f.col('State') != ''))

print('After filtering out nulls', Final_1.select("Account_No").count()) # 8214216
print('Columns of Final_1', Final_1.columns)
print('Unique Account_No after filtering out nulls', Final_1.select(f.countDistinct("Account_No")).show())
print('final_1 account no check :', Final_1.select('Account_No').head(100))

# usage = resource.getrusage(resource.RUSAGE_SELF).ru_maxrss
# print('[debug] memory usage is (Megabytes):') 
# print(usage)

print('Checking unique postcode length', Final_1.select(f.length(f.col('Postcode'))).distinct().show()) # only 5

## Save to ORC & CSV
Final_1.write.orc(UAMS_PySpark_save_path+"phase_2/{}/All_pipeline4-final.orc".format(date_key), mode='overwrite', compression='snappy')
Final_1.coalesce(1).write.csv(UAMS_PySpark_save_path+"phase_2/{}/All_pipeline4-final.csv.gz".format(date_key), header=True, mode='overwrite', compression='gzip')
# wr.s3.to_csv(df = Final_1, path = final_1_temp_path + 'final_1_temp.csv')

count before removing nulls: 8919424
count after removing null Postcode: 8919424
count after removing null City: 8919424
count after removing null Street_Type: 8919424
count after removing null Street_Name: 8919424
+-------------+
|length(OBJID)|
+-------------+
|            1|
|            6|
|            4|
|            7|
|            0|
|            8|
|            5|
+-------------+

None
+-------------+
|length(OBJID)|
+-------------+
|            0|
|            8|
+-------------+

None
+------------------+
|length(Account_No)|
+------------------+
|                 0|
|                 8|
+------------------+

None
Before filtering out nulls 8919424
After filtering out nulls 8919424
Columns of Final_1 ['Key', 'Address_ID', 'Account_No', 'OBJID', 'House_No', 'Building_Name', 'Standard_Building_Name', 'Street_Type', 'Street_Name', 'Area', 'City', 'Postcode', 'State', 'Address_Type', 'Serviceable', 'P_Flag', 'index', 'HouseNo']
+--------------------------+
|count(DISTINCT Account_