
# Glue Studio Notebook
You are now running a **Glue Studio** notebook; before you can start using your notebook you *must* start an interactive session.

## Available Magics
|          Magic              |   Type       |                                                                        Description                                                                        |
|-----------------------------|--------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------|
| %%configure                 |  Dictionary  |  A json-formatted dictionary consisting of all configuration parameters for a session. Each parameter can be specified here or through individual magics. |
| %profile                    |  String      |  Specify a profile in your aws configuration to use as the credentials provider.                                                                          |
| %iam_role                   |  String      |  Specify an IAM role to execute your session with.                                                                                                        |
| %region                     |  String      |  Specify the AWS region in which to initialize a session                                                                                                  |
| %session_id                 |  String      |  Returns the session ID for the running session.                                                                                                          |
| %connections                |  List        |  Specify a comma separated list of connections to use in the session.                                                                                     |
| %additional_python_modules  |  List        |  Comma separated list of pip packages, s3 paths or private pip arguments.                                                                                 |
| %extra_py_files             |  List        |  Comma separated list of additional Python files from S3.                                                                                                 |
| %extra_jars                 |  List        |  Comma separated list of additional Jars to include in the cluster.                                                                                       |
| %number_of_workers          |  Integer     |  The number of workers of a defined worker_type that are allocated when a job runs. worker_type must be set too.                                          |
| %worker_type                |  String      |  Standard, G.1X, *or* G.2X. number_of_workers must be set too. Default is G.1X                                                                            |
| %glue_version               |  String      |  The version of Glue to be used by this session. Currently, the only valid options are 2.0 and 3.0 (eg: %glue_version 2.0)                                |
| %security_config            |  String      |  Define a security configuration to be used with this session.                                                                                            |
| %sql                        |  String      |  Run SQL code. All lines after the initial %%sql magic will be passed as part of the SQL code.                                                            |
| %streaming                  |  String      |  Changes the session type to Glue Streaming.                                                                                                              |
| %etl                        |  String      |   Changes the session type to Glue ETL.                                                                                                                   |
| %status                     |              |  Returns the status of the current Glue session including its duration, configuration and executing user / role.                                          |
| %stop_session               |              |  Stops the current session.                                                                                                                               |
| %list_sessions              |              |  Lists all currently running sessions by name and ID.                                                                                                     |
| %spark_conf                 |  String      |  Specify custom spark configurations for your session. E.g. %spark_conf spark.serializer=org.apache.spark.serializer.KryoSerializer                       |

In [None]:
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
  
sc = SparkContext.getOrCreate()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)

Welcome to the Glue Interactive Sessions Kernel
For more information on available magic commands, please type %help in any new cell.

Please view our Getting Started page to access the most up-to-date information on the Interactive Sessions kernel: https://docs.aws.amazon.com/glue/latest/dg/interactive-sessions.html
Installed kernel version: 0.35 
Authenticating with environment variables and user-defined glue_role_arn: arn:aws:iam::741993363917:role/AWSGlueServiceRole
Trying to create a Glue session for the kernel.
Worker Type: G.1X
Number of Workers: 5
Session ID: 0cf836e0-7fcf-43f0-ac3b-87bd40b14084
Applying the following default arguments:
--glue_kernel_version 0.35
--enable-glue-datacatalog true
Waiting for session 0cf836e0-7fcf-43f0-ac3b-87bd40b14084 to get into ready status...
Session 0cf836e0-7fcf-43f0-ac3b-87bd40b14084 has been created




In [19]:
import pandas as pd
import numpy as np
import pyspark.sql.functions as f
from pyspark.sql.functions import when
from pyspark.sql.window import Window
from pyspark.sql import DataFrame
from pyspark.sql.types import StringType, IntegerType
from datetime import datetime
# now = datetime.now()

## Amzar 18/11/2022 -> added a datetime key for tracking file creation
# curr_date = str(datetime.today().strftime('%Y%m%d'))
curr_date = '20221118' # temporary assignment
print(curr_date)

20221118


In [7]:
# =========================
ISP_Name = 'TM'

# distinct_mdu_p1_read_path = 's3://astro-groupdata-prod-source/rpa/Distinct_Fields_P1_MDU_old_code_20221013.csv' #18/11/22: commented out & added lines to read from 'automated' files (plus removed the old_code part for distinct file) 
# mdu_p1_read_path = 's3://astro-groupdata-prod-pipeline/address_standardization/tm_uams_mdu/UAMS_Format_stndrd_TM_P1_MDU20221014.csv' # 18/11/22: commented out & changed to latest address UAMS Format file

distinct_mdu_p1_read_path = 's3://astro-groupdata-prod-target/rpa/Distinct_Fields_P1_MDU.csv.gz' # 22/11/22: added .gz to read_path
mdu_p1_read_path = 's3://astro-groupdata-prod-pipeline/address_standardization/tm_uams_mdu/UAMS_Format_stndrd_TM_P1_MDU_20221118.csv' # need to automate

distinct_sdu_p1_read_path = 's3://astro-groupdata-prod-target/rpa/Distinct_Fields_P1_SDU.csv.gz' # 22/11/22: added .gz to read_path
sdu_p1_read_path = 's3://astro-groupdata-prod-pipeline/address_standardization/tm_uams_sdu/UAMS_Format_stndrd_TM_P1_SDU_20221118.csv'  # need to automate

distinct_sdu_p2_read_path = 's3://astro-groupdata-prod-target/rpa/Distinct_Fields_P2_SDU.csv.gz' # 22/11/22: added .gz to read_path
sdu_p2_read_path = 's3://astro-groupdata-prod-pipeline/address_standardization/tm_uams_sdu/UAMS_Format_stndrd_TM_P2_SDU_20221118.csv'  # need to automate
# new_p1_save_path = args['new_p1_save_path']
# new_p2_save_path = args['new_p2_save_path']

# print(distinct_mdu_p1_read_path)




In [26]:
#Distinct mdu p1 read

distinct_mdu_p1 = glueContext.create_dynamic_frame_from_options(

    connection_type = 's3',
    connection_options = {'paths' : [distinct_mdu_p1_read_path]},
    format = 'csv',
    format_options = {'withHeader':True}

)




In [27]:
#Distinct sdu p1 read 
distinct_sdu_p1 = glueContext.create_dynamic_frame_from_options(

    connection_type = 's3',
    connection_options = {'paths' : [distinct_sdu_p1_read_path]},
    format = 'csv',
    format_options = {'withHeader':True}

)




In [28]:
#Distinct sdu p2 read
distinct_sdu_p2 = glueContext.create_dynamic_frame_from_options(

    connection_type = 's3',
    connection_options = {'paths' : [distinct_sdu_p2_read_path]},
    format = 'csv',
    format_options = {'withHeader':True}

)




In [29]:
#MDU P1
P1_MDU = glueContext.create_dynamic_frame_from_options(

    connection_type = 's3',
    connection_options = {'paths' : [mdu_p1_read_path]},
    format = 'csv',
    format_options = {'withHeader':True}

)




In [30]:
#SDU P1
P1_SDU  = glueContext.create_dynamic_frame_from_options(

    connection_type = 's3',
    connection_options = {'paths' : [sdu_p1_read_path]},
    format = 'csv',
    format_options = {'withHeader':True}

)




In [31]:
#SDU P2
P2_SDU  = glueContext.create_dynamic_frame_from_options(

    connection_type = 's3',
    connection_options = {'paths' : [sdu_p2_read_path]},
    format = 'csv',
    format_options = {'withHeader':True}

)




In [32]:
distinct_mdu_p1 = distinct_mdu_p1.toDF()
distinct_sdu_p1 = distinct_sdu_p1.toDF()
distinct_sdu_p2 = distinct_sdu_p2.toDF()
P1_MDU = P1_MDU.toDF()
P1_SDU = P1_SDU.toDF()
P2_SDU = P2_SDU.toDF()




In [15]:
#======================================================distinct mdu p1 ==================================
print('-----first do P1 MDU-----')

#Fakhrul - 18/10/22 - same as above, helping amzar out here, removed mix key bcs we wanna run without mix key
## Select columns to reduce computational time
distinct_mdu_p1 = distinct_mdu_p1.select(['BuildingName','StreetType','StreetName','Section','City',
                                   'State','Postcode','ServiceType','HNUM_STRT_TM'])
                                   
P1_MDU = P1_MDU.withColumn("Mix_key", f.concat_ws(" ,", f.col("Street_1_New") + f.col("HNUM_STRT_TM")) )
    
P1_MDU = P1_MDU.select(['Account_No', 'service_add_objid', 'House_No','AREA', 'HNUM_STRT_TM', 'Mix_key'])

P1_MDU = P1_MDU.withColumn('AREA', f.regexp_replace('AREA', 'AA.{3}', '') )
# P1_MDU["AREA"]= np.where(P1_MDU["AREA"].astype(str).str.startswith("AA"), '', P1_MDU["AREA"])

P1_MDU = P1_MDU.withColumn( 'Account_No',  f.col('Account_No').cast('string') ).withColumn( 'Account_No',  f.regexp_replace('Account_No', '\.0', '') )

ISP_UNIQUE_WS = distinct_mdu_p1
STRT_P1 = P1_MDU

print(ISP_UNIQUE_WS.count(), STRT_P1.count()) # 3195117 179663

#revision - 24/8/22 zohreh amzar see if it works
#KL_New_fields1 = STRT_P1.merge(ISP_UNIQUE_WS,on='Mix_key', how = 'left')
# KL_New_fields1 = STRT_P1.merge(ISP_UNIQUE_WS,on='HNUM_STRT_TM', how = 'left')
KL_New_fields1 = STRT_P1.join(ISP_UNIQUE_WS, on='HNUM_STRT_TM', how = 'left')

print('Shape after merging (kl new fields 1) :', KL_New_fields1.count()) # 70821640
print('this is kl new fields cols :', KL_New_fields1.columns)\

# Creating new location flag for postcodes with both urban and rural

KL_New_fields1 = KL_New_fields1.withColumn('address_type', when(f.col('BuildingName').isNull(), 'SDU').when(f.col('BuildingName').isNotNull(), 'MDU').otherwise('none') )

## Add ASTRO_BLOCK column for RPA
KL_New_fields1 = KL_New_fields1.withColumn('ASTRO_BLOCK', f.lit(''))
KL_New_fields1 = KL_New_fields1.withColumn('ASTRO_CONDO_NAME', f.col('BuildingName'))

### Selecting and renaming columns
#revision - 24/8/22 zohreh amzar disabling this below to try the below of below
#revision - 18/10/22 fakhrul amzar disabling this to try below
#KL_New_fields2 = KL_New_fields1[['Account_No','service_add_objid','address_type','House_No','ASTRO_CONDO_NAME',
                                #'ASTRO_BLOCK','AREA','BuildingName','StreetType','StreetName',
                                #'Section','City','State','Postcode','ServiceType','HNUM_STRT_TM_y']]

KL_New_fields2 = KL_New_fields1.select(['Account_No','service_add_objid','address_type','House_No','ASTRO_CONDO_NAME',
                                'ASTRO_BLOCK','AREA','BuildingName','StreetType','StreetName',
                                'Section','City','State','Postcode','ServiceType','HNUM_STRT_TM'])

## De-dupe on Account_No, keep first (doesn't seem to be a clear reason why we keep first, so I'm gonna ignore that for now
# KL_New_fields3 = KL_New_fields2.drop_duplicates(subset= 'Account_No', keep = 'first')
KL_New_fields3 = KL_New_fields2.dropDuplicates(subset=['Account_No'])
print('KL_New_fields3 after dedupe on Acc No :', KL_New_fields3.filter(f.col('ServiceType').isNotNull()).select('HNUM_STRT_TM').count()) # 

#revision - 24/8/22 zohreh amzar disabling this below to try the below of below
#revision - 18/10/22 fakhrul amzar disabling this below
#KL_New_fields3 = KL_New_fields3[KL_New_fields3['HNUM_STRT_TM_y'].notnull()]
KL_New_fields3 = KL_New_fields3.filter(f.col('HNUM_STRT_TM').isNotNull())
print('KL_New_fields3 after removing null HNUM_STRT_TM :', KL_New_fields3.filter(f.col('ServiceType').isNotNull()).select('HNUM_STRT_TM').count()) 

## rename columns using pyspark method
KL_New_fields3 = KL_New_fields3.toDF(*['ACCOUNT_ID', 'CRM_OBJID', 'DTYPE', 'ASTRO_HOUSE_NO', 'ASTRO_CONDO_NAME', 'ASTRO_BLOCK', 'ASTRO_AREA', 'TM_CONDO', 
                                        'TM_STREET_TYPE', 'TM_STREET_NAME', 'TM_AREA', 'TM_CITY', 'TM_STATE', 'TM_POSTCODE', 'ISP_INDICATOR', 'HNUM_STRT_TM'])

print('unique acc_no before de-dupe (KL New fields 2)', KL_New_fields2.select(f.countDistinct('Account_No')).show()) # 179659
print('unique acc_no after de-dupe (KL New fields 3)', KL_New_fields3.select(f.countDistinct('ACCOUNT_ID')).show()) # 179296

# Rearranging column and pad house number
KL_New_fields3 = KL_New_fields3.select('ACCOUNT_ID','CRM_OBJID', 'DTYPE', 'ASTRO_HOUSE_NO', 'ASTRO_CONDO_NAME','ASTRO_BLOCK',
                                 'ASTRO_AREA','TM_CONDO','TM_STREET_TYPE','TM_STREET_NAME',
                                 'TM_AREA', 'TM_CITY', 'TM_STATE', 'TM_POSTCODE','ISP_INDICATOR' )

KL_New_fields3 = KL_New_fields3.withColumn('ASTRO_HOUSE_NO', f.lpad(f.col('ASTRO_HOUSE_NO'), 10, ' ') )
KL_New_fields3 = KL_New_fields3.filter(f.col('ISP_INDICATOR') != '')
KL_New_fields3 = KL_New_fields3.filter(f.col('ISP_INDICATOR').isNotNull())

print('ISP INDICATOR not null:', KL_New_fields3.count()) # 1639699
KL_New_fields3 = KL_New_fields3.filter(f.col('TM_STATE').isNotNull())

print('TM STATE not null:',KL_New_fields3.count()) # 1629700
KL_New_fields3 = KL_New_fields3.filter(f.col('TM_POSTCODE').isNotNull())

print('Final count:', KL_New_fields3.count()) # 1629700

KL_New_fields3 = KL_New_fields3.withColumn('TM_POSTCODE', f.col('TM_POSTCODE').cast('string')).withColumn('TM_POSTCODE', f.regexp_replace('TM_POSTCODE', '\.0', '') )
KL_New_fields3 = KL_New_fields3.withColumn( 'TM_POSTCODE', f.substring(f.col('TM_POSTCODE'), 1, 5) )
KL_New_fields3 = KL_New_fields3.withColumn('TM_POSTCODE', f.lpad(f.col('TM_POSTCODE'), 5, '0') )
# KL_New_fields3['TM_POSTCODE']= KL_New_fields3['TM_POSTCODE'].astype(str).replace('00000', '', regex=True)

rpa_p1_mdu = KL_New_fields3

## ----------------------------------------------------------


-----first do P1 MDU-----
3281168 181324
Shape after merging (kl new fields 1) : 71167945
this is kl new fields cols : ['HNUM_STRT_TM', 'Account_No', 'service_add_objid', 'House_No', 'AREA', 'Mix_key', 'BuildingName', 'StreetType', 'StreetName', 'Section', 'City', 'State', 'Postcode', 'ServiceType']
+--------------------------+
|count(DISTINCT Account_No)|
+--------------------------+
|                    181324|
+--------------------------+

unique acc_no before de-dupe (KL New fields 2) None
+--------------------------+
|count(DISTINCT ACCOUNT_ID)|
+--------------------------+
|                    181324|
+--------------------------+

unique acc_no after de-dupe (KL New fields 3) None
180916
180916
180916


In [21]:
# print(distinct_sdu_p1.filter(f.col('ServiceType').isNull()).count()) # 0
print(distinct_sdu_p1.filter(f.col('ServiceType') == '').count())
# filter(f.col('ISP_INDICATOR') != '')

1


In [33]:
## ======================# P1 SDU =================================
print('-----now for P1 SDU-----')
print('checking p1 sdu: ', P1_SDU.count()) # 1,639,756 rows

## Select columns to reduce computational time
#distinct_sdu_p1 = distinct_sdu_p1[['BuildingName','StreetType','StreetName','Section','City',
                                   #'State','Postcode','ServiceType','HNUM_STRT_TM','Mix_key']]
                                   
#Fakhrul - 18/10/22 - same as above, helping amzar out here, removed mix key bcs we wanna run without mix key
distinct_sdu_p1 = distinct_sdu_p1.select(['BuildingName','StreetType','StreetName','Section','City',
                                   'State','Postcode','ServiceType','HNUM_STRT_TM'])
                                   
P1_SDU = P1_SDU.withColumn("Mix_key", f.concat_ws(" ,", f.col("Combined_Building") + f.col("HNUM_STRT_TM")) )
    
P1_SDU = P1_SDU.select(['Account_No', 'service_add_objid', 'House_No','AREA', 'HNUM_STRT_TM', 'Mix_key'])

P1_SDU = P1_SDU.withColumn('AREA', f.regexp_replace('AREA', 'AA.{3}', '') )
# P1_SDU["AREA"]= np.where(P1_SDU["AREA"].astype(str).str.startswith("AA"), '', P1_SDU["AREA"])

P1_SDU = P1_SDU.withColumn( 'Account_No',  f.col('Account_No').cast('string') ).withColumn( 'Account_No',  f.regexp_replace('Account_No', '\.0', '') )

ISP_UNIQUE_WS = distinct_sdu_p1
STRT_P1 = P1_SDU

print(ISP_UNIQUE_WS.count(), STRT_P1.count()) # 9162197 1639756

#revision - 24/8/22 zohreh amzar see if it works
#KL_New_fields1 = STRT_P1.merge(ISP_UNIQUE_WS,on='Mix_key', how = 'left')
# KL_New_fields1 = STRT_P1.merge(ISP_UNIQUE_WS,on='HNUM_STRT_TM', how = 'left')
KL_New_fields1 = STRT_P1.join(ISP_UNIQUE_WS, on='HNUM_STRT_TM', how = 'left')

print('Shape after merging (kl new fields 1) :', KL_New_fields1.count()) # 2440211
print('count of non-null left join :', KL_New_fields1.filter(f.col('ServiceType').isNotNull()).select('HNUM_STRT_TM').count()) # 
print('this is kl new fields cols :', KL_New_fields1.columns)

# Creating new location flag for postcodes with both urban and rural
from pyspark.sql.functions import when
KL_New_fields1 = KL_New_fields1.withColumn('address_type', when(f.col('BuildingName').isNull(), 'SDU').when(f.col('BuildingName').isNotNull(), 'MDU').otherwise('none') )

## Add ASTRO_BLOCK column for RPA
KL_New_fields1 = KL_New_fields1.withColumn('ASTRO_BLOCK', f.lit(''))
KL_New_fields1 = KL_New_fields1.withColumn('ASTRO_CONDO_NAME', f.col('BuildingName'))

### Selecting and renaming columns
#revision - 24/8/22 zohreh amzar disabling this below to try the below of below
#revision - 18/10/22 fakhrul amzar disabling this to try below
#KL_New_fields2 = KL_New_fields1[['Account_No','service_add_objid','address_type','House_No','ASTRO_CONDO_NAME',
                                #'ASTRO_BLOCK','AREA','BuildingName','StreetType','StreetName',
                                #'Section','City','State','Postcode','ServiceType','HNUM_STRT_TM_y']]

KL_New_fields2 = KL_New_fields1.select(['Account_No','service_add_objid','address_type','House_No','ASTRO_CONDO_NAME',
                                'ASTRO_BLOCK','AREA','BuildingName','StreetType','StreetName',
                                'Section','City','State','Postcode','ServiceType','HNUM_STRT_TM'])

## De-dupe on Account_No, keep first (doesn't seem to be a clear reason why we keep first, so I'm gonna ignore that for now
# KL_New_fields3 = KL_New_fields2.drop_duplicates(subset= 'Account_No', keep = 'first')
KL_New_fields3 = KL_New_fields2.dropDuplicates(subset=['Account_No'])
print('KL_New_fields3 after dedupe on Acc No :', KL_New_fields3.filter(f.col('ServiceType').isNotNull()).select('HNUM_STRT_TM').count()) # 

#revision - 24/8/22 zohreh amzar disabling this below to try the below of below
#revision - 18/10/22 fakhrul amzar disabling this below
#KL_New_fields3 = KL_New_fields3[KL_New_fields3['HNUM_STRT_TM_y'].notnull()]
KL_New_fields3 = KL_New_fields3.filter(f.col('HNUM_STRT_TM').isNotNull())
print('KL_New_fields3 after removing null HNUM_STRT_TM :', KL_New_fields3.filter(f.col('ServiceType').isNotNull()).select('HNUM_STRT_TM').count()) # 

## rename columns using pyspark method
KL_New_fields3 = KL_New_fields3.toDF(*['ACCOUNT_ID', 'CRM_OBJID', 'DTYPE', 'ASTRO_HOUSE_NO', 'ASTRO_CONDO_NAME', 'ASTRO_BLOCK', 'ASTRO_AREA', 'TM_CONDO', 
                                        'TM_STREET_TYPE', 'TM_STREET_NAME', 'TM_AREA', 'TM_CITY', 'TM_STATE', 'TM_POSTCODE', 'ISP_INDICATOR', 'HNUM_STRT_TM'])

print('unique acc_no before de-dupe (KL New fields 2)', KL_New_fields2.select(f.countDistinct('Account_No')).show()) # 1639729
print('unique acc_no after de-dupe & remove nulls (KL New fields 3)', KL_New_fields3.select(f.countDistinct('ACCOUNT_ID')).show()) # 1639699

# Rearranging column and pad house number
KL_New_fields3 = KL_New_fields3.select('ACCOUNT_ID','CRM_OBJID', 'DTYPE', 'ASTRO_HOUSE_NO', 'ASTRO_CONDO_NAME','ASTRO_BLOCK',
                                 'ASTRO_AREA','TM_CONDO','TM_STREET_TYPE','TM_STREET_NAME',
                                 'TM_AREA', 'TM_CITY', 'TM_STATE', 'TM_POSTCODE','ISP_INDICATOR' )

KL_New_fields3 = KL_New_fields3.withColumn('ASTRO_HOUSE_NO', f.lpad(f.col('ASTRO_HOUSE_NO'), 10, ' ') )
KL_New_fields3 = KL_New_fields3.filter(f.col('ISP_INDICATOR') != '')
KL_New_fields3 = KL_New_fields3.filter(f.col('ISP_INDICATOR').isNotNull())

print('ISP INDICATOR not null:', KL_New_fields3.count()) # 1639699
KL_New_fields3 = KL_New_fields3.filter(f.col('TM_STATE').isNotNull())

print('TM STATE not null:',KL_New_fields3.count()) # 1629700
KL_New_fields3 = KL_New_fields3.filter(f.col('TM_POSTCODE').isNotNull())

print('Final count:', KL_New_fields3.count()) # 1629700

KL_New_fields3 = KL_New_fields3.withColumn('TM_POSTCODE', f.col('TM_POSTCODE').cast('string')).withColumn('TM_POSTCODE', f.regexp_replace('TM_POSTCODE', '\.0', '') )
KL_New_fields3 = KL_New_fields3.withColumn( 'TM_POSTCODE', f.substring(f.col('TM_POSTCODE'), 1, 5) )
KL_New_fields3 = KL_New_fields3.withColumn('TM_POSTCODE', f.lpad(f.col('TM_POSTCODE'), 5, '0') )
# KL_New_fields3['TM_POSTCODE']= KL_New_fields3['TM_POSTCODE'].astype(str).replace('00000', '', regex=True)

rpa_p1_sdu = KL_New_fields3

-----now for P1 SDU-----
checking p1 sdu:  1671222
3281168 1671222
Shape after merging (kl new fields 1) : 1671290
count of non-null left join : 72
this is kl new fields cols : ['HNUM_STRT_TM', 'Account_No', 'service_add_objid', 'House_No', 'AREA', 'Mix_key', 'BuildingName', 'StreetType', 'StreetName', 'Section', 'City', 'State', 'Postcode', 'ServiceType']
KL_New_fields3 after dedupe on Acc No : 4
KL_New_fields3 after removing null HNUM_STRT_TM : 4
+--------------------------+
|count(DISTINCT Account_No)|
+--------------------------+
|                   1671222|
+--------------------------+

unique acc_no before de-dupe (KL New fields 2) None
+--------------------------+
|count(DISTINCT ACCOUNT_ID)|
+--------------------------+
|                   1671222|
+--------------------------+

unique acc_no after de-dupe & remove nulls (KL New fields 3) None
ISP INDICATOR not null: 4
TM STATE not null: 4
Final count: 4


In [17]:
## ======================# P2 SDU =================================

print('-----now for P2 SDU-----')
## Select columns to reduce computational time
#distinct_sdu_p2 = distinct_sdu_p2[['BuildingName','StreetType','StreetName','Section','City',
                                   #'State','Postcode','ServiceType','HNUM_STRT_TM','Mix_key']]
                                   
#Fakhrul - 18/10/22 - same as above, helping amzar out here, removed mix key bcs we wanna run without mix key
distinct_sdu_p2 = distinct_sdu_p2.select(['BuildingName','StreetType','StreetName','Section','City',
                                   'State','Postcode','ServiceType','HNUM_STRT_TM'])
                                   
P2_SDU = P2_SDU.withColumn("Mix_key", f.concat_ws(" ,", f.col("Combined_Building") + f.col("HNUM_STRT_TM")) )
    
P2_SDU = P2_SDU.select(['Account_No', 'service_add_objid', 'House_No','AREA', 'HNUM_STRT_TM', 'Mix_key'])

P2_SDU = P2_SDU.withColumn('AREA', f.regexp_replace('AREA', 'AA.{3}', '') )
# P2_SDU["AREA"]= np.where(P2_SDU["AREA"].astype(str).str.startswith("AA"), '', P2_SDU["AREA"])

P2_SDU = P2_SDU.withColumn( 'Account_No',  f.col('Account_No').cast('string') ).withColumn( 'Account_No',  f.regexp_replace('Account_No', '\.0', '') )

# ## this step is important for P2 SDU especially since got many duplicates. Amzar 18/10/2022 -> added dedupe to the variable name
P2_SDU_dedupe = P2_SDU.drop_duplicates() # 1759949
distinct_sdu_p2_dedupe = distinct_sdu_p2.drop_duplicates() # 725325 (if on HNUM_STRT_TM, only 201932 are left)


ISP_UNIQUE_WS = distinct_sdu_p2_dedupe # Amzar 18/10/2022 -> added dedupe to the variable name
STRT_P1 = P2_SDU_dedupe # Amzar 18/10/2022 -> added dedupe to the variable name
print(ISP_UNIQUE_WS.count(), STRT_P1.count()) # 725325 1759949

#revision - 24/8/22 zohreh amzar see if it works
#KL_New_fields1 = STRT_P1.merge(ISP_UNIQUE_WS,on='Mix_key', how = 'left')
# KL_New_fields1 = STRT_P1.merge(ISP_UNIQUE_WS,on='HNUM_STRT_TM', how = 'left')
KL_New_fields1 = STRT_P1.join(ISP_UNIQUE_WS, on='HNUM_STRT_TM', how = 'left')

print('Shape after merging (kl new fields 1) :', KL_New_fields1.count()) # 60,132,254 rows
print('this is kl new fields cols :', KL_New_fields1.columns)

# Creating new location flag for postcodes with both urban and rural
KL_New_fields1 = KL_New_fields1.withColumn('address_type', when(f.col('BuildingName').isNull(), 'SDU').when(f.col('BuildingName').isNotNull(), 'MDU').otherwise('none') )

## Add ASTRO_BLOCK column for RPA
KL_New_fields1 = KL_New_fields1.withColumn('ASTRO_BLOCK', f.lit(''))
KL_New_fields1 = KL_New_fields1.withColumn('ASTRO_CONDO_NAME', f.col('BuildingName'))

### Selecting and renaming columns
#revision - 24/8/22 zohreh amzar disabling this below to try the below of below
#revision - 18/10/22 fakhrul amzar disabling this to try below
#KL_New_fields2 = KL_New_fields1[['Account_No','service_add_objid','address_type','House_No','ASTRO_CONDO_NAME',
                                #'ASTRO_BLOCK','AREA','BuildingName','StreetType','StreetName',
                                #'Section','City','State','Postcode','ServiceType','HNUM_STRT_TM_y']]

KL_New_fields2 = KL_New_fields1.select(['Account_No','service_add_objid','address_type','House_No','ASTRO_CONDO_NAME',
                                'ASTRO_BLOCK','AREA','BuildingName','StreetType','StreetName',
                                'Section','City','State','Postcode','ServiceType','HNUM_STRT_TM'])

## De-dupe on Account_No, keep first (doesn't seem to be a clear reason why we keep first, so I'm gonna ignore that for now
# KL_New_fields3 = KL_New_fields2.drop_duplicates(subset= 'Account_No', keep = 'first')
KL_New_fields3 = KL_New_fields2.dropDuplicates(subset=['Account_No'])
print('KL_New_fields3 after dedupe on Acc No :', KL_New_fields3.filter(f.col('ServiceType').isNotNull()).select('HNUM_STRT_TM').count()) # 

#revision - 24/8/22 zohreh amzar disabling this below to try the below of below
#revision - 18/10/22 fakhrul amzar disabling this below
#KL_New_fields3 = KL_New_fields3[KL_New_fields3['HNUM_STRT_TM_y'].notnull()]
KL_New_fields3 = KL_New_fields3.filter(f.col('HNUM_STRT_TM').isNotNull())
print('KL_New_fields3 after removing null HNUM_STRT_TM :', KL_New_fields3.filter(f.col('ServiceType').isNotNull()).select('HNUM_STRT_TM').count()) # 

## rename columns using pyspark method
KL_New_fields3 = KL_New_fields3.toDF(*['ACCOUNT_ID', 'CRM_OBJID', 'DTYPE', 'ASTRO_HOUSE_NO', 'ASTRO_CONDO_NAME', 'ASTRO_BLOCK', 'ASTRO_AREA', 'TM_CONDO', 
                                        'TM_STREET_TYPE', 'TM_STREET_NAME', 'TM_AREA', 'TM_CITY', 'TM_STATE', 'TM_POSTCODE', 'ISP_INDICATOR', 'HNUM_STRT_TM'])

print('unique acc_no before de-dupe (KL New fields 2)', KL_New_fields2.select(f.countDistinct('Account_No')).show()) # 1,759,944
print('unique acc_no after de-dupe (KL New fields 3)', KL_New_fields3.select(f.countDistinct('ACCOUNT_ID')).show()) # 1,759,816

# Rearranging column and pad house number
KL_New_fields3 = KL_New_fields3.select('ACCOUNT_ID','CRM_OBJID', 'DTYPE', 'ASTRO_HOUSE_NO', 'ASTRO_CONDO_NAME','ASTRO_BLOCK',
                                 'ASTRO_AREA','TM_CONDO','TM_STREET_TYPE','TM_STREET_NAME',
                                 'TM_AREA', 'TM_CITY', 'TM_STATE', 'TM_POSTCODE','ISP_INDICATOR' )

KL_New_fields3 = KL_New_fields3.withColumn('ASTRO_HOUSE_NO', f.lpad(f.col('ASTRO_HOUSE_NO'), 10, ' ') )
KL_New_fields3 = KL_New_fields3.filter(f.col('ISP_INDICATOR') != '')
KL_New_fields3 = KL_New_fields3.filter(f.col('ISP_INDICATOR').isNotNull())

print('ISP INDICATOR not null:', KL_New_fields3.count()) # 
KL_New_fields3 = KL_New_fields3.filter(f.col('TM_STATE').isNotNull())

print('TM STATE not null:',KL_New_fields3.count()) # 
KL_New_fields3 = KL_New_fields3.filter(f.col('TM_POSTCODE').isNotNull())

print('Final count:', KL_New_fields3.count()) # 

KL_New_fields3 = KL_New_fields3.withColumn('TM_POSTCODE', f.col('TM_POSTCODE').cast('string')).withColumn('TM_POSTCODE', f.regexp_replace('TM_POSTCODE', '\.0', '') )
KL_New_fields3 = KL_New_fields3.withColumn( 'TM_POSTCODE', f.substring(f.col('TM_POSTCODE'), 1, 5) )
KL_New_fields3 = KL_New_fields3.withColumn('TM_POSTCODE', f.lpad(f.col('TM_POSTCODE'), 5, '0') )
# KL_New_fields3['TM_POSTCODE']= KL_New_fields3['TM_POSTCODE'].astype(str).replace('00000', '', regex=True)

rpa_p2_sdu = KL_New_fields3

-----now for P2 SDU-----
756968 1766822
Shape after merging (kl new fields 1) : 62332947
this is kl new fields cols : ['HNUM_STRT_TM', 'Account_No', 'service_add_objid', 'House_No', 'AREA', 'Mix_key', 'BuildingName', 'StreetType', 'StreetName', 'Section', 'City', 'State', 'Postcode', 'ServiceType']
+--------------------------+
|count(DISTINCT Account_No)|
+--------------------------+
|                   1766822|
+--------------------------+

unique acc_no before de-dupe (KL New fields 2) None
+--------------------------+
|count(DISTINCT ACCOUNT_ID)|
+--------------------------+
|                   1766822|
+--------------------------+

unique acc_no after de-dupe (KL New fields 3) None
1766817
1766817
1766817


In [18]:
## ----------------------------------------------------------

# ### Combining RPA format file to P1 and P2

#======================================================new tm p1 ==================================
new_p1 = rpa_p1_mdu.union(rpa_p1_sdu)
print('P1 (after union MDU & SDU):',new_p1.count()) # 1805286

## use this pyspark method to create a 'row number' that increases post-union: https://stackoverflow.com/questions/51200217/how-to-create-sequential-number-column-in-pyspark-dataframe
new_p1 = new_p1.withColumn("row_idx", f.row_number().over(Window.orderBy(f.monotonically_increasing_id())))
## then use this pyspark method to de-dupe ACCOUNT_ID, keep first: https://stackoverflow.com/questions/38687212/spark-dataframe-drop-duplicates-and-keep-first
window = Window.partitionBy('ACCOUNT_ID').orderBy(f.col("row_idx").asc())
new_p1 = new_p1.withColumn('row', f.row_number().over(window)).filter(f.col('row') == 1).drop('row')

print('P1 (after dedupe on ACC_ID, keep MDU first):',new_p1.count()) # 1805286

#revision - 29/8/22 fakhrul adding cleaning code here as postcode is messed up due to decimal
# print('checking new p1 before correcting: ', new_p1.info())
# print(new_p1[['TM_POSTCODE']].head())

new_p1 = new_p1.withColumn('TM_POSTCODE', f.col('TM_POSTCODE').cast('string') ).withColumn( 'TM_POSTCODE',  f.regexp_replace('TM_POSTCODE', '\.0', '') )
new_p1 = new_p1.withColumn( 'TM_POSTCODE', f.substring(f.col('TM_POSTCODE'), 1, 5) )
new_p1 = new_p1.withColumn('TM_POSTCODE', f.lpad(f.col('TM_POSTCODE'), 5, '0') )

print('checking new p1 after correcting: ', new_p1.count()) # 1805286
# print(new_p1[['TM_POSTCODE']].head())

#=====================================================new tm p2=========================

new_p2 = rpa_p2_sdu ## Amzar 18/10/2022 -> changed the position of this line of code

print('P2:',new_p2.count()) # 1747675 ## Amzar 18/10/2022 -> changed the position of this line of code 

# print('checking new p1 before correcting: ', new_p2.info())
# print(new_p2[['TM_POSTCODE']].head())

new_p2 = new_p2.withColumn('TM_POSTCODE', f.col('TM_POSTCODE').cast('string') ).withColumn( 'TM_POSTCODE',  f.regexp_replace('TM_POSTCODE', '\.0', '') )
new_p2 = new_p2.withColumn( 'TM_POSTCODE', f.substring(f.col('TM_POSTCODE'), 1, 5) )
new_p2 = new_p2.withColumn('TM_POSTCODE', f.lpad(f.col('TM_POSTCODE'), 5, '0') )

print('checking new p2 after correcting: ', new_p2.count()) # 1745117
# print(new_p2[['TM_POSTCODE']].head())

P1 (after union MDU & SDU): 180920
P1 (after dedupe on ACC_ID, keep MDU first): 180920
checking new p1 after correcting:  180920
P2: 1766817
checking new p2 after correcting:  1766817


In [92]:
## ----------------------------------------------------------

## Save in source bucket, RPA folder - s3://astro-groupdata-prod-source/sftp/rpa/ 

# 18/11/22: added line to create standard filename for easy automation + edited existing line to store historic files
new_p1.coalesce(1).write.csv('s3://astro-groupdata-prod-source/rpa/New_P1_TM_Format_glue_spark.csv.gz', mode='overwrite', header=True, compression='gzip') 
new_p1.write.csv('s3://astro-groupdata-prod-source/rpa/historical_folder/new_p1p2/New_P1_TM_Format_glue_spark_'+str(curr_date)+'.csv.gz', mode='overwrite', header=True, compression='gzip')

# 18/11/22: added line to create standard filename for easy automation + edited existing line to store historic files
new_p2.coalesce(1).write.csv('s3://astro-groupdata-prod-source/rpa/New_P2_TM_Format_glue_spark.csv.gz', mode='overwrite', header=True, compression='gzip') 
new_p2.write.csv('s3://astro-groupdata-prod-source/rpa/historical_folder/new_p1p2/New_P2_TM_Format_glue_spark_'+str(curr_date)+'.csv.gz', mode='overwrite', header=True, compression='gzip')

#new_p1.to_csv('New_P1_TM_Format.csv',index=False)
# wr.s3.to_csv(df = new_p1, path = new_p1_save_path + 'New_P1_TM_Format_20221014.csv', index = False) ## --> 14/10/2022 Amzar: added "_20221014" to filename
# wr.s3.to_csv(df = new_p1, path = 's3://astro-groupdata-prod-source/sftp/rpa/' + 'New_P1_TM_Format_20221018.csv', index = False) ## --> 14/10/2022 Amzar: added "_20221014" to filename
# 18/11/22: commented out --> new_p1.coalesce(1).write.csv('s3://astro-groupdata-prod-source/rpa/historical_folder/New_P1_TM_Format_old_code_20221018_glue_spark.csv.gz', mode='overwrite', header=True, compression='gzip')

#new_p2.to_csv('New_P2_TM_Format.csv',index=False)
# wr.s3.to_csv(df = new_p2, path = new_p2_save_path + 'New_P2_TM_Format_20221014.csv', index = False) ## --> 14/10/2022 Amzar: added "_20221014" to filename
# wr.s3.to_csv(df = new_p2, path = 's3://astro-groupdata-prod-source/sftp/rpa/' + 'New_P2_TM_Format_20221018.csv', index = False) ## --> 14/10/2022 Amzar: added "_20221014" to filename
# 18/11/22: commented out --> new_p2.coalesce(1).write.csv('s3://astro-groupdata-prod-source/rpa/New_P2_TM_Format_20221018_glue_spark.csv.gz', mode='overwrite', header=True, compression='gzip')

# usage = resource.getrusage(resource.RUSAGE_SELF).ru_maxrss
# print('[debug] memory usage is (Megabytes):')
# print(usage)


