The ETL process is as follows:
1. Import the database tables as dataframes
2. Use pandas dataframe operations to clean the dataframe
3. Extract dimension values
4. Load it to the data warehouse

Standard Protocols will be used to clean data
1. Checking for incorrect data types
2. Check for dupliucate values
3. Check for multiple representations
4. Check for missing and default values
5. Check for inconsistent format

In [2]:
import pandas as pd
from sqlalchemy import create_engine
from sqlalchemy import URL
import numpy as np
from enum import Enum
import sqlalchemy

Pyarrow will become a required dependency of pandas in the next major release of pandas (pandas 3.0),
(to allow more performant data types, such as the Arrow string type, and better interoperability with other libraries)
but was not found to be installed on your system.
If this would cause problems for you,
please provide us feedback at https://github.com/pandas-dev/pandas/issues/54466
        
  import pandas as pd


In [4]:
# creating engine. MySQL will be used
url_object = URL.create(
    drivername='mysql',
    username='root',
    password='Data_101',
    host='localhost',
    port=3306,
    database='seriousmd'
)
MySqlEngine = create_engine(url_object)

In [3]:
# creating engine for data warehouse
url_object = URL.create(
    drivername='mysql',
    username='root',
    password='Data_101',
    host='localhost',
    port=3306,
    database='mdwarehouse'
)
MySqlEngineDW = create_engine(url_object)

In [5]:
# importing the appointments table to a dataframe for
seriousmd_conn = MySqlEngine.connect()

patients_df = pd.read_sql(
    sql="""SELECT * FROM px""",
    con=seriousmd_conn,
    dtype={
        'age': pd.Int64Dtype()
    }
)
patients_df.info()

seriousmd_conn.close()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6507812 entries, 0 to 6507811
Data columns (total 3 columns):
 #   Column  Dtype 
---  ------  ----- 
 0   pxid    object
 1   age     Int64 
 2   gender  object
dtypes: Int64(1), object(2)
memory usage: 155.2+ MB


Checking for duplicate values

In [6]:
patients_df[patients_df.duplicated()]

Unnamed: 0,pxid,age,gender
813543,6287BCA4D9AA7E05113840FBD3F6423E,18,MALE
814629,A83A266DA32247121459EF1509A81C84,17,FEMALE
815606,159AB6E288586E00E0BE7646818ACBD0,46,FEMALE
816082,4E3E2A2420F9A247722E1B03F4F183B4,34,FEMALE
816670,F2E7BDC6FCA876802CCABCF007F0FB8E,60,FEMALE
...,...,...,...
2042802,4304FCA1FA5ABD90F65F31B2E0CD5847,3,FEMALE
2042912,5A15C630C9BBE8EA616E64B65B46E627,42,FEMALE
2042970,562BDF5428C27D644BF36362A8DB03EA,5,MALE
2043095,D97C2E0C69D86772583E585A03523813,79,FEMALE


In [7]:
duplicated_index = patients_df[patients_df.duplicated()].index

In [8]:
patients_df.drop(index=duplicated_index, inplace=True)
patients_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 5512483 entries, 0 to 6507811
Data columns (total 3 columns):
 #   Column  Dtype 
---  ------  ----- 
 0   pxid    object
 1   age     Int64 
 2   gender  object
dtypes: Int64(1), object(2)
memory usage: 173.5+ MB


In [9]:
patients_df.reset_index(drop=True, inplace=True)
patients_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5512483 entries, 0 to 5512482
Data columns (total 3 columns):
 #   Column  Dtype 
---  ------  ----- 
 0   pxid    object
 1   age     Int64 
 2   gender  object
dtypes: Int64(1), object(2)
memory usage: 131.4+ MB


In [10]:
patients_df[patients_df['pxid'].duplicated()]

Unnamed: 0,pxid,age,gender
3010847,FBA46EA3EF7CCD4F3551C22272FE865F,4,MALE


In [11]:
patients_df[patients_df['pxid'] == 'FBA46EA3EF7CCD4F3551C22272FE865F']

Unnamed: 0,pxid,age,gender
3000660,FBA46EA3EF7CCD4F3551C22272FE865F,42,MALE
3010847,FBA46EA3EF7CCD4F3551C22272FE865F,4,MALE


In [12]:
patients_df.drop(index=3010847, inplace=True)
patients_df.reset_index(drop=True, inplace=True)
patients_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5512482 entries, 0 to 5512481
Data columns (total 3 columns):
 #   Column  Dtype 
---  ------  ----- 
 0   pxid    object
 1   age     Int64 
 2   gender  object
dtypes: Int64(1), object(2)
memory usage: 131.4+ MB


In [13]:
patients_df[patients_df['pxid'].duplicated()]

Unnamed: 0,pxid,age,gender


Checking for multiple representations

In [14]:
patients_df['gender'].unique()

array(['FEMALE', 'MALE'], dtype=object)

In [15]:
patients_df['age'].value_counts()

age
29      125362
54      125306
28      121702
31      120597
32      120355
         ...  
165          1
162          1
-40          1
-81          1
-996         1
Name: count, Length: 368, dtype: Int64

Checking for missng and default values

In [16]:
patients_df[patients_df['age'] < 0]

Unnamed: 0,pxid,age,gender
5347,EC7168F4DF42E718CA4A70F52E57A99B,-182,MALE
10471,8CC44C76FDAAC6C6F63BCFFA7D6D035B,-24,FEMALE
12949,C1CA856AD536A5271D627B1C2D3035E5,-962,FEMALE
15316,4C6D650B3DF986431FB3E8E73B25E71B,-9,MALE
19070,0D6764878E57FF8C4A3F665FD187F829,-964,MALE
...,...,...,...
5451953,C53E03E6794972AB45656853357AC65A,-5,MALE
5497595,AB033EC5325213D763D23F08DFBCAE2D,-996,FEMALE
5499919,E4A2C6A059BD27024843AEA8924ACA01,-3,MALE
5500512,91A193DBF9891D001A11C6ED9093F2F6,-1,FEMALE


In [17]:
invalid_age_patients = patients_df[patients_df['age'] < 0].index

In [18]:
patients_df.drop(index=invalid_age_patients, inplace=True)
patients_df.reset_index(drop=True, inplace=True)
patients_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5511479 entries, 0 to 5511478
Data columns (total 3 columns):
 #   Column  Dtype 
---  ------  ----- 
 0   pxid    object
 1   age     Int64 
 2   gender  object
dtypes: Int64(1), object(2)
memory usage: 131.4+ MB


In [19]:
patients_df[patients_df['pxid'].isnull()]

Unnamed: 0,pxid,age,gender


In [20]:
na_age_patients = patients_df[(patients_df['age'].isnull())].index

In [21]:
patients_df.drop(index=na_age_patients, inplace=True)
patients_df.reset_index(drop=True, inplace=True)
patients_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5501480 entries, 0 to 5501479
Data columns (total 3 columns):
 #   Column  Dtype 
---  ------  ----- 
 0   pxid    object
 1   age     Int64 
 2   gender  object
dtypes: Int64(1), object(2)
memory usage: 131.2+ MB


In [22]:
patients_df[patients_df['age'].isnull()]

Unnamed: 0,pxid,age,gender


In [23]:
patients_df['pxid'].astype(str).str.match(r'^[a-fA-F0-9]{32}$').unique()

array([ True])

Renaming of columns

In [24]:
patients_df.rename(
    columns={
        'pxid': 'pxID',
    },
    copy=False,
    inplace=True,
    errors='raise'
)

Standard Protocols will be used to clean data
1. Checking for incorrect data types
2. Check for dupliucate values
3. Check for multiple representations
4. Check for missing and default values
5. Check for inconsistent format

**Loading it to the data warehouse**

In [25]:
patients_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5501480 entries, 0 to 5501479
Data columns (total 3 columns):
 #   Column  Dtype 
---  ------  ----- 
 0   pxID    object
 1   age     Int64 
 2   gender  object
dtypes: Int64(1), object(2)
memory usage: 131.2+ MB


In [26]:
# Preparing ENUM data type
# Prepareing data type
class GenderEnum(Enum):
    MALE = 'MALE'
    FEMALE = 'FEMALE'

In [28]:
# sending dataframe to sql table
mdwarehouse_conn = MySqlEngineDW.connect()

rows_affected = patients_df.to_sql(
    name='dim_px',
    con=mdwarehouse_conn,
    if_exists='append',
    index=False,
    dtype={
        'pxID': sqlalchemy.types.String(32),
        'age': sqlalchemy.types.Integer,
        'gender': sqlalchemy.types.Enum(GenderEnum)
    },
    chunksize=5000,
    method='multi'
)

print("rows affected:" + str(rows_affected))

mdwarehouse_conn.close()

rows affected:5501480


In [None]:
MySqlEngine.dispose(),
MySqlEngineDW.dispose()