The ETL process is as follows:
1. Import the database tables as dataframes
2. Use pandas dataframe operations to clean the dataframe
3. Extract dimension values
4. Load it to the data warehouse

Standard Protocols will be used to clean data
1. Checking for incorrect data types
2. Check for dupliucate values
3. Check for multiple representations
4. Check for missing and default values
5. Check for inconsistent format

In [2]:
import pandas as pd
from sqlalchemy import create_engine
from sqlalchemy import URL
import numpy as np
from enum import Enum
import sqlalchemy

Pyarrow will become a required dependency of pandas in the next major release of pandas (pandas 3.0),
(to allow more performant data types, such as the Arrow string type, and better interoperability with other libraries)
but was not found to be installed on your system.
If this would cause problems for you,
please provide us feedback at https://github.com/pandas-dev/pandas/issues/54466
        
  import pandas as pd


In [3]:
# creating engine. MySQL will be used
url_object = URL.create(
    drivername='mysql',
    username='root',
    password='Data_101',
    host='localhost',
    port=3306,
    database='seriousmd'
)
MySqlEngine = create_engine(url_object)

In [4]:
# creating engine for data warehouse
url_object = URL.create(
    drivername='mysql',
    username='root',
    password='Data_101',
    host='localhost',
    port=3306,
    database='mdwarehouse'
)
MySqlEngineDW = create_engine(url_object)

In [106]:
# importing the appointments table to a dataframe for
seriousmd_conn = MySqlEngine.connect()

doctors_df = pd.read_sql(
    sql="""SELECT * FROM doctors""",
    con=seriousmd_conn,
    dtype={
        'age': pd.Int64Dtype()
    }
)
doctors_df.info()

seriousmd_conn.close()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 60024 entries, 0 to 60023
Data columns (total 3 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   doctorid       60024 non-null  object
 1   mainspecialty  27315 non-null  object
 2   age            20028 non-null  Int64 
dtypes: Int64(1), object(2)
memory usage: 1.4+ MB


**Cleaning Data**

In [107]:
doctors_df[doctors_df.duplicated()]

Unnamed: 0,doctorid,mainspecialty,age


In [108]:
doctors_df[doctors_df['doctorid'].duplicated()]

Unnamed: 0,doctorid,mainspecialty,age


Checking for multiple representation

In [109]:
doctors_df['age'].value_counts(dropna=False)

age
<NA>    39996
34       1215
35       1082
33        973
37        971
        ...  
4           1
91          1
92          1
1048        1
160         1
Name: count, Length: 82, dtype: Int64

In [134]:
doctors_df['mainspecialty'].value_counts(dropna=False)

mainspecialty
None                                           32830
Internal Medicine                               3957
General Medicine                                2378
Pediatrics                                      1762
General Physician                                933
                                               ...  
Adult Hematology                                   1
E.G                                                1
General Dentistry, Orthodontics, & Implants        1
Pediatric -Adolescent Medicine                     1
Breast -General Surgery                            1
Name: count, Length: 3784, dtype: int64

Plan of attack.
1. Create a junc table
2. Compile all
2a. With compiling, create a copy separate table
2b. Query all with that
2c. Once added to the list, delete that row.

In [123]:
# creating the junction table
junc_doctor_specialty = pd.DataFrame(columns=['doctorid', 'mainspecialty'])

In [112]:
# creating the specialty table
specialty_df = []

In [113]:
# replacing None
doctors_df['mainspecialty'].replace(to_replace='None', value=None, inplace=True)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  doctors_df['mainspecialty'].replace(to_replace='None', value=None, inplace=True)


In [183]:
doctors_df['mainspecialty'].value_counts()

mainspecialty
Internal Medicine                              3957
General Medicine                               2378
Pediatrics                                     1762
General Physician                               933
Family Medicine                                 919
                                               ... 
Adult Hematology                                  1
E.G                                               1
General Dentistry, Orthodontics, & Implants       1
Pediatric -Adolescent Medicine                    1
Breast -General Surgery                           1
Name: count, Length: 3783, dtype: int64

In [114]:
doctors_df_cpy = doctors_df.copy()

In [115]:
# let's tackle internal medicine first
specialty_df.append('Internal Medicine')

In [118]:
regex_val = r'^Internal Medicine$'
doctors_df_cpy[doctors_df_cpy['mainspecialty'].str.contains(pat=regex_val, case=False, na=False)]['mainspecialty'].value_counts()

mainspecialty
Internal Medicine    3957
Internal medicine     346
internal medicine      97
INTERNAL MEDICINE      96
internal Medicine      13
Internal MEdicine       2
Internal MEDICINE       1
Name: count, dtype: int64

In [121]:
temp_df = doctors_df_cpy[doctors_df_cpy['mainspecialty'].str.contains(pat=regex_val, case=False, na=False)]['doctorid'].copy()
temp_df = pd.DataFrame(temp_df)
temp_df.insert(loc=1, column='mainspecialty', value='Internal Medicine')
temp_df

Unnamed: 0,doctorid,mainspecialty
4,A7AEED74714116F3B292A982238F83D2,Internal Medicine
10,B7B16ECF8CA53723593894116071700C,Internal Medicine
15,A1D33D0DFEC820B41B54430B50E96B5C,Internal Medicine
24,6CFE0E6127FA25DF2A0EF2AE1067D915,Internal Medicine
26,5D44EE6F2C3F71B73125876103C8F6C4,Internal Medicine
...,...,...
59913,191C62D342811D1A0D3D0528EC35CD2D,Internal Medicine
59941,5F268DFB0FBEF44DE0F668A022707B86,Internal Medicine
59945,B8C7803CAE4625F5A77592749E5D6FFF,Internal Medicine
59972,95A976F4986CACE28BC02AA631F7B88F,Internal Medicine


In [124]:
junc_doctor_specialty = pd.concat([junc_doctor_specialty, temp_df], ignore_index=True)
junc_doctor_specialty

Unnamed: 0,doctorid,mainspecialty
0,A7AEED74714116F3B292A982238F83D2,Internal Medicine
1,B7B16ECF8CA53723593894116071700C,Internal Medicine
2,A1D33D0DFEC820B41B54430B50E96B5C,Internal Medicine
3,6CFE0E6127FA25DF2A0EF2AE1067D915,Internal Medicine
4,5D44EE6F2C3F71B73125876103C8F6C4,Internal Medicine
...,...,...
4507,191C62D342811D1A0D3D0528EC35CD2D,Internal Medicine
4508,5F268DFB0FBEF44DE0F668A022707B86,Internal Medicine
4509,B8C7803CAE4625F5A77592749E5D6FFF,Internal Medicine
4510,95A976F4986CACE28BC02AA631F7B88F,Internal Medicine


In [129]:
drop_rows = doctors_df_cpy[doctors_df_cpy['mainspecialty'].str.contains(pat=regex_val, case=False, na=False)].index
doctors_df_cpy = doctors_df_cpy.drop(index=drop_rows)
doctors_df_cpy.info()

<bound method DataFrame.info of                                doctorid              mainspecialty   age
0      CCB1D45FB76F7C5A0BF619F979C6CF36                    Surgery    71
1      6C29793A140A811D0C45CE03C1C93A28                       None  <NA>
2      CF67355A3333E6E143439161ADC2D82E            Family Medicine    67
3      5B69B9CB83065D403869739AE7F0995E  Obstetrics and Gynecology    43
5      851DDF5058CF22DF63D3344AD89919CF              Ophthalmology    46
...                                 ...                        ...   ...
60019  B2592B47D3A46F07D90C3E5A9CF3ACC3                       None  <NA>
60020  C136BB626E2697D2D1BBA2BD447277A1                       None  <NA>
60021  7B49655747396EBE9689CE931D04F84C          GENERAL PHYSICIAN  <NA>
60022  82048394A4FEC8B2F82CB19FAD17D292                       None  <NA>
60023  D0F56CC15CD90A7568BD6E0319A0DDBB                       None  <NA>

[55512 rows x 3 columns]>

start of function 'Internal Medicine'

In [165]:
regex_val = r'Internal'
doctors_df_cpy[doctors_df_cpy['mainspecialty'].str.contains(pat=regex_val, case=False, na=False)]['mainspecialty'].value_counts()

Series([], Name: count, dtype: int64)

In [161]:
temp_df = doctors_df_cpy[doctors_df_cpy['mainspecialty'].str.contains(pat=regex_val, case=False, na=False)]['doctorid'].copy()
temp_df = pd.DataFrame(temp_df)
temp_df.insert(loc=1, column='mainspecialty', value='Internal Medicine')

junc_doctor_specialty = pd.concat([junc_doctor_specialty, temp_df], ignore_index=True)

drop_rows = doctors_df_cpy[doctors_df_cpy['mainspecialty'].str.contains(pat=regex_val, case=False, na=False)].index
doctors_df_cpy = doctors_df_cpy.drop(index=drop_rows)
doctors_df_cpy.info()

<class 'pandas.core.frame.DataFrame'>
Index: 54524 entries, 0 to 60023
Data columns (total 3 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   doctorid       54524 non-null  object
 1   mainspecialty  21694 non-null  object
 2   age            15889 non-null  Int64 
dtypes: Int64(1), object(2)
memory usage: 1.7+ MB


start for general

In [167]:
doctors_df_cpy = doctors_df.copy()

In [168]:
specialty_df.append('General Medicine')

In [181]:
regex_val = r'General  m'
doctors_df_cpy[doctors_df_cpy['mainspecialty'].str.contains(pat=regex_val, case=False, na=False)]['mainspecialty'].value_counts()

mainspecialty
GENERAL  MEDICINE    1
Name: count, dtype: int64

In [182]:
temp_df = doctors_df_cpy[doctors_df_cpy['mainspecialty'].str.contains(pat=regex_val, case=False, na=False)]['doctorid'].copy()
temp_df = pd.DataFrame(temp_df)
temp_df.insert(loc=1, column='mainspecialty', value='General Medicine')

junc_doctor_specialty = pd.concat([junc_doctor_specialty, temp_df], ignore_index=True)

drop_rows = doctors_df_cpy[doctors_df_cpy['mainspecialty'].str.contains(pat=regex_val, case=False, na=False)].index
doctors_df_cpy = doctors_df_cpy.drop(index=drop_rows)

junc_doctor_specialty.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11035 entries, 0 to 11034
Data columns (total 2 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   doctorid       11035 non-null  object
 1   mainspecialty  11035 non-null  object
dtypes: object(2)
memory usage: 172.5+ KB


start for pedia

In [184]:
doctors_df_cpy = doctors_df.copy()

In [185]:
specialty_str = 'Pediatrics'
specialty_df.append(specialty_str)

In [197]:
regex_val = r'Pedia'
doctors_df_cpy[doctors_df_cpy['mainspecialty'].str.contains(pat=regex_val, case=False, na=False)]['mainspecialty'].value_counts()

Series([], Name: count, dtype: int64)

In [196]:
temp_df = doctors_df_cpy[doctors_df_cpy['mainspecialty'].str.contains(pat=regex_val, case=False, na=False)]['doctorid'].copy()
temp_df = pd.DataFrame(temp_df)
temp_df.insert(loc=1, column='mainspecialty', value=specialty_str)

junc_doctor_specialty = pd.concat([junc_doctor_specialty, temp_df], ignore_index=True)

drop_rows = doctors_df_cpy[doctors_df_cpy['mainspecialty'].str.contains(pat=regex_val, case=False, na=False)].index
doctors_df_cpy = doctors_df_cpy.drop(index=drop_rows)

junc_doctor_specialty.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 13717 entries, 0 to 13716
Data columns (total 2 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   doctorid       13717 non-null  object
 1   mainspecialty  13717 non-null  object
dtypes: object(2)
memory usage: 214.5+ KB


start for family med

In [198]:
doctors_df_cpy = doctors_df.copy()

In [199]:
specialty_str = 'Family Medicine'
specialty_df.append(specialty_str)

In [236]:
regex_val = r'family do'
doctors_df_cpy[doctors_df_cpy['mainspecialty'].str.contains(pat=regex_val, case=False, na=False)]['mainspecialty'].value_counts()

mainspecialty
Family Doctor    4
Family doctor    1
Name: count, dtype: int64

In [237]:
temp_df = doctors_df_cpy[doctors_df_cpy['mainspecialty'].str.contains(pat=regex_val, case=False, na=False)]['doctorid'].copy()
temp_df = pd.DataFrame(temp_df)
temp_df.insert(loc=1, column='mainspecialty', value=specialty_str)

junc_doctor_specialty = pd.concat([junc_doctor_specialty, temp_df], ignore_index=True)

drop_rows = doctors_df_cpy[doctors_df_cpy['mainspecialty'].str.contains(pat=regex_val, case=False, na=False)].index
doctors_df_cpy = doctors_df_cpy.drop(index=drop_rows)

junc_doctor_specialty.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 15081 entries, 0 to 15080
Data columns (total 2 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   doctorid       15081 non-null  object
 1   mainspecialty  15081 non-null  object
dtypes: object(2)
memory usage: 235.8+ KB


obstetics

In [238]:
doctors_df_cpy = doctors_df.copy()

In [239]:
specialty_str = 'Obstetrics'
specialty_df.append(specialty_str)

In [270]:
regex_val = r'^Oby$'
doctors_df_cpy[doctors_df_cpy['mainspecialty'].str.contains(pat=regex_val, case=False, na=False)]['mainspecialty'].value_counts()

mainspecialty
Oby    4
Name: count, dtype: int64

In [271]:
temp_df = doctors_df_cpy[doctors_df_cpy['mainspecialty'].str.contains(pat=regex_val, case=False, na=False)]['doctorid'].copy()
temp_df = pd.DataFrame(temp_df)
temp_df.insert(loc=1, column='mainspecialty', value=specialty_str)

junc_doctor_specialty = pd.concat([junc_doctor_specialty, temp_df], ignore_index=True)

drop_rows = doctors_df_cpy[doctors_df_cpy['mainspecialty'].str.contains(pat=regex_val, case=False, na=False)].index
doctors_df_cpy = doctors_df_cpy.drop(index=drop_rows)

junc_doctor_specialty.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 17162 entries, 0 to 17161
Data columns (total 2 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   doctorid       17162 non-null  object
 1   mainspecialty  17162 non-null  object
dtypes: object(2)
memory usage: 268.3+ KB


Gynecology

In [272]:
doctors_df_cpy = doctors_df.copy()

In [273]:
specialty_str = 'Gynecology'
specialty_df.append(specialty_str)

In [277]:
regex_val = r'Gyn'
doctors_df_cpy[doctors_df_cpy['mainspecialty'].str.contains(pat=regex_val, case=False, na=False)]['mainspecialty'].value_counts()

mainspecialty
OB-GYN                        128
OBGYN                          97
OB GYN                         66
OB-GYNE                        31
OB GYNE                        28
                             ... 
on-gyn                          1
Ob gyn/ GP                      1
obstetric-Gynecologist          1
GYNECOLOGIC ONCOLOGY            1
Obstetrics and Gynelcology      1
Name: count, Length: 143, dtype: int64

In [278]:
temp_df = doctors_df_cpy[doctors_df_cpy['mainspecialty'].str.contains(pat=regex_val, case=False, na=False)]['doctorid'].copy()
temp_df = pd.DataFrame(temp_df)
temp_df.insert(loc=1, column='mainspecialty', value=specialty_str)

junc_doctor_specialty = pd.concat([junc_doctor_specialty, temp_df], ignore_index=True)

drop_rows = doctors_df_cpy[doctors_df_cpy['mainspecialty'].str.contains(pat=regex_val, case=False, na=False)].index
doctors_df_cpy = doctors_df_cpy.drop(index=drop_rows)

junc_doctor_specialty.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19172 entries, 0 to 19171
Data columns (total 2 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   doctorid       19172 non-null  object
 1   mainspecialty  19172 non-null  object
dtypes: object(2)
memory usage: 299.7+ KB


Dermatology

In [279]:
doctors_df_cpy = doctors_df.copy()

In [280]:
specialty_str = 'Dermatology'
specialty_df.append(specialty_str)

In [283]:
regex_val = r'Derm'
doctors_df_cpy[doctors_df_cpy['mainspecialty'].str.contains(pat=regex_val, case=False, na=False)]['mainspecialty'].value_counts()

mainspecialty
Dermatology                                                      676
Dermatologist                                                     37
Derma                                                             19
DERMATOLOGY                                                       18
Aesthetic Dermatology                                             13
                                                                ... 
Dermatology (Skin, Hair, Nails)                                    1
DERMATOLOGY - Fellow, Philippine Dermatological Society, Inc.      1
Dermatology ( Medical)                                             1
Dermatology and Advanced Wound Care                                1
Skin Health and Aesthetic Dermatology                              1
Name: count, Length: 78, dtype: int64

In [284]:
temp_df = doctors_df_cpy[doctors_df_cpy['mainspecialty'].str.contains(pat=regex_val, case=False, na=False)]['doctorid'].copy()
temp_df = pd.DataFrame(temp_df)
temp_df.insert(loc=1, column='mainspecialty', value=specialty_str)

junc_doctor_specialty = pd.concat([junc_doctor_specialty, temp_df], ignore_index=True)

drop_rows = doctors_df_cpy[doctors_df_cpy['mainspecialty'].str.contains(pat=regex_val, case=False, na=False)].index
doctors_df_cpy = doctors_df_cpy.drop(index=drop_rows)

junc_doctor_specialty.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20030 entries, 0 to 20029
Data columns (total 2 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   doctorid       20030 non-null  object
 1   mainspecialty  20030 non-null  object
dtypes: object(2)
memory usage: 313.1+ KB


Opthalmology

In [285]:
doctors_df_cpy = doctors_df.copy()

In [286]:
specialty_str = 'Ophthalmology'
specialty_df.append(specialty_str)

In [291]:
regex_val = r'Opt[^i]'
doctors_df_cpy[doctors_df_cpy['mainspecialty'].str.contains(pat=regex_val, case=False, na=False)]['mainspecialty'].value_counts()

mainspecialty
Optometry                                          24
Opthalmology                                       12
Optometrist                                        11
Opthamology                                         4
Opta                                                3
Opthalmologist                                      2
Optha                                               2
Clinical Optometry                                  2
Optal                                               1
OPTHALMOLOGIST                                      1
Pediatric Optometry                                 1
General Medicine & Optometrist                      1
CLINICAL OPTOMETRY                                  1
Opthalmologists                                     1
Vision Therapy/ Behavioral Optometry                1
Doctor Of Optometry                                 1
opthalmologist                                      1
OPTOMETRY                                           1
Internal medic

In [292]:
temp_df = doctors_df_cpy[doctors_df_cpy['mainspecialty'].str.contains(pat=regex_val, case=False, na=False)]['doctorid'].copy()
temp_df = pd.DataFrame(temp_df)
temp_df.insert(loc=1, column='mainspecialty', value=specialty_str)

junc_doctor_specialty = pd.concat([junc_doctor_specialty, temp_df], ignore_index=True)

drop_rows = doctors_df_cpy[doctors_df_cpy['mainspecialty'].str.contains(pat=regex_val, case=False, na=False)].index
doctors_df_cpy = doctors_df_cpy.drop(index=drop_rows)

junc_doctor_specialty.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20744 entries, 0 to 20743
Data columns (total 2 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   doctorid       20744 non-null  object
 1   mainspecialty  20744 non-null  object
dtypes: object(2)
memory usage: 324.2+ KB


Surgery

In [293]:
doctors_df_cpy = doctors_df.copy()

In [294]:
specialty_str = 'Surgery'
specialty_df.append(specialty_str)

In [295]:
regex_val = r'surg'
doctors_df_cpy[doctors_df_cpy['mainspecialty'].str.contains(pat=regex_val, case=False, na=False)]['mainspecialty'].value_counts()

mainspecialty
Surgery                                               526
General Surgery                                       525
Orthopedic Surgery                                    161
Neurosurgery                                           66
Orthopaedic Surgery                                    52
                                                     ... 
general  and cancer surgery                             1
Otorhinolaryngology-Head and Neck Surgery(ENT-HNS)      1
Pediatrics, General Surgery                             1
Orthopaedics and Spine Surgery                          1
Breast -General Surgery                                 1
Name: count, Length: 519, dtype: int64

In [296]:
temp_df = doctors_df_cpy[doctors_df_cpy['mainspecialty'].str.contains(pat=regex_val, case=False, na=False)]['doctorid'].copy()
temp_df = pd.DataFrame(temp_df)
temp_df.insert(loc=1, column='mainspecialty', value=specialty_str)

junc_doctor_specialty = pd.concat([junc_doctor_specialty, temp_df], ignore_index=True)

drop_rows = doctors_df_cpy[doctors_df_cpy['mainspecialty'].str.contains(pat=regex_val, case=False, na=False)].index
doctors_df_cpy = doctors_df_cpy.drop(index=drop_rows)

junc_doctor_specialty.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 23089 entries, 0 to 23088
Data columns (total 2 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   doctorid       23089 non-null  object
 1   mainspecialty  23089 non-null  object
dtypes: object(2)
memory usage: 360.9+ KB


Psychiatry

In [297]:
doctors_df_cpy = doctors_df.copy()

In [298]:
specialty_str = 'Psychiatry'
specialty_df.append(specialty_str)

In [299]:
regex_val = r'psy'
doctors_df_cpy[doctors_df_cpy['mainspecialty'].str.contains(pat=regex_val, case=False, na=False)]['mainspecialty'].value_counts()

mainspecialty
Psychiatry                                                 212
Psychology                                                  31
Clinical Psychology                                         19
Psychiatrist                                                15
General Adult Psychiatry                                    12
                                                          ... 
Psychology assessment                                        1
PSYCHOLOGY                                                   1
Telepsychology                                               1
Addiction Psychiatry, Psychotherapy, Lifestyle Medicine      1
psychiatrist                                                 1
Name: count, Length: 87, dtype: int64

In [300]:
temp_df = doctors_df_cpy[doctors_df_cpy['mainspecialty'].str.contains(pat=regex_val, case=False, na=False)]['doctorid'].copy()
temp_df = pd.DataFrame(temp_df)
temp_df.insert(loc=1, column='mainspecialty', value=specialty_str)

junc_doctor_specialty = pd.concat([junc_doctor_specialty, temp_df], ignore_index=True)

drop_rows = doctors_df_cpy[doctors_df_cpy['mainspecialty'].str.contains(pat=regex_val, case=False, na=False)].index
doctors_df_cpy = doctors_df_cpy.drop(index=drop_rows)

junc_doctor_specialty.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 23512 entries, 0 to 23511
Data columns (total 2 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   doctorid       23512 non-null  object
 1   mainspecialty  23512 non-null  object
dtypes: object(2)
memory usage: 367.5+ KB


Dentistry

In [301]:
doctors_df_cpy = doctors_df.copy()

In [302]:
specialty_str = 'Dentistry'
specialty_df.append(specialty_str)

In [307]:
regex_val = r'([^u] | [^i])dent'
doctors_df_cpy[doctors_df_cpy['mainspecialty'].str.contains(pat=regex_val, case=False, na=False)]['mainspecialty'].value_counts()

  doctors_df_cpy[doctors_df_cpy['mainspecialty'].str.contains(pat=regex_val, case=False, na=False)]['mainspecialty'].value_counts()


mainspecialty
General Dentistry                                    176
General Dentist                                       30
General dentistry                                     14
General Dentistry, Orthodontics                       12
Pediatric Dentistry                                   10
                                                    ... 
General Dentistry, Orthodontics, & Implants            1
Orthodontic, Implantology, Esthetic Dentistry          1
Oral Surgery, Prosthodontics, General Dentistry        1
Pediatric dentistry                                    1
General Dentistry, Oral Surgery, TMJ-Orthodontics      1
Name: count, Length: 103, dtype: int64

In [308]:
temp_df = doctors_df_cpy[doctors_df_cpy['mainspecialty'].str.contains(pat=regex_val, case=False, na=False)]['doctorid'].copy()
temp_df = pd.DataFrame(temp_df)
temp_df.insert(loc=1, column='mainspecialty', value=specialty_str)

junc_doctor_specialty = pd.concat([junc_doctor_specialty, temp_df], ignore_index=True)

drop_rows = doctors_df_cpy[doctors_df_cpy['mainspecialty'].str.contains(pat=regex_val, case=False, na=False)].index
doctors_df_cpy = doctors_df_cpy.drop(index=drop_rows)

junc_doctor_specialty.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 23882 entries, 0 to 23881
Data columns (total 2 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   doctorid       23882 non-null  object
 1   mainspecialty  23882 non-null  object
dtypes: object(2)
memory usage: 373.3+ KB


  temp_df = doctors_df_cpy[doctors_df_cpy['mainspecialty'].str.contains(pat=regex_val, case=False, na=False)]['doctorid'].copy()
  drop_rows = doctors_df_cpy[doctors_df_cpy['mainspecialty'].str.contains(pat=regex_val, case=False, na=False)].index


I did this for Urology, neuro, ortho, cardio, Radio, Gastro, Endocrinology, Pulmo

In [369]:
doctors_df_cpy = doctors_df.copy()

In [370]:
specialty_str = 'Pulmonology'
specialty_df.append(specialty_str)

In [372]:
regex_val = r'Pulm'
doctors_df_cpy[doctors_df_cpy['mainspecialty'].str.contains(pat=regex_val, case=False, na=False)]['mainspecialty'].value_counts()

mainspecialty
Pulmonology                                        30
Internal Medicine - Pulmonology                    18
Pulmonologist                                      17
Pulmonary Medicine                                 16
Internal Medicine - Pulmonary Medicine             11
                                                   ..
Pulmologist                                         1
Internal Medicine, Pulmonary and Sleep Medicine     1
Pulmunology                                         1
internal Medicine, Pulmonology                      1
Adult Pulmonary Medicine & Internal Medicine        1
Name: count, Length: 83, dtype: int64

In [373]:
temp_df = doctors_df_cpy[doctors_df_cpy['mainspecialty'].str.contains(pat=regex_val, case=False, na=False)]['doctorid'].copy()
temp_df = pd.DataFrame(temp_df)
temp_df.insert(loc=1, column='mainspecialty', value=specialty_str)

junc_doctor_specialty = pd.concat([junc_doctor_specialty, temp_df], ignore_index=True)

drop_rows = doctors_df_cpy[doctors_df_cpy['mainspecialty'].str.contains(pat=regex_val, case=False, na=False)].index
doctors_df_cpy = doctors_df_cpy.drop(index=drop_rows)

junc_doctor_specialty.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 26202 entries, 0 to 26201
Data columns (total 2 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   doctorid       26202 non-null  object
 1   mainspecialty  26202 non-null  object
dtypes: object(2)
memory usage: 409.5+ KB


-------

all have consistent format

Standard Protocols will be used to clean data
1. Checking for incorrect data types
2. Check for dupliucate values
3. Check for multiple representations
4. Check for missing and default values
5. Check for inconsistent format

Creating the dataframe for specialty

In [378]:
official_specialty_df = pd.DataFrame(specialty_df, columns=['specialty'], copy=True)
official_specialty_df.index += 1
official_specialty_df

Unnamed: 0,specialty
1,Internal Medicine
2,General Medicine
3,Pediatrics
4,Family Medicine
5,Obstetrics
6,Gynecology
7,Dermatology
8,Ophthalmology
9,Surgery
10,Psychiatry


creating the dataframe for junction table with code instead of string

In [381]:
official_junc_doctor_specialty = junc_doctor_specialty.copy()
official_specialty_df_cpy = official_specialty_df.copy()
official_specialty_df_cpy['index'] = official_specialty_df_cpy.index

official_junc_doctor_specialty

Unnamed: 0,doctorid,mainspecialty
0,A7AEED74714116F3B292A982238F83D2,Internal Medicine
1,B7B16ECF8CA53723593894116071700C,Internal Medicine
2,A1D33D0DFEC820B41B54430B50E96B5C,Internal Medicine
3,6CFE0E6127FA25DF2A0EF2AE1067D915,Internal Medicine
4,5D44EE6F2C3F71B73125876103C8F6C4,Internal Medicine
...,...,...
26197,9C8C4241D264F883F4587529316B042A,Pulmonology
26198,EA9BE6EA49C5C752ABB11953955C90E4,Pulmonology
26199,6E69EBBFAD976D4637BB4B39DE261BF7,Pulmonology
26200,45C68484C6FC509CB25BDFCA881E5CD8,Pulmonology


In [382]:
official_specialty_df_cpy

Unnamed: 0,specialty,index
1,Internal Medicine,1
2,General Medicine,2
3,Pediatrics,3
4,Family Medicine,4
5,Obstetrics,5
6,Gynecology,6
7,Dermatology,7
8,Ophthalmology,8
9,Surgery,9
10,Psychiatry,10


In [383]:
official_junc_doctor_specialty = pd.merge(
    left=official_junc_doctor_specialty,
    right=official_specialty_df_cpy,
    how='inner',
    left_on='mainspecialty',
    right_on='specialty'
)

official_junc_doctor_specialty

Unnamed: 0,doctorid,mainspecialty,specialty,index
0,A7AEED74714116F3B292A982238F83D2,Internal Medicine,Internal Medicine,1
1,B7B16ECF8CA53723593894116071700C,Internal Medicine,Internal Medicine,1
2,A1D33D0DFEC820B41B54430B50E96B5C,Internal Medicine,Internal Medicine,1
3,6CFE0E6127FA25DF2A0EF2AE1067D915,Internal Medicine,Internal Medicine,1
4,5D44EE6F2C3F71B73125876103C8F6C4,Internal Medicine,Internal Medicine,1
...,...,...,...,...
26197,9C8C4241D264F883F4587529316B042A,Pulmonology,Pulmonology,18
26198,EA9BE6EA49C5C752ABB11953955C90E4,Pulmonology,Pulmonology,18
26199,6E69EBBFAD976D4637BB4B39DE261BF7,Pulmonology,Pulmonology,18
26200,45C68484C6FC509CB25BDFCA881E5CD8,Pulmonology,Pulmonology,18


In [388]:
official_junc_doctor_specialty = official_junc_doctor_specialty[['doctorid', 'index']]
official_junc_doctor_specialty

Unnamed: 0,doctorid,index
0,A7AEED74714116F3B292A982238F83D2,1
1,B7B16ECF8CA53723593894116071700C,1
2,A1D33D0DFEC820B41B54430B50E96B5C,1
3,6CFE0E6127FA25DF2A0EF2AE1067D915,1
4,5D44EE6F2C3F71B73125876103C8F6C4,1
...,...,...
26197,9C8C4241D264F883F4587529316B042A,18
26198,EA9BE6EA49C5C752ABB11953955C90E4,18
26199,6E69EBBFAD976D4637BB4B39DE261BF7,18
26200,45C68484C6FC509CB25BDFCA881E5CD8,18


**IMPORTING TO SQL**

In [389]:
official_specialty_df

Unnamed: 0,specialty
1,Internal Medicine
2,General Medicine
3,Pediatrics
4,Family Medicine
5,Obstetrics
6,Gynecology
7,Dermatology
8,Ophthalmology
9,Surgery
10,Psychiatry


In [390]:
# sending dataframe to sql table
# dim_specialty
mdwarehouse_conn = MySqlEngineDW.connect()

rows_affected = official_specialty_df.to_sql(
    name='dim_specialty',
    con=mdwarehouse_conn,
    if_exists='append',
    index=True,
    index_label='specialtyID',
    dtype={
        'specialtyID': sqlalchemy.types.Integer,
        'specialty': sqlalchemy.types.String(100),
    },
    chunksize=5000,
    method='multi'
)

print("rows affected:" + str(rows_affected))

mdwarehouse_conn.close()

rows affected:18


In [392]:
official_doctors_df = doctors_df[['doctorid', 'age']]
official_doctors_df

Unnamed: 0,doctorid,age
0,CCB1D45FB76F7C5A0BF619F979C6CF36,71
1,6C29793A140A811D0C45CE03C1C93A28,
2,CF67355A3333E6E143439161ADC2D82E,67
3,5B69B9CB83065D403869739AE7F0995E,43
4,A7AEED74714116F3B292A982238F83D2,43
...,...,...
60019,B2592B47D3A46F07D90C3E5A9CF3ACC3,
60020,C136BB626E2697D2D1BBA2BD447277A1,
60021,7B49655747396EBE9689CE931D04F84C,
60022,82048394A4FEC8B2F82CB19FAD17D292,


In [393]:
official_doctors_df.rename(
    columns={
        'doctorid': 'doctorID',
    },
    copy=False,
    inplace=True,
    errors='raise'
)
official_doctors_df

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  official_doctors_df.rename(


Unnamed: 0,doctorID,age
0,CCB1D45FB76F7C5A0BF619F979C6CF36,71
1,6C29793A140A811D0C45CE03C1C93A28,
2,CF67355A3333E6E143439161ADC2D82E,67
3,5B69B9CB83065D403869739AE7F0995E,43
4,A7AEED74714116F3B292A982238F83D2,43
...,...,...
60019,B2592B47D3A46F07D90C3E5A9CF3ACC3,
60020,C136BB626E2697D2D1BBA2BD447277A1,
60021,7B49655747396EBE9689CE931D04F84C,
60022,82048394A4FEC8B2F82CB19FAD17D292,


In [394]:
# sending dataframe to sql table
# dim_doctor
mdwarehouse_conn = MySqlEngineDW.connect()

rows_affected = official_doctors_df.to_sql(
    name='dim_doctor',
    con=mdwarehouse_conn,
    if_exists='append',
    index=False,
    dtype={
        'doctorID': sqlalchemy.types.String(32),
        'age': sqlalchemy.types.Integer,
    },
    chunksize=5000,
    method='multi'
)

print("rows affected:" + str(rows_affected))

mdwarehouse_conn.close()

rows affected:60024


In [395]:
official_junc_doctor_specialty

Unnamed: 0,doctorid,index
0,A7AEED74714116F3B292A982238F83D2,1
1,B7B16ECF8CA53723593894116071700C,1
2,A1D33D0DFEC820B41B54430B50E96B5C,1
3,6CFE0E6127FA25DF2A0EF2AE1067D915,1
4,5D44EE6F2C3F71B73125876103C8F6C4,1
...,...,...
26197,9C8C4241D264F883F4587529316B042A,18
26198,EA9BE6EA49C5C752ABB11953955C90E4,18
26199,6E69EBBFAD976D4637BB4B39DE261BF7,18
26200,45C68484C6FC509CB25BDFCA881E5CD8,18


In [396]:
official_junc_doctor_specialty.rename(
    columns={
        'index': 'specialtyID',
    },
    copy=False,
    inplace=True,
    errors='raise'
)
official_junc_doctor_specialty

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  official_junc_doctor_specialty.rename(


Unnamed: 0,doctorid,specialtyID
0,A7AEED74714116F3B292A982238F83D2,1
1,B7B16ECF8CA53723593894116071700C,1
2,A1D33D0DFEC820B41B54430B50E96B5C,1
3,6CFE0E6127FA25DF2A0EF2AE1067D915,1
4,5D44EE6F2C3F71B73125876103C8F6C4,1
...,...,...
26197,9C8C4241D264F883F4587529316B042A,18
26198,EA9BE6EA49C5C752ABB11953955C90E4,18
26199,6E69EBBFAD976D4637BB4B39DE261BF7,18
26200,45C68484C6FC509CB25BDFCA881E5CD8,18


In [398]:
# sending dataframe to sql table
# dim_doctor
mdwarehouse_conn = MySqlEngineDW.connect()

rows_affected = official_junc_doctor_specialty.to_sql(
    name='junc_doctor_specialty',
    con=mdwarehouse_conn,
    if_exists='append',
    index=False,
    dtype={
        'doctorID': sqlalchemy.types.String(32),
        'specialtyID': sqlalchemy.types.Integer,
    },
    chunksize=5000,
    method='multi'
)

print("rows affected:" + str(rows_affected))

mdwarehouse_conn.close()

rows affected:26202


------

In [None]:
MySqlEngine.dispose(),
MySqlEngineDW.dispose()