## Data Processed Notebook

In this section, we will process the data to normalize it before importing it into the database while preserving the relationships between the tables. The tables will be named CardioTrainNormalize, GlucoseTypes, CholesterolTypes and we will make all necessary changes to ensure proper data transformation.

Ensure that you already have your own .env file containing your environment variables.

In [1]:
import sys
import os
from dotenv import load_dotenv

load_dotenv()
work_dir = os.getenv('WORK_DIR')
sys.path.append(work_dir)

## libraries & Data loading

In [2]:
from src.model.models import CardioTrainNormalize, GlucoseTypes, CholesterolTypes, CauseOfDeaths
from src.database.dbconnection import getconnection
from sqlalchemy import inspect
from sqlalchemy.orm import sessionmaker
from sqlalchemy.exc import SQLAlchemyError
from transform.TransformData import DataTransform, DataTransformCauseOfDeaths

Using the SQLAlchemy library, connect to the database. If you encounter any issues, check that your .env file contains the correct environment variables and try again.

In [3]:
engine = getconnection()
Session = sessionmaker(bind=engine)
session = Session()

Conected successfully to database airflow_project!


Make sure to create the categories tables first, as it serves as the foreign key for the cardioTrain table. This will help avoid any potential errors. In this process, ensure that there are no other tables with the same name. If such tables exist, they should be dropped before creating the new ones.

In [10]:
try:
    if inspect(engine).has_table('CauseOfDeaths'):
        CauseOfDeaths.__table__.drop(engine, checkfirst=True)
    CauseOfDeaths.__table__.create(engine)
    print("Table created successfully.")
except SQLAlchemyError as e:
    print(f"Error creating table: {e}")
finally:
    engine.dispose()

Table created successfully.


In [11]:
try:
    if inspect(engine).has_table('GlucoseTypes'):
        if inspect(engine).has_table('CardioTrainNormalize'):
            CardioTrainNormalize.__table__.drop(engine)
        GlucoseTypes.__table__.drop(engine, checkfirst=True)
    GlucoseTypes.__table__.create(engine)
    print("Table created successfully.")
except SQLAlchemyError as e:
    print(f"Error creating table: {e}")
finally:
    engine.dispose()

Table created successfully.


In [12]:
try:
    if inspect(engine).has_table('CholesterolTypes'):
        if inspect(engine).has_table('CardioTrainNormalize'):
            CardioTrainNormalize.__table__.drop(engine)
        CholesterolTypes.__table__.drop(engine, checkfirst=True)
    CholesterolTypes.__table__.create(engine)
    print("Table created successfully.")
except SQLAlchemyError as e:
    print(f"Error creating table: {e}")
finally:
    engine.dispose()

Table created successfully.


In [13]:
try:
    if inspect(engine).has_table('CardioTrainNormalize'):
        CardioTrainNormalize.__table__.drop(engine)
    CardioTrainNormalize.__table__.create(engine)
    print("Table created successfully.")
except SQLAlchemyError as e:
    print(f"Error creating table: {e}")
finally:
    engine.dispose()

Table created successfully.


## Transformations on cardio_train.csv:

1. Gender Categorization: The gender column was mapped from numeric values to categorical labels ('Female' and 'Male').

2. Cholesterol Categorization: The cholesterol column was converted from numeric levels to categories like 'normal', 'above normal', and 'well above normal'.

3. Glucose Categorization: Similar to cholesterol, the gluc column was categorized into 'normal', 'above normal', and 'well above normal'.

4. BMI Calculation: The Body Mass Index (BMI) was calculated using the formula: BMI = weight / (height in meters)^2. The result was stored in a new bmi column.

5. Age Calculation : The age column, initially recorded in days, was converted to years by dividing by 365.3 and taking the floor value.

6. Blood Pressure Standardization :

- Absolute values of ap_hi (systolic) and ap_lo (diastolic) were taken.
- Records with systolic blood pressure below 80 or above 250, and diastolic below 50 or above 150, were removed.
- Rows where systolic pressure equaled diastolic pressure were also removed.

7. BMI Categorization (CategorizeBMI): The bmi column was categorized into different classes based on predefined BMI ranges, and the results were stored in a new bmi_class column.

8. Blood Pressure Categorization (categorize_blood_pressure): Blood pressure readings were categorized into different levels based on standard hypertension guidelines, and the results were stored in a bp_cat column.

9. Pulse Pressure Calculation (CalculatePulsePressure): A new column pulse_press was created to store the difference between systolic (ap_hi) and diastolic (ap_lo) blood pressure, known as pulse pressure.



## Normalization of Glucose and Cholesterol Levels:

1. Glucose Normalization (nomalize_gluc): Unique glucose levels were extracted and mapped to new IDs. The original gluc column was replaced with these IDs, and the glucose types were stored in a separate table called GlucoseTypes.
2. Cholesterol Normalization (normalize_cholesterol): Similar to glucose, cholesterol levels were normalized and stored in a CholesterolTypes table, with the original cholesterol column being replaced by corresponding IDs.
3. Transformations on cause_of_deaths.csv:
-  ID Insertion: A new id column was created to uniquely identify each row.

- Column Dropping: The Code column was removed from the dataset.

4. Total Deaths Calculation : A TotalDeaths column was added, summing up all death counts across the specified columns. The dataset was then reorganized to display this new column alongside specific causes of death, like Cardiovascular.

These transformations were designed to clean, standardize, and normalize the data for further analysis and storage in an SQL database

In [14]:
try:
    #Cardio train transform
    file = DataTransform('../../data/cardio_train.csv')
    file.gender_by_category()
    file.cholesterol_by_category()
    file.gluc_by_category()
    file.bmi()
    file.days_to_age()
    file.StandardizeBloodPressure()
    file.CategorizeBMI()
    file.categorize_blood_pressure()
    file.CalculatePulsePressure()

    file.df.to_sql('CardioTrainNormalize', con=engine, if_exists='append', index=False)

    #Cause Of Deaths transform
    file2 = DataTransformCauseOfDeaths('../../data/cause_of_deaths.csv')
    file2.total_deaths()
    file2.insert_id()

    file2.df.to_sql('CauseOfDeaths', con=engine, if_exists='append', index=False)


    print("Data uploaded")

except SQLAlchemyError as e:
    print(f"Database error: {e}")

except Exception as e:
    print(f"Error: {e}")

finally:
    if hasattr(engine, 'dispose'):
        engine.dispose()
    if 'session' in locals():
        session.close()

Error: Choicelist and default value do not have a common dtype: The DType <class 'numpy.dtypes._PyLongDType'> could not be promoted by <class 'numpy.dtypes.StrDType'>. This means that no common DType exists for the given inputs. For example they cannot be stored in a single array unless the dtype is `object`. The full list of DTypes is: (<class 'numpy.dtypes.StrDType'>, <class 'numpy.dtypes.StrDType'>, <class 'numpy.dtypes.StrDType'>, <class 'numpy.dtypes.StrDType'>, <class 'numpy.dtypes.StrDType'>, <class 'numpy.dtypes.StrDType'>, <class 'numpy.dtypes.StrDType'>, <class 'numpy.dtypes._PyLongDType'>)


With the transformed health data successfully stored in our database, we are now poised to extract new data from the API. This next phase will allow us to gather relevant information that will further enrich our analysis.

Now you can proceed with the next notebook: [004_API.ipynb](../004_API.ipynb)
