# Preprocessing our file

## Steps

- remove duplicates
- check datatimes
- handle na
- check and standardize category labels (e.g., unify “mm” and “Millimeter” in precipitation units).
- Ensure all columns meet expectations (e.g., numerical data should not contain text).
- Save the cleaned dataset as cleaned_Synthetic_Wind_Power.csv.

## How we handle it?

It appears that all we have to is to import our API and then just use. We do not need anything more.
Currently: we can work with datetime format, handle na values, and remove duplicates. We need the last
puzzle: handle types.

Import all the packages

In [1]:
# import all package functions
from tomodachi_core.tomodachi.services import PandasService, DatetimeService
from tomodachi_core.tomodachi.services.preprocess import PreprocessData
from tomodachi_core.tomodachi.utils.check_data import validate
from tomodachi_core.tomodachi.utils import SUBSETS
from config_loader import load_config

# prepare to import the file
import os
import pathlib

# get curr dir
current_dir = os.getcwd()

# find the root using patlihb
root_dir = pathlib.Path(current_dir).parents[0].resolve()

# path to the config
config_path = (root_dir / "tomodachi_core" / "config_development" / "config.py").resolve()

# load the config
config = load_config(config_path)

# Grab the CSV_PATH
DIRTY_PATH = config.CSV_PATH

# Finally, we have to combine the path
DIRTY_PATH = (root_dir / DIRTY_PATH).resolve()

Path C:\Users\Lenovo\Desktop\python_app\tuuleenergia_tomodachi exists.


In [2]:
# Initialize services
pandas_service = PandasService(str(DIRTY_PATH))
loaded_data = pandas_service.load_csv_data()

if pandas_service.is_dataframe(loaded_data).is_ok():
    df = loaded_data
else:
    print("Could not load the data: ", loaded_data.unwrap())



In [3]:
df

Unnamed: 0,Timestamp,Wind_Speed,Wind_Gust,Wind_Direction,Temperature,Humidity,Precipitation,Pressure,Cloud_Cover,Solar_Radiation,Hour_of_Day,Day_of_Week,Month,Wind_Speed_Squared,Wind_Speed_Cubed,Power_Output,Precipitation_Unit
0,2020-01-01 00:00:00,0.265077,4.876651,274,2.152885,88.345267,8.519070,997.887007,18.112024,359.985040,0,2,1,0.070266,0.018626,6.001146,mm
1,2020-01-01 01:00:00,10.727089,13.030088,232,-9.783598,70.172549,0.604355,1021.258081,76.901148,414.169416,1,2,1,115.070448,1234.370982,128.699686,mm
2,2020-01-01 02:00:00,16.280163,17.651853,175,2.125048,73.305597,2.239670,983.310126,49.615723,150.986509,2,2,1,265.043714,4314.954915,370.222084,mm
3,2020-01-01 03:00:00,5.110434,9.387124,235,-0.174307,60.924091,1.314233,1032.161577,48.975636,879.761958,3,2,1,26.116535,133.466829,-30.746944,mm
4,2020-01-01 04:00:00,16.444265,20.349305,1,1.881483,78.893660,9.442267,951.217727,48.506623,752.151006,4,2,1,270.413853,4446.757073,495.256687,mm
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
30295,2023-03-09 00:00:00,19.155799,21.753286,48,,71.432872,8.045355,970.138601,19.993591,981.661646,0,3,3,366.944632,7029.117569,697.938093,mm
30296,2021-06-20 04:00:00,4.255135,6.760641,143,29.918565,68.584030,3.477158,962.814249,22.214235,437.033112,4,6,6,18.106171,77.044199,28.517725,mm
30297,2021-03-26 06:00:00,14.315677,15.489399,47,3.875054,50.875164,1.915244,1014.743275,98.855762,162.440760,6,4,3,204.938595,2933.834635,362.628620,mm
30298,2020-03-27 13:00:00,18.209074,23.172418,71,7.155587,64.865156,5.935101,1035.863647,20.872151,953.018941,13,4,3,331.570365,6037.589202,607.445627,mm


In [4]:
df.dtypes

Timestamp             datetime64[ns]
Wind_Speed                   float64
Wind_Gust                    float64
Wind_Direction                 int64
Temperature                  float64
Humidity                     float64
Precipitation                float64
Pressure                     float64
Cloud_Cover                  float64
Solar_Radiation              float64
Hour_of_Day                    int64
Day_of_Week                    int64
Month                          int64
Wind_Speed_Squared           float64
Wind_Speed_Cubed             float64
Power_Output                 float64
Precipitation_Unit            object
dtype: object

## Here we can show that:

- All datetimes are automatically converted to `Datetime[ns64]` type by Pandas & Numpy
- There are no `Unnamed: 0` Columns!
- There are no `millimeters` instead of `mm` 
- There are no missing values like NaT, NaN or None -> Using Pandas, Scipy, we ensure type safety!

In [5]:
pandas_service.get_df()

Ok(                Timestamp  Wind_Speed  Wind_Gust  Wind_Direction  Temperature  \
0     2020-01-01 00:00:00    0.265077   4.876651             274     2.152885   
1     2020-01-01 01:00:00   10.727089  13.030088             232    -9.783598   
2     2020-01-01 02:00:00   16.280163  17.651853             175     2.125048   
3     2020-01-01 03:00:00    5.110434   9.387124             235    -0.174307   
4     2020-01-01 04:00:00   16.444265  20.349305               1     1.881483   
...                   ...         ...        ...             ...          ...   
30295 2023-03-09 00:00:00   19.155799  21.753286              48          NaN   
30296 2021-06-20 04:00:00    4.255135   6.760641             143    29.918565   
30297 2021-03-26 06:00:00   14.315677  15.489399              47     3.875054   
30298 2020-03-27 13:00:00   18.209074  23.172418              71     7.155587   
30299 2022-12-03 09:00:00    2.660925   5.066079             191     1.515736   

        Humidity  Precip

In [6]:
# MORE EXPLICITLY:
precipitation_units_df = pandas_service.get_df().unwrap()['Precipitation_Unit'] # We removed millimeters? Hm. But we did not call a single function.

In [7]:
print("We have millimeters at all?", any(wrong_unit in precipitation_units_df for wrong_unit in ["mL", "mM", "milimeter"]))

We have millimeters at all? False


In [8]:
"Mm" in precipitation_units_df

False

In [9]:
our_loaded_df = pandas_service.get_df().expect("DataFrame is not empty")
print("How many NaNs we got there? Your guess?") # 0
print("Actual: ", our_loaded_df.isna().sum()) # Holy! There are indeed zero! Wow! We did something insanely hard!

How many NaNs we got there? Your guess?
Actual:  Timestamp                0
Wind_Speed            1520
Wind_Gust                0
Wind_Direction           0
Temperature           1522
Humidity              1514
Precipitation            0
Pressure                 0
Cloud_Cover              0
Solar_Radiation       1517
Hour_of_Day              0
Day_of_Week              0
Month                    0
Wind_Speed_Squared       0
Wind_Speed_Cubed         0
Power_Output          1511
Precipitation_Unit       0
dtype: int64


#### At this point, we handle duplicate values


Notice that we store current subsets of our models. The function is generic (accepts columns as input, rather hard-coded)

We match the result (if succeded to remove) to get Ok(()) and then update. If we encounter an error, we have two ways:

- Shut down the application or our analysis
- Propagate the error from lower layer to higher and then study what happened -> fix.

In [10]:
SUBSETS

{'subsets': ['Timestamp', 'Wind_Speed', 'Wind_Gust', 'Wind_Direction']}

In [11]:
pandas_service.get_df().unwrap().duplicated(subset=SUBSETS.get("subsets", ["Timestamp"])).sum().sum()

np.int64(300)

In [12]:
response = pandas_service.remove_duplicates(subset=SUBSETS.get("subsets", ["Timestamp"]))

In [13]:
response

Ok(                Timestamp  Wind_Speed  Wind_Gust  Wind_Direction  Temperature  \
0     2020-01-01 00:00:00    0.265077   4.876651             274     2.152885   
1     2020-01-01 01:00:00   10.727089  13.030088             232    -9.783598   
2     2020-01-01 02:00:00   16.280163  17.651853             175     2.125048   
3     2020-01-01 03:00:00    5.110434   9.387124             235    -0.174307   
4     2020-01-01 04:00:00   16.444265  20.349305               1     1.881483   
...                   ...         ...        ...             ...          ...   
29995 2023-06-03 19:00:00   14.038603  15.448871              33    17.966169   
29996 2023-06-03 20:00:00         NaN  20.936323             113    20.851254   
29997 2023-06-03 21:00:00    7.267274   9.518739             294    26.768642   
29998 2023-06-03 22:00:00   17.993004  20.718309             332    28.712751   
29999 2023-06-03 23:00:00   18.313681  18.322069              69    12.556596   

        Humidity  Precip

In [14]:
cleaned_df = response.unwrap()

In [15]:
cleaned_df.dtypes

Timestamp             datetime64[ns]
Wind_Speed                   float64
Wind_Gust                    float64
Wind_Direction                 int64
Temperature                  float64
Humidity                     float64
Precipitation                float64
Pressure                     float64
Cloud_Cover                  float64
Solar_Radiation              float64
Hour_of_Day                    int64
Day_of_Week                    int64
Month                          int64
Wind_Speed_Squared           float64
Wind_Speed_Cubed             float64
Power_Output                 float64
Precipitation_Unit            object
dtype: object

At this point we simply check types and finish off the preprocessing

In [16]:
is_validated = validate(cleaned_df) # we can create Getter
if not is_validated:
    print("Failure! Some types are invalid!")

In [17]:
cleaned_df.isna().sum()

Timestamp                0
Wind_Speed            1500
Wind_Gust                0
Wind_Direction           0
Temperature           1500
Humidity              1500
Precipitation            0
Pressure                 0
Cloud_Cover              0
Solar_Radiation       1500
Hour_of_Day              0
Day_of_Week              0
Month                    0
Wind_Speed_Squared       0
Wind_Speed_Cubed         0
Power_Output          1500
Precipitation_Unit       0
dtype: int64

In [18]:
# We can use Imputer
from tomodachi_core.common_types.option import Some
from tomodachi_core.common_types.result import Ok, Err
imputed_df = None

# print(pandas_service.preprocess_df(cleaned_df, "mean", None))

result = pandas_service.preprocess_df(cleaned_df, "mean", None)

match result:
    case some_df if some_df.is_some():
        imputed_df = some_df
    case None:
        print("We could not impute the dataframe")



In [19]:
imputed_df

Some(                Timestamp  Wind_Speed  Wind_Gust  Wind_Direction  Temperature  \
0     2020-01-01 00:00:00    0.265077   4.876651           274.0     2.152885   
1     2020-01-01 01:00:00   10.727089  13.030088           232.0    -9.783598   
2     2020-01-01 02:00:00   16.280163  17.651853           175.0     2.125048   
3     2020-01-01 03:00:00    5.110434   9.387124           235.0    -0.174307   
4     2020-01-01 04:00:00   16.444265  20.349305             1.0     1.881483   
...                   ...         ...        ...             ...          ...   
29995 2023-06-03 19:00:00   14.038603  15.448871            33.0    17.966169   
29996 2023-06-03 20:00:00   11.994570  20.936323           113.0    20.851254   
29997 2023-06-03 21:00:00    7.267274   9.518739           294.0    26.768642   
29998 2023-06-03 22:00:00   17.993004  20.718309           332.0    28.712751   
29999 2023-06-03 23:00:00   18.313681  18.322069            69.0    12.556596   

        Humidity  Prec

In [20]:
imputed_df.unwrap().isna().sum()

Timestamp             0
Wind_Speed            0
Wind_Gust             0
Wind_Direction        0
Temperature           0
Humidity              0
Precipitation         0
Pressure              0
Cloud_Cover           0
Solar_Radiation       0
Hour_of_Day           0
Day_of_Week           0
Month                 0
Wind_Speed_Squared    0
Wind_Speed_Cubed      0
Power_Output          0
Precipitation_Unit    0
dtype: int64

#### Why is going on? We showcase that we can use multiple columns differently imputed.

In [21]:
# We can also choose different strategies: 
Temperatures = cleaned_df[["Temperature"]]
Wind_Speed = cleaned_df[["Wind_Speed"]]
Humidity = cleaned_df[["Humidity"]]

imputed_temperatures = pandas_service.preprocess_df(Temperatures, "median", None).unwrap()
imputed_wind_speed = pandas_service.preprocess_df(Wind_Speed, "mean", None).unwrap()
imputed_humidity = pandas_service.preprocess_df(Humidity, "most_frequent", None).unwrap() # Some(Humidity_def)
new_imputed_df = cleaned_df.copy()

In [22]:
columns_to_impute = [
    ("Temperature", imputed_temperatures),
    ("Wind_Speed", imputed_wind_speed),
    ("Humidity", imputed_humidity),
]

for col, result in columns_to_impute:
    new_imputed_df[col] = result

In [23]:
new_imputed_df.isna().sum()

Timestamp                0
Wind_Speed               0
Wind_Gust                0
Wind_Direction           0
Temperature              0
Humidity                 0
Precipitation            0
Pressure                 0
Cloud_Cover              0
Solar_Radiation       1500
Hour_of_Day              0
Day_of_Week              0
Month                    0
Wind_Speed_Squared       0
Wind_Speed_Cubed         0
Power_Output          1500
Precipitation_Unit       0
dtype: int64

### The end. We just save the file

It is commented out because we do not want to save it everytime. Okay? So, we keep it one time only.

In [25]:
# Update the df
pandas_service.update_dataframe(imputed_df.unwrap())

INFO:root:DataFrame updated successfully.


<tomodachi_core.tomodachi.services.pandas_service.PandasService at 0x1d3963df770>

In [26]:
pandas_service.get_df().unwrap().isna().sum()

Timestamp             0
Wind_Speed            0
Wind_Gust             0
Wind_Direction        0
Temperature           0
Humidity              0
Precipitation         0
Pressure              0
Cloud_Cover           0
Solar_Radiation       0
Hour_of_Day           0
Day_of_Week           0
Month                 0
Wind_Speed_Squared    0
Wind_Speed_Cubed      0
Power_Output          0
Precipitation_Unit    0
dtype: int64

In [None]:
# At this point our job is done.
# We keep this commented out because we do NOT save the file everytime
#path_to_save = (root_dir / "shared" / "data" / "processed" / "cleaned_Synthetic_Wind_Power.csv").resolve()
#pandas_service.save_dataframe(path_to_save)

INFO:root:DataFrame saved successfully to C:\Users\Lenovo\Desktop\python_app\tuuleenergia_tomodachi\shared\data\processed\cleaned_Synthetic_Wind_Power.csv.


<tomodachi_core.tomodachi.services.pandas_service.PandasService at 0x1d3963df770>