 <img width="60px" style="float: right;" src="https://xmks.s3.amazonaws.com/2020/X-Blue.png">
 
 # 🥇Golden Batch - Production Quality GENERATE DATA
 
 ---
 **By Jaun van Heerden**

In [1]:
import pandas as pd

In [2]:
# Load the dataset from the specified URL using pandas
df_X = pd.read_csv("data/data_X.csv")
df_X['date_time'] = pd.to_datetime(df_X['date_time'])
df_X = df_X.set_index('date_time')

df_Y = pd.read_csv("data/data_Y.csv")
df_Y['date_time'] = pd.to_datetime(df_Y['date_time'])
df_Y = df_Y.set_index('date_time')

# Display the first 5 rows of the dataset to get an initial overview
df_Y.head(150)

Unnamed: 0_level_0,quality
date_time,Unnamed: 1_level_1
2015-01-04 00:05:00,392
2015-01-04 01:05:00,384
2015-01-04 02:05:00,393
2015-01-04 03:05:00,399
2015-01-04 04:05:00,400
...,...
2015-01-10 01:05:00,444
2015-01-10 02:05:00,435
2015-01-10 03:05:00,426
2015-01-10 04:05:00,418


In [3]:
df = pd.DataFrame()

for i in range(1, 5 + 1):
    df[f'Avg_Temp_Chamber_{i}'] = df_X[[f'T_data_{i}_1', f'T_data_{i}_2', f'T_data_{i}_3']].mean(axis=1)
    
df[["H_data", "AH_data"]] = df_X[["H_data", "AH_data"]]    

merged_df = pd.merge_asof(df.sort_index(), df_Y.sort_index(), left_index=True, right_index=True, direction='forward')

df_cleaned = merged_df.dropna(how='any')

# Identify changes in AH_data
df_cleaned['batch_change'] = df_cleaned['AH_data'].diff().fillna(0).ne(0)

# Create a new column to identify batches
df_cleaned['batch_id'] = df_cleaned['batch_change'].cumsum()

# Calculate the time each batch has been processed
df_cleaned['time_in_batch'] = df_cleaned.groupby('batch_id').cumcount() * 1  # Assuming time is in minutes

df_cleaned

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_cleaned['batch_change'] = df_cleaned['AH_data'].diff().fillna(0).ne(0)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_cleaned['batch_id'] = df_cleaned['batch_change'].cumsum()
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_cleaned['time_in_batch'] = df_cleaned.groupby('batch_id').cumcount(

Unnamed: 0_level_0,Avg_Temp_Chamber_1,Avg_Temp_Chamber_2,Avg_Temp_Chamber_3,Avg_Temp_Chamber_4,Avg_Temp_Chamber_5,H_data,AH_data,quality,batch_change,batch_id,time_in_batch
date_time,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
2015-01-01 00:00:00,211.000000,349.000000,476.000000,349.666667,241.666667,167.85,9.22,392.0,False,0,0
2015-01-01 00:01:00,211.333333,348.000000,476.333333,350.666667,241.666667,162.51,9.22,392.0,False,0,1
2015-01-01 00:02:00,211.333333,347.666667,476.666667,352.000000,241.666667,164.99,9.22,392.0,False,0,2
2015-01-01 00:03:00,211.666667,347.000000,477.000000,353.000000,241.666667,167.34,9.22,392.0,False,0,3
2015-01-01 00:04:00,211.666667,346.333333,477.666667,354.000000,242.000000,163.04,9.22,392.0,False,0,4
...,...,...,...,...,...,...,...,...,...,...,...
2018-05-03 23:01:00,255.666667,351.666667,447.000000,341.666667,256.000000,155.69,6.39,454.0,False,29190,1
2018-05-03 23:02:00,255.666667,351.666667,447.333333,341.333333,255.666667,155.33,6.39,454.0,False,29190,2
2018-05-03 23:03:00,255.666667,351.333333,447.333333,341.000000,255.333333,155.53,6.39,454.0,False,29190,3
2018-05-03 23:04:00,255.666667,351.333333,447.666667,340.333333,255.333333,153.74,6.39,454.0,False,29190,4


## Save cleaned data

In [4]:
df_cleaned.to_parquet("data/clean.parquet")

## Save cleaned simulation data

In [5]:
# Identify unique batch IDs and corresponding quality
batch_quality_map = df_cleaned.drop_duplicates(subset=['batch_id']).set_index('batch_id')['quality']

# Sort by quality
sorted_batches = batch_quality_map.sort_values()

# Select 3 "bad" batches with lowest quality
bad_batches = sorted_batches.head(3).index.tolist()

# Select 1 "good" batch with highest quality
good_batch = sorted_batches.tail(1).index.tolist()

# Filter the original DataFrame to capture the entire process of these batches
df_bad_batches = df_cleaned[df_cleaned['batch_id'].isin(bad_batches)]
df_good_batches = df_cleaned[df_cleaned['batch_id'].isin(good_batch)]

df_save = pd.concat([df_bad_batches, df_good_batches])

round_cols = [col for col in df_save.columns if col not in ["batch_id", "time_in_batch"]]

df_save[round_cols] = df_save[round_cols].round(2)

df_save.to_csv("data/roaster.csv")

