## Data Assessment of Waterflow Historical Data

**Metadata Summary**  
- 📅 **Date of Retrieval:** JULY 1, 2025  
- 🌐 **Source of Data:** LGU San Jacinto Treasury Records
- 📄 **License/Permission:**  
- 🧑‍💼 **Prepared by:** MARK JUNE E. ALMOJUELA

This notebook is used to split the compiled records with more than one month in one file to create chunks of records for each month.

In [65]:
# Initialization
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import os 

Split MAR_APR2020 record to create MAR2020 and APR2020

In [66]:
# Initialize df as None at the start
df = None

# Define the file path
file_path = os.path.normpath("../../dataset/raw/2020/MAR_APR2020.csv")

# Print the full path for verification
print(f"Attempting to load file from: {os.path.abspath(file_path)}")

try:
    if not os.path.exists(file_path):
        print("Error: File not found at the specified location.")
        dir_path = os.path.dirname(file_path)
        if not os.path.exists(dir_path):
            print(f"Error: Directory not found: {os.path.abspath(dir_path)}")
        else:
            print("Files in directory:")
            print(os.listdir(dir_path))
    else:
        # Try UTF-8 encoding first
        try:
            df = pd.read_csv(file_path)
            print("File loaded successfully with UTF-8 encoding!")
        except UnicodeDecodeError:
            print("Trying with 'latin1' encoding...")
            df = pd.read_csv(file_path, encoding='latin1')
            print("File loaded successfully with 'latin1' encoding!")
        
        # Display info if df was loaded
        if df is not None:
            print(f"\nNumber of rows: {len(df)}")
            print("\nFirst few rows:")
            print(df.head())
            print("\nColumns in the dataset:")
            print(df.columns.tolist())
            
except Exception as e:
    print(f"An error occurred: {e}")

# The df variable is now available for use in subsequent cells

Attempting to load file from: c:\Users\Mark June Almojuela\OneDrive - Bicol University\WaterFlow\AI\Model Training\dataset\raw\2020\MAR_APR2020.csv
Trying with 'latin1' encoding...
File loaded successfully with 'latin1' encoding!

Number of rows: 1633

First few rows:
   Control Number      Consumer's Name       Address Water Meter Serial #  \
0        501549.0       Albaño, Lilane  Alicante St.                  NaN   
1        500750.0  Aljecera, Marcelino  Alicante St.                  NaN   
2        500990.0       Almiñana, Irus  Alicante St.                  NaN   
3        500505.0       Almiñe, Edison  Alicante St.             95022096   
4        501542.0       Almiñe, Filben  Alicante St.                  NaN   

  Previous Present  Cons.    Amount  
0      218     247   29.0    87.00   
1     3030    3051   21.0    63.00   
2      471     537   66.0   198.00   
3        2      63   61.0   183.00   
4     3271    3314   43.0   129.00   

Columns in the dataset:
['Control Numbe

In [67]:
# Count of null/NaN values in each column
null_counts = df.isnull().sum()
print("Count of null/NaN values per column:")
print(null_counts[null_counts > 0])  # Only show columns with null values

# Count of rows with any null/NaN values
rows_with_nulls = df[df.isnull().any(axis=1)]
print(f"\nNumber of rows with any null/NaN values: {len(rows_with_nulls)}")

Count of null/NaN values per column:
Control Number            1
Water Meter Serial #    698
Previous                201
Present                 403
Cons.                   548
Amount                  416
dtype: int64

Number of rows with any null/NaN values: 1026


Creating MAR2020 AND APR2020 records

In [68]:
# Logic test for MAR_APR2020.csv record split
for index, row in df.iterrows():
    try:
        control_number = row["Control Number"]
        consumer_name = row["Consumer's Name"]
        address = row["Address"]
        serial_number = row["Water Meter Serial #"]
        try:
            previous_reading = int(row["Previous"])
        except ValueError:
            previous_reading = 0
        
        try:
            present_reading = int(row["Present"])
        except ValueError:
            if previous_reading > 0:
                present_reading = previous_reading
            else:
                present_reading = 0
        
        current_reading = present_reading - ((present_reading - previous_reading) / 2)
        
        total_consumption = present_reading - previous_reading
        total_amount = total_consumption * 10

        print(control_number, consumer_name, address, serial_number, 
              previous_reading, current_reading, total_consumption, total_amount)
              
    except Exception as e:
        print(f"Error processing row {index}: {e}")

501549.0 Albaño, Lilane Alicante St. nan 218 232.5 29 290
500750.0 Aljecera, Marcelino Alicante St. nan 3030 3040.5 21 210
500990.0 Almiñana, Irus Alicante St. nan 471 504.0 66 660
500505.0 Almiñe, Edison Alicante St. 95022096 2 32.5 61 610
501542.0 Almiñe, Filben Alicante St. nan 3271 3292.5 43 430
500431.0 Almiñe, Franchie Alicante St. 121006093 0 0.0 0 0
500263.0 Almodal, Arna Alicante St. 9588526 5228 5240.5 25 250
501240.0 Almocera, Owen Alicante St. nan 67 102.5 71 710
500484.0 Almodal, Erlinda Alicante St. 028086-02 0 0.0 0 0
500739.0 Almodal, Jolly Alicante St. 017902-02 1795 1861.5 133 1330
500544.0 Almodal, Noe Alicante St. nan 2418 2418.0 0 0
500187.0 Almodiel, Arles Alicante St. 9074313 3210 3210.0 0 0
501447.0 Almodiel, Mary Grace Alicante St. nan 238 240.5 5 50
501453.0 Alcantara, Hilda Alicante St. nan 183 189.5 13 130
501317.0 Almoete, Ike Alicante St. nan 595 603.0 16 160
501280.0 Almojuela, Arlic Alicante St. nan 424 448.0 48 480
500248.0 Almojuela, Rogelio Alicante S

In [None]:
import csv

# Create the output directory if it doesn't exist
mar_output_dir = os.path.dirname("../../dataset/raw/2020/MAR2020.csv")
apr_output_dir = os.path.dirname("../../dataset/raw/2020/APR2020.csv")
os.makedirs(mar_output_dir, exist_ok=True)
os.makedirs(apr_output_dir, exist_ok=True)

with open("../../dataset/raw/2020/MAR2020.csv", "w", newline="", encoding='latin-1') as mar_file \
    , open("../../dataset/raw/2020/APR2020.csv", "w", newline="", encoding='latin-1') as apr_file:
    mar_csv_writer = csv.writer(mar_file)
    apr_csv_writer = csv.writer(apr_file)
    # Write header
    mar_csv_writer.writerow([
        "Control Number", "Consumer's Name", "Address", 
        "Water Meter Serial #", "Previous", "Present", 
        "Cons.", "Amount"
    ])
    apr_csv_writer.writerow([
        "Control Number", "Consumer's Name", "Address", 
        "Water Meter Serial #", "Previous", "Present", 
        "Cons.", "Amount"
    ])

    for index, row in df.iterrows():
        try:
            control_number = row["Control Number"]
            consumer_name = row["Consumer's Name"]
            address = row["Address"]
            serial_number = row["Water Meter Serial #"]
            
            # Handle Previous Reading
            try:
                mar_previous_reading = int(float(str(row["Previous"]).strip() or 0))
            except (ValueError, TypeError):
                mar_previous_reading = 0
            
            # Handle Present Reading
            try:
                mar_present_reading = int(float(str(row["Present"]).strip() or 0))
            except (ValueError, TypeError):
                mar_present_reading = mar_previous_reading  # Default to previous reading if present is invalid
            
            # Calculate values for March
            mar_current_reading = mar_previous_reading + round((mar_present_reading - mar_previous_reading) / 2)
            mar_total_consumption = mar_current_reading - mar_previous_reading
            mar_total_amount = mar_total_consumption * 10 

            new_record_mar = [
                control_number, consumer_name, address, serial_number,
                mar_previous_reading, round(mar_current_reading),
                mar_total_consumption, mar_total_amount
            ]

            # Calculate values April
            apr_previous_reading = mar_current_reading
            
            # Handle Present Reading
            try:
                apr_current_reading = int(float(str(row["Present"]).strip() or 0))
            except (ValueError, TypeError):
                apr_current_reading = apr_previous_reading
            
            # Calculate values for April
            apr_total_consumption = apr_current_reading - apr_previous_reading
            apr_total_amount = apr_total_consumption * 10 

            new_record_apr = [
                control_number, consumer_name, address, serial_number,
                apr_previous_reading, round(apr_current_reading),
                apr_total_consumption, apr_total_amount
            ]            
            # Print Record
            print(f"Processed MAR {index} rows: {new_record_mar}")
            print(f"Processed APR {index} rows: {new_record_apr}")
            
            # Write row
            mar_csv_writer.writerow(new_record_mar)
            apr_csv_writer.writerow(new_record_apr)                
        except Exception as e:
            print(f"Error processing row {index}: {e}")
            continue

print("Processing complete!")

Processed MAR 0 rows: [501549.0, 'Albaño, Lilane', 'Alicante St.', nan, 218, 232, 14, 140]
Processed MAR 1 rows: [500750.0, 'Aljecera, Marcelino', 'Alicante St.', nan, 3030, 3040, 10, 100]
Processed MAR 2 rows: [500990.0, 'Almiñana, Irus', 'Alicante St.', nan, 471, 504, 33, 330]
Processed MAR 3 rows: [500505.0, 'Almiñe, Edison', 'Alicante St.', '95022096', 2, 32, 30, 300]
Processed MAR 4 rows: [501542.0, 'Almiñe, Filben', 'Alicante St.', nan, 3271, 3293, 22, 220]
Processed MAR 5 rows: [500431.0, 'Almiñe, Franchie', 'Alicante St.', '121006093', 0, 0, 0, 0]
Processed MAR 6 rows: [500263.0, 'Almodal, Arna', 'Alicante St.', '9588526', 5228, 5240, 12, 120]
Processed MAR 7 rows: [501240.0, 'Almocera, Owen', 'Alicante St.', nan, 67, 103, 36, 360]
Processed MAR 8 rows: [500484.0, 'Almodal, Erlinda', 'Alicante St.', '028086-02', 0, 0, 0, 0]
Processed MAR 9 rows: [500739.0, 'Almodal, Jolly', 'Alicante St.', '017902-02', 1795, 1861, 66, 660]
Processed MAR 10 rows: [500544.0, 'Almodal, Noe', 'Alic

In [71]:
# Read the data with optimized dtypes
dtypes = {
    'Control Number': 'str',
    "Consumer's Name": 'str',
    'Address': 'str',
    'Water Meter Serial #': 'str',
    'Previous': 'float64',
    'Present': 'float64',
    'Current': 'float64',
    'Cons.': 'float64',
    'Amount': 'float64'
}

# Read the CSV
new_df = pd.read_csv("../../dataset/raw/2020/APR2020.csv", 
                    encoding='latin-1',
                    dtype=dtypes)

# Check for negative consumption
print("=== Negative Consumption Summary ===")
neg_consumption = new_df[new_df['Cons.'] < 0]
print(f"Total rows with negative consumption: {len(neg_consumption)}")
if not neg_consumption.empty:
    print("\nSample of rows with negative consumption:")
    print(neg_consumption[['Control Number', 'Previous', 'Present', 'Cons.']].head())

# Check for negative amount
print("\n=== Negative Amount Summary ===")
neg_amount = new_df[new_df['Amount'] < 0]
print(f"Total rows with negative amount: {len(neg_amount)}")
if not neg_amount.empty:
    print("\nSample of rows with negative amount:")
    print(neg_amount[['Control Number', 'Cons.', 'Amount']].head())

# Additional checks
print("\n=== Additional Data Quality Checks ===")
print(f"Total rows: {len(new_df)}")
print(f"Rows with zero consumption: {len(new_df[new_df['Cons.'] == 0])}")
print(f"Rows with missing values: {new_df.isnull().any(axis=1).sum()}")

=== Negative Consumption Summary ===
Total rows with negative consumption: 1

Sample of rows with negative consumption:
    Control Number  Previous  Present  Cons.
171       500741.0      58.0     14.0  -44.0

=== Negative Amount Summary ===
Total rows with negative amount: 1

Sample of rows with negative amount:
    Control Number  Cons.  Amount
171       500741.0  -44.0  -440.0

=== Additional Data Quality Checks ===
Total rows: 1633
Rows with zero consumption: 544
Rows with missing values: 698
