Here I am trying to perform a data exploration on the target displacemnet data from EGMS. The guideline to follow is as detailed in this book.
- https://otexts.com/fpp3/intro.html

did you do some descriptive statistics just to see that the distribution has some similarity or there was any difference between the validation and the training?

# What can be forecast?
The predictability of an event or a quantity depends on several factors including:

- how well we understand the factors that contribute to it;
- how much data is available;
- how similar the future is to the past;
- whether the forecasts can affect the thing we are trying to forecast.

- Often in forecasting, a key step is knowing when something can be forecast accurately, and when forecasts will be no better than tossing a coin. 
- Good forecasts capture the genuine patterns and relationships which exist in the historical data, but do not replicate past events that will not occur again.

- What is normally assumed is that the way in which the environment is changing will continue into the future.

- Forecasting situations vary widely in their time horizons, 
    - factors determining actual outcomes, 
    - types of data patterns, 
    - and many other aspects. 
    
Forecasting methods can be 
    - simple, such as using the most recent observation as a forecast (which is called the naïve method), or 
    - highly complex, such as neural nets and econometric systems of simultaneous equations. 
    
Sometimes, there will be no data available at all. In situations like this, we use judgmental forecasting. 

The choice of method depends on what data are available and the predictability of the quantity to be forecast.


# 1.2 Forecasting, goals and planning

- Forecasting
    - is about predicting the future as accurately as possible
- Goals
    - are what you would like to have happen.
- Planning
    - is a response to forecasts and goals.

- Short-term forecasts
    - are needed for the scheduling of personnel, production and transportation.
- Medium-term forecasts
    - are needed to determine future resource requirements, in order to purchase raw materials, hire personnel, or buy machinery and equipment.
- Long-term forecasts
    - are used in strategic planning. Such decisions must take account of market opportunities, environmental factors and internal resources.

# 1.3 Determining what to forecast
* decisions need to be made about what should be forecast.
* It is also necessary to consider the forecasting horizon.
* How frequently are forecasts required?
* It is worth spending time talking to the people who will use the forecasts to ensure that you understand their needs, and how the forecasts are to be used, before embarking on extensive work in producing the forecasts.
* it is then necessary to find or collect the data on which the forecasts will be based. 

# 1.4 Forecasting data and methods
* If there are no data available, or if the data available are not relevant to the forecasts, then qualitative forecasting methods must be used.
* Quantitative forecasting can be applied when two conditions are satisfied:

- numerical information about the past is available;
- it is reasonable to assume that some aspects of the past patterns will continue into the future.

* Most quantitative prediction problems use either 
    - time series data (collected at regular intervals over time) or 
    - cross-sectional data (collected at a single point in time)

The simplest time series forecasting methods use only information on the variable to be forecast, and make no attempt to discover the factors that affect its behaviour. Therefore they will extrapolate trend and seasonal patterns, but they ignore all other information such as marketing initiatives, competitor activity, changes in economic conditions, and so on.

Decomposition methods are helpful for studying the trend and seasonal patterns in a time series;

## Predictor variables and time series forecasting

A model with predictor variables might be of the form
* predictive_model = f(dynamic_variables, statis_variables, error) 

- The “error” term on the right allows for random variation and the effects of relevant variables that are not included in the model. 
- We call this an explanatory model because it helps explain what causes the variation in electricity demand.

- We can also use only the time-series of the target variable for forecasting. In this case, the time series forecasting equation is of the form

TD(future) = f(TD_present, TD_past, TD_pastpast ...., error)

- Here, prediction of the future is not on external variables that may affect the system. 


- There are mixed models which combines the features of the above two models
    - TD_future = f(TD_present, dynamic, static, error). 
- They are known as dynamic regression models, panel data models, longitudinal models, transfer function models, and linear system models.

The simplest time series forecasting methods use only information on the variable to be forecast, and make no attempt to discover the factors that affect its behaviour. Therefore they will extrapolate trend and seasonal patterns, but they ignore all other information such as marketing initiatives, competitor activity, changes in economic conditions, and so on.

Why time series model?
- the system may not be understood and difficult to measure relationships.
- it is necessary to know or forecast the future values of the various predictors in order to be able to forecast the variable of interest, and this may be too difficult. 
- Third, the main concern may be only to predict what will happen, not to know why it happens. 
- Finally, the time series model may give more accurate forecasts than an explanatory or mixed model.

- The model to be used in forecasting depends on 
    - the resources and data available, 
    - the accuracy of the competing models, and 
    - the way in which the forecasting model is to be used.

# 1.6 The basic steps in a forecasting task
A forecasting task usually involves five basic steps.

- Step 1: Problem definition.
- Step 2: Gathering information.
    - (a) statistical data, and 
    - (b) the accumulated expertise of the people who collect the data and use the forecasts. 
- Step 3: Preliminary (exploratory) analysis.
- Step 4: Choosing and fitting models.
- Step 5: Using and evaluating a forecasting model.

* Often, a forecast is accompanied by a prediction interval giving a range of values the random variable could take with relatively high probability.

# Chapter 2 Time series graphics
- The first thing to do in any data analysis task is to plot the data. 
- Graphs enable many features of the data to be visualised, including patterns, unusual observations, changes over time, and relationships between variables. 
- The features that are seen in plots of the data must then be incorporated, as much as possible, into the forecasting methods to be used.

- Trend
A trend exists when there is a long-term increase or decrease in the data. 

- Seasonal
A seasonal pattern occurs when a time series is affected by seasonal factors such as the time of the year or the day of the week. Seasonality is always of a fixed and known period.

- Cyclic
A cycle occurs when the data exhibit rises and falls that are not of a fixed frequency.

In [2]:
import matplotlib.pyplot as plt
import numpy as np
import os
import geopandas as gpd
import pandas as pd

# Scatter plots

## Displacement


In [None]:


# Variables to keep track of statistics

expected_columns = 313
chunk_size = 1000
point_in_file = False


data_path = r"C:\vgd_italy\data\csv"

# Read and reproject the shapefile of the AOI to the CRS of the EGMS data
aoi_shp = "../regression/aoi/gadm41_ITA_1.shp"
aoi_gdf = gpd.read_file(aoi_shp)
aoi_gdf = aoi_gdf.to_crs("EPSG:3035")

files = os.listdir(data_path)

iii = 0
# Loop through each file
for file in files:
    print(f"Reading file {iii}")
    point_in_file = False
    with open(os.path.join(data_path, file), "r") as datafile:
        for df in pd.read_csv(datafile, chunksize=chunk_size):
            pos_columns = df.columns[1:3]
            disp_columns = df.columns[11:expected_columns]

            # Convert to GeoDataFrame
            data_gdf = gpd.GeoDataFrame(
                df,
                geometry=gpd.points_from_xy(df.easting, df.northing),
                crs="EPSG:3035",
            )

            # Spatial join to find points within the AOI
            points_in_aoi = gpd.sjoin(
                data_gdf, aoi_gdf, how="inner", predicate="within"
            )
            if points_in_aoi.shape[0] != 0:
                if not point_in_file:
                    point_in_file = True  # This file has at least one MP
                
                # Filter relevant columns
                points_in_aoi = points_in_aoi[
                    pos_columns.tolist() + disp_columns.tolist()
                ]
                displacement_values = points_in_aoi[disp_columns]

                # Compute mean over time for each MP (row-wise)
                mean_vgm = displacement_values.mean(axis=1)

                mean_df = pd.DataFrame(
                    {
                        "easting": points_in_aoi[pos_columns[0]],
                        "northing": points_in_aoi[pos_columns[1]],
                        "mean_vgm": mean_vgm,
                    }
                )

                mean_df.to_csv(
                    output_space, mode="a", header=not header_written, index=False
                )

                header_written = True

             
    iii += 1


In [None]:


# --- 1. Define your CSV file path ---
csv_file_path = 'your_data.csv' # <--- IMPORTANT: Change this to your actual CSV file path

# --- Create a dummy CSV file for demonstration if you don't have one ---
# You can skip this block if you already have your CSV file
dummy_data = {
    'MP_ID': ['MP_A', 'MP_B', 'MP_C', 'MP_D'],
    '2023-01-01': [1.2, 0.5, 2.0, 3.0],
    '2023-01-02': [1.5, 0.7, 2.2, 3.1],
    '2023-01-03': [1.8, 0.6, 1.9, 3.5],
    '2023-01-04': [2.1, 0.9, 1.7, 3.2],
    '2023-01-05': [2.5, 1.1, 1.5, 3.8],
    '2023-01-06': [2.3, 1.3, 1.4, 3.7]
}
dummy_df = pd.DataFrame(dummy_data)
dummy_df.to_csv(csv_file_path, index=False)
print(f"Dummy CSV created at: {csv_file_path}\n")
# --- End of dummy CSV creation ---


# --- 2. Read the CSV file ---
# Assuming the first column is the MP identifier and subsequent columns are time steps
try:
    df = pd.read_csv(csv_file_path)
    print("Original DataFrame head:")
    print(df.head())
    print("\n")
except FileNotFoundError:
    print(f"Error: The file '{csv_file_path}' was not found.")
    print("Please make sure the path is correct or create the dummy CSV.")
    exit()

# --- 3. Prepare the DataFrame for plotting ---

# Set the MP_ID column as the index
df = df.set_index('MP_ID')

# Convert column headers (time steps) to datetime objects
# This is crucial for proper time-series plotting
df.columns = pd.to_datetime(df.columns)

print("DataFrame after setting index and converting columns to datetime:")
print(df.head())
print("\n")


# --- 4. Plotting Options ---

### Option A: Create a separate scatter plot for each MP ###
print("Generating separate scatter plots for each MP...")
for mp_id, displacements in df.iterrows():
    plt.figure(figsize=(10, 5))
    plt.scatter(displacements.index, displacements.values,
                s=30, alpha=0.8, color='blue')

    plt.title(f'Displacement Over Time for {mp_id}')
    plt.xlabel('Time Step')
    plt.ylabel('Displacement Value')
    plt.grid(True, linestyle='--', alpha=0.6)
    plt.xticks(rotation=45)
    plt.tight_layout()
    plt.show()

print("\nFinished generating separate plots.\n")


### Option B: Create a single scatter plot with all MPs (distinguished by color) ###
print("Generating a single scatter plot for all MPs...")
plt.figure(figsize=(14, 8))

# Get a color map for different MPs
colors = plt.cm.get_cmap('tab10', len(df)) # 'tab10' is a good categorical colormap

for i, (mp_id, displacements) in enumerate(df.iterrows()):
    plt.scatter(displacements.index, displacements.values,
                s=30, alpha=0.7, label=mp_id, color=colors(i))

plt.title('Displacement Over Time for All Measurement Points')
plt.xlabel('Time Step')
plt.ylabel('Displacement Value')
plt.grid(True, linestyle='--', alpha=0.6)
plt.xticks(rotation=45)
plt.legend(title='Measurement Point', bbox_to_anchor=(1.05, 1), loc='upper left') # Legend outside the plot
plt.tight_layout()
plt.show()

print("\nFinished generating combined plot.")