In [108]:
import pandas as pd

In [114]:
path_train = "data/archive/train_timeseries/train_timeseries.csv"
path_validation = "data/archive/validation_timeseries/validation_timeseries.csv"
path_test = "data/archive/test_timeseries/test_timeseries.csv"

## 1. Introduction to the Dataset
What is the dataset about? What are the features and the target variable? What is the application domain?

### 1.1. Context and Introduction  
The **US Drought & Meteorological Data** dataset, available on [Kaggle](https://www.kaggle.com/datasets/cdminix/us-drought-meteorological-data/data), combines meteorological indicators with drought severity levels from the **U.S. Drought Monitor**. It was created to support data-driven drought prediction by linking weather conditions with expert-assessed drought categories (D0–D4) and no drought at all. Each observation corresponds to a specific county in the United States, identified by its FIPS code, and includes various climatic features, daily obtained, such as temperature, precipitation, humidity, wind speed, and soil moisture, which will be explained later.

The dataset integrates information from the **National Drought Mitigation Center (NDMC)**, **NOAA**, and **USDA**, which produce weekly drought classifications based on multiple climate variables such as precipitation, temperature, and soil moisture.  
Meteorological data were aggregated and aligned with these drought severity labels to enable **machine learning applications in climate and environmental monitoring**.

The dataset is already divided into three subsets:
- **Training Set:** Contains historical data from 2000 to 2009, used to train machine learning models (it represents around 47% of the all dataset).  
- **Validation Set:** Contains data from 2010 to 2011, used for hyperparameter tuning and model selection (it represents around 10% of the all dataset).  
- **Test Set:** Contains data from 2012 to 2020, used to test the model accuracy prediction (it represents around 43% of the all dataset).

All the subsets contain the same feature structure and data organization. For the context of this project, as explained later, it will be used this splitting format.
  
**Application Domain:** This dataset belongs to the domain of **climate science and environmental modeling**, with key applications in:
- Drought risk assessment and early warning systems  
- Agricultural and water resource management  
- Climate variability analysis  

### 1.2. Features Overview 
All dataset contains **19 meteorological indicators**, a **date column**, a **county identifier (`fips`)**, and a **target variable (`score`)** representing drought severity.  

- **Total entries:** 23841468 - *19300680 (training dataset), 2268840 (validation dataset), 2271948 (test dataset)*
- **Total columns:** 21  
- **Temporal coverage:** Daily observations across multiple U.S. counties  
- **Target variable:** `score` — representing drought severity intensity (ordinal numeric scale)  

In [124]:
# drought_train = pd.read_csv(path_train)
# drought_train.info()

drought_validation = pd.read_csv(path_validation)   # Exploring this dataset due to its size
drought_validation.info()                           # which is smaller than the training and testing dataset

# drought_test = pd.read_csv(path_test)
# drought_test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2268840 entries, 0 to 2268839
Data columns (total 21 columns):
 #   Column       Dtype  
---  ------       -----  
 0   fips         int64  
 1   date         object 
 2   PRECTOT      float64
 3   PS           float64
 4   QV2M         float64
 5   T2M          float64
 6   T2MDEW       float64
 7   T2MWET       float64
 8   T2M_MAX      float64
 9   T2M_MIN      float64
 10  T2M_RANGE    float64
 11  TS           float64
 12  WS10M        float64
 13  WS10M_MAX    float64
 14  WS10M_MIN    float64
 15  WS10M_RANGE  float64
 16  WS50M        float64
 17  WS50M_MAX    float64
 18  WS50M_MIN    float64
 19  WS50M_RANGE  float64
 20  score        float64
dtypes: float64(19), int64(1), object(1)
memory usage: 363.5+ MB


| Feature | Description | Type | Unit |
|----------|--------------|------|------|
| `fips` | FIPS code identifying the USA county | int64 | – |
| `date` | Observation date | object | YYYY-MM-DD |
| `PRECTOT` | Total Precipitation | float64 | mm/day |
| `PS` | Surface Pressure | float64 | kPa |
| `QV2M` | Specific Humidity at 2 Meters | float64 | g/kg |
| `T2M` | Air Temperature at 2 Meters | float64 | °C |
| `T2MDEW` | Dew/Frost Point Temperature at 2 Meters | float64 | °C |
| `T2MWET` | Wet Bulb Temperature at 2 Meters | float64 | °C |
| `T2M_MAX` | Maximum Temperature at 2 Meters | float64 | °C |
| `T2M_MIN` | Minimum Temperature at 2 Meters | float64 | °C |
| `T2M_RANGE` | Temperature Range at 2 Meters | float64 | °C |
| `TS` | Earth Skin Temperature | float64 | °C |
| `WS10M` | Wind Speed at 10 Meters | float64 | m/s |
| `WS10M_MAX` | Maximum Wind Speed at 10 Meters | float64 | m/s |
| `WS10M_MIN` | Minimum Wind Speed at 10 Meters | float64 | m/s |
| `WS10M_RANGE` | Wind Speed Range at 10 Meters | float64 | m/s |
| `WS50M` | Wind Speed at 50 Meters | float64 | m/s |
| `WS50M_MAX` | Maximum Wind Speed at 50 Meters | float64 | m/s |
| `WS50M_MIN` | Minimum Wind Speed at 50 Meters | float64 | m/s |
| `WS50M_RANGE` | Wind Speed Range at 50 Meters | float64 | m/s |
| `score` | Drought severity indicator (target variable) | float64  | – |

### 1.3. Output Variable  

- **Variable:** `score`  
- **Type:** Ordinal categorical variable (`D0`–`D4`) and no drought class (`0`)
- **Meaning:** Represents drought severity, from Abnormally Dry (D0) to Exceptional Drought (D4) and no drought shown at all.

In [125]:
round(drought_validation['score'], 0).unique()

array([nan,  2.,  1.,  0.,  3.,  4.,  5.])

### 1.4. Learning Task Definition  
The objective of this study is to **predict drought severity levels** using meteorological and climatic indicators.  
This task is formulated as an **ordinal classification problem**, since the drought classes (`D0`–`D4`) and no drought class `0` represent an ordered sequence of severity levels.

## 2. Data Preprocessing
Before conducting exploratory analysis, the dataset will be **reprocessed and preprocessed** to make it suitable for machine learning.  
The original dataset contains multiple counties, potential missing values, and temporal dependencies that require careful handling.


all features are numeric except for `fips` and `date` (which will be converted to datetime format). since we are focusing in one county, with just one `fips` value, we will drop this column.

### 2.1. Data Subsetting  

For this project, only data from **one selected county** will be used, allowing consistent temporal modeling and reducing data volume for computational efficiency.

because the scores of the datasets are embalaced and the dataset is too big, we chose to subset the dataset, modeling only one county (using train, validation and test data already provided for that county)

it was important to chose a county that experienced no drought but also that has experienced all types of drought before (since these are the classes with less data scores). - dataset is imbalenced so it is important to chose a county that has all drough classes represented

later this model will be speacilized to predict drought in areas (counties), similar to the one we chose

To simplify the study and ensure computational efficiency:
- Only **one county** will be used (selected based on data completeness and representativeness).  
- This subset will serve as the basis for **training**, **validation**, and **testing** — ensuring consistency while avoiding spatial leakage.  
- All other counties will be **cropped** from the dataset.

### 2.2. Handling Missing Values and Outliers  
- Missing data will be handled using imputation (e.g., median or time-based interpolation).  
- Outliers will be detected and managed using the interquartile range (IQR) or z-score method.  
- The target variable `drought_class` will be checked for consistency and coverage over time.

### 2.3. Temporal Aggregation and Feature Engineering  
To capture drought dynamics and temporal dependencies:
- New features will be created by computing **rolling averages or sums** over the **past 180 days** for selected meteorological variables (e.g., precipitation, temperature, vegetation indices).  
- These aggregated statistics will help smooth short-term fluctuations and highlight medium-term climate trends relevant for drought prediction.

### 2.4. Encoding and Normalization  
- Categorical variables (if present) such as `state` or `county` will be label-encoded.  
- Numerical features will be scaled using **StandardScaler** or **MinMaxScaler** to improve model performance.  

### 2.5. Final Preprocessed Dataset  
After preprocessing:
- The dataset will contain only one county’s time series.  
- Each record will include meteorological and drought indices, along with the engineered rolling-window features.  
- The resulting dataset will then be analyzed in the following section.

## 3. Exploratory Data Analysis (EDA)
A descriptive statistical analysis of the preprocessed dataset will be conducted to understand the distributions, relationships and temporal patterns of the features and the target variable.

### 3.1. Data Overview  
- Display the shape of the preprocessed dataset (number of records and features).  
- Present data types, completeness, and the structure of the time series.  
- Verify that the temporal resolution and coverage are consistent.

### 3.2. Descriptive Statistics  
- Use `.describe()` for numerical columns to summarize mean, standard deviation, min, max, and quartiles.  
- For the target variable `drought_class`, use `value_counts()` to examine class balance.

### 3.3. Visualization (max. 4 figures)  
Recommended figures:
1. **Histogram of drought classes** — show the frequency of each drought category.  
2. **Time series plot** — visualize drought class evolution over time for the selected county.  
3. **Correlation heatmap** — display relationships among key meteorological and drought indices.  
4. **Boxplot or line plot** — show how temperature or precipitation vary across drought categories.

Each figure should include:
- A caption (e.g., “Temporal distribution of drought severity in County X”), and  
- A short comment interpreting the observed trend.

## 5. Evaluation Protocol
Describe how the models will be evaluated, including metrics and validation strategies, hyperparameter tuning and splitting strategies.

### 5.1. Data Splitting Strategy  
- Use **temporal splitting** to preserve the chronological order of observations.  
- Divide the dataset into **training (70%)**, **validation (15%)**, and **test (15%)** subsets.  
- Ensure no temporal leakage by splitting along time boundaries (future data never used for training).  
- Set a fixed `random_state` for reproducibility.

### 5.2. Cross-Validation and Hyperparameter Tuning  
- Apply **TimeSeriesSplit** or a similar time-aware cross-validation method for model tuning.  
- Use **GridSearchCV** to optimize key hyperparameters.  
- Validation results will guide model selection before final testing.

### 5.3. Evaluation Metrics  
Metrics suitable for ordinal classification:

| Metric | Description | Justification |
|--------|--------------|----------------|
| **Accuracy** | Ratio of correctly predicted drought categories | Baseline metric |
| **Cohen’s Kappa** | Measures agreement adjusted for chance and order | Captures ordinal consistency |
| **F1-score (macro)** | Harmonic mean of precision and recall across classes | Accounts for class imbalance |

## 6. Summary and Next Steps  
- The dataset has been filtered to one county to simplify the problem and ensure data quality.  
- Temporal aggregation and preprocessing steps allow the model to capture meaningful drought trends.  
- The next milestone will focus on implementing and evaluating multiple machine learning models to predict drought severity levels.

**References**  
- U.S. Drought Monitor Dataset: [https://www.kaggle.com/datasets/cdminix/us-drought-meteorological-data](https://www.kaggle.com/datasets/cdminix/us-drought-meteorological-data)  
- Copernicus Climate Service: [https://www.copernicus.eu/en/copernicus-services](https://www.copernicus.eu/en/copernicus-services)

# Import Libraries

In [96]:
import pandas as pd

# Discovering the dataset

In [97]:
path = "archive/validation_timeseries/validation_timeseries.csv"
drought_df = pd.read_csv(path)
drought_df.head()

Unnamed: 0,fips,date,PRECTOT,PS,QV2M,T2M,T2MDEW,T2MWET,T2M_MAX,T2M_MIN,...,TS,WS10M,WS10M_MAX,WS10M_MIN,WS10M_RANGE,WS50M,WS50M_MAX,WS50M_MIN,WS50M_RANGE,score
0,1001,2017-01-01,32.5,100.02,10.47,14.69,14.47,14.47,17.68,10.53,...,14.63,2.14,2.71,1.52,1.19,4.4,5.96,2.25,3.71,
1,1001,2017-01-02,63.52,100.04,12.75,17.96,17.75,17.75,20.3,16.14,...,17.85,2.75,4.31,1.6,2.71,5.5,8.16,4.05,4.11,
2,1001,2017-01-03,18.82,99.69,9.74,14.24,13.44,13.44,18.48,9.29,...,14.06,2.25,3.73,1.64,2.09,4.8,7.27,2.54,4.72,2.0
3,1001,2017-01-04,0.01,100.02,5.21,8.1,3.86,3.88,11.74,2.12,...,8.08,2.63,3.95,1.34,2.6,4.98,6.16,3.36,2.8,
4,1001,2017-01-05,0.01,99.89,4.54,5.91,2.2,2.22,13.07,-0.18,...,5.85,1.76,2.76,0.47,2.28,3.43,4.7,0.66,4.04,


In [98]:
drought_df

Unnamed: 0,fips,date,PRECTOT,PS,QV2M,T2M,T2MDEW,T2MWET,T2M_MAX,T2M_MIN,...,TS,WS10M,WS10M_MAX,WS10M_MIN,WS10M_RANGE,WS50M,WS50M_MAX,WS50M_MIN,WS50M_RANGE,score
0,1001,2017-01-01,32.50,100.02,10.47,14.69,14.47,14.47,17.68,10.53,...,14.63,2.14,2.71,1.52,1.19,4.40,5.96,2.25,3.71,
1,1001,2017-01-02,63.52,100.04,12.75,17.96,17.75,17.75,20.30,16.14,...,17.85,2.75,4.31,1.60,2.71,5.50,8.16,4.05,4.11,
2,1001,2017-01-03,18.82,99.69,9.74,14.24,13.44,13.44,18.48,9.29,...,14.06,2.25,3.73,1.64,2.09,4.80,7.27,2.54,4.72,2.0
3,1001,2017-01-04,0.01,100.02,5.21,8.10,3.86,3.88,11.74,2.12,...,8.08,2.63,3.95,1.34,2.60,4.98,6.16,3.36,2.80,
4,1001,2017-01-05,0.01,99.89,4.54,5.91,2.20,2.22,13.07,-0.18,...,5.85,1.76,2.76,0.47,2.28,3.43,4.70,0.66,4.04,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2268835,56043,2018-12-27,0.14,82.71,1.54,-9.57,-14.20,-13.09,-6.23,-12.06,...,-10.10,2.01,3.56,0.23,3.33,2.67,4.70,0.28,4.42,
2268836,56043,2018-12-28,0.02,83.14,1.32,-11.25,-15.98,-14.57,-7.03,-14.33,...,-12.36,1.66,3.12,0.09,3.04,2.40,5.58,0.08,5.50,
2268837,56043,2018-12-29,0.34,82.78,1.75,-7.17,-12.62,-12.17,0.64,-14.85,...,-8.09,3.64,4.71,2.47,2.24,5.58,7.61,4.15,3.46,
2268838,56043,2018-12-30,3.17,81.97,2.72,-2.84,-7.14,-6.95,2.14,-8.49,...,-3.63,5.26,11.19,1.70,9.49,7.56,13.59,3.11,10.49,


In [99]:
drought_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2268840 entries, 0 to 2268839
Data columns (total 21 columns):
 #   Column       Dtype  
---  ------       -----  
 0   fips         int64  
 1   date         object 
 2   PRECTOT      float64
 3   PS           float64
 4   QV2M         float64
 5   T2M          float64
 6   T2MDEW       float64
 7   T2MWET       float64
 8   T2M_MAX      float64
 9   T2M_MIN      float64
 10  T2M_RANGE    float64
 11  TS           float64
 12  WS10M        float64
 13  WS10M_MAX    float64
 14  WS10M_MIN    float64
 15  WS10M_RANGE  float64
 16  WS50M        float64
 17  WS50M_MAX    float64
 18  WS50M_MIN    float64
 19  WS50M_RANGE  float64
 20  score        float64
dtypes: float64(19), int64(1), object(1)
memory usage: 363.5+ MB


In [100]:
drought_df.shape

(2268840, 21)

In [None]:
drought_df['score'] = drought_df['score'].astype(int)

In [107]:
round(drought_df['score'], 0).unique()

array([nan,  2.,  1.,  0.,  3.,  4.,  5.])

In [102]:
# Convert fips to integer (removes decimals)
drought_df['fips'] = drought_df['fips'].astype(int)

# Count rows where fips == 5119 (leading zero is dropped in integer)
count = (drought_df['fips'] == 5119).value_counts()
count

fips
False    2268110
True         730
Name: count, dtype: int64

FIPS codes are numbers which uniquely identify geographic areas.  The number of 
digits in FIPS codes vary depending on the level of geography.  State-level FIPS
codes have two digits, county-level FIPS codes have five digits of which the 
first two are the FIPS code of the state to which the county belongs.  When 
using the list below to look up county FIPS codes, it is advisable to first look
up the FIPS code for the state to which the county belongs.  This will help you
identify the right section of the list while scrolling down, which can be
important since there are over 3000 counties and county-equivalents (e.g.
independent cities, parishes, boroughs) in the United States.