# Machine Learning Analysis in the Earth Systems Sciences


In this module, you are tasked with planning, implementing, and evaluating a machine learning solution for a real-world scenario. Given pre-configured code blocks and prepared data, you will create a problem statement, explore the data, experiment with model development, and ultimately make a recommendation on the utility of machine learning for your scenario.
To get started, first run the cell below to prepare this notebook. While that process runs, watch the following video to learn more about this scenario.

# Missing weather station in western North Carolina

Play the below video to learn about the situation.

`<video>`

`link to transcript`

## Part 1: Problem Framing

Based on the information provided in the video, which type of machine learning analysis is most appropriate for this scenario?

TK Narrative scaffolding to this:

1. Does a simpler solution exist?
2. Can machine learning requirements be met?
3. Which scientific question should be answered?

<div class="alert alert-success" role="alert">
<p class="admonition-title" style="font-weight:bold">Exercise</p>
    <p>In your <b>Machine Learning Model Handbook</b>, type the scientific question to be answered for this situation.</p>
    <p>GENERAL RUBRIC TBD</p>
</div>

## Part 2: Data Handling

You will be using other stations in the <a href="https://econet.climate.ncsu.edu/" target="blank">NC ECONet</a> for this project. Your colleague is a <a href="https://www.mongodb.com/resources/basics/data-engineering#what-is-data-engineering" target="blank">data engineer</a> who has done much of the data preparation for you. They have prepared the following document to describe the nature of the dataset they are providing you for your model building work. 

### Metadata Document for Western North Carolina Weather Station Data

#### General Information

Dataset Name: Western NC Weather Station Time-Series Data

Description: This dataset contains tabular time-series data collected from multiple weather stations in Western North Carolina. The data includes atmospheric and environmental variables recorded at hourly intervals.

Date Range: January 1, 2015, to December 16, 2024

Geographic Coverage: Western North Carolina 

Data Frequency: Hourly

Last Updated: Jan 1, 2025

#### Data Structure

File Format: .parquet

Number of Records: 69,760 per station per feature

Columns (Features) per Station (XXXX):

- observation_datetime_station_XXXX: Date and time of observation in <UTC?>
- airtemp_degF_XXXX_station_XXXX (°F): Air temperature measured at 2 meters above ground level
- windspeed_avg_mph_XXXX_station_XXXX (mph): Average wind speed during the hour at <2? 6? 10?> meters above ground level
- winddgust_mph_XXXX_station_XXXX (mph): <Peak wind gust during the hour at <2?> meters above ground level>
- rhavg_percent_XXXX_station_XXXX (%): Relative humidity
- precip_in_XXXX_station_XXXX (in): Total precipitation accumulated in the hour at <1? 2?> meters above ground level <need snow equivalent info>
- date_station_XXXX: <>
- day_index: <>
- hour_index: <>

Stations:

- BEAR (Bearwallow Mountain)
- BURN (Burnsville Tower)
- FRYI (Frying Pan Mountain)
- JEFF (Mount Jefferso Tower)
- MITC (Mount Mitchell State Park)
- NCAT (North Carolina A&T University Research Farm)
- SALI (Piedmont Research Station)
- SASS (Sassafrass Mountain)
- UNCA (University of North Carolina - Asheville Weather Tower)
- WINE (Wayah Bald Mountain)

#### Data Quality

Missing Data: Timestamps with no recorded data are marked as <>. <Other info about hanling missing data>

Outlier Handling: <outside range handling>

#### Data Provenance

Source: North Carolina State Climate Office ECONet, <a href="https://econet.climate.ncsu.edu" target="blank">https://econet.climate.ncsu.edu/about/</a>

#### Data Transformations

Time Normalization: <?>

Unit Conversion: <?>

Aggregations: <?>

In [19]:
import pandas as pd

file_path = "processed_data/NC_processed_data_12_31.parquet"
df = pd.read_parquet(file_path) 

In [20]:
df.columns

Index(['observation_datetime_station_BURN', 'airtemp_degF_BURN_station_BURN',
       'windspeed_avg_mph_BURN_station_BURN',
       'winddgust_mph_BURN_station_BURN', 'rhavg_percent_BURN_station_BURN',
       'precip_in_BURN_station_BURN', 'date_station_BURN', 'day_index',
       'hour_index', 'observation_datetime_station_NCAT',
       'airtemp_degF_NCAT_station_NCAT', 'windspeed_avg_mph_NCAT_station_NCAT',
       'winddgust_mph_NCAT_station_NCAT', 'rhavg_percent_NCAT_station_NCAT',
       'precip_in_NCAT_station_NCAT', 'date_station_NCAT',
       'observation_datetime_station_SALI', 'airtemp_degF_SALI_station_SALI',
       'windspeed_avg_mph_SALI_station_SALI',
       'winddgust_mph_SALI_station_SALI', 'rhavg_percent_SALI_station_SALI',
       'precip_in_SALI_station_SALI', 'date_station_SALI',
       'observation_datetime_station_MITC', 'airtemp_degF_MITC_station_MITC',
       'windspeed_avg_mph_MITC_station_MITC',
       'winddgust_mph_MITC_station_MITC', 'rhavg_percent_MITC_station

In [21]:
df.rhavg_percent_MORG_station_MORG

0         86.3
1         86.3
2         86.3
3         86.3
4         86.3
         ...  
69755     99.5
69756    100.0
69757    100.0
69758    100.0
69759    100.0
Name: rhavg_percent_MORG_station_MORG, Length: 69760, dtype: float64

In [13]:
df.observation_datetime_station_NCAT

0        2017-01-01 00:00:00
1        2017-01-01 01:00:00
2        2017-01-01 02:00:00
3        2017-01-01 03:00:00
4        2017-01-01 04:00:00
                ...         
69755    2024-12-16 19:00:00
69756    2024-12-16 20:00:00
69757    2024-12-16 21:00:00
69758    2024-12-16 22:00:00
69759    2024-12-16 23:00:00
Name: observation_datetime_station_NCAT, Length: 69760, dtype: object