Introduction
===============

PRISM provides precipitation data based on location, as far back as 1981.
This dataset was generated with the parameters:
* Location:  Lat: 35.5515   Lon: -97.4072   Elev: 1171ft
* Climate variables: ppt,tmin,tmean,tmax,tdmean,vpdmin,vpdmax
* Spatial resolution: 4km
* Period: 1981-01-01 - 2022-12-31
* Dataset: AN91d
* PRISM day definition: 24 hours ending at 1200 UTC on the day shown
* Grid Cell Interpolation: Off

This notebook is intended to demo the data analysis and data modelling process as a part of a portfolio. The ultimate goal is to be able to predict total precipitation (rain + melted snow). 
\
Feature Explanation
--------------------

| Name   | Description                                                                                                           | Units |
| ------ | --------------------------------------------------------------------------------------------------------------------- | ----- |
| ppt    | Total precipitation (rain+melted snow) for the day                                                                    | in    |
| tmin   | Minimum temperature for the day                                                                                       | F     |
| tmean  | Mean temperature for the day (tmax+tmin)/2                                                                            | F     |
| tmax   | Maximum temperature for the day                                                                                       | F     |
| tdmean | Mean dewpoint temperature (analogous to humidity and comfort)                                                         | F     |
| vpdmin | Minimum difference between the amount of vapor in the air, versus the total amount it can hold (relative to humidity) | hPa   |
| vpdmax | Maximum difference between the amount of vapor in the air, versus the total amount it can hold (relative to humidity) | hPa   |

In [14]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_absolute_error
from matplotlib import pyplot as plt
import utils

In [12]:
"""
First thing we want to do is read in the dataframe,
clean up the column names and move them in the order
that we want. We also want to decode the date and
extract the information from it.
"""
df = pd.read_csv("prism_ok_country_precip_data.csv")
col_names = ["date","precip","min_temp","mean_temp","max_temp","dewpoint","vpd_min","vpd_max"]
df.columns = col_names
df.loc[:,"year"] = pd.DatetimeIndex(df["date"]).year
df.loc[:,"month"] = pd.DatetimeIndex(df["date"]).month
df.loc[:,"day"] = pd.DatetimeIndex(df["date"]).day
df.loc[:,"jul_day"] = pd.DatetimeIndex(df["date"]).dayofyear
col_order = ["year","month","day","jul_day","min_temp","mean_temp","max_temp","dewpoint","vpd_min","vpd_max","precip"]
df = df.loc[:,col_order]
print(df)
df.info()
df.to_csv("rain.csv", index=False)

       year  month  day  jul_day  min_temp  mean_temp  max_temp  dewpoint  \
0      1981      1    1        1      36.8       49.6      62.5      30.2   
1      1981      1    2        2      28.0       41.6      55.2      24.4   
2      1981      1    3        3      26.2       40.7      55.1      27.5   
3      1981      1    4        4      20.0       33.0      45.9      22.1   
4      1981      1    5        5      20.2       31.0      41.9      14.8   
...     ...    ...  ...      ...       ...        ...       ...       ...   
15335  2022     12   27      361      16.2       31.2      46.1      16.3   
15336  2022     12   28      362      18.1       34.5      50.9      21.2   
15337  2022     12   29      363      43.8       55.8      67.9      39.0   
15338  2022     12   30      364      34.3       50.3      66.3      32.1   
15339  2022     12   31      365      31.1       44.0      56.8      33.3   

       vpd_min  vpd_max  precip  
0         2.10    13.32     0.0  
1      

In [13]:
"""
Now we can start to look at the condition of the
data, as well as some descriptive stats.
"""
df = pd.read_csv("rain.csv")
print(df.describe())

               year         month           day       jul_day      min_temp  \
count  15340.000000  15340.000000  15340.000000  15340.000000  15340.000000   
mean    2001.500326      6.523077     15.729205    183.119296     50.128038   
std       12.120919      3.448775      8.800196    105.438628     17.919648   
min     1981.000000      1.000000      1.000000      1.000000    -10.800000   
25%     1991.000000      4.000000      8.000000     92.000000     35.500000   
50%     2001.500000      7.000000     16.000000    183.000000     50.900000   
75%     2012.000000     10.000000     23.000000    274.000000     66.400000   
max     2022.000000     12.000000     31.000000    366.000000     85.900000   

          mean_temp      max_temp      dewpoint       vpd_min       vpd_max  \
count  15340.000000  15340.000000  15340.000000  15340.000000  15340.000000   
mean      60.747106     71.371076     47.666173      2.192583     17.431450   
std       17.958043     18.687281     17.427147    

In [None]:
"""
We can see the temperatures seem to be within reasonable ranges.
This information is for Oklahoma County and as a resident I can
use subject knowledge to say that the temperatures are within
reason. If any outliers are present, they are not errors and are
true data points that happen to be outliers.

The dewpoint and VPD min/max are something I do not have subject
knowledge of. So we will take a closer look at these to see the
spreads.

We can also see from the quantiles that the precipitation is
right-skewed and there may be some high outliers. We are not
concerned about fixing the skew, as this is our target 
variable, but we need to make concious choices when splitting
the data.
"""
cols = ["dewpoint","vpd_min","vpd_max","precip"]
