In [2]:
import pandas as pd

In [3]:
df = pd.read_csv('Heatvars_County_2000-2020_v1.2.csv')

Input Data: 
* [Source: Spangler, Liang, Wellenius via figshare.com](https://figshare.com/articles/dataset/Daily_County-Level_Wet-Bulb_Globe_Temperature_Universal_Thermal_Climate_Index_and_Other_Heat_Metrics_for_the_Contiguous_United_States_2000-2020/19419836)
* [Data Dictionary: Nature.com](https://www.nature.com/articles/s41597-022-01405-3/tables/4)
* citation: Spangler, Keith (2022). Daily, County-Level Wet-Bulb Globe Temperature, Universal Thermal Climate Index, and Other Heat Metrics for the Contiguous United States, 2000-2020. figshare. Dataset. https://doi.org/10.6084/m9.figshare.19419836.v2

Preprocessing Note: Data was downloaded from the above source in a .rds format.  The code below was used in RStudio to convert the file to a .csv for use outside R
```
rds_file <- "~/Downloads/Heatvars_County_2000-2020_v1.2.Rds"
data <- readRDS(rds_file)
csv_file <- "~/DSI-508/Projects/group-project/Heatvars_County_2000-2020_v1.2.csv"
write.csv(data, file = csv_file, row.names = FALSE)
```

---
***Original Data Dictionary***
|Feature| Variable Name (Long) | Description / Format                                     | Units |
|----------------------|---------------------|---------------------------------------------------------|-------|
| StCoFIPS             | State-county Federal Information Processing Standard (FIPS) Identifier | Unique county identifier: concatenation of two-digit state identifier and three-digit county identifier | N/A   |
| Date                 | Date                | Local day in the format YYYYMMDD                         | N/A   |
| Tmin_C               | Daily Minimum Ambient Temperature | Lowest 2-meter ambient temperature observed from hourly data from 00 LST to 23 LST | °C    |
| Tmax_C               | Daily Maximum Ambient Temperature | Highest 2-meter ambient temperature observed from hourly data from 00 LST to 23 LST | °C    |
| Tmean_C              | Daily Mean Ambient Temperature | 2-meter ambient temperature averaged over hourly observations from 00 LST to 23 LST | °C    |
| TDmin_C              | Daily Minimum Dew Point Temperature | Lowest dew point temperature observed from hourly data from 00 LST to 23 LST | °C    |
| TDmax_C              | Daily Maximum Dew Point Temperature | Highest dew point temperature observed from hourly data from 00 LST to 23 LST | °C    |
| TDmean_C             | Daily Mean Dew Point Temperature | Dew point temperature averaged over hourly observations from 00 LST to 23 LST | °C    |
| NETmin_C             | Daily Minimum Net Effective Temperature | Lowest net effective temperature observed from hourly data from 00 LST to 23 LST | °C    |
| NETmax_C             | Daily Maximum Net Effective Temperature | Highest net effective temperature observed from hourly data from 00 LST to 23 LST | °C    |
| NETmean_C            | Daily Mean Net Effective Temperature | Net effective temperature averaged over hourly observations from 00 LST to 23 LST | °C    |
| HImin_C              | Daily Minimum Heat Index | Lowest heat index observed from hourly data from 00 LST to 23 LST | °C    |
| HImax_C              | Daily Maximum Heat Index | Highest heat index observed from hourly data from 00 LST to 23 LST | °C    |
| HImean_C             | Daily Mean Heat Index | Heat index averaged over hourly observations from 00 LST to 23 LST | °C    |
| HXmin_C              | Daily Minimum Humidex | Lowest humidex observed from hourly data from 00 LST to 23 LST | °C    |
| HXmax_C              | Daily Maximum Humidex | Highest humidex observed from hourly data from 00 LST to 23 LST | °C    |
| HXmean_C             | Daily Mean Humidex | Humidex averaged over hourly observations from 00 LST to 23 LST | °C    |
| WBGTmin_C            | Daily Minimum Wet-Bulb Globe Temperature | Lowest wet-bulb globe temperature (WBGT) observed from hourly data from 00 LST to 23 LST | °C    |
| WBGTmax_C            | Daily Maximum Wet-Bulb Globe Temperature | Highest wet-bulb globe temperature (WBGT) observed from hourly data from 00 LST to 23 LST | °C    |
| WBGTmean_C           | Daily Mean Wet-Bulb Globe Temperature | Wet-bulb globe temperature (WBGT) averaged over hourly observations from 00 LST to 23 LST | °C    |
| UTCImin_C            | Daily Minimum Universal Thermal Climate Index | Lowest Universal Thermal Climate Index (UTCI) observed from hourly data from 00 LST to 23 LST | °C    |
| UTCImax_C            | Daily Maximum Universal Thermal Climate Index | Highest Universal Thermal Climate Index (UTCI) observed from hourly data from 00 LST to 23 LST | °C    |
| UTCImean_C           | Daily Mean Universal Thermal Climate Index | Universal Thermal Climate Index (UTCI) averaged over hourly data from 00 LST to 23 LST | °C    |
| Flag_T               | Ambient temperature flag | Indicator of the percent of county population represented by the county-day ambient temperature estimate. 0: ≥50%, 1: 10–49%, 2: <10%, 3: 0% (NA) | N/A   |
| Flag_TD              | Dew point temperature flag | Indicator of the percent of county population represented by the county-day dew point temperature estimate. 0: ≥50%, 1: 10–49%, 2: <10%, 3: 0% (NA) | N/A   |
| Flag_NET             | Net effective temperature flag | Indicator of the percent of county population represented by the county-day net effective temperature estimate. 0: ≥50%, 1: 10–49%, 2: <10%, 3: 0% (NA) | N/A   |
| Flag_HI              | Heat index flag | Indicator of the percent of county population represented by the county-day heat index estimate. 0: ≥50%, 1: 10–49%, 2: <10%, 3: 0% (NA) | N/A   |
| Flag_HX              | Humidex flag | Indicator of the percent of county population represented by the county-day humidex estimate. 0: ≥50%, 1: 10–49%, 2: <10%, 3: 0% (NA) | N/A   |
| Flag_WBGT            | Wet-bulb globe temperature flag | Indicator of the percent of county population represented by the county-day WBGT estimate. 0: ≥50%, 1: 10–49%, 2: <10%, 3: 0% (NA) | N/A   |
| Flag_UTCI            | Universal Thermal Climate Index flag | Indicator of the percent of county population represented by the county-day UTCI estimate. 0: ≥50%, 1: 10–49%, 2: <10%, 3: 0% (NA) | N/A   |

---
*** Model Usage Data Dictionary***
|Feature| Variable Name (Long) | Description / Format                                     | Units |
|----------------------|---------------------|---------------------------------------------------------|-------|
| StCoFIPS             | State-county Federal Information Processing Standard (FIPS) Identifier | Unique county identifier: concatenation of two-digit state identifier and three-digit county identifier | N/A   |
| Date                 | Date                | Local day in the format YYYYMMDD                         | N/A   |
| Tmin_C               | Daily Minimum Ambient Temperature | Lowest 2-meter ambient temperature observed from hourly data from 00 LST to 23 LST | °C    |
| Tmax_C               | Daily Maximum Ambient Temperature | Highest 2-meter ambient temperature observed from hourly data from 00 LST to 23 LST | °C    |
| Tmean_C              | Daily Mean Ambient Temperature | 2-meter ambient temperature averaged over hourly observations from 00 LST to 23 LST | °C    |
| TDmin_C              | Daily Minimum Dew Point Temperature | Lowest dew point temperature observed from hourly data from 00 LST to 23 LST | °C    |
| TDmax_C              | Daily Maximum Dew Point Temperature | Highest dew point temperature observed from hourly data from 00 LST to 23 LST | °C    |
| TDmean_C             | Daily Mean Dew Point Temperature | Dew point temperature averaged over hourly observations from 00 LST to 23 LST | °C    |
| NETmin_C             | Daily Minimum Net Effective Temperature | Lowest net effective temperature observed from hourly data from 00 LST to 23 LST | °C    |
| NETmax_C             | Daily Maximum Net Effective Temperature | Highest net effective temperature observed from hourly data from 00 LST to 23 LST | °C    |
| NETmean_C            | Daily Mean Net Effective Temperature | Net effective temperature averaged over hourly observations from 00 LST to 23 LST | °C    |
| HImin_C              | Daily Minimum Heat Index | Lowest heat index observed from hourly data from 00 LST to 23 LST | °C    |
| HImax_C              | Daily Maximum Heat Index | Highest heat index observed from hourly data from 00 LST to 23 LST | °C    |
| HImean_C             | Daily Mean Heat Index | Heat index averaged over hourly observations from 00 LST to 23 LST | °C    |
| HXmin_C              | Daily Minimum Humidex | Lowest humidex observed from hourly data from 00 LST to 23 LST | °C    |
| HXmax_C              | Daily Maximum Humidex | Highest humidex observed from hourly data from 00 LST to 23 LST | °C    |
| HXmean_C             | Daily Mean Humidex | Humidex averaged over hourly observations from 00 LST to 23 LST | °C    |
| WBGTmin_C            | Daily Minimum Wet-Bulb Globe Temperature | Lowest wet-bulb globe temperature (WBGT) observed from hourly data from 00 LST to 23 LST | °C    |
| WBGTmax_C            | Daily Maximum Wet-Bulb Globe Temperature | Highest wet-bulb globe temperature (WBGT) observed from hourly data from 00 LST to 23 LST | °C    |
| WBGTmean_C           | Daily Mean Wet-Bulb Globe Temperature | Wet-bulb globe temperature (WBGT) averaged over hourly observations from 00 LST to 23 LST | °C    |
| UTCImin_C            | Daily Minimum Universal Thermal Climate Index | Lowest Universal Thermal Climate Index (UTCI) observed from hourly data from 00 LST to 23 LST | °C    |
| UTCImax_C            | Daily Maximum Universal Thermal Climate Index | Highest Universal Thermal Climate Index (UTCI) observed from hourly data from 00 LST to 23 LST | °C    |
| UTCImean_C           | Daily Mean Universal Thermal Climate Index | Universal Thermal Climate Index (UTCI) averaged over hourly data from 00 LST to 23 LST | °C    |
| Flag_T               | Ambient temperature flag | Indicator of the percent of county population represented by the county-day ambient temperature estimate. 0: ≥50%, 1: 10–49%, 2: <10%, 3: 0% (NA) | N/A   |
| Flag_TD              | Dew point temperature flag | Indicator of the percent of county population represented by the county-day dew point temperature estimate. 0: ≥50%, 1: 10–49%, 2: <10%, 3: 0% (NA) | N/A   |
| Flag_NET             | Net effective temperature flag | Indicator of the percent of county population represented by the county-day net effective temperature estimate. 0: ≥50%, 1: 10–49%, 2: <10%, 3: 0% (NA) | N/A   |
| Flag_HI              | Heat index flag | Indicator of the percent of county population represented by the county-day heat index estimate. 0: ≥50%, 1: 10–49%, 2: <10%, 3: 0% (NA) | N/A   |
| Flag_HX              | Humidex flag | Indicator of the percent of county population represented by the county-day humidex estimate. 0: ≥50%, 1: 10–49%, 2: <10%, 3: 0% (NA) | N/A   |
| Flag_WBGT            | Wet-bulb globe temperature flag | Indicator of the percent of county population represented by the county-day WBGT estimate. 0: ≥50%, 1: 10–49%, 2: <10%, 3: 0% (NA) | N/A   |
| Flag_UTCI            | Universal Thermal Climate Index flag | Indicator of the percent of county population represented by the county-day UTCI estimate. 0: ≥50%, 1: 10–49%, 2: <10%, 3: 0% (NA) | N/A   |




In [8]:
df.shape

(23835252, 30)

In [15]:
print(f"The Original DataFrame is {round(df.size / 1_000_000,2)} million cells")

The Original DataFrame is 715.06 million cells


In [18]:
df.head()

Unnamed: 0,StCoFIPS,Date,Tmin_C,Tmax_C,Tmean_C,TDmin_C,TDmax_C,TDmean_C,NETmin_C,NETmax_C,...,UTCImin_C,UTCImax_C,UTCImean_C,Flag_T,Flag_TD,Flag_NET,Flag_HI,Flag_HX,Flag_WBGT,Flag_UTCI
0,1001,2000-01-02,14.73,21.53,17.27,14.52,16.3,15.38,8.19,15.48,...,12.18,24.5,16.38,0,0,0,0,0,0,0
1,1003,2000-01-02,16.67,21.72,18.74,16.14,18.41,17.43,9.75,14.75,...,11.95,24.35,16.66,0,0,0,0,0,0,0
2,1005,2000-01-02,13.88,20.08,16.21,13.8,16.25,14.92,7.23,15.01,...,11.45,22.8,15.67,0,0,0,0,0,0,0
3,1007,2000-01-02,13.55,20.95,16.58,13.44,16.69,15.02,6.46,15.0,...,9.62,24.41,15.45,0,0,0,0,0,0,0
4,1009,2000-01-02,13.58,18.49,15.32,13.46,15.8,14.56,5.96,12.11,...,9.98,21.16,13.61,0,0,0,0,0,0,0


In [20]:
# Downsize Columns by dropping mins and maxes
# Data is collected here at a daily level.  All other sources are annual.  Will normalize at the day level.
list(df.columns)
df.drop(columns = [
 # 'StCoFIPS',
 # 'Date',
 'Tmin_C',
 'Tmax_C',
 # 'Tmean_C',
 'TDmin_C',
 'TDmax_C',
 # 'TDmean_C',
 'NETmin_C',
 'NETmax_C',
 # 'NETmean_C',
 'HImin_C',
 'HImax_C',
 # 'HImean_C',
 'HXmin_C',
 'HXmax_C',
 #'HXmean_C',
 'WBGTmin_C',
 'WBGTmax_C',
 #'WBGTmean_C',
 'UTCImin_C',
 'UTCImax_C',
 'UTCImean_C',
 #'Flag_T',
 #'Flag_TD',
 #'Flag_NET',
 #'Flag_HI',
 #'Flag_HX',
 #'Flag_WBGT',
 'Flag_UTCI'
], inplace = True)

In [22]:
df.Date = pd.to_datetime(df.Date, format = '%Y-%M-%d')

In [36]:
df.head()

Unnamed: 0,StCoFIPS,Date,Tmean_C,TDmean_C,NETmean_C,HImean_C,HXmean_C,WBGTmean_C,Flag_T,Flag_TD,Flag_NET,Flag_HI,Flag_HX,Flag_WBGT
0,1001,2000-01-02 00:01:00,17.27,15.38,10.78,17.39,21.41,16.71,0,0,0,0,0,0
1,1003,2000-01-02 00:01:00,18.74,17.43,11.65,19.09,24.24,18.75,0,0,0,0,0,0
2,1005,2000-01-02 00:01:00,16.21,14.92,10.19,16.31,20.08,15.85,0,0,0,0,0,0
3,1007,2000-01-02 00:01:00,16.58,15.02,10.03,16.67,20.51,16.11,0,0,0,0,0,0
4,1009,2000-01-02 00:01:00,15.32,14.56,8.23,15.4,18.96,15.12,0,0,0,0,0,0


### Create an Annual Average Dataset

In [84]:
annual_df = round(df.groupby([df.Date.dt.year,df.StCoFIPS]).mean(numeric_only=False),2)

In [85]:
annual_df.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,Date,Tmean_C,TDmean_C,NETmean_C,HImean_C,HXmean_C,WBGTmean_C,Flag_T,Flag_TD,Flag_NET,Flag_HI,Flag_HX,Flag_WBGT
Date,StCoFIPS,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
2000,1001,2000-01-15 19:14:35.013698560,18.45,11.49,12.37,18.48,21.36,16.73,0.0,0.0,0.0,0.0,0.0,0.0
2000,1003,2000-01-15 19:14:35.013698560,20.07,14.69,13.78,20.7,24.86,18.9,0.0,0.0,0.0,0.0,0.0,0.0
2000,1005,2000-01-15 19:14:35.013698560,18.63,12.14,12.92,18.72,21.87,17.17,0.0,0.0,0.0,0.0,0.0,0.0
2000,1007,2000-01-15 19:14:35.013698560,18.01,11.13,12.26,18.04,20.81,16.42,0.0,0.0,0.0,0.0,0.0,0.0
2000,1009,2000-01-15 19:14:35.013698560,16.65,10.19,10.74,16.57,19.02,15.25,0.0,0.0,0.0,0.0,0.0,0.0


In [86]:
df.loc[(df['StCoFIPS']==1001) & (df['Date'].dt.year == 2000), 'Tmean_C'].mean() #confirmation

18.451506849315066

In [87]:
annual_df.drop(columns = 'Date', inplace=True)

In [88]:
annual_df.reset_index(inplace=True)

In [89]:
annual_df.rename(columns = {'Date': 'Year'}, inplace = True)

In [90]:
annual_df.shape

(65268, 14)

In [92]:
annual_df.head()

Unnamed: 0,Year,StCoFIPS,Tmean_C,TDmean_C,NETmean_C,HImean_C,HXmean_C,WBGTmean_C,Flag_T,Flag_TD,Flag_NET,Flag_HI,Flag_HX,Flag_WBGT
0,2000,1001,18.45,11.49,12.37,18.48,21.36,16.73,0.0,0.0,0.0,0.0,0.0,0.0
1,2000,1003,20.07,14.69,13.78,20.7,24.86,18.9,0.0,0.0,0.0,0.0,0.0,0.0
2,2000,1005,18.63,12.14,12.92,18.72,21.87,17.17,0.0,0.0,0.0,0.0,0.0,0.0
3,2000,1007,18.01,11.13,12.26,18.04,20.81,16.42,0.0,0.0,0.0,0.0,0.0,0.0
4,2000,1009,16.65,10.19,10.74,16.57,19.02,15.25,0.0,0.0,0.0,0.0,0.0,0.0


In [96]:
annual_df.to_csv('annual_temperature_2000-2020_FIPS.csv', index = False)