### MAC Season 6 Data Cleaning
#### Traits
- aboveground dry biomass
- canopy height (time series)

#### Season Dates
- Planting: 2018-04-25
- Last Day of Harvest: 2018-08-01

This notebook contains the code Emily Cain used to clean and curate sorghum data from MAC Season 6. The input csv files were queried from betydb version 1 in March 2020 and from the MAC weather station website. To run the entire notebook (in the CyVerse Discovery Environment, for now, where the input data are stored) and output the csv files for the traits above, select `Run` and then `Run All Cells` from the notebook menu above. Once all cells have been executed, the output csv file will appear in the file panel on the left. Right click to download the file. If you experience any problems or have questions, please e-mail ejcain@arizona.edu.

#### Custom Functions Used

In [1]:
def check_for_subplots(df):
    
    """
    Function takes a dataframe as argument and checks for sitename subplots ending in ' E' or ' W'
    Will return rows with subplots, if any.
    """
    return df.loc[(df.sitename.str.endswith(' E')) | (df.sitename.str.endswith(' W'))]

#### I. Import python packages

In [2]:
import datetime
import numpy as np
import pandas as pd
import sqlalchemy
import sqlite3

#### II. Read in dataset
Future: query betydb directly with public API key for most recent data

In [3]:
df_0 = pd.read_csv('mac_season_six_2020-03-09.csv', low_memory=False)
print(df_0.shape)
df_0.head(3)

(881869, 39)


Unnamed: 0.1,Unnamed: 0,checked,result_type,id,citation_id,site_id,treatment_id,sitename,city,lat,...,n,statname,stat,notes,access_level,cultivar,entity,method_name,view_url,edit_url
0,1,0,traits,6002883000.0,,6000014922,6000000032,MAC Field Scanner Season 6 Range 12 Column 12,Maricopa,33.074943,...,,,,,2,PI656065,,Mean temperature from infrared images,https://terraref.ncsa.illinois.edu/bety/traits...,https://terraref.ncsa.illinois.edu/bety/traits...
1,2,0,traits,6002883000.0,,6000014928,6000000032,MAC Field Scanner Season 6 Range 12 Column 13,Maricopa,33.074943,...,,,,,2,PI569422,,Mean temperature from infrared images,https://terraref.ncsa.illinois.edu/bety/traits...,https://terraref.ncsa.illinois.edu/bety/traits...
2,3,0,traits,6002883000.0,,6000014929,6000000032,MAC Field Scanner Season 6 Range 12 Column 15,Maricopa,33.074943,...,,,,,2,PI329614,,Mean temperature from infrared images,https://terraref.ncsa.illinois.edu/bety/traits...,https://terraref.ncsa.illinois.edu/bety/traits...


#### III. Drop Columns

In [4]:
df_0.columns

Index(['Unnamed: 0', 'checked', 'result_type', 'id', 'citation_id', 'site_id',
       'treatment_id', 'sitename', 'city', 'lat', 'lon', 'scientificname',
       'commonname', 'genus', 'species_id', 'cultivar_id', 'author',
       'citation_year', 'treatment', 'date', 'time', 'raw_date', 'month',
       'year', 'dateloc', 'trait', 'trait_description', 'mean', 'units', 'n',
       'statname', 'stat', 'notes', 'access_level', 'cultivar', 'entity',
       'method_name', 'view_url', 'edit_url'],
      dtype='object')

In [5]:
cols_to_drop = ['Unnamed: 0', 'checked', 'result_type', 'id', 'citation_id', 'site_id', 'treatment_id', 'city', 
                'scientificname', 'commonname', 'genus', 'species_id', 'cultivar_id', 'author',
                'citation_year', 'time', 'raw_date', 'month', 'year', 'dateloc', 'n', 'statname', 'stat', 'notes', 
                'access_level', 'entity', 'view_url', 'edit_url']

In [6]:
df_1 = df_0.drop(labels=cols_to_drop, axis=1)
print(df_1.shape)
df_1.head(3)

(881869, 11)


Unnamed: 0,sitename,lat,lon,treatment,date,trait,trait_description,mean,units,cultivar,method_name
0,MAC Field Scanner Season 6 Range 12 Column 12,33.074943,-111.974868,MAC Season 6: Sorghum,2018 Jul 19,surface_temperature,Surface temperature,39.207788,C,PI656065,Mean temperature from infrared images
1,MAC Field Scanner Season 6 Range 12 Column 13,33.074943,-111.974851,MAC Season 6: Sorghum,2018 Jul 19,surface_temperature,Surface temperature,39.6737,C,PI569422,Mean temperature from infrared images
2,MAC Field Scanner Season 6 Range 12 Column 15,33.074943,-111.974819,MAC Season 6: Sorghum,2018 Jul 19,surface_temperature,Surface temperature,40.171472,C,PI329614,Mean temperature from infrared images


In [7]:
for col in df_1.columns:
    print(f'{col}: {df_1[col].nunique()}')

sitename: 2509
lat: 2509
lon: 2509
treatment: 1
date: 128
trait: 17
trait_description: 10
mean: 304721
units: 9
cultivar: 326
method_name: 14


In [8]:
# check for drought treatment

df_1.treatment.unique()

array(['MAC Season 6: Sorghum'], dtype=object)

#### IV. Change `date` format

In [9]:
new_dates = []

for d in df_1.date.values:
    
    # strip '(America/Phoenix)' string from date
    if 'Phoenix' in d:
        new_name = d[:-18]
        new_dates.append(new_name)
    
    else:
        new_name = d
        new_dates.append(new_name)
        

# check that length of new dates matches number of rows
print(len(new_dates))
print(df_1.shape[0])

881869
881869


Convert string dates to datetime

In [10]:
iso_format_dates = pd.to_datetime(new_dates)

Add new column with datetime values

In [11]:
# copy df to avoid SettingWithCopyWarning
df_2 = df_1.copy()
df_2['date_1'] = iso_format_dates

print(df_2.shape)
df_2.head(3)

(881869, 12)


Unnamed: 0,sitename,lat,lon,treatment,date,trait,trait_description,mean,units,cultivar,method_name,date_1
0,MAC Field Scanner Season 6 Range 12 Column 12,33.074943,-111.974868,MAC Season 6: Sorghum,2018 Jul 19,surface_temperature,Surface temperature,39.207788,C,PI656065,Mean temperature from infrared images,2018-07-19
1,MAC Field Scanner Season 6 Range 12 Column 13,33.074943,-111.974851,MAC Season 6: Sorghum,2018 Jul 19,surface_temperature,Surface temperature,39.6737,C,PI569422,Mean temperature from infrared images,2018-07-19
2,MAC Field Scanner Season 6 Range 12 Column 15,33.074943,-111.974819,MAC Season 6: Sorghum,2018 Jul 19,surface_temperature,Surface temperature,40.171472,C,PI329614,Mean temperature from infrared images,2018-07-19


#### V. Extract Range & Column Values for Location

In [12]:
df_3 = df_2.copy()

df_3['range'] = df_3['sitename'].str.extract("Range (\d+)").astype(int)
df_3['column'] = df_3['sitename'].str.extract("Column (\d+)").astype(int)

df_3.sample(n=3)

Unnamed: 0,sitename,lat,lon,treatment,date,trait,trait_description,mean,units,cultivar,method_name,date_1,range,column
643312,MAC Field Scanner Season 6 Range 2 Column 2,33.074584,-111.975031,MAC Season 6: Sorghum,2018 Jul 8,surface_temperature,Surface temperature,35.077081,C,SP1516,Mean temperature from infrared images,2018-07-08,2,2
466836,MAC Field Scanner Season 6 Range 40 Column 6 W,33.075949,-111.97497,MAC Season 6: Sorghum,2018 Jul 2,surface_temperature,Surface temperature,31.069391,C,PI63715,Mean temperature from infrared images,2018-07-02,40,6
773008,MAC Field Scanner Season 6 Range 46 Column 6,33.076165,-111.974966,MAC Season 6: Sorghum,2018 Jul 7,canopy_cover,Fraction of ground covered by plant,84.559625,%,PI152651,Green Canopy Cover Estimation from Field Scann...,2018-07-07,46,6


#### VI. Drop, Rename, & Reorder Columns

In [13]:
# drop string date column

df_4 = df_3.drop(labels=['date'], axis=1)

In [14]:
df_5 = df_4.rename({'date_1': 'date', 'mean': 'value'}, axis=1)

In [15]:
new_col_order = ['sitename', 'range', 'column', 'lat', 'lon', 'date', 'treatment', 'trait', 'trait_description', 'method_name', 'cultivar', 'value', 'units']

df_6 = pd.DataFrame(data=df_5, columns=new_col_order).reset_index(drop=True)
print(df_6.shape)
df_6.head(3)

(881869, 13)


Unnamed: 0,sitename,range,column,lat,lon,date,treatment,trait,trait_description,method_name,cultivar,value,units
0,MAC Field Scanner Season 6 Range 12 Column 12,12,12,33.074943,-111.974868,2018-07-19,MAC Season 6: Sorghum,surface_temperature,Surface temperature,Mean temperature from infrared images,PI656065,39.207788,C
1,MAC Field Scanner Season 6 Range 12 Column 13,12,13,33.074943,-111.974851,2018-07-19,MAC Season 6: Sorghum,surface_temperature,Surface temperature,Mean temperature from infrared images,PI569422,39.6737,C
2,MAC Field Scanner Season 6 Range 12 Column 15,12,15,33.074943,-111.974819,2018-07-19,MAC Season 6: Sorghum,surface_temperature,Surface temperature,Mean temperature from infrared images,PI329614,40.171472,C


In [16]:
df_6.trait.unique()

array(['surface_temperature', 'canopy_height', 'canopy_cover',
       'leaf_angle_mean', 'leaf_angle_alpha', 'leaf_angle_beta',
       'leaf_angle_chi', 'aboveground_biomass_moisture',
       'aboveground_fresh_biomass', 'leaf_width', 'leaf_length',
       'aboveground_dry_biomass', 'panicle_count', 'panicle_volume',
       'panicle_surface_area', 'stalk_diameter_fixed_height',
       'emergence_count'], dtype=object)

#### VI. Select for specific traits
- aboveground dry biomass
- canopy height - time series

### A. Aboveground Dry Biomass

In [17]:
adb_df = df_6.loc[df_6.trait == 'aboveground_dry_biomass']
print(adb_df.shape)
adb_df.tail(3)

(808, 13)


Unnamed: 0,sitename,range,column,lat,lon,date,treatment,trait,trait_description,method_name,cultivar,value,units
849427,MAC Field Scanner Season 6 Range 51 Column 5,51,5,33.076345,-111.974983,2018-08-01,MAC Season 6: Sorghum,aboveground_dry_biomass,Aboveground Dry Biomass,Whole above ground biomass at harvest,PI563022,8270.0,kg / ha
849428,MAC Field Scanner Season 6 Range 52 Column 14,52,14,33.076381,-111.974836,2018-08-01,MAC Season 6: Sorghum,aboveground_dry_biomass,Aboveground Dry Biomass,Whole above ground biomass at harvest,SP1516,11800.0,kg / ha
849429,MAC Field Scanner Season 6 Range 52 Column 15,52,15,33.076381,-111.974819,2018-08-01,MAC Season 6: Sorghum,aboveground_dry_biomass,Aboveground Dry Biomass,Whole above ground biomass at harvest,SP1516,6910.0,kg / ha


##### Check for E and W subplots

In [18]:
# will have no output if there are no subplots

check_for_subplots(adb_df)

Unnamed: 0,sitename,range,column,lat,lon,date,treatment,trait,trait_description,method_name,cultivar,value,units


In [19]:
# check date ranges - should only include two dates of harvesting

print(adb_df.date.min())
print(adb_df.date.max())

2018-07-31 00:00:00
2018-08-01 00:00:00


#### Write dataframe to csv file with timestamp

In [20]:
timestamp = datetime.datetime.now().replace(microsecond=0).isoformat()
output_filename = f'aboveground_dry_biomass_season_6_{timestamp}.csv'.replace(':', '')

adb_df.to_csv(output_filename, index=False)

### B. Canopy Height - Time Series

In [22]:
ch_0 = df_6.loc[df_6.trait == 'canopy_height']
print(ch_0.shape)
ch_0.head(3)

(42271, 13)


Unnamed: 0,sitename,range,column,lat,lon,date,treatment,trait,trait_description,method_name,cultivar,value,units
12081,MAC Field Scanner Season 6 Range 23 Column 11,23,11,33.075338,-111.974884,2018-06-03,MAC Season 6: Sorghum,canopy_height,"top of the general canopy of the plant, discou...",Scanner 3d ply data to height,PI63715,122.0,cm
12082,MAC Field Scanner Season 6 Range 9 Column 15,9,15,33.074835,-111.974818,2018-06-13,MAC Season 6: Sorghum,canopy_height,"top of the general canopy of the plant, discou...",Scanner 3d ply data to height,PI329618,159.0,cm
12083,MAC Field Scanner Season 6 Range 5 Column 8,5,8,33.074691,-111.974933,2018-06-14,MAC Season 6: Sorghum,canopy_height,"top of the general canopy of the plant, discou...",Scanner 3d ply data to height,PI569420,160.0,cm


In [23]:
subplots = check_for_subplots(ch_0)
subplots.shape

(0, 13)

#### Check for canopy height values taken on the same day in the same plot

In [25]:
ch_0.duplicated(subset=['sitename', 'date']).value_counts()

False    30552
True     11719
dtype: int64

In [33]:
duplicates = ch_0.loc[ch_0.duplicated(subset=['sitename', 'date'], keep='first')]
duplicates.head(3)

Unnamed: 0,sitename,range,column,lat,lon,date,treatment,trait,trait_description,method_name,cultivar,value,units
189107,MAC Field Scanner Season 6 Range 19 Column 15,19,15,33.075195,-111.974819,2018-04-19,MAC Season 6: Sorghum,canopy_height,"top of the general canopy of the plant, discou...",3D scanner to 98th quantile height,PI154844,11.0,cm
200325,MAC Field Scanner Season 6 Range 4 Column 6,4,6,33.074655,-111.974966,2018-05-28,MAC Season 6: Sorghum,canopy_height,"top of the general canopy of the plant, discou...",Scanner 3d ply data to height,PI585452,96.0,cm
200344,MAC Field Scanner Season 6 Range 4 Column 7,4,7,33.074655,-111.974949,2018-05-28,MAC Season 6: Sorghum,canopy_height,"top of the general canopy of the plant, discou...",Scanner 3d ply data to height,PI570400,94.0,cm


In [34]:
# check for hand measurements
duplicates.method_name.unique()

array(['3D scanner to 98th quantile height',
       'Scanner 3d ply data to height'], dtype=object)

In [36]:
# check one of the duplicates

duplicates.loc[(duplicates.sitename == 'MAC Field Scanner Season 6 Range 4 Column 7') & (duplicates.date == '2018-05-28')]

Unnamed: 0,sitename,range,column,lat,lon,date,treatment,trait,trait_description,method_name,cultivar,value,units
200344,MAC Field Scanner Season 6 Range 4 Column 7,4,7,33.074655,-111.974949,2018-05-28,MAC Season 6: Sorghum,canopy_height,"top of the general canopy of the plant, discou...",Scanner 3d ply data to height,PI570400,94.0,cm
861116,MAC Field Scanner Season 6 Range 4 Column 7,4,7,33.074655,-111.974949,2018-05-28,MAC Season 6: Sorghum,canopy_height,"top of the general canopy of the plant, discou...",Scanner 3d ply data to height,PI570400,94.0,cm


#### Use sqlite database to group by `sitename` and `date`

In [40]:
conn = sqlite3.connect('canopy_heights_season_6.sqlite')
cursor = conn.cursor()
print("Opened database successfully")

Opened database successfully


In [41]:
# comment next line out if db has already been created
ch_0.to_sql('canopy_heights_season_6.sqlite', conn)

In [42]:
ch_1 = pd.read_sql_query("""
                            SELECT sitename, range, column, lat, lon, date, treatment, 
                            trait, trait_description, method_name, cultivar, 
                            ROUND(AVG(value), 2) AS avg_canopy_height, units 
                            FROM 'canopy_heights_season_6.sqlite'
                            GROUP BY sitename, date, cultivar
                            ORDER BY date ASC;
                            """, conn)

print(ch_1.shape)
ch_1.head(3)

(30552, 13)


Unnamed: 0,sitename,range,column,lat,lon,date,treatment,trait,trait_description,method_name,cultivar,avg_canopy_height,units
0,MAC Field Scanner Season 6 Range 10 Column 11,10,11,33.074871,-111.974884,2018-04-19 00:00:00,MAC Season 6: Sorghum,canopy_height,"top of the general canopy of the plant, discou...",3D scanner to 98th quantile height,PI569419,12.0,cm
1,MAC Field Scanner Season 6 Range 10 Column 12,10,12,33.074871,-111.974868,2018-04-19 00:00:00,MAC Season 6: Sorghum,canopy_height,"top of the general canopy of the plant, discou...",3D scanner to 98th quantile height,PI156330,11.0,cm
2,MAC Field Scanner Season 6 Range 10 Column 13,10,13,33.074871,-111.974851,2018-04-19 00:00:00,MAC Season 6: Sorghum,canopy_height,"top of the general canopy of the plant, discou...",3D scanner to 98th quantile height,PI562998,11.0,cm


In [44]:
# Sanity Check

sample_with_duplicate = ch_0.loc[(ch_0.range == 4) & (ch_0.column == 6) & (ch_0.date == '2018-05-28')]
sample_with_duplicate

Unnamed: 0,sitename,range,column,lat,lon,date,treatment,trait,trait_description,method_name,cultivar,value,units
12990,MAC Field Scanner Season 6 Range 4 Column 6,4,6,33.074655,-111.974966,2018-05-28,MAC Season 6: Sorghum,canopy_height,"top of the general canopy of the plant, discou...",Scanner 3d ply data to height,PI585452,96.0,cm
200325,MAC Field Scanner Season 6 Range 4 Column 6,4,6,33.074655,-111.974966,2018-05-28,MAC Season 6: Sorghum,canopy_height,"top of the general canopy of the plant, discou...",Scanner 3d ply data to height,PI585452,96.0,cm
861097,MAC Field Scanner Season 6 Range 4 Column 6,4,6,33.074655,-111.974966,2018-05-28,MAC Season 6: Sorghum,canopy_height,"top of the general canopy of the plant, discou...",Scanner 3d ply data to height,PI585452,96.0,cm


In [45]:
# Sanity Check - should have only one row for the above group

sample_without_duplicate = ch_1.loc[(ch_1.range == 4) & (ch_1.column == 6) & (ch_1.date == '2018-05-28 00:00:00')]
sample_without_duplicate

Unnamed: 0,sitename,range,column,lat,lon,date,treatment,trait,trait_description,method_name,cultivar,avg_canopy_height,units
13420,MAC Field Scanner Season 6 Range 4 Column 6,4,6,33.074655,-111.974966,2018-05-28 00:00:00,MAC Season 6: Sorghum,canopy_height,"top of the general canopy of the plant, discou...",Scanner 3d ply data to height,PI585452,96.0,cm


#### Write dataframe to csv file with timestamp

In [46]:
timestamp = datetime.datetime.now().replace(microsecond=0).isoformat()
output_filename = f'canopy_height_time_series_season_6_{timestamp}.csv'.replace(':', '')

ch_1.to_csv(output_filename, index=False)

#### C. Growing Degree Days / Weather Data for Season Six
No `days_to_flowering` or `days_to_flag_leaf_emergence` for season 6, but will add GDD values for season in case needed for other traits

#### Read in Season Six Weather Data from MAC Weather Station

In [48]:
weather_df_0 = pd.read_csv('mac_weather_station_raw_daily_2018.csv')
print(weather_df_0.shape)
weather_df_0.head(3)

(365, 28)


Unnamed: 0,year,day_of_year,station_num,air_temp_max,air_temp_min,air_temp_mean,rh_max,rh_min,rh_mean,vpd_mean,...,wind_speed_mean,wind_vec_mag,wind_vec_dir,wind_dir_std,max_wind_speed,heat_units,eto,eto_pm,vapor_pressure_mean,dewpoint_mean
0,2018,1,6,23.6,0.4,10.4,75.9,11.7,41.5,0.94,...,0.9,0.1,159,74,3.8,3.4,2.3,1.8,0.46,-4.0
1,2018,2,6,23.0,3.6,11.6,70.4,11.7,37.7,1.01,...,0.8,0.3,302,67,3.8,3.4,2.5,1.7,0.47,-3.7
2,2018,3,6,24.2,1.8,12.9,73.9,11.8,35.7,1.17,...,2.0,1.2,50,50,8.1,3.7,2.8,3.1,0.47,-3.6


#### Slice dataframe for season dates only and add date column
* Planting Date: 2018-04-25, Day 115
* Last Day of Harvest: 2018-08-01, Day 213

In [49]:
weather_df_1 = weather_df_0.loc[(weather_df_0.day_of_year >= 115) & (weather_df_0.day_of_year <= 213)]
print(weather_df_1.shape)
weather_df_1.head(3)

(99, 28)


Unnamed: 0,year,day_of_year,station_num,air_temp_max,air_temp_min,air_temp_mean,rh_max,rh_min,rh_mean,vpd_mean,...,wind_speed_mean,wind_vec_mag,wind_vec_dir,wind_dir_std,max_wind_speed,heat_units,eto,eto_pm,vapor_pressure_mean,dewpoint_mean
114,2018,115,6,36.6,17.1,27.2,31.0,6.5,16.8,3.26,...,1.6,0.8,185,57,5.6,12.4,7.7,6.7,0.56,-1.4
115,2018,116,6,34.5,18.2,26.8,33.3,9.9,18.7,3.04,...,2.4,1.3,208,53,5.9,12.6,6.1,6.7,0.63,0.3
116,2018,117,6,36.4,17.6,28.6,48.7,9.6,20.5,3.35,...,2.0,0.9,215,59,7.2,12.6,7.9,7.1,0.74,2.5


In [50]:
season_4_date_range = pd.date_range(start='2018-04-25', end='2018-08-01')

In [51]:
weather_df_2 = weather_df_1.copy()
weather_df_2['date'] = season_4_date_range
print(weather_df_2.shape)
weather_df_2.tail(3)

(99, 29)


Unnamed: 0,year,day_of_year,station_num,air_temp_max,air_temp_min,air_temp_mean,rh_max,rh_min,rh_mean,vpd_mean,...,wind_vec_mag,wind_vec_dir,wind_dir_std,max_wind_speed,heat_units,eto,eto_pm,vapor_pressure_mean,dewpoint_mean,date
210,2018,211,6,41.9,27.2,34.2,64.9,22.3,41.4,3.44,...,1.0,215,66,12.5,16.7,8.7,9.1,2.12,18.3,2018-07-30
211,2018,212,6,40.9,26.0,33.7,79.8,18.8,42.1,3.35,...,0.6,154,65,8.4,16.3,7.6,7.4,2.03,17.6,2018-07-31
212,2018,213,6,42.9,28.9,36.0,64.7,14.9,36.3,4.05,...,1.5,154,52,14.7,17.1,8.6,8.9,2.02,17.4,2018-08-01


#### Add Growing Degree Days
- Future: add LaTeX equation
- Future: add info about min and max daily values
- 10 degrees Celsius is base temp for sorghum
- Daily gdd value = ((max temp + min temp) / 2) - 10 (base temp)
- Growing Degree Days = cumulative sum of daily gdd values

In [52]:
weather_df_3 = weather_df_2.copy()
weather_df_3['daily_gdd'] = (((weather_df_3['air_temp_max'] + weather_df_3['air_temp_min'])) / 2) - 10
print(weather_df_3.shape)
weather_df_3.head(3)

(99, 30)


Unnamed: 0,year,day_of_year,station_num,air_temp_max,air_temp_min,air_temp_mean,rh_max,rh_min,rh_mean,vpd_mean,...,wind_vec_dir,wind_dir_std,max_wind_speed,heat_units,eto,eto_pm,vapor_pressure_mean,dewpoint_mean,date,daily_gdd
114,2018,115,6,36.6,17.1,27.2,31.0,6.5,16.8,3.26,...,185,57,5.6,12.4,7.7,6.7,0.56,-1.4,2018-04-25,16.85
115,2018,116,6,34.5,18.2,26.8,33.3,9.9,18.7,3.04,...,208,53,5.9,12.6,6.1,6.7,0.63,0.3,2018-04-26,16.35
116,2018,117,6,36.4,17.6,28.6,48.7,9.6,20.5,3.35,...,215,59,7.2,12.6,7.9,7.1,0.74,2.5,2018-04-27,17.0


In [53]:
weather_df_4 = weather_df_3.copy()

# round to the nearest integer
weather_df_4['gdd'] = np.rint(np.cumsum(weather_df_4['daily_gdd']))
print(weather_df_4.shape)
weather_df_4.tail(3)

(99, 31)


Unnamed: 0,year,day_of_year,station_num,air_temp_max,air_temp_min,air_temp_mean,rh_max,rh_min,rh_mean,vpd_mean,...,wind_dir_std,max_wind_speed,heat_units,eto,eto_pm,vapor_pressure_mean,dewpoint_mean,date,daily_gdd,gdd
210,2018,211,6,41.9,27.2,34.2,64.9,22.3,41.4,3.44,...,66,12.5,16.7,8.7,9.1,2.12,18.3,2018-07-30,24.55,1923.0
211,2018,212,6,40.9,26.0,33.7,79.8,18.8,42.1,3.35,...,65,8.4,16.3,7.6,7.4,2.03,17.6,2018-07-31,23.45,1947.0
212,2018,213,6,42.9,28.9,36.0,64.7,14.9,36.3,4.05,...,52,14.7,17.1,8.6,8.9,2.02,17.4,2018-08-01,25.9,1973.0


#### Drop `daily_gdd` and reorder columns

In [None]:
day_of_planting = datetime.date(2017,4,20)
fle_1 = fle_0.copy()

fle_1['date_of_planting'] = day_of_planting
print(fle_1.shape)
fle_1.head(3)

#### Create timedelta using days to flag leaf emergence

In [None]:
timedelta_values = fle_1['value'].values
dates_of_flag_leaf_emergence = []

for val in timedelta_values:
    
    date_of_flag_leaf_emergence = day_of_planting + datetime.timedelta(days=val)
    dates_of_flag_leaf_emergence.append(date_of_flag_leaf_emergence)
    
print(fle_1.shape[0])
print(len(dates_of_flag_leaf_emergence))

In [None]:
fle_2 = fle_1.copy()
fle_2['date_of_flag_leaf_emergence'] = dates_of_flag_leaf_emergence
print(fle_2.shape)
fle_2.head(3)

#### Add GDD to flag leaf emergence

In [None]:
# slice df for date and cumulative gdd values only

season_4_gdd = weather_df_4[['date', 'gdd']]
print(season_4_gdd.shape)
season_4_gdd.head(3)

In [None]:
fle_2.dtypes

In [None]:
fle_3 = fle_2.copy()
fle_3.date_of_flag_leaf_emergence = pd.to_datetime(fle_3.date_of_flag_leaf_emergence)
fle_3.dtypes

In [None]:
fle_4 = fle_3.merge(season_4_gdd, how='left', left_on='date_of_flag_leaf_emergence', right_on='date')
print(fle_4.shape)
fle_4.head(3)

#### Drop all date columns except `date_of_flag_leaf_emergence`

In [None]:
date_cols_to_drop = ['date_x', 'date_of_planting', 'date_y']
fle_5 = fle_4.drop(labels=date_cols_to_drop, axis=1)
print(fle_5.shape)
fle_5.tail(3)

#### Check for duplicates

In [None]:
fle_5.duplicated().value_counts()

In [None]:
# keep duplicates for now?

#### Write dataframe to csv file with timestamp

In [None]:
timestamp = datetime.datetime.now().replace(microsecond=0).isoformat()
output_filename = f'days_gdd_to_flag_leaf_emergence_season_4_{timestamp}.csv'.replace(':', '')

fle_5.to_csv(output_filename, index=False)