### MAC Season 4 Data Cleaning
#### Traits
- aboveground dry biomass
- days & growing degree days (GDD) to flowering
- days & GDD to flag leaf emergence
- canopy height (time series)

This notebook contains the code Emily Cain used to clean and curate sorghum data from MAC Season 4. The input csv file was queried from betydb version 1 in February 2020. To run the entire notebook (in the CyVerse Discovery Environment, for now, where the input data are stored) and output the csv file for aboveground dry biomass at harvest, select `Run` and then `Run All Cells` from the notebook menu above. Once all cells have been executed, the output csv file will appear in the file panel on the left. Right click to download the file. If you experience any problems or have questions, please e-mail ejcain@email.arizona.edu.

#### Custom Functions Used

In [1]:
def check_for_subplots(df):
    
    """
    Function takes a dataframe as argument and checks for sitename subplots ending in ' E' or ' W'
    Will return rows with subplots, if any.
    """
    return df.loc[(df.sitename.str.endswith(' E')) | (df.sitename.str.endswith(' W'))]

#### I. Import python packages

In [2]:
import datetime
import numpy as np
import pandas as pd
import sqlalchemy
import sqlite3

#### II. Read in dataset
Future: query betydb directly with public API key for most recent data

In [3]:
df_0 = pd.read_csv('mac_season_four_2020-02-26.csv', low_memory=False)
print(df_0.shape)
df_0.head(3)

(397879, 39)


Unnamed: 0.1,Unnamed: 0,checked,result_type,id,citation_id,site_id,treatment_id,sitename,city,lat,...,n,statname,stat,notes,access_level,cultivar,entity,method_name,view_url,edit_url
0,1,0,traits,6002274258,,6000005457,,MAC Field Scanner Season 4 Range 5 Column 14,Maricopa,33.074691,...,,,,,2,PI653617,,Mean temperature from infrared images,https://terraref.ncsa.illinois.edu/bety/traits...,https://terraref.ncsa.illinois.edu/bety/traits...
1,2,0,traits,6002274259,,6000005645,,MAC Field Scanner Season 4 Range 17 Column 10,Maricopa,33.075123,...,,,,,2,PI569420,,Mean temperature from infrared images,https://terraref.ncsa.illinois.edu/bety/traits...,https://terraref.ncsa.illinois.edu/bety/traits...
2,3,0,traits,6002274265,,6000005871,,MAC Field Scanner Season 4 Range 40 Column 6,Maricopa,33.075949,...,,,,,2,PI329841,,Mean temperature from infrared images,https://terraref.ncsa.illinois.edu/bety/traits...,https://terraref.ncsa.illinois.edu/bety/traits...


#### III. Drop Columns

In [4]:
df_0.columns

Index(['Unnamed: 0', 'checked', 'result_type', 'id', 'citation_id', 'site_id',
       'treatment_id', 'sitename', 'city', 'lat', 'lon', 'scientificname',
       'commonname', 'genus', 'species_id', 'cultivar_id', 'author',
       'citation_year', 'treatment', 'date', 'time', 'raw_date', 'month',
       'year', 'dateloc', 'trait', 'trait_description', 'mean', 'units', 'n',
       'statname', 'stat', 'notes', 'access_level', 'cultivar', 'entity',
       'method_name', 'view_url', 'edit_url'],
      dtype='object')

In [5]:
cols_to_drop = ['Unnamed: 0', 'checked', 'result_type', 'id', 'citation_id', 'site_id', 'treatment_id', 'city', 
                'scientificname', 'commonname', 'genus', 'species_id', 'cultivar_id', 'author',
                'citation_year', 'time', 'raw_date', 'month', 'year', 'dateloc', 'n', 'statname', 'stat', 'notes', 
                'access_level', 'entity', 'view_url', 'edit_url']

In [6]:
df_1 = df_0.drop(labels=cols_to_drop, axis=1)
print(df_1.shape)
df_1.head(3)

(397879, 11)


Unnamed: 0,sitename,lat,lon,treatment,date,trait,trait_description,mean,units,cultivar,method_name
0,MAC Field Scanner Season 4 Range 5 Column 14,33.074691,-111.974835,,2017 Jul 9,surface_temperature,Surface temperature,38.090662,C,PI653617,Mean temperature from infrared images
1,MAC Field Scanner Season 4 Range 17 Column 10,33.075123,-111.9749,,2017 Jul 9,surface_temperature,Surface temperature,37.715112,C,PI569420,Mean temperature from infrared images
2,MAC Field Scanner Season 4 Range 40 Column 6,33.075949,-111.974966,,2017 Jul 9,surface_temperature,Surface temperature,37.458246,C,PI329841,Mean temperature from infrared images


In [7]:
# for col in df_1.columns:
#     print(f'{col}: {df_1[col].nunique()}')

#### IV. Change `date` format

In [8]:
new_dates = []

for d in df_1.date.values:
    
    # strip '(America/Phoenix)' string from date
    if 'Phoenix' in d:
        new_name = d[:-18]
        new_dates.append(new_name)
    
    else:
        new_name = d
        new_dates.append(new_name)
        

# check that length of new dates matches number of rows
print(len(new_dates))
print(df_1.shape[0])

397879
397879


Convert string dates to datetime

In [9]:
iso_format_dates = pd.to_datetime(new_dates)

Add new column with datetime values

In [10]:
# copy df to avoid SettingWithCopyWarning
df_2 = df_1.copy()
df_2['date_1'] = iso_format_dates

print(df_2.shape)
df_2.head(3)

(397879, 12)


Unnamed: 0,sitename,lat,lon,treatment,date,trait,trait_description,mean,units,cultivar,method_name,date_1
0,MAC Field Scanner Season 4 Range 5 Column 14,33.074691,-111.974835,,2017 Jul 9,surface_temperature,Surface temperature,38.090662,C,PI653617,Mean temperature from infrared images,2017-07-09
1,MAC Field Scanner Season 4 Range 17 Column 10,33.075123,-111.9749,,2017 Jul 9,surface_temperature,Surface temperature,37.715112,C,PI569420,Mean temperature from infrared images,2017-07-09
2,MAC Field Scanner Season 4 Range 40 Column 6,33.075949,-111.974966,,2017 Jul 9,surface_temperature,Surface temperature,37.458246,C,PI329841,Mean temperature from infrared images,2017-07-09


#### V. Extract Range & Column Values for Location

In [11]:
df_3 = df_2.copy()

df_3['range'] = df_3['sitename'].str.extract("Range (\d+)").astype(int)
df_3['column'] = df_3['sitename'].str.extract("Column (\d+)").astype(int)

df_3.sample(n=3)

Unnamed: 0,sitename,lat,lon,treatment,date,trait,trait_description,mean,units,cultivar,method_name,date_1,range,column
174435,MAC Field Scanner Season 4 Range 3 Column 6,33.07462,-111.974966,,2017 Jun 29,surface_temperature,Surface temperature,34.86532,C,PI330181,Mean temperature from infrared images,2017-06-29,3,6
315440,MAC Field Scanner Season 4 Range 21 Column 9,33.075266,-111.974917,,2017 Aug 30,canopy_height,"top of the general canopy of the plant, discou...",366.0,cm,PI569453,3D scanner to 98th quantile height,2017-08-30,21,9
216380,MAC Field Scanner Season 4 Range 7 Column 16,33.074763,-111.974802,,2017 May 19,surface_temperature,Surface temperature,32.643213,C,SP1615,Mean temperature from infrared images,2017-05-19,7,16


#### VI. Drop, Rename, & Reorder Columns

In [12]:
# drop string date column

df_4 = df_3.drop(labels=['date'], axis=1)

In [13]:
df_5 = df_4.rename({'date_1': 'date', 'mean': 'value'}, axis=1)

In [14]:
new_col_order = ['sitename', 'range', 'column', 'lat', 'lon', 'date', 'treatment', 'trait', 'trait_description', 'method_name', 'cultivar', 'value', 'units']

df_6 = pd.DataFrame(data=df_5, columns=new_col_order).reset_index(drop=True)
print(df_6.shape)
df_6.head(3)

(397879, 13)


Unnamed: 0,sitename,range,column,lat,lon,date,treatment,trait,trait_description,method_name,cultivar,value,units
0,MAC Field Scanner Season 4 Range 5 Column 14,5,14,33.074691,-111.974835,2017-07-09,,surface_temperature,Surface temperature,Mean temperature from infrared images,PI653617,38.090662,C
1,MAC Field Scanner Season 4 Range 17 Column 10,17,10,33.075123,-111.9749,2017-07-09,,surface_temperature,Surface temperature,Mean temperature from infrared images,PI569420,37.715112,C
2,MAC Field Scanner Season 4 Range 40 Column 6,40,6,33.075949,-111.974966,2017-07-09,,surface_temperature,Surface temperature,Mean temperature from infrared images,PI329841,37.458246,C


#### VI. Select for specific traits
- aboveground dry biomass
- canopy height - time series

#### A. Aboveground Dry Biomass

In [15]:
adb_df = df_6.loc[df_6.trait == 'aboveground_dry_biomass']
print(adb_df.shape)
adb_df.tail(3)

(447, 13)


Unnamed: 0,sitename,range,column,lat,lon,date,treatment,trait,trait_description,method_name,cultivar,value,units
365551,MAC Field Scanner Season 4 Range 53 Column 2,53,2,33.076417,-111.975032,2017-09-10,"BAP 2017, water-deficit stress Aug 1-14",aboveground_dry_biomass,Aboveground Dry Biomass,Whole above ground biomass at harvest,Big_Kahuna,16040.0,kg / ha
365552,MAC Field Scanner Season 4 Range 53 Column 12,53,12,33.076417,-111.974868,2017-09-15,"BAP 2017, water-deficit stress Aug 1-14",aboveground_dry_biomass,Aboveground Dry Biomass,Whole above ground biomass at harvest,Big_Kahuna,15130.0,kg / ha
366123,MAC Field Scanner Season 4 Range 31 Column 2,31,2,33.075626,-111.975032,2017-09-10,"BAP 2017, water-deficit stress Aug 1-14",aboveground_dry_biomass,Aboveground Dry Biomass,Whole above ground biomass at harvest,PI195754,17820.0,kg / ha


##### Check for E and W subplots

In [16]:
# will have no output if there are no subplots

check_for_subplots(adb_df)

Unnamed: 0,sitename,range,column,lat,lon,date,treatment,trait,trait_description,method_name,cultivar,value,units


#### Write dataframe to csv file with timestamp

In [17]:
timestamp = datetime.datetime.now().replace(microsecond=0).isoformat()
output_filename = f'aboveground_dry_biomass_season_4_{timestamp}.csv'.replace(':', '')

adb_df.to_csv(output_filename, index=False)

#### B. Days & Growing Degree Days (GDD) to Flowering

In [18]:
# df_5.trait.unique()

In [19]:
flower_df_0 = df_6.loc[df_6.trait == 'flowering_time']
print(flower_df_0.shape)
flower_df_0.head(3)

(136, 13)


Unnamed: 0,sitename,range,column,lat,lon,date,treatment,trait,trait_description,method_name,cultivar,value,units
23569,MAC Field Scanner Season 4 Range 7 Column 8,7,8,33.074763,-111.974933,2017-07-20,"BAP 2017, water-deficit stress Aug 1-14",flowering_time,Number of days from sowing to the date when 50...,Visual classification of sorghum growth stages...,PI152694,64.0,days
23570,MAC Field Scanner Season 4 Range 18 Column 4,18,4,33.075159,-111.974999,2017-07-20,"BAP 2017, water-deficit stress Aug 1-14",flowering_time,Number of days from sowing to the date when 50...,Visual classification of sorghum growth stages...,PI540518,72.0,days
23571,MAC Field Scanner Season 4 Range 27 Column 4,27,4,33.075482,-111.974999,2017-07-20,"BAP 2017, water-deficit stress Aug 1-14",flowering_time,Number of days from sowing to the date when 50...,Visual classification of sorghum growth stages...,PI641835,90.0,days


##### Check for E and W subplots

In [20]:
# will have no output if there are no subplots

check_for_subplots(flower_df_0)

Unnamed: 0,sitename,range,column,lat,lon,date,treatment,trait,trait_description,method_name,cultivar,value,units


#### Read in Season Four Weather Data from MAC Weather Station

In [21]:
weather_df_0 = pd.read_csv('mac_weather_station_raw_daily_2017.csv')
print(weather_df_0.shape)
weather_df_0.head(3)

(365, 28)


Unnamed: 0,year,day_of_year,station_number,air_temp_max,air_temp_min,air_temp_mean,rh_max,rh_min,rh_mean,vpd_mean,...,wind_speed_mean,wind_vector_magnitude,wind_vector_direction,wind_direction_std,max_wind_speed,heat_units,eto_azmet,eto_p_m,vapor_pressure_mean,dewpoint_mean
0,2017,1,6,13.6,9.3,11.8,92.7,69.2,83.5,0.23,...,3.5,2.6,188,43,10.9,0.2,1.0,1.2,1.16,9.0
1,2017,2,6,14.9,7.2,10.5,87.7,44.7,71.4,0.39,...,2.2,1.5,129,44,5.8,0.5,1.0,1.6,0.89,5.3
2,2017,3,6,13.9,3.2,9.0,97.0,60.6,81.9,0.24,...,1.0,0.1,349,78,3.3,0.2,0.6,0.9,0.93,5.8


#### Slice dataframe for season dates only and add date column
* Planting Date: 2017-04-20, Day 110
* Last Day of Harvest: 2017-09-16, Day 259

In [22]:
weather_df_1 = weather_df_0.loc[(weather_df_0.day_of_year >= 110) & (weather_df_0.day_of_year <= 259)]
print(weather_df_1.shape)
weather_df_1.head(3)

(150, 28)


Unnamed: 0,year,day_of_year,station_number,air_temp_max,air_temp_min,air_temp_mean,rh_max,rh_min,rh_mean,vpd_mean,...,wind_speed_mean,wind_vector_magnitude,wind_vector_direction,wind_direction_std,max_wind_speed,heat_units,eto_azmet,eto_p_m,vapor_pressure_mean,dewpoint_mean
109,2017,110,6,33.3,14.1,23.5,45.0,5.1,18.2,2.63,...,1.9,0.8,233,60,8.2,10.3,8.0,6.8,0.47,-3.7
110,2017,111,6,34.4,11.1,24.0,46.5,5.5,17.2,2.82,...,2.2,1.3,274,52,8.5,9.4,8.5,7.4,0.43,-4.9
111,2017,112,6,35.5,14.5,25.0,32.5,6.4,15.6,2.95,...,1.6,0.5,178,66,5.2,11.0,8.0,6.7,0.45,-4.2


In [23]:
season_4_date_range = pd.date_range(start='2017-04-20', end='2017-09-16')

In [24]:
weather_df_2 = weather_df_1.copy()
weather_df_2['date'] = season_4_date_range
print(weather_df_2.shape)
weather_df_2.tail(3)

(150, 29)


Unnamed: 0,year,day_of_year,station_number,air_temp_max,air_temp_min,air_temp_mean,rh_max,rh_min,rh_mean,vpd_mean,...,wind_vector_magnitude,wind_vector_direction,wind_direction_std,max_wind_speed,heat_units,eto_azmet,eto_p_m,vapor_pressure_mean,dewpoint_mean,date
256,2017,257,6,39.5,22.8,31.4,50.6,17.8,32.9,3.29,...,3.6,203,34,13.6,15.1,8.6,9.2,1.45,12.5,2017-09-14
257,2017,258,6,36.2,21.4,28.5,63.7,14.2,33.7,2.82,...,2.1,192,42,9.9,14.2,7.7,7.4,1.2,9.3,2017-09-15
258,2017,259,6,36.3,18.2,27.6,51.4,16.7,29.9,2.8,...,1.4,168,47,8.0,12.8,7.0,6.5,1.07,7.8,2017-09-16


#### Add Growing Degree Days
- Future: add LaTeX equation
- Future: add info about min and max daily values
- 10 degrees Celsius is base temp for sorghum
- Daily gdd value = ((max temp + min temp) / 2) - 10 (base temp)
- Growing Degree Days = cumulative sum of daily gdd values

In [25]:
weather_df_3 = weather_df_2.copy()
weather_df_3['daily_gdd'] = (((weather_df_3['air_temp_max'] + weather_df_3['air_temp_min'])) / 2) - 10
print(weather_df_3.shape)
weather_df_3.head(3)

(150, 30)


Unnamed: 0,year,day_of_year,station_number,air_temp_max,air_temp_min,air_temp_mean,rh_max,rh_min,rh_mean,vpd_mean,...,wind_vector_direction,wind_direction_std,max_wind_speed,heat_units,eto_azmet,eto_p_m,vapor_pressure_mean,dewpoint_mean,date,daily_gdd
109,2017,110,6,33.3,14.1,23.5,45.0,5.1,18.2,2.63,...,233,60,8.2,10.3,8.0,6.8,0.47,-3.7,2017-04-20,13.7
110,2017,111,6,34.4,11.1,24.0,46.5,5.5,17.2,2.82,...,274,52,8.5,9.4,8.5,7.4,0.43,-4.9,2017-04-21,12.75
111,2017,112,6,35.5,14.5,25.0,32.5,6.4,15.6,2.95,...,178,66,5.2,11.0,8.0,6.7,0.45,-4.2,2017-04-22,15.0


In [26]:
weather_df_4 = weather_df_3.copy()
weather_df_4['gdd'] = np.rint(np.cumsum(weather_df_4['daily_gdd']))
print(weather_df_4.shape)
weather_df_4.tail(3)

(150, 31)


Unnamed: 0,year,day_of_year,station_number,air_temp_max,air_temp_min,air_temp_mean,rh_max,rh_min,rh_mean,vpd_mean,...,wind_direction_std,max_wind_speed,heat_units,eto_azmet,eto_p_m,vapor_pressure_mean,dewpoint_mean,date,daily_gdd,gdd
256,2017,257,6,39.5,22.8,31.4,50.6,17.8,32.9,3.29,...,34,13.6,15.1,8.6,9.2,1.45,12.5,2017-09-14,21.15,2999.0
257,2017,258,6,36.2,21.4,28.5,63.7,14.2,33.7,2.82,...,42,9.9,14.2,7.7,7.4,1.2,9.3,2017-09-15,18.8,3018.0
258,2017,259,6,36.3,18.2,27.6,51.4,16.7,29.9,2.8,...,47,8.0,12.8,7.0,6.5,1.07,7.8,2017-09-16,17.25,3035.0


#### Add planting date 2017-04-20

In [27]:
day_of_planting = datetime.date(2017,4,20)
flower_df_1 = flower_df_0.copy()

flower_df_1['date_of_planting'] = day_of_planting
print(flower_df_1.shape)
flower_df_1.head(3)

(136, 14)


Unnamed: 0,sitename,range,column,lat,lon,date,treatment,trait,trait_description,method_name,cultivar,value,units,date_of_planting
23569,MAC Field Scanner Season 4 Range 7 Column 8,7,8,33.074763,-111.974933,2017-07-20,"BAP 2017, water-deficit stress Aug 1-14",flowering_time,Number of days from sowing to the date when 50...,Visual classification of sorghum growth stages...,PI152694,64.0,days,2017-04-20
23570,MAC Field Scanner Season 4 Range 18 Column 4,18,4,33.075159,-111.974999,2017-07-20,"BAP 2017, water-deficit stress Aug 1-14",flowering_time,Number of days from sowing to the date when 50...,Visual classification of sorghum growth stages...,PI540518,72.0,days,2017-04-20
23571,MAC Field Scanner Season 4 Range 27 Column 4,27,4,33.075482,-111.974999,2017-07-20,"BAP 2017, water-deficit stress Aug 1-14",flowering_time,Number of days from sowing to the date when 50...,Visual classification of sorghum growth stages...,PI641835,90.0,days,2017-04-20


#### Create timedelta using days to flowering

In [28]:
timedelta_values = flower_df_1['value'].values
dates_of_flowering = []

for val in timedelta_values:
    
    date_of_flowering = day_of_planting + datetime.timedelta(days=val)
    dates_of_flowering.append(date_of_flowering)
    
print(flower_df_1.shape[0])
print(len(dates_of_flowering))

136
136


In [29]:
flower_df_2 = flower_df_1.copy()
flower_df_2['date_of_flowering'] = dates_of_flowering
print(flower_df_2.shape)
flower_df_2.head(3)

(136, 15)


Unnamed: 0,sitename,range,column,lat,lon,date,treatment,trait,trait_description,method_name,cultivar,value,units,date_of_planting,date_of_flowering
23569,MAC Field Scanner Season 4 Range 7 Column 8,7,8,33.074763,-111.974933,2017-07-20,"BAP 2017, water-deficit stress Aug 1-14",flowering_time,Number of days from sowing to the date when 50...,Visual classification of sorghum growth stages...,PI152694,64.0,days,2017-04-20,2017-06-23
23570,MAC Field Scanner Season 4 Range 18 Column 4,18,4,33.075159,-111.974999,2017-07-20,"BAP 2017, water-deficit stress Aug 1-14",flowering_time,Number of days from sowing to the date when 50...,Visual classification of sorghum growth stages...,PI540518,72.0,days,2017-04-20,2017-07-01
23571,MAC Field Scanner Season 4 Range 27 Column 4,27,4,33.075482,-111.974999,2017-07-20,"BAP 2017, water-deficit stress Aug 1-14",flowering_time,Number of days from sowing to the date when 50...,Visual classification of sorghum growth stages...,PI641835,90.0,days,2017-04-20,2017-07-19


#### Add GDD to flowering dataframe

In [30]:
# slice df for date and cumulative gdd values only

season_4_gdd = weather_df_4[['date', 'gdd']]
print(season_4_gdd.shape)
season_4_gdd.head(3)

(150, 2)


Unnamed: 0,date,gdd
109,2017-04-20,14.0
110,2017-04-21,26.0
111,2017-04-22,41.0


In [31]:
flower_df_2.dtypes

sitename                     object
range                         int64
column                        int64
lat                         float64
lon                         float64
date                 datetime64[ns]
treatment                    object
trait                        object
trait_description            object
method_name                  object
cultivar                     object
value                       float64
units                        object
date_of_planting             object
date_of_flowering            object
dtype: object

In [32]:
flower_df_3 = flower_df_2.copy()
flower_df_3.date_of_flowering = pd.to_datetime(flower_df_3.date_of_flowering)
flower_df_3.dtypes

sitename                     object
range                         int64
column                        int64
lat                         float64
lon                         float64
date                 datetime64[ns]
treatment                    object
trait                        object
trait_description            object
method_name                  object
cultivar                     object
value                       float64
units                        object
date_of_planting             object
date_of_flowering    datetime64[ns]
dtype: object

In [33]:
flower_df_4 = flower_df_3.merge(season_4_gdd, how='left', left_on='date_of_flowering', right_on='date')
print(flower_df_4.shape)
flower_df_4.head(3)

(136, 17)


Unnamed: 0,sitename,range,column,lat,lon,date_x,treatment,trait,trait_description,method_name,cultivar,value,units,date_of_planting,date_of_flowering,date_y,gdd
0,MAC Field Scanner Season 4 Range 7 Column 8,7,8,33.074763,-111.974933,2017-07-20,"BAP 2017, water-deficit stress Aug 1-14",flowering_time,Number of days from sowing to the date when 50...,Visual classification of sorghum growth stages...,PI152694,64.0,days,2017-04-20,2017-06-23,2017-06-23,1100.0
1,MAC Field Scanner Season 4 Range 18 Column 4,18,4,33.075159,-111.974999,2017-07-20,"BAP 2017, water-deficit stress Aug 1-14",flowering_time,Number of days from sowing to the date when 50...,Visual classification of sorghum growth stages...,PI540518,72.0,days,2017-04-20,2017-07-01,2017-07-01,1296.0
2,MAC Field Scanner Season 4 Range 27 Column 4,27,4,33.075482,-111.974999,2017-07-20,"BAP 2017, water-deficit stress Aug 1-14",flowering_time,Number of days from sowing to the date when 50...,Visual classification of sorghum growth stages...,PI641835,90.0,days,2017-04-20,2017-07-19,2017-07-19,1734.0


#### Drop all date columns except `date_of_flowering`

In [34]:
date_cols_to_drop = ['date_x', 'date_of_planting', 'date_y']
flower_df_5 = flower_df_4.drop(labels=date_cols_to_drop, axis=1)
print(flower_df_5.shape)
flower_df_5.tail(3)

(136, 14)


Unnamed: 0,sitename,range,column,lat,lon,treatment,trait,trait_description,method_name,cultivar,value,units,date_of_flowering,gdd
133,MAC Field Scanner Season 4 Range 18 Column 11,18,11,33.075159,-111.974884,"BAP 2017, water-deficit stress Aug 1-14",flowering_time,Number of days from sowing to the date when 50...,Visual classification of sorghum growth stages...,PI641862,64.0,days,2017-06-23,1100.0
134,MAC Field Scanner Season 4 Range 18 Column 13,18,13,33.075159,-111.974851,"BAP 2017, water-deficit stress Aug 1-14",flowering_time,Number of days from sowing to the date when 50...,Visual classification of sorghum growth stages...,PI641817,64.0,days,2017-06-23,1100.0
135,MAC Field Scanner Season 4 Range 24 Column 3,24,3,33.075374,-111.975015,"BAP 2017, water-deficit stress Aug 1-14",flowering_time,Number of days from sowing to the date when 50...,Visual classification of sorghum growth stages...,PI63715,89.0,days,2017-07-18,1711.0


#### Check for duplicates

In [35]:
flower_df_5.duplicated().value_counts()

False    97
True     39
dtype: int64

#### Write dataframe to csv file with timestamp

In [36]:
timestamp = datetime.datetime.now().replace(microsecond=0).isoformat()
output_filename = f'days_gdd_to_flowering_season_4_{timestamp}.csv'.replace(':', '')

flower_df_5.to_csv(output_filename, index=False)

#### C. Days & GDD to Flag Leaf Emergence

In [37]:
fle_0 = df_6.loc[df_6.trait == 'flag_leaf_emergence_time']
print(fle_0.shape)
fle_0.head(3)

(154, 13)


Unnamed: 0,sitename,range,column,lat,lon,date,treatment,trait,trait_description,method_name,cultivar,value,units
23550,MAC Field Scanner Season 4 Range 7 Column 8,7,8,33.074763,-111.974933,2017-07-20,"BAP 2017, water-deficit stress Aug 1-14",flag_leaf_emergence_time,Number days from sowing to the date when 50% o...,Visual classification of sorghum growth stages...,PI152694,62.0,days
23551,MAC Field Scanner Season 4 Range 18 Column 4,18,4,33.075159,-111.974999,2017-07-20,"BAP 2017, water-deficit stress Aug 1-14",flag_leaf_emergence_time,Number days from sowing to the date when 50% o...,Visual classification of sorghum growth stages...,PI540518,70.0,days
23552,MAC Field Scanner Season 4 Range 27 Column 7,27,7,33.075482,-111.97495,2017-07-20,"BAP 2017, water-deficit stress Aug 1-14",flag_leaf_emergence_time,Number days from sowing to the date when 50% o...,Visual classification of sorghum growth stages...,PI196586,77.0,days


##### Check for E and W subplots

In [38]:
# will have no output if there are no subplots

check_for_subplots(fle_0)

Unnamed: 0,sitename,range,column,lat,lon,date,treatment,trait,trait_description,method_name,cultivar,value,units


#### Read in Season Four Weather Data from MAC Weather Station

In [39]:
# weather_df_0 = pd.read_csv('mac_weather_station_raw_daily_2017.csv')
print(weather_df_0.shape)
weather_df_0.head(3)

(365, 28)


Unnamed: 0,year,day_of_year,station_number,air_temp_max,air_temp_min,air_temp_mean,rh_max,rh_min,rh_mean,vpd_mean,...,wind_speed_mean,wind_vector_magnitude,wind_vector_direction,wind_direction_std,max_wind_speed,heat_units,eto_azmet,eto_p_m,vapor_pressure_mean,dewpoint_mean
0,2017,1,6,13.6,9.3,11.8,92.7,69.2,83.5,0.23,...,3.5,2.6,188,43,10.9,0.2,1.0,1.2,1.16,9.0
1,2017,2,6,14.9,7.2,10.5,87.7,44.7,71.4,0.39,...,2.2,1.5,129,44,5.8,0.5,1.0,1.6,0.89,5.3
2,2017,3,6,13.9,3.2,9.0,97.0,60.6,81.9,0.24,...,1.0,0.1,349,78,3.3,0.2,0.6,0.9,0.93,5.8


#### Slice dataframe for season dates only and add date column
* Planting Date: 2017-04-20, Day 110
* Last Day of Harvest: 2017-09-16, Day 259

In [40]:
weather_df_1 = weather_df_0.loc[(weather_df_0.day_of_year >= 110) & (weather_df_0.day_of_year <= 259)]
print(weather_df_1.shape)
weather_df_1.head(3)

(150, 28)


Unnamed: 0,year,day_of_year,station_number,air_temp_max,air_temp_min,air_temp_mean,rh_max,rh_min,rh_mean,vpd_mean,...,wind_speed_mean,wind_vector_magnitude,wind_vector_direction,wind_direction_std,max_wind_speed,heat_units,eto_azmet,eto_p_m,vapor_pressure_mean,dewpoint_mean
109,2017,110,6,33.3,14.1,23.5,45.0,5.1,18.2,2.63,...,1.9,0.8,233,60,8.2,10.3,8.0,6.8,0.47,-3.7
110,2017,111,6,34.4,11.1,24.0,46.5,5.5,17.2,2.82,...,2.2,1.3,274,52,8.5,9.4,8.5,7.4,0.43,-4.9
111,2017,112,6,35.5,14.5,25.0,32.5,6.4,15.6,2.95,...,1.6,0.5,178,66,5.2,11.0,8.0,6.7,0.45,-4.2


In [41]:
season_4_date_range = pd.date_range(start='2017-04-20', end='2017-09-16')

In [42]:
weather_df_2 = weather_df_1.copy()
weather_df_2['date'] = season_4_date_range
print(weather_df_2.shape)
weather_df_2.tail(3)

(150, 29)


Unnamed: 0,year,day_of_year,station_number,air_temp_max,air_temp_min,air_temp_mean,rh_max,rh_min,rh_mean,vpd_mean,...,wind_vector_magnitude,wind_vector_direction,wind_direction_std,max_wind_speed,heat_units,eto_azmet,eto_p_m,vapor_pressure_mean,dewpoint_mean,date
256,2017,257,6,39.5,22.8,31.4,50.6,17.8,32.9,3.29,...,3.6,203,34,13.6,15.1,8.6,9.2,1.45,12.5,2017-09-14
257,2017,258,6,36.2,21.4,28.5,63.7,14.2,33.7,2.82,...,2.1,192,42,9.9,14.2,7.7,7.4,1.2,9.3,2017-09-15
258,2017,259,6,36.3,18.2,27.6,51.4,16.7,29.9,2.8,...,1.4,168,47,8.0,12.8,7.0,6.5,1.07,7.8,2017-09-16


#### Add Growing Degree Days
- Future: add LaTeX equation
- Future: add info about min and max daily values
- 10 degrees Celsius is base temp for sorghum
- Daily gdd value = ((max temp + min temp) / 2) - 10 (base temp)
- Growing Degree Days = cumulative sum of daily gdd values

In [43]:
weather_df_3 = weather_df_2.copy()
weather_df_3['daily_gdd'] = (((weather_df_3['air_temp_max'] + weather_df_3['air_temp_min'])) / 2) - 10
print(weather_df_3.shape)
weather_df_3.head(3)

(150, 30)


Unnamed: 0,year,day_of_year,station_number,air_temp_max,air_temp_min,air_temp_mean,rh_max,rh_min,rh_mean,vpd_mean,...,wind_vector_direction,wind_direction_std,max_wind_speed,heat_units,eto_azmet,eto_p_m,vapor_pressure_mean,dewpoint_mean,date,daily_gdd
109,2017,110,6,33.3,14.1,23.5,45.0,5.1,18.2,2.63,...,233,60,8.2,10.3,8.0,6.8,0.47,-3.7,2017-04-20,13.7
110,2017,111,6,34.4,11.1,24.0,46.5,5.5,17.2,2.82,...,274,52,8.5,9.4,8.5,7.4,0.43,-4.9,2017-04-21,12.75
111,2017,112,6,35.5,14.5,25.0,32.5,6.4,15.6,2.95,...,178,66,5.2,11.0,8.0,6.7,0.45,-4.2,2017-04-22,15.0


In [44]:
weather_df_4 = weather_df_3.copy()

# round to the nearest integer
weather_df_4['gdd'] = np.rint(np.cumsum(weather_df_4['daily_gdd']))
print(weather_df_4.shape)
weather_df_4.tail(3)

(150, 31)


Unnamed: 0,year,day_of_year,station_number,air_temp_max,air_temp_min,air_temp_mean,rh_max,rh_min,rh_mean,vpd_mean,...,wind_direction_std,max_wind_speed,heat_units,eto_azmet,eto_p_m,vapor_pressure_mean,dewpoint_mean,date,daily_gdd,gdd
256,2017,257,6,39.5,22.8,31.4,50.6,17.8,32.9,3.29,...,34,13.6,15.1,8.6,9.2,1.45,12.5,2017-09-14,21.15,2999.0
257,2017,258,6,36.2,21.4,28.5,63.7,14.2,33.7,2.82,...,42,9.9,14.2,7.7,7.4,1.2,9.3,2017-09-15,18.8,3018.0
258,2017,259,6,36.3,18.2,27.6,51.4,16.7,29.9,2.8,...,47,8.0,12.8,7.0,6.5,1.07,7.8,2017-09-16,17.25,3035.0


#### Add planting date 2017-04-20

In [45]:
day_of_planting = datetime.date(2017,4,20)
fle_1 = fle_0.copy()

fle_1['date_of_planting'] = day_of_planting
print(fle_1.shape)
fle_1.head(3)

(154, 14)


Unnamed: 0,sitename,range,column,lat,lon,date,treatment,trait,trait_description,method_name,cultivar,value,units,date_of_planting
23550,MAC Field Scanner Season 4 Range 7 Column 8,7,8,33.074763,-111.974933,2017-07-20,"BAP 2017, water-deficit stress Aug 1-14",flag_leaf_emergence_time,Number days from sowing to the date when 50% o...,Visual classification of sorghum growth stages...,PI152694,62.0,days,2017-04-20
23551,MAC Field Scanner Season 4 Range 18 Column 4,18,4,33.075159,-111.974999,2017-07-20,"BAP 2017, water-deficit stress Aug 1-14",flag_leaf_emergence_time,Number days from sowing to the date when 50% o...,Visual classification of sorghum growth stages...,PI540518,70.0,days,2017-04-20
23552,MAC Field Scanner Season 4 Range 27 Column 7,27,7,33.075482,-111.97495,2017-07-20,"BAP 2017, water-deficit stress Aug 1-14",flag_leaf_emergence_time,Number days from sowing to the date when 50% o...,Visual classification of sorghum growth stages...,PI196586,77.0,days,2017-04-20


#### Create timedelta using days to flag leaf emergence

In [46]:
timedelta_values = fle_1['value'].values
dates_of_flag_leaf_emergence = []

for val in timedelta_values:
    
    date_of_flag_leaf_emergence = day_of_planting + datetime.timedelta(days=val)
    dates_of_flag_leaf_emergence.append(date_of_flag_leaf_emergence)
    
print(fle_1.shape[0])
print(len(dates_of_flag_leaf_emergence))

154
154


In [47]:
fle_2 = fle_1.copy()
fle_2['date_of_flag_leaf_emergence'] = dates_of_flag_leaf_emergence
print(fle_2.shape)
fle_2.head(3)

(154, 15)


Unnamed: 0,sitename,range,column,lat,lon,date,treatment,trait,trait_description,method_name,cultivar,value,units,date_of_planting,date_of_flag_leaf_emergence
23550,MAC Field Scanner Season 4 Range 7 Column 8,7,8,33.074763,-111.974933,2017-07-20,"BAP 2017, water-deficit stress Aug 1-14",flag_leaf_emergence_time,Number days from sowing to the date when 50% o...,Visual classification of sorghum growth stages...,PI152694,62.0,days,2017-04-20,2017-06-21
23551,MAC Field Scanner Season 4 Range 18 Column 4,18,4,33.075159,-111.974999,2017-07-20,"BAP 2017, water-deficit stress Aug 1-14",flag_leaf_emergence_time,Number days from sowing to the date when 50% o...,Visual classification of sorghum growth stages...,PI540518,70.0,days,2017-04-20,2017-06-29
23552,MAC Field Scanner Season 4 Range 27 Column 7,27,7,33.075482,-111.97495,2017-07-20,"BAP 2017, water-deficit stress Aug 1-14",flag_leaf_emergence_time,Number days from sowing to the date when 50% o...,Visual classification of sorghum growth stages...,PI196586,77.0,days,2017-04-20,2017-07-06


#### Add GDD to flag leaf emergence

In [48]:
# slice df for date and cumulative gdd values only

season_4_gdd = weather_df_4[['date', 'gdd']]
print(season_4_gdd.shape)
season_4_gdd.head(3)

(150, 2)


Unnamed: 0,date,gdd
109,2017-04-20,14.0
110,2017-04-21,26.0
111,2017-04-22,41.0


In [49]:
fle_2.dtypes

sitename                               object
range                                   int64
column                                  int64
lat                                   float64
lon                                   float64
date                           datetime64[ns]
treatment                              object
trait                                  object
trait_description                      object
method_name                            object
cultivar                               object
value                                 float64
units                                  object
date_of_planting                       object
date_of_flag_leaf_emergence            object
dtype: object

In [50]:
fle_3 = fle_2.copy()
fle_3.date_of_flag_leaf_emergence = pd.to_datetime(fle_3.date_of_flag_leaf_emergence)
fle_3.dtypes

sitename                               object
range                                   int64
column                                  int64
lat                                   float64
lon                                   float64
date                           datetime64[ns]
treatment                              object
trait                                  object
trait_description                      object
method_name                            object
cultivar                               object
value                                 float64
units                                  object
date_of_planting                       object
date_of_flag_leaf_emergence    datetime64[ns]
dtype: object

In [51]:
fle_4 = fle_3.merge(season_4_gdd, how='left', left_on='date_of_flag_leaf_emergence', right_on='date')
print(fle_4.shape)
fle_4.head(3)

(154, 17)


Unnamed: 0,sitename,range,column,lat,lon,date_x,treatment,trait,trait_description,method_name,cultivar,value,units,date_of_planting,date_of_flag_leaf_emergence,date_y,gdd
0,MAC Field Scanner Season 4 Range 7 Column 8,7,8,33.074763,-111.974933,2017-07-20,"BAP 2017, water-deficit stress Aug 1-14",flag_leaf_emergence_time,Number days from sowing to the date when 50% o...,Visual classification of sorghum growth stages...,PI152694,62.0,days,2017-04-20,2017-06-21,2017-06-21,1050.0
1,MAC Field Scanner Season 4 Range 18 Column 4,18,4,33.075159,-111.974999,2017-07-20,"BAP 2017, water-deficit stress Aug 1-14",flag_leaf_emergence_time,Number days from sowing to the date when 50% o...,Visual classification of sorghum growth stages...,PI540518,70.0,days,2017-04-20,2017-06-29,2017-06-29,1250.0
2,MAC Field Scanner Season 4 Range 27 Column 7,27,7,33.075482,-111.97495,2017-07-20,"BAP 2017, water-deficit stress Aug 1-14",flag_leaf_emergence_time,Number days from sowing to the date when 50% o...,Visual classification of sorghum growth stages...,PI196586,77.0,days,2017-04-20,2017-07-06,2017-07-06,1420.0


#### Drop all date columns except `date_of_flag_leaf_emergence`

In [52]:
date_cols_to_drop = ['date_x', 'date_of_planting', 'date_y']
fle_5 = fle_4.drop(labels=date_cols_to_drop, axis=1)
print(fle_5.shape)
fle_5.tail(3)

(154, 14)


Unnamed: 0,sitename,range,column,lat,lon,treatment,trait,trait_description,method_name,cultivar,value,units,date_of_flag_leaf_emergence,gdd
151,MAC Field Scanner Season 4 Range 18 Column 13,18,13,33.075159,-111.974851,"BAP 2017, water-deficit stress Aug 1-14",flag_leaf_emergence_time,Number days from sowing to the date when 50% o...,Visual classification of sorghum growth stages...,PI641817,56.0,days,2017-06-15,904.0
152,MAC Field Scanner Season 4 Range 24 Column 3,24,3,33.075374,-111.975015,"BAP 2017, water-deficit stress Aug 1-14",flag_leaf_emergence_time,Number days from sowing to the date when 50% o...,Visual classification of sorghum growth stages...,PI63715,77.0,days,2017-07-06,1420.0
153,MAC Field Scanner Season 4 Range 24 Column 9,24,9,33.075374,-111.974917,"BAP 2017, water-deficit stress Aug 1-14",flag_leaf_emergence_time,Number days from sowing to the date when 50% o...,Visual classification of sorghum growth stages...,PI641830,46.0,days,2017-06-05,712.0


#### Check for duplicates

In [53]:
fle_5.duplicated().value_counts()

False    108
True      46
dtype: int64

In [54]:
# keep duplicates for now?

#### Write dataframe to csv file with timestamp

In [55]:
timestamp = datetime.datetime.now().replace(microsecond=0).isoformat()
output_filename = f'days_gdd_to_flowering_season_4_{timestamp}.csv'.replace(':', '')

flower_df_5.to_csv(output_filename, index=False)

### D. Canopy Height - Time Series

In [56]:
ch_0 = df_6.loc[df_6.trait == 'canopy_height']
print(ch_0.shape)
ch_0.head(3)

(52154, 13)


Unnamed: 0,sitename,range,column,lat,lon,date,treatment,trait,trait_description,method_name,cultivar,value,units
21410,MAC Field Scanner Season 4 Range 5 Column 7 E,5,7,33.074691,-111.974945,2017-07-11,"BAP 2017, water-deficit stress Aug 1-14",canopy_height,"top of the general canopy of the plant, discou...",Manual canopy height,PI329300,310.0,cm
22171,MAC Field Scanner Season 4 Range 9 Column 6 E,9,6,33.074835,-111.974962,2017-05-29,"BAP 2017, water-deficit stress Aug 1-14",canopy_height,"top of the general canopy of the plant, discou...",Manual canopy height,PI329351,44.0,cm
22172,MAC Field Scanner Season 4 Range 11 Column 3 W,11,3,33.074907,-111.975019,2017-05-29,"BAP 2017, water-deficit stress Aug 1-14",canopy_height,"top of the general canopy of the plant, discou...",Manual canopy height,PI655978,58.0,cm


In [57]:
subplots = check_for_subplots(ch_0)
subplots.shape

(4430, 13)

#### Take average canopy height values for subplots on same day
- Strip ` E` and ` W` subplot designations
- Group by rows with the same sitename and date and take the average value

In [58]:
sitename_values = ch_0.sitename.values
no_e_w_names = []

for name in sitename_values:
    
    if name.endswith(' W') | name.endswith(' E'):
        name = name[:-2]
        no_e_w_names.append(name)
        
    else:
        no_e_w_names.append(name)
        
print(len(no_e_w_names))

52154


#### Add new sitename column with no subplots

In [59]:
ch_1 = ch_0.copy()
ch_1['sitename_1'] = no_e_w_names
print(ch_1.shape)
ch_1.head(3)

(52154, 14)


Unnamed: 0,sitename,range,column,lat,lon,date,treatment,trait,trait_description,method_name,cultivar,value,units,sitename_1
21410,MAC Field Scanner Season 4 Range 5 Column 7 E,5,7,33.074691,-111.974945,2017-07-11,"BAP 2017, water-deficit stress Aug 1-14",canopy_height,"top of the general canopy of the plant, discou...",Manual canopy height,PI329300,310.0,cm,MAC Field Scanner Season 4 Range 5 Column 7
22171,MAC Field Scanner Season 4 Range 9 Column 6 E,9,6,33.074835,-111.974962,2017-05-29,"BAP 2017, water-deficit stress Aug 1-14",canopy_height,"top of the general canopy of the plant, discou...",Manual canopy height,PI329351,44.0,cm,MAC Field Scanner Season 4 Range 9 Column 6
22172,MAC Field Scanner Season 4 Range 11 Column 3 W,11,3,33.074907,-111.975019,2017-05-29,"BAP 2017, water-deficit stress Aug 1-14",canopy_height,"top of the general canopy of the plant, discou...",Manual canopy height,PI655978,58.0,cm,MAC Field Scanner Season 4 Range 11 Column 3


#### Use sqlite database to group by `sitename_1` and `date`

In [60]:
conn = sqlite3.connect('canopy_heights.sqlite')
cursor = conn.cursor()
print("Opened database successfully")

Opened database successfully


In [62]:
# comment next line out if db has already been created
ch_1.to_sql('canopy_heights.sqlite', conn)

In [63]:
ch_2 = pd.read_sql_query("""
                            SELECT sitename_1 AS sitename, range, column, lat, lon, date, treatment, 
                            trait, trait_description, method_name, cultivar, 
                            ROUND(AVG(value), 2) AS avg_canopy_height, units 
                            FROM 'canopy_heights.sqlite'
                            GROUP BY sitename_1, date, cultivar
                            ORDER BY date ASC;
                            """, conn)

print(ch_2.shape)
ch_2.head(3)

(34632, 13)


Unnamed: 0,sitename,range,column,lat,lon,date,treatment,trait,trait_description,method_name,cultivar,avg_canopy_height,units
0,MAC Field Scanner Season 4 Range 10 Column 10,10,10,33.074871,-111.9749,2017-05-01 00:00:00,,canopy_height,"top of the general canopy of the plant, discou...",3D scanner to 98th quantile height,PI152816,12.0,cm
1,MAC Field Scanner Season 4 Range 10 Column 11,10,11,33.074871,-111.974884,2017-05-01 00:00:00,,canopy_height,"top of the general canopy of the plant, discou...",3D scanner to 98th quantile height,PI195754,12.0,cm
2,MAC Field Scanner Season 4 Range 10 Column 12,10,12,33.074871,-111.974868,2017-05-01 00:00:00,,canopy_height,"top of the general canopy of the plant, discou...",3D scanner to 98th quantile height,PI329501,12.0,cm


In [64]:
# Sanity Check

sample_with_subplot = ch_1.loc[(ch_1.range == 5) & (ch_1.column == 7) & (ch_1.date == '2017-07-11')]
sample_with_subplot

Unnamed: 0,sitename,range,column,lat,lon,date,treatment,trait,trait_description,method_name,cultivar,value,units,sitename_1
21410,MAC Field Scanner Season 4 Range 5 Column 7 E,5,7,33.074691,-111.974945,2017-07-11,"BAP 2017, water-deficit stress Aug 1-14",canopy_height,"top of the general canopy of the plant, discou...",Manual canopy height,PI329300,310.0,cm,MAC Field Scanner Season 4 Range 5 Column 7
25752,MAC Field Scanner Season 4 Range 5 Column 7 W,5,7,33.074691,-111.974953,2017-07-11,"BAP 2017, water-deficit stress Aug 1-14",canopy_height,"top of the general canopy of the plant, discou...",Manual canopy height,PI329300,318.0,cm,MAC Field Scanner Season 4 Range 5 Column 7
45448,MAC Field Scanner Season 4 Range 5 Column 7 E,5,7,33.074691,-111.974945,2017-07-11,"BAP 2017, water-deficit stress Aug 1-14",canopy_height,"top of the general canopy of the plant, discou...",Manual canopy height,PI329300,310.0,cm,MAC Field Scanner Season 4 Range 5 Column 7


In [65]:
# Sanity Check - should have only one row for the above group

sample_without_subplot = ch_2.loc[(ch_2.range == 5) & (ch_2.column == 7) & (ch_2.date == '2017-07-11 00:00:00')]
sample_without_subplot

Unnamed: 0,sitename,range,column,lat,lon,date,treatment,trait,trait_description,method_name,cultivar,avg_canopy_height,units
26726,MAC Field Scanner Season 4 Range 5 Column 7,5,7,33.074691,-111.974945,2017-07-11 00:00:00,"BAP 2017, water-deficit stress Aug 1-14",canopy_height,"top of the general canopy of the plant, discou...",Manual canopy height,PI329300,312.67,cm


#### Write dataframe to csv file with timestamp

In [66]:
timestamp = datetime.datetime.now().replace(microsecond=0).isoformat()
output_filename = f'canopy_height_time_series_season_4_{timestamp}.csv'.replace(':', '')

ch_2.to_csv(output_filename, index=False)

## Run All Cells _above_ this one. Canopy Height End of Season not working yet.

### E. Canopy Height - End of Season

In [67]:
# eos_ch_0 = df_6.loc[df_6.trait == 'canopy_height']
# print(eos_ch_0.shape)
# eos_ch_0.head(3)

In [68]:
# subplots = check_for_subplots(eos_ch_0)
# subplots.shape

#### Take average canopy height values for subplots on same day
- Strip ` E` and ` W` subplot designations
- Group by rows with the same sitename and date and take the average value

In [69]:
# sitename_values = eos_ch_0.sitename.values
# no_e_w_names = []

# for name in sitename_values:
    
#     if name.endswith(' W') | name.endswith(' E'):
#         name = name[:-2]
#         no_e_w_names.append(name)
        
#     else:
#         no_e_w_names.append(name)
        
# print(len(no_e_w_names))

#### Add new sitename column with no subplots

In [70]:
# eos_ch_1 = eos_ch_0.copy()
# eos_ch_1['sitename_1'] = no_e_w_names
# print(eos_ch_1.shape)
# eos_ch_1.head(3)

#### Use sqlite database to group by `sitename_1` and `date`
* Order by `date` descending to keep the first duplicate (latest date) for `sitename` and `cultivar`

In [71]:
# conn = sqlite3.connect('end_of_season_canopy_heights.sqlite')
# cursor = conn.cursor()
# print("Opened database successfully")

In [107]:
# eos_ch_1.to_sql('end_of_season_canopy_heights.sqlite', conn)

In [72]:
# eos_ch_2 = pd.read_sql_query("""
#                             SELECT sitename_1 AS sitename, range, column, lat, lon, date, treatment, 
#                             trait, trait_description, method_name, cultivar, 
#                             ROUND(AVG(value), 2) AS avg_canopy_height, units 
#                             FROM 'end_of_season_canopy_heights.sqlite'
#                             GROUP BY sitename_1, date, cultivar
#                             ORDER BY date DESC;
#                             """, conn)

# print(eos_ch_2.shape)
# eos_ch_2.head(3)

#### Drop Duplicates

In [73]:
# eos_ch_3 = eos_ch_2.drop_duplicates(subset=['sitename', 'range', 'column', 'treatment', 'cultivar'], keep='first')
# print(eos_ch_3.shape)
# eos_ch_3.head(3)

In [74]:
# check date ranges
# print(eos_ch_3.date.min())
# print(eos_ch_3.date.max())

In [75]:
# find row with date.min()

# eos_ch_3.loc[eos_ch_3.date == '2017-05-29 00:00:00']

In [76]:
# check earlier df to ensure no measurements were taken after 2017-05-29

# eos_ch_2.loc[eos_ch_2.sitename == 'MAC Field Scanner Season 4 Range 52 Column 6']

#### Write dataframe to csv file with timestamp

In [77]:
# timestamp = datetime.datetime.now().replace(microsecond=0).isoformat()
# output_filename = f'canopy_height_time_series_season_4_{timestamp}.csv'.replace(':', '')

# ch_2.to_csv(output_filename, index=False)