### Kansas Sorghum Experiments Data Cleaning Notebook
#### Data from Kansas State University Sorghum Experiments
- goal: to gather more cultivar data in addition to MAC Sorghum Seasons 4 & 6
- please contact Emily Cain at ejcain@arizona.edu with any questions or feedback

In [None]:
import datetime
import numpy as np
import pandas as pd

#### A. Read in data queried from betydb in `R` using this code:
```
library(traits)

options(betydb_url = "https://terraref.ncsa.illinois.edu/bety/",
        betydb_api_version = 'v1',
        betydb_key = 'secret_api_key_123456_abcde')
        
kansas <- betydb_query(experiment  = "~KSU",
                         limit     =  "none")
                      
write.csv(kansas, file = "kansas_experiments_2020-03-24.csv")
```

In [None]:
df_0 = pd.read_csv('data/ksu_experiments_2020-03-24.csv', low_memory=False)
print(df_0.shape)
df_0.head(3)

#### B. Find sitenames that do **not** start with `MAC`
- Slice dataframe to only include sitenames that include `KSU`

In [None]:
non_mac_sites = df_0[~df_0.sitename.str.startswith('MAC')]
print(non_mac_sites.shape)
# non_mac_sites.head(3)

In [None]:
print(non_mac_sites.raw_date.min())
print(non_mac_sites.raw_date.max())

In [None]:
ksu_0 = non_mac_sites[non_mac_sites.sitename.str.contains('KSU')]
print(ksu_0.shape)
# ksu_0.tail(3)

#### C. Slice for selected traits
- canopy height
- days & GDD to flowering
- may use other traits as needed for future models

In [None]:
ksu_0.trait.unique()

In [None]:
ksu_1 = ksu_0.loc[(ksu_0.trait == 'flowering_time') | (ksu_0.trait == 'canopy_height')]
print(ksu_1.shape)
# ksu_1.head(3)

#### D. Drop & Rename Columns
- rename `mean` to `value`
- convert `raw_date` to new datetime object
- new datetime object will be in `date` column

In [None]:
# ksu_1.columns

In [None]:
# Can drop most columns with only one value

# for col in ksu_1.columns:
    
#     if ksu_1[col].nunique() < 5:
#         print(f'Unique values for {col}: {ksu_1[col].unique()}')

In [None]:
cols_to_drop = ['Unnamed: 0', 'checked', 'result_type', 'id', 'citation_id', 'site_id', 'treatment_id', 
                'city', 'scientificname', 'commonname', 'genus', 'species_id', 'cultivar_id', 'author', 
                'citation_year', 'time', 'month', 'year', 'n', 'statname', 'stat', 'notes', 'access_level', 
                'entity', 'view_url', 'edit_url', 'treatment', 'date', 'dateloc']

ksu_2 = ksu_1.drop(labels=cols_to_drop, axis=1)
print(ksu_2.shape)
# ksu_2.tail(3)

#### Convert `raw_date` to datetime object

In [None]:
ksu_2.dtypes

In [None]:
new_dates = pd.to_datetime(ksu_2.raw_date)

ksu_3 = ksu_2.copy()
ksu_3['date'] = new_dates

print(ksu_2.shape[0])
print(ksu_3.shape[0])

# ksu_3.head(3)

In [None]:
ksu_4 = ksu_3.rename({'mean': 'value'}, axis=1)
print(ksu_4.shape)
# ksu_4.tail(3)

### E. Extract `Range` and `Pass` values
- still need to determine how the field is structured
- is `Pass` similar to `Column` in the MAC experiments?

In [None]:
ksu_5 = ksu_4.copy()

ksu_5['range'] = ksu_5['sitename'].str.extract("Range (\d+)").astype(int)
ksu_5['pass'] = ksu_5['sitename'].str.extract("Pass (\d+)").astype(int)

# ksu_5.sample(n=5)

### F. Growing Degree Days (GDD) to Flowering
- Weather data taken from [KSU Weather Station](http://mesonet.k-state.edu/weather/historical/) in Manhattan
- planting date: 2016-06-17
- harvest date: 2016-10-21

In [None]:
manhattan_weather_0 = pd.read_csv('data/manhattan_weather_2016_daily.csv')
print(manhattan_weather_0.shape)
manhattan_weather_0.head(5)

#### Change column names and drop first two rows
- Add datetime column

In [None]:
manhattan_weather_1 = manhattan_weather_0.copy()

datetimes = pd.to_datetime(manhattan_weather_1['Timestamp'])
manhattan_weather_1['date'] = datetimes

print(manhattan_weather_1.shape)
# manhattan_weather_1.tail()

In [None]:
# manhattan_weather_1.columns

In [None]:
# Drop first 2 rows

manhattan_weather_2 = manhattan_weather_1.iloc[2:]
print(manhattan_weather_2.shape)
# manhattan_weather_2.head()

In [None]:
# Drop `timestamp` column

manhattan_weather_3 = manhattan_weather_2.drop(labels=['Timestamp'], axis=1)
print(manhattan_weather_3.shape)
# manhattan_weather_3.head()

In [None]:
manhattan_weather_4 = manhattan_weather_3.rename({'Station': 'station', 'AirTemperature': 'air_temp_max_F', 
                                                  'AirTemperature.1': 'air_temp_min_F', 'RelativeHumidity': 'avg_rh',
                                                  'Precipitation': 'precip_total', 'WindSpeed2m': 'avg_wind_speed', 
                                                  'WindSpeed2m.1': 'max_wind_speed', 'SoilTemperature5cm': 'soil_temp_5cm_max',
                                                  'SoilTemperature5cm.1': 'soil_temp_5cm_min', 
                                                  'SoilTemperature10cm': 'soil_temp_10cm_max', 
                                                  'SoilTemperature10cm.1': 'soil_temp_10cm_min', 'SolarRadiation': 'solar_rad',
                                                  'ETo': 'eto_grass', 'ETo.1': 'eto_alfalfa'}, axis=1)
print(manhattan_weather_4.shape)
# manhattan_weather_4.head()


#### Add Day-of-year (DOY) to Weather Dataframe
- slice dataframe to only include season dates from planting to harvest
- change `date` to index, but keep `date` column
- use Pandas `PeriodIndex.dayofyear()`

In [None]:
manhattan_weather_5 = manhattan_weather_4.loc[(manhattan_weather_4['date'] >= '2016-06-17') & (manhattan_weather_4['date'] <= '2016-10-21')]

In [None]:
manhattan_weather_6 = manhattan_weather_5.set_index(keys=['date'], drop=False)
print(manhattan_weather_6.shape)
# manhattan_weather_6.tail(3)

In [None]:
manhattan_weather_7 = manhattan_weather_6.copy()

manhattan_weather_7['day_of_year'] = manhattan_weather_7.index.dayofyear

In [None]:
# manhattan_weather_7.tail(3)

#### Add Growing Degree Days (GDD)
- convert all numeric columns from string `to_numeric`
- add air temps in C
- equation = (F - 32) x 0.5556 = C
- daily gdd equation = ((max_air_temp + min_air_temp) / 2) - 10

In [None]:
cols_to_convert = ['air_temp_max_F', 'air_temp_min_F', 'avg_rh', 'precip_total', 'avg_wind_speed', 'max_wind_speed', 
                   'soil_temp_5cm_max', 'soil_temp_5cm_min', 'soil_temp_10cm_max', 'soil_temp_10cm_min', 'solar_rad', 
                   'eto_grass', 'eto_alfalfa']

In [None]:
manhattan_weather_7[cols_to_convert] = manhattan_weather_7[cols_to_convert].apply(pd.to_numeric)

In [None]:
manhattan_weather_7.dtypes

In [None]:
manhattan_weather_8 = manhattan_weather_7.copy()

manhattan_weather_8['air_temp_max_C'] = round(((manhattan_weather_8['air_temp_max_F'] - 32) * 0.556), 1)
print(manhattan_weather_8.shape)
# manhattan_weather_8.tail(3)

In [None]:
manhattan_weather_9 = manhattan_weather_8.copy()

manhattan_weather_9['air_temp_min_C'] = round(((manhattan_weather_9['air_temp_min_F'] - 32) * 0.556), 1)
print(manhattan_weather_9.shape)
# manhattan_weather_9.head(3)

In [None]:
manhattan_weather_10 = manhattan_weather_9.copy()

manhattan_weather_10['daily_gdd'] = (((manhattan_weather_10['air_temp_max_C'] + manhattan_weather_10['air_temp_min_C'])) / 2) - 10

print(manhattan_weather_10.shape)
# manhattan_weather_10.sample(n=3)

In [None]:
# Check for any negative daily GDD values (if any, need to be converted to 0)

manhattan_weather_10.loc[manhattan_weather_10.daily_gdd < 0]

In [None]:
# Assign negative daily gdd values to 0

manhattan_weather_11 = manhattan_weather_10.copy()

In [None]:
# ignore SeetingWithCopyWarning

manhattan_weather_11['daily_gdd']['2016-10-12'] = 0
manhattan_weather_11['daily_gdd']['2016-10-13'] = 0

In [None]:
# Check to see that negative values were successfully converted to 0

manhattan_weather_11.loc[manhattan_weather_11.daily_gdd <= 0]

In [None]:
# should now return an empty df

manhattan_weather_11.loc[manhattan_weather_11.daily_gdd < 0]

In [None]:
# Add cumulative GDD, round to nearest integer

manhattan_weather_12 = manhattan_weather_11.copy()

manhattan_weather_12['gdd'] = np.rint(np.cumsum(manhattan_weather_12['daily_gdd']))
print(manhattan_weather_12.shape)
# manhattan_weather_12.tail()

Drop `daily_gdd`

In [None]:
manhattan_weather_13 = manhattan_weather_12.drop(labels=['daily_gdd'], axis=1)
print(manhattan_weather_13.shape)
# manhattan_weather_13.head()

#### Write Manhattan Weather Data to `.csv`

In [None]:
manhattan_weather_13.to_csv('data/processed/ksu_weather_2016_daily.csv')

#### Add Day of Year & GDD to Days to Flowering DataFrame
- slice trait data to only include `days_to_flowering`
- merge DataFrames on `date_of_flowering`

In [None]:
ksu_5.trait.unique()

In [None]:
flowering_df_0 = ksu_5.loc[ksu_5.trait == 'flowering_time']
print(flowering_df_0.shape)
# flowering_df_0.head(3)

In [None]:
flowering_df_0.cultivar.nunique()

#### Add `planting_date`
- 2016-06-17

In [None]:
day_of_planting = datetime.date(2016,6,17)
flowering_df_1 = flowering_df_0.copy()

flowering_df_1['date_of_planting'] = day_of_planting
print(flowering_df_1.shape)
# flowering_df_1.head(3)

#### Create timedelta using `flowering_time` values

In [None]:
timedelta_values = flowering_df_1['value'].values
dates_of_flowering = []

for val in timedelta_values:
    
    date_of_flowering = day_of_planting + datetime.timedelta(days=val)
    dates_of_flowering.append(date_of_flowering)
    
print(flowering_df_1.shape[0])
print(len(dates_of_flowering))

In [None]:
flowering_df_2 = flowering_df_1.copy()
flowering_df_2['date_of_flowering'] = dates_of_flowering
print(flowering_df_2.shape)
# flowering_df_2.sample(n = 3)

#### Add GDD and day_of_year to flowering DataFrame

In [None]:
ksu_gdd = manhattan_weather_13[['date', 'day_of_year', 'gdd']]

In [None]:
ksu_gdd.head()

In [None]:
flowering_df_3 = flowering_df_2.copy()
flowering_df_3.date_of_flowering = pd.to_datetime(flowering_df_3.date_of_flowering)
flowering_df_3.dtypes

In [None]:
flowering_df_4 = flowering_df_3.merge(ksu_gdd, how='left', left_on='date_of_flowering', right_on=ksu_gdd.index)
print(flowering_df_4.shape)
# flowering_df_4.head(3)

#### Drop all date columns except `date_of_flowering`

In [None]:
date_cols_to_drop = ['date_x', 'raw_date', 'date_of_planting', 'date_y']
flowering_df_5 = flowering_df_4.drop(labels=date_cols_to_drop, axis=1)
print(flowering_df_5.shape)
# flowering_df_5.tail(3)

#### Check for duplicates

In [None]:
flowering_df_5.duplicated().value_counts()

#### Sort flowering dataframe by `date`

In [None]:
flowering_df_6 = flowering_df_5.sort_values(by='date_of_flowering', ascending=True).reset_index(drop=True)
# flowering_df_6.head()

#### Write flowering dataframe to `.csv`

In [None]:
flowering_df_6.to_csv('data/processed/ksu_flowering_2020-04-06.csv', index=False)

#### Canopy Height DataFrame

In [None]:
ksu_5.trait.value_counts()

In [None]:
canopy_0 = ksu_5.loc[ksu_5.trait == 'canopy_height']
print(canopy_0.shape)
# canopy_0.head(3)

#### Drop `raw_date`

In [None]:
canopy_1 = canopy_0.drop(labels=['raw_date'], axis=1)
print(canopy_1.shape)
# canopy_1.head(3)

#### Sort by Date

In [None]:
canopy_2 = canopy_1.copy()

canopy_2['date'] = canopy_2['date'].astype('datetime64[ns]')
canopy_2.dtypes

In [None]:
canopy_3 = canopy_2.set_index(keys=['date'], drop=True)
print(canopy_3.shape)
# canopy_3.head()

In [None]:
canopy_4 = canopy_3.sort_index()
print(canopy_4.shape)
# canopy_4.head()

#### Write to `.csv`

In [None]:
canopy_4.to_csv('data/processed/ksu_canopy_heights_2020-04-07.csv')