### Clemson Sorghum Experiments Data Cleaning Notebook
#### Data from Clemson University Pee Dee Research and Education Center 2014
- goal: to gather more cultivar data in addition to MAC Sorghum Seasons 4 & 6 and KSU Experiments
- please contact Emily Cain at ejcain@arizona.edu with any questions or feedback

In [None]:
import datetime
import numpy as np
import pandas as pd

#### Read in data queried from betydb in `R` using this code:
```
library(traits)

options(betydb_url = "https://terraref.ncsa.illinois.edu/bety/",
        betydb_api_version = 'v1',
        betydb_key = 'secret_api_key_123456_abcde')
        
clemson <- betydb_query(experiment  = "~Clemson",
                         limit     =  "none")
                      
write.csv(clemson, file = "clemson_data_2020-06-01.csv")
```

In [None]:
df_0 = pd.read_csv('data/clemson_data_2020-06-01.csv')
print(df_0.shape)
# df_0.head(3)

In [None]:
# print(df_0.raw_date.min())
# print(df_0.raw_date.max())

#### Slice for selected traits
- plant height
- days & GDD to flowering
- aboveground dry biomass
- may use other traits as needed for future models

In [None]:
df_0.trait.unique()

In [None]:
df_1 = df_0.loc[(df_0.trait == 'flowering_time') | (df_0.trait == 'plant_height') | (df_0.trait == 'aboveground_dry_biomass')]
print(df_1.shape)
# df_1.tail()

#### Drop & Rename Columns
- rename `mean` to `value`
- convert `raw_date` to new datetime object
- new datetime object will be in `date` column
- drop `raw_date` column

In [None]:
# df_1.columns

In [None]:
# Can drop most columns with only one value

# for col in df_1.columns:
    
#     if df_1[col].nunique() < 5:
#         print(f'Unique values for {col}: {df_1[col].unique()}')

In [None]:
cols_to_drop = ['Unnamed: 0', 'checked', 'result_type', 'id', 'citation_id', 'site_id', 'treatment_id', 
                'commonname', 'genus', 'species_id', 'cultivar_id', 'month', 'year', 'dateloc', 'n', 'statname', 
                'stat', 'notes', 'access_level', 'entity', 'view_url', 'edit_url', 'date', 'time', 'method_name']

df_2 = df_1.drop(labels=cols_to_drop, axis=1)
print(df_2.shape)
# df_2.head()

#### Convert `raw_date` to datetime object

In [None]:
# df_2.dtypes

In [None]:
new_dates = pd.to_datetime(df_2.raw_date)

df_3 = df_2.copy()
df_3['date'] = new_dates

print(df_2.shape)
print(df_3.shape)

# df_3.head(3)

In [None]:
# df_3.dtypes

In [None]:
df_4 = df_3.rename({'mean': 'value'}, axis=1)
print(df_4.shape)
# df_4.tail(3)

In [None]:
df_5 = df_4.drop(labels=['raw_date'], axis=1)
print(df_5.shape)
# df_5.head()

#### Write to `.csv`

In [None]:
timestamp = datetime.datetime.now().replace(microsecond=0).isoformat()
output_filename = f'data/processed/clemson_2014_tall_traits_{timestamp}.csv'.replace(':', '')

df_5.to_csv(output_filename, index=False)