## Phenotype Table
### MAC Season 4
* Season 4 trait data queried from betydb in R using terra-ref [tutorial](https://terraref.github.io/tutorials/accessing-trait-data-in-r.html) 

#### Will also include:
* Environmental data
* Fertilizer inputs

In [None]:
# notes = delete exploratory / commented lines and cells before final push to GitHub

In [1]:
import datetime
import pandas as pd
import numpy as np

In [2]:
# %cd '/Users/ejcain/UA-AG/phenotypes/terraref-datasets'

/Users/ejcain/UA-AG/phenotypes/terraref-datasets


In [3]:
df = pd.read_csv('data/raw/mac_season_4.csv', low_memory=False)
# df.head()

### I. Explore and Drop Columns as Needed
#### A. Null values or duplicate information
* `Unnamed: 0` 
* `checked` - `0` in every row
* `result_type` - `trait` in every row
* `commonname` - `sorghum` in every row
* `genus` - `Sorghum` in every row
* `n` - no values in this column
* `statname` - no values in this column
* `month` - already included in other date columns
* `year` - already included in other date columns

#### B . Columns with only 1 or 2 non-null values that will be included in data dictionary
* `citation_id`
* `city` - `Maricopa` in every row
* `scientificname` - `Sorghum bicolor` in every row
* `dateloc` - `5` in every row (need clarification on this value)
* `author` - only 2 non-null values of `Newcomb, Maria` and `Zongyang, Li`
* `species_id` - `2588` in every row
* `citation_year` - 2 non-null values of `2016` and `2017`

#### C. URL columns that can be added back if needed
* `edit_url`
* `view_url`

#### D. Columns that may be dropped later after feedback
* `access_level` - need clarification on these values of `2` or `4`
* `treatment_id` - need clarification if this will be needed for the initial dataset
* `treatment` - need clarification if water deficit stress treatments will be needed for initial dataset
* `notes` - may be able to be added to data dictionary

#### Find Columns that can be dropped
* Check for number of unique values
* For columns with fewer than 5 unique values, check to see if these columns are needed in the dataset.
* Columns with only one unique value (city, for example) can just be included in the data dictionary and metadata.

In [None]:
for col in df.columns:
    
    print(f'{df[col].nunique()} unique value(s) in the {col} column.')

In [None]:
for col in df.columns:
    
    if df[col].nunique() < 5:
        print(f'Unique value(s) for {col} column: {df[col].unique()}')
        print(' ')

In [4]:
cols_to_drop = ['Unnamed: 0', 'checked', 'result_type', 'city', 'scientificname', 'commonname', 'genus', 'species_id', 
                'citation_year', 'citation_id', 'month', 'author', 'year', 'dateloc', 'n', 'statname', 'view_url', 'edit_url']

In [5]:
df_1 = df.drop(labels=cols_to_drop, axis=1)
df_1.head()

Unnamed: 0,id,site_id,treatment_id,sitename,lat,lon,cultivar_id,treatment,date,time,...,trait,trait_description,mean,units,stat,notes,access_level,cultivar,entity,method_name
0,6001958927,6000005673,6000000000.0,MAC Field Scanner Season 4 Range 11 Column 5,33.074907,-111.974982,6000000730,"BAP 2017, water-deficit stress Aug 1-14",2017 Jun 14 (America/Phoenix),[time unspecified or unknown],...,leaf_desiccation_present,Presence or absence of leaves showing desiccat...,0.0,,,,2,PI181083,,Visual assessment of leaf dessication
1,6001958928,6000005676,6000000000.0,MAC Field Scanner Season 4 Range 11 Column 6,33.074907,-111.974966,6000000231,"BAP 2017, water-deficit stress Aug 1-14",2017 Jun 14 (America/Phoenix),[time unspecified or unknown],...,leaf_desiccation_present,Presence or absence of leaves showing desiccat...,0.0,,,,2,PI564163,,Visual assessment of leaf dessication
2,6001958931,6000005685,6000000000.0,MAC Field Scanner Season 4 Range 11 Column 9,33.074907,-111.974917,6000000860,"BAP 2017, water-deficit stress Aug 1-14",2017 Jun 14 (America/Phoenix),[time unspecified or unknown],...,leaf_desiccation_present,Presence or absence of leaves showing desiccat...,0.0,,,,2,PI52606,,Visual assessment of leaf dessication
3,6001958933,6000005691,6000000000.0,MAC Field Scanner Season 4 Range 11 Column 11,33.074907,-111.974884,6000000863,"BAP 2017, water-deficit stress Aug 1-14",2017 Jun 14 (America/Phoenix),[time unspecified or unknown],...,leaf_desiccation_present,Presence or absence of leaves showing desiccat...,0.0,,,,2,PI533792,,Visual assessment of leaf dessication
4,6001958936,6000005700,6000000000.0,MAC Field Scanner Season 4 Range 11 Column 14,33.074907,-111.974835,6000000869,"BAP 2017, water-deficit stress Aug 1-14",2017 Jun 14 (America/Phoenix),[time unspecified or unknown],...,leaf_desiccation_present,Presence or absence of leaves showing desiccat...,0.0,,,,2,PI535794,,Visual assessment of leaf dessication


In [6]:
remaining_cols = len(df.columns) - len(cols_to_drop)

print(f'New dataset with dropped columns should now contain {remaining_cols} columns.')
print(f'New dataset contains {len(df_1.columns)}.')

New dataset with dropped columns should now contain 21 columns.
New dataset contains 21.


### A. Extract range and column values to determine number of unique sitenames / plots
* This will ignore any `E` or `W` subplots in the calculation
* Confirm with terra-ref docs

In [7]:
df_1['range'] = df_1['sitename'].str.extract("Range (\d+)").astype(int)
df_1['column'] = df_1['sitename'].str.extract("Column (\d+)").astype(int)

In [None]:
# df_1.sample(n=10)

In [13]:
# need to confirm that there is a plot for every range / column combination
# some EW subplots may not be with the rest of the non EW subplots

num_unique_ranges = df_1.range.nunique()
num_unique_columns = df_1.column.nunique()
num_plots = num_unique_ranges * num_unique_columns

print(f'Number of unique sitenames / plots: {num_plots}')

Number of unique sitenames / plots: 848


### B. Determine number of days in season from planting to harvest dates according to the Season 4 [metadata](https://terraref.ncsa.illinois.edu/bety/api/v1/managements)
* Planting date 2019-04-20 
* Fourth and final harvest date 2019-9-16

#### Confirm these dates in the Season 4 trait dataset.

In [9]:
print(f'Earliest date for Season 4 trait dataset: {df.date.min()}')
print(f'Latest date for Season 4 trait dataset: {df.date.max()}')

Earliest date for Season 4 trait dataset: 2017 Apr 25 (America/Phoenix)
Latest date for Season 4 trait dataset: 2017 Sep 15 (America/Phoenix)


In [10]:
# df.loc[df.trait == 'aboveground_dry_biomass'].date.max()

'2017 Sep 15 (America/Phoenix)'

#### Calculate number of days

In [14]:
# datetime arithmetic does not include both dates in the calculation for timedelta, so 1 day is added

planting_date = datetime.date(2017, 4, 20)
harvest_date = datetime.date(2017, 9, 16)

time_delta = harvest_date - planting_date
total_season_days = time_delta.days + 1
print(f'Total number of days in season: {total_season_days}')

Total number of days in season: 150


In [None]:
# confirm arithmetic with datetime and number of days

# test_date_1 = datetime.date(2019, 12, 1)
# test_date_2 = datetime.date(2019, 12, 3)

# delt = test_date_2 - test_date_1
# delt.days

#### Number of columns TBD

## III. Transform dataset to wide format
* Each trait, environmental factor, and treatment will be a single column
* For daily values, each day / value will represent a single column. For example:
    * Day 1 temp min
    * Day 1 temp max
    * Day 2 temp min
    * Day 2 temp max
* Each row will represent a plot

### A. For index, use unique sorted sitenames

In [17]:
no_e_w_sites = df_1.loc[~((df_1.sitename.str.endswith(' E')) | df_1.sitename.str.endswith(' W'))]
no_e_w_sites.shape

(362903, 23)

In [18]:
# Confirm calculation

e_w_sites = df_1.loc[(df_1.sitename.str.endswith(' E')) | df_1.sitename.str.endswith(' W')]
# e_w_sites.head()

In [None]:
e_w_sites.columns

In [None]:
# e_w_sites.method_name.unique()

In [None]:
# no_e_w_sites.method_name.unique()

In [19]:
print(f'Total number of rows in raw dataset: {df.shape[0]}')

print(f'Total number of rows with E and W subbplot sitenames: {e_w_sites.shape[0]}')
print(f'Total number of rows with no E or W subplot sitenames: {no_e_w_sites.shape[0]}')

print(f'Future test or function should show this will be the same number as original number of rows: {e_w_sites.shape[0] + no_e_w_sites.shape[0]}')

Total number of rows in raw dataset: 372363
Total number of rows with E and W subbplot sitenames: 9460
Total number of rows with no E or W subplot sitenames: 362903
Future test or function should show this will be the same number as original number of rows: 372363


In [20]:
no_e_w_sites.sitename.nunique()

847

In [None]:
# could be off from earlier calculation of 848 due to range values starting at 2 and not 1
# or other issues like the EW subplots
# need clarification

In [22]:
# sitename_list = no_e_w_sites.sitename.unique()
# sitename_list

In [23]:
print(f'Min range value: {df_1.range.min()}')
print(f'Max range value: {df_1.range.max()}')
print(' ')
print(f'Min column value: {df_1.column.min()}')
print(f'Max column value: {df_1.column.max()}')

Min range value: 2
Max range value: 54
 
Min column value: 1
Max column value: 16


In [24]:
df_1.columns

Index(['id', 'site_id', 'treatment_id', 'sitename', 'lat', 'lon',
       'cultivar_id', 'treatment', 'date', 'time', 'raw_date', 'trait',
       'trait_description', 'mean', 'units', 'stat', 'notes', 'access_level',
       'cultivar', 'entity', 'method_name', 'range', 'column'],
      dtype='object')

In [42]:
id_cols = []
trait_cols = []
total_season_days = 0
daily_values = []
daily_value_cols = []
all_cols = []

In [43]:
id_cols = ['id', 'site_id', 'range', 'column', 'lat', 'lon', 'time', 'cultivar_id', 
           'cultivar']

In [44]:
trait_cols = ['days_to_emergence', 'days_to_flowering', 'max_canopy_height', 'end_season_canopy_height',
             'aboveground_dry_biomass', 'cumulative_gdd']

In [45]:
total_season_days = range(1, 151)

daily_values = ['date', 'raw_date', 'temp_min', 'temp_max', 'temp_mean', 'humidity_min', 'humidity_max', 'humidity_mean', 'canopy_height',
               'soil_moisture_min', 'soil_moisture_max', 'soil_moisture_mean', 'wind_speed_min', 'wind_speed_max',
               'wind_speed_mean']

In [46]:
daily_value_cols = []

for val in daily_values:
    for i in total_season_days:
        
        day_col = val + '_day_' + str(i)
        daily_value_cols.append(day_col)

In [47]:
len(daily_value_cols)

2250

In [48]:
len(daily_values) * 150

2250

In [50]:
all_cols = id_cols + trait_cols + daily_value_cols

In [51]:
empty_df = pd.DataFrame(data=np.nan, index=sitename_list, columns=all_cols)
print(empty_df.shape)
empty_df.head()

(847, 2265)


Unnamed: 0,id,site_id,range,column,lat,lon,time,cultivar_id,cultivar,days_to_emergence,...,wind_speed_mean_day_141,wind_speed_mean_day_142,wind_speed_mean_day_143,wind_speed_mean_day_144,wind_speed_mean_day_145,wind_speed_mean_day_146,wind_speed_mean_day_147,wind_speed_mean_day_148,wind_speed_mean_day_149,wind_speed_mean_day_150
MAC Field Scanner Season 4 Range 11 Column 5,,,,,,,,,,,...,,,,,,,,,,
MAC Field Scanner Season 4 Range 11 Column 6,,,,,,,,,,,...,,,,,,,,,,
MAC Field Scanner Season 4 Range 11 Column 9,,,,,,,,,,,...,,,,,,,,,,
MAC Field Scanner Season 4 Range 11 Column 11,,,,,,,,,,,...,,,,,,,,,,
MAC Field Scanner Season 4 Range 11 Column 14,,,,,,,,,,,...,,,,,,,,,,


In [52]:
empty_df.to_csv('data/interim/empty_df_2019-12-09.csv')