## Tall Format Trait Table Season 6
### Columns
* Date
* Sitename
* Range
* Column
* Cultivar
* Trait
* Trait Value

### Season Dates
* Planting: 2018-04-25
    * 700 experimental 2-row plots
* Days of Harvest: 2018-07-31 & 2018-08-01

### Notes and ToDos for this table
* Add trait id - where is this located?
* Can wait for feedback on the best way to incorporate TO Mappings
* Add GDD?
* Is there other Season 6 trait data located elsewhere? 

In [1]:
import datetime
import numpy as np
import pandas as pd

In [4]:
df_0 = pd.read_csv('data/raw/season_6_traits.csv', low_memory=False)
df_0.shape

(925563, 38)

In [5]:
df_0.columns

Index(['checked', 'result_type', 'id', 'citation_id', 'site_id',
       'treatment_id', 'sitename', 'city', 'lat', 'lon', 'scientificname',
       'commonname', 'genus', 'species_id', 'cultivar_id', 'author',
       'citation_year', 'treatment', 'date', 'time', 'raw_date', 'month',
       'year', 'dateloc', 'trait', 'trait_description', 'mean', 'units', 'n',
       'statname', 'stat', 'notes', 'access_level', 'cultivar', 'entity',
       'method_name', 'view_url', 'edit_url'],
      dtype='object')

In [17]:
cols_to_drop = ['checked', 'result_type', 'id', 'citation_id', 'site_id', 'treatment_id', 'city', 
                'scientificname', 'commonname', 'genus', 'species_id', 'cultivar_id', 'author', 'citation_year',
                'treatment', 'time', 'dateloc', 'trait_description', 'units', 'n', 'statname',
                'stat', 'notes', 'access_level', 'entity', 'method_name', 'view_url', 'edit_url']

In [18]:
df_1 = df_0.drop(labels=cols_to_drop, axis=1)
# df_1.head()

Unnamed: 0,sitename,lat,lon,date,raw_date,month,year,trait,mean,cultivar
0,MAC Field Scanner Season 6 Range 16 Column 14 W,33.075087,-111.974839,2018 May 30 (America/Phoenix),2018-05-30 02:00:00 -0500,5,2018,leaf_width,20.7,PI563022
1,MAC Field Scanner Season 6 Range 16 Column 14 W,33.075087,-111.974839,2018 May 31 (America/Phoenix),2018-05-31 02:00:00 -0500,5,2018,leaf_width,22.1,PI563022
2,MAC Field Scanner Season 6 Range 16 Column 14 W,33.075087,-111.974839,2018 Jun 1 (America/Phoenix),2018-06-01 02:00:00 -0500,6,2018,leaf_width,22.9,PI563022
3,MAC Field Scanner Season 6 Range 16 Column 14 W,33.075087,-111.974839,2018 Jun 2 (America/Phoenix),2018-06-02 02:00:00 -0500,6,2018,leaf_width,23.7,PI563022
4,MAC Field Scanner Season 6 Range 16 Column 14 W,33.075087,-111.974839,2018 Jun 3 (America/Phoenix),2018-06-03 02:00:00 -0500,6,2018,leaf_width,24.5,PI563022


### Season Dates

In [19]:
print(f'Earliest raw date: {df_1.raw_date.min()}')
print(f'Latest raw date: {df_1.raw_date.max()}')
print(' ')
print(f'Earliest AZ date: {df_1.date.min()}')
print(f'Latest AZ date: {df_1.date.max()}')

Earliest raw date: 2017-07-05 14:00:00 -0500
Latest raw date: 2018-08-22 14:00:00 -0500
 
Earliest AZ date: 2017 Jul 5
Latest AZ date: 2018 May 9


In [20]:
# those dates. . . are not correct

In [21]:
df_1.year.unique()

array([2018, 2017])

In [27]:
year_2017 = df_1.loc[df_1.year == 2017].shape
year_2017.

(368, 10)

In [28]:
# year_2017.head()

for col in year_2017.columns:
    print(f'Number of unique values for {col}: {year_2017[col].nunique()}')
    
    if year_2017[col].nunique() < 5:
        print(f'Unique values for {col}: {year_2017[col].unique()}')

Number of unique values for sitename: 185
Number of unique values for lat: 185
Number of unique values for lon: 185
Number of unique values for date: 1
Unique values for date: ['2017 Jul 5']
Number of unique values for raw_date: 1
Unique values for raw_date: ['2017-07-05 14:00:00 -0500']
Number of unique values for month: 1
Unique values for month: [7]
Number of unique values for year: 1
Unique values for year: [2017]
Number of unique values for trait: 1
Unique values for trait: ['surface_temperature']
Number of unique values for mean: 185
Number of unique values for cultivar: 121


### I. Slice the dataframe to only include 2018 dates
* '2017-7-05' does not belong - follow up on this

In [39]:
df_2 = df_1.loc[df_1.year == 2018]
df_2.shape

(925195, 10)

In [33]:
# print(df_2.raw_date.nunique())
# print(df_2.date.nunique())

164
164


### I. Change AZ date values to iso date format and strip `America/Phoenix` from string dates
* date(s) as index?

In [40]:
new_dates = []

for d in df_2.date.values:
    
    if 'Phoenix' in d:
        new_name = d[:-18]
        new_dates.append(new_name)
    
    else:
        new_name = d
        new_dates.append(new_name)
        
print(len(new_dates))

925195


In [41]:
iso_format_dates = pd.to_datetime(new_dates)

In [42]:
df_3 = df_2.copy()

df_3['date_1'] = iso_format_dates
# df_3.head()

Unnamed: 0,sitename,lat,lon,date,raw_date,month,year,trait,mean,cultivar,date_1
0,MAC Field Scanner Season 6 Range 16 Column 14 W,33.075087,-111.974839,2018 May 30 (America/Phoenix),2018-05-30 02:00:00 -0500,5,2018,leaf_width,20.7,PI563022,2018-05-30
1,MAC Field Scanner Season 6 Range 16 Column 14 W,33.075087,-111.974839,2018 May 31 (America/Phoenix),2018-05-31 02:00:00 -0500,5,2018,leaf_width,22.1,PI563022,2018-05-31
2,MAC Field Scanner Season 6 Range 16 Column 14 W,33.075087,-111.974839,2018 Jun 1 (America/Phoenix),2018-06-01 02:00:00 -0500,6,2018,leaf_width,22.9,PI563022,2018-06-01
3,MAC Field Scanner Season 6 Range 16 Column 14 W,33.075087,-111.974839,2018 Jun 2 (America/Phoenix),2018-06-02 02:00:00 -0500,6,2018,leaf_width,23.7,PI563022,2018-06-02
4,MAC Field Scanner Season 6 Range 16 Column 14 W,33.075087,-111.974839,2018 Jun 3 (America/Phoenix),2018-06-03 02:00:00 -0500,6,2018,leaf_width,24.5,PI563022,2018-06-03


In [43]:
df_3.dtypes

sitename            object
lat                float64
lon                float64
date                object
raw_date            object
month                int64
year                 int64
trait               object
mean               float64
cultivar            object
date_1      datetime64[ns]
dtype: object

#### Drop other date columns

In [44]:
other_date_cols = ['date', 'raw_date', 'month', 'year']
df_4 = df_3.drop(other_date_cols, axis=1)
df_4.shape

(925195, 7)

In [46]:
df_4.trait.unique()

array(['leaf_width', 'surface_temperature', 'canopy_cover', 'leaf_length',
       'canopy_height', 'panicle_count', 'panicle_volume',
       'panicle_surface_area', 'aboveground_dry_biomass',
       'aboveground_fresh_biomass', 'stalk_diameter_fixed_height',
       'aboveground_biomass_moisture', 'leaf_angle_mean',
       'leaf_angle_alpha', 'leaf_angle_beta', 'leaf_angle_chi'],
      dtype=object)

### II. Subset traits
Needed now:
* `aboveground_dry_biomass`
* `canopy_height`

In [60]:
df_5 = df_4.loc[(df_4.trait == 'aboveground_dry_biomass') | (df_4.trait == 'canopy_height')]
df_5.shape

(48663, 7)

In [61]:
df_5.date_1.nunique()

83

In [52]:
# df_5.loc[df_5.trait == 'aboveground_dry_biomass'].date_1.nunique()

2

In [58]:
# df_5.loc[df_5.trait == 'canopy_height'].date_1.nunique()

81

### III. Extract Range and Column Values

In [62]:
df_6 = df_5.copy()

df_6['range'] = df_6['sitename'].str.extract("Range (\d+)").astype(int)
df_6['column'] = df_6['sitename'].str.extract("Column (\d+)").astype(int)

# df_6.sample(n=7)

Unnamed: 0,sitename,lat,lon,trait,mean,cultivar,date_1,range,column
857460,MAC Field Scanner Season 6 Range 28 Column 7,33.075518,-111.97495,canopy_height,190.0,PI521152,2018-06-22,28,7
856357,MAC Field Scanner Season 6 Range 22 Column 4,33.075302,-111.974999,canopy_height,167.0,PI337689,2018-06-17,22,4
856292,MAC Field Scanner Season 6 Range 11 Column 9,33.074907,-111.974917,canopy_height,134.0,PI273465,2018-06-17,11,9
504926,MAC Field Scanner Season 6 Range 53 Column 7,33.076417,-111.97495,canopy_height,284.0,SP1516,2018-07-08,53,7
826462,MAC Field Scanner Season 6 Range 27 Column 3,33.075482,-111.975015,canopy_height,224.0,PI585954,2018-07-25,27,3
499049,MAC Field Scanner Season 6 Range 36 Column 7,33.075806,-111.97495,canopy_height,22.0,PI22913,2018-05-19,36,7
858014,MAC Field Scanner Season 6 Range 4 Column 15,33.074655,-111.974818,canopy_height,195.0,PI329394,2018-06-24,4,15


#### Check for E W subplots

In [63]:
df_6.loc[(df_6.sitename.str.endswith(' E')) | (df_6.sitename.str.endswith(' W'))]

Unnamed: 0,sitename,lat,lon,trait,mean,cultivar,date_1,range,column


### IV. Reorder & Rename Columns
* Set date column as index

In [67]:
df_7 = df_6.rename({'date_1': 'date', 'mean': 'value'}, axis=1)
# df_7.head()

Unnamed: 0,sitename,lat,lon,trait,value,cultivar,date,range,column
20292,MAC Field Scanner Season 6 Range 7 Column 11,33.074763,-111.974884,canopy_height,88.0,PI452619,2018-05-28,7,11
20341,MAC Field Scanner Season 6 Range 8 Column 12,33.074799,-111.974868,canopy_height,86.0,PI646266,2018-05-28,8,12
20369,MAC Field Scanner Season 6 Range 8 Column 2,33.074799,-111.975031,canopy_height,89.0,PI527045,2018-05-28,8,2
20629,MAC Field Scanner Season 6 Range 8 Column 16,33.074799,-111.974802,canopy_height,99.0,SP1516,2018-05-28,8,16
21431,MAC Field Scanner Season 6 Range 8 Column 7,33.074799,-111.974949,canopy_height,83.0,PI524475,2018-05-28,8,7


In [68]:
df_8 = df_7.set_index(keys='date')
print(df_7.shape)
print(df_8.shape)

(48663, 9)
(48663, 8)


In [71]:
col_reorder = ['sitename', 'range', 'column', 'lat', 'lon', 'cultivar', 'trait', 'value']
df_9 = pd.DataFrame(data=df_8, columns=col_reorder, index=df_8.index)
# df_9.head()

Unnamed: 0_level_0,sitename,range,column,lat,lon,cultivar,trait,value
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
2018-05-28,MAC Field Scanner Season 6 Range 7 Column 11,7,11,33.074763,-111.974884,PI452619,canopy_height,88.0
2018-05-28,MAC Field Scanner Season 6 Range 8 Column 12,8,12,33.074799,-111.974868,PI646266,canopy_height,86.0
2018-05-28,MAC Field Scanner Season 6 Range 8 Column 2,8,2,33.074799,-111.975031,PI527045,canopy_height,89.0
2018-05-28,MAC Field Scanner Season 6 Range 8 Column 16,8,16,33.074799,-111.974802,SP1516,canopy_height,99.0
2018-05-28,MAC Field Scanner Season 6 Range 8 Column 7,8,7,33.074799,-111.974949,PI524475,canopy_height,83.0


In [73]:
df_10 = df_9.sort_index()

In [74]:
# df_10.tail()

Unnamed: 0_level_0,sitename,range,column,lat,lon,cultivar,trait,value
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
2018-08-01,MAC Field Scanner Season 6 Range 32 Column 2,32,2,33.075662,-111.975032,PI329632,aboveground_dry_biomass,8180.0
2018-08-01,MAC Field Scanner Season 6 Range 31 Column 12,31,12,33.075626,-111.974868,PI330184,aboveground_dry_biomass,16100.0
2018-08-01,MAC Field Scanner Season 6 Range 27 Column 11,27,11,33.075482,-111.974884,PI152727,aboveground_dry_biomass,13100.0
2018-08-01,MAC Field Scanner Season 6 Range 26 Column 12,26,12,33.075446,-111.974868,PI569458,aboveground_dry_biomass,7230.0
2018-08-01,MAC Field Scanner Season 6 Range 20 Column 15,20,15,33.075231,-111.974819,PI569422,aboveground_dry_biomass,8150.0


### Add TO Mappings

#### Final Steps
* Create `.csv`

In [77]:
need_to_create_csv = False

if need_to_create_csv:

    timestamp = datetime.datetime.now().replace(microsecond=0).isoformat()
    output_filename = f'tall_format_traits_season_6{timestamp}.csv'.replace(':', '')
    df_10.to_csv(f'data/processed/{output_filename}')