Join V2 data from TERRA program recveved directly from Roman

In [2]:
import pandas as pd

In [3]:
heights = pd.read_csv("/Volumes/Curt-MacPro-Backup/D3M/terra/Raw-data/V2-Roman-1108/s4.csv")
genids = pd.read_csv("/Volumes/Curt-MacPro-Backup/D3M/terra/Raw-data/V2-Roman-1108/s4_genotypes.csv")
gennames = pd.read_csv("/Volumes/Curt-MacPro-Backup/D3M/terra/Raw-data/V2-Roman-1108/genotype_names.csv")

In [4]:
heights.head()

Unnamed: 0,day_number,range,column,sensor,height(cm)
0,118,3,2,1,5.89
1,118,3,2,2,6.39
2,118,3,3,1,5.84
3,118,3,3,2,6.44
4,118,3,4,1,5.65


In [5]:
def returnUniqueCounts(dframe):
    return pd.DataFrame.from_records([(col, dframe[col].nunique()) for col in dframe.columns],
                          columns=['Column_Name', 'Num_Unique']).sort_values(by=['Num_Unique'])

In [6]:
returnUniqueCounts(heights)

Unnamed: 0,Column_Name,Num_Unique
3,sensor,2
2,column,14
1,range,50
0,day_number,71
4,height(cm),29344


In [7]:
returnUniqueCounts(genids)

Unnamed: 0,Column_Name,Num_Unique
1,column,14
0,range,50
2,genotype_id,349


In [8]:
returnUniqueCounts(gennames)

Unnamed: 0,Column_Name,Num_Unique
0,genotype_id,350
1,genotype_string,350


Now do a join to first add the genID to each height measurement.  This is an inner join so we fill in only values we already have in the left dataframe.  We don't want to concatenate rows onto the left dataframe.  By specifying range,column to match, this will add the genotype_id (cultivar ID) to the height data.

Info on joins in Pandas: https://pandas.pydata.org/pandas-docs/stable/user_guide/merging.html#database-style-dataframe-or-named-series-joining-merging

In [9]:
join1 = pd.merge(left=heights,right=genids,on=['range','column'],how='inner')

In [10]:
returnUniqueCounts(join1)

Unnamed: 0,Column_Name,Num_Unique
3,sensor,2
2,column,14
1,range,50
0,day_number,71
5,genotype_id,349
4,height(cm),29344


In [11]:
join1.tail()

Unnamed: 0,day_number,range,column,sensor,height(cm),genotype_id
77328,237,52,15,1,324.91,273
77329,237,52,15,2,307.6,273
77330,238,52,15,1,338.76,273
77331,238,52,15,2,304.74,273
77332,241,52,15,2,317.17,273


Now we add the genotype (cultivar) name by doing a join on the genotype_id column

In [12]:
join2 = pd.merge(left=join1,right=gennames,on='genotype_id',how='inner')
returnUniqueCounts(join2)

Unnamed: 0,Column_Name,Num_Unique
3,sensor,2
2,column,14
1,range,50
0,day_number,71
5,genotype_id,349
6,genotype_string,349
4,height(cm),29344


In [13]:
join2.tail()

Unnamed: 0,day_number,range,column,sensor,height(cm),genotype_id,genotype_string
77328,237,38,9,2,305.23,349,PI156463
77329,238,38,9,1,303.24,349,PI156463
77330,238,38,9,2,306.01,349,PI156463
77331,241,38,9,1,304.59,349,PI156463
77332,241,38,9,2,306.38,349,PI156463


In [14]:
join2.head()

Unnamed: 0,day_number,range,column,sensor,height(cm),genotype_id,genotype_string
0,118,3,2,1,5.89,1,PI329465
1,118,3,2,2,6.39,1,PI329465
2,121,3,2,1,6.15,1,PI329465
3,121,3,2,2,6.85,1,PI329465
4,123,3,2,1,5.98,1,PI329465


Now lets see if we can add the leaf values we received from the first dataset by joining on the range,column,and day.  First, we will have to offset the day in the leaf dataset because the V2 dataset starts with day=118.  The first dataset starts with day=12.  Let's review one of the first dataset's leaf features:

In [23]:
import pyreadr
result = pyreadr.read_r('/Volumes/Curt-MacPro-Backup/D3M/terra/processing/terra-explore/output/s4_data.rds')

# done! let's see what we got
print(result.keys()) # let's check what objects we got
ryan_s4_df_raw = result[None] # extract the pandas data frame for object df1
ryan_s4_df_raw.head()


odict_keys([None])


Unnamed: 0,cultivar,range,column,day,canopy_height,canopy_height_n,leaf_angle_alpha,leaf_angle_alpha_n,leaf_angle_beta,leaf_angle_beta_n,leaf_angle_chi,leaf_angle_chi_n,leaf_angle_mean,leaf_angle_mean_n
0,PI145619,27,11,12.0,10.0,3,,,,,,,,
1,PI145619,27,11,13.0,10.0,2,,,,,,,,
2,PI145619,27,11,14.0,10.0,2,,,,,,,,
3,PI145619,27,11,15.0,10.0,2,,,,,,,,
4,PI145619,27,11,16.0,10.0,2,,,,,,,,


In [26]:
# drop NAs
ryan_clean = ryan_s4_df_raw.dropna(how="any")
ryan_clean.head()

Unnamed: 0,cultivar,range,column,day,canopy_height,canopy_height_n,leaf_angle_alpha,leaf_angle_alpha_n,leaf_angle_beta,leaf_angle_beta_n,leaf_angle_chi,leaf_angle_chi_n,leaf_angle_mean,leaf_angle_mean_n
33,PI145619,27,11,77.0,289.0,1,3.501112,1,2.250611,1,1.822136,1,0.429758,1
34,PI145619,27,11,78.0,288.0,1,3.936643,1,2.386776,1,1.892378,1,0.409214,1
35,PI145619,27,11,80.0,285.0,1,2.948751,1,1.930459,1,1.870285,1,0.42065,1
36,PI145619,27,11,83.0,280.0,1,2.966654,1,2.013383,1,1.807524,1,0.432626,1
37,PI145619,27,11,84.0,291.0,1,3.135967,2,1.992641,2,1.892911,2,0.41379,2


In [38]:
ryan_clean.shape

(8355, 14)

Now lets look for the empirical match by looking for matches of range,column, cultivar and print out the height values to compare

In [33]:
# select only rows matching feature values: df.loc[(df['column_name'] >= A) & (df['column_name'] <= B)]
# & (join2['genotype_string'] =='PI145619')
join2.loc[(join2['range'] == 27) & (join2['column'] == 11) ].head()


Unnamed: 0,day_number,range,column,sensor,height(cm),genotype_id,genotype_string
76154,118,27,11,1,5.5,345,PI641810
76155,118,27,11,2,6.38,345,PI641810
76156,120,27,11,1,5.51,345,PI641810
76157,120,27,11,2,6.72,345,PI641810
76158,121,27,11,1,5.5,345,PI641810


In [36]:
join2.loc[(join2['genotype_string'] == 'PI145619') ].head()

Unnamed: 0,day_number,range,column,sensor,height(cm),genotype_id,genotype_string
74962,118,27,6,1,4.78,340,PI145619
74963,118,27,6,2,5.33,340,PI145619
74964,120,27,6,1,4.9,340,PI145619
74965,120,27,6,2,5.62,340,PI145619
74966,121,27,6,1,4.87,340,PI145619


Oh! that is wierd.  The range,column don't match with the same cultivar between these two versions of the datasets.  See if we can find the day offset if we match by cultivar name

In [37]:
join2.loc[(join2['genotype_string'] == 'PI145619') ].describe()

Unnamed: 0,day_number,range,column,sensor,height(cm),genotype_id
count,242.0,242.0,242.0,242.0,242.0,242.0
mean,172.53719,30.024793,7.008264,1.5,162.83657,340.0
std,36.509998,3.006115,1.002038,0.501036,122.2134,0.0
min,118.0,27.0,6.0,1.0,4.78,340.0
25%,139.0,27.0,6.0,1.0,15.075,340.0
50%,171.0,33.0,8.0,1.5,181.12,340.0
75%,192.0,33.0,8.0,2.0,273.25,340.0
max,242.0,33.0,8.0,2.0,334.79,340.0


In [15]:
leafAlpha = pd.read_csv("/Volumes/Curt-MacPro-Backup/D3M/terra/Raw-data/V1-from-d3m/raw_data/terra_tabular/s4_leaf_angle_alpha_formatted.csv")
leafAlpha.head()

Unnamed: 0,cultivar,sitename,day,leaf_angle_alpha
0,PI569423,MAC Field Scanner Season 4 Range 49 Column 13,83,3.600738
1,PI527045,MAC Field Scanner Season 4 Range 50 Column 10,83,2.641265
2,PI655972,MAC Field Scanner Season 4 Range 50 Column 11,83,2.842065
3,PI535795,MAC Field Scanner Season 4 Range 51 Column 4,83,2.060581
4,PI641825,MAC Field Scanner Season 4 Range 51 Column 9,83,1.535933


In [40]:
returnUniqueCounts(leafAlpha)

Unnamed: 0,Column_Name,Num_Unique
6,season,2
5,column,16
4,range,53
2,day,66
0,cultivar,270
1,sitename,777
3,leaf_angle_alpha,19759


In [42]:
leafAlpha.describe()

Unnamed: 0,day,leaf_angle_alpha,range,column,season
count,34961.0,34961.0,34961.0,34961.0,34961.0
mean,83.877263,2.613388,0.929521,0.289008,0.134207
std,32.165619,1.036215,5.758057,1.710396,0.720299
min,24.0,0.558077,0.0,0.0,0.0
25%,57.0,1.819175,0.0,0.0,0.0
50%,78.0,2.494087,0.0,0.0,0.0
75%,118.0,3.264155,0.0,0.0,0.0
max,133.0,9.439741,54.0,16.0,4.0


We can see from the above unique counts, that there are more range and column values and that many of the cultivars are missing.  Since our "left" dataframe in the join will be our new height dataframe, lets use just the column,range, and day to merge in the additional attribute.  We need to adjust the day.  These days run from 24 to 133 or 133-24 = 109 days. 

In [43]:
join2.describe()

Unnamed: 0,day_number,range,column,sensor,height(cm),genotype_id
count,77333.0,77333.0,77333.0,77333.0,77333.0,77333.0
mean,173.843883,28.54264,8.535308,1.497278,164.589503,176.740628
std,36.432193,14.355509,3.881779,0.499996,118.326372,100.707996
min,118.0,3.0,2.0,1.0,2.58,1.0
25%,140.0,16.0,5.0,1.0,22.13,90.0
50%,172.0,29.0,9.0,1.0,181.21,178.0
75%,193.0,41.0,12.0,2.0,263.67,264.0
max,242.0,52.0,15.0,2.0,393.9,349.0


In the new V2 dataset, the day range varies from 118 to 242 or 242-118 = 124 days.  It is not obvious how these line up.  Let's look at the original canopy height dataset because we might be able to compare heights of some plots and find the day offset empirically:

#### abandoning the attempt to match up between the first and second TERRA datasets.  We will show Ryan's merge from the first one with all variables and we will show plots and model fits on the V2 data (canopy height only)

Therefore, the join2 dataframe contains the new canopy height data.  Save it out:

In [40]:
join2.to_csv('/Volumes/Curt-MacPro-Backup/D3M/terra/processing/V2/s4_height.csv')

In [42]:
# lets just pick one of the sensors first to simplify.  It will be better to average the sensors, but that will take a while. 
sensor1 = join2.loc[(join2['sensor'] == 1)]
print(join2.shape)
print(sensor1.shape)

(77333, 7)
(38877, 7)


In [44]:
sensor1.to_csv('/Volumes/Curt-MacPro-Backup/D3M/terra/processing/V2/s4_height_s1.csv')

Since it takes a long time to run, lets add a date field to the S4 height information to further facilitate time sequence modeling

In [52]:
import arrow
import itertools

count = 0
startdate = arrow.get("2019-01-01T00:29:00.655800-05:00")

def convertDayToDate(startdate,dayOffset):
    return startdate.shift(days=int(dayOffset))

sensor1['date'] = startdate
    
for i in range(len(sensor1)):
    sensor1['date'][i] = convertDayToDate(startdate,sensor1['day_number'][i])
    count += 1
    if (count % 5000) == 0:
        print(count)


sensor1.to_csv('/Volumes/Curt-MacPro-Backup/D3M/terra/processing/V2/s4_height_s1_date.csv')


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  # Remove the CWD from sys.path while we load stuff.
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  del sys.path[0]


KeyError: 1

In [46]:
arrow.now()

<Arrow [2019-11-09T00:29:00.655800-05:00]>

In [48]:
sensor1['day_number'][4]

123

In [51]:
sensor1.head()

Unnamed: 0,day_number,range,column,sensor,height(cm),genotype_id,genotype_string,date
0,118,3,2,1,5.89,1,PI329465,2019-01-01T00:29:00.655800-05:00
2,121,3,2,1,6.15,1,PI329465,2019-01-01T00:29:00.655800-05:00
4,123,3,2,1,5.98,1,PI329465,2019-01-01T00:29:00.655800-05:00
6,124,3,2,1,5.99,1,PI329465,2019-01-01T00:29:00.655800-05:00
8,125,3,2,1,6.02,1,PI329465,2019-01-01T00:29:00.655800-05:00
