#### Season 7 (2018) quick exploration

In [2]:
import pandas as pd
import altair as alt

In [3]:
# we will use this function later in the notebook
def returnUniqueCounts(dframe):
    return pd.DataFrame.from_records([(col, dframe[col].nunique()) for col in dframe.columns],
                          columns=['Column_Name', 'Num_Unique']).sort_values(by=['Num_Unique'])

Read in the CSV file that was created using an R procedure.  TERRA-REF publishes how to query through an R interface, so I used this process to generate the CSV.  Then the CSV is read in below and processing continues using Python and PANDAS.  The CSV file is an R "long type" where each measurement is in its own row, under the heading "trait" and its value is in the corresponding "mean" column. 

In [4]:
s_df = pd.read_csv('/Users/curtislisle/Dropbox/ipython-notebooks/D3M/TERRA/terraref_r/season7-2018-date.csv')

In [5]:
s_df.head()['sitename']

0    MAC Field Scanner Season 7 Range 18 Column 13 E
1     MAC Field Scanner Season 7 Range 19 Column 3 E
2    MAC Field Scanner Season 7 Range 19 Column 14 W
3    MAC Field Scanner Season 7 Range 19 Column 14 W
4    MAC Field Scanner Season 7 Range 19 Column 15 W
Name: sitename, dtype: object

there seems to be no East or West measurements.  Couldn't find an E or W in the sitename

In [6]:
eastern = s_df.loc[s_df.sitename.str.contains('E')]
print(eastern.shape)
eastern.head()['sitename']


(7599, 40)


0     MAC Field Scanner Season 7 Range 18 Column 13 E
1      MAC Field Scanner Season 7 Range 19 Column 3 E
6     MAC Field Scanner Season 7 Range 26 Column 13 E
8     MAC Field Scanner Season 7 Range 32 Column 12 E
11    MAC Field Scanner Season 7 Range 35 Column 13 E
Name: sitename, dtype: object

In [7]:
s_df[['trans_date','sitename','trait','mean']].head(8)

Unnamed: 0,trans_date,sitename,trait,mean
0,2018-09-07 00:00:00,MAC Field Scanner Season 7 Range 18 Column 13 E,plant_basal_tiller_number,0.0
1,2018-09-07 00:00:00,MAC Field Scanner Season 7 Range 19 Column 3 E,plant_basal_tiller_number,0.0
2,2018-09-07 00:00:00,MAC Field Scanner Season 7 Range 19 Column 14 W,plant_basal_tiller_number,0.0
3,2018-09-07 00:00:00,MAC Field Scanner Season 7 Range 19 Column 14 W,plant_basal_tiller_number,0.0
4,2018-09-07 00:00:00,MAC Field Scanner Season 7 Range 19 Column 15 W,plant_basal_tiller_number,0.0
5,2018-09-07 00:00:00,MAC Field Scanner Season 7 Range 19 Column 15 W,plant_basal_tiller_number,0.0
6,2018-09-07 00:00:00,MAC Field Scanner Season 7 Range 26 Column 13 E,plant_basal_tiller_number,0.0
7,2018-09-07 00:00:00,MAC Field Scanner Season 7 Range 32 Column 12 W,plant_basal_tiller_number,0.0


In [8]:
selected = ['id','cultivar','cultivar_id','date','trans_date','sitename','trait','mean','units']
s_sel = s_df[selected]

 If all the measurements were equally distributed, doing a long to wide rollup mechanically using pandas' pivot would work.  However, some measurements started later and ended earlier.  Some measurements are daily, some are hourly (just in August), so we really need to split up this dataset into major subsets:  daily and hourly, then try to pivot these datasets.  Or worse, have to hand convert the entries.  I elected to just write a custom algorithm to gather all the measurements together, indexed by date. 

Write a routine that pivots/rolls up the data by hand, by creating a dictionary with trans_date as its index.  Then we can add measurements one at a time...  This takes a few minutes to run on a circa-2020 CPU.  There is an update line printed every 5k entries processed.  As of when this was released, there were xxxk entries total. 

In [9]:
s_hand = {}
count = 0
for i in range(len(s_sel)):
    #if count > 40:
    #    break
    #print(i,s_sel['trans_date'][i])
    
    # if we have never seen this date before, start a new dictionary at this date
    if s_sel['trans_date'][i] not in s_hand.keys():
        s_hand[s_sel['trans_date'][i]] = {}

    # if we have not seen this cultivar before on this date, then add a dictionary for this cultivar.  Is there is a chance we 
    # might lose records here? 
    if s_sel['cultivar_id'][i] not in s_hand[s_sel['trans_date'][i]].keys():
        s_hand[s_sel['trans_date'][i]][s_sel['cultivar_id'][i]] = {}
        
    # add this feature to the dictionary for the correct cultivar on this date.  We add a dictionary entry named 
    # from the contents in the 'trait' attribute and pull the value from the 'mean' attribute.  This is the heart
    # of the long to wide format conversion.
    s_hand[s_sel['trans_date'][i]][s_sel['cultivar_id'][i]][s_sel['trait'][i]] = s_sel['mean'][i]
    
    # add the cultivar and the location (split out from the sitename text).  This will be added multiple times,
    # so represents redundant processing, but it works to place the measurements in cultivar and location
    s_hand[s_sel['trans_date'][i]][s_sel['cultivar_id'][i]]['cultivar_id'] = s_sel['cultivar_id'][i]
    s_hand[s_sel['trans_date'][i]][s_sel['cultivar_id'][i]]['cultivar'] = s_sel['cultivar'][i]
    s_hand[s_sel['trans_date'][i]][s_sel['cultivar_id'][i]]['season'] = int(s_sel['sitename'][i].split(' ')[4])
    s_hand[s_sel['trans_date'][i]][s_sel['cultivar_id'][i]]['range'] = int(s_sel['sitename'][i].split(' ')[6])
    s_hand[s_sel['trans_date'][i]][s_sel['cultivar_id'][i]]['column'] = int(s_sel['sitename'][i].split(' ')[8])
    count += 1
    if (count % 5000) == 0:
        print('in process:',count, 'records ingested so far')
print('entered ',count, 'measurements')


in process: 5000 records ingested so far
in process: 10000 records ingested so far
in process: 15000 records ingested so far
in process: 20000 records ingested so far
in process: 25000 records ingested so far
entered  26598 measurements


In [11]:
# how many different measurement days
print(len(s_hand.keys()))
s_hand.keys()

55


dict_keys(['2018-09-07 00:00:00', '2018-09-13 00:00:00', '2018-09-19 00:00:00', '2018-09-24 00:00:00', '2018-09-28 00:00:00', '2018-10-26 00:00:00', '2018-10-08 00:00:00', '2018-10-31 00:00:00', '2018-10-10 00:00:00', '2018-10-11 00:00:00', '2018-10-30 00:00:00', '2018-09-10 00:00:00', '2018-10-17 00:00:00', '2018-10-22 00:00:00', '2018-09-17 00:00:00', '2018-10-10 12:00:00', '2018-10-11 12:00:00', '2018-10-15 12:00:00', '2018-10-16 12:00:00', '2018-10-17 12:00:00', '2018-10-18 12:00:00', '2018-10-21 12:00:00', '2018-10-24 12:00:00', '2018-10-25 12:00:00', '2018-09-20 12:00:00', '2018-09-23 12:00:00', '2018-09-24 12:00:00', '2018-10-14 12:00:00', '2018-10-20 12:00:00', '2018-09-27 12:00:00', '2018-09-30 12:00:00', '2018-10-03 12:00:00', '2018-10-04 12:00:00', '2018-10-22 12:00:00', '2018-09-25 12:00:00', '2018-10-23 12:00:00', '2018-09-29 12:00:00', '2018-10-05 12:00:00', '2018-10-12 12:00:00', '2018-09-22 12:00:00', '2018-09-26 12:00:00', '2018-09-28 12:00:00', '2018-10-09 12:00:00', 

In [13]:
print(s_hand['2018-09-07 00:00:00'])
print('one cultivar:',s_hand['2018-09-07 00:00:00'][6000000413])

{6000000413: {'plant_basal_tiller_number': 0.0, 'cultivar_id': 6000000413, 'cultivar': 'SC56-14', 'season': 7, 'range': 32, 'column': 15, 'stem_elongated_internodes_number': 3.0}, 6000000264: {'plant_basal_tiller_number': 0.0, 'cultivar_id': 6000000264, 'cultivar': 'RIL_36_(SC56*TX7000)-F10', 'season': 7, 'range': 19, 'column': 3, 'stem_elongated_internodes_number': 3.0}, 6000000388: {'plant_basal_tiller_number': 0.0, 'cultivar_id': 6000000388, 'cultivar': 'RIL_173_(SC56*TX7000)-F10', 'season': 7, 'range': 34, 'column': 3, 'stem_elongated_internodes_number': 3.0}, 6000000381: {'plant_basal_tiller_number': 0.0, 'cultivar_id': 6000000381, 'cultivar': 'RIL_165_(SC56*TX7000)-F10', 'season': 7, 'range': 26, 'column': 13, 'stem_elongated_internodes_number': 3.0}, 6000000390: {'plant_basal_tiller_number': 0.0, 'cultivar_id': 6000000390, 'cultivar': 'RIL_176_(SC56*TX7000)-F10', 'season': 7, 'range': 48, 'column': 12, 'stem_elongated_internodes_number': 3.0}, 6000000329: {'plant_basal_tiller_nu

Sometimes the cultivar numeric index is used as the cultivar_id, sometimes it has the character name (e.g. 'cultivar_id': 6000000962,
  'cultivar': 'PI570254').  

In [14]:
cultivar_xref = {}
for key in s_hand.keys():
    # look in all the records on this date and record any names we find
    for thiscultivar in s_hand[key]:
        # now we are looking at the dictionary of measurements for a single cultivar. If the _id and cultivar name don't match, we can
        # learn from this record, so make an entry in the cross reference dictionary
        if s_hand[key][thiscultivar]['cultivar'] != s_hand[key][thiscultivar]['cultivar_id']:
            cultivar_xref[s_hand[key][thiscultivar]['cultivar']] = s_hand[key][thiscultivar]['cultivar_id']
            cultivar_xref[s_hand[key][thiscultivar]['cultivar_id']] = s_hand[key][thiscultivar]['cultivar']
    

In [15]:
# lets see what our bi-directional index looks like.  There are a few odd names, but this doesn't show unless 
# we print the whole dictionary.  Lets just sample the dictionary for reasonability
count = 0
for key in cultivar_xref:
    print(key, cultivar_xref[key])
    count += 1
    if count>15:
        break

SC56-14 6000000413
6000000413 SC56-14
RIL_36_(SC56*TX7000)-F10 6000000264
6000000264 RIL_36_(SC56*TX7000)-F10
RIL_173_(SC56*TX7000)-F10 6000000388
6000000388 RIL_173_(SC56*TX7000)-F10
RIL_165_(SC56*TX7000)-F10 6000000381
6000000381 RIL_165_(SC56*TX7000)-F10
RIL_176_(SC56*TX7000)-F10 6000000390
6000000390 RIL_176_(SC56*TX7000)-F10
RIL_109_(SC56*TX7000)-F10 6000000329
6000000329 RIL_109_(SC56*TX7000)-F10
TX7000 6000000416
6000000416 TX7000
BTX623 6000001430
6000001430 BTX623


In [16]:
widths = []
for key in s_hand.keys():
    # how many measurements are on this datetime. accumulate in a histogram dictionary
    cultivars = s_hand[key]
    # check just the first entry.  We are assuming they are all the same width.  This is probably naive, but it will get us something
    for cultivar in cultivars:
        measurement_width = len(s_hand[key][cultivar].keys())
        widths.append({'width': measurement_width})
        break
print('we found the tuple width of',len(widths), 'different measurements')
width_df = pd.DataFrame.from_records(widths)

we found the tuple width of 55 different measurements


In [17]:
alt.Chart(width_df,title="Histogram of tuple widths").mark_bar().encode(
    alt.X("width:Q", bin=True),
    y='count()',
)

Looking at the above histogram, there are a lot of short tuples (these are probably the height, leaf information automaticaly recorded) but some of the records are long, containing many fields).  We have seen some of those records above already.  Try to look at the widths and when they are captured.  This might not be a correct rendering:

In [19]:
alt.Chart(width_df.reset_index(),title="show the size of the tuples in date order").mark_line().encode(
    alt.Y("width:Q"),
    alt.X('index:T')
)

In [20]:
firstList = []
dateList = []
for key in s_hand.keys():
    #print('date: ',key)
    cultivar_keys = s_hand[key].keys()
    #print('cultivar-keys: ',cultivar_keys)
    for k in cultivar_keys:
        record = s_hand[key][k]
        record['cultivar_id'] = k
        # look up the textual name of the cultivar so we can match against the tree
        record['cultivar'] = cultivar_xref[k]
        record['date'] = key    
        firstList.append(record)
        dateList.append(key)
        #break
print(len(firstList))

1147


In [21]:
firstList[300]

{'aboveground_dry_biomass': 2742.0,
 'cultivar_id': 6000000379,
 'cultivar': 'RIL_163_(SC56*TX7000)-F10',
 'season': 7,
 'range': 36,
 'column': 14,
 'aboveground_biomass_moisture': 69.3,
 'aboveground_fresh_biomass': 8931.0,
 'date': '2018-10-31 00:00:00'}

In [22]:
import pandas as pd
full_df = pd.DataFrame(firstList,index=dateList)
full_df.head()

Unnamed: 0,NBI_nitrogen_balance_index,aboveground_biomass_moisture,aboveground_dry_biomass,aboveground_fresh_biomass,adf,canopy_height,chlorophyll_index,column,crude_protein,cultivar,...,plant_basal_tiller_number,range,relative_feed_quality,relative_forage_quality,season,stalk_diameter,stem dry weight per plant,stem fresh weight per plant,stem_elongated_internodes_number,tdn
2018-09-07 00:00:00,,,,,,,,15,,SC56-14,...,0.0,32,,,7,,,,3.0,
2018-09-07 00:00:00,,,,,,,,3,,RIL_36_(SC56*TX7000)-F10,...,0.0,19,,,7,,,,3.0,
2018-09-07 00:00:00,,,,,,,,3,,RIL_173_(SC56*TX7000)-F10,...,0.0,34,,,7,,,,3.0,
2018-09-07 00:00:00,,,,,,,,13,,RIL_165_(SC56*TX7000)-F10,...,0.0,26,,,7,,,,3.0,
2018-09-07 00:00:00,,,,,,,,12,,RIL_176_(SC56*TX7000)-F10,...,0.0,48,,,7,,,,3.0,


In [23]:
returnUniqueCounts(full_df)

Unnamed: 0,Column_Name,Num_Unique
42,season,1
38,plant_basal_tiller_number,1
31,maturity_stage_at_harvest,4
43,stalk_diameter,4
15,flowering_time,4
23,leaf_fat,6
30,leaf_width,6
27,leaf_phosphorus,7
44,stem dry weight per plant,8
24,leaf_length,8


In [24]:
full_df.to_csv('s7_height_biomass.csv',index=False)

In [1]:
full_df.shape

NameError: name 'full_df' is not defined