## NDSU Capstone Project
#### Contact: Drew Sandberg

Objective: predict future glucose(g/L) values of stored potatoes give the following information:
- stem-end sucrose & glucose measurements 
- storage bin conditions; fan speed, temperatures, carbon dioxide levels, refrigeration levels
- planted crop metadata: planting, emergence, harvest, and vine kill dates
- weather metrics; potato days, daily solar radiation values
- crop canopy development
- crop inspection report data: dirt, soft rot, gravity, bruising
- in-season crop observations for presence of diseases

The data is currently limited to the 2021 and 2022 crop years. Environmental control system data prior to 2021 is currently unavailable.

The following short videos may provide additional context about potato inspections, processing, and storage.

**Storage Bins**
- https://youtu.be/QLxPNdfKCzU

**Potato Inspections**
- https://youtu.be/T9wAGDT7OfU?si=ai_PJMbQFiM69OCw

**Processing Potoates**
- https://youtu.be/yoDDAa8_XMM?si=Qd69ufvocRmuyIMp

### Planted Crops
These datasets tracks activities (e.g. planting and harvest dates, canopy measurements, crop observations, crop development milestones, etc.) and crop inputs (e.g. seed, fertilizer, chemicals, etc.) at the "planted crop" level. One or more planted crops can be planted on a land parcel (i.e. "field") and planted crops can be planted concurrently on the same field. When multiple planted crops are planted on a single field, this is generally called a "split" field. The purposes for splitting a field with more than one planted crop is typically to partition the filed for different varieties, or for research purposes.

We have included the Minnesota county in which the crop was grown as a reasonable proxy for the farm location.

**Variety**

The quantity and the rates of conversion of sucrose to glucose varies by variety. Variety must be a factor when considering building models to predict glucose values.

**Emergence Dates**

Emergence is defined as the date in which new plant foliage ("canopy") is visible above soil for at least 50% of the planted crop. Generally, emergence will have occurred 20 to 30 days after planting. If an emergence date is null, you can fill it in using either of the following approaches:
- 25 days after planting
- the average emergence date for all planted on that date within the specific county.

**Vine Kill Date**

Vine kill dates reflect the date in which a dessicant was applied to any remaining "green" vegetation (i.e., plant material). The application of a dessicant stops the photosynthesis process and begins the process of thickening the tuber skin. A thicker skin helps protect the tuber from the damage (scuffing, small cuts, abrasions, skin/peel loss) which can happen during the harvest process.

Not all fields are vine killed. Some planted crops may mature and senesce naturally; some planted crops are intentionally harvested with "green vines" still present in the field. Optimally, a planted crop is harvested no earlier than 14 days after vine kill -- thus giving time for the tubers to set a thicker skin. For planted crops with a *NULL* vine kill date, they should be left *NULL*. At this time, this dataset does not indicate if a planted crop's vegetation senesced naturally, died early due to pathogen(s), or if it was harvested "green".

**Harvest Start and Completion Dates**

Harvesting a planted crop typically is done over a day or two, depending on the acres planted. In several cases, the number of days between start and completion of harvest can indicate weather delays, or may indicate a small portion of the crop was harvested early.

In [2]:
import pandas as pd
df_crop = pd.read_csv(
    "factPlantedCrop.csv"
    , parse_dates=[
        'PlantingCompleteDate'
        ,'EmergenceDate'
        ,'VineKillDate'
        ,'HarvestStartDate'
        ,'HarvestCompleteDate'
    ]
).set_index('PlantedCropID')

df_crop.head(1)

Unnamed: 0_level_0,WeatherLocationID,CropYear,Variety,PlantedCropName,County,PlantingCompleteDate,EmergenceDate,VineKillDate,HarvestStartDate,HarvestCompleteDate
PlantedCropID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
00DB364A-1462-48FE-8386-71F19B040A72,A4924DF7-8154-4619-8F6A-CA76A039A012,2022,Burbank,NE HIGHWAY 02,HUBBARD,2022-05-06,2022-05-31,2022-09-07,2022-09-19,2022-09-21


### Storage Bin Environmental Control System Sensor Data

Sensor values are read approximately every 30 minutes, and generally, won't change much within a given day assuming the storage bin is not opened and no intentional adjustments of environmental controls have been made. 

Changes in plenum temperatures are typically manipulated by adjusting environmental controls. "Plenum air temperature" is essentially temperature of the air supplied to the underside of the potato pile. The air is either supplied through perforated steel culverts (roughly 30" to 36" diameter) spaced every 6-8 feet and are placed perpendicular to building's longest axis. Here's an image which may help: https://idahopotato.com/uploads/foodserviceblog/2015/03/potaotes-going-into-storage-shed.jpg

Alternatively, "plenum air" can be forced through channels and the air is forced upward and through grooved channels in storage bin's floor. See this image for an example:https://th.bing.com/th/id/R.6af899a66af9d6598ecd772d1cb9196f?rik=6zLUGAdQD9iwXA&riu=http%3a%2f%2fwww.suberizer.com%2fwp-content%2fuploads%2f2012%2f12%2fpotato-storage.jpg&ehk=UqbkKvbSN4BqS1vlDA5lvogbsG2U1IfDXxmo1Er01kI%3d&risl=&pid=ImgRaw&r=0&sres=1&sresct=1


**Return Air**
Return air temperature reflects the ambient air temperature of the air above the potato pile. When the bin doors are closed, return air temps is a reasonable proxy for potato pile's average temperature. When bin doors are open, return temperatures can fluctuate. As a general approach, the goal is to maintain temperatures so that the return air is approximately 1 degree Farenheit warmer than the plenum temperatures.

**Sensor Values**

Sensor values should be evaluated within the context of their type. "Fan Speed Percent" will have a value between 0.0 and 1.0 and should be evaluated as a percent so that 0.45 is interpreted as 45% "fan speed". Similarily, carbon dioxide (CO2) levels are measured in parts per million (PPM) with values ranging from several hundered to several thousand. CO2 levels in fresh air range in teh 400 to 600 PPM. When CO2 values approach 2,000 PPM, the environmental control system will automatically adjust door openings to bring in more fresh air as high CO2 values for extended times can have detrimental impact to stored potatoes.

In [2]:
import pandas as pd
df_sensors = pd.read_csv(
    "factEnviroSensorObservation.csv"
    , parse_dates=['ObservationDate']
    , dtype={
        'SystemID':'object'
        ,'SensorTypeID':'int'
        ,'SensorValue_Avg':'float'
        ,'SensorValue_StDev':'float'
    }
).set_index(['EnvironmentalControlSystemID', 'SensorTypeID', 'ObservationDate'])

df_sensors.head(1)


Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,SensorValue_Avg,SensorValue_StDev
EnvironmentalControlSystemID,SensorTypeID,ObservationDate,Unnamed: 3_level_1,Unnamed: 4_level_1
E18A8B53-7D2F-4623-B6FA-B7867CE313AD,1,2022-10-16,50.4,1.3


### Sucrose & Glucose Testing

For sucrose and glucose testing, a sample of 20-30 potatoes are used. Each fresh potato is sliced along it's longest axis into 3/8" square strips. The four strips from the center of the sliced potato are retained and used for sugar testing. From each of those strips, only the stem-end 1/2" is retained. The stem-end of the potato is where the tuber is attached to the plant's stolons; the greatest concentration of sugars are found in the tubers' stem-end. Sugar concentrations become more dilute further from the stem-end.

**Reducing Sugars: Sucrose converts to Glucose**
Think of sucrose levels as the overall "gas tank" of potential glucose. Potatoes "burn" glucose during respiration and is the presence and quantity of glucose which defines the fry color. As reference point, for Burbank variety, a sugar test returning sucrose values less than 0.5 grams/liter and glucose values less than 0.1 grams per liter will result in a near perfect fry color score of 0. 

The higher the glucose value, the resulting fry color score will likely increase, too.

If you plot the data you may observe the glucose values holding steady while the sucrose values decrease; this should be interpreted as the enzyme activity converting sucrose to glucose. Pay close attention to the storage bin environmental temperatures when this occurs as that indicates an optimal range for inducing that enzyme conversion process. Again, the issue is that we might want to drive glucose levels lower, but it may take longer to achieve until scucrose levels get closer to their minimum values.

**Sample Collection Type**
"Field samples" are taken before the crop has been harvested. Consequently, there will be no StorageBinID. "Harvest line samples" are taken while the harvested field is being loaded into a storage bin; the sample is collected from potatoes flowing off the truck, along conveyor belts, and into the storage bin. "In-Storage samples" are taken from the potato pile in the storage bin.

Generally, in-field samples are taken to assess maturity of the crop. The lower the sucrose number, generally, the more mature the crop is perceived. Samples taken off the harvest line may show spikes in glucose, which might be induced by the stress of harvest. 

In [None]:
import pandas as pd
df_sugars = pd.read_csv(
    "factSugars.csv"
    , parse_dates=['SampleHarvestDate', 'ProcessedDate']
    , dtype={
        'LabSampleID':'object'
        ,'PlantedCropID':'object'
        ,'StorageBinID':'object'
        ,'SpecificGravity':'float'
        ,'StemEndSucroseGramPerLiter':'float'
        ,'StemEndGlucoseGramPerLiter':'float'
        ,'SampleCollectionMethodID':'int'
    }
)
df_sugars.head(1)

Unnamed: 0,LabSampleID,PlantedCropID,StorageBinID,SampleHarvestDate,ProcessedDate,SpecificGravity,StemEndSucroseGramPerLiter,StemEndGlucoseGramPerLiter,SampleCollectionMethodID
0,599BEEFA-ECD6-4F6C-8DA1-0031CF1CDEA5,28412FEC-CD1E-47EE-86C1-B7DAED90B6DC,0A50E1C1-5B4C-4291-AE01-362E074446F9,8/21/2023,2023-08-21,,2.59,1.34,1


### USDA Crop Inspections

A USDA Crop Inspection report is typically tied to the harvested yield from a single field and on a single calendar day. A small sample (5 to 7 pounds) is taken from each truckload of harvested yield. The small sample is placed into a larger container; the container holds only samples from one field source.

**StorageBinID** 

When blank or *NULL*, this means the harvested yield was processed immediately and not stored.

**isEstimated**

1 indicates that the sampled tubers were not inspected and consequently, a different inspection report was used. The inspection facility and processor may agree to this if the sample was left outside in the heat or rain too long, the sample size too small, or there was insufficent staffing to inspect the raw. When a differnt report is applied, it's always from the same planted crop, but may have been from a different bin on the same day, or in rare cases, a differnt day.

0 indicates the sampled tubers was inspected and that the reported numbers are tied to the actual sampled contents.

**PercentOfHarvestedYield**

This number should range from 0 to 1 and should be interpreted as the percentage of the planted crop's total harvested yield to which this inspection report applies. This information is used when calculating weighted averages of crop metrics like 6-ounce percent and hollow heart percent, etc.

In [5]:
import pandas as pd
df_insp = pd.read_csv(
    "factCropInspection.csv",
).set_index(['PlantedCropID', 'USDAInspectionNumber'])

df_insp.head(1)

Unnamed: 0_level_0,Unnamed: 1_level_0,StorageBinID,isEstimated,PercentOfHarvestedYield,PercentBruiseFree,PercentSixOz,PercentSoftRot,PercentTenOz,PercentUndersize,PercentUnusable,PercentUSDA1,SpecificGravity,PercentHollow,PercentInternalDiscolored,PercentMechanicalDamaged,PercentPinkEye,PercentPitScab
PlantedCropID,USDAInspectionNumber,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1
F0B8092D-043F-47D9-8971-69B44F8B246E,4028,3C27C1E1-AF68-4DC7-87E1-06FCE480A4F1,1,0.160509,,0.57,0.009,0.121,0.035,0.017,,1.0822,0.0,0.0,0.003,0.0,0.003


### Canopy

Canopy measurements reflect the percentage of ground visible thorugh the crop's vegetation. The idea is the growing mature potatoes is really a process of harvesting as much of the available solar radiation as possible. The closer to the summer solstice that the crop achieves a maximum ground cover will equate to more "harvested sunshine". The longer the crop can maintain maximum ground cover, the bigger and more mature the crop will become.

The sum of a planted crops intercepted solar radiation may correllate to sucrose values.

In [7]:
df_canopy = pd.read_csv(
    "factPlantedCropCanopy.csv"
    , parse_dates=['CanopyDate']
)

df_canopy.head(1)

Unnamed: 0,PlantedCropID,CanopyDate,CanopyPercent
0,1521CCCE-AF40-49DC-82FF-0A1A6EFFC9D2,2023-06-15,0.58


### Weather

Weather information provided here has been retrieved from physical weather stations scattered around central Minnesota. Each planted crop is no more than 30 miles from a known physical weather station.

**Potato Days**
In this dataset, Potato Days is a measurement of quality of growing conditions for potatoes. Potatoes neither like it too cool, nor too hot. When too warm or too cool, the tubers will not bulk (increase in size). The optimum temperature for bulking is between 48 and 74 degrees. Each day is graded on a scale of 0 to 10 where 10 indicates the most positive growing/bulking conditions. 

We hypothesize that a mature potoato crop will have total Potato Days, as measured from planting complete to the earlier of vine kill date, harvest complete, or September 10th, should be in the 950 to 1000 range.


**Reference ET (Penman-Monteith Model)**

The "ReferenceETPenmanMonteith" data point estimates the inches of water used per day by the crop either through evaporation, or transpriation. The Reference ET is referenced to a crop of grass which has a crop coefficient of 1.0. Potatoes have a evapotranspiration crop coefficient of 1.15, which means potato crops will consume roughly 15% more water through evapotranspiration over grass.

Evapotranspiration numbers can be summarized for the entire gowing season, but must be factored in with canopy growth. So, if by June 20th, a planted crop has an average canopy ground coverage of 60% and the Reference ET value is 0.25, then the adjusted evapotranspiration rate would be: 0.25 * 0.6 * 1.15 = .1725" of water.

In theory, in properly watered potato crops the sum of all irrigation water inches applied + measured rainfall equals the adjusted evaoptranspiration rate.

**Solar Radiation**

The "MegaJoulesPerMeterSquarePerDay" data point quantifies the total downward solar radiation energy available to a planted crop. Like ET values, we like to calculate the sum of a planted crop's of intercepted solar radiation. Each day's solar radiation available must be factored by the canopy ground cover percentage. Generally speaking, more intercepted solar radiation will lead to larger yields as well as improved the crop's overall maturity.

In [14]:
df_wx = pd.read_csv(
    "factWeather.csv",
    parse_dates=['Date']
)\
 .drop(columns=['Idx'])\
 .sort_values(by=['WeatherLocationID', 'Date', 'ObservationType'])

df_wx.head(7)

Unnamed: 0,WeatherLocationID,Date,ObservationType,Value
35900,09BB28C6-0770-4169-B9B6-DE29B4BD4D9A,2021-04-01,AvgSurfaceDewPoint,13.23
14360,09BB28C6-0770-4169-B9B6-DE29B4BD4D9A,2021-04-01,AvgSurfaceTempF,29.839
0,09BB28C6-0770-4169-B9B6-DE29B4BD4D9A,2021-04-01,MaxSurfaceTempF,45.984
21540,09BB28C6-0770-4169-B9B6-DE29B4BD4D9A,2021-04-01,MegaJoulesPerMeterSquarePerDay,512.568
7180,09BB28C6-0770-4169-B9B6-DE29B4BD4D9A,2021-04-01,MinSurfaceTempF,13.694
43080,09BB28C6-0770-4169-B9B6-DE29B4BD4D9A,2021-04-01,PotatoDays,0.096
28720,09BB28C6-0770-4169-B9B6-DE29B4BD4D9A,2021-04-01,ReferenceETPenmanMonteith,0.176
