# Data-driven approach to predict Biomass

For understanding this notebook is neccesary to check the explore_imagen_analysis notebook. Here, machine learning models implemented in Pyspark are used to predict the final biomass.

## Loading PySpark session

Next lines represent the code to start spark session into the machine. Additionally, some functions are implemented to extract information from MongoDB and create PySpark Dataframes.

In [20]:
import findspark
findspark.init()
import pyspark
from pyspark.sql import SparkSession
import datetime as dt
import os

#%matplotlib 
import matplotlib.pyplot as plt

from pyspark.mllib.stat import Statistics
import pandas as pd


## init the spark session and impose the MongoDB connector
MONGO_URI="mongodb://localhost:27017/iot_db" 
my_spark = SparkSession \
    .builder \
    .appName("myApp") \
    .master('local[*]')\
    .config("spark.jars.packages", "org.mongodb.spark:mongo-spark-connector_2.11:2.3.2")\
    .config("spark.mongodb.input.uri", MONGO_URI+".phis_experiments") \
    .getOrCreate()

def get_collection_mongodb(collection, pipeline=None) :
    '''Get one collection from MongoDB database, the pipeline parameter is optional'''
    options = my_spark.read.format("com.mongodb.spark.sql.DefaultSource")\
                                            .option("database","iot_db")\
                                            .option("collection", collection)                                           
    if pipeline is not None: 
        options.option("pipeline", pipeline)
    return options.load()

def pipeline_angle(angle):
    '''Pipeline send to MongoDB to extract the angle information'''
    return  "{'$match': {'angle':'%s' }}"%(angle)

   
#stats functions
def cal_correlation(df):
    '''stats function to calculate the correlation between the columns, asumming that all are numeric'''
    col_names = df.columns
    features = df.rdd.map(lambda row: row[0:])

    corr_mat=Statistics.corr(features, method="pearson")
    corr_df = pd.DataFrame(corr_mat)
    corr_df.index, corr_df.columns = col_names, col_names
    return corr_df

def join_dataframes(df1,df2, f1_column, f2_column): 
    '''Apply join function to two spark dataframes'''
    ta = df1.alias('ta')
    tb = df2.alias('tb')
    
    if f1_column == f2_column: ## avoid repeat column in join result
        df_join = ta.join(tb,[f1_column])
    else:   
        df_join = ta.join(tb, ta[f1_column] == tb[f2_column])
    return df_join


## Configuring a scenery to run this notebook

The scenery is a set of configurations aim to define how the notebook will predict the biomass. For example, the angle to explore, the feature extraction technique to use, the slots number

Line by line will be explain this code

In [21]:

# After running the data exploration notebook some plants contain many null rows > 23, in this list are all these plants. 
# potenicial plants that has problems during the growth season, it might be because camera errors, algortihms erros, human erros
potencial_nulls= [
    'http://www.phenome-fppn.fr/m3p/arch/2017/c17000795'
,'http://www.phenome-fppn.fr/m3p/arch/2017/c17000469'
,'http://www.phenome-fppn.fr/m3p/arch/2017/c17001536']


## imagen analysis can be disaggregated by the camera angle 
# all the angles takes in phenoarch
# 0 is removed because doesn't have heigth over pot 
ANGLES= ["30","60","90","120","150","180","210","240","270","300","330","AVG"] 

# ANGLE is the constant that is used in the next lines to indentify the camera ANGLE, some element of ANGLES array
ANGLE =  ANGLES[11] #240      

LABEL_ML= "label" # column from dataset that indicates the output variable to predict, label=biomass
METRIC_PERFORMANCE_ML="rmse" # the metrict to select models into the crossvalition iterations
ENTITY_URI_COL= "plantURI"# the entity that joins biomass Dataframe with imagenAnalysis Dataframe 
DATE_COL = "dayOfYear" # the column that indentifies the date value

TIME_SERIES_COLLECTION= "phis_imagen_analysis_explicit_angle" # timeseries imagen analysis collection name
SUMMARY_DATA = "phis_biomass" # biomass collection name

# some features selection options, all is to use all the features, pca is to use pca technique, 
# naive is a two steps implementation, (1) get the most relevant features for biomass, (2) filter the correlated with the features selected
FEATURE_SELECTION_ALL = "all" 
FEATURE_SELECTION_PCA = "pca"
FEATURE_SELECTION_NAIVE = "naive"
FEATURE_SELECTION_METHOD= FEATURE_SELECTION_ALL # configure the feature option in the notebook
PCA_OPTION ="4" #["2","3","4"]  the pca number of components

if FEATURE_SELECTION_METHOD==FEATURE_SELECTION_PCA: # when PCA is used as method some columns names change in the pipeline
    FEATURE_ML = 'features_pca'
    LAST_FOLDER = FEATURE_SELECTION_PCA+"_"+PCA_OPTION  
else :
    FEATURE_ML = 'features_scaled'
    LAST_FOLDER = FEATURE_SELECTION_METHOD # False when there is not used PCA

## how many slots to use  
SLOTS= "4"
until_ndays=30 # get first n days from time series data
split_ndays=8 # groups size of ndays days 
groups= [1,2,3,4] # groups label, suitable for performance improvenment 

ACTIVATE_SAMPLING = True # this line is to select 10% of dataset when the notebook is tested

MODEL_TYPE = "lr" # first  run linear regression models to avoid kernel errors then run rf models


FOLDER = "results/angle_%s/slots_%s/%s"%(ANGLE,SLOTS,LAST_FOLDER) # a folder to store the scenery 
isDirectory = os.path.isdir(FOLDER)
print(isDirectory,FOLDER) ## validate that the folder exists

# result_notebook is a dictionary to save the final result in a File
result_notebook= {'angle':ANGLE}
result_notebook['folder']= FOLDER
result_notebook['start']=str(dt.datetime.now())

True results/angle_AVG/slots_4/all


## Obtaining time-series data from MongoDB
<i>Loadding collection imagenanalysesangle_aggregated_by_day</i>

in this poin is used the function ``get_collection_mongodb``, which allows to get a MongoDB collection, the loaded collection is mapped to a Dataframe. Following, the relevant features are projected. Finally variablecode values are reshape to columns by invoking the ``pivot`` function. 

Since variable code is an arbitrary list of variable names, this pivot process gives flexibility to the pre-processing and allows to add other predictors, for example, a new information about the plant. The schema is thought to facilitate this called

In [25]:
# from hour to day transformacion
def get_by_angle(angle):
    
    if ANGLE =="AVG":
        pipeline = None
    else :
        pipeline = pipeline_angle(angle)
    
    df_images_anaysis = get_collection_mongodb( TIME_SERIES_COLLECTION ,pipeline)
    
    
    df_images_anaysis = df_images_anaysis.filter(~df_images_anaysis[ENTITY_URI_COL].isin(potencial_nulls))
    df_images_anaysis = df_images_anaysis.select(ENTITY_URI_COL, DATE_COL, "variableCode", "value")
    df_images_anaysis = df_images_anaysis.groupBy(ENTITY_URI_COL, DATE_COL).pivot("variableCode").mean("value")
    #df_images_anaysis = df_images_anaysis.orderBy(ENTITY_URI_COL,DATE_COL)
    
    result_notebook['columns_images']=df_images_anaysis.columns
    
    #display(df_images_anaysis.limit(20).toPandas())
    #df_images_anaysis.show(200)
    return df_images_anaysis
    
df_select_angle=get_by_angle(ANGLE)





None


Unnamed: 0,plantURI,dayOfYear,Silk_area,convex_hull_area,convex_hull_perimeter,height,height_over_pot,height_under_pot,number_of_objects,object_sum_area,width
0,http://www.phenome-fppn.fr/m3p/arch/2017/c1700...,103,,4697.875,332.055457,127.875,129.636364,0.0,2.208333,1361.875,83.875
1,http://www.phenome-fppn.fr/m3p/arch/2017/c1700...,107,,21515.6,628.237317,230.208333,231.090909,0.0,1.375,3937.791667,154.791667
2,http://www.phenome-fppn.fr/m3p/arch/2017/c1700...,110,,54496.29,1001.990808,371.333333,309.545455,68.363636,3.666667,6594.958333,257.583333
3,http://www.phenome-fppn.fr/m3p/arch/2017/c1700...,122,,432269.0,2676.527085,844.375,809.545455,0.0,3.416667,58799.333333,863.916667
4,http://www.phenome-fppn.fr/m3p/arch/2017/c1700...,124,,382504.1,2403.782312,798.291667,786.0,0.0,5.583333,64758.083333,726.875
5,http://www.phenome-fppn.fr/m3p/arch/2017/c1700...,130,,723557.6,3292.860667,1133.75,1087.818182,28.727273,6.875,121217.708333,905.625
6,http://www.phenome-fppn.fr/m3p/arch/2017/c1700...,136,,820786.4,3546.773774,1356.020833,1264.565217,85.608696,5.0,181940.083333,903.3125
7,http://www.phenome-fppn.fr/m3p/arch/2017/c1700...,138,,1273222.0,4590.556989,1784.708333,1733.818182,59.636364,9.375,239447.583333,1083.791667
8,http://www.phenome-fppn.fr/m3p/arch/2017/c1700...,139,,756541.4,3310.599642,1196.833333,1201.363636,0.0,7.833333,188696.875,932.083333
9,http://www.phenome-fppn.fr/m3p/arch/2017/c1700...,141,,867406.1,3564.382594,1271.833333,1250.636364,1.909091,5.833333,217354.083333,1018.25


## Summarizing time-series data 

<i>Apply pyspark transformations/actions to aggregate time-series data</i>

Create a window lead by plantURI and dayOfYear fields, this window represents a plant growth season,
the row_number is for numbering  the days inside the window

In [26]:
from pyspark.sql.window import *
import pyspark.sql.functions as F 

# porcion del dataset donde se realiza la secuencia, si hay dos ventanas se repetira la posicion 1 para cada ventana
windowSpec = Window.partitionBy(ENTITY_URI_COL).orderBy(DATE_COL)
# definción de transformación 
df_grp_img_analysis_row_number= df_select_angle.withColumn("row_number", F.row_number().over(windowSpec))

df_grp_img_analysis_row_number_biomass=df_grp_img_analysis_row_number.select(ENTITY_URI_COL,'row_number',
                                              'convex_hull_area','height_over_pot','width','object_sum_area')


## Defining the number of slots and filter time-series data

In this process, having the number of slots, a new column is defined to identify the group for each row,
this column will be used for grouping. Second line is about to filter the time-series data, using
the first ``until_days``. Finally, using the ```pivot``` function time-series data is aggregated, transforming
one variable into n variables equivalent to the number of slots. The result is a dataframe with the same granularity of  summary data


In [27]:
result_notebook['slots']={'until':until_ndays,'split':split_ndays}
df_grp_img_analysis_row_number_biomass= df_grp_img_analysis_row_number_biomass\
                                                                    .withColumn("split_ndays_group",\
                                                                      F.ceil(F.col("row_number") /split_ndays))

## filter to get first days 
df_grp_img_analysis_row_number_biomass_first_ndays=df_grp_img_analysis_row_number_biomass[F.col("row_number")<=until_ndays]

# applied pivot exclusively to certainty columns
METHOD_AGG= 'avg'
exprsPivoted = {"convex_hull_area":METHOD_AGG,'height_over_pot':METHOD_AGG,'width':METHOD_AGG ,"object_sum_area":METHOD_AGG}

df_grp_img_analysis_row_number_biomass_split= df_grp_img_analysis_row_number_biomass_first_ndays\
                            .groupBy('plantURI').pivot('split_ndays_group',groups).agg(exprsPivoted)

## Joining biomass and time-series imagen analysis data

In this process is created the biomass Dataframe, again invoking the ``get_collection_mongodb``function, some features are projected. In this way, a dataset is created by joining biomass Dataframe with time-series imagen analysis

In [32]:
excel_biomasa= get_collection_mongodb(SUMMARY_DATA)
excel_biomasa= excel_biomasa.select('plantURI','Treatment','Biomass(gramos_pesofresco)')
df_dataset = join_dataframes(df_grp_img_analysis_row_number_biomass_split,excel_biomasa,ENTITY_URI_COL,ENTITY_URI_COL)

## Pre-processing the dataset

First will be deleted all the records where biomass field is null, the biomass field is casting to double, in order to manipulate as a number, which is required in regression problems. Finally, the flag ``ACTIVATE_SAMPLING`` is checked to know if the dataset must be reduced. 

In [33]:
# delete rows with no biomass value
df_dataset = df_dataset.na.drop(subset=['Biomass(gramos_pesofresco)'])

df_dataset=df_dataset.withColumn("biomass", df_dataset["Biomass(gramos_pesofresco)"].cast("double"))

df_dataset= df_dataset.drop('Biomass(gramos_pesofresco)')

if ACTIVATE_SAMPLING:
    df_dataset= df_dataset.sample(0.1, 2018) # test
    

## Instancing Machine Learning Pipelines, Indexer, Enconder, Scaler



In [34]:
from pyspark.ml.feature import OneHotEncoderEstimator, StringIndexer, VectorAssembler, MinMaxScaler, StandardScaler
from pyspark.ml.feature import PCA as PCAml
from pyspark.sql.functions import log, col

print(df_dataset.printSchema())
df_dataset_final= df_dataset.drop("plantURI")#,"1_avg(object_sum_area)", "1_avg(object_sum_area)")

target_y = 'biomass'
target_renamed_original= "label"

df_dataset_final=df_dataset_final.withColumnRenamed( target_y , target_renamed_original)

df_dataset_final=df_dataset_final.na.drop()

df_dataset_final.persist(pyspark.StorageLevel.MEMORY_AND_DISK)
#df_dataset_final.cache()  # keep in memory for performance

# df_pd_sampling[df_pd_sampling['Treatment']=='WW'][target_renamed_original].hist()
# df_pd_sampling[df_pd_sampling['Treatment']=='WD'][target_renamed_original].hist(alpha=0.5)
# #df_dataset_final=df_dataset_final.drop('biomass')
# plt.savefig(FOLDER +'/target_original_histogram.png')

def enconder_indexer_stages(df_interno):
    
    array_labels= [ target_renamed_original ]
    #df_interno.cache() # keep in memory for performance
    
    df_interno = df_interno

    numeric_features = [t[0] for t in df_interno.dtypes if t[1] == 'double' and t[0] not in array_labels]

    categorical_features = [t[0] for t in df_interno.dtypes if t[1] == 'string' and t[0] not in array_labels]
    
    print(numeric_features)
    print(categorical_features)
    stages = []
    for categoricalCol in categorical_features:
        stringIndexer = StringIndexer(inputCol = categoricalCol, outputCol = categoricalCol + 'Index')
        encoder = OneHotEncoderEstimator(inputCols=[stringIndexer.getOutputCol()], outputCols=[categoricalCol + "classVec"])
        stages += [stringIndexer, encoder]


    assemblerInputs = [c + "classVec" for c in categorical_features] + numeric_features

    assembler = VectorAssembler(inputCols=assemblerInputs, outputCol="features")
    stages+=[assembler]
    
    scaler = StandardScaler(inputCol="features", outputCol="features_scaled")
    stages+=[scaler]
    
    if FEATURE_SELECTION_METHOD == FEATURE_SELECTION_PCA:
        pca = PCAml(k=int(PCA_OPTION), inputCol='features_scaled', outputCol="features_pca")
        stages+=[pca]
    
    return stages



# df_pd_sampling[df_pd_sampling['Treatment']=='WW'][target_renamed_log].hist()
# df_pd_sampling[df_pd_sampling['Treatment']=='WD'][target_renamed_log].hist(alpha=0.5)

# plt.savefig(FOLDER +'/target_log_histogram.png')

#df_dataset_final.describe().toPandas()

root
 |-- plantURI: string (nullable = true)
 |-- 1_avg(convex_hull_area): double (nullable = true)
 |-- 1_avg(object_sum_area): double (nullable = true)
 |-- 1_avg(width): double (nullable = true)
 |-- 1_avg(height_over_pot): double (nullable = true)
 |-- 2_avg(convex_hull_area): double (nullable = true)
 |-- 2_avg(object_sum_area): double (nullable = true)
 |-- 2_avg(width): double (nullable = true)
 |-- 2_avg(height_over_pot): double (nullable = true)
 |-- 3_avg(convex_hull_area): double (nullable = true)
 |-- 3_avg(object_sum_area): double (nullable = true)
 |-- 3_avg(width): double (nullable = true)
 |-- 3_avg(height_over_pot): double (nullable = true)
 |-- 4_avg(convex_hull_area): double (nullable = true)
 |-- 4_avg(object_sum_area): double (nullable = true)
 |-- 4_avg(width): double (nullable = true)
 |-- 4_avg(height_over_pot): double (nullable = true)
 |-- Treatment: string (nullable = true)
 |-- biomass: double (nullable = true)

None


##  Executors for Linear Regression

In [35]:
from pyspark.ml.regression import LinearRegression
from pyspark.ml.evaluation import RegressionEvaluator
from pyspark.ml import Pipeline
from pyspark.ml.tuning import ParamGridBuilder
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder

def cross_validation(pipeline, param_grid, evaluator):
    
    K_FOLDS=5
    crossval = CrossValidator(estimator=pipeline,
                              estimatorParamMaps=param_grid,
                              numFolds=K_FOLDS,
                              evaluator= evaluator
                             ) 
    return crossval
# training dataset 
def split_dataframe(df_interno, training_rate= 0.8):
    
    TRAINING=training_rate
    TEST= 1.0 - training_rate 
    train, test = df_interno.randomSplit([TRAINING, TEST], seed = 2018)
    #print("Training Dataset Count: " + str(train.count()))
    #print("Test Dataset Count: " + str(test.count()))
    
    return train, test


def ridge_regression_executor(df_interno):
    
    print("ridge regression /////////")
    lr = LinearRegression( maxIter=100, featuresCol=FEATURE_ML, labelCol=LABEL_ML , solver="l-bfgs")
                          #tol=1e-6, fitIntercept=True, standardization=True, solver="auto",weightCol=None, aggregationDepth=2)
    print(lr.solver)
    stages_lr= enconder_indexer_stages(df_interno)
    stages_lr = stages_lr+[lr] # append linear regression models as steps

    pipeline_lr = Pipeline(stages = stages_lr)
    
    # ridge regression 0.0, lasso= 1.0
    param_lr = ParamGridBuilder() \
        .addGrid(lr.regParam, [0.5,0.9]) \
        .addGrid(lr.elasticNetParam, [0.0]) \
        .build()
    
    
    model_crossval = cross_validation(pipeline_lr, param_lr, RegressionEvaluator(metricName=METRIC_PERFORMANCE_ML))
    
    model_list_lr = model_crossval.fit(df_interno)
    
    return model_list_lr

def linear_regression_executor(df_interno):
    
    print("linear regression /////////")
    lr = LinearRegression( maxIter=100, featuresCol=FEATURE_ML, labelCol=LABEL_ML, solver="l-bfgs")
    print(lr.solver)
    stages_lr= enconder_indexer_stages(df_interno)
    stages_lr = stages_lr+[lr] # append linear regression models as steps

    pipeline_lr = Pipeline(stages = stages_lr)
    
    param_lr = ParamGridBuilder() \
        .addGrid(lr.regParam, [0.0])\
        .build()
    model_crossval = cross_validation(pipeline_lr, param_lr, RegressionEvaluator(metricName=METRIC_PERFORMANCE_ML))
    
    model_list_lr = model_crossval.fit(df_interno)
    
    return model_list_lr




##  Executors for Random Forest Regressor

In [36]:
from pyspark.ml.regression import RandomForestRegressor
from pyspark.sql.functions import mean, col, lit

def random_forest_executor(df_interno):
    

    rf = RandomForestRegressor(featuresCol = FEATURE_ML, labelCol = LABEL_ML , maxDepth=6 ,seed=42)
    
    stages_rf= enconder_indexer_stages(df_interno)
    stages_rf = stages_rf+[rf] # append linear regression models as steps

    pipeline_rf = Pipeline(stages = stages_rf)
    
    param_rf = ParamGridBuilder() \
        .addGrid(rf.numTrees, [10,100,1000]) \
        .build()
    
    
    model_crossval = cross_validation(pipeline_rf, param_rf, RegressionEvaluator(metricName=METRIC_PERFORMANCE_ML))
    
    model_list_rf = model_crossval.fit(df_interno)
    
    return model_list_rf

class DummieModel:
    
    def __init__(self,df_interno):
        self.mean_label = df_interno.select( mean(col(LABEL_ML)).alias('mean')).collect()[0]['mean']
        print(self.mean_label)
    
    def transform(self,df_interno):

        return  df_interno.withColumn('prediction', lit(self.mean_label))
    def save(self, folder):
        return None

## Methods for Organizing and Evaluating Models Results

In [37]:
from pyspark.ml.evaluation import RegressionEvaluator
from itertools import cycle
cycol = cycle('bgrcmk')

def calc_rmse(label_col,df_predictions):
    rf_evaluator = RegressionEvaluator(
    labelCol=label_col, predictionCol="prediction", metricName="rmse")
    rmse = rf_evaluator.evaluate(df_predictions)
    return rmse

def calc_rsquare(label_col, df_predictions):
    rf_evaluator = RegressionEvaluator(
    labelCol=label_col, predictionCol="prediction", metricName="r2")
    r2 = rf_evaluator.evaluate(df_predictions)
    return r2
    
def rf_output(model_crossv_rf):
    result={'model':"rf"}
    result['r2_crossval']=str(model_crossv_rf.avgMetrics)
    result['importance_list']=[]
    #importance= model_crossv_rf.bestModel.stages[-1].featureImportances 
    #input_cols_rf = model_crossv_rf.bestModel.stages[-2].getInputCols()
    #for x in range(len(input_cols_rf)):
        #result['importance_list'].append({'feature':input_cols_rf[x],'importance':importance[x]})
    return result

def lr_output(model_crossv_lr):
    result={'model':"lr"}
    result['r2_crossval']=str(model_crossv_lr.avgMetrics)
    lr_model= model_crossv_lr.bestModel.stages[-1] # extract linear regression model
#     lr_summary = lr_model.summary
#     result['rmse']=lr_summary.rootMeanSquaredError
#     result['r2_summary']=lr_summary.r2
    result['coefficients']= str(lr_model.coefficients)
    result['intercept']= str(lr_model.intercept)
    return result



## Running the scenery

In [38]:

if FEATURE_SELECTION_METHOD== FEATURE_SELECTION_PCA  or FEATURE_SELECTION_METHOD== FEATURE_SELECTION_ALL:
    sceneries=[
        df_dataset_final.columns
    ]
elif FEATURE_SELECTION_METHOD== FEATURE_SELECTION_NAIVE:
    sceneries=[
       selected_items_columns + [LABEL_ML]
    ]
    
def run_sceneries(df_training, df_test):
    
    
    lr_model_config = {'label':'lr', "executor":linear_regression_executor, "output":lr_output}
    lr_model_ridge_config = {'label':'lr_ridge', "executor":ridge_regression_executor, "output":lr_output}
    rf_model_ridge_config = {'label':'rf', "executor":random_forest_executor, "output":rf_output}
    
    #baseline_model_config = {'label':'baseline', "executor":DummieModel, "output":rf_output}
    
#     if MODEL_TYPE == "lr":
#         models = [baseline_model_config, lr_model_config, lr_model_ridge_config ]
#     elif MODEL_TYPE == "rf":         
#         models = [ rf_model_ridge_config]
    models = [ lr_model_config, lr_model_ridge_config , rf_model_ridge_config]
    
    models_final_result = {}
    
    plot_sampling_test = []
        
    SAMPLING_SIZE= 200  
    
    for local_model in models:
        result = {'start':str(dt.datetime.now())}
        model_result = local_model["executor"](df_training)
        
        
        result_model_training = model_result.transform(df_training)
        result_model_test = model_result.transform(df_test)
        

        df_pd_result=result_model_test.select(LABEL_ML,"prediction").limit(SAMPLING_SIZE).toPandas()
        df_pd_result.to_csv(FOLDER+'/%s_test_sampling.csv'%(local_model['label']), index = False, header=True)
        
        print("Calculating errors")
        result['r2_training']= calc_rsquare(LABEL_ML,result_model_training)
        result['rmse_training']= calc_rmse(LABEL_ML,result_model_training)
        result['r2_test']= calc_rsquare(LABEL_ML,result_model_test)
        result['rmse_test']= calc_rmse(LABEL_ML,result_model_test)
        
        result['end']= str(dt.datetime.now())
        
        explore_result = local_model["output"](model_result)
        result["model_output"]= explore_result
        
        models_final_result[local_model['label']]= result
        
        
    #plot_errors_models_label(plot_sampling_test)
    
    
    return models_final_result 

def run_all():
    for sc in sceneries:

        info_run=run_sceneries(training.select(sc), test.select(sc) )

        result_local={'scenery':sc,'models_info':info_run}

        result_notebook['sceneries'].append(result_local)

        
result_notebook['sceneries']=[]
training, test = split_dataframe(df_dataset_final)

run_all()

result_notebook['end']=str(dt.datetime.now())



linear regression /////////
LinearRegression_49a1ae06a4a0674c43f6__solver
['1_avg(convex_hull_area)', '1_avg(object_sum_area)', '1_avg(width)', '1_avg(height_over_pot)', '2_avg(convex_hull_area)', '2_avg(object_sum_area)', '2_avg(width)', '2_avg(height_over_pot)', '3_avg(convex_hull_area)', '3_avg(object_sum_area)', '3_avg(width)', '3_avg(height_over_pot)', '4_avg(convex_hull_area)', '4_avg(object_sum_area)', '4_avg(width)', '4_avg(height_over_pot)']
['Treatment']
Calculating errors
ridge regression /////////
LinearRegression_4eae89c2e7c7317e76fa__solver
['1_avg(convex_hull_area)', '1_avg(object_sum_area)', '1_avg(width)', '1_avg(height_over_pot)', '2_avg(convex_hull_area)', '2_avg(object_sum_area)', '2_avg(width)', '2_avg(height_over_pot)', '3_avg(convex_hull_area)', '3_avg(object_sum_area)', '3_avg(width)', '3_avg(height_over_pot)', '4_avg(convex_hull_area)', '4_avg(object_sum_area)', '4_avg(width)', '4_avg(height_over_pot)']
['Treatment']
Calculating errors
['1_avg(convex_hull_area)

## Saving results in a JSON File

In [39]:
import json

#result_notebook['step_wise_naive']=str(result_notebook['step_wise_naive'])
print(json.dumps(result_notebook, indent=4,  sort_keys=True))

with open(FOLDER+'/models_info_%s.json'%(MODEL_TYPE), 'w') as outfile:
    json.dump(result_notebook, outfile, indent=4,  sort_keys=True)

{
    "angle": "AVG",
    "columns_images": [
        "plantURI",
        "dayOfYear",
        "Silk_area",
        "convex_hull_area",
        "convex_hull_perimeter",
        "height",
        "height_over_pot",
        "height_under_pot",
        "number_of_objects",
        "object_sum_area",
        "width"
    ],
    "end": "2020-03-25 21:51:22.426236",
    "folder": "results/angle_AVG/slots_4/all",
    "sceneries": [
        {
            "models_info": {
                "lr": {
                    "end": "2020-03-25 21:38:34.978736",
                    "model_output": {
                        "coefficients": "[1.3403921777125678,-29.107091384057327,-53.10424339749276,9.02060814977152,77.60437764565454,39.9518928097993,33.5917276013052,-38.73530020633246,-74.84994393989513,34.21081861907875,-59.57724864440271,8.53232077168436,30.07704573165963,-102.42163752743275,175.50810343438235,21.438228570111814,11.557652648847508]",
                        "intercept": "-8.08389172714464

In [40]:
#result_local
df_dataset_final.columns

['1_avg(convex_hull_area)',
 '1_avg(object_sum_area)',
 '1_avg(width)',
 '1_avg(height_over_pot)',
 '2_avg(convex_hull_area)',
 '2_avg(object_sum_area)',
 '2_avg(width)',
 '2_avg(height_over_pot)',
 '3_avg(convex_hull_area)',
 '3_avg(object_sum_area)',
 '3_avg(width)',
 '3_avg(height_over_pot)',
 '4_avg(convex_hull_area)',
 '4_avg(object_sum_area)',
 '4_avg(width)',
 '4_avg(height_over_pot)',
 'Treatment',
 'label']

In [None]:
df_dataset_final.limit(20).toPandas()