<br>

## Final project progress ~ milestone (hw9)
+ Project progress (and a reminder of the idea/task + background)
+ Resources/libraries/walkthroughs you've found
+ One or more datasets/libraries (perhaps)
+ _How explorations so far have gone_  (This is perhaps the most important!)
  + include, in Python, as much progress as you might have 
  + or, at least feel free to do so, if you wish
  + if you're using plain-python, submit that and any other small-data (<1mb files) you have
+ How you plan to like to go _beyond_ the resources available... 
+ Also, include a link to a - <i>draft of</i> - a Google-slides presentation to be presented on April 18th or April 25th.



### Draft Presentation Link
https://docs.google.com/presentation/d/1d2228b2q-UljA39Itvh4ws0sSibjv9MnSXYmR_UPqBE/edit?usp=sharing

<br>

#### Here is ths [project-progress overview page](https://docs.google.com/document/d/1XMVH0yHaBRVneMc4R0daA5qYgSgzl9ppDkI1qBPJ77s/edit)  with more details.


I want to test out a variety on ML models on my CSA dataset and then once I decide on a model to use I will build a Flask app that uses that takes input from an html form and used my model to suggest a CSA farm for a consumer based on their inputs to the form. My goal would be to accomplish all of these tasks to some extent and then perfect them if time allows.

### Cleaning the CSA data

Here are two libraries I will use throughout

In [28]:
# libraries!
import numpy as np      # numpy is Python's "array" library
import pandas as pd     # Pandas is Python's "data" library ("dataframe" == spreadsheet)

In [29]:
# let's read in our CSA data...
# 
# for read_csv, use header=0 when row 0 is a header row
# 
filename = 'CSA data.csv'
df = pd.read_csv(filename)        # encoding="utf-8" et al.
print(f"{filename} : file read into a pandas dataframe.")

CSA data.csv : file read into a pandas dataframe.


In [30]:
#
# a dataframe is a "spreadsheet in Python"   (seems to have an extra column!)
#
# let's view it!
df

Unnamed: 0,Name,price,size,season,frequency,home delivery,SNAP,pickup location,vegetables,meat,eggs,fish,fruit,milk,Farm
0,Elisabeth Harmon,675,small,main,weekly,0,0,NE Portland,1,0,0,0,0,0,Full Cellar Farm
1,Londyn Houston,500,small,main,weekly,0,1,SW Portland,1,0,0,0,0,0,Good Rain Farm
2,Sylvia Barnes,675,full,main,weekly,0,1,SE Portland,1,0,0,0,0,0,Lil Starts Farm
3,Juliet Schroeder,600,full,main,weekly,0,1,SE Portland,1,1,0,0,0,0,Three Goats Farm
4,Stephen Mckinney,1000,full,main,weekly,0,0,NE Portland,1,0,0,0,1,0,The Side Yard Farm
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
85,Nadine Wagner,700,full,main,weekly,0,0,NW Portland,1,0,0,0,0,0,Good Rain Farm
86,Anderson Clayton,360,small,main,biweekly,0,0,NE Portland,1,0,0,0,0,0,Kasama Farm
87,Jessie Poole,500,small,main,weekly,0,0,NW Portland,1,0,0,0,0,0,Good Rain Farm
88,Cathy Chan,630,small,main,monthly,1,1,NE Portland,0,1,1,0,0,0,Totum Farm


In [31]:
#
# let's look at the dataframe's "info":
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 90 entries, 0 to 89
Data columns (total 15 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   Name             90 non-null     object
 1   price            90 non-null     int64 
 2   size             90 non-null     object
 3   season           90 non-null     object
 4   frequency        90 non-null     object
 5   home delivery    90 non-null     int64 
 6   SNAP             90 non-null     int64 
 7   pickup location  90 non-null     object
 8   vegetables       90 non-null     int64 
 9   meat             90 non-null     int64 
 10  eggs             90 non-null     int64 
 11  fish             90 non-null     int64 
 12  fruit            90 non-null     int64 
 13  milk             90 non-null     int64 
 14  Farm             90 non-null     object
dtypes: int64(9), object(6)
memory usage: 10.7+ KB


In [32]:
# Let's look at the dataframe's columns:
df.columns

Index(['Name', 'price', 'size', 'season', 'frequency', 'home delivery', 'SNAP',
       'pickup location', 'vegetables', 'meat', 'eggs', 'fish', 'fruit',
       'milk', 'Farm'],
      dtype='object')

In [33]:
# we can drop a series of data (a row or a column)
# they're indicated by numeric value, row~0, col~1, but let's use readable names instead:
ROW = 0
COLUMN = 1

df_clean1 = df.drop('Name', axis=COLUMN)
df_clean1

Unnamed: 0,price,size,season,frequency,home delivery,SNAP,pickup location,vegetables,meat,eggs,fish,fruit,milk,Farm
0,675,small,main,weekly,0,0,NE Portland,1,0,0,0,0,0,Full Cellar Farm
1,500,small,main,weekly,0,1,SW Portland,1,0,0,0,0,0,Good Rain Farm
2,675,full,main,weekly,0,1,SE Portland,1,0,0,0,0,0,Lil Starts Farm
3,600,full,main,weekly,0,1,SE Portland,1,1,0,0,0,0,Three Goats Farm
4,1000,full,main,weekly,0,0,NE Portland,1,0,0,0,1,0,The Side Yard Farm
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
85,700,full,main,weekly,0,0,NW Portland,1,0,0,0,0,0,Good Rain Farm
86,360,small,main,biweekly,0,0,NE Portland,1,0,0,0,0,0,Kasama Farm
87,500,small,main,weekly,0,0,NW Portland,1,0,0,0,0,0,Good Rain Farm
88,630,small,main,monthly,1,1,NE Portland,0,1,1,0,0,0,Totum Farm


In [34]:
#
# let's drop _all_ rows with data that is missing/NaN (not-a-number)
df_clean2 = df_clean1.dropna()
df_clean2.info()  # print the info, and
# let's see the whole table, as well:
df_clean2

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 90 entries, 0 to 89
Data columns (total 14 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   price            90 non-null     int64 
 1   size             90 non-null     object
 2   season           90 non-null     object
 3   frequency        90 non-null     object
 4   home delivery    90 non-null     int64 
 5   SNAP             90 non-null     int64 
 6   pickup location  90 non-null     object
 7   vegetables       90 non-null     int64 
 8   meat             90 non-null     int64 
 9   eggs             90 non-null     int64 
 10  fish             90 non-null     int64 
 11  fruit            90 non-null     int64 
 12  milk             90 non-null     int64 
 13  Farm             90 non-null     object
dtypes: int64(9), object(5)
memory usage: 10.0+ KB


Unnamed: 0,price,size,season,frequency,home delivery,SNAP,pickup location,vegetables,meat,eggs,fish,fruit,milk,Farm
0,675,small,main,weekly,0,0,NE Portland,1,0,0,0,0,0,Full Cellar Farm
1,500,small,main,weekly,0,1,SW Portland,1,0,0,0,0,0,Good Rain Farm
2,675,full,main,weekly,0,1,SE Portland,1,0,0,0,0,0,Lil Starts Farm
3,600,full,main,weekly,0,1,SE Portland,1,1,0,0,0,0,Three Goats Farm
4,1000,full,main,weekly,0,0,NE Portland,1,0,0,0,1,0,The Side Yard Farm
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
85,700,full,main,weekly,0,0,NW Portland,1,0,0,0,0,0,Good Rain Farm
86,360,small,main,biweekly,0,0,NE Portland,1,0,0,0,0,0,Kasama Farm
87,500,small,main,weekly,0,0,NW Portland,1,0,0,0,0,0,Good Rain Farm
88,630,small,main,monthly,1,1,NE Portland,0,1,1,0,0,0,Totum Farm


In [35]:
# all of scikit-learn's ML routines need numbers, not strings

SIZE = ['small','full']   # int to str
SIZE_INDEX = {'small':0,'full':1}  # str to int

SEASON = ['main','fall']
SEASON_INDEX = {'main':0, 'fall':1}

FREQUENCY = ['biweekly', 'weekly', 'monthly']
FREQUENCY_INDEX = {'biweekly':0, 'weekly':1, 'monthly':2}

LOCATION = ['SE Portland', 'NE Portland', 'SW Portland', 'NW Portland']
LOCATION_INDEX = {'SE Portland':0, 'NE Portland':1, 'SW Portland':2, 'NW Portland':3}

FARM = ['Full Cellar Farm', 'Good Rain Farm', 'Lil Starts Farm', 'Three Goats Farm', 'The Side Yard Farm',
        'Sun Love Farm', 'Kasama Farm', 'PK Pastures', 'Totum Farm', 'Stoneboat Farm']
FARM_INDEX = {'Full Cellar Farm':0, 'Good Rain Farm':1, 'Lil Starts Farm':2, 'Three Goats Farm':3, 'The Side Yard Farm':4,
            'Sun Love Farm':5, 'Kasama Farm':6, 'PK Pastures':7, 'Totum Farm':8, 'Stoneboat Farm':9}

def convert_size(size):
    """ return the species index (a unique integer/category) """
    return SIZE_INDEX[size]

def convert_season(season):
    """ return the species index (a unique integer/category) """
    return SEASON_INDEX[season]

def convert_frequency(frequency):
    """ return the species index (a unique integer/category) """
    return FREQUENCY_INDEX[frequency]

def convert_location(location):
    """ return the species index (a unique integer/category) """
    return LOCATION_INDEX[location]

def convert_farm(farm):
    """ return the species index (a unique integer/category) """
    return FARM_INDEX[farm]

In [36]:
df_clean3 = df_clean2.copy()

df_clean3['sizenum'] = df_clean2['size'].apply(convert_size)
df_clean3['seasonnum'] = df_clean2['season'].apply(convert_season)
df_clean3['frequencynum'] = df_clean2['frequency'].apply(convert_frequency)
df_clean3['locationnum'] = df_clean2['pickup location'].apply(convert_location)
df_clean3['farmnum'] = df_clean2['Farm'].apply(convert_farm)

df_clean3

Unnamed: 0,price,size,season,frequency,home delivery,SNAP,pickup location,vegetables,meat,eggs,fish,fruit,milk,Farm,sizenum,seasonnum,frequencynum,locationnum,farmnum
0,675,small,main,weekly,0,0,NE Portland,1,0,0,0,0,0,Full Cellar Farm,0,0,1,1,0
1,500,small,main,weekly,0,1,SW Portland,1,0,0,0,0,0,Good Rain Farm,0,0,1,2,1
2,675,full,main,weekly,0,1,SE Portland,1,0,0,0,0,0,Lil Starts Farm,1,0,1,0,2
3,600,full,main,weekly,0,1,SE Portland,1,1,0,0,0,0,Three Goats Farm,1,0,1,0,3
4,1000,full,main,weekly,0,0,NE Portland,1,0,0,0,1,0,The Side Yard Farm,1,0,1,1,4
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
85,700,full,main,weekly,0,0,NW Portland,1,0,0,0,0,0,Good Rain Farm,1,0,1,3,1
86,360,small,main,biweekly,0,0,NE Portland,1,0,0,0,0,0,Kasama Farm,0,0,0,1,6
87,500,small,main,weekly,0,0,NW Portland,1,0,0,0,0,0,Good Rain Farm,0,0,1,3,1
88,630,small,main,monthly,1,1,NE Portland,0,1,1,0,0,0,Totum Farm,0,0,2,1,8


In [37]:
# All of the columns need to be numeric
df_clean4 = df_clean3.drop( 'size', axis=COLUMN )
df_clean4 = df_clean4.drop( 'season', axis=COLUMN )
df_clean4 = df_clean4.drop( 'frequency', axis=COLUMN )
df_clean4 = df_clean4.drop( 'pickup location', axis=COLUMN )
df_clean4 = df_clean4.drop( 'Farm', axis=COLUMN )

#
# let's call it df_tidy 
#
df_tidy = df_clean4

In [38]:
df_tidy

Unnamed: 0,price,home delivery,SNAP,vegetables,meat,eggs,fish,fruit,milk,sizenum,seasonnum,frequencynum,locationnum,farmnum
0,675,0,0,1,0,0,0,0,0,0,0,1,1,0
1,500,0,1,1,0,0,0,0,0,0,0,1,2,1
2,675,0,1,1,0,0,0,0,0,1,0,1,0,2
3,600,0,1,1,1,0,0,0,0,1,0,1,0,3
4,1000,0,0,1,0,0,0,1,0,1,0,1,1,4
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
85,700,0,0,1,0,0,0,0,0,1,0,1,3,1
86,360,0,0,1,0,0,0,0,0,0,0,0,1,6
87,500,0,0,1,0,0,0,0,0,0,0,1,3,1
88,630,1,1,0,1,1,0,0,0,0,0,2,1,8


In [39]:
# We'll construct the new filename:
old_basename = filename[:-4]                      # remove the ".csv"
cleaned_filename = old_basename + "_cleaned.csv"  # name-creating
print(f"cleaned_filename is {cleaned_filename}")

# Now, save
df_tidy.to_csv(cleaned_filename, index_label=False)  # no "index" column...

cleaned_filename is CSA data_cleaned.csv


I has great success cleaning my data and getting it into a helpful form. No issues here!

### Let's begin creating models

In [40]:
# let's read in our cleaned CSV data...
# 
# for read_csv, use header=0 when row 0 is a header row
# 
cleaned_filename = "CSA data_cleaned.csv"
df_model1= pd.read_csv(cleaned_filename)   # encoding="utf-8" et al.
print(f"{cleaned_filename} : file read into a pandas dataframe.")
df_model1

CSA data_cleaned.csv : file read into a pandas dataframe.


Unnamed: 0,price,home delivery,SNAP,vegetables,meat,eggs,fish,fruit,milk,sizenum,seasonnum,frequencynum,locationnum,farmnum
0,675,0,0,1,0,0,0,0,0,0,0,1,1,0
1,500,0,1,1,0,0,0,0,0,0,0,1,2,1
2,675,0,1,1,0,0,0,0,0,1,0,1,0,2
3,600,0,1,1,1,0,0,0,0,1,0,1,0,3
4,1000,0,0,1,0,0,0,1,0,1,0,1,1,4
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
85,700,0,0,1,0,0,0,0,0,1,0,1,3,1
86,360,0,0,1,0,0,0,0,0,0,0,0,1,6
87,500,0,0,1,0,0,0,0,0,0,0,1,3,1
88,630,1,1,0,1,1,0,0,0,0,0,2,1,8


In [41]:
#
# let's keep our column names in variables, for reference
#
COLUMNS = df_model1.columns            # "list" of columns
print(f"COLUMNS is {COLUMNS}\n")  

COLUMNS is Index(['price', 'home delivery', 'SNAP', 'vegetables', 'meat', 'eggs', 'fish',
       'fruit', 'milk', 'sizenum', 'seasonnum', 'frequencynum', 'locationnum',
       'farmnum'],
      dtype='object')



In [42]:
# #
# # Reweighting our columns...
# # price and location of pickup is important to people
# # 
# df_model1['price'] *= 20
# df_model1['locationnum'] *= 20
# df_model1

In [43]:
def numpy_it(dframe):
    Arr = dframe.to_numpy()    # .values gets the numpy array

    #
    # let's make sure it's all floating-point, so we can multiply and divide
    #
    Arr = Arr.astype('float64')  

    #
    # Also, nice to have NUM_ROWS and NUM_COLS around
    #
    NUM_ROWS, NUM_COLS = Arr.shape

    return Arr

In [44]:
def data_definitions(range, Arr):
    print("+++ Start of data definitions +++\n")

    #
    # we could do this at the data-frame level, too!
    #
    X_all = Arr[:,0:range]  # X (features) ... is all rows
    y_all = Arr[:,range]    # y (labels) ... is all rows, last col only

    print(f"y_all (just the actual farms)   are \n {y_all}")
    print(f"X_all (just the features) are \n {X_all}")

    #
    # we scramble the data, to remove (potential) dependence on its ordering: 
    # 
    indices = np.random.permutation(len(y_all))  # indices is a permutation-list

    # we scramble both X and y, necessarily with the same permutation
    X_labeled = X_all[indices]              # we apply the _same_ permutation to each!
    y_labeled = y_all[indices]              # again...
    print(f"The scrambled actual farms are \n {y_labeled}")
    print(f"The corresponding data rows are \n {X_labeled}")

    return X_all, y_all

In [45]:
def seperate_data(X_all, y_all):
    #
    # We next separate into test data and training data ... 
    #    + We will train on the training data...
    #    + We will _not_ look at the testing data to build the model
    #
    # Then, afterward, we will test on the testing data -- and see how well we do!
    #

    #
    # a common convention:  train on 80%, test on 20%    Let's define the TEST_PERCENT
    #

    from sklearn.model_selection import train_test_split

    X_train, X_test, y_train, y_test = train_test_split(X_all, y_all, test_size=0.2, random_state=42)

    print(f"training with {len(y_train)} rows;  testing with {len(y_test)} rows\n" )

    print(f"Held-out data... (testing data: {len(y_test)})")
    print(f"y_test: {y_test}")
    print(f"X_test (a few rows): {X_test[0:5,:]}")  # 5 rows
    print()
    print(f"Data used for modeling... (training data: {len(y_train)})")
    print(f"y_train: {y_train}")
    print(f"X_train (a few rows): {X_train[0:5,:]}")  # 5 rows

    return X_train, X_test, y_train, y_test

### Our first model we will try is a KNN model

In [46]:
#
# to do this, we use "cross validation"
#

from sklearn.model_selection import cross_val_score
from sklearn.neighbors import KNeighborsClassifier


def find_k(X_train, X_test, y_train, y_test):
    #
    # cross-validation splits the training set into two pieces:
    #   + model-building and model-validation. We'll use "build" and "validate"
    #
    best_k = 84  # Not correct!
    best_accuracy = 0.0  # also not correct...

    # Note that we are cross-validating using only our TEST data!
    for k in range(1,12):
        knn_cv_model = KNeighborsClassifier(n_neighbors=k)   # build knn_model for every k!
        cv_scores = cross_val_score( knn_cv_model, X_train, y_train, cv=5 )  # cv=5 means 80/20
        # print(cv_scores)  # just to see the five scores... 
        average_cv_accuracy = cv_scores.mean()  # mean() is numpy's built-in average function 
        print(f"k: {k:2d}  cv accuracy: {average_cv_accuracy:7.4f}")

        if average_cv_accuracy > best_accuracy:
            best_accuracy = average_cv_accuracy
            best_k = k

        
    # assign best value of k to best_k
    best_k = best_k      # at the moment this is incorrect   TO DO for hw4pr1: fix this...
    # you'll need to use the loop above to find and remember the real best_k

    print(f"best_k = {best_k}   yields the highest average cv accuracy of {best_accuracy}")  # print the best one

    return best_k

In [47]:
def train(X_train, y_train, best_k):
    #
    # With the best k, we build and train a new model:
    #
    # Now, we use best_k instead of the original, randomly-guessed value    
    #
    from sklearn.neighbors import KNeighborsClassifier
    knn_model_tuned = KNeighborsClassifier(n_neighbors=best_k)   # here, we use the best_k!

    # we train the model (one line!)
    knn_model_tuned.fit(X_train, y_train)                              # yay!  trained!
    print(f"Created + trained a knn classifier, now tuned with a (best) k of {best_k}")  

In [48]:
def train_final(best_k, X_all, y_all):
    #
    # Ok!  We have tuned knn to use the "best" value of k...
    #
    # And, we should now use ALL available data to train our final predictive model:
    #

    knn_model_final = KNeighborsClassifier(n_neighbors=best_k)   # here, we use the best_k
    knn_model_final.fit(X_all, y_all)                              # here we use ALL the data!
    print(f"Created + trained a 'final' knn classifier, with a (best) k of {best_k}") 

    return knn_model_final

In [49]:
#
# final predictive model (k-nearest-neighbor), with tuned k + ALL data incorporated
#
def final_predict(knn_model_final, bound):
    def predictive_model( Features, Model ):
        """ input: a list of the pixels
            output: the number
        """
        our_features = np.asarray([Features])                 # extra brackets needed

        # The model's prediction!
        predicted_farms = Model.predict(our_features)
        
        # a bit awkward
        predicted_farms = int(round(predicted_farm[0]))  # unpack one element
        return predicted_farms


    # LoD = [[0,0,0,8,14,0,0,0,0,0,5,16,11,0,0,0,0,1,15,14,1,6,0,0,0,7,16,5,3,16,8,0,0,8,16,8,14,16,2,0,0,0,6,14,16,11,0,0,0,0,0,6,16,4,0,0,0,0,0,10,15,0,0,0],
    # [0,0,0,5,14,12,2,0,0,0,7,15,8,14,4,0,0,0,6,2,3,13,1,0,0,0,0,1,13,4,0,0,0,0,1,11,9,0,0,0,0,8,16,13,0,0,0,0,0,5,14,16,11,2,0,0,0,0,0,6,12,13,3,0],
    # [0,0,0,3,16,3,0,0,0,0,0,12,16,2,0,0,0,0,8,16,16,4,0,0,0,7,16,15,16,12,11,0,0,8,16,16,16,13,3,0,0,0,0,7,14,1,0,0,0,0,0,6,16,0,0,0,0,0,0,4,14,0,0,0],
    # [0,0,0,3,15,10,1,0,0,0,0,11,10,16,4,0,0,0,0,12,1,15,6,0,0,0,0,3,4,15,4,0,0,0,0,6,15,6,0,0,0,4,15,16,9,0,0,0,0,0,13,16,15,9,3,0,0,0,0,4,9,14,7,0],
    # [0,0,0,3,16,3,0,0,0,0,0,10,16,11,0,0,0,0,4,16,16,8,0,0,0,2,14,12,16,5,0,0,0,10,16,14,16,16,11,0,0,5,12,13,16,8,3,0,0,0,0,2,15,3,0,0,0,0,0,4,12,0,0,0],
    # [0,0,7,15,15,4,0,0,0,8,16,16,16,4,0,0,0,8,15,8,16,4,0,0,0,0,0,10,15,0,0,0,0,0,1,15,9,0,0,0,0,0,6,16,2,0,0,0,0,0,8,16,8,11,9,0,0,0,9,16,16,12,3,0]]

    # # run on each one:
    # for Features in LoD:
    #     predicted_farms = predictive_model( Features[:bound], knn_model_final )  # pass in the model, too!
    #     print(f"I predict {predicted_farms} from the features {Features}")

In [50]:
# col_vars(df_model1)
nparr = numpy_it(df_model1)
X_all, y_all = data_definitions(13, nparr)
X_train, X_test, y_train, y_test = seperate_data(X_all, y_all)
best_k = find_k(X_train, X_test, y_train, y_test)
train(X_train, y_train, best_k)
knn_model_final = train_final(best_k, X_all, y_all)
# final_predict(knn_model_final, 56)

+++ Start of data definitions +++

y_all (just the actual farms)   are 
 [0. 1. 2. 3. 4. 5. 6. 7. 8. 1. 6. 1. 8. 8. 0. 3. 4. 2. 0. 5. 3. 6. 8. 0.
 6. 8. 4. 2. 7. 1. 9. 5. 5. 7. 4. 9. 6. 3. 1. 9. 2. 5. 9. 5. 4. 7. 9. 7.
 6. 5. 6. 8. 4. 2. 7. 1. 9. 5. 5. 7. 4. 9. 6. 3. 1. 9. 2. 5. 9. 5. 4. 7.
 9. 7. 6. 5. 0. 1. 2. 3. 4. 5. 6. 7. 2. 1. 6. 1. 8. 8.]
X_all (just the features) are 
 [[675.   0.   0. ...   0.   1.   1.]
 [500.   0.   1. ...   0.   1.   2.]
 [675.   0.   1. ...   0.   1.   0.]
 ...
 [500.   0.   0. ...   0.   1.   3.]
 [630.   1.   1. ...   0.   2.   1.]
 [630.   1.   1. ...   0.   2.   3.]]
The scrambled actual farms are 
 [4. 8. 6. 5. 0. 6. 4. 2. 0. 3. 8. 5. 3. 8. 9. 7. 7. 3. 6. 9. 1. 2. 3. 1.
 1. 2. 3. 7. 7. 4. 9. 5. 9. 6. 9. 6. 9. 6. 0. 8. 6. 6. 1. 1. 7. 8. 7. 5.
 4. 2. 8. 2. 0. 4. 1. 5. 4. 5. 5. 9. 7. 5. 7. 7. 5. 5. 6. 7. 2. 4. 9. 2.
 4. 3. 8. 9. 4. 9. 2. 1. 5. 6. 0. 1. 1. 8. 1. 5. 5. 6.]
The corresponding data rows are 
 [[5.000e+02 0.000e+00 0.000e+00 ... 0.000e+00 1.00



k:  5  cv accuracy:  0.6114
k:  6  cv accuracy:  0.5267
k:  7  cv accuracy:  0.5267
k:  8  cv accuracy:  0.4295
k:  9  cv accuracy:  0.4295
k: 10  cv accuracy:  0.3752
k: 11  cv accuracy:  0.3752
best_k = 1   yields the highest average cv accuracy of 0.9304761904761903
Created + trained a knn classifier, now tuned with a (best) k of 1




Created + trained a 'final' knn classifier, with a (best) k of 1


### For KNN the best k is 1 with an accuracy of 0.85
From the warnings I got above, it seems that I would benefit from more data. Unfortunately I do not have easy access to more data so I will roll with what I have for now and see if I can maybe get more later. For the mean time I think 85% accuracy is good enough to make solid progress with.

### Now let's try a decision tree

In [51]:
nparr = numpy_it(df_model1)
X_all, y_all = data_definitions(13, nparr)
X_train, X_test, y_train, y_test = seperate_data(X_all, y_all)

+++ Start of data definitions +++

y_all (just the actual farms)   are 
 [0. 1. 2. 3. 4. 5. 6. 7. 8. 1. 6. 1. 8. 8. 0. 3. 4. 2. 0. 5. 3. 6. 8. 0.
 6. 8. 4. 2. 7. 1. 9. 5. 5. 7. 4. 9. 6. 3. 1. 9. 2. 5. 9. 5. 4. 7. 9. 7.
 6. 5. 6. 8. 4. 2. 7. 1. 9. 5. 5. 7. 4. 9. 6. 3. 1. 9. 2. 5. 9. 5. 4. 7.
 9. 7. 6. 5. 0. 1. 2. 3. 4. 5. 6. 7. 2. 1. 6. 1. 8. 8.]
X_all (just the features) are 
 [[675.   0.   0. ...   0.   1.   1.]
 [500.   0.   1. ...   0.   1.   2.]
 [675.   0.   1. ...   0.   1.   0.]
 ...
 [500.   0.   0. ...   0.   1.   3.]
 [630.   1.   1. ...   0.   2.   1.]
 [630.   1.   1. ...   0.   2.   3.]]
The scrambled actual farms are 
 [2. 1. 1. 8. 0. 3. 5. 4. 6. 2. 0. 9. 3. 9. 1. 7. 5. 2. 6. 5. 2. 4. 9. 8.
 4. 1. 3. 3. 5. 5. 6. 8. 8. 7. 5. 5. 5. 1. 4. 6. 0. 1. 7. 2. 4. 2. 1. 1.
 5. 9. 1. 7. 8. 6. 9. 7. 9. 5. 6. 4. 9. 3. 9. 7. 4. 9. 0. 0. 6. 5. 2. 7.
 7. 1. 6. 7. 3. 5. 7. 2. 4. 8. 8. 4. 6. 8. 9. 5. 6. 6.]
The corresponding data rows are 
 [[6.750e+02 0.000e+00 1.000e+00 ... 0.000e+00 1.00

In [52]:
#
# To compare different tree-depths, we use cross validation
#

from sklearn.model_selection import cross_val_score
from sklearn import tree      # for decision trees

#
# cross-validation splits the training set into two pieces:
#   + model-building and model-validation. We'll use "build" and "validate"
#

best_d = 1
best_accuracy = 0.0

for d in range(1,10):
    cv_model = tree.DecisionTreeClassifier(max_depth=d)   # for each depth, d
    cv_scores = cross_val_score( cv_model, X_train, y_train, cv=5 ) # 5 means 80/20 split
    # print(cv_scores)  # we usually don't want to see the five individual scores 
    average_cv_accuracy = cv_scores.mean()  # more likely, only their average
    print(f"depth: {d:2d}  cv accuracy: {average_cv_accuracy:7.4f}")
    
    if average_cv_accuracy > best_accuracy:
        best_accuracy = average_cv_accuracy
        best_d = d

    
    
# assign best value of d to best_depth
best_depth = best_d   # may have to hand-tune this, depending on what happens...
print()
print(f"best_depth = {best_depth} is our choice for an underfitting/overfitting balance with accuracy {best_accuracy}")  



depth:  1  cv accuracy:  0.2771
depth:  2  cv accuracy:  0.4448
depth:  3  cv accuracy:  0.5829
depth:  4  cv accuracy:  0.6943
depth:  5  cv accuracy:  0.7371
depth:  6  cv accuracy:  0.7924
depth:  7  cv accuracy:  0.8476
depth:  8  cv accuracy:  0.8324
depth:  9  cv accuracy:  0.8457

best_depth = 7 is our choice for an underfitting/overfitting balance with accuracy 0.8476190476190476




In [53]:
#
# Now, we re-create and re-run the  "Model-building and -training Cell"
#
# this time, with the best depth, best_d, found by cross-validation model tuning:
#
from sklearn import tree      # for decision trees

# we should have best_depth from our cv exploration
dtree_model_tuned = tree.DecisionTreeClassifier(max_depth=best_depth)

# we train the model (it's one line!)
dtree_model_tuned.fit(X_train, y_train)                              # yay!  trained!
print("Created and trained a DT classifier with max depth =", best_depth) 

Created and trained a DT classifier with max depth = 7


In [54]:
#
# Ok!  We have tuned our DT to use the "best" depth...
#
# Now, we use ALL available data to train our final predictive model:
#

from sklearn import tree      # for decision trees

# we should have best_depth from our cv exploration
dtree_model_final = tree.DecisionTreeClassifier(max_depth=best_depth)

# we train the model (it's one line!)
dtree_model_final.fit(X_all, y_all)                              # yay!  trained!
print("Created and trained a 'final' DT classifier with max depth =", best_depth) 

Created and trained a 'final' DT classifier with max depth = 7


### This DT model has an accuracy of 0.75 which is not as good as our KNN model

### Now trying a random forest

In [55]:
#
# So, to compare different parameters, let's use cv
#

from sklearn.model_selection import cross_val_score
from sklearn import tree      # for decision trees
from sklearn import ensemble  # for random forests, an ensemble classifier

#
# cross-validation splits the training set into two pieces:
#   + model-building and model-validation. We'll use "build" and "validate"
#

#
# lab task:  wrap this loop in another one! (or create an inner one...)
#

best_d = 1
best_ntrees = 50   # range(50,300,100)
best_score = 0

for d in range(1,10):
    for ntrees in range(50,300,100):
        rforest_model = ensemble.RandomForestClassifier(max_depth=d, 
                                                        n_estimators=ntrees,
                                                        max_samples=0.5)
        cv_scores = cross_val_score( rforest_model, X_train, y_train, cv=5 ) # 5 means 80/20 split
        average_cv_accuracy = cv_scores.mean()  # more likely, only their average
        print(f"depth: {d:2d} ntrees: {ntrees:3d} cv accuracy: {average_cv_accuracy:7.4f}")

        if best_score < average_cv_accuracy:
            best_d = d
            best_ntrees = ntrees
            best_score = average_cv_accuracy

# 
# your task: assign best values by keeping a "running max"
#

best_depth = best_d   
best_num_trees = best_ntrees

print()
print(f"best_depth: {best_depth} and best_num_trees: {best_num_trees} are our choices with accuracy {best_score}.")  

#
# remember that the RF lab task is to complete this nested cross-validation loop!
#



depth:  1 ntrees:  50 cv accuracy:  0.4276




depth:  1 ntrees: 150 cv accuracy:  0.4010




depth:  1 ntrees: 250 cv accuracy:  0.3743




depth:  2 ntrees:  50 cv accuracy:  0.6800




depth:  2 ntrees: 150 cv accuracy:  0.6933
depth:  2 ntrees: 250 cv accuracy:  0.7352




depth:  3 ntrees:  50 cv accuracy:  0.7752




depth:  3 ntrees: 150 cv accuracy:  0.7905




depth:  3 ntrees: 250 cv accuracy:  0.8333




depth:  4 ntrees:  50 cv accuracy:  0.7914




depth:  4 ntrees: 150 cv accuracy:  0.8048




depth:  4 ntrees: 250 cv accuracy:  0.8324




depth:  5 ntrees:  50 cv accuracy:  0.8048




depth:  5 ntrees: 150 cv accuracy:  0.8333




depth:  5 ntrees: 250 cv accuracy:  0.8190




depth:  6 ntrees:  50 cv accuracy:  0.8457




depth:  6 ntrees: 150 cv accuracy:  0.8476




depth:  6 ntrees: 250 cv accuracy:  0.8619




depth:  7 ntrees:  50 cv accuracy:  0.8467




depth:  7 ntrees: 150 cv accuracy:  0.8895




depth:  7 ntrees: 250 cv accuracy:  0.8752




depth:  8 ntrees:  50 cv accuracy:  0.8600




depth:  8 ntrees: 150 cv accuracy:  0.8886




depth:  8 ntrees: 250 cv accuracy:  0.8752




depth:  9 ntrees:  50 cv accuracy:  0.8457




depth:  9 ntrees: 150 cv accuracy:  0.8610




depth:  9 ntrees: 250 cv accuracy:  0.8752

best_depth: 7 and best_num_trees: 150 are our choices with accuracy 0.8895238095238096.


In [56]:
# Ok!  We have tuned our RF to use the "best" parameters
#
# Now, we use ALL available data to train our final predictive model:
#
from sklearn import tree      # for decision trees
from sklearn import ensemble  # for random forests

# we should have best_depth and best_num_trees
rforest_model_final = ensemble.RandomForestClassifier(max_depth=best_depth, 
                                                      n_estimators=best_num_trees,
                                                      max_samples=0.5)

# we train the model (it's one line!)
rforest_model_final.fit(X_all, y_all)              # yay!  trained!
print(f"Built an RF classifier with depth={best_depth} and ntrees={best_num_trees}") 

Built an RF classifier with depth=7 and ntrees=150


### The accuracy of our random forest model is 0.8 which is better than the DT but still not quite as good as KNN
I don't think I'll use trees as my final model. I just wanted to play with them to see if I could get any better results.

My face is LoFi[1], so I'll add this face to the list of faces from the larger image LoFiD

### Here are some feature importances of our random forest. They match up fairly well with what I'd generally expect to be important CSA descision making features

In [57]:
#
# feature importances are often even more "important" than predictions...
#
#    Random forests can provide a much "smoother" measure of feature importance, since
#                   they integrate over so many individual models (each tree)
#
#    That is, it's much less likely that a feature will have 0% importance, unless it never varies
#

print(rforest_model_final.feature_importances_)
print()

# let's see them with each feature name:
IMPs = rforest_model_final.feature_importances_

# enumerate is great when you want indices _and_ elements!
for i, importance in enumerate(IMPs):
    perc = importance*100
    print(f"Feature {COLUMNS[i]:>12s} has {perc:>7.2f}% of the decision-making importance.")

[0.21250023 0.0533863  0.0673768  0.04521514 0.08734369 0.03054599
 0.01506936 0.09457764 0.0344614  0.06067773 0.01206868 0.11913924
 0.1676378 ]

Feature        price has   21.25% of the decision-making importance.
Feature home delivery has    5.34% of the decision-making importance.
Feature         SNAP has    6.74% of the decision-making importance.
Feature   vegetables has    4.52% of the decision-making importance.
Feature         meat has    8.73% of the decision-making importance.
Feature         eggs has    3.05% of the decision-making importance.
Feature         fish has    1.51% of the decision-making importance.
Feature        fruit has    9.46% of the decision-making importance.
Feature         milk has    3.45% of the decision-making importance.
Feature      sizenum has    6.07% of the decision-making importance.
Feature    seasonnum has    1.21% of the decision-making importance.
Feature frequencynum has   11.91% of the decision-making importance.
Feature  locationnum ha

### Here are our DT feature importances. Again they seem to align with what I'd expect

In [58]:
#
# feature importances are often even more "important" than predictions...
#
#    Random forests can provide a much "smoother" measure of feature importance, since
#                   they integrate over so many individual models (each tree)
#
#    That is, it's much less likely that a feature will have 0% importance, unless it never varies
#

print(dtree_model_final.feature_importances_)
print()

# let's see them with each feature name:
IMPs = dtree_model_final.feature_importances_

# enumerate is great when you want indices _and_ elements!
for i, importance in enumerate(IMPs):
    perc = importance*100
    print(f"Feature {COLUMNS[i]:>12s} has {perc:>7.2f}% of the decision-making importance.")

[0.25128145 0.         0.10979295 0.         0.08893777 0.
 0.         0.11882297 0.         0.05007984 0.0243867  0.17843994
 0.17825838]

Feature        price has   25.13% of the decision-making importance.
Feature home delivery has    0.00% of the decision-making importance.
Feature         SNAP has   10.98% of the decision-making importance.
Feature   vegetables has    0.00% of the decision-making importance.
Feature         meat has    8.89% of the decision-making importance.
Feature         eggs has    0.00% of the decision-making importance.
Feature         fish has    0.00% of the decision-making importance.
Feature        fruit has   11.88% of the decision-making importance.
Feature         milk has    0.00% of the decision-making importance.
Feature      sizenum has    5.01% of the decision-making importance.
Feature    seasonnum has    2.44% of the decision-making importance.
Feature frequencynum has   17.84% of the decision-making importance.
Feature  locationnum has   17.8

### Now it's time to play around with a deep learning model

In [59]:
nparr = numpy_it(df_model1)
X_all, y_all = data_definitions(13, nparr)
X_train, X_test, y_train, y_test = seperate_data(X_all, y_all)

+++ Start of data definitions +++

y_all (just the actual farms)   are 
 [0. 1. 2. 3. 4. 5. 6. 7. 8. 1. 6. 1. 8. 8. 0. 3. 4. 2. 0. 5. 3. 6. 8. 0.
 6. 8. 4. 2. 7. 1. 9. 5. 5. 7. 4. 9. 6. 3. 1. 9. 2. 5. 9. 5. 4. 7. 9. 7.
 6. 5. 6. 8. 4. 2. 7. 1. 9. 5. 5. 7. 4. 9. 6. 3. 1. 9. 2. 5. 9. 5. 4. 7.
 9. 7. 6. 5. 0. 1. 2. 3. 4. 5. 6. 7. 2. 1. 6. 1. 8. 8.]
X_all (just the features) are 
 [[675.   0.   0. ...   0.   1.   1.]
 [500.   0.   1. ...   0.   1.   2.]
 [675.   0.   1. ...   0.   1.   0.]
 ...
 [500.   0.   0. ...   0.   1.   3.]
 [630.   1.   1. ...   0.   2.   1.]
 [630.   1.   1. ...   0.   2.   3.]]
The scrambled actual farms are 
 [1. 3. 2. 7. 4. 6. 5. 8. 5. 4. 5. 5. 9. 4. 6. 6. 7. 6. 2. 3. 9. 7. 3. 1.
 4. 1. 7. 5. 9. 2. 4. 9. 3. 0. 9. 2. 1. 9. 0. 4. 5. 2. 5. 8. 6. 0. 9. 8.
 1. 7. 6. 5. 6. 1. 5. 5. 4. 7. 2. 8. 1. 9. 8. 3. 7. 7. 6. 5. 6. 8. 2. 1.
 9. 5. 0. 1. 8. 4. 3. 7. 1. 4. 9. 6. 6. 2. 5. 7. 0. 8.]
The corresponding data rows are 
 [[200.   0.   0. ...   1.   1.   0.]
 [600.   0.  

In [60]:
#
# for NNets, it's important to keep the feature values near 0, say -1. to 1. or so
#    This is done through the "StandardScaler" in scikit-learn
# 
USE_SCALER = True   # this variable is important! It tracks if we need to use the scaler...

# we "train the scaler"  (computes the mean and standard deviation)
if USE_SCALER == True:
    from sklearn.preprocessing import StandardScaler
    scaler = StandardScaler()
    scaler.fit(X_train)  # Scale with the training data! ave becomes 0; stdev becomes 1
else:
    # this one does no scaling!  We still create it to be consistent:
    scaler = StandardScaler(copy=True, with_mean=False, with_std=False) # no scaling
    scaler.fit(X_train)  # still need to fit, though it does not change...

scaler   # is now defined and ready to use...

# ++++++++++++++++++++++++++++++++++++++++++++++++++++++

# Here are our scaled training and testing sets:

X_train_scaled = scaler.transform(X_train) # scale!
X_test_scaled = scaler.transform(X_test) # scale!

y_train_scaled = y_train  # the predicted/desired labels are not scaled
y_test_scaled = y_test  # not using the scaler

def ascii_table(X,y):
    """ print a table of binary inputs and outputs """
    print(f"{'input ':>58s} -> {'pred':<5s} {'des.':<5s}") 
    for i in range(len(y)):
        print(f"{X[i,:]!s:>58s} -> {'?':<5s} {y[i]:<5.0f}")   # !s is str ...
    
ascii_table(X_train_scaled[0:5,:],y_train_scaled[0:5])

#
# Note that the zeros have become -1's
# and the 1's have stayed 1's
#

                                                    input  -> pred  des. 
[-0.34791591 -0.66332496  1.02817453  0.51298918 -0.64168895 -0.32816506
 -0.20851441 -0.30151134 -0.27317918  0.86953871 -0.11867817 -1.50317161
 -1.03839072] -> ?     5    
[-1.04756085 -0.66332496 -0.97259753  0.51298918 -0.64168895 -0.32816506
 -0.20851441 -0.30151134 -0.27317918 -1.15003506 -0.11867817 -1.50317161
  1.6968824 ] -> ?     6    
[ 0.28087891  1.50755672 -0.97259753 -1.94935887  1.55838744  3.047247
 -0.20851441 -0.30151134 -0.27317918  0.86953871 -0.11867817  1.54551447
 -0.12663301] -> ?     7    
[ 1.94585673 -0.66332496 -0.97259753  0.51298918 -0.64168895 -0.32816506
 -0.20851441 -0.30151134 -0.27317918  0.86953871 -0.11867817  0.02117143
 -1.03839072] -> ?     5    
[ 0.34730089 -0.66332496 -0.97259753  0.51298918 -0.64168895 -0.32816506
 -0.20851441 -0.30151134 -0.27317918 -1.15003506 -0.11867817  0.02117143
 -0.12663301] -> ?     0    


In [61]:
#
# MLPRegressor predicts _floating-point_ outputs
#

from sklearn.neural_network import MLPRegressor

nn_regressor = MLPRegressor(hidden_layer_sizes=(6,7), 
                    max_iter=500,          # how many training epochs
                    activation="tanh",     # the activation function
                    solver='sgd',          # the optimizer
                    verbose=True,          # do we want to watch as it trains?
                    shuffle=True,          # shuffle each epoch?
                    random_state=None,     # use for reproducibility
                    learning_rate_init=.1, # how much of each error to back-propagate
                    learning_rate = 'adaptive')  # how to handle the learning_rate

print("\n\n++++++++++  TRAINING:  begin  +++++++++++++++\n\n")
nn_regressor.fit(X_train_scaled, y_train_scaled)
print("++++++++++  TRAINING:   end  +++++++++++++++")

print(f"The (squared) prediction error (the loss) is {nn_regressor.loss_}")
print(f"And, its square root: {nn_regressor.loss_ ** 0.5}")




++++++++++  TRAINING:  begin  +++++++++++++++


Iteration 1, loss = 18.25988923
Iteration 2, loss = 8.17823935
Iteration 3, loss = 3.30166136
Iteration 4, loss = 3.42356039
Iteration 5, loss = 2.69901624
Iteration 6, loss = 2.05874451
Iteration 7, loss = 1.88228913
Iteration 8, loss = 1.58136247
Iteration 9, loss = 1.39568200
Iteration 10, loss = 1.30287873
Iteration 11, loss = 1.20113885
Iteration 12, loss = 1.13339395
Iteration 13, loss = 1.07328313
Iteration 14, loss = 1.02149046
Iteration 15, loss = 0.97243366
Iteration 16, loss = 0.92135920
Iteration 17, loss = 0.86776191
Iteration 18, loss = 0.81573343
Iteration 19, loss = 0.77109261
Iteration 20, loss = 0.73941211
Iteration 21, loss = 0.72075700
Iteration 22, loss = 0.70969223
Iteration 23, loss = 0.70036512
Iteration 24, loss = 0.68891385
Iteration 25, loss = 0.67437988
Iteration 26, loss = 0.65778719
Iteration 27, loss = 0.64053317
Iteration 28, loss = 0.62340445
Iteration 29, loss = 0.60633342
Iteration 30, loss = 0.5885772



### Error tracker
+ hidden layers: 6,7    loss: 0.05179
+ hidden layers: 6,7,8  loss: 0.15812
+ hidden layers: 5,6    loss: 0.15775
+ hidden layers: 7,8    loss: 0.15584
+ hidden layers: 7      loss: 0.19848
+ hidden layers: 6,7,2  loss: 0.18585


In [62]:
#
# how did it do? now we're making progress (by regressing)
#

def ascii_table_for_regressor(Xsc,y,nn,scaler):
    """ a table including predictions using nn.predict """
    predictions = nn.predict(Xsc) # all predictions
    Xpr = scaler.inverse_transform(Xsc)  # Xpr is the "X to print": unscaled data!
    # measure error
    error = 0.0
    # printing
    print(f"{'input ':>28s} ->  {'pred':^6s}  {'des.':^6s}  {'absdiff':^10s}") 
    for i in range(len(y)):
        pred = predictions[i]
        desired = y[i]
        result = abs(desired - pred)
        error += result
        # Xpr = Xsc   # if you'd like to see the scaled values
        print(f"{Xpr[i,:]!s:>28s} ->  {pred:<+6.3f}  {desired:<+6.3f}  {result:^10.3f}") 

    print("\n" + "+++++   +++++      +++++   +++++   ")
    print(f"average abs error: {error/len(y)}")
    print("+++++   +++++      +++++   +++++   ")
    
#
# let's see how it did on the test data (also the training data!)
#
ascii_table_for_regressor(X_test_scaled,
                          y_test_scaled,
                          nn_regressor,
                          scaler)   # this is our own f'n, above
#
# other things...
#
if False:  # do we want to see these details?
    nn = nn_regressor  # less to type?
    print("\n\n+++++ parameters, weights, etc. +++++\n")
    print(f"\nweights/coefficients:\n")
    for wts in nn.coefs_:
        print(wts)
    print(f"\nintercepts: {nn.intercepts_}")
    print(f"\nall parameters: {nn.get_params()}")

                      input  ->   pred    des.    absdiff  
[3.40000000e+02 0.00000000e+00 0.00000000e+00 1.00000000e+00
 0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00
 0.00000000e+00 0.00000000e+00 0.00000000e+00 1.11022302e-16
 1.00000000e+00] ->  +2.239  +2.000    0.239   
[930.   1.   1.   0.   1.   1.   0.   0.   0.   1.   0.   2.   2.] ->  +8.968  +8.000    0.968   
[200.   0.   0.   1.   0.   0.   0.   0.   0.   0.   1.   1.   0.] ->  +1.056  +1.000    0.056   
[500.   0.   0.   1.   0.   0.   0.   1.   0.   0.   0.   1.   1.] ->  +5.164  +4.000    1.164   
[675.   0.   0.   1.   0.   0.   0.   0.   0.   0.   0.   1.   1.] ->  +0.063  +0.000    0.063   
[1000.    0.    0.    1.    0.    0.    0.    1.    0.    1.    0.    1.
    1.] ->  +4.056  +4.000    0.056   
[475.   0.   1.   1.   0.   1.   0.   0.   0.   0.   0.   1.   1.] ->  +6.046  +9.000    2.954   
[475.   0.   1.   1.   0.   1.   0.   0.   0.   0.   0.   1.   1.] ->  +6.046  +9.000    2.954   
[3.600000

### This avg abs error of 2.558 is not ideal. I think I just do not have enough data to get great results

### Out of the models that I've tried I think KNN does best with my limited data. However, I want to see if I can get more data and then play around more with my neural net model. I don't think I will proceed in my exploration of descision trees.
If I ever get a lot of data I think using a deep learning model would be the way to go, but with my current data set I will stick to a KNN model for my Flask App

### Let's escape a little from the bounds of this course and dip int tensorflow a little to try another model

In [63]:
import tensorflow as tf
from tensorflow.keras import layers

### Here is a non-normalized model

In [64]:
CSA_model = tf.keras.Sequential([
  layers.Dense(64),
  layers.Dense(1)
])

CSA_model.compile(loss = tf.keras.losses.MeanSquaredError(),
                      optimizer = tf.optimizers.Adam())

In [65]:
CSA_model.fit(X_train, y_train, epochs=15)

Epoch 1/15
Epoch 2/15
Epoch 3/15
Epoch 4/15
Epoch 5/15
Epoch 6/15
Epoch 7/15
Epoch 8/15
Epoch 9/15
Epoch 10/15
Epoch 11/15
Epoch 12/15
Epoch 13/15
Epoch 14/15
Epoch 15/15


<keras.callbacks.History at 0x1e18891ce20>

### Oof that was bad. Let's normalize

In [66]:
normalize = layers.Normalization()

normalize.adapt(X_train)

In [67]:
norm_CSA_model = tf.keras.Sequential([
  normalize,
  layers.Dense(64),
  layers.Dense(1)
])

norm_CSA_model.compile(loss = tf.losses.MeanSquaredError(),
                           optimizer = tf.optimizers.Adam())

norm_CSA_model.fit(X_train, y_train, epochs=15)

Epoch 1/15
Epoch 2/15
Epoch 3/15
Epoch 4/15
Epoch 5/15
Epoch 6/15
Epoch 7/15
Epoch 8/15
Epoch 9/15
Epoch 10/15
Epoch 11/15
Epoch 12/15
Epoch 13/15
Epoch 14/15
Epoch 15/15


<keras.callbacks.History at 0x1e189bbaca0>

My face is LoFi[1], so I'll add this face to the list of faces from the larger image LoFiD

### Okay this model still seems bad. I think I'll stick to my KNN model or my NN regressor
I just wanted to experiment with a new model that we did not use in class, and maybe I did not experiment fully enough, but without a better data set I will not be using this model.

I looked back at my data so I could beef up my data set. It seems like Full Cellar Farm, Lil Starts Farm, and Three Goats Farm are underepresented, but all of the farms would benefit from more data points.

### Let's save our KNN model. I will be using pickle! What a fun library name

In [68]:
import pickle

# Saving our model
pickle.dump(knn_model_final, open('model.pkl', 'wb'))

look my pickle model works!

In [69]:
pickled_model = pickle.load(open('model.pkl', 'rb'))
print(pickled_model.predict(X_test))
print(y_test)

[2. 8. 1. 4. 0. 4. 9. 9. 6. 4. 5. 9. 9. 6. 8. 4. 0. 7.]
[2. 8. 1. 4. 0. 4. 9. 9. 6. 4. 5. 9. 9. 6. 8. 4. 0. 7.]


### Now that I have some sample models, let's move onto FLASK
I'll be following this as a reference: https://www.geeksforgeeks.org/deploy-machine-learning-model-using-flask/
My Flask explorations and code will be in seperate files. Not this notebook

### Current capabilities
I have 5 working ML models trained on my CSA data set. This dataset could use more data to improve these models, but for the scope of this project I think 85% accuracy with my KNN model is acceptable. I also have a Flask app. It is very ugly but it works and will predict a farm based on inputs!

Surprisingly my plan went quite smoothly. I had a few issues trying to figure out tensorflow and keras but after a bit of troubleshooting I got things figured out. The most difficulties I had was with Flask. I ended up half following the online tutorial I found and then when things didn't work I referenced the Flask app we messed with during week 2.

Honestly, I am pretty happy with how this week has gone. I did not expect to get this much accomplished. I pretty much have done what I initially set out to. Of course everything is done roughly but all of the buildinging blocks are there and what's left is refinement.

### Future Improvements
In the future one big improvement would be data, data, data! The first step in this could be adding more to the CSV data. The second data improvement would be using the inputted data from the user of the Flask app to further train the model. So for example, every 10 user entries this data could be used to retrain the model, however, the issue with this is that any current inaccuracies in the model would be further amplified. Thus, maybe a human could assign a farm to the saved inputs and then that data could be added to the training set.

Another improvement could be made to the Flask app GUI. Right now my Flask app is very ugly. It turns out I'm a backend girlie and don't want to mess around with html.

In the coming week I think I will force myself to make the GUI less bad. I will also get a few points of supplementary data to test my model using.