# 👋 Introduction

### Who are we? 

We are Berkeley undergraduates working with Viviendas Leon, a nonprofit dedicated to eliminating rural poverty in Nicaragua and Guatemala. 

`Team lead`: Elda Pere

`Team members`: Lauren Faulds, Chase Elements, Barnett (Adam) Yang, Kathryn (Katie) Byers, Eva Sidlo, Kelly Trinh 


### Questions to address: 
Given location, soil, and weather data, which crops should a farmer plant that would be most resilient to disease?

### Dataset description: 
`Data source`: Viviendas Leon 

It contains information on crop disease percentage, crop conditions, and any recommendations made. We also scraped weather data, which includes dew point, temperature, percipitation data, and more, and append to our dataset. 
 
### Objectives: 
1. Clean the data to account for missing value, inconsistent names, translate Spanish to English, and scrap weather data to supplement the dataset. 

2. Perform exploratory data analysis to find trends between crop type, effectiveness of recommendations, and geographical area, and disease percentage.

3. Build a predictor with the following parameters:
- Input: soil, temperature, and weather condition
- Output: top 3 specific crops and the best general type of crops that are best for the given. This model works with 4 general crop types: `fruits`, `vegetables`, `legumes & seeds`, and `grasses`. Specific crops example are papayas, tomatoes, onions, etc. 

We will build a scoring system and a machine learning model. **Note: both the scoring system and the machine learning model with rank the specific crops and assign each of the crops a score based on how suitable it is for the given weather conditions. From this rank and score, we will extract out the top 3 specific crops and the general type of crop.** Our goal is to combine the predictions of the scoring system and the machine learning model.

The use case of this model will be for Viviendas Leon to input their own real time data and the predictor will output the above predictions. These predictions will help the organization come up with better recommendations to farmers that the organization works with. 

### Outline of this notebook:
1. Brief data cleaning  
2. Scoring system 
3. Machine learning model
4. Predictor function that combines the scoring system and machine learning model to create the predict mentioned in objective number 3. 
5. Appendix 

  (sections are not placed in the order they are executed)

  5.1 Machine learning model selection 

  5.2 Hyperparameter tuning

  5.3 Data processing 





### Import Libraries

In [None]:
import pandas as pd
import numpy as np
from google.colab import (drive, files)
from datetime import datetime, timedelta
import os
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.decomposition import PCA
from sklearn.ensemble import RandomForestClassifier
from sklearn.experimental import enable_iterative_imputer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.impute import IterativeImputer, SimpleImputer
from sklearn.linear_model import LogisticRegression, LogisticRegressionCV
from sklearn.metrics import classification_report, roc_auc_score, roc_curve, accuracy_score, confusion_matrix, f1_score, precision_score, recall_score, make_scorer
from sklearn.model_selection import train_test_split, KFold, GridSearchCV, RepeatedStratifiedKFold, cross_val_score
from sklearn.multiclass import OneVsRestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler, LabelEncoder, OneHotEncoder, MinMaxScaler
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.utils import compute_class_weight, compute_sample_weight
from xgboost import XGBClassifier
from imblearn.over_sampling import SMOTE, ADASYN, RandomOverSampler 
from collections import defaultdict
from sklearn import metrics




### 👀 Read Data

`VL_farm_geo_w.csv` is the dataset that went through initial cleaning. 

Each row represents a visit from Viviendas Leon to a family. The row has information on the crop (condition, percent disease, type of crop, and weather conditions such as dew point, heat index, etc.


The data spans from 2017 to 2021. 



In [None]:
# connect to the drive
drive.mount('/content/gdrive')
path = "/content/gdrive"
chases_path = path + "/MyDrive"
kellys_path = path + "/MyDrive/VL_Data/Farming_Data"
os.chdir(chases_path)

KeyboardInterrupt: ignored

In [None]:
#df = pd.read_csv('/content/gdrive/MyDrive/VL_farm.csv') # read Kelly and Lauren's updated data

df= pd.read_csv('VL_farm_geo_w.csv', index_col=0)


# translate Spanish to English
def season(month):
    if month == 12 or month == 1 or month == 2:
        return "Winter"
    elif month == 3 or month == 4 or month == 5:
        return "Spring"
    elif month == 6 or month == 7 or month == 8:
        return "Summer"
    elif month == 9 or month == 10 or month == 11:
        return "Fall"
df["Season visited"] = df["Month visited"].apply(season)

# 'Bueno'== good, '0'== null, 'Promedio'==average, 'Excelente'==excellent, 'Pobre'==poor, 'Excel', 'crisopa'
df['Condition'] = df['Condition'].replace(['Excel'], 'Excelente')
df['Condition'] = df['Condition'].replace(['Excelente'], "excellent_cond")
df['Condition'] = df['Condition'].replace(['Promedio'], 'average_cond')
df['Condition'] = df['Condition'].replace(['Bueno'], 'good_cond')
df['Condition'] = df['Condition'].replace(['Pobre'], 'poor_cond')
df['Condition'] = df['Condition'].replace(['crisopa'], 'bad?_cond')

df['Condition'] = df['Condition'].replace([0], 'N/A_cond')
df['Seedling_or_transplanted'].unique()
df['Seedling_or_transplanted'] = df['Seedling_or_transplanted'].replace(['Almácigo'], 'seedling')
df['Seedling_or_transplanted'] = df['Seedling_or_transplanted'].replace(['Transplantado'], 'transplanted')
df['Seedling_or_transplanted'] = df['Seedling_or_transplanted'].replace(['Sin germinar'], 'transplanted')
df['Seedling_or_transplanted'] = df['Seedling_or_transplanted'].replace(['Fructificacion'], 'fruitification')
df['Seedling_or_transplanted'] = df['Seedling_or_transplanted'].replace(['Produccion'], 'production')
df

#drop rows with unsual values 
df = df[df.Condition != 'bad?_cond']
df = df[df['Seedling_or_transplanted'] != 'fruitification']
df = df[df['Seedling_or_transplanted'] != 'production']

# Correct some crop spellings
df['Crop'] = df['Crop'].replace(['Calabasa'], 'Calabaza')
df['Crop'] = df['Crop'].replace(['Caña'], 'Caña de azucar')
df['Crop'] = df['Crop'].replace(['Verngena', 'Verenjena', 'verenjena', 'verengena'], 'Verengena')
df['Crop'] = df['Crop'].replace(['Rabano'], 'Rábano')
df['Crop'] = df['Crop'].replace(['zanahoria'], 'Zanahoria')

# preview of the data
print('Data Shape', df.shape)
hide_location = (df.columns != "Region") & (df.columns != "Community") & (df.columns != "location") & (df.columns != "longitude") & (df.columns !="latitude")
df.loc[0:5, hide_location]

FileNotFoundError: ignored

### Data Preview

Looking at ordinal variable Crop Condition and % Illness

In [None]:
# 0 = No entry 
df['Condition'].value_counts().plot(kind='bar')
plt.title("Overview of condition of all the crops")
plt.ylabel("Count")
plt.xlabel("Condition");

In [None]:
plt.figure(figsize=(10,5))

dsh = sns.lineplot(x="Month visited", y="% Disease"
             ,data=df)
plt.title("% Disease of all the crops in each month of a year");

In [None]:
plt.title("Count of types of crops")
plt.xlabel("Crop type")
plt.ylabel("Count")
df['Type'].value_counts().plot(kind='bar');

In [None]:
plt.figure(figsize=(15, 5))
plt.title("Count of each crop type")
plt.xlabel("Specific crop")
plt.ylabel("Count")
ax = df['Crop'].value_counts().plot(kind='bar')
#ax.set_xticks(df['Crop'].value_counts().values)
;
print('Crops to predict: ' , df['Crop'].nunique())

# ⚙️ Feature Engineering

This step determines which variables are not relevant for the model (using correlations, interview with staff, etc).




### Add Additional Variables

In [None]:
# Combined Wellness metric using % Illness and Condition

def condition_percentage (row):
   if row['Condition'] == 'excellent_cond' :
      return 1
   if row['Condition'] == 'good_cond' :
      return .90
   if row['Condition'] == 'average_cond':
      return .80
   if row['Condition']  == 'poor_cond':
      return .70
   if row['Condition'] == '0':
      return 1
   return .80

def condition_wellness_columns(df):
    df['Percent_Condition'] = df.apply(lambda row: condition_percentage(row), axis=1)
    df['Percent_wellness'] = 100 - df['% Disease']
    df['Wellness_Condition'] = df['Percent_wellness'] * df['Percent_Condition']
    return df

In [None]:
df = condition_wellness_columns(df)


# One hot encoding
df = df.join(pd.get_dummies(df["Season visited"], prefix="Season"), how = 'outer')
df = df.join(pd.get_dummies(df["Condition"], prefix="Condition"), how = 'outer')
df = df.join(pd.get_dummies(df["Region"], prefix="Region"), how= 'outer')
df = df.join(pd.get_dummies(df["location"], prefix="Location"), how = "outer")
df = df.join(pd.get_dummies(df["Seedling_or_transplanted"], prefix="Trans_or_seed"), how = "outer")

# Drop now one-hot encoded columns
# df = df.drop(["Season visited", "Condition", "Region", "Plague", "location", "Seedling_or_transplanted", "Organic recommendation", "Chemical recommendation"], axis = 1)

In [None]:
# For each crop, 
df1 = df[['Region', 'Crop']]
df1.groupby(['Crop']).count()

In [None]:
# drop crops with only 1 count
df1 = df1.drop(df1.index[[4173, 5141, 2249, 6539, 4317, 3817, 6403, 5476, 311, 6697, 3002, 2908, 4494]])

In [None]:
df.to_csv('vl_geow_f.csv')

# 💯 Scoring System

Required files:
- `VL_farm_geo_w.csv`: A cleaned version of the raw Excel datasets

Steps:
- Initialize a new scoring function with `score = init_score(df)`, where df is the `cleaned.csv` pandas dataframe
    - You can also set df to be any dataframe with the same column names, but you may have to change the default values for crops, regions, and communities when you call the score function.
- To create a new score object, call `result = score(region, community)`, where region and community are optional parameters
- To get rankings of n crops, call `result.get_best_composite(n)` to get list of crops with best composite scores (includes percent diseased, condition, region, and community scores (if applicable)), `result.get_best_region(n)` to get list of crops with best region scores (if region was specified in the above step), and `result.get_best_community(n)` to get list of crops with best community scores (if community was specified in the above step).
    - By default, n is the number of unique crops in the dataset
    - Highest scoring crops are listed first in the returned list
    
Function Descriptions:
- `get_best_composite`: Ranks crops using composite scores based on condition, percent diseased, region (if applicable), and community (if applicable). The ranking of each crop corresponds to its order in the returned array (i.e. best to worst order). Uses the dictionary `comp_scores` to inform its rankings (the higher the score, the better the crop's rank).
- `get_best_region`: Ranks crops using composite scores based on region (if applicable). The ranking of each crop corresponds to its order in the returned array. Uses the dictionaries `reg_cond_scores` (for conditions) and `reg_dis_scores` (for percent diseased) to inform its rankings.
- `get_best_community`: Ranks crops using composite scores based on community (if applicable). The ranking of each crop corresponds to its order in the returned array. Uses the dictionaries `com_cond_scores` (for conditions) and `com_dis_scores` (for percent diseased) to inform its rankings.
- `get_best_type_composite`: Ranks crop types (e.g. "Veg", "Grians", etc.) using composite scores based on condition, percent diseased, region (if applicable), and community (if applicable). The ranking of each crop type corresponds to its order in the returned array. Uses the dictionary `type_comp_scores` to inform its rankings (the higher the score, the better the crop type's rank).
- `get_best_type_region`: Ranks crop types using composite scores based on region (if applicable). The ranking of each crop type corresponds to its order in the returned array. Uses the dictionaries `type_reg_cond_scores` (for conditions) and `type_reg_dis_scores` (for percent diseased) to inform its rankings.
- `get_best_type_community`: Ranks crop types using composite scores based on community (if applicable). The ranking of each crop type corresponds to its order in the returned array. Uses the dictionaries `type_com_cond_scores` (for conditions) and `type_com_dis_scores` (for percent diseased) to inform its rankings.

In [None]:
class ScoreResult:
    def __init__(
        self, 
        comp_scores, 
        cond_scores, 
        per_dis_scores, 
        reg_cond_scores, 
        reg_per_dis_scores, 
        com_cond_scores, 
        com_per_dis_scores,
        type_comp_scores,
        type_cond_scores,
        type_per_dis_scores,
        type_reg_cond_scores,
        type_reg_per_dis_scores,
        type_com_cond_scores,
        type_com_per_dis_scores,
        crops,
        types
    ):
        self.comp_scores = comp_scores
        self.cond_scores = cond_scores
        self.per_dis_scores = per_dis_scores
        self.reg_cond_scores = reg_cond_scores
        self.reg_dis_scores = reg_per_dis_scores
        self.com_cond_scores = com_cond_scores
        self.com_dis_scores = com_per_dis_scores
        self.type_comp_scores = type_comp_scores
        self.type_cond_scores = type_cond_scores
        self.type_per_dis_scores = type_per_dis_scores
        self.type_reg_cond_scores = type_reg_cond_scores
        self.type_reg_dis_scores = type_reg_per_dis_scores
        self.type_com_cond_scores = type_com_cond_scores
        self.type_com_dis_scores = type_com_per_dis_scores
        self.crops = crops
        self.types = types
        
    def get_best_composite(self, n=None):
        if n == None:
            n = len(self.crops)
        crops = self.crops.copy()
        crops.sort(key=lambda x: -self.comp_scores[x])
        return crops[:n]
    
    def get_best_region(self, n=None):
        if n == None:
            n = len(self.crops)
        crops = self.crops.copy()
        crops.sort(key=lambda x: -(self.reg_cond_scores[x] + self.reg_dis_scores[x]))
        return crops[:n]
    
    def get_best_community(self, n=None):
        if n == None:
            n = len(self.crops)
        crops = self.crops.copy()
        crops.sort(key=lambda x: -(self.com_cond_scores[x] + self.com_dis_scores[x]))
        return crops[:n]
    
    def get_best_type_composite(self, n=None):
        if n == None:
            n = len(self.types)
        types = self.types.copy()
        types.sort(key=lambda x: -self.type_comp_scores[x])
        return types[:n]
    
    def get_best_type_region(self, n=None):
        if n == None:
            n = len(self.types)
        types = self.types.copy()
        types.sort(key=lambda x: -(self.type_reg_cond_scores[x] + self.type_reg_dis_scores[x]))
        return types[:n]
    
    def get_best_type_community(self, n=None):
        if n == None:
            n = len(self.types)
        types = self.types.copy()
        types.sort(key=lambda x: -(self.type_com_cond_scores[x] + self.type_com_dis_scores[x]))
        return types[:n]
        
def normalize(d, target=1.0):
    raw = sum(d.values())
    factor = target/raw
    return {key:value*factor for key,value in d.items()}

def init_score(df,  
               crops=None, 
               types=None, 
               regions=None, 
               communities=None, 
               conds = None,
               cond_weights={
                   "good_cond": 1, 
                   "Bueno" : 1,
                   "excellent_cond": 2, 
                   "Excelente": 2,
                   "Excel": 2,
                   "average_cond": -1, 
                   "Promedio": -1,
                   "poor_cond": -2, 
                   "Pobre": -2,
                   "crisopa": -2,
                   0: 0, 
                   '0': 0}):
    if crops == None:
        crops = list(df["Crop"].unique())
    if regions == None:
        regions = list(df["Region"].unique())
    if types == None:
        types = list(df["Type"].unique())
    if communities == None:
        communities = list(df["Community"].unique())
    if conds == None:
        conds = list(df["Condition"].unique())
        
        
    def score(region=None, community=None):
        if region != None and region not in regions:
            raise ValueError(f"region is not valid, valid inputs include: {', '.join(regions)}")
        if community != None and community not in communities:
            raise ValueError(f"community is not valid, valid inputs include: {', '.join(communities)}")
        
        comp_scores = dict.fromkeys(crops, 0)
        cond_scores = dict.fromkeys(crops, 0)
        per_dis_scores = dict.fromkeys(crops, 0)
        
        reg_cond_scores = dict.fromkeys(crops, 0)
        reg_per_dis_scores = dict.fromkeys(crops, 0)
        
        com_cond_scores = dict.fromkeys(crops, 0)
        com_per_dis_scores = dict.fromkeys(crops, 0)
        
        for crop in crops:
            cond_total = 0
            n = 0
            crop_df = df[df["Crop"] == crop]
            cond_counts = crop_df["Condition"].value_counts().to_dict()
            for cond in conds:
                if cond in cond_counts:
                    cond_total += cond_counts[cond] * cond_weights[cond]
                    n += cond_counts[cond]
            if n == 0:
                cond_scores[crop] = 0
            else:
                cond_scores[crop] = cond_total / n
            per_dis_scores[crop] = 100 - crop_df["% Disease"].mean()
        cond_scores = normalize(cond_scores)
        per_dis_scores = normalize(per_dis_scores)
        
        if region != None:
            region_df = df[df["Region"] == region]
            for crop in crops:
                cond_total = 0
                n = 0
                crop_df = region_df[region_df["Crop"] == crop]
                cond_counts = crop_df["Condition"].value_counts().to_dict()
                for cond in conds:
                    if cond in cond_counts:
                        cond_total += cond_counts[cond] * cond_weights[cond]
                        n += cond_counts[cond]
                if n == 0:
                    reg_cond_scores[crop] = 0
                else:
                    reg_cond_scores[crop] = cond_total / n
                if isinstance(100 - crop_df["% Disease"].mean(), np.float64):
                    reg_per_dis_scores[crop] = 100 - crop_df["% Disease"].mean()
                else:
                    reg_per_dis_scores[crop] = 0
            reg_cond_scores = normalize(reg_cond_scores)
            reg_per_dis_scores = normalize(reg_per_dis_scores)
                
        if community != None:
            com_df = df[df["Community"] == community]
            for crop in crops:
                cond_total = 0
                n = 0
                crop_df = com_df[com_df["Crop"] == crop]
                cond_counts = crop_df["Condition"].value_counts().to_dict()
                for cond in conds:
                    if cond in cond_counts:
                        cond_total += cond_counts[cond] * cond_weights[cond]
                        n += cond_counts[cond]
                if n == 0:
                    com_cond_scores[crop] = 0
                else:
                    com_cond_scores[crop] = cond_total / n
                if isinstance(100 - crop_df["% Disease"].mean(), np.float64):
                    com_per_dis_scores[crop] = 100 - crop_df["% Disease"].mean()
                else:
                    com_per_dis_scores[crop] = 0
            com_cond_scores = normalize(com_cond_scores)
            com_per_dis_scores = normalize(com_per_dis_scores)
        
        for crop in crops:
            comp_scores[crop] += cond_scores[crop] + per_dis_scores[crop]
            if region != None:
                comp_scores[crop] += reg_cond_scores[crop] + reg_per_dis_scores[crop]
            if community != None:
                comp_scores[crop] += com_cond_scores[crop] + com_per_dis_scores[crop]
        comp_scores = normalize(comp_scores)
        
        
        type_comp_scores = dict.fromkeys(types, 0)
        type_cond_scores = dict.fromkeys(types, 0)
        type_per_dis_scores = dict.fromkeys(types, 0)
        
        type_reg_cond_scores = dict.fromkeys(types, 0)
        type_reg_per_dis_scores = dict.fromkeys(types, 0)
        
        type_com_cond_scores = dict.fromkeys(types, 0)
        type_com_per_dis_scores = dict.fromkeys(types, 0)
        
        for _type in types:
            cond_total = 0
            n = 0
            type_df = df[df["Type"] == _type]
            cond_counts = type_df["Condition"].value_counts().to_dict()
            for cond in conds:
                if cond in cond_counts:
                    cond_total += cond_counts[cond] * cond_weights[cond]
                    n += cond_counts[cond]
            if n == 0:
                type_cond_scores[_type] = 0
            else:
                type_cond_scores[_type] = cond_total / n
            type_per_dis_scores[_type] = 100 - type_df["% Disease"].mean()
        type_cond_scores = normalize(type_cond_scores)
        type_per_dis_scores = normalize(type_per_dis_scores)
        
        if region != None:
            region_df = df[df["Region"] == region]
            for _type in types:
                cond_total = 0
                n = 0
                type_df = region_df[region_df["Type"] == _type]
                cond_counts = type_df["Condition"].value_counts().to_dict()
                for cond in conds:
                    if cond in cond_counts:
                        cond_total += cond_counts[cond] * cond_weights[cond]
                        n += cond_counts[cond]
                if n == 0:
                    type_reg_cond_scores[_type] = 0
                else:
                    type_reg_cond_scores[_type] = cond_total / n
                if isinstance(100 - type_df["% Disease"].mean(), np.float64):
                    type_reg_per_dis_scores[_type] = 100 - type_df["% Disease"].mean()
                else:
                    type_reg_per_dis_scores[_type] = 0
            type_reg_cond_scores = normalize(type_reg_cond_scores)
            type_reg_per_dis_scores = normalize(type_reg_per_dis_scores)
                
        if community != None:
            com_df = df[df["Community"] == community]
            for _type in types:
                cond_total = 0
                n = 0
                type_df = com_df[com_df["Type"] == _type]
                cond_counts = type_df["Condition"].value_counts().to_dict()
                for cond in conds:
                    if cond in cond_counts:
                        cond_total += cond_counts[cond] * cond_weights[cond]
                        n += cond_counts[cond]
                if n == 0:
                    type_com_cond_scores[_type] = 0
                else:
                    type_com_cond_scores[_type] = cond_total / n
                if isinstance(100 - type_df["% Disease"].mean(), np.float64):
                    type_com_per_dis_scores[_type] = 100 - type_df["% Disease"].mean()
                else:
                    type_com_per_dis_scores[_type] = 0
            type_com_cond_scores = normalize(type_com_cond_scores)
            type_com_per_dis_scores = normalize(type_com_per_dis_scores)
        
        for _type in types:
            type_comp_scores[_type] += type_cond_scores[_type] + type_per_dis_scores[_type]
            if region != None:
                type_comp_scores[_type] += type_reg_cond_scores[_type] + type_reg_per_dis_scores[_type]
            if community != None:
                type_comp_scores[_type] += type_com_cond_scores[_type] + type_com_per_dis_scores[_type]
        type_comp_scores = normalize(type_comp_scores)
        
        return ScoreResult(
            comp_scores, 
            cond_scores, 
            per_dis_scores, 
            reg_cond_scores, 
            reg_per_dis_scores, 
            com_cond_scores, 
            com_per_dis_scores,
            type_comp_scores,
            type_cond_scores,
            type_per_dis_scores,
            type_reg_cond_scores,
            type_reg_per_dis_scores,
            type_com_cond_scores,
            type_com_per_dis_scores,
            crops,
            types
        )
    
    return score

# 🔥 Modeling

We are building a model that is predicting which general crop type should be planted. There are four general crop types: fruits, vegetables, legumes and seeds, and grasses. We call this the general crop model.



**Dataset:**
* Features = percent disease, wellness condition (i.e a combination of percent disease and crop condition), weather conditions, and location. 
* Y = ranking for each specific crop and a score for each specific crop.  
* Addressing imbalanced classes: the class legumes and seeds and class grasses are less represented in the dataset. Realistically, we want farmers to plant a variety of crops, so we want to avoid having the model favoring one type of crop over the other. Therefore, we implemented class balance weights. 


**Model selection and evaluation**
* We are implementing a multi-class classification model. We will be choosing between XGBoost, random forest, one-vs-rest, logistic regression, k-nearest neighbors, and support vector machines. 
* Criteria for a good model: We also wish for the farmers to plant a variety of crops, so we will choose the model that has the highest accuracy, and also recommend a good mix of crops.  
* We used AUC, precision, recall, F1 score, the confusion matrix, and 5-fold cross validation accuracy score to evaluate each of the models.

**Hyperparameter tuning**

We will use grid search to tune the parameters of the best model selected.

Below are the code and results of the model selection and evaluation.


**Multiclass Classification**

Predict crops with highest probability of success given features. Success is defined by `Wellness_Condition`. Highest likelihood of success is dervied from outputted probabilities of the model

**Feature Columns Used:**
  - Weather: `DewPointC`, `HeatIndexC`, `WindChillC`, `sunHour`
  - Season: `Season_Fall`, `Season_Spring`, `Season_Summer`, `Season_Winter`, `Month Visited`(?)
  - Location Based: `Region_Goyena`, `Region_Troilo`

**Metrics**
- Weight on the features for the either the quality or illness
- Make list from outputs, then aggregate lists

- Composite score: 
  - Quality & percent disease
    - When modeling can get rid of crops with illness, bad quality 
  - Model that maps crops to expected percent disease , maps other conditions conditions with season and location ( average percent disease in different locations )




## 😊 Final model chosen

After the model selection process and tuning the hyperparameters, we've chosen the XGBoost model with default parameters to be the final machine learning model. 

In [None]:
def remove_low_crops(df):
  '''
  Outputs a dataframe for crop strain modeling without crops with low representaion
  '''
  # Getting counts of crops in dataframe
  crop_counts = df.groupby(['Crop']).size().sort_values(ascending=True)
  # Selecting index crop names with less than 10 counts
  low_crops = crop_counts[crop_counts < 10].index.tolist()
  # filtering dataframe without
  df_without = df[~df['Crop'].isin(low_crops)]
  return df_without

def training(predictors, target):
  '''
  Uses predictors and target to split and train model. No normaliser needed for xgboost
  '''
  classes = np.unique(target)
  class_weight = compute_class_weight('balanced', classes, target)

  xgboost_model = XGBClassifier(scale_pos_weight=class_weight)
  xgboost_model.fit(predictors, target)
  return xgboost_model

def get_preds(model, conditions):
  '''
  Given model and the user conditions obtain the top class predictions from the model
  Conditions: '% Disease' (set to 0), 'Wellness_Condition' (set to 100), 
  'HeatIndexC' (avg 30.74), 'DewPointC' (avg 20.66), 'WindChillC' (avg 28.10), 'sunHour' (avg 10.95), 
  'Season_Fall', 'Season_Spring', 'Season_Summer', 'Season_Winter', 
  'Region_Goyena', 'Region_Troilo'
  '''
  # if certain conditions aren't given , then default values 
  target_prediction = model.predict(conditions)
  class_probas = model.predict_proba(conditions)[0].tolist()
  model_classes = model.classes_
  class_probabilities = list(zip(model_classes, class_probas))
  class_probabilities.sort(reverse=True, key=lambda x:x[1])
  top_classes = [every[0] for every in class_probabilities[:3]]
  return top_classes

# May want to do this with the whole dataset for maximum representation of imbalanced classes
# Add argument for Crop_model=True or Type_model=True to get more specific for accuracies desired
def class_assessment(model, predictors, target):
    '''
    Assess the roc auc score for all classes. Saves a list of crops with higher roc auc scores
    Uses: Model, Training set of X & y
    -- default dict module needed from collections package
    '''
    crop_scores = defaultdict(list)
    classes = model.classes_
    X_train, X_test, y_train, y_test = train_test_split(predictors.values, target.values, test_size=0.2, shuffle=True)
    kf = KFold(n_splits=3, random_state=42)
    for train_ind, val_ind in kf.split(X_train, y_train):

      # Split train into validation sets
        X_tr, y_tr = X_train[train_ind], y_train[train_ind]
        X_val, y_val = X_train[val_ind], y_train[val_ind]
        # Get roc auc score for each crop
        for each in classes:
            fpr, tpr, thresholds = roc_curve(y_val,  
                model.fit(X_tr, y_tr).predict_proba(X_val)[:,1], pos_label = each)
            auc = round(metrics.auc(fpr, tpr),2)
            crop_scores[each].append(auc)

        crop_auc = pd.DataFrame.from_dict(crop_scores, orient='index')
        crop_auc['avg'] = crop_auc.mean(axis=1)
        
    crop_auc2 = crop_auc[crop_auc['avg'] > 0.5]
    crop_auc2.drop(crop_auc.columns[[0, 1, 2]], axis=1, inplace=True)
    crop_auc2.sort_values(by=['avg'], ascending=False, inplace=True)
    return [crop_auc2, classes]

def cherry_pick(func_predictions, model_predictions, well_classified_crops):
    '''
    Use Adam's functions to supplement model predictions
    '''
    safe_predictions = [x for x in model_predictions if x in well_classified_crops]
    safe_predictions.extend(func_predictions)
    return safe_predictions[:3]

'''
Function for ensembling results
'''
# def avg_preds(model_predictions, func_predictions):
#   '''
#   Ensembling results of model and function
#   '''
#   model_predsdf = pd.DataFrame.from_dict(model_preds, orient='index').sort_values(by=[0], ascending=False).reset_index().reset_index()
#   model_predsdf.columns = ['rank', 'crop', 'rating']
#   func_predsdf = pd.DataFrame.from_dict(function_results, orient='index').sort_values(by=[0], ascending=False).reset_index().reset_index()
#   func_predsdf.columns = ['rank', 'crop', 'rating']
#   comb_predsdf = func_predsdf.merge(model_predsdf, left_on='crop', right_on='crop')
#   comb_predsdf['averaged_rank'] = (comb_predsdf['rank_x'] + comb_predsdf['rank_y']) / 2 
#   comb_predsdf.sort_values(by=['averaged_rank'])
#   return comb_predsdf

## Training, cross validation and testing


In [None]:
# Training
df = pd.read_csv('vl_geow_f.csv')

In [None]:
# 1) Remove the low crops
df_less = remove_low_crops(df)

# 2) Train the model
predictorsc = df_less[['% Disease', 'Wellness_Condition', 'HeatIndexC', 'DewPointC', 'WindChillC', 'sunHour', 'Season_Fall', 'Season_Spring', 'Season_Summer', 'Season_Winter', 'Region_Goyena', 'Region_Troilo']]
targetc = df_less['Crop']

predictorst = df[['% Disease', 'Wellness_Condition', 'HeatIndexC', 'DewPointC', 'WindChillC', 'sunHour', 'Season_Fall', 'Season_Spring', 'Season_Summer', 'Season_Winter', 'Region_Goyena', 'Region_Troilo']]
targett = df['Type']

crop_model = training(predictorsc, targetc)
type_model = training(predictorst, targett)

In [None]:
crop_model.predict(testc)

## ➡️ Test set predictions

In [None]:
# Load the conditions into a dataframe
some_conditions = [0, 100, 30.74, 20.66, 28.10, 10.95, 0, 0, 0, 0, 0, 0]
columns_dict = {0: '% Disease', 1: 'Wellness_Condition', 2: 'HeatIndexC',
                  3: 'DewPointC', 4: 'WindChillC', 5: 'sunHour',
                  6: 'Season_Fall', 7: 'Season_Spring', 8: 'Season_Summer',
                  9: 'Season_Winter', 10: 'Region_Goyena', 11: 'Region_Troilo'}
some_conditions_df = pd.DataFrame(some_conditions).T.rename(columns=columns_dict)

# Get the predictions from the model
crop_preds = get_preds(crop_model, some_conditions_df)
type_preds = get_preds(type_model, some_conditions_df)

well_classified_categories = class_assessment(type_model, predictorst, targett)

In [None]:
well_classified_categories.reset_index().rename(columns={'index': 'Crop Type'})

# 🎉 Final combined function

This will be the function that combine the scoring system and the machine learning model. 

**How it works:**

We will first use the machine learning model to output the ranks and the scores for the specific crop types. 

We will examine the AUC score for the recommended general crop type. If the score is below a certain threshold, then instead the ranks and scores that the machine learning model outputs, we use ranks and scores from the scoring system. If the score is above the threshold, then we will run a function that combines the ranks and scores the machine learning model predicts with the ranks and scores the scoring system predict. 


In [None]:
  import pickle
  
  predictorsc = df_less[['% Disease', 'Wellness_Condition', 'HeatIndexC', 'DewPointC', 'WindChillC', 'sunHour', 'Season_Fall', 'Season_Spring', 'Season_Summer', 'Season_Winter', 'Region_Goyena', 'Region_Troilo']]
  targetc = df_less['Crop']
  
  classes = np.unique(targetc)
  class_weight = compute_class_weight('balanced', classes, targetc)

  xgboost_model = XGBClassifier(scale_pos_weight=class_weight)
  xgboost_model.fit(predictorsc, targetc)

filename = 'finalized_model33.pkl'
pickle.dump(xgboost_model, open(filename, 'wb'))
 

In [None]:
# Maybe useful? df_test_binary["Fare"].fillna(df_test_binary.groupby("NameLen")["Fare"].transform("median"), inplace=True)
def final_function(Percent_Disease=0, Wellness_Condition=100, HeatIndexC=30.74,
                   DewPointC=20.66, WindChillC=28.10, sunHour=10.95,
                   Season_Fall=0, Season_Spring=0,
                   Season_Summer=0, Season_Winter=0,
                   Region_Goyena=0, Region_Troilo=0):
  """
      NOTE: This function assumes that df is defined above!

      The parameters are set to default values as specified by the model. Feel
      free to pass in as many or as few of these parameters as necessary. This
      function will return a string representing the model's recommendation.
  """
  # 1) Remove the low crops
  df_less = remove_low_crops(df)

  # 2) Train the model
  predictorsc = df_less[['% Disease', 'Wellness_Condition', 'HeatIndexC', 'DewPointC', 'WindChillC', 'sunHour', 'Season_Fall', 'Season_Spring', 'Season_Summer', 'Season_Winter', 'Region_Goyena', 'Region_Troilo']]
  targetc = df_less['Crop']

  predictorst = df[['% Disease', 'Wellness_Condition', 'HeatIndexC', 'DewPointC', 'WindChillC', 'sunHour', 'Season_Fall', 'Season_Spring', 'Season_Summer', 'Season_Winter', 'Region_Goyena', 'Region_Troilo']]
  targett = df['Type']

  crop_model = training(predictorsc, targetc)
  type_model = training(predictorst, targett)

  # 3) Put the passed in conditions into a dataframe
  conditions = [Percent_Disease, Wellness_Condition, HeatIndexC, DewPointC,
                WindChillC, sunHour, Season_Fall, Season_Spring, Season_Summer,
                Season_Winter, Region_Goyena, Region_Troilo]
  columns_dict = {0: '% Disease', 1: 'Wellness_Condition', 2: 'HeatIndexC',
                  3: 'DewPointC', 4: 'WindChillC', 5: 'sunHour',
                  6: 'Season_Fall', 7: 'Season_Spring', 8: 'Season_Summer',
                  9: 'Season_Winter', 10: 'Region_Goyena', 11: 'Region_Troilo'}
  conditions_df = pd.DataFrame(conditions).T.rename(columns=columns_dict)

  # 4) Call the model
  crop_preds = get_preds(crop_model, conditions_df)
  type_preds = get_preds(type_model, conditions_df)
  well_classified_crops = class_assessment(crop_model, predictorsc, targetc)
  well_classified_crops.reset_index().rename(columns={'index': 'Crop'})
  well_classified_crops = well_classified_crops.head(3)
  well_classified_categories = class_assessment(type_model, predictorst, targett)
  well_classified_categories.reset_index().rename(columns={'index': 'Crop Type'})
  well_classified_categories = well_classified_categories.head(1)

  # 5) Call the scoring system
  score = init_score(df)
  result = score()
  if Region_Goyena == 1:
    region = 'Goyena'
  elif Region_Troilo == 1:
    region = 'Troilo'
  else:
    region = None
  score_func = init_score(df)
  result = score_func(region)
  high_score_crops = result.get_best_composite(n=3)
  high_score_categories = result.get_best_type_composite(n=1)
  if high_score_categories[0] == 'Veg':
    high_score_categories[0] = 'Vegetable'

  # 6) Add the results of the crop scoring to get the crop DF up to 3
  crops_length = len(well_classified_crops)
  if crops_length < 3:
    high_score_crops_2D = []
    for i in range(min(len(high_score_crops, 3 - crops_length))):
      high_score_crops_2D.append([high_score_crops[i], None])
    high_score_df = pd.DataFrame(new_high_score_crops, columns=['Crop', 'avg'])
    well_classified_crops = well_classified_crops.append(high_score_df)
  
  # 7) Add the results of the crop type scoring if the category DF is 0 in len
  if len(well_classified_categories) < 1:
    high_score_types_2D = [[high_score_categories[0], None]]
    well_classified_categories = pd.DataFrame(high_score_types_2D, columns=['Crop Type', 'avg'])
  
  # 8) Return the crop and category recommendations dataframes in a list
  return [well_classified_crops, well_classified_categories]
  
final_function()

In [None]:

well_classified_crops = class_assessment(crop_model, predictorsc, targetc)
well_classified_crops

In [None]:
well_classified_crops[1]

 # 📚 Appendix

#### Model Comparison

In [None]:
X_train, X_test, y_train, y_test = train_test_split(predictors, target, test_size=0.2, shuffle=True)

# data normalization using MinMaxScaler (standardscaler decreases accuracy due to nongaussian distribution)
norm = MinMaxScaler().fit(X_train)
X_train_norm = norm.transform(X_train)
X_test_norm = norm.transform(X_test)

kf = KFold(n_splits=5, random_state=42)

classes = np.unique(target)


In [None]:
# Checking for America's Next Top Models
# cv_results = {}
# result_table = pd.DataFrame(columns=['classifiers', 'accuracy'])

models = {'xgboost': XGBClassifier(random_state=42,),
        'logistic regression': LogisticRegression(solver="lbfgs", random_state=42, multi_class="multinomial"),
        'KNN': KNeighborsClassifier( n_neighbors=5),
        'decision tree': DecisionTreeClassifier(random_state=42),
        'random forest': RandomForestClassifier(random_state=42, n_estimators=100),
        'SVC': svm.SVC(random_state=42),
        'one vs rest': OneVsRestClassifier(SVC(random_state=42)) }

for model in models.items():
  for train_ind, val_ind in kf.split(X_train_norm, y_train):
    X_tr, y_tr = X_train_norm[train_ind], y_train.iloc[train_ind]
    X_val, y_val = X_train_norm[val_ind], y_train.iloc[val_ind]
    #fit model in dictionary with values
    model[1].fit(X_tr, y_tr)
    #predict
    y_pred = model[1].predict(X_val)

  #testing metrics
    precision = precision_score(y_val, y_pred, average='weighted')
    accuracy = accuracy_score(y_val, y_pred)
    f1 = f1_score(y_val, y_pred, average='weighted')

    # cv_results[model[0]] = (accuracy)
    cm = confusion_matrix(y_val,y_pred)

  #roc auc scores
  for each in classes:
      fpr, tpr, thresholds = roc_curve(y_val,  
                      model[1].predict_proba(X_val)[:,1], pos_label = each) 

      auroc = round(metrics.auc(fpr, tpr),2)
      print(each,'--AUC--->',auroc)

  print(model[0], '\n', 'accuracy score:', accuracy, '\n', 'f1 score: ', f1, '\n precision: ', precision)
  ax= plt.subplot()
  sns.heatmap(cm, annot=True, fmt='g', ax=ax, cmap='Blues');  #annot=True to annotate cells, ftm='g' to disable scientific notation
  # labels, title and ticks
  ax.set_xlabel('Predicted Labels');ax.set_ylabel('True Labels'); 
  ax.set_title('Confusion Matrix'); 
  ax.xaxis.set_ticklabels(['Fruit', 'Grains', 'Legumes', 'Veg']); ax.yaxis.set_ticklabels(['Fruit', 'Grains', 'Legumes', 'Veg']);
  plt.show()      
  print('\n') 


### Hyperparameter Tuning

In [None]:
# XGBoost with oversampled data
# over sampled train set
oversample = RandomOverSampler(random_state=42)
X_train_over, y_train_over = oversample.fit_resample(X_train_norm, y_train)


xgboost_model_o = XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
                    colsample_bynode=1, colsample_bytree=1, gamma=0, learn_rate=0.2,
                    learning_rate=0.1, max_delta_step=0, max_depth=3,
                    min_child_weight=1, missing=None, n_estimators=100, n_jobs=1,
                    nthread=None, objective='multi:softprob', random_state=42,
                    reg_alpha=0, reg_lambda=1, sample_rate=0.8,
                    seed=None, silent=None, subsample=1, verbosity=1)
xgboost_model_o.fit(X_train_over, y_train_over)

y_pred = xgboost_model_o.predict(X_test_norm)
accuracy_score(y_test, y_pred)

print(classification_report(y_test, y_pred))

In [None]:
# XGBoost with tuned parameters
# {'gamma': 1, 'learn_rate': 0.1, 'max_depth': 3, 'subsample': 0.9}
tuned_xgboost_model = XGBClassifier(gamma=1, learn_rate=0.1, max_depth=3, subsample=0.9)
tuned_xgboost_model.fit(X_train, y_train.values.ravel())

y_pred = tuned_xgboost_model.predict(X_test)
accuracy_score(y_test, y_pred)

In [None]:
class_weight = compute_class_weight('balanced', classes, target)
print(class_weight)

In [None]:
# Finding the best parameters for XGBoost
def parameter_tune(clf, hyper_params):
    # Type of scoring used to compare parameter combinations
    acc_scorer = make_scorer(accuracy_score)

    # Run the grid search
    grid_obj = GridSearchCV(clf, hyper_params, scoring=acc_scorer)
    grid_obj.fit(X_train, y_train)
    return grid_obj

# Parameter combonations that the grid search will try
hyper_params = {'max_depth': [3, 4, 5],
                'learn_rate': [0.1, 0.09, 0.08, 0.07],
                'subsample': [0.8, 0.9, 1],
                'gamma': [0, 1, 5]}
# Find the best parameters with GridSearchCV
#grid_search_obj = parameter_tune(XGBClassifier(), hyper_params)
#grid_search_obj.cv_results_

In [None]:
# Make the results into a dataframe
#grid_search_df = pd.DataFrame(grid_search_obj.cv_results_)
#grid_search_df

In [None]:
# Print out the best parameters
#grid_search_obj.best_params_

In [None]:
# Get the best estimator
#BEST_XG_CLF = grid_search_obj.best_estimator_
#BEST_XG_CLF

In [None]:
# Print the classification_report of the best estimator
#BEST_XG_CLF.fit(X_train, y_train)

#y_pred = BEST_XG_CLF.predict(X_test)
#accuracy_score(y_test, y_pred)

#print(classification_report(y_test, y_pred))

In [None]:
predictors = df[['% Disease', 'Wellness_Condition', 'HeatIndexC', 'DewPointC', 'WindChillC', 'sunHour', 'Season_Fall', 'Season_Spring', 'Season_Summer', 'Season_Winter', 'Region_Goyena', 'Region_Troilo']]
targetc = df['Crop']

X_trainc, X_testc, y_trainc, y_testc = train_test_split(predictors, targetc, test_size=0.2, shuffle=True)

In [None]:
# Crops data 

xgboost_model = XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
                    colsample_bynode=1, colsample_bytree=1, gamma=0, learn_rate=0.2,
                    learning_rate=0.1, max_delta_step=0, max_depth=3,
                    min_child_weight=1, missing=None, n_estimators=100, n_jobs=1,
                    nthread=None, objective='multi:softprob', random_state=42,
                    reg_alpha=0, reg_lambda=1, sample_rate=0.8, scale_pos_weight=class_weight,
                    seed=None, silent=None, subsample=1, verbosity=1)
xgboost_model.fit(X_trainc, y_trainc)

y_pred = xgboost_model.predict(X_testc)
accuracy_score(y_testc, y_pred)

print(classification_report(y_testc, y_pred))


## ✏️ Data Preparation


We produced a cleaned version of the data titled `VL_farm_geo_w.csv`.
Below is a description of the initial data cleaning steps we took. 

The data we received from Viviendas Leon (VL) originally consisted of these files:

1) corrected names of farmers VL worked with

2) farming data 2017 - 2021 of the farmers 

3) coordinates of the families

We further scraped weather information to add to our analysis. 

1) replaced all the family names with corrected names

2) translated Spanish (the original language of the data) into English

3) merged the farming data with weather and geocoordinate data

Below are useful functions we used for data cleaning, and the code of our data cleaning.



In [None]:
# Load farming data, weather data, geo data

# dropped last three rows (just empty)
correct_names = pd.read_excel("210304_Full Participant List Farming Program 2017-2021.xlsx").drop([51,52,53], axis=0)
correct_names.columns = correct_names.loc[0,:]
correct_names = correct_names.drop([0], axis=0)

# load data
data_17_19 = pd.read_excel("VL Huertos Familiares- Hoja de Datos (2017-2019).xlsx")
data_19_20 = pd.read_excel("VL Huertos Familiares- Hoja de Datos (2019-2020).xlsx")

# Lauren
historic_w = pd.read_csv('w_historic.csv')
families_coordinates = pd.read_csv('family_coordinates_api.csv')

In [None]:
def combined_farming(dataframe1, dataframe2):
  # Kelly
  #drop empty column
  dataframe1 = dataframe1.drop("Unnamed: 10", axis=1)

  '''Translate Spanish column names into English'''

  translated_cols = ['Date visited','Auditor','Region','Community','Family visited','Present?',
                      'Fruit','Fruit_Condition (seedling or transplanted)', 'Fruit_% Disease','Fruit_Condition',
                      'Fruit_Plague','Fruit_Organic recommendation','Fruit_Chemical recommendation',
                      'Vegetables','Veg_Condition (seedling or transplanted)','Veg_% Disease','Veg_Condition',
                      'Veg_Plague','Veg_Organic recommendation','Veg_Chemical recommendation',
                      'Legumes and seeds','LnS_Condition (seedling or transplanted)','LnS_% Disease',
                      'LnS_Condition','LnS_Plague','LnS_Organic recommendation','LnS_Chemical recommendation',
                      'Grasses','Grasses_Condition (seedling or transplanted)','Grasses_% Disease',
                      'Grasses_Condition','Grasses_Plague','Grasses_Organic recommendation',
                      'Grasses_Chemical recommendation',
                      'Commentaries, additional remarks','Response, commentary follow up']
  # rename columns 
  dataframe1.columns = translated_cols
  dataframe2.columns = translated_cols

  # drop first 3 rows (headings of table names)
  dataframe1 = dataframe1.drop([0, 1, 2], axis=0) 
  dataframe2 = dataframe2.drop([0, 1, 2], axis=0)

  # appending 2019-2020 to the bottom of 2017-2019 data
  combined_data = dataframe1.append(dataframe2)

  return combined_data

def clean_farming(combined_data):

  '''Preliminary cleaning for consistent data entry, correct names, and structuring
  data so that each row is one observation for one crop.'''

    # fill NaN with 0s
    combined_data = combined_data.fillna(0)


    # further cleaning
    combined_data["Month visited"] = combined_data["Date visited"].dt.month
    combined_data["Year visited"] = combined_data["Date visited"].dt.year
    combined_data["Veg_% Disease"] = combined_data["Veg_% Disease"].replace(['35 %', ';4', ' '], [35, 4, 0])
    combined_data["Fruit"] = combined_data["Fruit"].replace(['Piña', 'piña'], 'Pina')
    combined_data["Legumes and seeds"] = combined_data["Legumes and seeds"].replace("Pipián", "Pipian")
    combined_data["Legumes and seeds"] = combined_data["Legumes and seeds"].replace("I", "None")
    combined_data["Legumes and seeds"] = combined_data["Legumes and seeds"].replace(["Frijol Rojo", "Frijoles rojo"], "Frijol rojo")
    combined_data["Legumes and seeds"] = combined_data["Legumes and seeds"].replace(["Frijoles blanco", "Frijol blanco"])

    # replace with corrected names
    combined_data = combined_data.replace(correct_names.iloc[:,3].values, correct_names.iloc[:,4].values)

    # structure data so that one row is one observation 
    overall_info = combined_data.iloc[:,:6]
    fruit = combined_data.iloc[:,6:13]
    veg = combined_data.iloc[:,13:20]
    lns = combined_data.iloc[:,20:27]
    grasses = combined_data.iloc[:,27:34]

    # keeping a column for crop type
    fruit['Type'] = 'Fruit'
    veg['Type'] = 'Veg'
    lns['Type'] = 'Legumes Seeds'
    grasses['Type'] = 'Grains'

    crops = [fruit, veg, lns, grasses]
    crops_0 = []
    new_col_names = ['Date visited','Auditor','Region','Community','Family visited','Present?',
                          'Crop','Seedling_or_transplanted', '% Disease','Condition',
                          'Plague','Organic recommendation','Chemical recommendation', 'Type']

    for table in crops:
      temp = pd.concat([overall_info, table], axis=1)
      temp.columns = new_col_names
      crops_0.append(temp)

    final = crops_0[0]
    for table in crops_0[1:]:
      final = pd.concat([final, table], axis=0)

    # Lauren
    # Removing empties
    final = final[final['Crop']!=0]

    # Cleaning names column

    final['Family visited'] = final['Family visited'].str.lower()

    final = final.replace(
        ['arelis', 'arelis  solis', 'arelis solis',
          'arelis soliz', 'areliz solis', 'arlelis solis'], 'arelis solis')

    final = final.replace(
        ['freddy', 'freddy lanza', 'freddy lanzas',
          'freddy lasza', 'freddys campo', 'fredi ', 'fredis ',
          'fredis lanza', 'fredy lanzas', 'fredys'], 'freddy lanza')

    final = final.replace(
        ['helen espinoza', 'hellen', 'hellen espinoza', ], 'hellen espinoza')

    final = final.replace(
        ['johana', 'johana  salgado','johana salgado', 'johanna salgado', 'yohana salgado'], 'johana salgado')

    final = final.replace(
        ['juan sandobal', 'juan sandoval'], 'juan sandoval')

    final = final.replace(
        ['maria jose', 'maria jose roque',
          'maria jose roque ', 'mariajose roque', ',maria jose roque', ], 'maria jose roque')

    final = final.replace(
        ['marvin toval', 'marvin toval padilla'], 'marvin toval padilla')

    final = final.replace(
        ['naideling', 'naideling vargas', 'naidelyn', 'naidelyn vargas', 'naidelyng', 'naidelyng ', 'naidelyng vargas', 'naydelin', 'naydelin varga', 'naydeling', 'naydeling varga', 'naydeling vargas', 'nayeling varga’, ‘neilyng', 'ávila vargas'
    ], 'naydeling vargas')

    final = final.replace(
        ['nayeli roque','nayelis  roque', 'nayelis roque', 'nayelis roqur', 'nerlyn roque',
    ], 'nayelis roque')

    final = final.replace(
        ['nerligh hernandez', 'nerling henandez', 'nerling hernandez',
          'nerlyn hernandez', 'nerlynh hernandez',], 'nerling hernandez')

    final = final.replace(
        ['rayson membreño', 'reison membreño', 'reison membreńo',
          'reysom membreño', 'reyson membrecho', 'reyson membreño',
          'reyson membreńo', 'reyson menbreño'], 'reysom membreño')

    final = final.replace(
        ['yader  morales', 'yader morales', 'yader morales ',
          'yadermorales', ], 'yader morales')

    final["Month visited"] = final["Date visited"].dt.month
    final["Year visited"] = final["Date visited"].dt.year      
    return final


def clean_gps(families_coordinates):
    '''Making families_coordinates names identical to VL farming names'''

    families_coordinates['Name'] = families_coordinates['Name'].str[8:].str.lower()

    families_coordinates['Name'] = families_coordinates['Name'].str.replace('\s{2,}', ' ')
    families_coordinates = families_coordinates.replace(
        ['rebeca sequeira'], 'rebeca carolina sequeira morales')
    families_coordinates = families_coordinates.replace(
        ['fátima castillo'], 'maría de fátima castillo')
    families_coordinates = families_coordinates.replace(
        ['yojhana cristina flores'], 'johana cristina altamirano flores')
    families_coordinates = families_coordinates.replace(
        ['karla galeano'], 'karla galiano martínez')
    families_coordinates = families_coordinates.replace(
        ['rita arevalo'], 'rita arévalo mora')
    families_coordinates = families_coordinates.replace(
        ['cristina alvares'], 'cristina alvares solís')
    families_coordinates = families_coordinates.replace(
        ['claudia arevalo'], 'claudia flavia arévalo')
    families_coordinates = families_coordinates.replace(
        ['silvia elena  moran'], 'silvia elena moran')
    families_coordinates = families_coordinates.replace(
        ['cristina avendaño'], 'maria cristina avendaño')
    families_coordinates = families_coordinates.replace(
        ['melania jacaba quiroz'], 'melania jocoba quiroz')
    families_coordinates = families_coordinates.replace(
        ['daisy ramirez'], 'maria deisy ramirez')
    families_coordinates = families_coordinates.replace(
        ['maria eugenia morales'], 'maría eugenia morales')
    families_coordinates = families_coordinates.replace(
        ['ana catalina millón'], 'ana catalina garcía millón')
    families_coordinates = families_coordinates.replace(
        ['oralia ramimez'], 'oralia ramirez')
    families_coordinates = families_coordinates.replace(
        ['roosvelt donaire'], 'roosevelt donaire')

    return families_coordinates

def merged_unified(farming, families_coordinates, historic_w):
    '''
    takes farming dataset, family_coordinates, historic weather
    '''

    # Merge geolocation on family names
    geo_farm = pd.merge(farming, families_coordinates[['Name', 'apienter', 'latitude', 'longitude']], 
                                  how="left", left_on="Family visited", 
                                  right_on="Name").drop(columns=['Name'])

    # For missing locations, average longitude 12.46 and avg latitude: -86.96 are imputed                               
    geo_farm['apienter'].fillna('12.46%-86.96', inplace=True)

    
    historic_w['date_time'] = pd.to_datetime(historic_w.date_time)

    # Merge weather on geolocation
    final = pd.merge(geo_farm, historic_w,
                        how="left", left_on=["Date visited", "apienter"],
                        right_on=["date_time", "location"])
    
    # Last column addition
    final["Month visited"] = final["Date visited"].dt.month
    final["Year visited"] = final["Date visited"].dt.year
    
    return final

In [None]:
# this cell runs all the defined functions to clean the data

combined_data = combined_farming(data_17_19,data_19_20)
combined_data = clean_farming(combined_data)
families_coordinates = clean_gps(families_coordinates)

# Check Farming Dataset names against Correct Name List
# extra_names = [name for name in final["Family visited"].unique() if name not in correct_names.iloc[:,3].unique()]
# extra_names.sort()

# Check Family Coordinates names against Farming Dataset names
# [name for name in families_coordinates["Name"].unique() if name not in final["Family visited"].unique()]

final = merged_unified(combined_data, families_coordinates, historic_w)
# Write to CSV
# final.to_csv('VL_farm_geo_w.csv')
# final.to_csv("/content/drive/MyDrive/VL_farm.csv")




In [None]:
import pickle 
filename = 'finalized_model.sav'
pickle.dump(xgboost_model_o, open(filename, 'wb'))
 
# some time later...
 
# load the model from disk
loaded_model = pickle.load(open(filename, 'rb'))
result = loaded_model.score(X_train_over, y_train_over)
print(result)

In [None]:
import pickle
!pip install app
from flask import Flask, render_template,request
import pickle #Initialize the flask App

app = Flask('crop_prediction')
model =loaded_model

import numpy as np
from flask import Flask, request, jsonify, render_template
import pickle

@app.route('/predict', methods=['POST'])


# #default page of our web-app
# @app.route('/')
# def home():
#     return render_template('index.html')

# #To use the predict button in our web-app
# @app.route('/predict',methods=['POST'])

# def predict():
#     #For rendering results on HTML GUI
#     int_features = [float(x) for x in request.form.values()]
#     final_features = [np.array(int_features)]
#     prediction = model.predict(final_features)
#     output = round(prediction[0], 2) 
#     return render_template('index.html', prediction_text='CO2    Emission of the vehicle is :{}'.format(output))


@app.route('/api',methods=['POST'])
def predict():
    # Get the data from the POST request.
    data = request.get_json(force=True)
    # Make prediction using model loaded from disk as per the data.
    prediction = model.predict(X_train_over, y_train_over)
    # Take the first value of prediction
    output = prediction[0]
    return jsonify(output)
if __name__ == '__main__':
    app.run(port=53300, debug=True)
# predict()