In [None]:
https://realpython.com/build-recommendation-engine-collaborative-filtering/
https://www.analyticsvidhya.com/blog/2018/06/comprehensive-guide-recommendation-engine-python/
https://www.datacamp.com/community/tutorials/recommender-systems-python

# Objective



In [None]:
# PROBLEM SETTING
Commercial corn is processed into multiple food and industrial products. It is widely known as one of the world’s most important crops. Each year, plant breeders create new corn products, known as experimental hybrids, by crossing two “parents” together. The parents are known as inbreds and the development of the inbreds takes up the bulk of a corn breeding program. Most of that effort is spent evaluating the inbreds by crossing to another inbred, called a “tester.” 

It is a plant breeder’s job to identify the best parent combinations by creating experimental hybrids and assessing the hybrids’ performance by “testing” it in multiple environments to identify the hybrids that perform best. Historically, identifying the best hybrids has been by trial and error, with breeders testing their experimental hybrids in a diverse set of locations and measuring their performance, then selecting the highest yielding hybrids. The process of selecting the correct parent combinations and testing the experimental hybrids can take many years and is inefficient, simply due to the number of potential parent combinations to create and test.

RESEARCH QUESTION
Given historical hybrid (inbred by tester) performance data across years and locations, how can we create a model to predict/impute the performance of the crossing of any two inbred and tester parents? 

For example, given 5,000 inbreds (parents), the number of potential crosses is 12,497,500 —far more than can be created or tested. Due to limited testing resources, breeders are only able to select a small subset of all the possible inbred combinations, which can lead to lost opportunities. 

This issue is the basis for the 2020 Syngenta Crop Challenge in Analytics. Can an accurate model be constructed to predict the performance of crossing any two inbreds? Such a model would allow breeders to focus on the best possible combinations. 

In simpler terms, can we use hybrid data collected from crossing inbreds and testers together to predict the result of cross combinations that have not yet been created and tested? Namely, are we able to construct a recommender system to propose new parent combinations based on the hybrid performance from other parent combinations and attributes they have in common? 

The following Table 1 is an illustration of the challenge. Each “X” is the set of observed performance data points of hybrids from their corresponding inbred by tester combinations. With the information from the table, how can a model be built to predict/impute the mean yield of each missing combinations (“?”)?


RESEARCH QUESTION
Given historical hybrid (inbred by tester) performance data across years and locations, how can we create a model to predict/impute the performance of the crossing of any two inbred and tester parents? 

OBJECTIVE
The objective is to estimate yield performance of the cross between inbred and tester combinations in a given holdout set. Specifically, we are asking for the mean yield performance of each inbred by tester combination in the holdout set. 

Notes
Each response in the holdout must be completed
Many approaches can be used such as statistical approaches, machine learning and collaborative filtering

Deliverables
Predicted yield values of the cross between inbred and tester combinations in the test set.
Additionally, observing the standards for academic publication, entries should include a written report with the following:
Quantitative results to justify your modeling and classification techniques
A clear description of the methodology and theory used
References or citations as appropriate

Evaluation
The entries will be evaluated based on:
Accuracy of the predicted values in the test set based on root mean squared error
Simplicity and intuitiveness of the solution
Clarity in the explanation
The quality and clarity of the finalist’s presentation at the 2020 INFORMS Conference on Business Analytics and Operations Research

dataset
Training Dataset: This dataset contains the observed yield (consistently scaled to an internal benchmark) for a large set of corn hybrids tested across multiple environments between 2016 and 2018. These hybrids are created through the crossing of 593 unique inbreds and 496 unique testers. Creating a two-way table of means with inbreds as rows and testers as columns results in a data table with approximately 96% missing values. Each row contains the year and location ID of the observation. Additionally, each row includes a cluster value for each inbred and tester. This represents the genetic grouping of the inbreds and testers and has been determined using internal methods. Inbreds and testers are not treated any differently when clustering, so a shared cluster value indicates genetic similarity regardless of whether a parent is defined as an inbred or a tester. Contestants may (or may not) find these columns useful.

Test Dataset: This dataset contains a set of inbred and tester combinations that need to be predicted as part of the challenge. The mean yield is to be predicted for each listed combination of inbred by tester.


Training Dataset	YEAR	Year grown
LOCATION	ID for each location
INBRED	ID for Inbred
INBRED_CLUSTER	Cluster association for each inbred which denotes genetic grouping
TESTER	ID for Tester
TESTER_CLUSTER	Cluster association for each tester which denotes genetic grouping
YIELD	The performance of the Line and Tester combination


Testing Dataset	INBRED	ID for INBRED
INBRED_CLUSTER	Cluster association for each line which denotes genetic grouping
TESTER	ID for Tester
TESTER_CLUSTER	Cluster association for each tester which denotes genetic grouping
YIELD	The performance of the Line and Tester combination – to be predicted

timeline jan 15, 2020



# Load libraries

In [4]:
import pandas as pd
import numpy as np
import seaborn as sns
import sklearn
import matplotlib.pyplot as plt
%matplotlib inline
from sklearn import preprocessing
from sklearn import model_selection
from sklearn.metrics import accuracy_score
from sklearn.metrics import roc_curve, auc, recall_score, precision_score, f1_score, confusion_matrix, recall_score
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.model_selection import cross_val_score
from sklearn.naive_bayes import GaussianNB
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.ensemble import VotingClassifier
from sklearn.ensemble import BaggingClassifier
from sklearn.metrics import roc_auc_score
from sklearn.preprocessing import OneHotEncoder  
from sklearn.preprocessing import StandardScaler  
from scipy import sparse


# Load Data

In [5]:
train = pd.read_csv('CC2020_train_final.csv')
test = pd.read_csv('CC2020_test_final.csv')

In [4]:
train.shape

(199476, 7)

In [5]:
train.head()

Unnamed: 0,YEAR,LOCATION,INBRED,INBRED_CLUSTER,TESTER,TESTER_CLUSTER,YIELD
0,18,Loc 5608,Inbred_142,Cluster11,Tester_740,Cluster10,1.135462
1,18,Loc 4533,Inbred_142,Cluster11,Tester_740,Cluster10,1.139813
2,18,Loc 5620,Inbred_142,Cluster11,Tester_740,Cluster10,1.117778
3,18,Loc 4732,Inbred_142,Cluster11,Tester_740,Cluster10,1.171366
4,18,Loc 5500,Inbred_142,Cluster11,Tester_740,Cluster10,1.059364


In [6]:
train.describe()

Unnamed: 0,YEAR,YIELD
count,199476.0,199476.0
mean,17.160551,1.001731
std,0.741779,0.104722
min,16.0,0.047236
25%,17.0,0.94187
50%,17.0,1.003277
75%,18.0,1.064073
max,18.0,1.800083


In [7]:
test.head()

Unnamed: 0,INBRED,INBRED_CLUSTER,TESTER,TESTER_CLUSTER,YIELD
0,Inbred_1071,Cluster8,Tester_5450,Cluster5,
1,Inbred_122,Cluster12,Tester_4336,Cluster6,
2,Inbred_1337,Cluster17,Tester_2652,Cluster1,
3,Inbred_1337,Cluster17,Tester_4373,Cluster3,
4,Inbred_1339,Cluster17,Tester_4238,Cluster11,


In [8]:
test.describe()

Unnamed: 0,YIELD
count,0.0
mean,
std,
min,
25%,
50%,
75%,
max,


# EDA

In [9]:
train.dtypes

YEAR                int64
LOCATION           object
INBRED             object
INBRED_CLUSTER     object
TESTER             object
TESTER_CLUSTER     object
YIELD             float64
dtype: object

In [11]:
#check year column
train.YEAR.unique()

array([18, 17, 16])

In [12]:
#check location
train.LOCATION.unique()

array(['Loc 5608', 'Loc 4533', 'Loc 5620', 'Loc 4732', 'Loc 5500',
       'Loc 5514', 'Loc 4742', 'Loc 4625', 'Loc 4620', 'Loc 4524',
       'Loc 4442', 'Loc 4400', 'Loc 5420', 'Loc 4621', 'Loc 4601',
       'Loc 4532', 'Loc 4424', 'Loc 4341', 'Loc 4515', 'Loc 6609',
       'Loc 5424', 'Loc 5320', 'Loc 6415', 'Loc 4439', 'Loc 6730',
       'Loc 5330', 'Loc 5610', 'Loc 6700', 'Loc 6532', 'Loc 5711',
       'Loc 6634', 'Loc 6511', 'Loc 4623', 'Loc 4523', 'Loc 4Z23',
       'Loc 3631', 'Loc 3601', 'Loc 6418', 'Loc 6601', 'Loc 6614',
       'Loc 5511', 'Loc 4326', 'Loc 5324', 'Loc 4420', 'Loc 3437',
       'Loc 4401', 'Loc 3439', 'Loc 5336', 'Loc 7345', 'Loc 5340',
       'Loc 7319', 'Loc 6310', 'Loc 6421', 'Loc 7420', 'Loc 5240',
       'Loc 6320', 'Loc 8405', 'Loc 8316', 'Loc 7311', 'Loc 7612',
       'Loc 7332', 'Loc 7303', 'Loc 6734', 'Loc 7632', 'Loc 8320',
       'Loc 7243', 'Loc 8234', 'Loc 7440', 'Loc 7520', 'Loc 6334',
       'Loc 7727', 'Loc 7528', 'Loc 6719', 'Loc 7D06', 'Loc 73

In [13]:
#check INBRED
train.INBRED.unique()

array(['Inbred_142', 'Inbred_740', 'Inbred_743', 'Inbred_19',
       'Inbred_755', 'Inbred_737', 'Inbred_733', 'Inbred_739',
       'Inbred_754', 'Inbred_748', 'Inbred_750', 'Inbred_727',
       'Inbred_753', 'Inbred_586', 'Inbred_752', 'Inbred_736',
       'Inbred_749', 'Inbred_725', 'Inbred_761', 'Inbred_756',
       'Inbred_760', 'Inbred_757', 'Inbred_768', 'Inbred_765',
       'Inbred_764', 'Inbred_769', 'Inbred_759', 'Inbred_741',
       'Inbred_747', 'Inbred_745', 'Inbred_731', 'Inbred_773',
       'Inbred_738', 'Inbred_145', 'Inbred_777', 'Inbred_804',
       'Inbred_724', 'Inbred_122', 'Inbred_799', 'Inbred_803',
       'Inbred_732', 'Inbred_801', 'Inbred_800', 'Inbred_770',
       'Inbred_805', 'Inbred_751', 'Inbred_771', 'Inbred_744',
       'Inbred_790', 'Inbred_746', 'Inbred_1339', 'Inbred_789',
       'Inbred_1071', 'Inbred_1342', 'Inbred_1345', 'Inbred_1341',
       'Inbred_1346', 'Inbred_1340', 'Inbred_1349', 'Inbred_1360',
       'Inbred_1358', 'Inbred_1354', 'Inbred_13

In [14]:
#Inbred cluster
train.INBRED_CLUSTER.unique()

array(['Cluster11', 'Cluster10', 'Cluster12', 'Cluster8', 'Cluster17',
       'Cluster5', 'Cluster1', 'Cluster4', 'Cluster7', 'Cluster3',
       'Cluster14', 'Cluster6', 'Cluster9', 'Cluster2'], dtype=object)

In [15]:
#Tester information
train.TESTER.unique()

array(['Tester_740', 'Tester_743', 'Tester_757', 'Tester_761',
       'Tester_767', 'Tester_775', 'Tester_776', 'Tester_779',
       'Tester_789', 'Tester_793', 'Tester_813', 'Tester_819',
       'Tester_821', 'Tester_828', 'Tester_829', 'Tester_1345',
       'Tester_1349', 'Tester_1397', 'Tester_2636', 'Tester_2652',
       'Tester_2683', 'Tester_2689', 'Tester_2690', 'Tester_2721',
       'Tester_2724', 'Tester_2736', 'Tester_2747', 'Tester_2773',
       'Tester_3404', 'Tester_3440', 'Tester_3484', 'Tester_3485',
       'Tester_3504', 'Tester_3507', 'Tester_3521', 'Tester_3565',
       'Tester_3567', 'Tester_3573', 'Tester_3577', 'Tester_3582',
       'Tester_3791', 'Tester_3796', 'Tester_4025', 'Tester_4048',
       'Tester_4051', 'Tester_4059', 'Tester_4062', 'Tester_4063',
       'Tester_4065', 'Tester_4067', 'Tester_4072', 'Tester_4082',
       'Tester_4083', 'Tester_4097', 'Tester_4099', 'Tester_4102',
       'Tester_4115', 'Tester_4119', 'Tester_4131', 'Tester_4135',
       'Te

In [16]:
#TESTER_CLUSTER
train.TESTER_CLUSTER.unique()

array(['Cluster10', 'Cluster5', 'Cluster4', 'Cluster8', 'Cluster17',
       'Cluster11', 'Cluster1', 'Cluster3', 'Cluster7', 'Cluster12',
       'Cluster6', 'Cluster14', 'Cluster2'], dtype=object)

In [17]:
#Yield information
train.YIELD.unique()

array([1.13546205, 1.13981255, 1.1177782 , ..., 0.97258388, 0.90014022,
       0.95307504])

In [7]:
test.dtypes

INBRED             object
INBRED_CLUSTER     object
TESTER             object
TESTER_CLUSTER     object
YIELD             float64
dtype: object

In [8]:
print(test.dtypes)
print(train.dtypes)

INBRED             object
INBRED_CLUSTER     object
TESTER             object
TESTER_CLUSTER     object
YIELD             float64
dtype: object
YEAR                int64
LOCATION           object
INBRED             object
INBRED_CLUSTER     object
TESTER             object
TESTER_CLUSTER     object
YIELD             float64
dtype: object


# Catboost Model

In [3]:
from catboost import CatBoostRegressor

In [6]:
#check for missing data
train.isnull().sum()

YEAR              0
LOCATION          0
INBRED            0
INBRED_CLUSTER    0
TESTER            0
TESTER_CLUSTER    0
YIELD             0
dtype: int64

In [None]:
#create model
model=CatBoostRegressor(iterations=50, depth=3, learning_rate=0.1, loss_function='RMSE')
model.fit(X_train, y_train,cat_features=categorical_features_indices,eval_set=(X_validation, y_validation),plot=True)

In [None]:
#import lightgbm as lgb
#import xgboost as xgb

In [None]:
#evaluation - root mean squared error

In [None]:
#other libararies
#lightfm, recsys, turicreate, scipy, suprise, seaborn

In [None]:
#metrics
#accuracy, recall, mean reciprocal rank, mean average precision at cutoff k, mean square error

In [None]:
#data cleaning

In [None]:
#feature engineering

In [None]:
#modeling