It is a sample notebook for illustration purposes only. We recommend including the below cell with important candidate instructions.
You may need to update the OS and package versions based on the current environment.

### Environment
Ubuntu 22.04 LTS which includes **Python 3.9.12** and utilities *curl*, *git*, *vim*, *unzip*, *wget*, and *zip*. There is no *GPU* support.

The IPython Kernel allows you to execute Python code in the Notebook cell and Python console.

### Installing packages
- Run `!mamba list "package_name"` command to check the package installation status. For example,

```python
!mamba list numpy
"""
# packages in environment at /opt/conda:
#
# Name                    Version                   Build  Channel
numpy                     1.21.6           py39h18676bf_0    conda-forge
"""
```

    You can also try importing the package.

- Run the `!mamba install "package_name"` to install a package

### Excluding large files
HackerRank rejects any submission larger than **20MB**. Therefore, you must exclude any large files by adding these to the *.gitignore* file.
You can **Submit** code to validate the status of your submission.

## Introduction

The Occupational Employment and Wage Statistics (OEWS) program produces employment and wage estimates annually for nearly 800 occupations. These estimates are available for the nation as a whole, for individual states, and for metropolitan and nonmetropolitan areas; national occupational estimates for specific industries are also available.

## Problem

The data used in this problem is a subset of the OEWS data, which include the 10-th percentile, 25-th percentile, 50-th percentile (a.k.a median), 75-th percentile, and 90-th percentile of the annual salary of a given combination of states, industries, and occupations.

One needs to use the data in _train.csv_ to train a machine learning model to predict the 10-th, 25-th, 50-th, 75-th and 90-th percentiles of the given combinations in _submission.csv_.

## Data

### Independent Variables

There are three independent variable columns:
- PRIM_STATE
- NAICS_TITLE
- OCC_TITLE

indicating the state, industry, and occupation.

NOTE:
- In the _PRIM_STATE_ variable, each category indicates a state postal abbreviation (like "_CA_", "_TX_", etc.) or "_U.S_" as the whole United States. When _PRIM_STATE_ is "_U.S_", it means the percentiles are aggregated across all the states.
- In thes _NAICS_TITLE_, each category indicates an industry sector name (like "_Retail Trade_", "_Manufacturing_") or "_Cross-industry_". When _NAICS_TITLE_ is "_Cross-industry_", it means the percentiles are aggregated across all the industries.

### Target Variables

There are 5 dependent (target) variable columns:
- A_PCT10
- A_PCT25
- A_MEDIAN
- A_PCT75
- A_PCT90

indicating the 10-th percentile, 25-th percentile, median, 75-th percentile, 90-th percentile of the annual base salary given the state, industry, and occupation information.

**IMPORTANT**: the percentiles should follow an increasing order. Namely, the 10-th percentile is less than (<) the 25-th percentile, the 25-th percentile is less than (<) the 50-th percentile, etc.

## Deliverables

### Submit a Well commented Jupyter Notebook

Explore the data, make visualizations, and generate new features if required. Make appropriate plots, annotate the notebook with markdowns and explain necessary inferences. A person should be able to read the notebook and understand the steps taken as well as the reasoning behind them. The solution will be graded on the basis of the usage of effective visualizations to convey the analysis and the modeling process.


### Submit _submission.csv_

In the given _submission.csv_, values in the "A_PCT10", "A_PCT25", "A_MEDIAN", "A_PCT75", and "A_PCT90" columns are constants, and you need to replace them with your model predictions.

**IMPORTANT**:
- please do not change the header given in _submission.csv_, or your predictions may not be evaluated correctly.
- Your Jupyter Notebook should be able to generate your submitted predictions.



## Evaluation Metric

The model performance is evaluated by the mean normalized weighted absolute error (MNWAE) defined as the following:
$$ MNWAE = \frac{1}{n} \sum_{i=1}^{n} \sum_{j \in \{10, 25, 50, 75, 90\}} w_j \times \frac{|y_{i,j}-z_{i,j}|}{z_{i,j}}$$
where $y_{i,j}$ and $z_{i,j}$ are the model estimation and the ground truth of the $i$-th row and $j$-th percentile, and
$$ w_{10} = w_{90} = 0.1, $$
$$ w_{25} = w_{75} = 0.2, $$
$$ w_{50} = 0.4 $$

For example, if

actual percentiles = [10000, 30000, 60000, 80000, 100000],

predicted percentiles = [11000, 33000, 54000, 88000, 120000],

normalized weighted absolute error = 0.1*|11000-10000|/10000+0.2*|33000-30000|/30000+0.4*|54000-60000|/60000+0.2*|88000-80000|/80000+0.1*|120000-100000|/100000 = 0.11

**IMPORTANT**: if the predicted percentiles in any row do not follow an increasing order, all the predictions will be considered as invalid.

## Solution ..

In [1]:
# Import the `pandas` library to load the dataset
import pandas as pd

In [2]:
df_train = pd.read_csv('train.csv')

In [3]:
df_train.head()

Unnamed: 0,PRIM_STATE,NAICS_TITLE,OCC_TITLE,A_PCT10,A_PCT25,A_MEDIAN,A_PCT75,A_PCT90
0,US,"Arts, Entertainment, and Recreation",Supervisors of Transportation and Material Mov...,32350.0,40200.0,50790.0,62560.0,78520.0
1,US,"Mining, Quarrying, and Oil and Gas Extraction","Sales Representatives, Wholesale and Manufactu...",47860.0,61600.0,87810.0,107460.0,153600.0
2,US,Finance and Insurance,Physical Scientists,59240.0,63050.0,89740.0,126320.0,149070.0
3,US,Administrative and Support and Waste Managemen...,"Architects, Surveyors, and Cartographers",37320.0,47630.0,60550.0,77450.0,98990.0
4,US,Manufacturing,Supervisors of Protective Service Workers,50130.0,63840.0,81770.0,104530.0,133180.0


## Pre-process the data

In [4]:
import pandas as pd

In [5]:
print(df_train.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2297 entries, 0 to 2296
Data columns (total 8 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PRIM_STATE   2297 non-null   object 
 1   NAICS_TITLE  2297 non-null   object 
 2   OCC_TITLE    2297 non-null   object 
 3   A_PCT10      2297 non-null   float64
 4   A_PCT25      2297 non-null   float64
 5   A_MEDIAN     2297 non-null   float64
 6   A_PCT75      2267 non-null   float64
 7   A_PCT90      2158 non-null   float64
dtypes: float64(5), object(3)
memory usage: 143.7+ KB
None


In [6]:
df_train["PRIM_STATE"].value_counts()

US    1498
FL      20
WV      19
DC      19
WA      19
SD      18
CO      18
NM      18
ME      18
NC      18
NH      18
DE      18
MI      18
AR      17
KY      17
GA      17
CA      17
MA      17
MN      17
ND      17
MO      17
UT      17
TX      16
AL      16
AZ      16
IL      16
ID      16
LA      16
MT      16
NJ      15
NV      15
NY      15
MS      15
OH      15
MD      15
HI      15
WI      14
IA      14
IN      14
TN      14
RI      14
VT      13
OK      13
CT      13
VA      13
SC      13
WY      13
PA      13
KS      13
AK      12
NE      12
OR      10
Name: PRIM_STATE, dtype: int64

In [7]:
df_train["NAICS_TITLE"].value_counts()

Cross-industry                                                                                                                         885
Federal, State, and Local Government, excluding state and local schools and hospitals and the U.S. Postal Service (OES Designation)     86
Other Services (except Public Administration)                                                                                           85
Health Care and Social Assistance                                                                                                       84
Management of Companies and Enterprises                                                                                                 84
Educational Services                                                                                                                    79
Administrative and Support and Waste Management and Remediation Services                                                                79
Manufacturing              

In [8]:
df_train.nunique()

PRIM_STATE       52
NAICS_TITLE      21
OCC_TITLE       116
A_PCT10        1402
A_PCT25        1514
A_MEDIAN       1634
A_PCT75        1721
A_PCT90        1747
dtype: int64

In [9]:
df_train.shape

(2297, 8)

In [10]:
df_train.dropna(inplace=True)

In [11]:
df_train.shape

(2158, 8)

In [12]:
df_Xtrain=pd.get_dummies(df_train,columns=['PRIM_STATE','NAICS_TITLE','OCC_TITLE'])

In [13]:
df1=df_Xtrain.copy()

In [14]:
df1.isnull().sum()

A_PCT10                                                                        0
A_PCT25                                                                        0
A_MEDIAN                                                                       0
A_PCT75                                                                        0
A_PCT90                                                                        0
                                                                              ..
OCC_TITLE_Tour and Travel Guides                                               0
OCC_TITLE_Transportation and Material Moving Occupations                       0
OCC_TITLE_Vehicle and Mobile Equipment Mechanics, Installers, and Repairers    0
OCC_TITLE_Water Transportation Workers                                         0
OCC_TITLE_Woodworkers                                                          0
Length: 194, dtype: int64

In [15]:
df1.drop(['A_PCT10','A_PCT25','A_MEDIAN','A_PCT75','A_PCT90'],axis=1,inplace=True)

In [16]:
X=df1
X.head()

Unnamed: 0,PRIM_STATE_AK,PRIM_STATE_AL,PRIM_STATE_AR,PRIM_STATE_AZ,PRIM_STATE_CA,PRIM_STATE_CO,PRIM_STATE_CT,PRIM_STATE_DC,PRIM_STATE_DE,PRIM_STATE_FL,...,OCC_TITLE_Supervisors of Protective Service Workers,OCC_TITLE_Supervisors of Sales Workers,OCC_TITLE_Supervisors of Transportation and Material Moving Workers,"OCC_TITLE_Textile, Apparel, and Furnishings Workers",OCC_TITLE_Top Executives,OCC_TITLE_Tour and Travel Guides,OCC_TITLE_Transportation and Material Moving Occupations,"OCC_TITLE_Vehicle and Mobile Equipment Mechanics, Installers, and Repairers",OCC_TITLE_Water Transportation Workers,OCC_TITLE_Woodworkers
0,0,0,0,0,0,0,0,0,0,0,...,0,0,1,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,1,0,0,0,0,0,0,0,0,0


In [17]:
y=df_train[['A_PCT10','A_PCT25','A_MEDIAN','A_PCT75','A_PCT90']]

In [18]:
y.head()

Unnamed: 0,A_PCT10,A_PCT25,A_MEDIAN,A_PCT75,A_PCT90
0,32350.0,40200.0,50790.0,62560.0,78520.0
1,47860.0,61600.0,87810.0,107460.0,153600.0
2,59240.0,63050.0,89740.0,126320.0,149070.0
3,37320.0,47630.0,60550.0,77450.0,98990.0
4,50130.0,63840.0,81770.0,104530.0,133180.0


In [19]:
type(y)

pandas.core.frame.DataFrame

In [20]:
y.isnull().sum()

A_PCT10     0
A_PCT25     0
A_MEDIAN    0
A_PCT75     0
A_PCT90     0
dtype: int64

In [21]:
X.shape, y.shape

((2158, 189), (2158, 5))

In [22]:
from sklearn.model_selection import train_test_split

In [23]:
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2,random_state=0)

In [24]:
from sklearn.neighbors import KNeighborsRegressor
# define model
knn_model = KNeighborsRegressor()
# fit model
knn_model.fit(X_train,y_train)


KNeighborsRegressor()

In [25]:
y_pred= knn_model.predict(X_test)

In [26]:
knn_model.score(X_test,y_test)

0.34452482683254865

In [27]:
knn_model.score(X_train,y_train)

0.5396795701039122

In [28]:
from sklearn.tree import DecisionTreeRegressor
dt_model=DecisionTreeRegressor()


In [29]:
dt_model.fit(X_train,y_train)

DecisionTreeRegressor()

In [30]:
dt_model.score(X_test,y_test)

0.601676413443436

In [31]:
dt_model.score(X_train,y_train)

1.0

## LinearRegression

In [32]:
from sklearn.linear_model import LinearRegression
lr_model = LinearRegression()

In [33]:
lr_model.fit(X_train,y_train)

LinearRegression()

In [34]:
lr_model.score(X_test,y_test)

-9.983663838872562e+21

In [35]:
lr_model.score(X_train,y_train)

0.8380475590084091

In [36]:
y_predicted=lr_model.predict(X_test)

In [37]:
y_predicted

array([[ 17867.5  ,  20544.   ,  23552.   ,  29056.   ,  36992.   ],
       [ 22011.   ,  26288.   ,  35168.   ,  53568.   ,  82688.   ],
       [ 41222.75 ,  57056.   ,  80704.   , 106368.   , 135168.   ],
       ...,
       [ 21044.75 ,  26240.   ,  33952.   ,  43776.   ,  59008.   ],
       [ 26618.125,  31200.   ,  43008.   ,  60416.   ,  88832.   ],
       [ 43773.75 ,  58320.   ,  80640.   , 105600.   , 132352.   ]])

# Random Forest:


In [38]:
from sklearn.ensemble import RandomForestRegressor
regr = RandomForestRegressor(random_state=0)
regr.fit(X_train, y_train)


RandomForestRegressor(random_state=0)

In [39]:
regr.score(X_test,y_test)

0.6842870841698085

In [40]:
regr.score(X_train,y_train)

0.9552580898676428

In [41]:
y_predicted=regr.predict(X_test)

In [42]:
y_predicted

array([[ 21551. ,  25471.6,  30639.3,  36963.7,  47454.2],
       [ 19551.1,  23248.9,  29582.1,  47875.4,  78898.8],
       [ 43218.8,  58513.1,  79002.1, 102118.6, 129682. ],
       ...,
       [ 23376.3,  29084.9,  36854.8,  47362.4,  60563.2],
       [ 24003.5,  28951.9,  33509.7,  47849.1,  65451.1],
       [ 47795.5,  63094.8,  88283.1, 117052.4, 148768.2]])

# Random Forest With Cross Validation:

In [43]:
from sklearn.ensemble import RandomForestRegressor
test_scores=[]
train_scores=[]
for i in [50,100,150,200,250,300,350,400,450,500]:
  regr = RandomForestRegressor(n_estimators=i, random_state=0)
  regr.fit(X_train, y_train)
  test_scores.append(regr.score(X_test,y_test))
  train_scores.append(regr.score(X_train,y_train))

In [44]:
test_scores

[0.686804806414474,
 0.6842870841698085,
 0.6867394672430615,
 0.6857528425527659,
 0.6874319517204187,
 0.6883199825859517,
 0.6893802816085014,
 0.6899180839445751,
 0.6901822503572121,
 0.6899380176518345]

In [45]:
train_scores

[0.9528277419226143,
 0.9552580898676428,
 0.9558513114634619,
 0.9560470467472928,
 0.9565350959532462,
 0.9569008112135226,
 0.9570448213107344,
 0.9569551088169559,
 0.9570802294977978,
 0.9570438964450145]

In [46]:
import numpy as np

In [47]:
A=np.array([10000,30000,60000,80000,100000])
B=np.array([11000,33000,54000,88000,120000])
W=np.array([0.1,0.2,0.4,0.2,0.1])
NWAE=W[0]*abs(A[0]-B[0])/A[0]+W[1]*abs(A[1]-B[1])/A[1]+W[2]*abs(A[2]-B[2])/A[2]+W[3]*abs(A[3]-B[3])/A[3]+W[4]*abs(A[4]-B[4])/A[4]
NWAE

0.11000000000000001

In [48]:
import numpy as np
from sklearn.model_selection import RandomizedSearchCV
# Number of trees in random forest
n_estimators = [int(x) for x in np.linspace(start = 200, stop = 2000, num = 10)]
# Number of features to consider at every split
max_features = ['auto', 'sqrt']
# Maximum number of levels in tree
max_depth = [int(x) for x in np.linspace(10, 110, num = 11)]
max_depth.append(None)
# Minimum number of samples required to split a node
min_samples_split = [2, 5, 10]
# Minimum number of samples required at each leaf node
min_samples_leaf = [1, 2, 4]
# Method of selecting samples for training each tree
bootstrap = [True, False]
# Create the random grid
random_grid = {'n_estimators': n_estimators,
               'max_features': max_features,
               'max_depth': max_depth,
               'min_samples_split': min_samples_split,
               'min_samples_leaf': min_samples_leaf,
               'bootstrap': bootstrap}

In [49]:
rf = RandomForestRegressor(n_jobs=-1)
rf_random = RandomizedSearchCV(estimator = rf, param_distributions = random_grid, n_iter = 100, cv = 3, verbose=2, random_state=42)
# Fit the random search model
rf_random.fit(X_train, y_train)

Fitting 3 folds for each of 100 candidates, totalling 300 fits
[CV] END bootstrap=True, max_depth=30, max_features=sqrt, min_samples_leaf=1, min_samples_split=5, n_estimators=400; total time=   9.4s
[CV] END bootstrap=True, max_depth=30, max_features=sqrt, min_samples_leaf=1, min_samples_split=5, n_estimators=400; total time=   1.0s
[CV] END bootstrap=True, max_depth=30, max_features=sqrt, min_samples_leaf=1, min_samples_split=5, n_estimators=400; total time=   0.8s
[CV] END bootstrap=True, max_depth=10, max_features=sqrt, min_samples_leaf=1, min_samples_split=5, n_estimators=2000; total time=   2.4s
[CV] END bootstrap=True, max_depth=10, max_features=sqrt, min_samples_leaf=1, min_samples_split=5, n_estimators=2000; total time=   2.3s
[CV] END bootstrap=True, max_depth=10, max_features=sqrt, min_samples_leaf=1, min_samples_split=5, n_estimators=2000; total time=   2.3s
[CV] END bootstrap=False, max_depth=10, max_features=sqrt, min_samples_leaf=2, min_samples_split=5, n_estimators=1200;

RandomizedSearchCV(cv=3, estimator=RandomForestRegressor(n_jobs=-1), n_iter=100,
                   param_distributions={'bootstrap': [True, False],
                                        'max_depth': [10, 20, 30, 40, 50, 60,
                                                      70, 80, 90, 100, 110,
                                                      None],
                                        'max_features': ['auto', 'sqrt'],
                                        'min_samples_leaf': [1, 2, 4],
                                        'min_samples_split': [2, 5, 10],
                                        'n_estimators': [200, 400, 600, 800,
                                                         1000, 1200, 1400, 1600,
                                                         1800, 2000]},
                   random_state=42, verbose=2)

In [50]:
rf_random.score(X_test,y_test)

0.7474222366099188

In [51]:
rf_random.score(X_train,y_train)

0.8641482869828607

In [52]:
best_random = rf_random.best_estimator_

In [53]:
best_random.score(X_test,y_test)

0.7474222366099188

In [54]:
def np_to_df(n1):
    df=pd.DataFrame(n1, columns=[["A_PCT10_pred","A_PCT25_pred","A_MEDIAN_pred","A_PCT75_pred","A_PCT90_pred"]])
    return df

In [55]:
def concat_pred_true(df1,df2):
    df_concat=pd.concat([df1,df2],axis = 1)
    return df_concat

In [56]:
def MNWAE(df):
    df["10_"]=df.apply(lambda x :  0.1*(abs(x[5] - x[0])/x[0]) ,axis=1)
    df["25_"]=df.apply(lambda x :  0.2*(abs(x[6] - x[1])/x[1]) ,axis=1)
    df["50_"]=df.apply(lambda x :  0.4*(abs(x[7] - x[2])/x[2]) ,axis=1)
    df["75_"]=df.apply(lambda x :  0.2*(abs(x[8] - x[3])/x[3]) ,axis=1)
    df["90_"]=df.apply(lambda x :  0.1*(abs(x[9] - x[4])/x[4]) ,axis=1)
    mnwae=(df[[ '10_','25_', '50_', '75_', '90_']].sum().sum())/ len(df)
    return mnwae

In [57]:
y_pred=best_random.predict(X_test)
df_pred=np_to_df(y_pred)
pd_concat=concat_pred_true(y_test,df_pred)
MNWAE(pd_concat)

0.05282081075704844

# MultiOutputRegressor

In [58]:
from sklearn.multioutput import MultiOutputRegressor
from sklearn.ensemble import GradientBoostingRegressor
mul_out_model=MultiOutputRegressor(GradientBoostingRegressor(random_state=0))
mul_out_model.fit(X_train, y_train)

MultiOutputRegressor(estimator=GradientBoostingRegressor(random_state=0))

In [59]:
mul_out_model.score(X_test,y_test)

0.5444217667262677

In [60]:
mul_out_model.score(X_train,y_train)

0.6097095989774154

In [61]:
y_pred=mul_out_model.predict(X_test)
df_pred=np_to_df(y_pred)
pd_concat=concat_pred_true(y_test,df_pred)
MNWAE(pd_concat)

0.044688901939194677

In [62]:
test_scores=[]
train_scores=[]
for i in range(800,1400,100):
  regr = MultiOutputRegressor(GradientBoostingRegressor(n_estimators=i,random_state=0))
  regr.fit(X_train, y_train)
  test_scores.append(regr.score(X_test,y_test))
  train_scores.append(regr.score(X_train,y_train))

In [63]:
y_pred=regr.predict(X_test)
df_pred=np_to_df(y_pred)
pd_concat=concat_pred_true(y_test,df_pred)
MNWAE(pd_concat)

0.0550764007178113

In [64]:
test_scores

[0.7818819779337493,
 0.7851083118756667,
 0.7873706114207099,
 0.7887184751331441,
 0.7897706450947085,
 0.7905603127685528]

In [65]:
train_scores

[0.9296950434276032,
 0.9369330698161562,
 0.9432755911702853,
 0.9482345755410293,
 0.9524962310185238,
 0.9562029195556019]

# MultiOutputRegressor With Cross Validation: RandomizedSearchCV

In [66]:
from sklearn.model_selection import RandomizedSearchCV
# Number of trees in random forest
n_estimators = [int(x) for x in range(1000,2000,100)]
# Number of features to consider at every split
max_features = ['auto', 'sqrt']
# Maximum number of levels in tree
max_depth = [int(x) for x in range(1,31,1)]
max_depth.append(None)
# Minimum number of samples required to split a node
min_samples_split = list(np.linspace(0.1, 1.0, 10, endpoint=True))
# Minimum number of samples required at each leaf node
min_samples_leaf = list(np.linspace(0.1, 0.5, 5, endpoint=True))
# Create the random grid
random_grid = {'estimator__n_estimators': n_estimators,
               'estimator__max_features': max_features,
               'estimator__max_depth': max_depth,
               'estimator__min_samples_split': min_samples_split,
               'estimator__min_samples_leaf': min_samples_leaf
               }

In [67]:
multi_gb_Boost = MultiOutputRegressor(GradientBoostingRegressor(random_state=0))
rf_random_ = RandomizedSearchCV(estimator = multi_gb_Boost, param_distributions = random_grid, n_iter = 100, cv = 3, verbose=2, random_state=42)
# Fit the random search model
rf_random_.fit(X_train, y_train)

Fitting 3 folds for each of 100 candidates, totalling 300 fits
[CV] END estimator__max_depth=24, estimator__max_features=sqrt, estimator__min_samples_leaf=0.2, estimator__min_samples_split=0.6, estimator__n_estimators=1400; total time=   1.9s
[CV] END estimator__max_depth=24, estimator__max_features=sqrt, estimator__min_samples_leaf=0.2, estimator__min_samples_split=0.6, estimator__n_estimators=1400; total time=   2.0s
[CV] END estimator__max_depth=24, estimator__max_features=sqrt, estimator__min_samples_leaf=0.2, estimator__min_samples_split=0.6, estimator__n_estimators=1400; total time=   1.9s
[CV] END estimator__max_depth=16, estimator__max_features=sqrt, estimator__min_samples_leaf=0.30000000000000004, estimator__min_samples_split=1.0, estimator__n_estimators=1500; total time=   2.1s
[CV] END estimator__max_depth=16, estimator__max_features=sqrt, estimator__min_samples_leaf=0.30000000000000004, estimator__min_samples_split=1.0, estimator__n_estimators=1500; total time=   2.1s
[CV] 

RandomizedSearchCV(cv=3,
                   estimator=MultiOutputRegressor(estimator=GradientBoostingRegressor(random_state=0)),
                   n_iter=100,
                   param_distributions={'estimator__max_depth': [1, 2, 3, 4, 5,
                                                                 6, 7, 8, 9, 10,
                                                                 11, 12, 13, 14,
                                                                 15, 16, 17, 18,
                                                                 19, 20, 21, 22,
                                                                 23, 24, 25, 26,
                                                                 27, 28, 29, 30, ...],
                                        'estimator__max_features': ['auto',
                                                                    'sqrt'],
                                        'estimator__min_samples_leaf': [0.1,
                                      

In [68]:
y_pred=rf_random_.predict(X_test)
df_pred=np_to_df(y_pred)
pd_concat=concat_pred_true(y_test,df_pred)
MNWAE(pd_concat)

0.038056897422428154

In [69]:
rf_random_.best_estimator_

MultiOutputRegressor(estimator=GradientBoostingRegressor(max_depth=None,
                                                         max_features='sqrt',
                                                         min_samples_leaf=0.30000000000000004,
                                                         min_samples_split=0.5,
                                                         n_estimators=1000,
                                                         random_state=0))

In [70]:
rf_random_.cv_results_

{'mean_fit_time': array([ 2.00549499,  2.09285323,  1.35849603,  5.55786029,  1.62354612,
         2.07079617,  1.57722608,  8.31202396, 15.45543019,  8.55247537,
         1.42353559,  1.24065844,  4.01951106,  1.32369383,  3.18383773,
         9.61441088,  7.0988756 ,  3.74614302, 13.0379173 ,  6.57985838,
         2.07691916,  2.57704234,  1.92870657,  9.69113739,  7.38403002,
         3.17621271,  6.12213198,  9.12526242,  2.6357801 ,  2.485461  ,
         2.52537314,  4.21847526,  2.51404524,  8.75747418,  4.82234581,
         2.68357038,  2.20688105,  2.57849964,  1.3630774 , 10.80802663,
         2.80696384,  2.40053503,  1.59304579,  6.63895726, 11.21573265,
        14.84314664, 14.50524513, 10.00739185,  2.54538663, 14.76765911,
        10.05846397,  1.41405884,  7.93248884,  2.24128723,  1.39252178,
         2.42326824,  1.43153079,  3.49260577,  1.70448565,  2.32661247,
         2.54486601,  1.67628336,  2.10416611, 10.87329388,  4.01850549,
        11.78851962, 15.67533882, 

In [71]:
rf_random_.score(X_test,y_test)

0.0002473768603242554

In [74]:
rf_random_.score(X_train,y_train)

0.005170803897183096

In [76]:
test_scores=[]
train_scores=[]
for i in range(800,1400,100):
    multi_gb_Boost = MultiOutputRegressor(GradientBoostingRegressor(n_estimators=i, random_state=0))
    multi_gb_Boost.fit(X_train,y_train)
    test_scores.append(multi_gb_Boost.score(X_test,y_test))
    train_scores.append(multi_gb_Boost.score(X_train,y_train)) 

In [None]:
print("Train Scores: \n",train_scores)
print("Test Scores: \n",test_scores)

In [None]:
test_scores=[]
train_scores=[]
custom_metric = []
for i in range(1,31,1):
    multi_gb_Boost = MultiOutputRegressor(GradientBoostingRegressor(max_depth=i, random_state=0))
    multi_gb_Boost.fit(X_train,y_train)
    y_pred = multi_gb_Boost.predict(X_test)
    df_pred=np_to_df(y_pred)
    pd_concat=concat_pred_true(y_test,df_pred)
    custom_metric.append(MNWAE(pd_concat))
    test_scores.append(multi_gb_Boost.score(X_test,y_test))
    train_scores.append(multi_gb_Boost.score(X_train,y_train)) 

In [None]:
print("Train Scores: \n",train_scores)
print("Test Scores: \n",test_scores)
print("Custom Metric Scores: \n",custom_metric)

In [None]:
test_scores=[]
train_scores=[]
custom_metric = []
for i in list(np.linspace(0.1, 1.0, 10, endpoint=True)):
    multi_gb_Boost = MultiOutputRegressor(GradientBoostingRegressor(min_samples_split=i, random_state=0))
    multi_gb_Boost.fit(X_train,y_train)
    y_pred = multi_gb_Boost.predict(X_test)
    df_pred=np_to_df(y_pred)
    pd_concat=concat_pred_true(y_test,df_pred)
    custom_metric.append(MNWAE(pd_concat))
    test_scores.append(multi_gb_Boost.score(X_test,y_test))
    train_scores.append(multi_gb_Boost.score(X_train,y_train)) 

In [None]:
print("Train Scores: \n",train_scores)
print("Test Scores: \n",test_scores)
print("Custom Metric Scores: \n",custom_metric)

In [None]:
multi_gb_Boost

In [None]:
y_pred=multi_gb_Boost.predict(X_test)
df_pred=np_to_df(y_pred)
pd_concat=concat_pred_true(y_test,df_pred)
MNWAE(pd_concat)

In [None]:
import xgboost as xgb
clf = MultiOutputRegressor(xgb.XGBRegressor(random_state=0))
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
df_pred=np_to_df(y_pred)
pd_concat=concat_pred_true(y_test,df_pred)
MNWAE(pd_concat)

In [None]:
test_scores=[]
train_scores=[]
custom_metric = []
for i in range(50,1000,50):
    xgb_ = MultiOutputRegressor(xgb.XGBRegressor(n_estimators=i, random_state=0))
    xgb_.fit(X_train,y_train)
    y_pred = xgb_.predict(X_test)
    df_pred=np_to_df(y_pred)
    pd_concat=concat_pred_true(y_test,df_pred)
    custom_metric.append(MNWAE(pd_concat))
    test_scores.append(xgb_.score(X_test,y_test))
    train_scores.append(xgb_.score(X_train,y_train))


In [None]:
print("Train Scores: \n",train_scores)
print("Test Scores: \n",test_scores)
print("Custom Metric Scores: \n",custom_metric)

In [None]:
import catboost as cb
cat_ = MultiOutputRegressor(cb.CatBoostRegressor(random_state=0))
cat_.fit(X_train, y_train)
y_pred = cat_.predict(X_test)
df_pred=np_to_df(y_pred)
pd_concat=concat_pred_true(y_test,df_pred)
MNWAE(pd_concat)


In [None]:
test_scores=[]
train_scores=[]
custom_metric = []
for i in range(50,1000,50):
    CB_ = MultiOutputRegressor(cb.CatBoostRegressor(n_estimators=i, random_state=0))
    CB_.fit(X_train,y_train)
    y_pred = CB_.predict(X_test)
    df_pred=np_to_df(y_pred)
    pd_concat=concat_pred_true(y_test,df_pred)
    custom_metric.append(MNWAE(pd_concat))
    test_scores.append(CB_.score(X_test,y_test))
    train_scores.append(CB_.score(X_train,y_train))

In [None]:
print("Train Scores: \n",train_scores)
print("Test Scores: \n",test_scores)
print("Custom Metric Scores: \n",custom_metric)