It is a sample notebook for illustration purposes only. We recommend including the below cell with important candidate instructions.
You may need to update the OS and package versions based on the current environment.

### Environment
Ubuntu 22.04 LTS which includes **Python 3.9.12** and utilities *curl*, *git*, *vim*, *unzip*, *wget*, and *zip*. There is no *GPU* support.

The IPython Kernel allows you to execute Python code in the Notebook cell and Python console.

### Installing packages
- Run `!mamba list "package_name"` command to check the package installation status. For example,

```python
!mamba list numpy
"""
# packages in environment at /opt/conda:
#
# Name                    Version                   Build  Channel
numpy                     1.21.6           py39h18676bf_0    conda-forge
"""
```

    You can also try importing the package.

- Run the `!mamba install "package_name"` to install a package

### Excluding large files
HackerRank rejects any submission larger than **20MB**. Therefore, you must exclude any large files by adding these to the *.gitignore* file.
You can **Submit** code to validate the status of your submission.

## Introduction

The Occupational Employment and Wage Statistics (OEWS) program produces employment and wage estimates annually for nearly 800 occupations. These estimates are available for the nation as a whole, for individual states, and for metropolitan and nonmetropolitan areas; national occupational estimates for specific industries are also available.

## Problem

The data used in this problem is a subset of the OEWS data, which include the 10-th percentile, 25-th percentile, 50-th percentile (a.k.a median), 75-th percentile, and 90-th percentile of the annual salary of a given combination of states, industries, and occupations.

One needs to use the data in _train.csv_ to train a machine learning model to predict the 10-th, 25-th, 50-th, 75-th and 90-th percentiles of the given combinations in _submission.csv_.

## Data

### Independent Variables

There are three independent variable columns:
- PRIM_STATE
- NAICS_TITLE
- OCC_TITLE

indicating the state, industry, and occupation.

NOTE:
- In the _PRIM_STATE_ variable, each category indicates a state postal abbreviation (like "_CA_", "_TX_", etc.) or "_U.S_" as the whole United States. When _PRIM_STATE_ is "_U.S_", it means the percentiles are aggregated across all the states.
- In thes _NAICS_TITLE_, each category indicates an industry sector name (like "_Retail Trade_", "_Manufacturing_") or "_Cross-industry_". When _NAICS_TITLE_ is "_Cross-industry_", it means the percentiles are aggregated across all the industries.

### Target Variables

There are 5 dependent (target) variable columns:
- A_PCT10
- A_PCT25
- A_MEDIAN
- A_PCT75
- A_PCT90

indicating the 10-th percentile, 25-th percentile, median, 75-th percentile, 90-th percentile of the annual base salary given the state, industry, and occupation information.

**IMPORTANT**: the percentiles should follow an increasing order. Namely, the 10-th percentile is less than (<) the 25-th percentile, the 25-th percentile is less than (<) the 50-th percentile, etc.

## Deliverables

### Submit a Well commented Jupyter Notebook

Explore the data, make visualizations, and generate new features if required. Make appropriate plots, annotate the notebook with markdowns and explain necessary inferences. A person should be able to read the notebook and understand the steps taken as well as the reasoning behind them. The solution will be graded on the basis of the usage of effective visualizations to convey the analysis and the modeling process.


### Submit _submission.csv_

In the given _submission.csv_, values in the "A_PCT10", "A_PCT25", "A_MEDIAN", "A_PCT75", and "A_PCT90" columns are constants, and you need to replace them with your model predictions.

**IMPORTANT**:
- please do not change the header given in _submission.csv_, or your predictions may not be evaluated correctly.
- Your Jupyter Notebook should be able to generate your submitted predictions.



## Evaluation Metric

The model performance is evaluated by the mean normalized weighted absolute error (MNWAE) defined as the following:
$$ MNWAE = \frac{1}{n} \sum_{i=1}^{n} \sum_{j \in \{10, 25, 50, 75, 90\}} w_j \times \frac{|y_{i,j}-z_{i,j}|}{z_{i,j}}$$
where $y_{i,j}$ and $z_{i,j}$ are the model estimation and the ground truth of the $i$-th row and $j$-th percentile, and
$$ w_{10} = w_{90} = 0.1, $$
$$ w_{25} = w_{75} = 0.2, $$
$$ w_{50} = 0.4 $$

For example, if

actual percentiles = [10000, 30000, 60000, 80000, 100000],

predicted percentiles = [11000, 33000, 54000, 88000, 120000],

normalized weighted absolute error = 0.1*|11000-10000|/10000+0.2*|33000-30000|/30000+0.4*|54000-60000|/60000+0.2*|88000-80000|/80000+0.1*|120000-100000|/100000 = 0.11

**IMPORTANT**: if the predicted percentiles in any row do not follow an increasing order, all the predictions will be considered as invalid.

## Solution ..

# Importing Required Files:

In [1]:
# Import the `pandas` library to load the dataset
import pandas as pd
import numpy as np
df_train = pd.read_csv('train.csv')
display("Top 5 rows: ",df_train.head())

'Top 5 rows: '

Unnamed: 0,PRIM_STATE,NAICS_TITLE,OCC_TITLE,A_PCT10,A_PCT25,A_MEDIAN,A_PCT75,A_PCT90
0,US,"Arts, Entertainment, and Recreation",Supervisors of Transportation and Material Mov...,32350.0,40200.0,50790.0,62560.0,78520.0
1,US,"Mining, Quarrying, and Oil and Gas Extraction","Sales Representatives, Wholesale and Manufactu...",47860.0,61600.0,87810.0,107460.0,153600.0
2,US,Finance and Insurance,Physical Scientists,59240.0,63050.0,89740.0,126320.0,149070.0
3,US,Administrative and Support and Waste Managemen...,"Architects, Surveyors, and Cartographers",37320.0,47630.0,60550.0,77450.0,98990.0
4,US,Manufacturing,Supervisors of Protective Service Workers,50130.0,63840.0,81770.0,104530.0,133180.0


# Pre-process the data

In [2]:
print(df_train.info())
print(df_train["PRIM_STATE"].value_counts())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2297 entries, 0 to 2296
Data columns (total 8 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PRIM_STATE   2297 non-null   object 
 1   NAICS_TITLE  2297 non-null   object 
 2   OCC_TITLE    2297 non-null   object 
 3   A_PCT10      2297 non-null   float64
 4   A_PCT25      2297 non-null   float64
 5   A_MEDIAN     2297 non-null   float64
 6   A_PCT75      2267 non-null   float64
 7   A_PCT90      2158 non-null   float64
dtypes: float64(5), object(3)
memory usage: 143.7+ KB
None
US    1498
FL      20
WV      19
DC      19
WA      19
SD      18
CO      18
NM      18
ME      18
NC      18
NH      18
DE      18
MI      18
AR      17
KY      17
GA      17
CA      17
MA      17
MN      17
ND      17
MO      17
UT      17
TX      16
AL      16
AZ      16
IL      16
ID      16
LA      16
MT      16
NJ      15
NV      15
NY      15
MS      15
OH      15
MD      15
HI      15
WI      14
IA      

In [3]:
df_train["NAICS_TITLE"].value_counts()

Cross-industry                                                                                                                         885
Federal, State, and Local Government, excluding state and local schools and hospitals and the U.S. Postal Service (OES Designation)     86
Other Services (except Public Administration)                                                                                           85
Health Care and Social Assistance                                                                                                       84
Management of Companies and Enterprises                                                                                                 84
Educational Services                                                                                                                    79
Administrative and Support and Waste Management and Remediation Services                                                                79
Manufacturing              

In [4]:
print(df_train.nunique())
print(df_train.shape)

PRIM_STATE       52
NAICS_TITLE      21
OCC_TITLE       116
A_PCT10        1402
A_PCT25        1514
A_MEDIAN       1634
A_PCT75        1721
A_PCT90        1747
dtype: int64
(2297, 8)


In [5]:
df_train.dropna(inplace=True)
print(df_train.shape)

(2158, 8)


In [6]:
df_Xtrain=pd.get_dummies(df_train,columns=['PRIM_STATE','NAICS_TITLE','OCC_TITLE'])
df1=df_Xtrain.copy()
print(df1.isnull().sum())

A_PCT10                                                                        0
A_PCT25                                                                        0
A_MEDIAN                                                                       0
A_PCT75                                                                        0
A_PCT90                                                                        0
                                                                              ..
OCC_TITLE_Tour and Travel Guides                                               0
OCC_TITLE_Transportation and Material Moving Occupations                       0
OCC_TITLE_Vehicle and Mobile Equipment Mechanics, Installers, and Repairers    0
OCC_TITLE_Water Transportation Workers                                         0
OCC_TITLE_Woodworkers                                                          0
Length: 194, dtype: int64


In [7]:
df1.drop(['A_PCT10','A_PCT25','A_MEDIAN','A_PCT75','A_PCT90'],axis=1,inplace=True)
X=df1
y=df_train[['A_PCT10','A_PCT25','A_MEDIAN','A_PCT75','A_PCT90']]
X.head()

Unnamed: 0,PRIM_STATE_AK,PRIM_STATE_AL,PRIM_STATE_AR,PRIM_STATE_AZ,PRIM_STATE_CA,PRIM_STATE_CO,PRIM_STATE_CT,PRIM_STATE_DC,PRIM_STATE_DE,PRIM_STATE_FL,...,OCC_TITLE_Supervisors of Protective Service Workers,OCC_TITLE_Supervisors of Sales Workers,OCC_TITLE_Supervisors of Transportation and Material Moving Workers,"OCC_TITLE_Textile, Apparel, and Furnishings Workers",OCC_TITLE_Top Executives,OCC_TITLE_Tour and Travel Guides,OCC_TITLE_Transportation and Material Moving Occupations,"OCC_TITLE_Vehicle and Mobile Equipment Mechanics, Installers, and Repairers",OCC_TITLE_Water Transportation Workers,OCC_TITLE_Woodworkers
0,0,0,0,0,0,0,0,0,0,0,...,0,0,1,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,1,0,0,0,0,0,0,0,0,0


In [8]:
y.head()

Unnamed: 0,A_PCT10,A_PCT25,A_MEDIAN,A_PCT75,A_PCT90
0,32350.0,40200.0,50790.0,62560.0,78520.0
1,47860.0,61600.0,87810.0,107460.0,153600.0
2,59240.0,63050.0,89740.0,126320.0,149070.0
3,37320.0,47630.0,60550.0,77450.0,98990.0
4,50130.0,63840.0,81770.0,104530.0,133180.0


In [9]:
y.isnull().sum()

A_PCT10     0
A_PCT25     0
A_MEDIAN    0
A_PCT75     0
A_PCT90     0
dtype: int64

In [10]:
X.shape, y.shape

((2158, 189), (2158, 5))

# Splitting The Dataset:

In [11]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2,random_state=0)

## Handler Functions:

In [12]:
def np_to_df(n1):
    df=pd.DataFrame(n1, columns=[["A_PCT10_pred","A_PCT25_pred","A_MEDIAN_pred","A_PCT75_pred","A_PCT90_pred"]])
    return df

In [13]:
def concat_pred_true(df1,df2):
    df_concat=pd.concat([df1,df2],axis = 1)
    return df_concat

## Mean normalized weighted absolute error (MNWAE)



In [14]:
def MNWAE(df):
    df["10_"]=df.apply(lambda x :  0.1*(abs(x[5] - x[0])/x[0]) ,axis=1)
    df["25_"]=df.apply(lambda x :  0.2*(abs(x[6] - x[1])/x[1]) ,axis=1)
    df["50_"]=df.apply(lambda x :  0.4*(abs(x[7] - x[2])/x[2]) ,axis=1)
    df["75_"]=df.apply(lambda x :  0.2*(abs(x[8] - x[3])/x[3]) ,axis=1)
    df["90_"]=df.apply(lambda x :  0.1*(abs(x[9] - x[4])/x[4]) ,axis=1)
    mnwae=(df[[ '10_','25_', '50_', '75_', '90_']].sum().sum())/ len(df)
    return mnwae

# Different Regressor Algorithms:

## KNN Regressor:

In [15]:
from sklearn.neighbors import KNeighborsRegressor
# define model
knn_model = KNeighborsRegressor()
# fit model
knn_model.fit(X_train,y_train)

KNeighborsRegressor()

In [16]:
y_pred= knn_model.predict(X_test)
knn_model.score(X_test,y_test), knn_model.score(X_train,y_train)

(0.34452482683254865, 0.5396795701039122)

In [17]:
y_pred=knn_model.predict(X_test)
df_pred=np_to_df(y_pred)
pd_concat=concat_pred_true(y_test,df_pred)
MNWAE(pd_concat)

0.046287707978833796

## Decision Tree Regressor:

In [18]:
from sklearn.tree import DecisionTreeRegressor
dt_model=DecisionTreeRegressor()
dt_model.fit(X_train,y_train)
dt_model.score(X_train,y_train), dt_model.score(X_test,y_test)


(1.0, 0.5938759833122635)

In [19]:
y_pred=dt_model.predict(X_test)
df_pred=np_to_df(y_pred)
pd_concat=concat_pred_true(y_test,df_pred)
MNWAE(pd_concat)

0.05532476302179066

# Random Forest:


In [20]:
from sklearn.ensemble import RandomForestRegressor
rf_def = RandomForestRegressor(random_state=0)
rf_def.fit(X_train, y_train)
rf_def.score(X_train,y_train), rf_def.score(X_test,y_test)


(0.9552580898676428, 0.6842870841698085)

In [21]:
y_pred=rf_def.predict(X_test)
df_pred=np_to_df(y_pred)
pd_concat=concat_pred_true(y_test,df_pred)
MNWAE(pd_concat)

0.05336389699396331

# Random Forest With Cross Validation:

In [22]:
from sklearn.ensemble import RandomForestRegressor
test_scores=[]
train_scores=[]
for i in range(50,550,50):
  regr = RandomForestRegressor(n_estimators=i, random_state=0)
  regr.fit(X_train, y_train)
  test_scores.append(regr.score(X_test,y_test))
  train_scores.append(regr.score(X_train,y_train))
print("Train Scores:",train_scores)
print("Test Scores:",test_scores)


Train Scores: [0.9528277419226143, 0.9552580898676428, 0.9558513114634619, 0.9560470467472928, 0.9565350959532462, 0.9569008112135226, 0.9570448213107344, 0.9569551088169559, 0.9570802294977978, 0.9570438964450145]
Test Scores: [0.686804806414474, 0.6842870841698085, 0.6867394672430615, 0.6857528425527659, 0.6874319517204187, 0.6883199825859517, 0.6893802816085014, 0.6899180839445751, 0.6901822503572121, 0.6899380176518345]


In [23]:
rf_450 = RandomForestRegressor(n_estimators=450, random_state=0)
rf_450.fit(X_train, y_train)
test_score = rf_450.score(X_test,y_test)
train_score = rf_450.score(X_train,y_train)
train_score, test_score

(0.9570802294977978, 0.6901822503572121)

In [24]:
from sklearn.model_selection import RandomizedSearchCV
# Number of trees in random forest
n_estimators = [int(x) for x in np.linspace(start = 200, stop = 2000, num = 10)]
# Number of features to consider at every split
max_features = ['auto', 'sqrt']
# Maximum number of levels in tree
max_depth = [int(x) for x in np.linspace(10, 110, num = 11)]
max_depth.append(None)
# Minimum number of samples required to split a node
min_samples_split = [2, 5, 10]
# Minimum number of samples required at each leaf node
min_samples_leaf = [1, 2, 4]
# Method of selecting samples for training each tree
bootstrap = [True, False]
# Create the random grid
random_grid = {'n_estimators': n_estimators,
               'max_features': max_features,
               'max_depth': max_depth,
               'min_samples_split': min_samples_split,
               'min_samples_leaf': min_samples_leaf,
               'bootstrap': bootstrap}

rf = RandomForestRegressor(n_jobs=-1)
rf_random = RandomizedSearchCV(estimator = rf, param_distributions = random_grid, n_iter = 100, cv = 3, verbose=2, random_state=42)
# Fit the random search model
rf_random.fit(X_train, y_train)
rf_random.score(X_train,y_train), rf_random.score(X_test,y_test)

Fitting 3 folds for each of 100 candidates, totalling 300 fits
[CV] END bootstrap=True, max_depth=30, max_features=sqrt, min_samples_leaf=1, min_samples_split=5, n_estimators=400; total time=  10.6s
[CV] END bootstrap=True, max_depth=30, max_features=sqrt, min_samples_leaf=1, min_samples_split=5, n_estimators=400; total time=   0.9s
[CV] END bootstrap=True, max_depth=30, max_features=sqrt, min_samples_leaf=1, min_samples_split=5, n_estimators=400; total time=   0.9s
[CV] END bootstrap=True, max_depth=10, max_features=sqrt, min_samples_leaf=1, min_samples_split=5, n_estimators=2000; total time=   3.3s
[CV] END bootstrap=True, max_depth=10, max_features=sqrt, min_samples_leaf=1, min_samples_split=5, n_estimators=2000; total time=   3.0s
[CV] END bootstrap=True, max_depth=10, max_features=sqrt, min_samples_leaf=1, min_samples_split=5, n_estimators=2000; total time=   2.8s
[CV] END bootstrap=False, max_depth=10, max_features=sqrt, min_samples_leaf=2, min_samples_split=5, n_estimators=1200;

(0.8621082317174362, 0.7481537195622828)

In [25]:
y_pred=rf_random.predict(X_test)
df_pred=np_to_df(y_pred)
pd_concat=concat_pred_true(y_test,df_pred)
MNWAE(pd_concat)

0.0527651955460226

# MultiOutputRegressor

In [26]:
from sklearn.multioutput import MultiOutputRegressor
from sklearn.ensemble import GradientBoostingRegressor
mul_out_model_def=MultiOutputRegressor(GradientBoostingRegressor(random_state=0))
mul_out_model_def.fit(X_train, y_train)
mul_out_model_def.score(X_train,y_train) , mul_out_model_def.score(X_test,y_test)

(0.6097095989774154, 0.5444217667262677)

In [27]:
y_pred=mul_out_model_def.predict(X_test)
df_pred=np_to_df(y_pred)
pd_concat=concat_pred_true(y_test,df_pred)
MNWAE(pd_concat)

0.044688901939194677

In [28]:
test_scores=[]
train_scores=[]
for i in range(800,1400,100):
  regr = MultiOutputRegressor(GradientBoostingRegressor(n_estimators=i,random_state=0))
  regr.fit(X_train, y_train)
  test_scores.append(regr.score(X_test,y_test))
  train_scores.append(regr.score(X_train,y_train))
print("Train Scores:",train_scores)
print("Test Scores:",test_scores)

Train Scores: [0.9296950434276032, 0.9369330698161562, 0.9432755911702853, 0.9482345755410293, 0.9524962310185238, 0.9562029195556019]
Test Scores: [0.7818819779337493, 0.7851083118756667, 0.7873706114207099, 0.7887184751331441, 0.7897706450947085, 0.7905603127685528]


In [29]:
mul_regr_1300 = MultiOutputRegressor(GradientBoostingRegressor(n_estimators=1300,random_state=0))
mul_regr_1300.fit(X_train, y_train)
y_pred=mul_regr_1300.predict(X_test)
df_pred=np_to_df(y_pred)
pd_concat=concat_pred_true(y_test,df_pred)
MNWAE(pd_concat)

0.0550764007178113

# MultiOutputRegressor With Gradient Boosting & Cross Validation: RandomizedSearchCV

In [30]:
from sklearn.model_selection import RandomizedSearchCV
# Number of trees in random forest
n_estimators = [int(x) for x in range(1000,2000,100)]
# Number of features to consider at every split
max_features = ['auto', 'sqrt']
# Maximum number of levels in tree
max_depth = [int(x) for x in range(1,31,1)]
max_depth.append(None)
# Minimum number of samples required to split a node
min_samples_split = list(np.linspace(0.1, 1.0, 10, endpoint=True))
# Minimum number of samples required at each leaf node
min_samples_leaf = list(np.linspace(0.1, 0.5, 5, endpoint=True))
# Create the random grid
random_grid = {'estimator__n_estimators': n_estimators,
               'estimator__max_features': max_features,
               'estimator__max_depth': max_depth,
               'estimator__min_samples_split': min_samples_split,
               'estimator__min_samples_leaf': min_samples_leaf
               }
multi_gb_Boost = MultiOutputRegressor(GradientBoostingRegressor(random_state=0))
rf_random_mult = RandomizedSearchCV(estimator = multi_gb_Boost, param_distributions = random_grid, n_iter = 100, cv = 3, verbose=2, random_state=42)
# Fit the random search model
rf_random_mult.fit(X_train, y_train)
rf_random_mult.score(X_train,y_train), rf_random_mult.score(X_test,y_test)

Fitting 3 folds for each of 100 candidates, totalling 300 fits
[CV] END estimator__max_depth=24, estimator__max_features=sqrt, estimator__min_samples_leaf=0.2, estimator__min_samples_split=0.6, estimator__n_estimators=1400; total time=   2.1s
[CV] END estimator__max_depth=24, estimator__max_features=sqrt, estimator__min_samples_leaf=0.2, estimator__min_samples_split=0.6, estimator__n_estimators=1400; total time=   2.4s
[CV] END estimator__max_depth=24, estimator__max_features=sqrt, estimator__min_samples_leaf=0.2, estimator__min_samples_split=0.6, estimator__n_estimators=1400; total time=   2.1s
[CV] END estimator__max_depth=16, estimator__max_features=sqrt, estimator__min_samples_leaf=0.30000000000000004, estimator__min_samples_split=1.0, estimator__n_estimators=1500; total time=   2.1s
[CV] END estimator__max_depth=16, estimator__max_features=sqrt, estimator__min_samples_leaf=0.30000000000000004, estimator__min_samples_split=1.0, estimator__n_estimators=1500; total time=   2.2s
[CV] 

(0.005170803897183096, 0.0002473768603242554)

In [31]:
y_pred=rf_random_mult.predict(X_test)
df_pred=np_to_df(y_pred)
pd_concat=concat_pred_true(y_test,df_pred)
MNWAE(pd_concat)

0.038056897422428154

## XGBOOST:

In [32]:
import xgboost as xgb
xgb_def = MultiOutputRegressor(xgb.XGBRegressor(random_state=0))
xgb_def.fit(X_train, y_train)
y_pred = xgb_def.predict(X_test)
df_pred=np_to_df(y_pred)
pd_concat=concat_pred_true(y_test,df_pred)
MNWAE(pd_concat)

0.054916473740117974

## XGBOOST WITH CROSS VALIDATION:

In [33]:
test_scores=[]
train_scores=[]
custom_metric = []
for i in range(50,1000,50):
    xgb_rand = MultiOutputRegressor(xgb.XGBRegressor(n_estimators=i, random_state=0))
    xgb_rand.fit(X_train,y_train)
    y_pred = xgb_rand.predict(X_test)
    df_pred=np_to_df(y_pred)
    pd_concat=concat_pred_true(y_test,df_pred)
    custom_metric.append(MNWAE(pd_concat))
    test_scores.append(xgb_rand.score(X_test,y_test))
    train_scores.append(xgb_rand.score(X_train,y_train))
print("Train Scores: \n",train_scores)
print("Test Scores: \n",test_scores)
print("Custom Metric Scores: \n",custom_metric)


Train Scores: 
 [0.8300322147643451, 0.9179004181001279, 0.9482773413255018, 0.964980338011987, 0.9746011650626338, 0.980042028743261, 0.9850028963200776, 0.9886630709840336, 0.9922699882724153, 0.9944244286635712, 0.9959528903608724, 0.9970764692399927, 0.9979765471462538, 0.9985477977034675, 0.9989359448301995, 0.9992414620641339, 0.9994624521583771, 0.9996162295182389, 0.99971877547652]
Test Scores: 
 [0.7135147780324617, 0.7635090392720414, 0.7701904907095999, 0.7701056109080728, 0.769650410837474, 0.769355578930876, 0.769080822684411, 0.769247231605011, 0.769188547745953, 0.769113197791043, 0.7692784262799413, 0.769449836619349, 0.7693552275266714, 0.7690774971385077, 0.769095474595224, 0.7690208645020405, 0.7688769950470317, 0.768852640726245, 0.7688561132420506]
Custom Metric Scores: 
 [0.051512579273910133, 0.054916473740117974, 0.0563209354308981, 0.05701932763863262, 0.05733258658142079, 0.057444631406605044, 0.057433054141481754, 0.05743093474215707, 0.05735628532087683, 0.0

## CATBOOST:


In [34]:
import catboost as cb
cat_def = MultiOutputRegressor(cb.CatBoostRegressor(random_state=0))
cat_def.fit(X_train, y_train)
y_pred = cat_def.predict(X_test)
df_pred=np_to_df(y_pred)
pd_concat=concat_pred_true(y_test,df_pred)
MNWAE(pd_concat)


Learning rate set to 0.044631
0:	learn: 10642.7412640	total: 155ms	remaining: 2m 34s
1:	learn: 10578.4318311	total: 164ms	remaining: 1m 21s
2:	learn: 10532.1151528	total: 174ms	remaining: 57.9s
3:	learn: 10449.2173911	total: 183ms	remaining: 45.5s
4:	learn: 10370.2669788	total: 194ms	remaining: 38.5s
5:	learn: 10318.6917243	total: 206ms	remaining: 34.1s
6:	learn: 10272.4164578	total: 215ms	remaining: 30.5s
7:	learn: 10213.6557869	total: 224ms	remaining: 27.7s
8:	learn: 10160.7790739	total: 233ms	remaining: 25.6s
9:	learn: 10103.1999505	total: 242ms	remaining: 23.9s
10:	learn: 10059.3040742	total: 251ms	remaining: 22.6s
11:	learn: 10024.6114513	total: 259ms	remaining: 21.4s
12:	learn: 9979.2019209	total: 270ms	remaining: 20.5s
13:	learn: 9949.2247143	total: 278ms	remaining: 19.6s
14:	learn: 9916.2869159	total: 287ms	remaining: 18.8s
15:	learn: 9875.8898805	total: 300ms	remaining: 18.5s
16:	learn: 9838.2192499	total: 314ms	remaining: 18.1s
17:	learn: 9812.8751908	total: 319ms	remaining: 

0.05321870648983786

## CATBOOST WITH CROSS VALIDATION:

In [35]:
test_scores=[]
train_scores=[]
custom_metric = []
for i in range(50,1000,50):
    CB_rand = MultiOutputRegressor(cb.CatBoostRegressor(n_estimators=i, random_state=0))
    CB_rand.fit(X_train,y_train)
    y_pred = CB_rand.predict(X_test)
    df_pred=np_to_df(y_pred)
    pd_concat=concat_pred_true(y_test,df_pred)
    custom_metric.append(MNWAE(pd_concat))
    test_scores.append(CB_rand.score(X_test,y_test))
    train_scores.append(CB_rand.score(X_train,y_train))
print("Train Scores:",train_scores)
print("Test Scores:",test_scores)
print("Custom Metric Scores:",custom_metric)

Learning rate set to 0.5
0:	learn: 10121.8794880	total: 4.47ms	remaining: 219ms
1:	learn: 9604.3883440	total: 9.83ms	remaining: 236ms
2:	learn: 9432.5511112	total: 14.5ms	remaining: 228ms
3:	learn: 9138.5282091	total: 19.6ms	remaining: 225ms
4:	learn: 8833.7141962	total: 24.6ms	remaining: 221ms
5:	learn: 8542.8077656	total: 29.4ms	remaining: 215ms
6:	learn: 8395.4583557	total: 34.3ms	remaining: 210ms
7:	learn: 8229.8001321	total: 39.9ms	remaining: 210ms
8:	learn: 8057.4413645	total: 44.4ms	remaining: 202ms
9:	learn: 7912.5490769	total: 48.9ms	remaining: 195ms
10:	learn: 7715.9683897	total: 54.2ms	remaining: 192ms
11:	learn: 7574.9861558	total: 59.3ms	remaining: 188ms
12:	learn: 7429.4561484	total: 63.7ms	remaining: 181ms
13:	learn: 7288.5089450	total: 68.7ms	remaining: 177ms
14:	learn: 7136.0233997	total: 73.4ms	remaining: 171ms
15:	learn: 7006.6109338	total: 78.3ms	remaining: 166ms
16:	learn: 6882.2272870	total: 84.4ms	remaining: 164ms
17:	learn: 6754.9689346	total: 88.7ms	remaining: 

## Implementation Of Models On Submission Dataset:

In [36]:
df_sub=pd.read_csv('submission.csv')
df_sub1=df_sub.copy()
df_sub1 = df_sub1[['PRIM_STATE', 'NAICS_TITLE', 'OCC_TITLE']]
display(df_sub1.head())
df_sub1.nunique()

Unnamed: 0,PRIM_STATE,NAICS_TITLE,OCC_TITLE
0,US,Accommodation and Food Services,Other Production Occupations
1,NE,Cross-industry,"Arts, Design, Entertainment, Sports, and Media..."
2,US,Manufacturing,Construction and Extraction Occupations
3,US,Wholesale Trade,Material Moving Workers
4,US,Other Services (except Public Administration),Supervisors of Building and Grounds Cleaning a...


PRIM_STATE      52
NAICS_TITLE     21
OCC_TITLE      115
dtype: int64

In [37]:
df_=pd.read_csv('train.csv')
df_ = df_[['PRIM_STATE', 'NAICS_TITLE', 'OCC_TITLE']]
print(df_.nunique())
print("The extra category in OCC_TITLE:- ", set(list(df_['OCC_TITLE'])) - set(list(df_sub1['OCC_TITLE'])) )
df_=pd.get_dummies(df_)
display(df_.head())

PRIM_STATE      52
NAICS_TITLE     21
OCC_TITLE      116
dtype: int64
The extra category in OCC_TITLE:-  {'Lawyers, Judges, and Related Workers'}


Unnamed: 0,PRIM_STATE_AK,PRIM_STATE_AL,PRIM_STATE_AR,PRIM_STATE_AZ,PRIM_STATE_CA,PRIM_STATE_CO,PRIM_STATE_CT,PRIM_STATE_DC,PRIM_STATE_DE,PRIM_STATE_FL,...,OCC_TITLE_Supervisors of Protective Service Workers,OCC_TITLE_Supervisors of Sales Workers,OCC_TITLE_Supervisors of Transportation and Material Moving Workers,"OCC_TITLE_Textile, Apparel, and Furnishings Workers",OCC_TITLE_Top Executives,OCC_TITLE_Tour and Travel Guides,OCC_TITLE_Transportation and Material Moving Occupations,"OCC_TITLE_Vehicle and Mobile Equipment Mechanics, Installers, and Repairers",OCC_TITLE_Water Transportation Workers,OCC_TITLE_Woodworkers
0,0,0,0,0,0,0,0,0,0,0,...,0,0,1,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,1,0,0,0,0,0,0,0,0,0


In [38]:
df_sub1=pd.get_dummies(df_sub1)
# Get missing columns in the training test
missing_cols = set( df_.columns ) - set( df_sub1.columns )
# Add a missing column in test set with default value equal to 0
for c in missing_cols:
    df_sub1[c] = 0
# Ensure the order of column in the test set is in the same order than in train set
df_sub1 = df_sub1[df_.columns]
(df_sub1.columns == df_.columns).all()

True

In [41]:
models = [knn_model, dt_model, rf_def, rf_450, rf_random, mul_out_model_def,
mul_regr_1300, rf_random_mult, xgb_def, xgb_rand, cat_def, CB_rand ]
scores = []
for i in models:
    y_pred = i.predict(X_test)
    df_pred=np_to_df(y_pred)
    pd_concat=concat_pred_true(y_test,df_pred)
    scores.append(MNWAE(pd_concat))


In [49]:
best_model = models[scores.index(min(scores))]

In [47]:
scores

[0.046287707978833796,
 0.05532476302179066,
 0.05336389699396331,
 0.05342246623413527,
 0.0527651955460226,
 0.044688901939194677,
 0.0550764007178113,
 0.038056897422428154,
 0.054916473740117974,
 0.057122076123850375,
 0.05321870648983786,
 0.05322111751349414]

In [50]:
 best_model_out = (best_model.predict(df_sub1))
print(best_model_out.shape)
print(best_model.score(X_test,y_test))

original = pd.read_csv('submission.csv')
submission_file = original.copy()
submission_file[['A_PCT10', 'A_PCT25','A_MEDIAN', 'A_PCT75', 'A_PCT90']] = best_model_out
submission_file.head()

(926, 5)
0.0002473768603242554


Unnamed: 0,PRIM_STATE,NAICS_TITLE,OCC_TITLE,A_PCT10,A_PCT25,A_MEDIAN,A_PCT75,A_PCT90
0,US,Accommodation and Food Services,Other Production Occupations,33155.445784,41193.237177,53007.537401,68353.482546,87245.654596
1,NE,Cross-industry,"Arts, Design, Entertainment, Sports, and Media...",30615.643051,38325.52082,50487.831667,67513.671473,89480.191129
2,US,Manufacturing,Construction and Extraction Occupations,33155.445784,41193.237177,53007.537401,68353.482546,87245.654596
3,US,Wholesale Trade,Material Moving Workers,33155.445784,41193.237177,53007.537401,68353.482546,87245.654596
4,US,Other Services (except Public Administration),Supervisors of Building and Grounds Cleaning a...,33155.445784,41193.237177,53007.537401,68353.482546,87245.654596
