# Machine Learning
# Assignment 2. Task 4
## Oleh Lukianykhin, Yevhen Pozdniakov

# Indoor localization

An indoor positioning system (IPS) is a system to locate objects or people inside a building using radio waves, magnetic fields, acoustic signals, or other sensory information collected by mobile devices. There are several commercial systems on the market, but there is no standard for an IPS system.

IPSes use different technologies, including distance measurement to nearby anchor nodes (nodes with known positions, e.g., WiFi access points), magnetic positioning, dead reckoning. They either actively locate mobile devices and tags or provide ambient location or environmental context for devices to get sensed.

According to the [report](https://www.marketsandmarkets.com/Market-Reports/indoor-positioning-navigation-ipin-market-989.html), the global indoor location market size is expected to grow from USD 7.11 Billion in 2017 to USD 40.99 Billion by 2022, at a Compound Annual Growth Rate (CAGR) of 42.0% during the forecast period. Hassle-free navigation, improved decision-making, and increased adoption of connected devices are boosting the growth of the indoor location market across the globe.

In this problem, you are going to use signals from seven different wi-fi access points to define in which room the user is located.

In [3]:
import pandas
import numpy as np
import xgboost as xgb

In [4]:
random_seed = 17
np.random.seed(random_seed)

Loading the data and breaking it into training and cross-validation sets.

In [5]:
train_set = pandas.read_csv('train_set.csv')
cv_set = pandas.read_csv('cv_set.csv')

train_data = train_set[['wifi'+str(i) for i in range(1, len(train_set.columns) - 1)]]
train_labels = train_set['room']
cv_data = cv_set[['wifi'+str(i) for i in range(1, len(cv_set.columns) - 1)]]
cv_labels = cv_set['room']

In [6]:
print(train_data[:10])
print(train_labels[:10])


   wifi1  wifi2  wifi3  wifi4  wifi5  wifi6  wifi7
0    -68    -57    -61    -65    -71    -85    -85
1    -63    -60    -60    -67    -76    -85    -84
2    -61    -60    -68    -62    -77    -90    -80
3    -65    -61    -65    -67    -69    -87    -84
4    -61    -63    -58    -66    -74    -87    -82
5    -62    -60    -66    -68    -80    -86    -91
6    -65    -59    -61    -67    -72    -86    -81
7    -63    -57    -61    -65    -73    -84    -84
8    -66    -60    -65    -62    -70    -85    -83
9    -67    -60    -59    -61    -71    -86    -91
0    1
1    1
2    1
3    1
4    1
5    1
6    1
7    1
8    1
9    1
Name: room, dtype: int64


In [7]:
print(cv_data[:10])
print(cv_labels[:10])

   wifi1  wifi2  wifi3  wifi4  wifi5  wifi6  wifi7
0    -64    -56    -61    -66    -71    -82    -81
1    -63    -65    -60    -63    -77    -81    -87
2    -64    -55    -63    -66    -76    -88    -83
3    -65    -60    -59    -63    -76    -86    -82
4    -67    -61    -62    -67    -77    -83    -91
5    -61    -59    -65    -63    -74    -89    -87
6    -63    -56    -63    -65    -72    -82    -89
7    -66    -59    -64    -68    -68    -97    -83
8    -67    -57    -64    -71    -75    -89    -87
9    -63    -57    -59    -67    -71    -82    -93
0    1
1    1
2    1
3    1
4    1
5    1
6    1
7    1
8    1
9    1
Name: room, dtype: int64


### Training XGBoost regressor

### First let's try default parameters for gradient boosting:

In [8]:
xgb_train = xgb.DMatrix(train_data, train_labels, feature_names=train_data.columns)
xgb_test = xgb.DMatrix(cv_data, cv_labels, feature_names=cv_data.columns)
parameters = {
    "random_seed": random_seed,
    'num_class':7,
    'silent': 1,
}

In [9]:
%%time
model = xgb.train(parameters, xgb_train)

Wall time: 907 ms


In [10]:
pred = model.predict(xgb_test)

error_rate = np.sum(pred != cv_labels) / cv_labels.shape[0]
print('Test error = {}'.format(error_rate))

Test error = 0.022670025188916875


### We received pretty good results - 2.27% of prediction errors.
However, we would like to improve prediction quality, thus we will try parameters other than default ones.

### Tuning hyperparameters

### First, we just try other parameters randomly. Learning rate - `eta`, maximum depth of the grown trees - `max_depth`

In [14]:
%%time
parameters = {
    'objective': 'multi:softmax',
    'eta': 0.1,
    "nthread": 3,
    'silent': 1,
    "random_seed": 1,
    'num_class':7,
    'max_depth':5 
}
model = xgb.train(parameters, xgb_train)
pred = model.predict(xgb_test)

error_rate = np.sum(pred != cv_labels) / cv_labels.shape[0]
print('Test error = {}'.format(error_rate))

Test error = 0.027707808564231738
Wall time: 51.9 ms


In [15]:
%%time
parameters = {
    'objective': 'multi:softmax',
    'eta': 0.5,
    "nthread": 3,
    'silent': 1,
    "random_seed": 1,
    'num_class':7,
    'max_depth':3 
}
model = xgb.train(parameters, xgb_train)
pred = model.predict(xgb_test)

error_rate = np.sum(pred != cv_labels) / cv_labels.shape[0]
print('Test error = {}'.format(error_rate))

Test error = 0.015113350125944584
Wall time: 42.8 ms


### We see that in the first case we received worse result on the test data - 2.77% of errors, while in the second case prediction quality became significantly better - 1.51% of errors.

#### What conclusion can we make? Obviously, just trying certain combinations of parameters manually is not a good approach.

## Thus we will try to find proper parameters using cross-validation
### First, we tried different values of `eta`:

In [16]:
%%time

for eta in np.arange(0,1,0.01):
    parameters = {
        #default
        'objective': 'multi:softmax',
        'silent': 1,
        "nthread": 3,
        "random_seed": 1,
        "eval_metric": 'mlogloss',
        'num_class':7,
        'eta':eta
    }
    results = xgb.cv(parameters, xgb_train, num_boost_round = 100, early_stopping_rounds=10,
                     nfold=5, stratified=True, seed=random_seed, show_stdv=False, 
                     verbose_eval=False)
    print("eta={}, test logloss ={}".format(eta, results.iloc[-1]['test-mlogloss-mean']))

eta=0.0, test logloss =1.94591
eta=0.01, test logloss =0.6111525999999999
eta=0.02, test logloss =0.2655008
eta=0.03, test logloss =0.14116900000000002
eta=0.04, test logloss =0.0940986
eta=0.05, test logloss =0.0763152
eta=0.06, test logloss =0.06847120000000001
eta=0.07, test logloss =0.06524640000000001
eta=0.08, test logloss =0.06503080000000001
eta=0.09, test logloss =0.0643112
eta=0.1, test logloss =0.0644088
eta=0.11, test logloss =0.0641914
eta=0.12, test logloss =0.0639374
eta=0.13, test logloss =0.063549
eta=0.14, test logloss =0.06429119999999999
eta=0.15, test logloss =0.0644992
eta=0.16, test logloss =0.0639346
eta=0.17, test logloss =0.0632468
eta=0.18, test logloss =0.0631302
eta=0.19, test logloss =0.06440939999999999
eta=0.2, test logloss =0.063712
eta=0.21, test logloss =0.0640918
eta=0.22, test logloss =0.06472460000000001
eta=0.23, test logloss =0.064371
eta=0.24, test logloss =0.06482180000000001
eta=0.25, test logloss =0.06462380000000001
eta=0.26, test logloss =0

### We have tried several local minimums and decided to take `eta=0.62` for further consideration, corresponding mean mlogloss on the test data equals ~0.0619

In [119]:
parameters = {
   'objective': 'multi:softmax',
    'eta': 0.62,
    'silent': 1,
    "nthread": 3,
    "random_seed": 1,
    'num_class':7,
    
}
model = xgb.train(parameters, xgb_train)
pred = model.predict(xgb_test)

error_rate = np.sum(pred != cv_labels) / cv_labels.shape[0]
print('Test error = {}'.format(error_rate))

Test error = 0.015113350125944584


### Corresponding error on the test data - 1.51%
### Next, we decided to try different values of `max_depth`:

In [17]:
%%time

for max_depth in np.arange(1,10):
    parameters = {
        #default
        'objective': 'multi:softmax',
        'silent': 1,
        "nthread": 3,
        "random_seed": 1,
        "eval_metric": 'mlogloss',
        'num_class':7,
        'eta': 0.62,
        'max_depth':max_depth
    }
    results = xgb.cv(parameters, xgb_train, num_boost_round = 100, early_stopping_rounds=10,
                     nfold=5, stratified=True, seed=random_seed, show_stdv=False, 
                     verbose_eval=False)
    print("max_depth={}, test logloss ={}".format(max_depth, results.iloc[-1]['test-mlogloss-mean']))

max_depth=1, test logloss =0.055717
max_depth=2, test logloss =0.057705799999999995
max_depth=3, test logloss =0.059850799999999996
max_depth=4, test logloss =0.0635156
max_depth=5, test logloss =0.06320980000000001
max_depth=6, test logloss =0.061946000000000015
max_depth=7, test logloss =0.06206339999999999
max_depth=8, test logloss =0.06233780000000001
max_depth=9, test logloss =0.062733
Wall time: 6.06 s


### We have tried several values and decided to use `max_depth=3`.

In [18]:
parameters = {
   'objective': 'multi:softmax',
    'eta': 0.62,
    'silent': 1,
    "nthread": 3,
    "random_seed": 1,
    'num_class':7,
    'max_depth':3
}
model = xgb.train(parameters, xgb_train)
pred = model.predict(xgb_test)

error_rate = np.sum(pred != cv_labels) / cv_labels.shape[0]
print('Test error = {}'.format(error_rate))

Test error = 0.012594458438287154


### Corresponding error on the test data ~1.26%
### Next, we decided to try different values of `lambda` - coefficient for L2-regularization:

In [19]:
%%time

for lambd in np.arange(0,3,0.05):
    parameters = {
        #default
        'objective': 'multi:softmax',
        'silent': 1,
        "nthread": 3,
        "random_seed": 1,
        "eval_metric": 'mlogloss',
        'num_class':7,
        'eta': 0.62,
        'max_depth':3,
        'lambda':lambd
        
    }
    results = xgb.cv(parameters, xgb_train, num_boost_round = 100, early_stopping_rounds=10,
                     nfold=5, stratified=True, seed=random_seed, show_stdv=False, 
                     verbose_eval=False)
    
    print("lambda={}, test logloss ={}".format(lambd, results.iloc[-1]['test-mlogloss-mean']))

lambda=0.0, test logloss =0.0604512
lambda=0.05, test logloss =0.05868359999999999
lambda=0.1, test logloss =0.057423800000000004
lambda=0.15000000000000002, test logloss =0.060823800000000004
lambda=0.2, test logloss =0.0599474
lambda=0.25, test logloss =0.058122799999999995
lambda=0.30000000000000004, test logloss =0.0585228
lambda=0.35000000000000003, test logloss =0.057076800000000004
lambda=0.4, test logloss =0.056756400000000005
lambda=0.45, test logloss =0.059345999999999996
lambda=0.5, test logloss =0.059927400000000006
lambda=0.55, test logloss =0.0580032
lambda=0.6000000000000001, test logloss =0.057970999999999995
lambda=0.65, test logloss =0.0568608
lambda=0.7000000000000001, test logloss =0.060494400000000004
lambda=0.75, test logloss =0.05949800000000001
lambda=0.8, test logloss =0.060073999999999995
lambda=0.8500000000000001, test logloss =0.059630800000000005
lambda=0.9, test logloss =0.061152200000000004
lambda=0.9500000000000001, test logloss =0.060014
lambda=1.0, tes

### We have tried several values for our test data and decided to use `lambda=2`, corresponding mean mlogloss on the test data equals ~0.06002.

In [150]:
parameters = {
   'objective': 'multi:softmax',
    'eta': 0.62,
    'silent': 1,
    "nthread": 3,
    "random_seed": 1,
    'num_class':7,
    'max_depth':3,
    'lambda': 2
}
model = xgb.train(parameters, xgb_train)
pred = model.predict(xgb_test)

error_rate = np.sum(pred != cv_labels) / cv_labels.shape[0]
print('Test error = {}'.format(error_rate))

Test error = 0.010075566750629723


### Corresponding error on the test data ~1.01% Much better than our initial guess - 2.27%.

## However, I should say that the final decision was made to minimize error on the test data.
## This is not the best approach, when we want to solve the real world problem or competition leaderboard is private. In this cases we should rely on cross validation results, not the public leaderboard or one test dataset.

## There is more generalised approach to choose the model hyperparameters using crossvalidation - we iterate over all combinations of parameters that we want to consider and choose the best one according to the cross validation results.

## This is kind of brute force and can not be easily applied to all tasks. Thus one should:
### a) Decrease number of parameter values being considerd. For instance, knowledge about subject field or model properties can be used for this purpose.
### b) Try random search - instead of iterating over all combinations we randomly sample one of them from the corresponding space at each step.

## So, we decided to try random search:

In [159]:
from xgboost import XGBClassifier
from sklearn.model_selection import RandomizedSearchCV

In [180]:
params = {
        'min_child_weight': [1, 5, 10],
        'gamma': [0.5, 1, 1.5, 2, 5],
        'subsample': [0.6, 0.8, 1.0],
        'colsample_bytree': [0.6, 0.8, 1.0],
        'max_depth': [1, 2, 3, 4, 5],
        'eta':[0.35, 0.62, 0.7],
        'n_estimators':[20, 50, 100, 200, 600],
        'lambda':[0,0.5,1.5,2,5],
        'alpha ':[0,0.5,1.5,2,5]
        }

In [182]:
xgb = XGBClassifier(objective='multi:softmax',
                    silent=True, nthread=1, random_seed=random_seed)

In [185]:
folds = 3
param_comb = 1000

skf = StratifiedKFold(n_splits=folds, shuffle = True, random_state = 1001)

random_search = RandomizedSearchCV(xgb, param_distributions=params, n_iter=param_comb, 
                                    n_jobs=4, cv=skf.split(train_data,train_labels),
                                   verbose=3, random_state=random_seed )


In [186]:
%%time
random_search.fit(train_data,train_labels)

Fitting 3 folds for each of 1000 candidates, totalling 3000 fits


[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done  24 tasks      | elapsed:   10.3s
[Parallel(n_jobs=4)]: Done 120 tasks      | elapsed:   29.8s
[Parallel(n_jobs=4)]: Done 280 tasks      | elapsed:  1.1min
[Parallel(n_jobs=4)]: Done 504 tasks      | elapsed:  1.8min
[Parallel(n_jobs=4)]: Done 792 tasks      | elapsed:  2.9min
[Parallel(n_jobs=4)]: Done 1144 tasks      | elapsed:  4.1min
[Parallel(n_jobs=4)]: Done 1560 tasks      | elapsed:  5.8min
[Parallel(n_jobs=4)]: Done 2040 tasks      | elapsed:  7.8min
[Parallel(n_jobs=4)]: Done 2584 tasks      | elapsed: 10.1min
[Parallel(n_jobs=4)]: Done 3000 out of 3000 | elapsed: 11.8min finished


Wall time: 11min 51s


RandomizedSearchCV(cv=<generator object _BaseKFold.split at 0x000002573A03BC50>,
          error_score='raise-deprecating',
          estimator=XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
       colsample_bytree=1, gamma=0, learning_rate=0.1, max_delta_step=0,
       max_depth=3, min_child_weight=1, missing=None, n_estimators=100,
       n_jobs=1, nthread=1, objective='multi:softmax', random_seed=17,
       random_state=0, reg_alpha=0, reg_lambda=1, scale_pos_weight=1,
       seed=None, silent=True, subsample=1),
          fit_params=None, iid='warn', n_iter=1000, n_jobs=4,
          param_distributions={'min_child_weight': [1, 5, 10], 'gamma': [0.5, 1, 1.5, 2, 5], 'subsample': [0.6, 0.8, 1.0], 'colsample_bytree': [0.6, 0.8, 1.0], 'max_depth': [1, 2, 3, 4, 5], 'eta': [0.35, 0.62, 0.7], 'n_estimators': [20, 50, 100, 200, 600], 'lambda': [0, 0.5, 1.5, 2, 5], 'alpha ': [0, 0.5, 1.5, 2, 5]},
          pre_dispatch='2*n_jobs', random_state=17, refit=True,
          

In [177]:
pred = random_search.predict(cv_data)
error_rate = np.sum(pred != cv_labels) / cv_labels.shape[0]
print('Test error = {}'.format(error_rate))

Test error = 0.017632241813602016


In [187]:
pred = random_search.predict(cv_data)
error_rate = np.sum(pred != cv_labels) / cv_labels.shape[0]
print('Test error = {}'.format(error_rate))

Test error = 0.022670025188916875


### Unfortuntely, even with a lot of time spent, results on the test data are much worse than our previous options.

### Most likely, because we have made too few iterations (to small number of samples from the parameters space) and thus have not found better values for hyperparameters. However, it also may be because high accuracy in the previous case is caused by overfitting.