Hackaton: Urban Air Pollution Challenge 
=============================================
Edimer David Jaramillo   
Abril de 2020

## Instalando y cargando bibliotecas

In [1]:
install.packages("h2o", dependencies = TRUE)
library(tidyverse)

Installing package into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)

also installing the dependencies ‘httpuv’, ‘xtable’, ‘sourcetools’, ‘fastmap’, ‘miniUI’, ‘webshot’, ‘misc3d’, ‘bitops’, ‘bit’, ‘shiny’, ‘manipulateWidget’, ‘plot3D’, ‘RCurl’, ‘mlbench’, ‘slam’, ‘bit64’, ‘data.table’, ‘rgl’, ‘plot3Drgl’


── [1mAttaching packages[22m ─────────────────────────────────────── tidyverse 1.3.0 ──

[32m✔[39m [34mggplot2[39m 3.3.0     [32m✔[39m [34mpurrr  [39m 0.3.3
[32m✔[39m [34mtibble [39m 3.0.0     [32m✔[39m [34mdplyr  [39m 0.8.5
[32m✔[39m [34mtidyr  [39m 1.0.2     [32m✔[39m [34mstringr[39m 1.4.0
[32m✔[39m [34mreadr  [39m 1.3.1     [32m✔[39m [34mforcats[39m 0.5.0

── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()



## Importando datos

In [0]:
load("/content/nuevo1_test.Rdata")
load("/content/nuevo1_train.Rdata")

## Modelo XGBoost

In [16]:
# Cargando  e iniciando h2o
library(h2o)
h2o.init(nthreads = -1, port = 54321, max_mem_size = "10g")

# Train y Test h2o
df_train <- df_train2 %>% 
  select(-c(1, 2, 3)) %>%
  filter(target <= 500)

# Datos h2o
datos_h2o <- as.h2o(x = df_train, destination_frame = "datos_h2o")

particiones <- h2o.splitFrame(data = datos_h2o, ratios = c(0.7, 0.20),
                              seed = 123)
datos_train_h2o <- h2o.assign(data = particiones[[1]], key = "datos_train_H2O")
datos_val_h2o   <- h2o.assign(data = particiones[[2]], key = "datos_val_H2O")
datos_test_h2o  <- h2o.assign(data = particiones[[3]], key = "datos_test_H2O")

# Se define la variable respuesta y los predictores.
var_respuesta <- "target"

# Para este modelo se emplean todos los predictores disponibles.
predictores   <- setdiff(h2o.colnames(datos_train_h2o), var_respuesta)

# Modelo XGB h2o
xgb <- h2o.xgboost(x = predictores
                  ,y = var_respuesta
                  ,training_frame = datos_train_h2o
                  ,validation_frame = datos_val_h2o
                  ,model_id = "xgb_model_1"
                  ,stopping_rounds = 3
                  ,stopping_metric = "RMSE"
                  ,distribution = "gaussian"
                  ,score_tree_interval = 1
                  ,learn_rate=0.05
                  ,ntrees=500
                  ,subsample = 0.75
                  ,colsample_bytree = 0.75
                  ,tree_method = "hist"
                  ,grow_policy = "lossguide"
                  ,booster = "gblinear"
                  ,gamma = 0.0
                  ,max_runtime_secs = 3600
                  ,nfolds = 10
                  ,seed = 123
                  ,categorical_encoding = "OneHotExplicit")

 Connection successful!

R is connected to the H2O cluster: 
    H2O cluster uptime:         23 minutes 38 seconds 
    H2O cluster timezone:       Etc/UTC 
    H2O data parsing timezone:  UTC 
    H2O cluster version:        3.30.0.1 
    H2O cluster version age:    6 days  
    H2O cluster name:           H2O_started_from_R_root_ybq511 
    H2O cluster total nodes:    1 
    H2O cluster total memory:   9.88 GB 
    H2O cluster total cores:    2 
    H2O cluster allowed cores:  2 
    H2O cluster healthy:        TRUE 
    H2O Connection ip:          localhost 
    H2O Connection port:        54321 
    H2O Connection proxy:       NA 
    H2O Internal Security:      FALSE 
    H2O API Extensions:         Amazon S3, XGBoost, Algos, AutoML, Core V3, TargetEncoder, Core V4 
    R Version:                  R version 3.6.3 (2020-02-29) 



“Dropping bad and constant columns: [day_week, month_date, Weekend].
”




In [17]:
xgb

Model Details:

H2ORegressionModel: xgboost
Model ID:  xgb_model_1 
Model Summary: 
  number_of_trees
1              52


H2ORegressionMetrics: xgboost
** Reported on training data. **

MSE:  2199.96
RMSE:  46.90373
MAE:  34.57763
RMSLE:  0.7914494
Mean Residual Deviance :  2199.96


H2ORegressionMetrics: xgboost
** Reported on validation data. **

MSE:  2081.29
RMSE:  45.62116
MAE:  33.51825
RMSLE:  0.7804832
Mean Residual Deviance :  2081.29


H2ORegressionMetrics: xgboost
** Reported on cross-validation data. **
** 10-fold cross-validation on training data (Metrics computed for combined holdout predictions) **

MSE:  2199.612
RMSE:  46.90002
MAE:  34.57685
RMSLE:  0.7914168
Mean Residual Deviance :  2199.612


Cross-Validation Metrics Summary: 
                               mean          sd   cv_1_valid   cv_2_valid
mae                       34.580215  0.48616752    34.508533     34.82499
mean_residual_deviance    2200.3948   107.04642    2269.7646    2177.4104
mse                 

## Predicciones

In [24]:
# ============================= Predicciones ====================================== #

# Train
predichos_train <- h2o.predict(xgb, datos_train_h2o) %>%
  as.data.frame() %>% pull(predict)

# Test (Train)
predichos_test <- h2o.predict(xgb, datos_test_h2o) %>%
  as.data.frame() %>% pull(predict)

# Test (Submission)
predichos_subm <- h2o.predict(xgb, as.h2o(df_test2)) %>%
  as.data.frame() %>% pull(predict)
df_test2 %>% 
  select(`Place_ID X Date`) %>% 
  mutate(target = predichos_subm) ->
  subm3

# Exportando predicciones
write.csv(subm3, file = "subm3.csv", row.names = FALSE)

RMSE <- function(pred, obs, na.rm = FALSE){
  sqrt(mean((pred - obs)^2, na.rm = na.rm))
}

data.frame(
  data = c("Train", "Test"),
  RMSE = c(RMSE(pred = predichos_train, obs = as.vector(datos_train_h2o$target)),
           RMSE(pred = predichos_test, obs = as.vector(datos_test_h2o$target)))
)



data,RMSE
<fct>,<dbl>
Train,46.90373
Test,47.45303
