# <img src="https://kaggle2.blob.core.windows.net/competitions/kaggle/4594/logos/front_page.png"/><span style="color:blue;text-align:center;">v1 Simple GLM</span>

Rossmann operates over 3,000 drug stores in 7 European countries. Currently, 
Rossmann store managers are tasked with predicting their daily sales for up to six weeks in advance. Store sales are influenced by many factors, including promotions, competition, school and state holidays, seasonality, and locality. With thousands of individual managers predicting sales based on their unique circumstances, the accuracy of results can be quite varied. Reliable sales forecasts enable store managers to create effective staff schedules that increase productivity and motivation. By helping Rossmann create a robust prediction model, you will help store managers stay focused on what’s most important to them: their customers and their teams! 
<img src="https://kaggle2.blob.core.windows.net/competitions/kaggle/4594/media/rossmann_banner2.png"/>

In [1]:
using DataFrames
using MLBase
using GLM
using Gadfly
using Iterators

  likely near /Users/diego/.julia/v0.4/MLBase/src/modeltune.jl:5
  likely near /Users/diego/.julia/v0.4/MLBase/src/modeltune.jl:5
  likely near /Users/diego/.julia/v0.4/MLBase/src/modeltune.jl:5
  likely near /Users/diego/.julia/v0.4/MLBase/src/deprecated/datapre.jl:104
  likely near /Users/diego/.julia/v0.4/MLBase/src/deprecated/datapre.jl:105
  likely near /Users/diego/.julia/v0.4/MLBase/src/deprecated/datapre.jl:163
  likely near /Users/diego/.julia/v0.4/MLBase/src/deprecated/datapre.jl:163
  likely near /Users/diego/.julia/v0.4/MLBase/src/deprecated/datapre.jl:163


## Load Data

In [2]:
train = readtable("data/train.csv")
test = readtable("data/test.csv")
store = readtable("data/store.csv")
train = join(train, store, on=:Store)
test = join(test, store, on=:Store);

## Sample Data Visualization

#### Train Data

In [39]:
showcols(train)

1017209x22 DataFrames.DataFrame
| Col # | Name                      | Eltype     | Missing |
|-------|---------------------------|------------|---------|
| 1     | Store                     | Int64      | 0       |
| 2     | DayOfWeek                 | Int64      | 0       |
| 3     | Date                      | UTF8String | 0       |
| 4     | Sales                     | Int64      | 0       |
| 5     | Customers                 | Int64      | 0       |
| 6     | Open                      | Int64      | 0       |
| 7     | Promo                     | Int64      | 0       |
| 8     | StateHoliday              | UTF8String | 0       |
| 9     | SchoolHoliday             | Int64      | 0       |
| 10    | StoreType                 | UTF8String | 0       |
| 11    | Assortment                | UTF8String | 0       |
| 12    | CompetitionDistance       | Int64      | 2642    |
| 13    | CompetitionOpenSinceMonth | Int64      | 323348  |
| 14    | CompetitionOpenSinceYear  | Int64      | 32

#### Test Data

In [18]:
showcols(test)

41088x21 DataFrames.DataFrame
| Col # | Name                      | Eltype     | Missing |
|-------|---------------------------|------------|---------|
| 1     | Id                        | Int64      | 0       |
| 2     | Store                     | Int64      | 0       |
| 3     | DayOfWeek                 | Int64      | 0       |
| 4     | Date                      | UTF8String | 0       |
| 5     | Open                      | Int64      | 0       |
| 6     | Promo                     | Int64      | 0       |
| 7     | StateHoliday              | UTF8String | 0       |
| 8     | SchoolHoliday             | Int64      | 0       |
| 9     | StoreType                 | UTF8String | 0       |
| 10    | Assortment                | UTF8String | 0       |
| 11    | CompetitionDistance       | Int64      | 96      |
| 12    | CompetitionOpenSinceMonth | Int64      | 15216   |
| 13    | CompetitionOpenSinceYear  | Int64      | 15216   |
| 14    | Promo2                    | Int64      | 0   

## Initial Preprocessing

### Select Initial Features and Label

In [24]:
features = [:DayOfWeek, :Open, :Promo, :StateHolidayEnc, :SchoolHoliday, 
            :StoreTypeEnc, :AssortmentEnc, :Promo2, :PromoIntervalEnc]
label = :Sales;

### Encode Categorical Fields

In [4]:
label_state_holiday = labelmap(vcat(train[:StateHoliday], test[:StateHoliday]))
label_store_type = labelmap(vcat(train[:StoreType], test[:StoreType]))
label_assortment = labelmap(vcat(train[:Assortment], test[:Assortment]))
label_promo_interval = labelmap(vcat(train[:PromoInterval], test[:PromoInterval]));

In [5]:
train[:StateHolidayEnc] = labelencode(label_state_holiday, train[:StateHoliday])
train[:StoreTypeEnc] = labelencode(label_store_type, train[:StoreType])
train[:AssortmentEnc] = labelencode(label_assortment, train[:Assortment])
train[:PromoIntervalEnc] = labelencode(label_promo_interval, train[:PromoInterval])

test[:StateHolidayEnc] = labelencode(label_state_holiday, test[:StateHoliday])
test[:StoreTypeEnc] = labelencode(label_store_type, test[:StoreType])
test[:AssortmentEnc] = labelencode(label_assortment, test[:Assortment])
test[:PromoIntervalEnc] = labelencode(label_promo_interval, test[:PromoInterval]);

In [6]:
test[isna(test[:Open]), :Open] = 1;

## Train Data

#### Prepare Formulas

In [27]:
formulas = map(f -> eval(parse("Sales ~ " * join(f, " + "))), collect(subsets(features))[2:end]);

#### Train Feasible Linear Model

In [8]:
function training(df, formulas)
    lms = []
    accepted_formulas = []
    for formula in formulas
        try
            push!(lms, glm(formula, df, Normal(), IdentityLink()))
            push!(accepted_formulas, formula)
        catch LoadError
            continue
        end
    end
    return accepted_formulas, lms
end

training (generic function with 1 method)

In [28]:
accepted_formulas, lms = training(train, formulas[end-1:end]);

## Evaluate Data

In [11]:
rmspe(yreal, ypred) = sqrt(sum(((yreal - ypred) ./ yreal) .^2)/length(yreal))

rmspe (generic function with 1 method)

In [40]:
train_predicted = Array[predict(lm, train) for lm in lms];

In [41]:
rmspe_data = map(p -> rmspe(train[:Sales], p), train_predicted);

## Predict

In [36]:
preds = predict(lms[end], test[:, features])
preds = map(v -> v > 0? int(v) : 0, preds);

## Create Submission

In [38]:
submission = DataFrame(Id=test[:Id], Sales=preds)
writetable("data/submission_v1_all_features.csv", submission, separator=',');

### Kaggle Public Result

v1 All Features - Only to set a initial score: **0.42591**  
Better than baseline benchmark that is all Zeros: **1.0000**