# Logistic regression with categorical data

We aim to fit a logistic regression model to the [shelter animal data](https://www.kaggle.com/c/shelter-animal-outcomes) from [kaggle](https://www.kaggle.com/competitions) using the Ruby gems `daru` and `statsample-glm`.

In [1]:
require 'daru'
shelter_data = Daru::DataFrame.from_csv 'train.csv'
shelter_data.head(3)

"if(window['d3'] === undefined ||\n   window['Nyaplot'] === undefined){\n    var path = {\"d3\":\"http://d3js.org/d3.v3.min\",\"downloadable\":\"http://cdn.rawgit.com/domitry/d3-downloadable/master/d3-downloadable\"};\n\n\n\n    var shim = {\"d3\":{\"exports\":\"d3\"},\"downloadable\":{\"exports\":\"downloadable\"}};\n\n    require.config({paths: path, shim:shim});\n\n\nrequire(['d3'], function(d3){window['d3']=d3;console.log('finished loading d3');require(['downloadable'], function(downloadable){window['downloadable']=downloadable;console.log('finished loading downloadable');\n\n\tvar script = d3.select(\"head\")\n\t    .append(\"script\")\n\t    .attr(\"src\", \"http://cdn.rawgit.com/domitry/Nyaplotjs/master/release/nyaplot.js\")\n\t    .attr(\"async\", true);\n\n\tscript[0][0].onload = script[0][0].onreadystatechange = function(){\n\n\n\t    var event = document.createEvent(\"HTMLEvents\");\n\t    event.initEvent(\"load_nyaplot\",false,false);\n\t    window.dispatchEvent(event);\n\t

Daru::DataFrame:46971910541380 rows: 3 cols: 10,Daru::DataFrame:46971910541380 rows: 3 cols: 10,Daru::DataFrame:46971910541380 rows: 3 cols: 10,Daru::DataFrame:46971910541380 rows: 3 cols: 10,Daru::DataFrame:46971910541380 rows: 3 cols: 10,Daru::DataFrame:46971910541380 rows: 3 cols: 10,Daru::DataFrame:46971910541380 rows: 3 cols: 10,Daru::DataFrame:46971910541380 rows: 3 cols: 10,Daru::DataFrame:46971910541380 rows: 3 cols: 10,Daru::DataFrame:46971910541380 rows: 3 cols: 10,Daru::DataFrame:46971910541380 rows: 3 cols: 10
Unnamed: 0_level_1,AgeuponOutcome,AnimalID,AnimalType,Breed,Color,DateTime,Name,OutcomeSubtype,OutcomeType,SexuponOutcome
0,1 year,A671945,Dog,Shetland Sheepdog Mix,Brown/White,2014-02-12 18:22:00,Hambone,,Return_to_owner,Neutered Male
1,1 year,A656520,Cat,Domestic Shorthair Mix,Cream Tabby,2013-10-13 12:44:00,Emily,Suffering,Euthanasia,Spayed Female
2,2 years,A686464,Dog,Pit Bull Mix,Blue/White,2015-01-31 12:28:00,Pearce,Foster,Adoption,Neutered Male


The first thing we notice is that none of the variables are in numeric form, but `statsample-glm` cannot deal with non-numeric data. __This should change after this year's GSoC__, but for now we have to do a lot of data transformation by hand.

## Data preprocessing

For simplicity, and since this data analysis is just for demonstration purposes, we will keep only the variables `AgeuponOutcome`, `AnimalType`, and `SexuponOutcome` as the predictors in the model.

The response variable will be a 0-1-valued indicator vector of whether the animal got adopted. Alternatively, we can use as the response variable a 0-1-valued indicator for whether the animal got euthanized, or an indicator for whether it got reunited with its previous owner, or died of natural causes, or transferred. That is, we may fit five different one-vs-all logistic regression models. However, we cannot fit a unified multinomial model, because that is not supported by `statsample-glm`.

We remove the variables that we are not going to use from the dataset.

In [2]:
shelter_data.delete_vector("AnimalID")
shelter_data.delete_vector("DateTime")
shelter_data.delete_vector("Name")
shelter_data.delete_vector("OutcomeSubtype")
shelter_data.delete_vector("Breed")
shelter_data.delete_vector("Color")
# remaining vectors:
shelter_data.head(3)

Daru::DataFrame:46971910362420 rows: 3 cols: 4,Daru::DataFrame:46971910362420 rows: 3 cols: 4,Daru::DataFrame:46971910362420 rows: 3 cols: 4,Daru::DataFrame:46971910362420 rows: 3 cols: 4,Daru::DataFrame:46971910362420 rows: 3 cols: 4
Unnamed: 0_level_1,AgeuponOutcome,AnimalType,OutcomeType,SexuponOutcome
0,1 year,Dog,Return_to_owner,Neutered Male
1,1 year,Cat,Euthanasia,Spayed Female
2,2 years,Dog,Adoption,Neutered Male


When we check the values that the remaining variables attain, we see that there are missing values denoted either as `nil` or as `"Unknown"`.

In [3]:
shelter_data.each { |vec| puts vec.name + " levels: " + vec.to_a.uniq.join(",") }
puts "------------------------"
puts "Are there any nil values? #{shelter_data.has_missing_data?}"
puts "------------------------"
puts "There are #{shelter_data.shape[0]} rows total, including those with missing data."

AgeuponOutcome levels: 1 year,2 years,3 weeks,1 month,5 months,4 years,3 months,2 weeks,2 months,10 months,6 months,5 years,7 years,3 years,4 months,12 years,9 years,6 years,1 weeks,11 years,4 weeks,7 months,8 years,11 months,4 days,9 months,8 months,15 years,10 years,1 week,0 years,14 years,3 days,6 days,5 days,5 weeks,2 days,16 years,1 day,13 years,,17 years,18 years,19 years,20 years
AnimalType levels: Dog,Cat
OutcomeType levels: Return_to_owner,Euthanasia,Adoption,Transfer,Died
SexuponOutcome levels: Neutered Male,Spayed Female,Intact Male,Intact Female,Unknown,
------------------------
Are there any nil values? true
------------------------
There are 26729 rows total, including those with missing data.


We delete all rows that contain missing values. For some reason `DataFrame::filter_rows`, which in general would be perfect to pick rows without missing values, is really slow here. So, we do the following instead.

In [4]:
shelter_data["AgeuponOutcome"].each_with_index do |entry, ind| 
  shelter_data.delete_row ind if entry.nil?
end

shelter_data["SexuponOutcome"].each_with_index do |entry, ind| 
  shelter_data.delete_row ind if entry.nil? or entry == "Unknown"
end

puts "There are #{shelter_data.shape[0]} rows left"
puts "Are there any nil values left? #{shelter_data.has_missing_data?}"

There are 25621 rows left
Are there any nil values left? false


We convert `AgeuponOutcome` to a numeric variable measured in weeks.

In [5]:
result = Array.new
shelter_data["AgeuponOutcome"].each_with_index do |age, ind|
  num, unit = age.split
  case unit
  when "year", "years"
    result.push(52.0 * num.to_f)
  when "month", "months"
    result.push(4.5 * num.to_f)
  when "week", "weeks"
    result.push(num.to_f)
  when "day", "days"
    result.push(num.to_f / 7.0)
  else
    raise "Unknown AgeuponOutcome unit!"
  end
end
shelter_data["AgeuponOutcome"] = result
shelter_data.head(3)

Daru::DataFrame:46971900152920 rows: 3 cols: 4,Daru::DataFrame:46971900152920 rows: 3 cols: 4,Daru::DataFrame:46971900152920 rows: 3 cols: 4,Daru::DataFrame:46971900152920 rows: 3 cols: 4,Daru::DataFrame:46971900152920 rows: 3 cols: 4
Unnamed: 0_level_1,AgeuponOutcome,AnimalType,OutcomeType,SexuponOutcome
0,52.0,Dog,Return_to_owner,Neutered Male
1,52.0,Cat,Euthanasia,Spayed Female
2,104.0,Dog,Adoption,Neutered Male


Then we transform `AnimalType` and `SexuponOutcome` into sets of 0-1-valued dummy vectors.

In [6]:
module Daru
  class DataFrame

    def replace_with_dummy_vectors(vector_name)
      vector = self[vector_name]
      levels = vector.to_a.uniq
      # drop the last level to avoid redundancy in regression
      levels.pop

      levels.each do |l|
        new_name = "#{l}_#{vector_name}"
        new_vector = Array.new
        vector.each { |e| e==l ? new_vector.push(1.0) : new_vector.push(0.0) }
        self[new_name] = new_vector 
      end
    end

  end
end

shelter_data.replace_with_dummy_vectors("AnimalType")
shelter_data.delete_vector("AnimalType")
shelter_data.replace_with_dummy_vectors("SexuponOutcome")
shelter_data.delete_vector("SexuponOutcome")

shelter_data.head(3)

Daru::DataFrame:46971910068940 rows: 3 cols: 6,Daru::DataFrame:46971910068940 rows: 3 cols: 6,Daru::DataFrame:46971910068940 rows: 3 cols: 6,Daru::DataFrame:46971910068940 rows: 3 cols: 6,Daru::DataFrame:46971910068940 rows: 3 cols: 6,Daru::DataFrame:46971910068940 rows: 3 cols: 6,Daru::DataFrame:46971910068940 rows: 3 cols: 6
Unnamed: 0_level_1,AgeuponOutcome,OutcomeType,Dog_AnimalType,Neutered Male_SexuponOutcome,Spayed Female_SexuponOutcome,Intact Male_SexuponOutcome
0,52.0,Return_to_owner,1.0,1.0,0.0,0.0
1,52.0,Euthanasia,0.0,0.0,1.0,0.0
2,104.0,Adoption,1.0,1.0,0.0,0.0


We create a 0-1-valued indicator for whether the animal got adopted.

In [7]:
result = Array.new
shelter_data["OutcomeType"].each do |entry|
  entry == "Adoption" ? result.push(1.0) : result.push(0.0)
end

shelter_data["Adoption"] = result
shelter_data.delete_vector("OutcomeType")
shelter_data.head(3)

Daru::DataFrame:46971924412000 rows: 3 cols: 6,Daru::DataFrame:46971924412000 rows: 3 cols: 6,Daru::DataFrame:46971924412000 rows: 3 cols: 6,Daru::DataFrame:46971924412000 rows: 3 cols: 6,Daru::DataFrame:46971924412000 rows: 3 cols: 6,Daru::DataFrame:46971924412000 rows: 3 cols: 6,Daru::DataFrame:46971924412000 rows: 3 cols: 6
Unnamed: 0_level_1,AgeuponOutcome,Dog_AnimalType,Neutered Male_SexuponOutcome,Spayed Female_SexuponOutcome,Intact Male_SexuponOutcome,Adoption
0,52.0,1.0,1.0,0.0,0.0,0.0
1,52.0,0.0,0.0,1.0,0.0,0.0
2,104.0,1.0,1.0,0.0,0.0,1.0


## Model fit

Now, having put data in appropriate form, we can fit the logistic regression model with `statsample-glm`.

__Unfortunatelly, `statsample-glm` cannot deal with this data size.__ My computer runs out of memory (12GB) when I fit a logistic regression model on the full data. Thus, I take a small subset of 1000 rows.

In [8]:
x = shelter_data.row[1..1000]
x.head(3)

Daru::DataFrame:46971911517720 rows: 3 cols: 6,Daru::DataFrame:46971911517720 rows: 3 cols: 6,Daru::DataFrame:46971911517720 rows: 3 cols: 6,Daru::DataFrame:46971911517720 rows: 3 cols: 6,Daru::DataFrame:46971911517720 rows: 3 cols: 6,Daru::DataFrame:46971911517720 rows: 3 cols: 6,Daru::DataFrame:46971911517720 rows: 3 cols: 6
Unnamed: 0_level_1,AgeuponOutcome,Dog_AnimalType,Neutered Male_SexuponOutcome,Spayed Female_SexuponOutcome,Intact Male_SexuponOutcome,Adoption
1,52.0,0.0,0.0,1.0,0.0,0.0
2,104.0,1.0,1.0,0.0,0.0,1.0
3,3.0,0.0,0.0,0.0,1.0,0.0


In [9]:
require 'statsample-glm'

glm_adoption = Statsample::GLM.compute(x, "Adoption", :logistic)
glm_adoption.coefficients :hash

{:AgeuponOutcome=>-0.006082852830214141, :Dog_AnimalType=>-0.7771170056503076, :"Neutered Male_SexuponOutcome"=>1.3284196185041677, :"Spayed Female_SexuponOutcome"=>1.7102012793438772, :"Intact Male_SexuponOutcome"=>-2.7451150407291425}

## Possible next steps

1. Interpret the logistic regression coefficients.
2. Fit logistic regression models with euthanasia, death, etc. as response variable.
3. Predict adoption, euthanasia, death, etc. on test data.
4. Submit prediction results to kaggle, and fail against random forrest models.