# Logistic regression with categorical data

We aim to fit a logistic regression model to the [shelter animal data](https://www.kaggle.com/c/shelter-animal-outcomes) from [kaggle](https://www.kaggle.com/competitions) using the Ruby gems `daru` and `statsample-glm`.

In [15]:
require 'daru'
shelter_data = Daru::DataFrame.from_csv 'data/animal_shelter_train.csv'
shelter_data.head(3)

Daru::DataFrame:47280018323020 rows: 3 cols: 10,Daru::DataFrame:47280018323020 rows: 3 cols: 10,Daru::DataFrame:47280018323020 rows: 3 cols: 10,Daru::DataFrame:47280018323020 rows: 3 cols: 10,Daru::DataFrame:47280018323020 rows: 3 cols: 10,Daru::DataFrame:47280018323020 rows: 3 cols: 10,Daru::DataFrame:47280018323020 rows: 3 cols: 10,Daru::DataFrame:47280018323020 rows: 3 cols: 10,Daru::DataFrame:47280018323020 rows: 3 cols: 10,Daru::DataFrame:47280018323020 rows: 3 cols: 10,Daru::DataFrame:47280018323020 rows: 3 cols: 10
Unnamed: 0_level_1,AnimalID,Name,DateTime,OutcomeType,OutcomeSubtype,AnimalType,SexuponOutcome,AgeuponOutcome,Breed,Color
0,A671945,Hambone,2014-02-12 18:22:00,Return_to_owner,,Dog,Neutered Male,1 year,Shetland Sheepdog Mix,Brown/White
1,A656520,Emily,2013-10-13 12:44:00,Euthanasia,Suffering,Cat,Spayed Female,1 year,Domestic Shorthair Mix,Cream Tabby
2,A686464,Pearce,2015-01-31 12:28:00,Adoption,Foster,Dog,Neutered Male,2 years,Pit Bull Mix,Blue/White


The first thing we notice is that none of the variables are in numeric form, but `statsample-glm` cannot deal with non-numeric data. __This should change after this year's GSoC__, but for now we have to do a lot of data transformation by hand.

## What is the model, what are we predicting, and what are the predictors?

We want to predict the outcome (variable `OutcomeType`) for each animal. There are five possible outcomes:

In [16]:
shelter_data["OutcomeType"].to_a.uniq

["Return_to_owner", "Euthanasia", "Adoption", "Transfer", "Died"]

Because multinomial logistic regression is not supported by `statsample-glm`, we fit five different one-vs-all logistic regression models instead.
That is, one model has a 0-1-valued indicator vector of whether the animal got adopted as the response. The next model uses as the response variable a 0-1-valued indicator for whether the animal got euthanized. And likewise, for the remaining three models, the response variables signify whether the animal got reunited with its previous owner, or died of natural causes, or transferred.

For simplicity, and since this data analysis is just for demonstration purposes, we keep only the variables `AgeuponOutcome`, `AnimalType`, and `SexuponOutcome` as the predictors in the model.

At the end, given an animal's age, type (cat or dog) and sex (neutered male, spayed female, intact male, etc.), we will be able to assign it a "score" for each of the five outcomes, using each of the five models respectively. Then the outcome that gets the largest score assigned is our predicted outcome for that animal.

## Data preprocessing

First, we remove the variables that we are not going to use from the dataset (__after this year's GSoC, it will become unnecessary to remove any variables, because we will be able to specify what variables to use in the regression model with an R-like formula language__).

In [17]:
shelter_data.delete_vectors *%W[AnimalID DateTime Name OutcomeSubtype Breed Color]
# remaining vectors:
shelter_data.head(3)

Daru::DataFrame:47280018564880 rows: 3 cols: 4,Daru::DataFrame:47280018564880 rows: 3 cols: 4,Daru::DataFrame:47280018564880 rows: 3 cols: 4,Daru::DataFrame:47280018564880 rows: 3 cols: 4,Daru::DataFrame:47280018564880 rows: 3 cols: 4
Unnamed: 0_level_1,OutcomeType,AnimalType,SexuponOutcome,AgeuponOutcome
0,Return_to_owner,Dog,Neutered Male,1 year
1,Euthanasia,Cat,Spayed Female,1 year
2,Adoption,Dog,Neutered Male,2 years


When we check the values that the remaining variables attain, we see that there are missing values denoted either as `nil` or as `"Unknown"`.

In [18]:
shelter_data.each { |vec| puts vec.name + " levels: " + vec.to_a.uniq.join(",") }
puts "------------------------"
puts "Are there any nil values? #{shelter_data.has_missing_data?}"
puts "------------------------"
puts "There are #{shelter_data.shape[0]} rows total, including those with missing data."

OutcomeType levels: Return_to_owner,Euthanasia,Adoption,Transfer,Died
AnimalType levels: Dog,Cat
SexuponOutcome levels: Neutered Male,Spayed Female,Intact Male,Intact Female,Unknown,
AgeuponOutcome levels: 1 year,2 years,3 weeks,1 month,5 months,4 years,3 months,2 weeks,2 months,10 months,6 months,5 years,7 years,3 years,4 months,12 years,9 years,6 years,1 weeks,11 years,4 weeks,7 months,8 years,11 months,4 days,9 months,8 months,15 years,10 years,1 week,0 years,14 years,3 days,6 days,5 days,5 weeks,2 days,16 years,1 day,13 years,,17 years,18 years,19 years,20 years
------------------------
Are there any nil values? true
------------------------
There are 26729 rows total, including those with missing data.


We delete all rows that contain missing values.

In [19]:
shelter_data = shelter_data.filter_rows { |row| !(row.has_missing_data? or row['SexuponOutcome'] == "Unknown") }

puts "There are #{shelter_data.shape[0]} rows left"
puts "Are there any nil values left? #{shelter_data.has_missing_data?}"

There are 25621 rows left
Are there any nil values left? false


We convert `AgeuponOutcome` to a numeric variable measured in weeks.

In [20]:
shelter_data['AgeuponOutcome'].map! do |age|
  num, unit = age.split
  num = num.to_f
  case unit
  when "year", "years"
    52.0 * num
  when "month", "months"
    4.5 * num
  when "week", "weeks"
    num
  when "day", "days"
    num / 7.0
  else
    raise "Unknown AgeuponOutcome unit!"
  end  
end
shelter_data.head(3)

Daru::DataFrame:47280019601480 rows: 3 cols: 4,Daru::DataFrame:47280019601480 rows: 3 cols: 4,Daru::DataFrame:47280019601480 rows: 3 cols: 4,Daru::DataFrame:47280019601480 rows: 3 cols: 4,Daru::DataFrame:47280019601480 rows: 3 cols: 4
Unnamed: 0_level_1,OutcomeType,AnimalType,SexuponOutcome,AgeuponOutcome
0,Return_to_owner,Dog,Neutered Male,52.0
1,Euthanasia,Cat,Spayed Female,52.0
2,Adoption,Dog,Neutered Male,104.0


Then we transform `AnimalType`, `SexuponOutcome` and `OutcomeType` into sets of 0-1-valued dummy vectors.

__Note: After GSoC 2016 this step will become unnecessary. Instead, we will be able to just do:__

```
shelter_data.to_category %W[AnimalType OutcomeType SexuponOutcome]
```

In [21]:
module Daru
  class DataFrame

    def replace_with_dummy_vectors(vector_name, drop_last=true)
      vector = self[vector_name]
      levels = vector.to_a.uniq
      # drop the last level to avoid redundancy in regression
      levels.pop if drop_last

      levels.each do |l|
        new_name = "#{l}_#{vector_name}"
        new_vector = Array.new
        vector.each { |e| e==l ? new_vector.push(1.0) : new_vector.push(0.0) }
        self[new_name] = new_vector 
      end
    end

  end
end

shelter_data.replace_with_dummy_vectors("AnimalType")
shelter_data.replace_with_dummy_vectors("SexuponOutcome")
shelter_data.replace_with_dummy_vectors("OutcomeType", false)
shelter_data.delete_vectors("AnimalType", "SexuponOutcome", "OutcomeType")

shelter_data.head(3)

Daru::DataFrame:47274794741760 rows: 3 cols: 10,Daru::DataFrame:47274794741760 rows: 3 cols: 10,Daru::DataFrame:47274794741760 rows: 3 cols: 10,Daru::DataFrame:47274794741760 rows: 3 cols: 10,Daru::DataFrame:47274794741760 rows: 3 cols: 10,Daru::DataFrame:47274794741760 rows: 3 cols: 10,Daru::DataFrame:47274794741760 rows: 3 cols: 10,Daru::DataFrame:47274794741760 rows: 3 cols: 10,Daru::DataFrame:47274794741760 rows: 3 cols: 10,Daru::DataFrame:47274794741760 rows: 3 cols: 10,Daru::DataFrame:47274794741760 rows: 3 cols: 10
Unnamed: 0_level_1,AgeuponOutcome,Dog_AnimalType,Neutered Male_SexuponOutcome,Spayed Female_SexuponOutcome,Intact Male_SexuponOutcome,Return_to_owner_OutcomeType,Euthanasia_OutcomeType,Adoption_OutcomeType,Transfer_OutcomeType,Died_OutcomeType
0,52.0,1.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
1,52.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0
2,104.0,1.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0


We create a 0-1-valued indicator for whether the animal got adopted.

## Model fit

Now, having put data in appropriate form, we can fit the logistic regression model with `statsample-glm`.

In [8]:
x = shelter_data.clone
x.delete_vectors *%W[Return_to_owner_OutcomeType Euthanasia_OutcomeType Transfer_OutcomeType Died_OutcomeType]
x.shape

[25621, 6]

__Note:__ As mentioned before, deleting vectors will become unnecessary after this year's GSoC.

In [9]:
require 'statsample-glm'

glm_adoption = Statsample::GLM.compute(x, "Adoption_OutcomeType", :logistic, constant: 1, method: :irls)
glm_adoption.coefficients :hash

{:AgeuponOutcome=>-0.004452060293208014, :Dog_AnimalType=>-0.4419095589782122, :"Neutered Male_SexuponOutcome"=>3.431548411851544, :"Spayed Female_SexuponOutcome"=>3.652769402043723, :"Intact Male_SexuponOutcome"=>-0.25835528345818276, :constant=>-2.418271221948218}

__Unfortunatelly, `statsample-glm` is extremely slow (hours) and memory hungry (uses >10 GB RAM) with this data size.__ I will look into how it can be improved.

## Possible next steps

1. Interpret the logistic regression coefficients.
2. Fit logistic regression models with euthanasia, death, etc. as response variable.
3. Predict adoption, euthanasia, death, etc. on test data.
4. Submit prediction results to kaggle, and fail against random forrest models.