# CS6140 Assignments

**Instructions**
1. In each assignment cell, look for the block:
 ```
  #BEGIN YOUR CODE
  raise NotImplementedError.new()
  #END YOUR CODE
 ```
1. Replace this block with your solution.
1. Test your solution by running the cells following your block (indicated by ##TEST##)
1. Click the "Validate" button above to validate the work.

**Notes**
* You may add other cells and functions as needed
* Keep all code in the same notebook
* In order to receive credit, code must "Validate" on the JupyterHub server

---

# Final Project: Part 2 - Feature Extraction


In any practical machine learning problem, the data preparation and feature extraction stages are the most important and time-consuming. The final project exposes you to a real-world dataset. In this part of the final project, you will implement various feature extraction and transformation methods that will be useful in the next part. 

There is an accompanying notebook, [part-2-a](../part-2-a.ipynb) which illustrates the feature extraction methods in R.

In [1]:
require './assignment_lib'

#Initializes the database used for this assignment
dir = "/home/dataset"
$dev_db = SQLite3::Database.new "#{dir}/credit_risk_data_dev.db", results_as_hash: true, readonly: true

"if(window['d3'] === undefined ||\n   window['Nyaplot'] === undefined){\n    var path = {\"d3\":\"https://cdnjs.cloudflare.com/ajax/libs/d3/3.5.5/d3.min\",\"downloadable\":\"http://cdn.rawgit.com/domitry/d3-downloadable/master/d3-downloadable\"};\n\n\n\n    var shim = {\"d3\":{\"exports\":\"d3\"},\"downloadable\":{\"exports\":\"downloadable\"}};\n\n    require.config({paths: path, shim:shim});\n\n\nrequire(['d3'], function(d3){window['d3']=d3;console.log('finished loading d3');require(['downloadable'], function(downloadable){window['downloadable']=downloadable;console.log('finished loading downloadable');\n\n\tvar script = d3.select(\"head\")\n\t    .append(\"script\")\n\t    .attr(\"src\", \"http://cdn.rawgit.com/domitry/Nyaplotjs/master/release/nyaplot.js\")\n\t    .attr(\"async\", true);\n\n\tscript[0][0].onload = script[0][0].onreadystatechange = function(){\n\n\n\t    var event = document.createEvent(\"HTMLEvents\");\n\t    event.initEvent(\"load_nyaplot\",false,false);\n\t    win

#<SQLite3::Database:0x000000000276ce68 @tracefunc=nil, @authorizer=nil, @encoding=nil, @busy_handler=nil, @collations={}, @functions={}, @results_as_hash=true, @type_translation=nil, @type_translator=#<Proc:0x000000000267c7d8@/usr/local/rvm/gems/ruby-2.5.1/gems/sqlite3-1.4.2/lib/sqlite3/database.rb:722 (lambda)>, @readonly=true>

### Create Dataset

The function ```create_dataset``` has been implemented for you in [assignment_lib.rb](../assignment_lib.rb). Given an SQL query, this function constructs the examples for a dataset like those we have used in this course. 

An ```id``` field is added for each example, equal to the ```SK_ID_CURR``` and the ```TARGET``` is called ```label```. These fields **must** be in the query. All feature names from the SQL query are lowercased in the features field. 


If the query is:
```sql
select sk_id_curr, target, ext_source_1 from application_train  where ext_source_1 <> '' order by sk_id_curr;
```

then the result is:

```json
{
    "features" : ["ext_source_1"],
    "data" : [
        {"label":1,"id":100002,"features":{"ext_source_1":0.08303696739132256}},
        {"label":0,"id":100015,"features":{"ext_source_1":0.7220444501416448}}
    ]
}
```

In [28]:
sql = <<SQL
select sk_id_curr, target, ext_source_1 
from application_train 
where ext_source_1 <> '' 
order by sk_id_curr limit 1
SQL
  
dataset = create_dataset $dev_db, sql
examples = dataset["data"]  
puts dataset

{"features"=>["ext_source_1"], "data"=>[{"label"=>1, "id"=>100002, "features"=>{"ext_source_1"=>0.08303696739132256}}]}


### Sample dataset
Here is a sample dataset we will use in this part, which illustrates some basic feature extraction. Note that this is just an example and you **should not** restrict your final project to these features. 

In [29]:
## Create a sample datasete and store as a separate file
def create_sample_dataset
  sql = <<SQL
  select target
, sk_id_curr
, ext_source_1
, ext_source_2
, ext_source_3
, amt_income_total
, amt_credit
, commonarea_avg
, flag_own_car 
, flag_mobil
, days_birth
, organization_type
, code_gender
, flag_own_realty
, flag_emp_phone
, name_education_type
, name_income_type
, name_family_status
, name_housing_type
, own_car_age
from application_train
where ext_source_1 <> ''
order by sk_id_curr
SQL

  sample_dataset = create_dataset $dev_db, sql
end

:create_sample_dataset

### Export dataset

The ```export_to_tsv``` function below exports a datset into TSV file that can be read by R or Excel. If you so choose, you can use this to export any features you have for use in R, see related notebook. We will export the sample dataset to ```part-2-sample.tsv``` in this directory if you want to use it.

In [30]:
def export_to_tsv dataset, file_name
  File.open file_name, 'w' do |out|
    features = dataset["features"].sort
    out.puts [["id", "label"], features].join("\t")
    examples = dataset["data"]
    
    examples.each do |example|
      id = example["id"]
      label = example["label"]
      values = features.collect {|k| example["features"].fetch(k, nil)}
      out.puts [id, label, values].join("\t")
    end.size
  end
end

:export_to_tsv

In [31]:
def export_sample_dataset
  sample_dataset = create_sample_dataset()
  export_to_tsv sample_dataset, "./part-2-random-sample.tsv"
end

## The line below should return 6641 examples
export_sample_dataset()

6641

## Question 1.1 (4 points)

Given an array of doubles, ```x```, implement mean and the sample standard deviation (not the population standard deviation). 

In [32]:
def mean x
  sum=0
  x.each do |item|
    sum+=item.to_f
  end
  
  return sum.to_f/(x.size)
end

:mean

In [33]:
### TEST ###
def test_11_0()
  test_1 = [3.0, 4.0, 5.0]
  assert_equal(4.0, mean(test_1))
end
test_11_0()

In [34]:
def stdev x
  sum=0
  mean1=mean(x)
  #puts mean1
  x.each do |item|
    sum+=(item-mean1)**2
  end
  
  #puts sum
  
  return (sum.to_f/(x.size-1))**0.5
end

:stdev

In [35]:
### TEST ###
def test_12_1()
  test_1 = [3.0, 4.0, 5.0]
  assert_equal(1.0, stdev(test_1))
end
test_12_1()

## Question 2

We will adopt a __Pipeline__ software development pattern for feature extraction in which a ```FeatureTransformer``` is first trained on the dataset and then applied to a batch of examples. 

```ruby
class FeatureTransformer
    def train dataset
        ## Calculate any statistics
    end
    
    def apply example_batch
        ## Apply transform to a batch of examples
    end
end
```

You will create multiple ```FeatureTransformers``` which will alter a dataset and are designed to be used sequentially. Each transform will be trained with data to collect any statistics and then will transform a batch of examples.

## Question 2.1 (5 points)

Implement the Z-Score Transformer. We have implemented z-score normalization before, but this is a refactor as a FeatureTransformer.

In the ```train``` method, calculate the ```means``` and ```stdevs``` member hashes containing the mean and standard deviation for the list of features provided in ```@whitelist```. 

Example:
```
Means
{"ext_source_1"=>0.4977204432348777, "ext_source_2"=>0.5264289029370367}

Stdevs
{"ext_source_1"=>0.21195635191237827, "ext_source_2"=>0.18333559581081152}
```

1. Do not alter the data in train method
1. If the feature is missing i.e., ```nil```, do not accumulate it. Don't assume it has a value. We will deal with missing values later.
1. Expect that this method will be called with string-valued features, do not accumulate them either

In [36]:
class ZScoreTransformer
  attr_reader :means, :stdevs
  
  def initialize feature_names
    @means = Hash.new {|h,k| h[k] = 0}
    @miss = Hash.new {|h,k| h[k] = 0}
    @stdevs = Hash.new {|h,k| h[k] = 0}
    @feature_names = feature_names    
  end
  
  def train dataset
    
    data=dataset["data"]
    
    @feature_names.each do |feature|
      data.each do |item|
        if item["features"][feature]==nil
          @miss[feature]=@miss[feature]+1
        end
        next if !(item["features"][feature].is_a?(Numeric))
      end
    end 
        
    @feature_names.each do |feature|
      data.each do |item|
        next if item["features"][feature]==nil
        next if !(item["features"][feature].is_a?(Numeric))
        @means[feature]=(item["features"][feature]).to_f+@means[feature]
      end
    end
    

    @means.each do |key,array|
      @means[key]=(@means[key]).to_f/(data.size-@miss[key]).to_f
    end


    @feature_names.each do |feature|
      data.each do |item2|
        next if !item2["features"][feature]
        next if !(item2["features"][feature].is_a?(Numeric))
        @stdevs[feature]+=(item2["features"][feature]-@means[feature])**2
      end
      
      @stdevs[feature]=((@stdevs[feature]).to_f/(data.size-@miss[feature]-1))**0.5
    end
    
  end
end

:train

In [37]:
def test_21_1()
  sample_dataset = create_sample_dataset()
  whitelist = %w(ext_source_1 ext_source_2)
  
  zscore = ZScoreTransformer.new whitelist
  zscore.train sample_dataset
  z_means = zscore.means
  puts "Means", z_means
  
  assert_equal whitelist.size, z_means.size
  
  assert_in_delta(0.4977204432348777, zscore.means["ext_source_1"], 1e-2, "Mean for ext_source_1")
  assert_in_delta(0.5258740162753052, zscore.means["ext_source_2"], 1e-2, "Mean for ext_source_2")
end

test_21_1()

Means
{"ext_source_1"=>0.4977204432348777, "ext_source_2"=>0.5264289029370367}


In [38]:
def test_21_2()
  sample_dataset = create_sample_dataset()
  whitelist = %w(ext_source_1 ext_source_2)
  
  zscore = ZScoreTransformer.new whitelist
  zscore.train sample_dataset
  
  z_stdevs = zscore.stdevs
  puts "Stdevs", z_stdevs
  
  assert_equal whitelist.size, z_stdevs.size
  
  assert_in_delta(0.21195635191237827, zscore.stdevs["ext_source_1"], 1e-2, "Stdev for ext_source_1")
  assert_in_delta(0.18403355900410537, zscore.stdevs["ext_source_2"], 1e-2, "Stdev for ext_source_2")
end

test_21_2()

Stdevs
{"ext_source_1"=>0.21195635191237827, "ext_source_2"=>0.18333559581081152}


In [39]:
def test_21_3()
  sample_dataset = create_sample_dataset()
  
  ## Added a string-valued feature, check that it does not cause any problems
  whitelist = %w(ext_source_1 ext_source_2 code_gender)
  
  zscore = ZScoreTransformer.new whitelist
  zscore.train sample_dataset
  
  assert_false(zscore.means.has_key?("ext_source_3"), "Only apply to whitelisted features")
end

test_21_3()

Next, implement the ```apply``` method, which takes a batch of examples and applies the z-normalization (aka standardization). Examples are altered in place without copying. Note that any feature which is missing or has zero standard deviation should not be altered. 

Transform should alter features in place, for example:
```
Before transform:
[{"features"=>{"ext_source_1"=>0.08303696739132256}}]
After transform:
[{"features"=>{"ext_source_1"=>-1.956456940790259}}]
```

Note:
1. Skip any missing features. We could assume they are zero, but we will not do this here.


In [40]:
class ZScoreTransformer  
  def apply example_batch
    
    example_batch.each do |item|
      item["features"].each do |key,array|
        next if !item["features"][key] or @stdevs[key]==0 or !(@feature_names.include? key)
        break if !(item["features"][key].is_a?(Numeric))
        item["features"][key]=(item["features"][key]-@means[key])/(@stdevs[key])
      end
    end

    return example_batch
  end
end
  
  

:apply

In [41]:
### TEST ###
def test_21_4()
  sample_dataset = create_sample_dataset()
  whitelist = %w(ext_source_1 ext_source_2 code_gender)
  
  zscore = ZScoreTransformer.new whitelist
  example1 = {"features" => {"ext_source_1" => 0.08303696739132256}}
  example_batch = [example1]

  puts "Before transform:", example_batch
  zscore.train sample_dataset
  assert_in_delta(0.08303696739132256, example1["features"]["ext_source_1"], 1e-3, "Creating transform does not alter dataset")
  
  zscore.apply(example_batch)
  puts "After transform:", example_batch

  assert_in_delta(-1.956456940790259, example1["features"]["ext_source_1"], 1e-3, "Applies to ext_source_1")
end

test_21_4()

Before transform:
[{"features"=>{"ext_source_1"=>0.08303696739132256}}]
After transform:
[{"features"=>{"ext_source_1"=>-1.956456940790259}}]


In [42]:
### TEST ###
# Handles string-valued features

def test_21_5()
  sample_dataset = create_sample_dataset()
  whitelist = %w(code_gender)
  zscore = ZScoreTransformer.new whitelist
  zscore.train sample_dataset
  
  example1 = {"features" => {"name_education_type" => "Secondary / secondary special", "code_gender" => "M"}}
  example_batch = [example1]
  
  puts "Before transform:", example_batch
  zscore.apply(example_batch)
  puts "After transform:", example_batch

  assert_equal("Secondary / secondary special", example1["features"]["name_education_type"], "Skips features not in whitelist")
  assert_equal("M", example1["features"]["code_gender"], "Does not apply to features with zero stdev")  
end

test_21_5()

Before transform:
[{"features"=>{"name_education_type"=>"Secondary / secondary special", "code_gender"=>"M"}}]
After transform:
[{"features"=>{"name_education_type"=>"Secondary / secondary special", "code_gender"=>"M"}}]


## Question 2.2 (10 points)

Implement the mean imputation transform any example with a missing feature has that feature replaced with the mean of the non-missing feature values. Note that this only makes sense for numeric features. The transformer takes an array of feature names as a whitelist. 

In the ```train``` method, calculate the mean values for each whitelisted feature. Store these means in the ```means``` member variable.

In [46]:
class MeanImputation
  attr_reader :means
  
  def initialize feature_names
    @means = Hash.new {|h,k| h[k] = 0}
    @miss = Hash.new {|h,k| h[k] = 0}
    @feature_names = feature_names 
  end
  
  def train dataset    
    data=dataset["data"]

    @feature_names.each do |feature|
      mean=[]
      data.each do |item|
        next if item["features"][feature].nil?
        next if !(item["features"][feature].is_a? Numeric)
        mean << item["features"][feature]           
      end
      @means[feature] = mean(mean)
    end
    
  end
end

:train

In [47]:
### TEST ###
def test_22_1()
  sample_dataset = create_sample_dataset()
  whitelist = %w(ext_source_2)
  transform = MeanImputation.new whitelist
  transform.train sample_dataset
  z_means = transform.means
  puts "Means", z_means
  
  assert_equal whitelist.size, z_means.size  
  assert_false(z_means.has_key?("ext_source_3"), "Only apply to whitelisted features")  
  assert_in_delta(0.5258740162753052, z_means["ext_source_2"], 1e-2, "Mean for ext_source_2")
end

test_22_1()

Means
{"ext_source_2"=>0.5264289029370367}


Next, implement the ```apply``` method in which we replace missing values with the mean available in the ```@means``` member variable. Examples are altered in place.

Example:
```
Before imputation
[{"features"=>{"ext_source_2"=>nil}}]
After imputation
[{"features"=>{"ext_source_2"=>0.5264289029370367}}]
```

Notes:


In [48]:
class MeanImputation  
  def apply(example_batch)
        
    example_batch.each do |item|
      @feature_names.each do |feature|
        if item["features"][feature]==nil and @means[feature].is_a? (Numeric)
          item["features"][feature]=@means[feature]
        end
      end
    end  
    return example_batch
    
  end
end

:apply

In [49]:
### TEST ###
# Verifies that transformer calculates means for non-missing values
def test_22_2()
  sample_dataset = create_sample_dataset()
  whitelist = %w(ext_source_2)
  transform = MeanImputation.new whitelist
  
  transform.train sample_dataset
  z_means = transform.means
  puts "Means", z_means
  
  example1 = {"features" => {"ext_source_2" => nil}}
  example2 = {"features" => {"ext_source_1" => 0.12345}}
  
  batch = [example1, example2]
  
  puts "Before imputation", batch
  transform.apply batch
  puts "After imputation", batch
  
  assert_in_delta(0.5264289029370367, example1["features"]["ext_source_2"], 1e-3, "Fills in example 1")
  assert_in_delta(0.5264289029370367, example2["features"]["ext_source_2"], 1e-3, "Fills in example 2")
end

test_22_2()

Means
{"ext_source_2"=>0.5264289029370367}
Before imputation
[{"features"=>{"ext_source_2"=>nil}}, {"features"=>{"ext_source_1"=>0.12345}}]
After imputation
[{"features"=>{"ext_source_2"=>0.5264289029370367}}, {"features"=>{"ext_source_1"=>0.12345, "ext_source_2"=>0.5264289029370367}}]


In [839]:
### TEST ###
# Verifies that transformer calculates means for non-missing values
def test_22_3()
  sample_dataset = create_sample_dataset()
  whitelist = %w(ext_source_2)
  transform = MeanImputation.new whitelist  
  transform.train sample_dataset
  
  example2 = {"features" => {"ext_source_1" => 0.12345}}
  example3 = {"features" => {"ext_source_2" => 0.4567}}
  
  batch = [example2, example3]
  
  puts "Before imputation", batch
  transform.apply batch
  puts "After imputation", batch
  
  assert_in_delta(0.12345, example2["features"]["ext_source_1"], 1e-3, "Does not alter other features")
  assert_in_delta(0.4567, example3["features"]["ext_source_2"], 1e-3, "Does not alter non-missing values")
end

test_22_3()

Before imputation
[{"features"=>{"ext_source_1"=>0.12345}}, {"features"=>{"ext_source_2"=>0.4567}}]
means in apply
{"ext_source_2"=>0.5264289029370367}
After imputation
[{"features"=>{"ext_source_1"=>0.12345, "ext_source_2"=>0.5264289029370367}}, {"features"=>{"ext_source_2"=>0.4567}}]


## Question 2.3 (10 points)

To demonstrate Binning, we will create a custom transformer for the ```days_birth``` feature in the dataset. Transform the feature value from negative days to positive years in 5-year increments, with a maximum age of 100 and minimum age of zero. Features are created as one-hot encoded values as follows:

```ruby
new_feature_name = pattern % binned_age
```

where the ```%``` operator for strings applies string formating like ```printf```. To keep everyone on the same page, define a bin $b$ given days $x$ as follows:

$ b(x) = 5 \times \left\lfloor \frac{-x}{365 \times 5} \right \rfloor$

note that your implementation should further clip the bin to range $\left [0,100 \right ]$.

Example: 
```
[{"features"=>{"days_birth"=>-13505}}]
After binning
[{"features"=>{"age_range_35"=>1}}]
```

Notes:
1. Skip any missing values

In [840]:
class AgeRangeAsVector
  def initialize; end
  def train dataset; end
  def apply(example_batch)
    min_age = 0
    max_age = 100
    feature_name = "days_birth"
    pattern = "age_range_%d"
    example_batch.each do |item|
      next if !(item["features"][feature_name])
      age=5*((-item["features"][feature_name])/(365*5)).floor
      
      if age>100
        age=100
      end
      
      if age<0
        age=0
      end
      item["features"][pattern % [age]]=1
      item["features"].delete(feature_name)
    end
    
    return example_batch
  end
end

:apply

In [841]:
### TEST ###
# Verifies that binning returns a vector
def test_23_1()
  sample_dataset = create_sample_dataset()
  binner = AgeRangeAsVector.new
  
  example1 = {"features" => {"days_birth" => -37 * 365}}
  
  batch = [example1]
  
  puts "Before binning", batch
  binner.apply batch
  puts "After binning", batch
  
  
  assert_equal(1, example1["features"]["age_range_35"], "Bins example 1")
  assert_false(example1["features"].has_key?("days_birth"), "Removes feature after transform")
  assert_equal(nil, example1["features"]["age_range_30"], "Bins example 1, in the 35 bin")
end

test_23_1()

Before binning
[{"features"=>{"days_birth"=>-13505}}]
After binning
[{"features"=>{"age_range_35"=>1}}]


In [842]:
### TEST ###
# Check that bins are clipped to min and max bins
def test_23_2()
  sample_dataset = create_sample_dataset()
  binner = AgeRangeAsVector.new
  
  example2 = {"features" => {"days_birth" => -40000}}
  example3 = {"features" => {"days_birth" => 1000}}
  
  batch = [example2, example3]
  
  puts "Before binning", batch
  binner.apply batch
  puts "After binning", batch
  
  
  assert_equal(1, example2["features"]["age_range_100"], "Bins example 2, to max value")
  assert_equal(1, example3["features"]["age_range_0"], "Bins example 3, to min value")
end

test_23_2()

Before binning
[{"features"=>{"days_birth"=>-40000}}, {"features"=>{"days_birth"=>1000}}]
After binning
[{"features"=>{"age_range_100"=>1}}, {"features"=>{"age_range_0"=>1}}]


## Question 2.4 (10 points)

Implement target averaging where a categorical feature is replaced with a numerical feature whose values are the average value of the target label for all examples with that feature. In this dataset, we are treating the target labels as either 0 or 1, so the average for the target label is an estimate of the probability of the class given the example has the feature value. 

In the ```train``` method, calculate the means for each possible feature value in the provided whitelist. For a feature named ```abc```, we will create a new feature called ```avg_abc```. 

The ```@means``` member variable is meant to be a two-dimensional hash defined as follows:
```ruby
{"name_family_status"=>{
    "Single / not married"=>0.09486931268151017, 
    "Married"=>0.06931390977443609, 
    "Separated"=>0.07191011235955057, 
    "Civil marriage"=>0.093841642228739, 
    "Widow"=>0.06666666666666667
    }, 
"code_gender"=>{
    "M"=>0.09155261915998113, 
    "F"=>0.06855373728438743
    }
}
```

Notes:


In [843]:
class TargetAveraging
  attr_reader :means
  
  def initialize feature_names
    @means = Hash.new {|h,k| h[k] = Hash.new {|h,k| h[k] = 0}}
    @feature_names = feature_names
    @pattern = "avg_%s"
    @total=Hash.new {|h,k| h[k] = Hash.new {|h,k| h[k] = 0}}
  end
  
  def train dataset   
    
    dataset["data"].each do |item|
      item["features"].each do |key,array|
        if (array.is_a? (String))
          @total[key][array]+=1.0
        end
      end
    end
    
    dataset["data"].each do |item|
      item["features"].each do |key,array|
        if (array.is_a? (String)) and item["label"]==1 and @feature_names.include? key
          @means[key][array]+=1.0/(@total[key][array])
        end
      end
    end
    
  end
end

:train

In [844]:
### TEST ###
# Verifies that transformer calculates means for non-missing values
def test_24_1()
  sample_dataset = create_sample_dataset()
  lookup = TargetAveraging.new %w(name_family_status code_gender)
  lookup.train sample_dataset
  means = lookup.means
  puts "Means", means
  
  nfs_means = means["name_family_status"]
  assert_equal 5, nfs_means.size  
  assert_in_delta(0.09384164, nfs_means["Civil marriage"], 1e-2, "Average for civil marriage")
  assert_in_delta(0.07191011, nfs_means["Separated"], 1e-2, "Average for Separated")
  
  cg_means = means["code_gender"]
  assert_in_delta(0.09155261915998113, cg_means["M"], 1e-2, "Average for code_gender=M")
  assert_in_delta(0.06855373728438743, cg_means["F"], 1e-2, "Average for code_gender=F")
  
end

test_24_1()

Means
{"code_gender"=>{"M"=>0.09155261915998138, "F"=>0.06855373728438752}, "name_family_status"=>{"Single / not married"=>0.0948693126815103, "Civil marriage"=>0.09384164222873895, "Married"=>0.06931390977443627, "Separated"=>0.07191011235955051, "Widow"=>0.06666666666666667}}


Next, implement the ```apply``` method which removes the original categorical feature from the example and replaces it with the new feature name and its average.

Example:

```
Before target averaging
[{"features"=>{"name_family_status"=>"Civil marriage"}}]
After target averaging
[{"features"=>{"avg_name_family_status"=>0.093841642228739}}]
```

Notes:
1. Skip any missing values
1. Skip any feature value not present in the means table.

In [845]:
class TargetAveraging  
  def apply(example_batch)
    
    example_batch.clone.each do |item|
      item["features"].clone.each do |key,array|
        if (array.is_a? (String)) and (@feature_names.include? key)
          new_key="avg_"+key
          item["features"][new_key] = @means[key][array]
          item["features"].delete(key)
        end
      end
    end
    
    return example_batch
  end
end

:apply

In [846]:
def test_24_2()
  sample_dataset = create_sample_dataset()
  transform = TargetAveraging.new %w(name_family_status)
  transform.train sample_dataset
  
  example1 = {"features" => {"name_family_status" => "Civil marriage"}}
  batch = [example1]
  
  puts "Before target averaging", batch
  transform.apply batch
  puts "After target averaging", batch
  
  assert_in_delta(0.09384164, example1["features"]["avg_name_family_status"], 1e-3, "Fills in example 1")
  assert_false(example1["features"].has_key?("name_family_status"), "Removes previous feature name")
end

test_24_2()

Before target averaging
[{"features"=>{"name_family_status"=>"Civil marriage"}}]
After target averaging
[{"features"=>{"avg_name_family_status"=>0.09384164222873895}}]


In [847]:
def test_24_3()
  sample_dataset = create_sample_dataset()
  transform = TargetAveraging.new %w(name_family_status)
  transform.train sample_dataset
  
  example2 = {"features" => {"name_family_status" => "Separated", "ext_source_1" => 0.212}}
  
  batch = [example2]
  
  puts "Before target averaging", batch
  transform.apply batch
  puts "After target averaging", batch
  
  assert_in_delta(0.07191011, example2["features"]["avg_name_family_status"], 1e-3, "Fills in example 2")
  assert_in_delta(0.212, example2["features"]["ext_source_1"], 1e-3, "Does not alter other features")
end

test_24_3()

Before target averaging
[{"features"=>{"name_family_status"=>"Separated", "ext_source_1"=>0.212}}]
After target averaging
[{"features"=>{"ext_source_1"=>0.212, "avg_name_family_status"=>0.07191011235955051}}]


## Question 2.5 (10 points)

Implement one-hot encoding. Given an array of categorical feature names, introduce new features for each possible value. Each new feature should have a value of 1. Do not add any features for missing features or for values which are not present in the dataset. There is no separate ```train``` step.

Example:

```
Before one hot encoding
[{"features"=>{"name_family_status"=>"Civil marriage"}}]
After one hot encoding
[{"features"=>{"name_family_status=Civil marriage"=>1.0}}]
```

Notes:
1. Examples are altered in place and do not change anything outside the provided list

In [25]:
class OneHotEncoding
  def initialize feature_names
    @feature_names = feature_names
    @pattern = "%s=%s"
  end
  
  def train dataset; end
  
  def apply(example_batch)
    

    example_batch.clone.each do |item|
      @feature_names.each do |feature|
        if (item["features"][feature].is_a? (String))
          new_key=feature+"="+item["features"][feature]
          item["features"][new_key] = 1.0
          item["features"].delete(feature)
        end
      end
    end
    
    return example_batch
  end
end

:apply

In [26]:
def test_25_1()
  sample_dataset = create_sample_dataset()
  lookup = OneHotEncoding.new %w(name_family_status)
  
  example1 = {"features" => {"name_family_status" => "Civil marriage"}}
  
  batch = [example1]
  
  puts "Before one hot encoding", batch
  lookup.apply batch
  puts "After one hot encoding", batch
  
  assert_in_delta(1.0, example1["features"]["name_family_status=Civil marriage"], 1e-3, "Fills in example 1")
  assert_false(example1["features"].has_key?("name_family_status"), "Removes previous feature name")
  assert_false(example1["features"].has_key?("name_family_status=Separated"), "Encodes only one value")
end

test_25_1()

Before one hot encoding
[{"features"=>{"name_family_status"=>"Civil marriage"}}]
After one hot encoding
[{"features"=>{"name_family_status=Civil marriage"=>1.0}}]


In [850]:
def test_25_2()
  sample_dataset = create_sample_dataset()
  lookup = OneHotEncoding.new %w(name_family_status)
  
  example2 = {"features" => {"name_family_status" => "Separated", "ext_source_1" => 0.212}}
  
  batch = [example2]
  
  puts "Before one hot encoding", batch
  lookup.apply batch
  puts "After one hot encoding", batch
  
  assert_in_delta(1.0, example2["features"]["name_family_status=Separated"], 1e-3, "Fills in example 2")
  assert_in_delta(0.212, example2["features"]["ext_source_1"], 1e-3, "Does not alter other features")
end

test_25_2()

Before one hot encoding
[{"features"=>{"name_family_status"=>"Separated", "ext_source_1"=>0.212}}]
After one hot encoding
[{"features"=>{"ext_source_1"=>0.212, "name_family_status=Separated"=>1.0}}]


## Question 2.6 (10 points)

Implement the logarithm transform for a numeric feature. One of the most common transforms. There are two edge cases. The feature should not take the value zero and should not be negative. In this case, we will apply the ```Math.log``` or natural logarithm and will not bother to check in the ```apply``` method about zero or negative values. Skip any example with a missing value. A new feature value ```log_%s``` is added to the example and the old feature is removed as a result of this transformation.

Example:
```
Before log transform
[{"features"=>{"abc"=>1000.0}}]
After log transform
[{"features"=>{"log_abc"=>6.907755278982137}}]
```

Notes: 
1. Examples are changed in place and do not alter any feature not in the list.

In [851]:
class LogTransform
  def initialize feature_names
    @feature_names = feature_names
    @pattern = "log_%s"
  end
  
  def train dataset; end
  
  def apply(example_batch)
    
    example_batch.clone.each do |item|
      @feature_names.each do |feature|

        if (item["features"][feature].is_a? (Numeric)) and item["features"][feature]>0
          new_key="log_"+feature
          item["features"][new_key] = Math.log(item["features"][feature])
          item["features"].delete(feature)
        end
      end
    end
    return example_batch
  end
end

:apply

In [852]:
def test_26_1()
  sample_dataset = create_sample_dataset()
  transform = LogTransform.new %w(abc)
  
  example1 = {"features" => {"abc" => 1000.0}}  
  batch = [example1]
  
  puts "Before log transform", batch
  transform.apply batch
  puts "After log transform", batch
  
  assert_in_delta(6.9077, example1["features"]["log_abc"], 1e-3, "Fills in example 1")
  assert_false(example1["features"].has_key?("abc"), "Removes previous feature name")
end

test_26_1()

Before log transform
[{"features"=>{"abc"=>1000.0}}]
After log transform
[{"features"=>{"log_abc"=>6.907755278982137}}]


In [853]:
def test_26_2()
  sample_dataset = create_sample_dataset()
  transform = LogTransform.new %w(abc)
  
  example2 = {"features" => {"abc" => Math.exp(-1), "ext_source_1" => 0.212}}  
  batch = [example2]
  
  puts "Before log transform", batch
  transform.apply batch
  puts "After log transform", batch
  
  assert_in_delta(-1, example2["features"]["log_abc"], 1e-3, "Fills in example 2")
  assert_in_delta(0.212, example2["features"]["ext_source_1"], 1e-3, "Does not alter other features")
end

test_26_2()

Before log transform
[{"features"=>{"abc"=>0.36787944117144233, "ext_source_1"=>0.212}}]
After log transform
[{"features"=>{"ext_source_1"=>0.212, "log_abc"=>-1.0}}]


## Question 2.7 (10 points)

Implement $L_2$ normalization, which transforms all the numeric features in an example into a unit vector. First will we reuse our ```dot``` and ```norm``` methods from previous assignments. 

In [854]:
def dot x, w
  sum=0
  x.each do |key1, array1|
    w.each do |key2, array2|
      if key1==key2 then
        sum+=array1*array2
      end
    end
  end
  return sum

end

def norm w
  sum=0
  sum = Math.sqrt(dot(w,w))
  return sum
end

:norm

In [855]:
### Hidden test ###


Next, we will implement the ```apply``` method. Skip any feature which is not numeric i.e., ```not x.is_a? Numeric``` in ruby.

Example:

```
Before transform
[{"features"=>{"abc"=>1.0, "bcd"=>-1.0}}]
After transform
[{"features"=>{"abc"=>0.7071067811865475, "bcd"=>-0.7071067811865475}}]
```

Notes:


In [856]:
class L2Normalize
  def train dataset; end
  def apply(example_batch)
    
    number=0
    total=Hash.new {|h,k| h[k] = 0}
    example_batch.clone.each do |item|
      item["features"].clone.each do |key,array|
        if (array.is_a? (Numeric))
          total[number]=(array)**2+total[number]
        end
      end
      number+=1
    end
    
    number=0
    example_batch.clone.each do |item|
      item["features"].clone.each do |key,array|
        if (array.is_a? (Numeric))
          item["features"][key]=array.to_f/((total[number])**0.5)
        end
      end
      number+=1
    end

    return example_batch
  end
end

:apply

In [857]:
def test_27_2()
  sample_dataset = create_sample_dataset()
  transform = L2Normalize.new
  
  example1 = {"features" => {"abc" => 1.0, "bcd" => -1.0}}
  batch = [example1]
  
  puts "Before transform", batch
  transform.apply batch
  puts "After transform", batch
  
  assert_in_delta(0.707, example1["features"]["abc"], 1e-3, "Fills in example 1")
  assert_in_delta(-0.707, example1["features"]["bcd"], 1e-3, "Fills in example 1 bcd")
end

test_27_2()

Before transform
[{"features"=>{"abc"=>1.0, "bcd"=>-1.0}}]
After transform
[{"features"=>{"abc"=>0.7071067811865475, "bcd"=>-0.7071067811865475}}]


In [858]:
def test_27_3()
  sample_dataset = create_sample_dataset()
  transform = L2Normalize.new
  
  example1 = {"features" => {"abc" => 1.0, "bcd" => -1.0}}
  example2 = {"features" => {"cdef" => -3.0, "efg" => 4.0}}
  
  batch = [example1, example2]
  
  puts "Before transform", batch
  transform.apply batch
  puts "After transform", batch
  
  assert_in_delta(0.707, example1["features"]["abc"], 1e-3, "Fills in example 1")
  assert_in_delta(-0.707, example1["features"]["bcd"], 1e-3, "Fills in example 1 bcd")
  assert_in_delta(-0.6, example2["features"]["cdef"], 1e-3, "Fills in example 2")
  assert_in_delta(0.8, example2["features"]["efg"], 1e-3, "Fills in example 2 efg")
end

test_27_3()

Before transform
[{"features"=>{"abc"=>1.0, "bcd"=>-1.0}}, {"features"=>{"cdef"=>-3.0, "efg"=>4.0}}]
After transform
[{"features"=>{"abc"=>0.7071067811865475, "bcd"=>-0.7071067811865475}}, {"features"=>{"cdef"=>-0.6, "efg"=>0.8}}]


In [859]:
def test_27_4()
  sample_dataset = create_sample_dataset()
  transform = L2Normalize.new
  
  example1 = {"features" => {"abc" => 1.0, "bcd" => -1.0, "string_feature" => "STRING"}}
  example2 = {"features" => {"cdef" => -3.0, "efg" => 4.0}}
  
  batch = [example1, example2]
  
  puts "Before transform", batch
  transform.apply batch
  puts "After transform", batch
  
  assert_equal("STRING", example1["features"]["string_feature"], "Ignores string features")
end

test_27_4()

Before transform
[{"features"=>{"abc"=>1.0, "bcd"=>-1.0, "string_feature"=>"STRING"}}, {"features"=>{"cdef"=>-3.0, "efg"=>4.0}}]
After transform
[{"features"=>{"abc"=>0.7071067811865475, "bcd"=>-0.7071067811865475, "string_feature"=>"STRING"}}, {"features"=>{"cdef"=>-0.6, "efg"=>0.8}}]


## Question 2.8 (10 points)

Implement downsampling, where we will filter out examples belonging to the negative class (```label <= 0```) according a provided probability. Rather than calculating a precise sampling rate as we do below, we commonly provide a nice round number like 10%. Therefore, we will not implement the ```train``` method. Instead we will update the ```@sampling_rate``` parameter in a separate function ```update_sampling_rate```.

Notes:
1. The sampling rate is **not** the class prior. It is the class ratio.

In [860]:
class DownsampleNegatives
  attr_reader :sampling_rate
  def initialize sampling_rate
    @sampling_rate = sampling_rate
  end
  
  def train dataset; end
  
  def update_sampling_rate dataset
    
    pos=0
    neg=0
    dataset["data"].each do |item|
      if item["label"]>0 
        pos+=1
      else
        neg+=1
      end
    end

    @sampling_rate=pos.to_f/neg.to_f
    
  end
end

:update_sampling_rate

In [861]:
def test_28_1()
  sample_dataset = create_sample_dataset()
  transform = DownsampleNegatives.new 0.0   
  assert_equal 0.0, transform.sampling_rate
  
  transform.update_sampling_rate sample_dataset  
  assert_in_delta(0.08212481668567705, transform.sampling_rate, 1e-3, "Calculate the class ratio, not the class prior")
end

test_28_1()

Next, we will implement the ```can_keep?``` method which returns a Boolean value indicating whether we can keep the example. To maintain consistent filtering for all students on this dataset, we will use deterministic sampling. Because every example in the dataset has a unique ID which does not depend on the data, we can use this to calculate the probability of keeping the example. Although less random, this a a very common practice in industry.


Notes:
1. Filters examples in place using the ```select!``` method in ruby.

In [889]:
require 'digest'

class DownsampleNegatives
  def hashprob id
    salt = "eifjcchdivlbreckvgndlvkgdtdjnbcnjldelrgefcgt"
    (Digest::MD5.hexdigest(id.to_s + salt).to_i(16) % 100000).abs / 100000.0
  end
  
  def can_keep? example
    can_keep = true

    if example["label"]>0
      return true
    end
    
    if hashprob(example["id"])>@sampling_rate 
      return false
    end
    return can_keep
  end

  def apply(example_batch)
    return example_batch.select! {|example| can_keep? example}
  end
end

:apply

In [890]:
def test_28_2()
  sample_dataset = create_sample_dataset()
  transform = DownsampleNegatives.new 0.0
  
  transform.update_sampling_rate sample_dataset
  
  example_to_keep = {"id" => 3, "label" => 0, "features" => {"abc" => 1.0}}
  example_to_filter = {"id" => 0, "label" => 0, "features" => {"abc" => 1.0}}
  example_pos = {"id" => 1, "label" => 1, "features" => {"abc" => 1.0}}  
  
  assert_true transform.can_keep?(example_to_keep), "Keep example based on ID and label"
  assert_false transform.can_keep?(example_to_filter), "Filter out example based on ID and label"
  assert_true transform.can_keep?(example_pos), "Keep all positive examples"
end

test_28_2()

## Question 2.9 (10 points)

Because we implemented feature transforms as a pipeline pattern, we can chaing together feature transforms into a pipeline. The ```FeatureTransformPipeline``` is itself a ```FeatureTransformer```, which supports ```train``` and ```apply```. In the ```train``` method, we will simply call ```train``` and ```apply``` on each transformer on each dataset. 

By calling train and apply, notice that the features will change in the middle of the pipeline. So, we can apply multiple transforms and expect that they will be built on top of each other's output. We are altering examples in place so we add new transforms which affect disjoint feature spaces.

Example:

```
Before transform
[{"features"=>{"ext_source_1"=>0.7, "ext_source_2"=>0.2, "amt_income_total"=>1000.0}}]
After transform
[{"features"=>{"ext_source_1"=>0.1326016448876392, "ext_source_2"=>-0.24739172298070694, "log_amt_income_total"=>0.9597990097795109}}]

```

Notes:


## class FeatureTransformPipeline
  def initialize *transformers
    @transformers = transformers
  end
  
  def train dataset
    puts "beginnong of pipelline"
    number=0
    @transformers.each do |item|
      #puts "number is " 
      #puts number
      transform=item
      #puts "data"
      #puts number
      ##puts dataset["data"][0..2]
      transform.train dataset
      #puts "after train"
      #puts number
      #puts dataset["data"][0..2]
      transform.apply dataset["data"]
      #puts "after"
      #puts number
      #puts dataset["data"][0..2]
      number+=1
    end
    
    puts "end of pipelline"
    
  end
  
  def apply example_batch 
    return @transformers.inject(example_batch) do |u, transform|
      u = transform.apply example_batch
    end
  end
end

In [892]:
def test_29_1()
  sample_dataset = create_sample_dataset()
  transform = FeatureTransformPipeline.new(
    ZScoreTransformer.new(%w(ext_source_1 ext_source_2 ext_source_3)),
    LogTransform.new(%w(amt_income_total)),
    L2Normalize.new
  )
  
  transform.train sample_dataset
  
  example1 = {"features" => {"ext_source_1" => 0.7, "ext_source_2" => 0.2, "amt_income_total" => 1000.0}}
  example2 = {"features" => {"ext_source_1" => 0.3, "amt_income_total" => 100000.0}}
  
  batch = [example1, example2]
  
  puts "Before transform", batch
  transform.apply batch
  puts "After transform", batch
  
  assert_in_delta(0.1326016448876392, example1["features"]["ext_source_1"], 1e-3, "Fills in example 1")
  assert_in_delta(0.9597990097795109, example1["features"]["log_amt_income_total"], 1e-3, "Fills in example 1 log_amt_income_total")
  assert_in_delta(-0.08076041093856547, example2["features"]["ext_source_1"], 1e-3, "Fills in example 2")
end

test_29_1()



beginnong of pipelline
end of pipelline
Before transform
[{"features"=>{"ext_source_1"=>0.7, "ext_source_2"=>0.2, "amt_income_total"=>1000.0}}, {"features"=>{"ext_source_1"=>0.3, "amt_income_total"=>100000.0}}]
After transform
[{"features"=>{"ext_source_1"=>0.1326016448876392, "ext_source_2"=>-0.24739172298070694, "log_amt_income_total"=>0.9597990097795109}}, {"features"=>{"ext_source_1"=>-0.08076041093856547, "log_amt_income_total"=>0.9967335431423154}}]


Now, we are ready to build a real ML data processing pipeline. The sample ```feature_transform_pipeline_29``` applies basic feature transforms. Other than testing all the code written here, little thought was put into the composition of the pipeline, so you should design your own in the next part of the final project. The result of this pipeline a fully numeric example that can be directly used in models.  

Note that we are not adding ```DownsampleNegatives``` in this pipeline. Special care should be used to apply downsampling only in training. 

In [893]:
def feature_transform_pipeline_29
  FeatureTransformPipeline.new(
    #ext_source
    ZScoreTransformer.new(%w(ext_source_1 ext_source_2 ext_source_3)),
    MeanImputation.new(%w(ext_source_2 ext_source_3)),
    
    #Treat amt_income_total and amt_credit as log normal
    LogTransform.new(%w(amt_income_total amt_credit)),
    ZScoreTransformer.new(%w(log_amt_income_total log_amt_credit)),
      
    #Imputation for commonarea_avg
    MeanImputation.new(%w(commonarea_avg)),
      
    #One-hot encoded features
    AgeRangeAsVector.new,      
    OneHotEncoding.new(%w(name_family_status code_gender)),
      
    #Target Averages
    TargetAveraging.new(%w(name_income_type flag_own_car flag_own_realty
      name_family_status organization_type name_housing_type name_education_type)),      
    L2Normalize.new
  )
end

:feature_transform_pipeline_29

In [894]:
def test_29_2()
  sample_dataset = create_sample_dataset()
  transform = feature_transform_pipeline_29()
  transform.train sample_dataset
  
  
  example1 = {"features" => {"ext_source_1" => 0.7, "ext_source_2" => 0.2, "amt_income_total" => 1000.0}}
  example2 = {"features" => {"ext_source_1" => 0.3, "amt_income_total" => 100000.0}}
  
  batch = [example1, example2]
  
  puts "Before transform", batch
  transform.apply batch
  puts "After transform", batch
  
  assert_in_delta(0.08875317278315571, example1["features"]["ext_source_1"], 1e-3, "Fills in example 1")
  assert_in_delta(-0.9821854258866999, example1["features"]["log_amt_income_total"], 1e-3, "Fills in example 1 log_amt_income_total")
  assert_in_delta(0.004047066576571951, example1["features"]["commonarea_avg"], 1e-3, "Fills in example 1 commonarea_avg")
  assert_in_delta(-0.7081029266295138, example2["features"]["ext_source_1"], 1e-3, "Fills in example 2")
end

test_29_2()



beginnong of pipelline
means in apply
{"ext_source_2"=>-6.0906627612613726e-15, "ext_source_3"=>-6.192312568763135e-15}
means in apply
{"commonarea_avg"=>0.04351730769230769}
end of pipelline
Before transform
[{"features"=>{"ext_source_1"=>0.7, "ext_source_2"=>0.2, "amt_income_total"=>1000.0}}, {"features"=>{"ext_source_1"=>0.3, "amt_income_total"=>100000.0}}]
means in apply
{"ext_source_2"=>-6.0906627612613726e-15, "ext_source_3"=>-6.192312568763135e-15}
means in apply
{"commonarea_avg"=>0.04351730769230769}
After transform
[{"features"=>{"ext_source_1"=>0.08875317278315571, "ext_source_2"=>-0.16558467546488206, "ext_source_3"=>-5.758789446700427e-16, "log_amt_income_total"=>-0.9821854258866999, "commonarea_avg"=>0.004047066576571951}}, {"features"=>{"ext_source_1"=>-0.7081029266295138, "ext_source_2"=>-4.623339689395529e-15, "ext_source_3"=>-4.700500682847904e-15, "log_amt_income_total"=>-0.7053361183296555, "commonarea_avg"=>0.03303339943711086}}]


Now we will apply downsampling, which is used only during training. Notice the pattern we are using here:

1. Create a ```feature_pipeline``` containing the feature transformations (no sampling) you want to use.
1. Train this pipeline on your dataset
1. Create a new dataset, which going to be your real training dataset
1. Create a ```training_pipeline``` containing the pre-trained ```feature_pipeline``` and the sampling.
1. Now only ```apply``` the training pipeline to your dataset.

This pattern is useful in practice because we may use a small sample to test out features but want to apply all the transforms at the same time as running learning algorithms on on a new dataset.

In [896]:
def test_28_3()
  sample_dataset = create_sample_dataset()  
  feature_pipeline = FeatureTransformPipeline.new(feature_transform_pipeline_29())
  feature_pipeline.train sample_dataset
  
  downsample = DownsampleNegatives.new(0.0)
  downsample.update_sampling_rate sample_dataset

  training_dataset = create_sample_dataset()
  training_pipeline = FeatureTransformPipeline.new(feature_pipeline, downsample)
  batch = training_dataset["data"][0..199]
  
  puts "Before transform", batch[0]
  sampled_batch = training_pipeline.apply batch
  puts "After transform", sampled_batch[0]
  
  assert_equal 33, sampled_batch.size, "Downsampling should remove several examples"
  assert_true(sampled_batch.all? {|e| e["features"].size > 5}, "At least 5 features")
  assert_true(sampled_batch[0]["features"].values.all? {|v| v.is_a? Numeric}, "All features should be numeric")
  assert_in_delta(1.0, norm(sampled_batch[0]["features"]), 1e-3, "Examples should be normalized to norm = 1")
end

test_28_3()

beginnong of pipelline
beginnong of pipelline
means in apply
{"ext_source_2"=>-6.0906627612613726e-15, "ext_source_3"=>-6.192312568763135e-15}
means in apply
{"commonarea_avg"=>0.04351730769230769}
end of pipelline
means in apply
{"ext_source_2"=>-6.0906627612613726e-15, "ext_source_3"=>-6.192312568763135e-15}
means in apply
{"commonarea_avg"=>0.04351730769230769}
end of pipelline
Before transform
{"label"=>1, "id"=>100002, "features"=>{"ext_source_1"=>0.08303696739132256, "ext_source_2"=>0.2629485927471776, "ext_source_3"=>0.13937578009978951, "amt_income_total"=>202500, "amt_credit"=>406597.5, "commonarea_avg"=>0.0143, "flag_own_car"=>"N", "flag_mobil"=>1, "days_birth"=>-9461, "organization_type"=>"Business Entity Type 3", "code_gender"=>"M", "flag_own_realty"=>"Y", "flag_emp_phone"=>1, "name_education_type"=>"Secondary / secondary special", "name_income_type"=>"Working", "name_family_status"=>"Single / not married", "name_housing_type"=>"House / apartment"}}
means in apply
{"ext_sou

## Question 3.1 (2 points)

Use information gain to calculate the value of some features. You are free to explore how feature transforms affect information gain. 

First, copy all **your** implementations of these functions from previous assignments. We are only applying them to the dataset here. We will skip any missing features in the tests. 

In [897]:
def class_distribution dataset
  classes = Hash.new {|h,k| h[k] = 0}
  dataset.each do |key,array|
    classes[key["label"]]=1+classes[key["label"]]
  end
  
  result={}
  classes.each do |key,array|
    result[key]=array.to_f/dataset.size.to_f
  end
  
  return result
end

:class_distribution

In [918]:
def entropy dist
  ent=0
  dist.each do |key,array|
    if array==0
      return 0.0
    end
    ent+=-array*Math.log(array)
  end
  
  ent
end

:entropy

In [919]:
def test_31_1()
  # Check that there are three classes
  dataset = create_sample_dataset()
  dist = class_distribution dataset["data"]
  h0 = entropy dist
  assert_in_delta(0.2686201883261589, h0, 1e-3)
end

test_31_1()

## Question 3.2 (2 ponts)

Reuse **your** implementation of information gain for categorical features and apply it to some of the features in this dataset.

In [920]:
def information_gain h0, splits
 size = Hash.new {|h,k| h[k] = 0}
  sum = Hash.new {|h,k| h[k] = 0}
  total=0
  
  splits.each do |key, array|
    total+=array.size
  end
  
  splits.each do |key, array|
    sum[key]+=entropy(class_distribution(array))
    size[key]=array.size
  end
  
  result=0
  
  size.each do |key,array|
    size[key]=size[key].to_f/total
  end
  
  splits.each do |key, array|
    result+=sum[key]*size[key]
  end

  return h0-result
end

:information_gain

In [921]:
def test_32_1()
  # Check that there are three classes
  dataset = create_dataset $dev_db, "select target, sk_id_curr, ext_source_1, flag_own_car from application_train where ext_source_1 <> ''"
  examples = dataset["data"]
  dist = class_distribution examples
  h0 = entropy dist
  
  splits = examples.group_by {|row| row["features"]["flag_own_car"]}
  ig = information_gain h0, splits
  assert_in_delta(0.0002206258541794237, ig, 1e-4)
end

test_32_1()

## Question 3.3 (2 points)

Reuse **your** implemenation to find the best split point by information gain. Depending on how you implemented it, you may want to make it faster. This should finish within 30s. 

In [935]:
def find_split_point_numeric x, h0, fname
  
  x.each do |item|
    if !item["features"][fname]
      item["features"][fname]=0
    end
  end
  
  sorted_values = x.collect {|r| r["features"][fname]}.uniq.sort
  min_sl = sorted_values.first
  max_sl = sorted_values.last
  
  ig_max=-1
  t_max=-1
  
  threshold = []
  iG = []
  h0 = entropy(class_distribution(x))
  (0..sorted_values.size).step(20) do |t|
    threshold << sorted_values[t]
    iG << information_gain(h0, split_on_numeric_value(x, fname, sorted_values[t]))
    if iG.last>ig_max
      ig_max=iG.last
      t_max=threshold.last
    end
  end
  puts [t_max, ig_max]
  return [t_max, ig_max]
end


def split_on_numeric_value x, k, v
  splits={"less than"=>[], "more than"=>[]}

  x.each do |item|
    if item["features"][k]
      y=item["features"][k]
    else
      y=0
    end
    if y<v
      splits["less than"]+=[item]
    end
    if y>=v
      splits["more than"]+=[item]
    end
  end
  
  splits
end


:split_on_numeric_value

In [None]:
def test_33_1()
  # Check that there are three classes
  dataset = create_sample_dataset()
  examples = dataset["data"]
  dist = class_distribution examples
  h0 = entropy dist
  
  t, ig = find_split_point_numeric examples, h0, "ext_source_1"
  assert_in_delta(0.009751743140812785, ig, 1e-4)
end

test_33_1()

[0.4532045762368005, 0.009648282450843593]


Test::Unit::AssertionFailedError: <0.009751743140812785> -/+ <0.0001> was expected to include
<0.009648282450843593>.

Relation:
<<0.009648282450843593> < <0.009751743140812785>-<0.0001>[0.009651743140812786] <= <0.009751743140812785>+<0.0001>[0.009851743140812785]>