# Lecture 7: Animal encoding example from slides

## First, define a prob_category()

### **prob_category()** is a generic function that calculates the probability estimate for one specific label.

-   ### Input:

    -   ### category = a specific label to encode

    -   ### data = a data frame

    -   ### encode_col = column name to be encoded in data

    -   ### target_col = column name of the target in data

    -   ### target_value = the target value to use for encoding

-   ### Returns: the number of rows that contain the category where target column entry is equal to target_vale, divided by the total number of rows that have the category label.



In [14]:
# define a prob_category() function to calculate the probability 
# estimate for one specific label.
# Input: category= a specific label to encode
#        data= a dataframe
#        encode_col= column name to be encoded in data
#        target_col= column name of the target in data
#        target_value= the target value to use for encoding
# Returns: number of rows that contain category where target column
# is equal to target_vale divided by the total number of rows that
# have the category label
prob_category <- function(category, data, encode_col,
                                    target_col, target_value){
   n_category_at_target_value <-
     sum(table(df[,c(encode_col,target_col)])[category,target_value])

   n_category <- sum(table(df[,c(encode_col,target_col)])[category,])

   return( n_category_at_target_value / n_category )
}


## Second, define the sample 'Animal' data

### Build the data frame for encoding the feature 'Animal' with 'Target'


In [15]:
# build the data frame

# define Animal data
Animal <- 
     as.factor(c("cat","hamster","cat","cat","dog","hamster","cat","dog","cat","dog"))

# define target data
Target <- c(1,0,0,1,1,1,0,1,0,0)

# build the sample data frame
df <- data.frame(Animal,Target)

# inspect it
print(df)


    Animal Target
1      cat      1
2  hamster      0
3      cat      0
4      cat      1
5      dog      1
6  hamster      1
7      cat      0
8      dog      1
9      cat      0
10     dog      0


In [16]:
str(df)

'data.frame':	10 obs. of  2 variables:
 $ Animal: Factor w/ 3 levels "cat","dog","hamster": 1 3 1 1 2 3 1 2 1 2
 $ Target: num  1 0 0 1 1 1 0 1 0 0


In [17]:
install.packages("dataPreparation")
library(dataPreparation)
require(dataPreparation)

Installing package into ‘/home/jupyter-sabdu070/R/x86_64-pc-linux-gnu-library/4.3’
(as ‘lib’ is unspecified)



## Third, build the encoding table

### use the function build_target_encoding()



In [18]:
target_encoding <- dataPreparation::build_target_encoding(df, cols_to_encode = "Animal",
                                         target_col = "Target", functions = c("mean", "sum"))

print(target_encoding)

[1] "build_target_encoding: Start to compute encoding for target_encoding according to col: Target."
$Animal
    Animal Target_mean_by_Animal Target_sum_by_Animal
1:     cat             0.4000000                    2
2: hamster             0.5000000                    1
3:     dog             0.6666667                    2



## Finally, update the new encoding of Animal feature

### Use sapply() to call a look-up function to assign in each row of df, the correct new encoding value based on the category of 'Animal'


In [19]:
# update the data frame by adding Encoded_Animal column by looking up the
# new encoded value for each Animal in encoding_table
Encoded_Animals <- dataPreparation::target_encode(df, target_encoding = target_encoding)

# inspect it
print(Encoded_Animals)


[1] "target_encode: Start to encode columns according to target."
     Animal Target Target_mean_by_Animal Target_sum_by_Animal
 1:     cat      1             0.4000000                    2
 2: hamster      0             0.5000000                    1
 3:     cat      0             0.4000000                    2
 4:     cat      1             0.4000000                    2
 5:     dog      1             0.6666667                    2
 6: hamster      1             0.5000000                    1
 7:     cat      0             0.4000000                    2
 8:     dog      1             0.6666667                    2
 9:     cat      0             0.4000000                    2
10:     dog      0             0.6666667                    2
