---
title:  Tensorflow tf.feature_column
tags: 小书匠,feature_column,Tensorflow1,tensorflow,embedding
grammar_cjkRuby: true
#renderNumberedHeading: true
---

[toc!]

# Tensorflow tf.feature_column

In [1]:
import tensorflow as tf

print(tf.__version__)

1.15.0


##  背景

特征预处理是几乎所有机器学习模型所必须的一个过程，常见的特征预处理方法包括：连续变量分箱化、离散变量one-hot、离散指标embedding等.

tensorflow给我们提供了一个功能强大的特征处理函数tf.feature_column，它通过对特征处理将数据输入网络并交由estimator来进行训练

特征数据主要包括categorical和dense两类，处理方法是使用tensorflow中的feature_column接口来进行定义。

如下图，总共有九种不同的函数，其中包括
1. 5种 categorical function
2. 3种numerical function 
3. 1种bucketized_column可属于任何一种

categorical column中的 with_identity其实和 dense column中的indicator_column没有区别，都是类别特征的one-hot表示，但是其属于不同的特征类别，前者属于categorical后者属于dense，对于estimator编写的不同网络而言，其可接受的one-hot类型不同，这里在实际操作中需要注意转换。

![](https://pic3.zhimg.com/v2-72716a0ab2971fea8cadb98cbceeddfa_b.jpg)

注意，我们的 tensorflow 的 版本是 1.15.0，tensorflow2 中的 tf.feature_column 和这个的用法有一些不同。

一般来说，我们的模型接受的都是 DenseColumn，如果有 Categorical Column，那么就要转换为 Dense Column

## input_layer

在介绍其他 api 之前，先介绍 tf.feature_column.input_layer 这个 api，这个 api 的原型是

```
tf.compat.v1.feature_column.input_layer(
    features, feature_columns, weight_collections=None, trainable=True,
    cols_to_vars=None, cols_to_output_tensors=None
)
```

**其作用是将 features 中指定数据，转换为 feature_columns 中指定的类型，最后形成一个 Tensor 并返回。，这个有点类似于 `feed_dict` 的感觉，其中 features 中是 feature_dict，feature_columns 中装的类似于 placeholder。**

>A Tensor which represents input layer of a model. Its shape is (batch_size, first_layer_dimension) and its dtype is float32. first_layer_dimension is determined based on given feature_columns.

它返回的是一个 tf.float32 类型的 Tensor，形状为 [batch_size, first_layer_dimension]，first_layer_dimension 由 feature_columns 决定

具体可以看 [ 2 ] 和下面的例子

In [2]:
import tensorflow as tf
sess = tf.Session()

#特征数据
features = {'birthplace': [[1], [1], [3], [4]]}

#特征列
birthplace = tf.feature_column.categorical_column_with_identity(
    "birthplace", num_buckets=3, default_value=0)
birthplace = tf.feature_column.indicator_column(birthplace)
#组合特征列
columns = [birthplace]

#输入层（数据，特征列）
inputs = tf.feature_column.input_layer(features, columns)

#初始化并运行
init = tf.global_variables_initializer()
sess.run(tf.tables_initializer())
sess.run(init)
v = sess.run(inputs)
print(v)

Instructions for updating:
The old _FeatureColumn APIs are being deprecated. Please use the new FeatureColumn APIs instead.
Instructions for updating:
The old _FeatureColumn APIs are being deprecated. Please use the new FeatureColumn APIs instead.
Instructions for updating:
The old _FeatureColumn APIs are being deprecated. Please use the new FeatureColumn APIs instead.
Instructions for updating:
The old _FeatureColumn APIs are being deprecated. Please use the new FeatureColumn APIs instead.
Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where
Instructions for updating:
The old _FeatureColumn APIs are being deprecated. Please use the new FeatureColumn APIs instead.
Instructions for updating:
The old _FeatureColumn APIs are being deprecated. Please use the new FeatureColumn APIs instead.
[[0. 1. 0.]
 [0. 1. 0.]
 [1. 0. 0.]
 [1. 0. 0.]]


这个表示我们的数据为 `{'birthplace': [[1], [1], [3], [4]]}`，而我们的 "placeholder" 是 `columns = [birthplace]`。

注意到，features 是一个 dict，但是 columns 值是一个 list，我们怎么知道 columns 这个 list 中的值对应 features 这个 dict 中的那个值呢？实际上，columns 中的每个元素都会有一个 key，这个 key 正是 features 中的 key。从这个角度来说，columns 虽然是一个 list，但是实际上也可以看作是一个 key-value pair，只不过 key 是作为对象的一个属性而已。

在这个例子中，对应着 `tf.feature_column.categorical_column_with_identity("birthplace", num_buckets=3, default_value=0)` 这句话，其中 `birthplace` 就是 key，这个对应着 `{'birthplace': [[1], [1], [3], [4]]}` 中的 birthplace

>A mapping from key to tensors. _FeatureColumns look up via these keys. For example numeric_column('price') will look at 'price' key in this dict. 

## categorical column

### categorical_column_with_identity

*   categorical_column_with_identity：把numerical data转乘 *onehot encoding*

![](https://pic4.zhimg.com/v2-3145e7209e6120e2485b462dafe6627b_b.jpg)

![](https://pic4.zhimg.com/80/v2-3145e7209e6120e2485b462dafe6627b_1440w.jpg)

*   只适用于值为整数的类别型变量，实际输出如下：

In [3]:
import tensorflow as tf
sess = tf.Session()

#特征数据
features = {'birthplace': [[1], [1], [3], [4]]}

#特征列
birthplace = tf.feature_column.categorical_column_with_identity(
    "birthplace", num_buckets=3, default_value=0)
birthplace = tf.feature_column.indicator_column(birthplace)
#组合特征列
columns = [birthplace]

#输入层（数据，特征列）
inputs = tf.feature_column.input_layer(features, columns)

#初始化并运行
init = tf.global_variables_initializer()
sess.run(tf.tables_initializer())
sess.run(init)
v = sess.run(inputs)
print(v)

[[0. 1. 0.]
 [0. 1. 0.]
 [1. 0. 0.]
 [1. 0. 0.]]


这个例子中，num_buckets=3，表示这个特征的取值范围为 [0, 1, 2] 中，如果不再这个范围，会被变成 default_value。
这里看到 3, 4 两个值转换后都变成了 [1, 0, 0]

### categorical_column_with_vocabulary_list or categorical_column_with_vocabulary_file

*   categorical_column_with_vocabulary_list or categorical_column_with_vocabulary_file：根据单词的序列顺序，把单词根据index转换成one hot encoding

![](https://pic2.zhimg.com/v2-700a75b4daebec26db1f3ab4ae9add5d_b.jpg)

![](https://pic2.zhimg.com/80/v2-700a75b4daebec26db1f3ab4ae9add5d_1440w.jpg)

*   主要用于处理非整数型的类别特征，两个函数的区别在于处理变量类别的多少，数量前者对应类别少的情况，所有可能的类别可以直接输入，后者对应类别多的情况，所有可能的类别可以存在一个文件中输入，实际输出如下：

In [4]:
import tensorflow as tf
sess=tf.Session()

#特征数据
features = {
    'sex': ['male', 'male', 'female', 'female'],
}

#特征列
sex_column = tf.feature_column.categorical_column_with_vocabulary_list('sex', ['male', 'female'])
sex_column = tf.feature_column.indicator_column(sex_column)
#组合特征列
columns = [
    sex_column
]

#输入层（数据，特征列）
inputs = tf.feature_column.input_layer(features, columns)

#初始化并运行
init = tf.global_variables_initializer()
sess.run(tf.tables_initializer())
sess.run(init)

v=sess.run(inputs)
print(v)

Instructions for updating:
The old _FeatureColumn APIs are being deprecated. Please use the new FeatureColumn APIs instead.
Instructions for updating:
The old _FeatureColumn APIs are being deprecated. Please use the new FeatureColumn APIs instead.
Instructions for updating:
The old _FeatureColumn APIs are being deprecated. Please use the new FeatureColumn APIs instead.
[[1. 0.]
 [1. 0.]
 [0. 1.]
 [0. 1.]]


*   如图输出为sex的one-hot结果，其后跟着的list用于定义该变量的所有类别。

### categorical_column_with_hash_bucket

*   categorical_column_with_hash_bucket：对于处理包含大量文字或数字类别的特征时可使用hash的方式，这能快速地建立对应的对照表，缺点则是会有哈希冲突的问题。

![](https://pic4.zhimg.com/v2-b8f3f29807e2c98e53549015ed58aaf7_b.jpg)

*   hash_bucket_size的大小一般设置为总类别数的2-5倍，该函数适用于不能确定所有类别样式的类别变量，实际输出如下：

In [2]:
import tensorflow as tf

#特征数据
features = {
    'department': ['sport', 'sport', 'drawing', 'gardening', 'travelling'],
}

#特征列
department = tf.feature_column.categorical_column_with_hash_bucket('department', 4, dtype=tf.string)
department = tf.feature_column.indicator_column(department)

#组合特征列
columns = [
    department
]

inputs = tf.feature_column.input_layer(features, columns)

with tf.Session() as sess:
    #初始化并运行
    sess.run(tf.global_variables_initializer())
    sess.run(tf.tables_initializer())
    v = sess.run(inputs)
    print(v)

Instructions for updating:
The old _FeatureColumn APIs are being deprecated. Please use the new FeatureColumn APIs instead.
Instructions for updating:
The old _FeatureColumn APIs are being deprecated. Please use the new FeatureColumn APIs instead.
Instructions for updating:
The old _FeatureColumn APIs are being deprecated. Please use the new FeatureColumn APIs instead.
Instructions for updating:
The old _FeatureColumn APIs are being deprecated. Please use the new FeatureColumn APIs instead.
Instructions for updating:
The old _FeatureColumn APIs are being deprecated. Please use the new FeatureColumn APIs instead.
Instructions for updating:
The old _FeatureColumn APIs are being deprecated. Please use the new FeatureColumn APIs instead.
[[0. 1. 0. 0.]
 [0. 1. 0. 0.]
 [0. 1. 0. 0.]
 [1. 0. 0. 0.]
 [0. 1. 0. 0.]]


*   如上，输出为department的one-hot结果，对于不同类的department出现了哈希冲突的情况。

### crossed_column

*   crossed_column：特征交叉，在有些情况下，特征独自编码与多维特征交叉后的特征特性会有不一样的结果。
*   该函数不能对hash映射之后的特征进行交叉，实际输出如下：

In [6]:
import tensorflow as tf
sess=tf.Session()

#特征数据
features = {
    'sex': [1, 2, 1, 1, 2],
    'department': ['sport', 'sport', 'drawing', 'gardening', 'travelling'],
}

#特征列
department = tf.feature_column.categorical_column_with_vocabulary_list('department', ['sport','drawing','gardening','travelling'], dtype=tf.string)
sex = tf.feature_column.categorical_column_with_identity('sex', num_buckets=2, default_value=0)
sex_department = tf.feature_column.crossed_column([department,sex], 16)
sex_department = tf.feature_column.indicator_column(sex_department)
#组合特征列
columns = [
    sex_department
]

#输入层（数据，特征列）
inputs = tf.feature_column.input_layer(features, columns)

#初始化并运行
init = tf.global_variables_initializer()
sess.run(tf.tables_initializer())
sess.run(init)

v=sess.run(inputs)
print(v)

Instructions for updating:
The old _FeatureColumn APIs are being deprecated. Please use the new FeatureColumn APIs instead.
Instructions for updating:
The old _FeatureColumn APIs are being deprecated. Please use the new FeatureColumn APIs instead.
Instructions for updating:
The old _FeatureColumn APIs are being deprecated. Please use the new FeatureColumn APIs instead.
[[0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0.]]


*   如上，输出为cross的one-hot结果，hash_bucket_size代表输出的交叉向量的one-hot维度。

## Dense column

### numeric_column

*   numeric_column：该函数主要用于处理连续型变量，即可以是float类型也可以是int类似，从table中读取对应的(key)column，并把它转成dtype的格式，实际情况如下：

In [7]:
import tensorflow as tf
sess=tf.Session()

#特征数据
features = {
    'sale': [1.2, 2.3, 1.2, 1.5, 2.2]
}

#特征列
sale = tf.feature_column.numeric_column("sale", default_value=0.0)
#组合特征列
columns = [
    sale
]

#输入层（数据，特征列）
inputs = tf.feature_column.input_layer(features, columns)

#初始化并运行
init = tf.global_variables_initializer()
sess.run(tf.tables_initializer())
sess.run(init)

v=sess.run(inputs)
print(v)

Instructions for updating:
The old _FeatureColumn APIs are being deprecated. Please use the new FeatureColumn APIs instead.
Instructions for updating:
The old _FeatureColumn APIs are being deprecated. Please use the new FeatureColumn APIs instead.
Instructions for updating:
The old _FeatureColumn APIs are being deprecated. Please use the new FeatureColumn APIs instead.
[[1.2]
 [2.3]
 [1.2]
 [1.5]
 [2.2]]


### indictor_column

indictor_column 的使用其实之前已经展示过了，主要就是用来将 `tf.feature_column.categorical_column_*` 表示的 categorical column 变成 dense column，从而可以喂给神经网络

这个实际上对应着 one-hot 编码

### bucketized_column

*   bucketized_column: 该函数将连续变量进行分桶离散化，输出one-hot的结果，方便连续值指标与分类变量进行交叉特征构建，

![](https://pic3.zhimg.com/v2-3ba0d5bcecd86642c39598b848ec534e_b.jpg)

![](https://pic3.zhimg.com/80/v2-3ba0d5bcecd86642c39598b848ec534e_1440w.jpg)

*   实际情况如下：

In [8]:
import numpy as np
import tensorflow as tf
sess = tf.Session()

#特征数据
features = {'sale': [0.1, 0.2, 0.5, 1.0, 0.2]}

#特征列
step_val = 0.5
boundaries = list(np.arange(0, 1, step_val))
sale = tf.feature_column.bucketized_column(tf.feature_column.numeric_column('sale', default_value=0.0),
                                           boundaries=boundaries)
#组合特征列
columns = [sale]

#输入层（数据，特征列）
inputs = tf.feature_column.input_layer(features, columns)

#初始化并运行
init = tf.global_variables_initializer()
sess.run(tf.tables_initializer())
sess.run(init)

v = sess.run(inputs)
print(v)

Instructions for updating:
The old _FeatureColumn APIs are being deprecated. Please use the new FeatureColumn APIs instead.
Instructions for updating:
The old _FeatureColumn APIs are being deprecated. Please use the new FeatureColumn APIs instead.
Instructions for updating:
The old _FeatureColumn APIs are being deprecated. Please use the new FeatureColumn APIs instead.
[[0. 1. 0.]
 [0. 1. 0.]
 [0. 0. 1.]
 [0. 0. 1.]
 [0. 1. 0.]]


### embedding_column

* embedding_column：把 categorical_column 进一步转换为 embedding

In [5]:
import tensorflow as tf

#特征数据
features = {
    'department': ['sport', 'sport', 'drawing', 'gardening', 'travelling'],
}

#特征列
vocab_list = ['sport','drawing','gardening','travelling']
department = tf.feature_column.categorical_column_with_vocabulary_list('department', vocab_list, dtype=tf.string)
department_emb = tf.feature_column.embedding_column(department, 5)

#组合特征列
columns = [
    department_emb
]

inputs = tf.feature_column.input_layer(features, columns)

with tf.Session() as sess:
    sess.run(tf.tables_initializer())
    sess.run(tf.global_variables_initializer())

    v = sess.run(inputs)
    print(v)

[[-0.0143425   0.1741002  -0.09912673 -0.22460788  0.10394533]
 [-0.0143425   0.1741002  -0.09912673 -0.22460788  0.10394533]
 [ 0.22888376 -0.12529817 -0.38291758 -0.2713956   0.3131153 ]
 [ 0.27156228  0.13630466 -0.46924916 -0.03021192 -0.8221253 ]
 [ 0.03409626  0.08630039 -0.30304956 -0.00866169  0.19067748]]


## 总结

本文通过直观的数据输出来展现tf.feature_column的特征处理过程，方便大家理解这个函数，tensorflow作为目前最常用的深度学习框架，有着很多高级的API，这些接口都可以极大方便我们算法工程师的工作，tf.estimator不仅可以很好地处理特征，同时它将train、evaluatete、predict都集成到了一起，大家平时可以多使用该接口。

# References
1. http://localhost:8888/lab/tree/DL-Project/learnTensorflow/Tensorflow%20tf.feature_column.ipynb

1. [tf.feature_column的特征处理探究 - 知乎](https://zhuanlan.zhihu.com/p/73701872)

2. [tf.compat.v1.feature_column.input_layer  |  TensorFlow Core v2.4.1](https://tensorflow.google.cn/api_docs/python/tf/compat/v1/feature_column/input_layer?hl=en)

3. [tf.feature_column.indicator_column  |  TensorFlow Core v1.15.0](https://www.tensorflow.org/versions/r1.15/api_docs/python/tf/feature_column/indicator_column)