# 针对prepare_data.py脚本文件的解析
prepare_data.py用于将csv文件转换为TFRecords文件。

内部的步骤清晰，是个完整的工程项目的写法。除了调试TensorFlow版本之间的变化外，还将完整的步骤的注解，在本文件内完整的说明。  

具体的调用还是使用prepare_data.py即可。

## Step 1: 加载库

In [2]:
# -*- coding:utf-8 -*-
import os
import csv
import itertools
import functools
import tensorflow as tf
import numpy as np
import pandas as pd

Python的内建模块**itertools**提供了非常有用的用于操作迭代对象的函数。
<a href='https://www.liaoxuefeng.com/wiki/001374738125095c955c1e6d8bb493182103fac9270762a000/001415616001996f6b32d80b6454caca3d33c965a07611f000'>使用参见链接</a>

**functools**该模块为**高阶函数**提供支持——作用于或返回函数的函数被称为高阶函数。在该模块看来，一切可调用的对象均可视为本模块中所说的“函数”。
<a href='https://www.cnblogs.com/Security-Darren/p/4168310.html'>使用参见链接</a>


## Step 2: 通过tf.flags.DEFINE_xxx()设置 命令行可选参数

In [3]:
# tf.flags.DEFINE_xxx
# FLAGS=tf.flags.FLAGES
# 添加命令行的可选参数
tf.flags.DEFINE_integer(
    'min_word_frequency', 5, 'Minimum frequency of words in the vocabulary'
)

tf.flags.DEFINE_integer(
    'max_sentence_len', 160, "Maximum Sentence Length"
)

tf.flags.DEFINE_string(
    'input_dir', os.path.abspath('../../ubuntu_dataset'),
    "Input directory containing original CSV data files"
)

tf.flags.DEFINE_string(
    'output_dir', os.path.abspath('../../ubuntu_dataset'),
    'Output directory for TFRecord files'
)

# 仅是在jupyter的时候会出现的报错信息。
tf.app.flags.DEFINE_string('f', '', 'kernel')

FLAGS = tf.flags.FLAGS

有关tf.flags的知识注解，可见<a href='../../../../../TF/教程/小知识.ipynb#flags'>链接</a>

## Step 3: 设置可选参数input_dir的路径，数据集加载路径

In [4]:
TRAIN_PATH = os.path.join(FLAGS.input_dir, 'train.csv')
VALIDATION_PATH = os.path.join(FLAGS.input_dir, 'valid.csv')
TEST_PATH = os.path.join(FLAGS.input_dir, 'test.csv')

## Step 4: 自定义分词函数
此处是简单的以空格" "作为分词。如果是中文或不能通过这种方式分词的，可自定义更改相应的预处理分词方式。

In [5]:
def tokenizer_fn(iterator):
    '''
    将可迭代对象，逐个按空格分词（英文可以），返回的分词存在元组中。
    '''
    return (x.split(" ") for x in iterator)

## Step 5: 将读取csv文件变成迭代器
这样在读取过程中可以按一行一行的方式读取，并一行一行地进行数据预处理，如：分词等。
这样可以避免内存的开销。

In [6]:
def create_csv_iter(filename):
    '''
    将CSV文件内容变成生成器，可迭代对象。
    Returns an iterator over a CSV file. Skips the header.
    '''
    with open(filename) as csvfile:
        reader = csv.reader(csvfile)

        # Skip the header
        next(reader)

        for row in reader:
            yield row

经过测试，在读取test.csv文件后，第一行（已经过滤掉了实际文件第一行的title）的数据格式为：

['anyone knows why my stock oneiric exports env var \'USERNAME\'?  I mean what is that used for?  I know of $USER but not $USERNAME .  My precise install doesn\'t export USERNAME __eou__ __eot__ looks like it used to be exported by lightdm, but the line had the comment "// FIXME: Is this required?" so I guess it isn\'t surprising it is gone __eou__ __eot__ thanks!  How the heck did you figure that out? __eou__ __eot__ https://bugs.launchpad.net/lightdm/+bug/864109/comments/3 __eou__ __eot__ ', 'nice thanks! __eou__', 'wrong channel for it, but check efnet.org, unofficial page. __eou__', 'every time the kernel changes, you will lose video __eou__ yep __eou__', 'ok __eou__', "!nomodeset > acer __eou__ I'm assuming it is a driver issue. __eou__ !pm > acer __eou__ i DON'T pm. ;) __eou__ OOPS SORRY FOR THE CAPS __eou__", 'http://www.ubuntu.com/project/about-ubuntu/derivatives  (some call them derivatives, others call them flavors, same difference) __eou__', "thx __eou__ unfortunately the program isn't installed from the repositories __eou__", 'how can I check? By doing a recovery for testing? __eou__', 'my humble apologies __eou__', '#ubuntu-offtopic __eou__']

下面通过pd.read_csv的方式读取test.csv文件，对比第一行数据的内容和上面的csv的迭代器读取的内容的形式。

In [7]:
test_df = pd.read_csv('../../ubuntu_dataset/test.csv')
test_df.iloc[0]

Context                   anyone knows why my stock oneiric exports env ...
Ground Truth Utterance                                 nice thanks! __eou__
Distractor_0              wrong channel for it, but check efnet.org, uno...
Distractor_1              every time the kernel changes, you will lose v...
Distractor_2                                                     ok __eou__
Distractor_3              !nomodeset > acer __eou__ I'm assuming it is a...
Distractor_4              http://www.ubuntu.com/project/about-ubuntu/der...
Distractor_5              thx __eou__ unfortunately the program isn't in...
Distractor_6              how can I check? By doing a recovery for testi...
Distractor_7                                    my humble apologies __eou__
Distractor_8                                       #ubuntu-offtopic __eou__
Name: 0, dtype: object

## Step 6: 创建词汇表
create_vocab的参数由input_iter和min_frequency决定。

input_iter是文本内容
min_frequency是最低词频的限制，只有超过这个min_frequency的词汇才会被加入到词汇表中，降低了词汇表的大小。

由于create_vocab中采用的构建词汇表的方式已经即将被移除，所以方程需要更新到新方法的实现。
下面先介绍之前的基于tf.contrib.learn.preprocessing.VocabularyProcessor的实现。  
再介绍tensorflow/transform or tf.data的实现方式。

In [8]:
def create_vocab(input_iter, min_frequency):
    '''
    创建词汇表，最低词频由min_frequency确定
    Creates and returns a VocabularyProcessor object with the vocabulary
  for the input iterator.
    '''
    vocab_processor = tf.contrib.learn.preprocessing.VocabularyProcessor(
        FLAGS.max_sentence_len,
        min_frequency=min_frequency,
        tokenizer_fn=tokenizer_fn
    )
    vocab_processor.fit(input_iter)
    return vocab_processor

- tf.contrib.learn.preprocessing.VocabularyProcessor (max_document_length, min_frequency=0, vocabulary=None, tokenizer_fn=None)

    - 参数：

        - max_document_length: 文档的最大长度。如果文本的长度大于最大长度，那么它会被剪切，反之则用0填充。 
        - min_frequency: 词频的最小值，出现次数小于最小词频则不会被收录到词表中。 
        - vocabulary: CategoricalVocabulary 对象。 
        - tokenizer_fn：分词函数

**重要！！！！ tokenizer (from tensorflow.contrib.learn.python.learn.preprocessing.text) is deprecated and will be removed in a future version.
Instructions for updating:
Please use tensorflow/transform or tf.data.**

In [9]:
import tensorflow as tf

import numpy as np
max_document_length =  4
x_text  = ['I love you', 'me too']

vocab_processor = tf.contrib.learn.preprocessing.VocabularyProcessor(
    max_document_length)
vocab_processor.fit(x_text)


Instructions for updating:
Please use tensorflow/transform or tf.data.
Instructions for updating:
Please use tensorflow/transform or tf.data.
Instructions for updating:
Please use tensorflow/transform or tf.data.


<tensorflow.contrib.learn.python.learn.preprocessing.text.VocabularyProcessor at 0xb1c72f8d0>

In [10]:
for i in vocab_processor.transform([' i me too']):
    print(i)

[0 4 5 0]


In [17]:
train_df = pd.read_csv('../../ubuntu_dataset/train.csv')

In [21]:
train_df.columns

Index(['Context', 'Utterance', 'Label'], dtype='object')

In [22]:
print("Creating vocabulary...")
input_iter = create_csv_iter(TRAIN_PATH)
# train.csv文件的列为['Context', 'Utterance', 'Label'
input_iter = (x[0] + " " + x[1] for x in input_iter)
vocab = create_vocab(input_iter, min_frequency=FLAGS.min_word_frequency)
print(vocab)

Creating vocabulary...
<tensorflow.contrib.learn.python.learn.preprocessing.text.VocabularyProcessor object at 0x1c2e57add8>


In [15]:
print('这是开始用tf.data API的分割线'.center(70,'*'))

*************************这是开始用tf.data API的分割线*************************


### 未找到相关范例，后续自己撸

## Step 7: 自定义句子向量化函数
将矩阵一句句地输入函数，对已经生成的vocab处理器，通过vocab.transform方法，对valid和test数据进行编码。    
vocab.transform生成的对象是一个迭代器，所以无法像sklearn中的TfidfVectorizer等方法一样直接toarray()的方式获取，而是需要用next(generator).tolist()的方式去拼成列表。   
如需转换为ndarray则通过numpy操作即可。  

In [32]:
def transform_sentence(sequence, vocab_processor):
    '''
    Maps a single sentence into integer vocabulary. 
    Returns a python array.
    '''
    return next(vocab_processor.transform([sequence])).tolist()

## Step 8: 

In [37]:
def create_text_sequence_feature(fl, sentence, sentence_len, vocab):
    '''
    Writes a sentence to FeatureList protocol buffer.
    '''
    sentence_transformed = transform_sentence(sentence, vocab)
    for word_id in sentence_transformed:
        fl.feature.add().int64_list.value.extend([word_id])
    return fl

[7,
 1029,
 827,
 8849,
 9337,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0]