<a href="https://colab.research.google.com/github/dk-wei/ml-algo-implementation/blob/main/CountVectorizer%E8%AE%B2%E8%A7%A3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

`CountVectorizer`用来对每个`document`进行`one-hot encoding`, 无论是对于NLP问题，还是多categorical feature情况，还是非常重要的。 

本文主要讲两个方面:
- `CountVectorizer`的各个parameter
- `tokenizer`用于split各个document (避免split合成词)，以及清除punctuation

我们通过不同的参数进行比较

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
import string

In [196]:
# Build our text
corpus = [
     'This is the first document.',
     'This document is the second-document.',
     'And this is the third-one.',
     'Is this the first_document?',
 ]

## 默认`CountVectorizer`

In [200]:
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(corpus)

In [201]:
# 我们可以看到默认的tokenizer已经清洗了每个token周围的标点
print(vectorizer.get_feature_names())

['and', 'document', 'first', 'first_document', 'is', 'one', 'second', 'the', 'third', 'this']


In [202]:
print(X.toarray())

[[0 1 1 0 1 0 0 1 0 1]
 [0 2 0 0 1 0 1 1 0 1]
 [1 0 0 0 1 1 0 1 1 1]
 [0 0 0 1 1 0 0 1 0 1]]


## N-gram`CountVectorizer`

In [203]:
vectorizer2 = CountVectorizer(analyzer='word', 
                              ngram_range=(2, 2)
                              )

X2 = vectorizer2.fit_transform(corpus)

In [205]:
print(vectorizer2.get_feature_names())

['and this', 'document is', 'first document', 'is the', 'is this', 'second document', 'the first', 'the first_document', 'the second', 'the third', 'third one', 'this document', 'this is', 'this the']


In [206]:
print(X2.toarray())

[[0 0 1 1 0 0 1 0 0 0 0 0 1 0]
 [0 1 0 1 0 1 0 0 1 0 0 1 0 0]
 [1 0 0 1 0 0 0 0 0 1 1 0 1 0]
 [0 0 0 0 1 0 0 1 0 0 0 0 0 1]]


## `CountVectorizer` with new `tokenizer`

In [192]:
def tokenizer_splitter(s):
  '''
  按照space给split，再strip两边的punctuation
  '''
  return [i.strip(string.punctuation) for i in s.split(' ')]
   
vectorizer3 = CountVectorizer(analyzer='word', 
                              ngram_range=(1, 1),
                              stop_words = ['is'],
                              binary = False,
                              lowercase = True,
                              #tokenizer = lambda x: x.split(" "),
                              tokenizer = tokenizer_splitter
                              )

In [193]:
X3 = vectorizer3.fit_transform(corpus)

In [207]:
# 我们可以看到不会split合成词
print(vectorizer3.get_feature_names())

['and', 'document', 'first', 'first_document', 'second-document', 'the', 'third-one', 'this']


In [195]:
print(X3.toarray())

[[0 1 1 0 0 1 0 1]
 [0 1 0 0 1 1 0 1]
 [1 0 0 0 0 1 1 1]
 [0 0 0 1 0 1 0 1]]


In [211]:
print(vectorizer3.vocabulary_)   # vocabulary_则是告知了每个encoding每个位置上的token情况，要好好利用

{'this': 7, 'the': 5, 'first': 2, 'document': 1, 'second-document': 4, 'and': 0, 'third-one': 6, 'first_document': 3}


In [212]:
corpus

['This is the first document.',
 'This document is the second-document.',
 'And this is the third-one.',
 'Is this the first_document?']