<a href="https://colab.research.google.com/github/dk-wei/ml-algo-implementation/blob/main/CountVectorizer%2BLogistic_Regression%2BELI5%E8%AE%B2%E8%A7%A3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

`CountVectorizer`用来对每个`document`进行`one-hot encoding`, 无论是对于NLP问题，还是多categorical feature情况，还是非常重要的。 

本文主要讲两个方面:
- `CountVectorizer`的各个parameter
- `tokenizer`用于split各个document (避免split合成词)，以及清除punctuation

我们通过不同的参数进行比较

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
import string
import pandas as pd

In [None]:
# Build our text
corpus = [
     'This is the first document.',
     'This document is the second-document.',
     'And this is the third-one.',
     'Is this the first_document?',
 ]

## 默认`CountVectorizer`

In [None]:
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(corpus)

In [None]:
# 我们可以看到默认的tokenizer已经清洗了每个token周围的标点
print(vectorizer.get_feature_names())

['and', 'document', 'first', 'first_document', 'is', 'one', 'second', 'the', 'third', 'this']


In [None]:
print(X.toarray())

[[0 1 1 0 1 0 0 1 0 1]
 [0 2 0 0 1 0 1 1 0 1]
 [1 0 0 0 1 1 0 1 1 1]
 [0 0 0 1 1 0 0 1 0 1]]


## N-gram`CountVectorizer`

In [None]:
vectorizer2 = CountVectorizer(analyzer='word', 
                              ngram_range=(2, 2)
                              )

X2 = vectorizer2.fit_transform(corpus)

In [None]:
print(vectorizer2.get_feature_names())

['and this', 'document is', 'first document', 'is the', 'is this', 'second document', 'the first', 'the first_document', 'the second', 'the third', 'third one', 'this document', 'this is', 'this the']


In [None]:
print(X2.toarray())

[[0 0 1 1 0 0 1 0 0 0 0 0 1 0]
 [0 1 0 1 0 1 0 0 1 0 0 1 0 0]
 [1 0 0 1 0 0 0 0 0 1 1 0 1 0]
 [0 0 0 0 1 0 0 1 0 0 0 0 0 1]]


## `CountVectorizer` with new `tokenizer`

In [None]:
def tokenizer_splitter(s):
  '''
  按照space给split，再strip两边的punctuation
  '''
  return [i.strip(string.punctuation) for i in s.split(' ')]
   
vectorizer3 = CountVectorizer(analyzer='word', 
                              ngram_range=(1, 1),
                              stop_words = ['is'],
                              binary = False,
                              lowercase = True,
                              #tokenizer = lambda x: x.split(" "),
                              tokenizer = tokenizer_splitter
                              )

In [None]:
X3 = vectorizer3.fit_transform(corpus)

In [None]:
# 我们可以看到不会split合成词
print(vectorizer3.get_feature_names())

['and', 'document', 'first', 'first_document', 'second-document', 'the', 'third-one', 'this']


In [None]:
print(X3.toarray())

[[0 1 1 0 0 1 0 1]
 [0 1 0 0 1 1 0 1]
 [1 0 0 0 0 1 1 1]
 [0 0 0 1 0 1 0 1]]


In [None]:
print(vectorizer3.vocabulary_)   # vocabulary_则是告知了每个encoding每个位置上的token情况，要好好利用

{'this': 7, 'the': 5, 'first': 2, 'document': 1, 'second-document': 4, 'and': 0, 'third-one': 6, 'first_document': 3}


In [None]:
corpus

['This is the first document.',
 'This document is the second-document.',
 'And this is the third-one.',
 'Is this the first_document?']

In [None]:
X3.todense()

matrix([[0, 1, 1, 0, 0, 1, 0, 1],
        [0, 1, 0, 0, 1, 1, 0, 1],
        [1, 0, 0, 0, 0, 1, 1, 1],
        [0, 0, 0, 1, 0, 1, 0, 1]])

## 查看CountVectorizer内部

`vectorizer3.vocabulary_`的`index`和`vectorizer3.get_feature_names()`的`value`是一致的

In [None]:
df_cvec = pd.DataFrame(X3.todense(),columns=vectorizer3.get_feature_names())
print (df_cvec.shape)
df_cvec.head()

(4, 8)


Unnamed: 0,and,document,first,first_document,second-document,the,third-one,this
0,0,1,1,0,0,1,0,1
1,0,1,0,0,1,1,0,1
2,1,0,0,0,0,1,1,1
3,0,0,0,1,0,1,0,1


In [None]:
vectorizer3.vocabulary_

{'and': 0,
 'document': 1,
 'first': 2,
 'first_document': 3,
 'second-document': 4,
 'the': 5,
 'third-one': 6,
 'this': 7}

In [None]:
vectorizer3.get_feature_names()

['and',
 'document',
 'first',
 'first_document',
 'second-document',
 'the',
 'third-one',
 'this']

## 与Logistic Regression相结合

代码来源：https://colab.research.google.com/github/littlecolumns/ds4j-notebooks/blob/master/nyt-takata-airbags/notebooks/Airbag%20classifier%20search%20(CountVectorizer).ipynb#scrollTo=jrT5yB7CG7dv

把CountVectorizer和Logistic Regression一起的话，可以用token的weights来代表重要性

In [None]:
# Make data directory if it doesn't exist
!mkdir -p data
!wget -nc https://nyc3.digitaloceanspaces.com/ml-files-distro/v1/nyt-takata-airbags/data/sampled-labeled.csv -P data

File ‘data/sampled-labeled.csv’ already there; not retrieving.



In [None]:
import pandas as pd

# Allow us to display 100 columns at a time, and 100 characters in each column (instead of ...)
pd.set_option("display.max_columns", 100)
pd.set_option("display.max_colwidth", 100)

In [None]:
labeled = pd.read_csv("data/sampled-labeled.csv")
labeled.head()

Unnamed: 0,is_suspicious,CDESCR
0,0.0,"ALTHOUGH I LOVED THE CAR OVERALL AT THE TIME I DECIDED TO OWN, , MY DREAM CAR CADILLAC CTS HAS T..."
1,0.0,"CONSUMER SHUT SLIDING DOOR WHEN ALL POWER LOCKS ON ALL DOORS LOCKED BY ITSELF, TRAPPING INFANT I..."
2,0.0,DRIVERS SEAT BACK COLLAPSED AND BENT WHEN REAR ENDED. PLEASE DESCRIBE DETAILS. TT
3,0.0,TL* THE CONTACT OWNS A 2009 NISSAN ALTIMA. THE CONTACT STATED THAT THE START BUTTON FOR THE IGNI...
4,0.0,THE FRONT MIDDLE SEAT DOESN'T LOCK IN PLACE. *AK


In [None]:
labeled = labeled.dropna()

In [None]:
labeled.is_suspicious.value_counts()

0.0    150
1.0     15
Name: is_suspicious, dtype: int64

In [None]:
train_df = pd.DataFrame({
    'is_suspicious': labeled.is_suspicious,
    'airbag': labeled.CDESCR.str.contains("AIRBAG", na=False).astype(int),
    'air bag': labeled.CDESCR.str.contains("AIR BAG", na=False).astype(int),
    'failed': labeled.CDESCR.str.contains("FAILED", na=False).astype(int),
    'did not deploy': labeled.CDESCR.str.contains("DID NOT DEPLOY", na=False).astype(int),
    'violent': labeled.CDESCR.str.contains("VIOLENT", na=False).astype(int),
    'explode': labeled.CDESCR.str.contains("EXPLODE", na=False).astype(int),
    'shrapnel': labeled.CDESCR.str.contains("SHRAPNEL", na=False).astype(int),
})
train_df.head()

Unnamed: 0,is_suspicious,airbag,air bag,failed,did not deploy,violent,explode,shrapnel
0,0.0,0,0,0,0,0,0,0
1,0.0,0,0,0,0,0,0,0
2,0.0,0,0,0,0,0,0,0
3,0.0,0,0,0,0,0,0,0
4,0.0,0,0,0,0,0,0,0


In [None]:
from sklearn.feature_extraction.text import CountVectorizer

def tokenizer_splitter(s):
  '''
  按照space给split，再strip两边的punctuation
  '''
  return [i.strip(string.punctuation) for i in s.split(' ')]

vectorizer = CountVectorizer(binary=True,
                             tokenizer = tokenizer_splitter)

vectors = vectorizer.fit_transform(labeled.CDESCR)
vectors

<165x2403 sparse matrix of type '<class 'numpy.int64'>'
	with 9382 stored elements in Compressed Sparse Row format>

In [None]:
vectors.toarray()
#vectors.todense()

array([[1, 0, 0, ..., 0, 0, 0],
       [1, 0, 0, ..., 0, 0, 0],
       [1, 0, 0, ..., 0, 0, 0],
       ...,
       [1, 0, 0, ..., 0, 0, 0],
       [1, 0, 0, ..., 0, 0, 0],
       [1, 0, 0, ..., 0, 0, 0]])

In [None]:
vectorizer = CountVectorizer(binary=True)

vectors = vectorizer.fit_transform(labeled.CDESCR)
words_df = pd.DataFrame(vectors.toarray(), columns=vectorizer.get_feature_names())
words_df.head()

Unnamed: 0,00,000,01,01v347000,02,02v105000,02v146000,03,03v455000,04,05,05v395000,06,07,08,08v303000,09,10,1000,10017,11,12,128,12th,13,136,13v136000,14,1420,15,150,15pm,16,160lbs,17,180,1996,1997,1998,1999,1st,20,2000,2001,2002,2003,2004,2005,2006,2007,...,window,windows,windshield,wiper,wipers,wires,wiring,wished,with,within,without,withstand,witnesses,won,wonder,woosh,word,work,working,works,worn,worse,worsened,worst,worth,would,wouldn,wrangler,wreck,wrecks,wrist,write,writes,writing,written,wrong,xterra,xxx,yards,yc,year,years,yes,yet,yield,york,you,your,zero,zone
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0
1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


**Big secret:** The "fit" part of .fit_transform means "learn the words." The "transform" part means "count them."

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix

In [None]:
X = words_df
y = labeled.is_suspicious

#clf = RandomForestClassifier(n_estimators=100)
clf = LogisticRegression(C=1e9, solver='lbfgs')

clf.fit(X, y)

LogisticRegression(C=1000000000.0, class_weight=None, dual=False,
                   fit_intercept=True, intercept_scaling=1, l1_ratio=None,
                   max_iter=100, multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

In [None]:
y_true = y
y_pred = clf.predict(X)

matrix = confusion_matrix(y_true, y_pred)

label_names = pd.Series(['not suspicious', 'suspicious'])
pd.DataFrame(matrix,
     columns='Predicted ' + label_names,
     index='Is ' + label_names)

Unnamed: 0,Predicted not suspicious,Predicted suspicious
Is not suspicious,150,0
Is suspicious,0,15


In [None]:
clf.coef_[0]

array([-0.49907654, -0.34293928, -0.0205122 , ..., -0.0670116 ,
       -0.14687349,  1.12410877])

In [None]:
token_dict = {i[1]: i[0] for i in vectorizer.vocabulary_.items()}

In [None]:
global_keywords = {}

for i in range(len(clf.coef_[0])):
    global_keywords[token_dict[i]] = clf.coef_[0][i]

In [None]:
# pos keyphrases
sorted(global_keywords.items(), key=lambda x:x[1], reverse = True)[:10]

[('deployed', 3.8441308751839194),
 ('face', 2.693390867658786),
 ('passenger', 2.5769032726993832),
 ('driver', 2.511533050105581),
 ('problem', 2.2668444791717164),
 ('degree', 2.230653547347673),
 ('both', 1.9928532394265508),
 ('1st', 1.8729197559403907),
 ('hands', 1.8705195988041483),
 ('burns', 1.7500262508153124)]

In [None]:
#pip install eli5

In [None]:
import eli5
feature_names = list(X.columns)

# Use this line instead of warnings about judging these classifier
# eli5.show_weights(clf, feature_names=feature_names, show=eli5.formatters.fields.ALL)
eli5.show_weights(clf, feature_names=feature_names)



Weight?,Feature
+3.844,deployed
+2.693,face
+2.577,passenger
+2.512,driver
+2.267,problem
+2.231,degree
+1.993,both
+1.873,1st
+1.871,hands
+1.750,burns


eli5和logistic regression的结果是一致的，都是weights

### `RandomForestClassifier`

In [None]:
from sklearn.model_selection import train_test_split

X = words_df
y = labeled.is_suspicious

X_train, X_test, y_train, y_test = train_test_split(X, y, stratify = y)

In [None]:
clf = RandomForestClassifier(n_estimators=100)

clf.fit(X_train, y_train)

RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
                       criterion='gini', max_depth=None, max_features='auto',
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=100,
                       n_jobs=None, oob_score=False, random_state=None,
                       verbose=0, warm_start=False)

In [None]:
y_true = y_test
y_pred = clf.predict(X_test)

matrix = confusion_matrix(y_true, y_pred)

label_names = pd.Series(['not suspicious', 'suspicious'])
pd.DataFrame(matrix,
     columns='Predicted ' + label_names,
     index='Is ' + label_names)

Unnamed: 0,Predicted not suspicious,Predicted suspicious
Is not suspicious,38,0
Is suspicious,4,0


In [None]:
eli5.show_weights(clf, feature_names=feature_names)

Weight,Feature
0.0230  ± 0.1616,aware
0.0148  ± 0.1036,burning
0.0134  ± 0.1048,apart
0.0131  ± 0.0966,month
0.0127  ± 0.0934,pulling
0.0122  ± 0.0886,chin
0.0118  ± 0.0822,nerves
0.0110  ± 0.0753,prior
0.0109  ± 0.1085,lot
0.0104  ± 0.0717,turned


In [None]:
from sklearn.linear_model import LogisticRegression

clf = LogisticRegression(C=1e9, solver='lbfgs')

clf.fit(X_train, y_train)

LogisticRegression(C=1000000000.0, class_weight=None, dual=False,
                   fit_intercept=True, intercept_scaling=1, l1_ratio=None,
                   max_iter=100, multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

In [None]:
y_true = y_test
y_pred = clf.predict(X_test)

matrix = confusion_matrix(y_true, y_pred)

label_names = pd.Series(['not suspicious', 'suspicious'])
pd.DataFrame(matrix,
     columns='Predicted ' + label_names,
     index='Is ' + label_names)

Unnamed: 0,Predicted not suspicious,Predicted suspicious
Is not suspicious,38,0
Is suspicious,4,0


In [None]:
eli5.show_weights(clf, feature_names=feature_names, target_names=['not suspicious', 'suspicious'])

Weight?,Feature
+2.408,deployed
+2.236,was
+1.914,problem
+1.806,aware
+1.696,broke
+1.689,pulling
+1.668,lot
+1.630,fire
+1.629,inside
+1.628,both
