# Label encoding in python
goals
- scikit-learn syle fit/transform methods to encode labels of categorical features of X
- should handle unseen labels
- should be faster than running a label encoder manually for each fold and manually checking if the label already was seen in the training data (https://stackoverflow.com/questions/45727934/pandas-categories-new-levels?noredirect=1#comment78424496_45727934 which links to https://gist.github.com/geoHeil/5caff5236b4850d673b2c9b0799dc2ce)
- only some columns are categorical, and only these should be converted

In [1]:
import pandas as pd

In [15]:
df = pd.DataFrame({'A':[1,2,3,4],
                  'B':['a', 'b','a','a']}) 
df['B'] = df['B'].astype('category')
df

Unnamed: 0,A,B
0,1,a
1,2,b
2,3,a
3,4,a


In [17]:
df2 = pd.DataFrame({'A':[1,2,3,4],
                  'B':['a', 'b','a','c']}) 
df2['B'] = df2['B'].astype('category')
df2

Unnamed: 0,A,B
0,1,a
1,2,b
2,3,a
3,4,c


### Label binarizer
> FAILS to handle strings

In [7]:
from sklearn.preprocessing import LabelBinarizer
lb = LabelBinarizer()
# lb.fit(df)
# FAILS: can only handle already converted labels i.e. integers!

### Dict Vectorizer
> Does not immediately throw an error, but:
  - can only take everything as an input
  - can only take non_pandas dataframe as dict with key-value
  
How could this be integrated into pandas?

In [28]:
from sklearn.feature_extraction import DictVectorizer
dv = DictVectorizer(sparse=False)
df_dict = df.to_dict(orient='records')
print(df_dict) # makes no sense for df['A']
dv.fit(df_dict)
dv.transform(df_dict)

[{'A': 1, 'B': 'a'}, {'A': 2, 'B': 'b'}, {'A': 3, 'B': 'a'}, {'A': 4, 'B': 'a'}]


array([[ 1.,  1.,  0.],
       [ 2.,  0.,  1.],
       [ 3.,  1.,  0.],
       [ 4.,  1.,  0.]])

In [18]:
dv.transform(df2.to_dict(orient='records'))

array([[ 1.,  1.,  0.],
       [ 2.,  0.,  1.],
       [ 3.,  1.,  0.],
       [ 4.,  0.,  0.]])

### Countvectorizer
> FAILS: already requires converted vocabulary

In [32]:
#from sklearn.feature_extraction.text import CountVectorizer
#cv = CountVectorizer()
#cv.fit(df)