<a id=0></a>
# 7.Categorical Features
カテゴリカル特徴量（変数）の取り扱い

---
### [1.LabelEncoder()](#1)
### [2.get_dummies()](#2)
### [3.OneHotEncoder()](#3)
### [4.pd.get_dummies()とOneHotEncoder()の違い](#4)
### [5.Seriesのstr属性を使う](#5)

---

データセットとしてsample1_without_index.csvを使用する

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [29]:
df = pd.read_csv('./sample1_without_index.csv')
df.head()

Unnamed: 0,Date,Price,Quantity,Width,Height,Quality,Score,Difference,Color,Shape
0,1997-07-05,2291,25,2.94665,5.305868,45.8933,52.762659,0.276266,green,triangle
1,1997-07-06,506,16,1.915208,0.679004,50.611735,31.453719,-1.854628,blue,
2,1997-07-07,9629,32,7.869855,6.563335,43.830416,56.239011,0.623901,blue,square
3,1997-07-08,6161,67,6.375209,5.756029,41.358007,61.453113,1.145311,green,square
4,,8570,55,0.390629,3.578136,55.739709,,1.03719,red,square


In [30]:
df = df[['Color', 'Shape']]

In [31]:
df.head()

Unnamed: 0,Color,Shape
0,green,triangle
1,blue,
2,blue,square
3,green,square
4,red,square


In [32]:
df.isnull().sum()

Color    4
Shape    5
dtype: int64

In [33]:
df[df['Color'].isnull()].index

Int64Index([19, 37, 40, 73], dtype='int64')

---
<a id=1></a>
[Topへ](#0)

---
## 1. LabelEncoder()  
https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html  
※ ラベルを数値(0, 1, 2, ...)で置換する

In [34]:
from sklearn import preprocessing

In [35]:
df_ce = df.copy()
le = preprocessing.LabelEncoder()
df_ce['Color_encoded'] = le.fit_transform(df['Color'])


In [36]:
le = preprocessing.LabelEncoder()
df_ce['Shape_encoded'] = le.fit_transform(df['Shape'])

In [37]:
df_ce

Unnamed: 0,Color,Shape,Color_encoded,Shape_encoded
0,green,triangle,1,2
1,blue,,0,3
2,blue,square,0,1
3,green,square,1,1
4,red,square,2,1
...,...,...,...,...
95,blue,circle,0,0
96,red,triangle,2,2
97,blue,square,0,1
98,blue,circle,0,0


---
<a id=2></a>
[Topへ](#0)

---
## 2. get_dummies()  
https://pandas.pydata.org/docs/reference/api/pandas.get_dummies.html  
※　カテゴリー変数をダミー変数化（0 or 1）する

* ダミー変数化を実行
* drop_first=Trueとは
* np.nanはどうなるのか
---

ダミー変数化を実行

In [38]:
pd.get_dummies(df['Color']).head()

Unnamed: 0,blue,green,red
0,0,1,0
1,1,0,0
2,1,0,0
3,0,1,0
4,0,0,1


drop_first=Trueとは  

In [39]:
pd.get_dummies(df['Color'], drop_first=True).head()

Unnamed: 0,green,red
0,1,0
1,0,0
2,0,0
3,1,0
4,0,1


In [40]:
df_cd = pd.get_dummies(df['Color'], drop_first=True)

In [41]:
pd.get_dummies(df, columns=['Color', 'Shape'], drop_first=True)

Unnamed: 0,Color_green,Color_red,Shape_square,Shape_triangle
0,1,0,0,1
1,0,0,0,0
2,0,0,1,0
3,1,0,1,0
4,0,1,1,0
...,...,...,...,...
95,0,0,0,0
96,0,1,0,1
97,0,0,1,0
98,0,0,0,0


np.nanはどうなるのか

In [42]:
df_cd.isnull().sum()

green    0
red      0
dtype: int64

In [44]:
df_cd = pd.get_dummies(df, columns=['Color'], drop_first=True, dummy_na=True)
df_cd.loc[36:42]

Unnamed: 0,Shape,Color_green,Color_red,Color_nan
36,square,1,0,0
37,circle,0,0,1
38,square,0,0,0
39,square,0,0,0
40,triangle,0,0,1
41,,0,1,0
42,square,1,0,0


---
<a id=3></a>
[Topへ](#0)

---
## 3. OneHotEncoder()  
※　One-hot : ひとつが1で他は0  
※　pd.get_dummies()にはない機能を使ってダミー変数化を行う

https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html

デフォルトのKeyword Argument : drop=None, handle_unknown='error'

* OneHotEncoder()を使ってみる
* 複数の特徴量を変換
---

OneHotEncoder()を使ってみる

In [45]:
from sklearn.preprocessing import OneHotEncoder

In [48]:
enc = OneHotEncoder()
enc.fit(df[['Color']])

OneHotEncoder()

In [49]:
enc.categories_

[array(['blue', 'green', 'red', nan], dtype=object)]

In [50]:
enc.transform(df[['Color']])

<100x4 sparse matrix of type '<class 'numpy.float64'>'
	with 100 stored elements in Compressed Sparse Row format>

In [51]:
enc.transform(df[['Color']]).toarray()[:5]

array([[0., 1., 0., 0.],
       [1., 0., 0., 0.],
       [1., 0., 0., 0.],
       [0., 1., 0., 0.],
       [0., 0., 1., 0.]])

複数の特徴量を変換

In [52]:
enc = OneHotEncoder()
enc.fit(df)

OneHotEncoder()

In [53]:
enc.categories_

[array(['blue', 'green', 'red', nan], dtype=object),
 array(['circle', 'square', 'triangle', nan], dtype=object)]

In [54]:
enc.transform(df).toarray()[:5]

array([[0., 1., 0., 0., 0., 0., 1., 0.],
       [1., 0., 0., 0., 0., 0., 0., 1.],
       [1., 0., 0., 0., 0., 1., 0., 0.],
       [0., 1., 0., 0., 0., 1., 0., 0.],
       [0., 0., 1., 0., 0., 1., 0., 0.]])

In [56]:
enc.inverse_transform([[0, 1, 0, 0, 0, 1, 0, 0]])

array([['green', 'square']], dtype=object)

---
<a id=4></a>
[Topへ](#0)

---
## 4. pd.get_dummies()とOneHotEncoder()の違い

* get_dummies()ではトレインセットとテストセットに差が生じる
* OneHotEncoder(handle_unknown='error', drop='first')の場合
* OneHotEncoder(handle_unknown='ignore')の場合
---

get_dummies()ではトレインセットとテストセットに差が生じる

In [57]:
np.random.seed(1)
s = pd.Series(np.random.choice([0, 1], len(df)), name='target')
s

0     1
1     1
2     0
3     0
4     1
     ..
95    0
96    1
97    1
98    0
99    1
Name: target, Length: 100, dtype: int32

In [58]:
df_new = pd.concat([df, s], axis=1)
df_new.head()

Unnamed: 0,Color,Shape,target
0,green,triangle,1
1,blue,,1
2,blue,square,0
3,green,square,0
4,red,square,1


In [59]:
from sklearn.model_selection import train_test_split

In [60]:
y = df_new.pop('target')
X = df_new

In [64]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.05, stratify=y, random_state=42)

In [65]:
X_test

Unnamed: 0,Color,Shape
5,green,triangle
45,blue,triangle
60,red,square
28,green,circle
82,blue,square


In [67]:
pd.get_dummies(X_train, drop_first=True, dummy_na=True).head()

Unnamed: 0,Color_green,Color_red,Color_nan,Shape_square,Shape_triangle,Shape_nan
44,0,0,0,0,1,0
2,0,0,0,1,0,0
91,0,1,0,0,0,0
94,0,1,0,1,0,0
29,1,0,0,0,1,0


In [68]:
enc = OneHotEncoder(drop='first')
enc.fit_transform(X_train).toarray()[:5]

array([[0., 0., 0., 0., 1., 0.],
       [0., 0., 0., 1., 0., 0.],
       [0., 1., 0., 0., 0., 0.],
       [0., 1., 0., 1., 0., 0.],
       [1., 0., 0., 0., 1., 0.]])

In [69]:
enc.transform(X_test).toarray()

array([[1., 0., 0., 0., 1., 0.],
       [0., 0., 0., 0., 1., 0.],
       [0., 1., 0., 1., 0., 0.],
       [1., 0., 0., 0., 0., 0.],
       [0., 0., 0., 1., 0., 0.]])

OneHotEncoder(handle_unknown='error', drop='first')の場合

In [70]:
enc = OneHotEncoder(handle_unknown='error', drop='first')
enc.fit_transform(X_train).toarray()[:5]

array([[0., 0., 0., 0., 1., 0.],
       [0., 0., 0., 1., 0., 0.],
       [0., 1., 0., 0., 0., 0.],
       [0., 1., 0., 1., 0., 0.],
       [1., 0., 0., 0., 1., 0.]])

In [71]:
enc.transform(X_test).toarray()

array([[1., 0., 0., 0., 1., 0.],
       [0., 0., 0., 0., 1., 0.],
       [0., 1., 0., 1., 0., 0.],
       [1., 0., 0., 0., 0., 0.],
       [0., 0., 0., 1., 0., 0.]])

In [72]:
X_test_new = X_test.copy()
X_test_new.loc[16, 'Color'] = 'purple'
X_test_new

Unnamed: 0,Color,Shape
5,green,triangle
45,blue,triangle
60,red,square
28,green,circle
82,blue,square
16,purple,


OneHotEncoder(handle_unknown='ignore')の場合

In [73]:
encoder_ignore = OneHotEncoder(handle_unknown='ignore')

In [74]:
encoder_ignore.fit(X_train)

OneHotEncoder(handle_unknown='ignore')

In [75]:
encoder_ignore.transform(X_test).toarray()

array([[0., 1., 0., 0., 0., 0., 1., 0.],
       [1., 0., 0., 0., 0., 0., 1., 0.],
       [0., 0., 1., 0., 0., 1., 0., 0.],
       [0., 1., 0., 0., 1., 0., 0., 0.],
       [1., 0., 0., 0., 0., 1., 0., 0.]])

In [76]:
encoder_ignore.transform(X_test_new).toarray()

array([[0., 1., 0., 0., 0., 0., 1., 0.],
       [1., 0., 0., 0., 0., 0., 1., 0.],
       [0., 0., 1., 0., 0., 1., 0., 0.],
       [0., 1., 0., 0., 1., 0., 0., 0.],
       [1., 0., 0., 0., 0., 1., 0., 0.],
       [0., 0., 0., 0., 0., 0., 0., 1.]])

#### 状況に応じて使い分ける（例）
* 分類される値が少ない、レコード量が多い  
    ＝＞　testデータに欠ける値はない　＝＞　get_dummies, OneHotEncoder(drop='first')
* 分類される値が少ない、レコード量が少ない  
    ＝＞　testデータに欠ける値があるかもしれない　＝＞　OneHotEncoder(handle_unknown='error', drop='first')
* 分類される値が多い、レコード量が少ない  
    ＝＞　testデータにtrainデータにない値が確実に入る　＝＞ OneHotEncoder, handle_unknown='ignore'

---
<a id=5></a>
[Topへ](#0)

---
## 5.Seriesのstr属性を使う

* Series.strとは
* メソッドを確認
* 利用頻度の高い置換、抽出、分離
---

Series.strとは

In [77]:
df = pd.DataFrame()
df['ID'] = ['A-123', 'B-456', 'A-789', 'B-123']
df['Color'] = ['py/white black', 'red green blue', 'py/yellow', 'purple white']
df

Unnamed: 0,ID,Color
0,A-123,py/white black
1,B-456,red green blue
2,A-789,py/yellow
3,B-123,purple white


In [78]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   ID      4 non-null      object
 1   Color   4 non-null      object
dtypes: object(2)
memory usage: 192.0+ bytes


In [79]:
df['ID'].str

<pandas.core.strings.accessor.StringMethods at 0x262c4caf4f0>

In [80]:
df['ID'].str[:3]

0    A-1
1    B-4
2    A-7
3    B-1
Name: ID, dtype: object

メソッドを確認

In [81]:
df['ID'].str.lower()

0    a-123
1    b-456
2    a-789
3    b-123
Name: ID, dtype: object

In [82]:
df['ID'].str.startswith('B')

0    False
1     True
2    False
3     True
Name: ID, dtype: bool

In [83]:
df['Color'].str.contains('white')


0     True
1    False
2    False
3     True
Name: Color, dtype: bool

In [84]:
df['Color'].str.contains('ye|pu')


0    False
1    False
2     True
3     True
Name: Color, dtype: bool

利用頻度の高い置換、抽出、分離

In [85]:
df['Color'].str.replace('black', 'gold')

0     py/white gold
1    red green blue
2         py/yellow
3      purple white
Name: Color, dtype: object

In [86]:
df['ID'].str.split('-')

0    [A, 123]
1    [B, 456]
2    [A, 789]
3    [B, 123]
Name: ID, dtype: object

In [87]:
df[['ID_a', 'ID_n']] = df['ID'].str.split('-', expand=True)
df

Unnamed: 0,ID,Color,ID_a,ID_n
0,A-123,py/white black,A,123
1,B-456,red green blue,B,456
2,A-789,py/yellow,A,789
3,B-123,purple white,B,123


In [88]:
df[['Color_1', 'Color_2', 'Color_3']] = df['Color'].str.split(' ', expand=True)
df

Unnamed: 0,ID,Color,ID_a,ID_n,Color_1,Color_2,Color_3
0,A-123,py/white black,A,123,py/white,black,
1,B-456,red green blue,B,456,red,green,blue
2,A-789,py/yellow,A,789,py/yellow,,
3,B-123,purple white,B,123,purple,white,


In [89]:
df['Color_1'].str.extract('(py/)', expand=True)


Unnamed: 0,0
0,py/
1,
2,py/
3,


In [90]:
df['py'] = df['Color_1'].str.extract('(py/)', expand=True)
df

Unnamed: 0,ID,Color,ID_a,ID_n,Color_1,Color_2,Color_3,py
0,A-123,py/white black,A,123,py/white,black,,py/
1,B-456,red green blue,B,456,red,green,blue,
2,A-789,py/yellow,A,789,py/yellow,,,py/
3,B-123,purple white,B,123,purple,white,,


In [91]:
df['Color_1'] = df['Color_1'].str.replace('py/', '')
df

Unnamed: 0,ID,Color,ID_a,ID_n,Color_1,Color_2,Color_3,py
0,A-123,py/white black,A,123,white,black,,py/
1,B-456,red green blue,B,456,red,green,blue,
2,A-789,py/yellow,A,789,yellow,,,py/
3,B-123,purple white,B,123,purple,white,,


---
[Topへ](#0)

---
## 以上
    
---