In [12]:
import pandas as pd

pd.set_option("display.max_columns", None)

# [Feature Engineering（特徴量エンジニアリング）](https://pycaret.gitbook.io/docs/get-started/preprocessing/feature-engineering)

## Feature Interaction（特徴量の相互作用） <a id="feature_interaction"></a>

It is often seen in machine learning experiments when two features combined through an **arithmetic operation** become more significant in explaining variances in the data, than the same two features separately. Creating a new feature through the interaction of existing features is known as **feature interaction**. It can be achieved in PyCaret using `feature_interaction` and `feature_ratio` parameters within setup. Feature interaction creates new features by multiplying two variables (a * b), while feature ratios create new features but by calculating the ratios of existing features (a / b).

機械学習の実験では、2つの特徴量を別々に用いるよりも、**算術演算**によって組み合わせた方が、データの分散を説明する上で有意になる場合によく見られる現象です。既存の特徴量の相互作用によって新しい特徴量を作成することは、 **特徴量の相互作用** と呼ばれます。PyCaret では、setup の `feature_interaction` と `feature_ratio` パラメータを使用して実現することができます。特徴量の相互作用は、2 つの変数の掛け算 (a * b) によって新しい特徴量を作成し、特徴量の比率は、既存の特徴量の比率 (a / b) を計算することによって新しい特徴量を作成します。

### PARAMETERS

- **feature_interaction**: bool, default = False
  - When set to `True`, it will create new features by interacting (a * b) for all numeric variables in the dataset including polynomial and trigonometric features (if created). This feature is not scalable and may not work as expected on datasets with large feature space.
  - `True` に設定すると、データセット内のすべての数値変数について、(多項式および三角関数の特徴量を含む) 相互作用 (a * b) によって新しい特徴量を作成します (作成された場合)。この機能はスケーラブルではなく、特徴空間が大きいデータセットでは期待通りに動作しない可能性があります。
- **feature_ratio**: bool, default = False
  - When set to `True`, it will create new features by calculating the ratios (a / b) of all numeric variables in the dataset. This feature is not scalable and may not work as expected on datasets with large feature space.
  - `True` に設定すると、データセット内のすべての数値変数の比率 (a / b) を計算して、新しい特徴量を作成します。この機能はスケーラブルではなく、特徴空間が大きいデータセットでは期待通りに動作しない可能性があります。
- **interaction_threshold**: bool, default = 0.01
  - Similar to `polynomial_threshold`, It is used to compress a sparse matrix of newly created features through interaction. Features whose importance based on the combination of Random Forest, AdaBoost and Linear correlation falls within the percentile of the defined threshold are kept in the dataset. Remaining features are dropped before further processing.
  - `polynomial_threshold` と同様に、新しく作成された特徴量の疎な行列を、相互作用によって圧縮するために利用されます。Random Forest、AdaBoost、Linear Correlation の組み合わせによる重要度が、定義された閾値のパーセンタイル内にある特徴量は、データセットに保存されます。残りの特徴量は、以降の処理の前に削除されます。

### Example

In [13]:
# load dataset
from pycaret.datasets import get_data

insurance = get_data('insurance')

# init setup
from pycaret.regression import *

reg1 = setup(data=insurance, target='charges', feature_interaction=True, feature_ratio=True)

Unnamed: 0,Description,Value
0,session_id,2528
1,Target,charges
2,Original Data,"(1338, 7)"
3,Missing Values,False
4,Numeric Features,2
5,Categorical Features,4
6,Ordinal Features,False
7,High Cardinality Features,False
8,High Cardinality Method,
9,Transformed Train Set,"(936, 18)"


In [14]:
display(get_config("data_before_preprocess"))

display(get_config("X"))

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
0,19,female,27.900,0,yes,southwest,16884.92400
1,18,male,33.770,1,no,southeast,1725.55230
2,28,male,33.000,3,no,southeast,4449.46200
3,33,male,22.705,0,no,northwest,21984.47061
4,32,male,28.880,0,no,northwest,3866.85520
...,...,...,...,...,...,...,...
1333,50,male,30.970,3,no,northwest,10600.54830
1334,18,female,31.920,0,no,northeast,2205.98080
1335,18,female,36.850,0,no,southeast,1629.83350
1336,21,female,25.800,0,no,southwest,2007.94500


Unnamed: 0,age,bmi,sex_female,children_0,children_1,children_2,children_3,children_4,children_5,smoker_yes,region_northeast,region_northwest,region_southeast,region_southwest,bmi_multiply_smoker_yes,smoker_yes_multiply_bmi,bmi_divide_age,bmi_multiply_age
0,19.0,27.900000,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,27.90,27.90,1.468421,530.099976
1,18.0,33.770000,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.00,0.00,1.876111,607.859985
2,28.0,33.000000,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.00,0.00,1.178571,924.000000
3,33.0,22.705000,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.00,0.00,0.688030,749.265015
4,32.0,28.879999,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.00,0.00,0.902500,924.159973
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1333,50.0,30.969999,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.00,0.00,0.619400,1548.500000
1334,18.0,31.920000,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.00,0.00,1.773333,574.559998
1335,18.0,36.849998,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.00,0.00,2.047222,663.299988
1336,21.0,25.799999,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.00,0.00,1.228571,541.799988


## Polynomial Features（多項式特徴量） <a id="polynomial_feature"></a>

In machine learning experiments the relationship between the dependent and independent variable is often assumed as linear, however this is not always the case. Sometimes the relationship between dependent and independent variables is more complex. Creating new polynomial features sometimes might help in capturing that relationship which otherwise may go unnoticed. PyCaret can create polynomial features from existing features using `polynomial_features` parameter within setup.

機械学習の実験では、従属変数と独立変数の関係はしばしば線形と仮定されますが、これは常にそうとは限りません。従属変数と独立変数の関係がより複雑であることもあります。新しい多項式特徴を作成することで、そうでなければ気づかないような関係を捉えることができるかもしれません。PyCaret は、セットアップで `polynomial_features` パラメータを使用して、既存の特徴量から多項式特徴量を作成することができます。

### PARAMETERS

- **polynomial_features**: bool, default = False
  - When set to `True`, new features are created based on all polynomial combinations that exist within the numeric features in a dataset to the degree defined in `polynomial_degree` param.
  - `True` に設定すると、データセットの数値特徴量に存在するすべての多項式の組み合わせから、 `polynomial_degree` パラメータで定義された度合いに基づいて新しい特徴量が作成されます。
- **polynomial_degree**: int, default = 2
  - Degree of polynomial features. For example, if an input sample is two dimensional and of the form [a, b], the polynomial features with degree = 2 are: [1, a, b, a^2, ab, b^2].
  - 多項式特徴の次数。例えば、入力サンプルが2次元で[a, b]の形式である場合、次数=2の多項式特徴量は [1, a, b, a^2, ab, b^2]となります。
- **polynomial_threshold**: float, default = 0.1
  - This is used to compress a sparse matrix of polynomial and trigonometric features. Polynomial and trigonometric features whose feature importance based on the combination of Random Forest, AdaBoost and Linear correlation falls within the percentile of the defined threshold are kept in the dataset. Remaining features are dropped before further processing.
  - これは、多項式および三角関数の特徴の疎な行列を圧縮するために使用されます。ランダムフォレスト、AdaBoost、線形相関の組み合わせに基づく特徴量の重要度が、定義された閾値のパーセンタイル内にある多項式および三角形の特徴量は、データセットに保持されます。残りの特徴は、さらなる処理の前に削除されます。

### Example

In [15]:
# load dataset
from pycaret.datasets import get_data

juice = get_data('juice')

# init setup
from pycaret.classification import *

clf1 = setup(data=juice, target='Purchase', polynomial_features=True)

Unnamed: 0,Description,Value
0,session_id,4121
1,Target,Purchase
2,Target Type,Binary
3,Label Encoded,"CH: 0, MM: 1"
4,Original Data,"(1070, 19)"
5,Missing Values,False
6,Numeric Features,13
7,Categorical Features,5
8,Ordinal Features,False
9,High Cardinality Features,False


In [16]:
display(get_config("data_before_preprocess"))

display(get_config("X"))

Unnamed: 0,Id,Purchase,WeekofPurchase,StoreID,PriceCH,PriceMM,DiscCH,DiscMM,SpecialCH,SpecialMM,LoyalCH,SalePriceMM,SalePriceCH,PriceDiff,Store7,PctDiscMM,PctDiscCH,ListPriceDiff,STORE
0,1,CH,237,1,1.75,1.99,0.00,0.00,0,0,0.500000,1.99,1.75,0.24,No,0.000000,0.000000,0.24,1
1,2,CH,239,1,1.75,1.99,0.00,0.30,0,1,0.600000,1.69,1.75,-0.06,No,0.150754,0.000000,0.24,1
2,3,CH,245,1,1.86,2.09,0.17,0.00,0,0,0.680000,2.09,1.69,0.40,No,0.000000,0.091398,0.23,1
3,4,MM,227,1,1.69,1.69,0.00,0.00,0,0,0.400000,1.69,1.69,0.00,No,0.000000,0.000000,0.00,1
4,5,CH,228,7,1.69,1.69,0.00,0.00,0,0,0.956535,1.69,1.69,0.00,Yes,0.000000,0.000000,0.00,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1065,1066,CH,252,7,1.86,2.09,0.10,0.00,0,0,0.587822,2.09,1.76,0.33,Yes,0.000000,0.053763,0.23,0
1066,1067,CH,256,7,1.86,2.18,0.00,0.00,0,0,0.670258,2.18,1.86,0.32,Yes,0.000000,0.000000,0.32,0
1067,1068,MM,257,7,1.86,2.18,0.00,0.00,0,0,0.736206,2.18,1.86,0.32,Yes,0.000000,0.000000,0.32,0
1068,1069,CH,261,7,1.86,2.13,0.00,0.24,0,0,0.588965,1.89,1.86,0.03,Yes,0.112676,0.000000,0.27,0


Unnamed: 0,PriceCH,PriceMM,DiscMM,LoyalCH,SalePriceMM,SalePriceCH,PriceDiff,PctDiscCH,ListPriceDiff,LoyalCH_Power2,WeekofPurchase_Power2,StoreID_3,StoreID_4,SpecialCH_1,SpecialMM_0,Store7_Yes,STORE_1,STORE_2
0,1.75,1.99,0.00,0.500000,1.99,1.75,0.24,0.000000,0.24,0.250000,56169.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0
1,1.75,1.99,0.30,0.600000,1.69,1.75,-0.06,0.000000,0.24,0.360000,57121.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
2,1.86,2.09,0.00,0.680000,2.09,1.69,0.40,0.091398,0.23,0.462400,60025.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0
3,1.69,1.69,0.00,0.400000,1.69,1.69,0.00,0.000000,0.00,0.160000,51529.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0
4,1.69,1.69,0.00,0.956535,1.69,1.69,0.00,0.000000,0.00,0.914959,51984.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1065,1.86,2.09,0.00,0.587822,2.09,1.76,0.33,0.053763,0.23,0.345535,63504.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0
1066,1.86,2.18,0.00,0.670258,2.18,1.86,0.32,0.000000,0.32,0.449246,65536.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0
1067,1.86,2.18,0.00,0.736206,2.18,1.86,0.32,0.000000,0.32,0.541999,66049.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0
1068,1.86,2.13,0.24,0.588965,1.89,1.86,0.03,0.000000,0.27,0.346880,68121.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0


Notice that new features were created from the existing feature space. To expand or compress polynomial feature space, you can use `polynomial_threshold` parameter which uses feature importance based on the combination of Random Forest, AdaBoost and Linear correlation to filter out the non important polynomial features. `polynomial_degree` can be used for defining number of degrees to be considered in feature creation.

既存の特徴空間から新しい特徴量が作成されたことに注意してください。多項式特徴空間を拡張または圧縮するには、 `polynomial_threshold` パラメータを使用します。これは、Random Forest、AdaBoost、線形相関の組み合わせに基づく特徴量の重要度を使用して、重要ではない多項式特徴量をフィルタリングすることができます。polynomial_degree` は、特徴を作成する際に考慮する次数を定義するために使用されます。

## Trigonometry Features（三角法の特徴量） <a id="trigonometry_features"></a>

Similar to Polynomial Features, PyCaret also allows creating new **trigonometry features** from the existing features. It is achieved using `trigonometry_features` parameter within setup.

多項式特徴量と同様に、PyCaret でも既存の特徴量から新しい **三角法特徴量** を作成することができます。これは setup で `trigonometry_features` パラメータを使用することで実現できます。

### PARAMETERS

- **trigonometry_features**: bool, default = False
  - When set to `True`, new features are created based on all trigonometric combinations that exist within the numeric features in a dataset to the degree defined in the `polynomial_degree` parameter.
  - `True` に設定すると、データセットの数値特徴内に存在する、 `polynomial_degree` パラメータで定義された程度のすべての三角形の組み合わせに基づいて、新しい特徴が作成されます。

### Example

In [17]:
# load dataset
from pycaret.datasets import get_data

insurance = get_data('insurance')


# init setup
from pycaret.regression import *

reg1 = setup(data=insurance, target='charges', trigonometry_features=True)

Unnamed: 0,Description,Value
0,session_id,5669
1,Target,charges
2,Original Data,"(1338, 7)"
3,Missing Values,False
4,Numeric Features,2
5,Categorical Features,4
6,Ordinal Features,False
7,High Cardinality Features,False
8,High Cardinality Method,
9,Transformed Train Set,"(936, 14)"


In [18]:
display(get_config("data_before_preprocess"))

display(get_config("X"))

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
0,19,female,27.900,0,yes,southwest,16884.92400
1,18,male,33.770,1,no,southeast,1725.55230
2,28,male,33.000,3,no,southeast,4449.46200
3,33,male,22.705,0,no,northwest,21984.47061
4,32,male,28.880,0,no,northwest,3866.85520
...,...,...,...,...,...,...,...
1333,50,male,30.970,3,no,northwest,10600.54830
1334,18,female,31.920,0,no,northeast,2205.98080
1335,18,female,36.850,0,no,southeast,1629.83350
1336,21,female,25.800,0,no,southwest,2007.94500


Unnamed: 0,age,bmi,sex_female,children_0,children_1,children_2,children_3,children_4,children_5,smoker_yes,region_northeast,region_northwest,region_southeast,region_southwest
0,19.0,27.900000,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0
1,18.0,33.770000,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
2,28.0,33.000000,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
3,33.0,22.705000,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
4,32.0,28.879999,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1333,50.0,30.969999,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
1334,18.0,31.920000,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
1335,18.0,36.849998,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
1336,21.0,25.799999,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0


## Group Features（グループの特徴量） <a id="group_features"></a>

When dataset contains features that are related to each other in someway, for example: features recorded at some fixed time intervals, then new statistical features such as **mean**, **median**, **variance** and **standard deviation** for a group of such features can be created from existing features using `group_features` parameter within setup.

データセットに何らかの形で互いに関連する特徴量が含まれている場合、例えば、ある一定の時間間隔で記録された特徴量の場合、既存の特徴量から `group_features` パラメータを使って、そのような特徴量のグループに対する **mean**, **median**, **variance**, **standard deviation** などの新しい統計特徴量を作成することが可能です。

### PARAMETERS

- **group_features**: list or list of list, default = None
  - When a dataset contains features that have related characteristics, the `group_features` param can be used for statistical feature extraction. For example, if a dataset has numeric features that are related with each other (i.e ‘Col1’, ‘Col2’, ‘Col3’), a list containing the column names can be passed under `group_features` to extract statistical information such as the mean, median, mode and standard deviation.
  - データセットに関連する特徴量がある場合、`group_features` パラメータを利用して統計的特徴量を抽出することができます。例えば、データセットに含まれる数値特徴量が互いに関連している場合 (例えば 'Col1', 'Col2', 'Col3') 、列名を含むリストを `group_features` に渡して、平均、中央値、最頻値、標準偏差といった統計情報を抽出することができます。
- **group_names**: list, default = None
  - When `group_features` is passed, a name of the group can be passed into the `group_names` param as a list containing strings. The length of a `group_names` list must equal to the length of `group_features`. When the length doesn’t match or the name is not passed, new features are sequentially named such as group_1, group_2 etc.
  - `group_features` が渡された場合、 `group_names` パラメータにグループの名前を文字列のリストとして渡すことができます。`group_names` リストの長さは `group_features` の長さと同じでなければなりません。長さが一致しない場合、または名前が渡されない場合、新しい機能は group_1, group_2 などのように順番に命名されます。

### Example

In [19]:
# load dataset
from pycaret.datasets import get_data

credit = get_data('credit')

# init setup
from pycaret.classification import *

clf1 = setup(
    data=credit,
    target='default',
    group_features=['BILL_AMT1', 'BILL_AMT2', 'BILL_AMT3', 'BILL_AMT4', 'BILL_AMT5', 'BILL_AMT6']
)

Unnamed: 0,Description,Value
0,session_id,2853
1,Target,default
2,Target Type,Binary
3,Label Encoded,
4,Original Data,"(24000, 24)"
5,Missing Values,False
6,Numeric Features,14
7,Categorical Features,9
8,Ordinal Features,False
9,High Cardinality Features,False


In [20]:
display(get_config("data_before_preprocess"))

display(get_config("X"))

Unnamed: 0,LIMIT_BAL,SEX,EDUCATION,MARRIAGE,AGE,PAY_1,PAY_2,PAY_3,PAY_4,PAY_5,...,BILL_AMT4,BILL_AMT5,BILL_AMT6,PAY_AMT1,PAY_AMT2,PAY_AMT3,PAY_AMT4,PAY_AMT5,PAY_AMT6,default
0,20000,2,2,1,24,2,2,-1,-1,-2,...,0.0,0.0,0.0,0.0,689.0,0.0,0.0,0.0,0.0,1
1,90000,2,2,2,34,0,0,0,0,0,...,14331.0,14948.0,15549.0,1518.0,1500.0,1000.0,1000.0,1000.0,5000.0,0
2,50000,2,2,1,37,0,0,0,0,0,...,28314.0,28959.0,29547.0,2000.0,2019.0,1200.0,1100.0,1069.0,1000.0,0
3,50000,1,2,1,57,-1,0,-1,0,0,...,20940.0,19146.0,19131.0,2000.0,36681.0,10000.0,9000.0,689.0,679.0,0
4,50000,1,1,2,37,0,0,0,0,0,...,19394.0,19619.0,20024.0,2500.0,1815.0,657.0,1000.0,1000.0,800.0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
23995,80000,1,2,2,34,2,2,2,2,2,...,77519.0,82607.0,81158.0,7000.0,3500.0,0.0,7000.0,0.0,4000.0,1
23996,150000,1,3,2,43,-1,-1,-1,-1,0,...,8979.0,5190.0,0.0,1837.0,3526.0,8998.0,129.0,0.0,0.0,0
23997,30000,1,2,2,37,4,3,2,-1,0,...,20878.0,20582.0,19357.0,0.0,0.0,22000.0,4200.0,2000.0,3100.0,1
23998,80000,1,3,1,41,1,-1,0,0,0,...,52774.0,11855.0,48944.0,85900.0,3409.0,1178.0,1926.0,52964.0,1804.0,1


Unnamed: 0,LIMIT_BAL,AGE,BILL_AMT1,BILL_AMT2,BILL_AMT3,BILL_AMT4,BILL_AMT5,BILL_AMT6,PAY_AMT1,PAY_AMT2,...,PAY_6_-1,PAY_6_-2,PAY_6_0,PAY_6_2,PAY_6_3,PAY_6_4,PAY_6_5,PAY_6_6,PAY_6_7,PAY_6_8
0,20000.0,24.0,3913.0,3102.0,689.0,0.0,0.0,0.0,0.0,689.0,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,90000.0,34.0,29239.0,14027.0,13559.0,14331.0,14948.0,15549.0,1518.0,1500.0,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,50000.0,37.0,46990.0,48233.0,49291.0,28314.0,28959.0,29547.0,2000.0,2019.0,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,50000.0,57.0,8617.0,5670.0,35835.0,20940.0,19146.0,19131.0,2000.0,36681.0,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,50000.0,37.0,64400.0,57069.0,57608.0,19394.0,19619.0,20024.0,2500.0,1815.0,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
23995,80000.0,34.0,72557.0,77708.0,79384.0,77519.0,82607.0,81158.0,7000.0,3500.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
23996,150000.0,43.0,1683.0,1828.0,3502.0,8979.0,5190.0,0.0,1837.0,3526.0,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
23997,30000.0,37.0,3565.0,3356.0,2758.0,20878.0,20582.0,19357.0,0.0,0.0,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
23998,80000.0,41.0,-1645.0,78379.0,76304.0,52774.0,11855.0,48944.0,85900.0,3409.0,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


## Bin Numeric Features（ビン数値特徴量） <a id="bin_numeric_features"></a>

Feature binning is a method of turning continuous variables into categorical values using pre-defined number of `bins`. It is effective when a continuous feature has too many unique values or few extreme values outside the expected range. Such extreme values influence on the trained model, thereby affecting the prediction accuracy of the model. In PyCaret, continuous numeric features can be binned into intervals using `bin_numeric_features` parameter within setup. PyCaret uses the *‘sturges’* rule to determine the number of bins and also uses K-Means clustering to convert continuous numeric features into categorical features.

特徴量のビン化とは、あらかじめ定義されたビン数を用いて連続変数をカテゴリ値に変換する方法です。連続的な特徴量にユニークな値が多すぎる場合や、期待される範囲外の極端な値が少ない場合に有効です。このような極端な値は、学習されたモデルに影響を与え、モデルの予測精度に影響を与えます。PyCaret では、連続した数値特徴量を区間にビン分割するために、setup 内の `bin_numeric_features` パラメータを使用します。PyCaret はビンの数を決定するために *'sturges'* ルールを使用し、また連続した数値特徴量をカテゴリ特徴量に変換するために K-Means クラスタリングも使用します。

### PARAMETERS

- **bin_numeric_features**: list, default = None
  - When a list of numeric features is passed they are transformed into categorical features using K-Means, where values in each bin have the same nearest center of a 1D k-means cluster. The number of clusters are determined based on the ‘sturges’ method. It is only optimal for gaussian data and underestimates the number of bins for large non-gaussian datasets.
  - 数値特徴量のリストが渡されると、それらはK-Meansを用いてカテゴリ特徴量に変換され、各ビン内の値は1次元k-meansクラスタの同じ最も近い中心を持つことになります。クラスタ数は「スタージュ」法に基づいて決定されます。これはガウス型データにのみ最適で、非ガウス型の大規模データセットではビンの数を過小評価します。

### Example

In [28]:
# load dataset
from pycaret.datasets import get_data

income = get_data('income')

# init setup
from pycaret.classification import *

clf1 = setup(data=income, target='income >50K', bin_numeric_features=['age'])

Unnamed: 0,Description,Value
0,session_id,926
1,Target,income >50K
2,Target Type,Binary
3,Label Encoded,
4,Original Data,"(32561, 14)"
5,Missing Values,True
6,Numeric Features,4
7,Categorical Features,9
8,Ordinal Features,False
9,High Cardinality Features,False


In [29]:
display(get_config("data_before_preprocess"))

display(get_config("X"))

Unnamed: 0,age,workclass,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,income >50K
0,39,State-gov,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,0
1,50,Self-emp-not-inc,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,0
2,38,Private,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,0
3,53,Private,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,0
4,28,Private,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
32556,27,Private,Assoc-acdm,12,Married-civ-spouse,Tech-support,Wife,White,Female,0,0,38,United-States,0
32557,40,Private,HS-grad,9,Married-civ-spouse,Machine-op-inspct,Husband,White,Male,0,0,40,United-States,1
32558,58,Private,HS-grad,9,Widowed,Adm-clerical,Unmarried,White,Female,0,0,40,United-States,0
32559,22,Private,HS-grad,9,Never-married,Adm-clerical,Own-child,White,Male,0,0,20,United-States,0


Unnamed: 0,capital-gain,capital-loss,hours-per-week,workclass_Federal-gov,workclass_Local-gov,workclass_Never-worked,workclass_Private,workclass_Self-emp-inc,workclass_Self-emp-not-inc,workclass_State-gov,...,age_14.0,age_15.0,age_2.0,age_3.0,age_4.0,age_5.0,age_6.0,age_7.0,age_8.0,age_9.0
0,2174.0,0.0,40.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,13.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
2,0.0,0.0,40.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,40.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
4,0.0,0.0,40.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
32556,0.0,0.0,38.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
32557,0.0,0.0,40.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
32558,0.0,0.0,40.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
32559,0.0,0.0,20.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


## Combine Rare Levels（レアレベルの組み合わせ） <a id="combine_rare_levels"></a>

Sometimes a dataset can have a categorical feature (or multiple categorical features) that has a very high number of levels (i.e. high cardinality features). If such feature (or features) are encoded into numeric values, then the resultant matrix is a **sparse matrix**. This not only makes experiment slow due to manifold increment in the number of features and hence the size of the dataset, but also introduces noise in the experiment. Sparse matrix can be avoided by combining the rare levels in the feature(or features) having high cardinality. This can be achieved in PyCaret using `combine_rare_levels` parameter within setup.

データセットには、非常に多くのレベルを持つカテゴリ特徴量（または複数のカテゴリ特徴量）がある場合がある（つまり、高基準の特徴量）。このような特徴量を数値化した場合、結果として得られる行列は**スパース行列**となる。これは、特徴量の数、ひいてはデータセットのサイズが多様に増加するため、実験に時間がかかるだけでなく、実験にノイズをもたらします。疎行列は、高いカーディナリティを持つ特徴量（または複数の特徴量）の希少レベルを結合することで回避することができます。これはPyCaretのsetupにある `combine_rare_levels` パラメータを使用することで実現可能です。

### PARAMETERS

- **combine_rare_levels**: bool, default = False
  - When set to `True`, all levels in categorical features below the threshold defined in `rare_level_threshold` param are combined together as a single level. There must be at least two levels under the threshold for this to take effect. `rare_level_threshold` represents the percentile distribution of level frequency. Generally, this technique is applied to limit a sparse matrix caused by high numbers of levels in categorical features.
  - `True` に設定すると、 `rare_level_threshold` パラメータで定義された閾値以下のカテゴリ特徴量のレベルは、全て1つのレベルとしてまとめられます。この機能を有効にするには、閾値以下のレベルが少なくとも2つ存在する必要があります。rare_level_threshold` は、レベル頻度のパーセンタイル分布を表しています。一般にこの手法は，カテゴリ特徴量のレベル数が多いために生じる疎な行列を制限するために適用されます。
- **rare_level_threshold**: float, default = 0.1
  - Percentile distribution below which rare categories are combined. Only comes into effect when `combine_rare_levels` is set to `True`.
  - レアカテゴリが結合されるパーセンタイル分布。`combine_rare_levels` が `True` に設定されている場合にのみ有効になります。

### Example

In [30]:
# load dataset
from pycaret.datasets import get_data

income = get_data('income')

# init setup
from pycaret.classification import *

clf1 = setup(data=income, target='income >50K', combine_rare_levels=True)

Unnamed: 0,Description,Value
0,session_id,5204
1,Target,income >50K
2,Target Type,Binary
3,Label Encoded,
4,Original Data,"(32561, 14)"
5,Missing Values,True
6,Numeric Features,4
7,Categorical Features,9
8,Ordinal Features,False
9,High Cardinality Features,False


In [31]:
display(get_config("data_before_preprocess"))

display(get_config("X"))

Unnamed: 0,age,workclass,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,income >50K
0,39,State-gov,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,0
1,50,Self-emp-not-inc,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,0
2,38,Private,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,0
3,53,Private,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,0
4,28,Private,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
32556,27,Private,Assoc-acdm,12,Married-civ-spouse,Tech-support,Wife,White,Female,0,0,38,United-States,0
32557,40,Private,HS-grad,9,Married-civ-spouse,Machine-op-inspct,Husband,White,Male,0,0,40,United-States,1
32558,58,Private,HS-grad,9,Widowed,Adm-clerical,Unmarried,White,Female,0,0,40,United-States,0
32559,22,Private,HS-grad,9,Never-married,Adm-clerical,Own-child,White,Male,0,0,20,United-States,0


Unnamed: 0,age,capital-gain,capital-loss,hours-per-week,workclass_Federal-gov,workclass_Local-gov,workclass_Never-worked,workclass_Private,workclass_Self-emp-inc,workclass_Self-emp-not-inc,...,native-country_Portugal,native-country_Puerto-Rico,native-country_South,native-country_Taiwan,native-country_Thailand,native-country_Trinadad&Tobago,native-country_United-States,native-country_Vietnam,native-country_not_available,native-country_others_infrequent
0,39.0,2174.0,0.0,40.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
1,50.0,0.0,0.0,13.0,0.0,0.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
2,38.0,0.0,0.0,40.0,0.0,0.0,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
3,53.0,0.0,0.0,40.0,0.0,0.0,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
4,28.0,0.0,0.0,40.0,0.0,0.0,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
32556,27.0,0.0,0.0,38.0,0.0,0.0,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
32557,40.0,0.0,0.0,40.0,0.0,0.0,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
32558,58.0,0.0,0.0,40.0,0.0,0.0,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
32559,22.0,0.0,0.0,20.0,0.0,0.0,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0


#### Effect of combining rare levels

![Effect of combining rare levels](./images/Effect_of_combining_rare_levels.png)

## Create Clusters（クラスターの作成）　<a id="create_clusters"></a>

**Creating Clusters** using the existing features from the data is an unsupervised ML technique to engineer and create new features. It uses iterative approach to determine the number of clusters using combination of Calinski-Harabasz and Silhouette criterion. Each data point with the original features is assigned to a cluster. The assigned cluster label is then used as a `new feature` in predicting target variable. This can be achieved in PyCaret using `create_clusters` parameter within setup.

**クラスターの作成**は、データから既存の特徴量を使用して、新しい特徴量を設計して作成する教師なしML技術です。Calinski-HarabaszとSilhouette基準の組み合わせでクラスタ数を決定する反復的なアプローチを用います。元の特徴量を持つ各データポイントは、クラスタに割り当てられます。そして、割り当てられたクラスタラベルは、ターゲット変数を予測するための「新しい特徴量」として使用されます。これは PyCaret の setup で `create_clusters` パラメータを使って実現することができます。

### PARAMETERS

- **create_clusters**: bool, default = False
  - When set to `True`, an additional feature is created where each instance is assigned to a cluster. The number of clusters is determined using a combination of Calinski-Harabasz and Silhouette criterion.
  - `True` に設定すると、各インスタンスがクラスタに割り当てられる追加機能が作成されます。クラスタ数は、Calinski-Harabasz 基準と Silhouette 基準の組み合わせで決定されます。
- **cluster_iter**: int, default = 20
  - Number of iterations used to create a cluster. Each iteration represents cluster size. Only comes into effect when `create_clusters` param is set to `True`.
  - クラスターを作成するために使用される反復の数。各反復はクラスタサイズに相当します。`create_clusters` パラメータが `True` に設定されている場合のみ、有効になります。

### Example

In [32]:
# load dataset
from pycaret.datasets import get_data

insurance = get_data('insurance')

# init setup
from pycaret.regression import *

reg1 = setup(data=insurance, target='charges', create_clusters=True)

Unnamed: 0,Description,Value
0,session_id,7650
1,Target,charges
2,Original Data,"(1338, 7)"
3,Missing Values,False
4,Numeric Features,2
5,Categorical Features,4
6,Ordinal Features,False
7,High Cardinality Features,False
8,High Cardinality Method,
9,Transformed Train Set,"(936, 15)"


In [33]:
display(get_config("data_before_preprocess"))

display(get_config("X"))

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
0,19,female,27.900,0,yes,southwest,16884.92400
1,18,male,33.770,1,no,southeast,1725.55230
2,28,male,33.000,3,no,southeast,4449.46200
3,33,male,22.705,0,no,northwest,21984.47061
4,32,male,28.880,0,no,northwest,3866.85520
...,...,...,...,...,...,...,...
1333,50,male,30.970,3,no,northwest,10600.54830
1334,18,female,31.920,0,no,northeast,2205.98080
1335,18,female,36.850,0,no,southeast,1629.83350
1336,21,female,25.800,0,no,southwest,2007.94500


Unnamed: 0,age,bmi,sex_female,children_0,children_1,children_2,children_3,children_4,children_5,smoker_no,region_northeast,region_northwest,region_southeast,region_southwest,data_cluster_1
0,19.0,27.900000,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
1,18.0,33.770000,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0
2,28.0,33.000000,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0
3,33.0,22.705000,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0
4,32.0,28.879999,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1333,50.0,30.969999,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0
1334,18.0,31.920000,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0
1335,18.0,36.849998,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0
1336,21.0,25.799999,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0
