In [1]:
import pandas as pd
pd.set_option("display.max_columns", None)

# [Data Preparation （データ準備）](https://pycaret.gitbook.io/docs/get-started/preprocessing/data-preparation)

## Missing Values（欠損値）<a id="missing_values"></a>

Datasets for various reasons may have missing values or empty records, often encoded as blanks or `NaN`. Most of the machine learning algorithms are not capable of dealing with missing or blank values. Removing samples with missing values is a basic strategy that is sometimes used but it comes with a cost of losing probable valuable data and the associated information or patterns. A better strategy is to impute the missing values. PyCaret by default imputes the missing value in the dataset by `mean` for numeric features and `constant` for categorical features. To change the imputation method, `numeric_imputation` and `categorical_imputation` parameters can be used within the setup.

データセットにはさまざまな理由で欠損値や空のレコードがあり、しばしば空白や`NaN`としてエンコードされます。ほとんどの機械学習アルゴリズムは、欠損値や空白値を扱うことができません。欠損値のあるサンプルを削除することは基本的な戦略であり、時々使用されますが、貴重なデータと関連する情報やパターンを失うという代償を伴います。より良い戦略は、欠損値を代入することです。PyCaret はデフォルトで、データセットの欠損値を、数的特徴量の場合は `mean` で、カテゴリカル特徴量の場合は `constant` でインピュテーション（埋め合わせ）します。インピュテーションの方法を変更するには、 `numeric_imputation` と `categorical_imputation` パラメータをセットアップで使用することができます。

### PARAMETERS（パラメータ）

- **imputation_type**: string, default =' simple'
  - The type of imputation to use.
  - It can be either `simple` or `iterative`
  - 使用する補完の種類。
  - `simple` または `iterative` のいずれかです。
  - 
- **numeric_imputation**: string, default = ‘mean’
  - Missing values in numeric features are imputed with the `mean` value of the feature in the training dataset.
  - The other available option is `median` or `zero`.
  - 数値特徴量の欠損値は、学習データセットにおけるその特徴量の `mean` を用いて代入されます。
  - ほかの利用可能なオプションは `median` または `zero` です。
  - 
- **categorical_imputation**: string, default = ‘constant’
  - Missing values in categorical features are imputed with a constant `not_available` value.
  - The other available option is `mode`.
  - カテゴリ特徴量の欠測値は、定数 `not_available` 値で代入されます。
  - その他のオプションは `mode` です。
  - 
- **iterative_imputation_iters**: int = 5
  - The number of iterations. Ignored when `imputation_type` is not `iterative`.
  - 反復計算の回数。`imputation_type` が `iterative` でない場合は無視されます。
  - 
- **numeric_iterative_imputer**: Union[str, Any] = 'lightgbm'
  - Estimator for iterative imputation of missing values in numeric features.
  - Ignored when `imputation_type` is set to `simple`.
  - 数値特徴量における欠損値の反復的補完のための推定量。
  - `imputation_type` が `simple` に設定されている場合は無視されます。
  - 
- **categorical_iterative_imputer**: Union[str, Any] = 'lightgbm'
  - Estimator for iterative imputation of missing values in categorical features.
  - Ignored when `imputation_type` is not `iterative`.
  - カテゴリ特徴量における欠損値の反復的な補完のための推定量。
  - `imputation_type` が `iterative` でない場合は無視されます。
  - 


### Example

In [2]:
# load dataset
from pycaret.datasets import get_data
hepatitis = get_data('hepatitis')

# init setup
from pycaret.classification import *
clf1 = setup(
    data=hepatitis,
    target='Class'
)

Unnamed: 0,Description,Value
0,session_id,6563
1,Target,Class
2,Target Type,Binary
3,Label Encoded,
4,Original Data,"(154, 20)"
5,Missing Values,True
6,Numeric Features,6
7,Categorical Features,13
8,Ordinal Features,False
9,High Cardinality Features,False


In [3]:
# オリジナルデータ
display(hepatitis)
# 修正済みデータ
get_config("X")

Unnamed: 0,Class,AGE,SEX,STEROID,ANTIVIRALS,FATIGUE,MALAISE,ANOREXIA,LIVER BIG,LIVER FIRM,SPLEEN PALPABLE,SPIDERS,ASCITES,VARICES,BILIRUBIN,ALK PHOSPHATE,SGOT,ALBUMIN,PROTIME,HISTOLOGY
0,0,30,2,1.0,2,2,2,2,1.0,2.0,2.0,2.0,2.0,2.0,1.0,85.0,18.0,4.0,,1
1,0,50,1,1.0,2,1,2,2,1.0,2.0,2.0,2.0,2.0,2.0,0.9,135.0,42.0,3.5,,1
2,0,78,1,2.0,2,1,2,2,2.0,2.0,2.0,2.0,2.0,2.0,0.7,96.0,32.0,4.0,,1
3,0,31,1,,1,2,2,2,2.0,2.0,2.0,2.0,2.0,2.0,0.7,46.0,52.0,4.0,80.0,1
4,0,34,1,2.0,2,2,2,2,2.0,2.0,2.0,2.0,2.0,2.0,1.0,,200.0,4.0,,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
149,1,46,1,2.0,2,1,1,1,2.0,2.0,2.0,1.0,1.0,1.0,7.6,,242.0,3.3,50.0,2
150,0,44,1,2.0,2,1,2,2,2.0,1.0,2.0,2.0,2.0,2.0,0.9,126.0,142.0,4.3,,2
151,0,61,1,1.0,2,1,1,2,1.0,1.0,2.0,1.0,2.0,2.0,0.8,75.0,20.0,4.1,,2
152,0,53,2,1.0,2,1,2,2,2.0,2.0,1.0,1.0,2.0,1.0,1.5,81.0,19.0,4.1,48.0,2


Unnamed: 0,AGE,BILIRUBIN,ALK PHOSPHATE,SGOT,ALBUMIN,PROTIME,SEX_1,STEROID_1.0,ANTIVIRALS_2,FATIGUE_2,...,LIVER FIRM_not_available,SPLEEN PALPABLE_1.0,SPLEEN PALPABLE_2.0,SPIDERS_1.0,SPIDERS_2.0,ASCITES_1.0,ASCITES_2.0,VARICES_1.0,VARICES_2.0,HISTOLOGY_2
0,30.0,1.0,85.000000,18.0,4.0,59.64706,0.0,1.0,1.0,1.0,...,0.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0
1,50.0,0.9,135.000000,42.0,3.5,59.64706,1.0,1.0,1.0,0.0,...,0.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0
2,78.0,0.7,96.000000,32.0,4.0,59.64706,1.0,0.0,1.0,0.0,...,0.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0
3,31.0,0.7,46.000000,52.0,4.0,80.00000,1.0,1.0,0.0,1.0,...,0.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0
4,34.0,1.0,104.155556,200.0,4.0,59.64706,1.0,0.0,1.0,1.0,...,0.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
149,46.0,7.6,104.155556,242.0,3.3,50.00000,1.0,0.0,1.0,0.0,...,0.0,0.0,1.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0
150,44.0,0.9,126.000000,142.0,4.3,59.64706,1.0,0.0,1.0,0.0,...,0.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,1.0
151,61.0,0.8,75.000000,20.0,4.1,59.64706,1.0,1.0,1.0,0.0,...,0.0,0.0,1.0,1.0,0.0,0.0,1.0,0.0,1.0,1.0
152,53.0,1.5,81.000000,19.0,4.1,48.00000,0.0,1.0,1.0,0.0,...,0.0,1.0,0.0,1.0,0.0,0.0,1.0,1.0,0.0,1.0


### Comparison of Simple imputer vs. Iterative imputer <br> （シンプルインピュータと反復インピュータの比較）

- [Iterative Imputation in PyCaret 2.2 <br> （PyCaret 2.2における反復的な入力処理）](https://www.linkedin.com/pulse/iterative-imputation-pycaret-22-antoni-baum/)
  - [this notebook on Google Colab](https://drive.google.com/file/d/17K1itQ0Z621c0SQApmIG-3oH8A4TQ-aO/view?usp=sharing)
- [Iterative Imputation in PyCaret 2.2 part 2 - real life data <br> （PyCaret 2.2 パート 2 の反復代入 - 実際のデータ）](https://www.linkedin.com/pulse/iterative-imputation-pycaret-22-part-2-real-life-data-antoni-baum?lipi=urn%3Ali%3Apage%3Ad_flagship3_pulse_read%3BeiXPrtSYTR%2Bu3CmutQ4r%2BA%3D%3D)
  - [this notebook on Google Colab](https://drive.google.com/file/d/1Z8-N5H1MDOmxC21V39v8YBepElyuPyUV/view?usp=sharing)


> NOTE: No explicit parameters for imputing missing values are required in the setup function as PyCaret handles this task by default.
>
>注：欠損値を埋め込むための明示的なパラメータは、PyCaretがデフォルトで処理するため、設定関数では必要ありません。

## Data Types（データ型）<a id="data_types"></a>

Each feature in the dataset has an associated data type such as numeric, categorical, or Datetime. PyCaret’s inference algorithm automatically detects the data type of each feature. However, sometimes the data types inferred by PyCaret are incorrect. Ensuring data types are correct is important as several downstream processes depend on the data type of the features. One example could be that [Missing Values](#missing_values) for numeric and categorical features in the dataset are imputed differently. To overwrite the inferred data types, `numeric_features`, `categorical_features` and `date_features` parameters can be used in the setup function. You can also use `ignore_features` to ignore certain features for model training.

データセットの特徴量には、それぞれ数値、カテゴリ、日時などのデータ型が関連付けられています。PyCaret の推論アルゴリズムは、各特徴のデータ型を自動的に検出します。しかし、PyCaret が推論したデータ型が正しくないことがあります。いくつかの下流工程が特徴量のデータ型に依存するため、データ型が正しいことを確認することは重要です。例えば、データセットの数値特徴とカテゴリ特徴の[Missing Values]が異なるようにインプットされることがあります。推論されたデータ型を上書きするには、setup 関数で `numeric_features`, `categorical_features`, `date_features` パラメータを使用します。また、`ignore_features` を使用すると、モデルの学習時に特定の特徴量を無視することができます。

### PARAMETERS（パラメータ）

- **numeric_features**: list of string, default = None
  - If the inferred data types are not correct, `numeric_features` can be used to overwrite the inferred data types.
  - 推測されたデータ型が正しくない場合、 `numeric_features` を使用して、推測されたデータ型を上書きすることができます。
  - 
- **categorical_features**: list of string, default = None
  - If the inferred data types are not correct, `categorical_features` can be used to overwrite the inferred data types.
  - 推測されたデータ型が正しくない場合、 `categorical_features` を使用して、推測されたデータ型を上書きすることができます。
  - 
- **date_features**: list of string, default = None
  - If the data has a `Datetime` column that is not automatically inferred when running the setup, `date_features` can be used to force the data type. It can work with multiple date columns. Datetime related features are not used in modeling. Instead, feature extraction is performed and original `Datetime` columns are ignored during model training. If the `Datetime` column includes a timestamp, features related to time will also be extracted.
  - 
- **ignore_features**: list of string, default = None
  - `ignore_features` can be used to ignore features during model training. It takes a list of strings with column names that are to be ignored.
  - もし、データに `Datetime` カラムがあり、セットアップの実行時に自動的に推測されない場合は、 `date_features` を使用して強制的にデータ型を変更することができます。これは、複数の日付カラムで利用できます。日付に関連する特徴はモデリングには使用されません。代わりに特徴量の抽出が行われ、モデルの学習時にはオリジナルの `Datetime` カラムは無視されます。もし `Datetime` カラムにタイムスタンプが含まれている場合は、時刻に関連する特徴量も抽出されます。

### Example 1 -  Categorical Features

In [None]:
# load dataset
from pycaret.datasets import get_data
hepatitis = get_data('hepatitis')

# init setup
from pycaret.classification import *
clf1 = setup(
    data=hepatitis,
    target='Class',
    categorical_features=['AGE']
)

Unnamed: 0,Description,Value
0,session_id,8043
1,Target,Class
2,Target Type,Binary
3,Label Encoded,
4,Original Data,"(154, 20)"
5,Missing Values,True
6,Numeric Features,5
7,Categorical Features,14
8,Ordinal Features,False
9,High Cardinality Features,False


In [None]:
display(hepatitis)
get_config("X")

Unnamed: 0,Class,AGE,SEX,STEROID,ANTIVIRALS,FATIGUE,MALAISE,ANOREXIA,LIVER BIG,LIVER FIRM,SPLEEN PALPABLE,SPIDERS,ASCITES,VARICES,BILIRUBIN,ALK PHOSPHATE,SGOT,ALBUMIN,PROTIME,HISTOLOGY
0,0,30,2,1.0,2,2,2,2,1.0,2.0,2.0,2.0,2.0,2.0,1.0,85.0,18.0,4.0,,1
1,0,50,1,1.0,2,1,2,2,1.0,2.0,2.0,2.0,2.0,2.0,0.9,135.0,42.0,3.5,,1
2,0,78,1,2.0,2,1,2,2,2.0,2.0,2.0,2.0,2.0,2.0,0.7,96.0,32.0,4.0,,1
3,0,31,1,,1,2,2,2,2.0,2.0,2.0,2.0,2.0,2.0,0.7,46.0,52.0,4.0,80.0,1
4,0,34,1,2.0,2,2,2,2,2.0,2.0,2.0,2.0,2.0,2.0,1.0,,200.0,4.0,,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
149,1,46,1,2.0,2,1,1,1,2.0,2.0,2.0,1.0,1.0,1.0,7.6,,242.0,3.3,50.0,2
150,0,44,1,2.0,2,1,2,2,2.0,1.0,2.0,2.0,2.0,2.0,0.9,126.0,142.0,4.3,,2
151,0,61,1,1.0,2,1,1,2,1.0,1.0,2.0,1.0,2.0,2.0,0.8,75.0,20.0,4.1,,2
152,0,53,2,1.0,2,1,2,2,2.0,2.0,1.0,1.0,2.0,1.0,1.5,81.0,19.0,4.1,48.0,2


Unnamed: 0,BILIRUBIN,ALK PHOSPHATE,SGOT,ALBUMIN,PROTIME,AGE_20.0,AGE_22.0,AGE_23.0,AGE_25.0,AGE_26.0,AGE_27.0,AGE_28.0,AGE_30.0,AGE_31.0,AGE_32.0,AGE_33.0,AGE_34.0,AGE_35.0,AGE_36.0,AGE_37.0,AGE_38.0,AGE_39.0,AGE_40.0,AGE_42.0,AGE_43.0,AGE_44.0,AGE_45.0,AGE_46.0,AGE_47.0,AGE_48.0,AGE_49.0,AGE_50.0,AGE_51.0,AGE_52.0,AGE_53.0,AGE_54.0,AGE_56.0,AGE_57.0,AGE_58.0,AGE_59.0,AGE_61.0,AGE_66.0,AGE_67.0,AGE_69.0,AGE_70.0,AGE_72.0,AGE_78.0,SEX_1,STEROID_1.0,STEROID_2.0,STEROID_not_available,ANTIVIRALS_2,FATIGUE_1,MALAISE_2,ANOREXIA_2,LIVER BIG_1.0,LIVER BIG_2.0,LIVER FIRM_1.0,LIVER FIRM_2.0,LIVER FIRM_not_available,SPLEEN PALPABLE_1.0,SPLEEN PALPABLE_2.0,SPIDERS_1.0,SPIDERS_2.0,ASCITES_1.0,ASCITES_2.0,VARICES_1.0,VARICES_2.0,VARICES_not_available,HISTOLOGY_2
0,1.0,85.000000,18.0,4.0,58.923077,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0,1.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0
1,0.9,135.000000,42.0,3.5,58.923077,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,1.0,1.0,1.0,1.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0
2,0.7,96.000000,32.0,4.0,58.923077,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,1.0,1.0,1.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0
3,0.7,46.000000,52.0,4.0,80.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0
4,1.0,108.689651,200.0,4.0,58.923077,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
149,7.6,108.689651,242.0,3.3,50.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0
150,0.9,126.000000,142.0,4.3,58.923077,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,1.0,1.0,1.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0
151,0.8,75.000000,20.0,4.1,58.923077,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,1.0,1.0,0.0,1.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0,1.0
152,1.5,81.000000,19.0,4.1,48.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,1.0,1.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0,1.0,0.0,0.0,1.0


### Example 2 - Ignore Features

In [None]:
# load dataset
from pycaret.datasets import get_data
pokemon = get_data('pokemon')

# init setup
from pycaret.classification import *
clf1 = setup(
    data=pokemon,
    target='Legendary',
    ignore_features=['#', 'Name']
)

Unnamed: 0,Description,Value
0,session_id,8518
1,Target,Legendary
2,Target Type,Binary
3,Label Encoded,"False: 0, True: 1"
4,Original Data,"(800, 13)"
5,Missing Values,True
6,Numeric Features,7
7,Categorical Features,3
8,Ordinal Features,False
9,High Cardinality Features,False


In [None]:
display(pokemon)
get_config("X")

Unnamed: 0,#,Name,Type 1,Type 2,Total,HP,Attack,Defense,Sp. Atk,Sp. Def,Speed,Generation,Legendary
0,1,Bulbasaur,Grass,Poison,318,45,49,49,65,65,45,1,False
1,2,Ivysaur,Grass,Poison,405,60,62,63,80,80,60,1,False
2,3,Venusaur,Grass,Poison,525,80,82,83,100,100,80,1,False
3,3,VenusaurMega Venusaur,Grass,Poison,625,80,100,123,122,120,80,1,False
4,4,Charmander,Fire,,309,39,52,43,60,50,65,1,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...
795,719,Diancie,Rock,Fairy,600,50,100,150,100,150,50,6,True
796,719,DiancieMega Diancie,Rock,Fairy,700,50,160,110,160,110,110,6,True
797,720,HoopaHoopa Confined,Psychic,Ghost,600,80,110,60,150,130,70,6,True
798,720,HoopaHoopa Unbound,Psychic,Dark,680,80,160,60,170,130,80,6,True


Unnamed: 0,Total,HP,Attack,Defense,Sp. Atk,Sp. Def,Speed,Type 1_Bug,Type 1_Dark,Type 1_Dragon,...,Type 2_Rock,Type 2_Steel,Type 2_Water,Type 2_not_available,Generation_1,Generation_2,Generation_3,Generation_4,Generation_5,Generation_6
0,318.0,45.0,49.0,49.0,65.0,65.0,45.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
1,405.0,60.0,62.0,63.0,80.0,80.0,60.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
2,525.0,80.0,82.0,83.0,100.0,100.0,80.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
3,625.0,80.0,100.0,123.0,122.0,120.0,80.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
4,309.0,39.0,52.0,43.0,60.0,50.0,65.0,0.0,0.0,0.0,...,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
795,600.0,50.0,100.0,150.0,100.0,150.0,50.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
796,700.0,50.0,160.0,110.0,160.0,110.0,110.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
797,600.0,80.0,110.0,60.0,150.0,130.0,70.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
798,680.0,80.0,160.0,60.0,170.0,130.0,80.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0


## One-Hot Encoding（ワンホットエンコーディング）<a id="one_hot_encoding"></a>
Categorical features in the dataset contain the label values (ordinal or nominal) rather than continuous numbers. The majority of the machine learning algorithms cannot directly deal with categorical features and they must be transformed into numeric values before training a model. The most common type of categorical encoding is One-Hot Encoding (also known as dummy encoding) where each categorical level becomes a separate feature in the dataset containing binary values (1 or 0).

Since this is an imperative step to perform an ML experiment, PyCaret will transform all categorical features in the dataset using one-hot encoding. This is ideal for features having nominal categorical data i.e. data cannot be ordered. In other different scenarios, other methods of encoding must be used. For example, when the data is ordinal i.e. data has intrinsic levels, [Ordinal Encoding](#ordinal_encoding) must be used. One-Hot Encoding works on all features that are either inferred as categorical or are forced as categorical using `categorical_features` in the setup function.

データセット内のカテゴリ特徴量には、連続した数値ではなく、ラベル値（序数または名数）が含まれます。機械学習アルゴリズムの大部分はカテゴリ特徴量を直接扱うことができないため、モデルを学習する前に数値に変換する必要があります。最も一般的なカテゴリカルエンコーディングはワンホットエンコーディング（ダミーエンコーディングとも呼ばれる）で、各カテゴリカルレベルがバイナリ値（1または0）を含むデータセットの個別の特徴量になります。

これは ML 実験を行う上で必須のステップであるため、PyCaret はデータセットの全てのカテゴリ特徴量を one-hot エンコーディングで変換します。これは名目的なカテゴリデータを持つ特徴量、つまりデータが順序付けできない特徴量に最適です。しかし、他のシナリオでは、他のエンコード方法を使用する必要があります。例えば、データが順序的である場合、すなわちデータが固有のレベルを持つ場合、[Ordinal Encoding](#ordinal_encoding) を使用する必要があります。ワンホットエンコーディングは、カテゴリとして推論された特徴量や、setup 関数で `categorical_features` を使って強制的にカテゴリとして推論された特徴量に対して機能します。

### Example

In [1]:
# load dataset
from pycaret.datasets import get_data
pokemon = get_data('pokemon')

# init setup
from pycaret.classification import *
clf1 = setup(
    data=pokemon,
    target='Legendary'
)

Unnamed: 0,Description,Value
0,session_id,686
1,Target,Legendary
2,Target Type,Binary
3,Label Encoded,"False: 0, True: 1"
4,Original Data,"(800, 13)"
5,Missing Values,True
6,Numeric Features,8
7,Categorical Features,4
8,Ordinal Features,False
9,High Cardinality Features,False


In [2]:
display(pokemon)
get_config("X")

Unnamed: 0,#,Name,Type 1,Type 2,Total,HP,Attack,Defense,Sp. Atk,Sp. Def,Speed,Generation,Legendary
0,1,Bulbasaur,Grass,Poison,318,45,49,49,65,65,45,1,False
1,2,Ivysaur,Grass,Poison,405,60,62,63,80,80,60,1,False
2,3,Venusaur,Grass,Poison,525,80,82,83,100,100,80,1,False
3,3,VenusaurMega Venusaur,Grass,Poison,625,80,100,123,122,120,80,1,False
4,4,Charmander,Fire,,309,39,52,43,60,50,65,1,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...
795,719,Diancie,Rock,Fairy,600,50,100,150,100,150,50,6,True
796,719,DiancieMega Diancie,Rock,Fairy,700,50,160,110,160,110,110,6,True
797,720,HoopaHoopa Confined,Psychic,Ghost,600,80,110,60,150,130,70,6,True
798,720,HoopaHoopa Unbound,Psychic,Dark,680,80,160,60,170,130,80,6,True


Unnamed: 0,#,Total,HP,Attack,Defense,Sp. Atk,Sp. Def,Speed,Name_Abomasnow,Name_AbomasnowMega Abomasnow,...,Type 2_Rock,Type 2_Steel,Type 2_Water,Type 2_not_available,Generation_1,Generation_2,Generation_3,Generation_4,Generation_5,Generation_6
0,1.0,318.0,45.0,49.0,49.0,65.0,65.0,45.0,0.0,0.0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
1,2.0,405.0,60.0,62.0,63.0,80.0,80.0,60.0,0.0,0.0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
2,3.0,525.0,80.0,82.0,83.0,100.0,100.0,80.0,0.0,0.0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
3,3.0,625.0,80.0,100.0,123.0,122.0,120.0,80.0,0.0,0.0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
4,4.0,309.0,39.0,52.0,43.0,60.0,50.0,65.0,0.0,0.0,...,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
795,719.0,600.0,50.0,100.0,150.0,100.0,150.0,50.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
796,719.0,700.0,50.0,160.0,110.0,160.0,110.0,110.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
797,720.0,600.0,80.0,110.0,60.0,150.0,130.0,70.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
798,720.0,680.0,80.0,160.0,60.0,170.0,130.0,80.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0


> NOTE: There is no additional parameter need to be passed in the setup function for one-hot-encoding. By default, it is applied to all `categorical_features`, unless otherwise, you define `ordinal_encoding` or `high_cardinality_features` explicitly.
>
>注：one-hot-encodingのセットアップ関数で渡す必要のある追加パラメータはありません。デフォルトでは、`ordinal_encoding` または `high_cardinality_features` を明示的に定義しない限り、すべての `categorical_features` に適用されます。

## Ordinal Encoding（順序エンコーディング）<a id="ordinal_encoding"></a>

When the categorical features in the dataset contain variables with intrinsic natural order such as Low, Medium, and High, these must be encoded differently than nominal variables (where there is no intrinsic order for e.g. Male or Female). This can be achieved using  the `ordinal_features` parameter in the setup function that accepts a dictionary with feature names and the levels in the increasing order from lowest to highest.

データセットのカテゴリ特徴量が、Low, Medium, Highのような自然な順序を持つ変数を含む場合、これらは名目変数（例えば、男性や女性のような固有の順序がない）とは異なる方法で符号化する必要があります。これは、setup 関数の `ordinal_features` パラメータで実現できます。このパラメータは、特徴量名と、最低から最高までのレベルの辞書を受け取ります。

### PARAMETERS（パラメータ）

- **ordinal_features**: dictionary, default = None
  - When the data contains ordinal features, they must be encoded differently using the `ordinal_features`. If the data has a categorical variable with values of `low`, `medium`, `high` and it is known that low < medium < high, then it can be passed as `ordinal_features = { ‘column_name’ : [‘low’, ‘medium’, ‘high’] }`. The list sequence must be in increasing order from lowest to highest.
  - データに順序特徴量が含まれている場合は、 `ordinal_features` を用いて別の方法で符号化する必要があります。データに `low`, `medium`, `high` という値を持つカテゴリ変数があり、 low < medium < high と分かっている場合は、 `ordinal_features = { 'column_name' : ['low', 'medium', 'high'] }` として渡すことができます。リストの並びは、低いものから高いものへと増加する順序でなければなりません。

### Example

In [3]:
# load dataset
from pycaret.datasets import get_data
employee = get_data('employee')

# init setup
from pycaret.classification import *
clf1 = setup(
    data=employee,
    target='left',
    ordinal_features={'salary' : ['low', 'medium', 'high']}
)

Unnamed: 0,Description,Value
0,session_id,4853
1,Target,left
2,Target Type,Binary
3,Label Encoded,
4,Original Data,"(14999, 10)"
5,Missing Values,False
6,Numeric Features,3
7,Categorical Features,6
8,Ordinal Features,True
9,High Cardinality Features,False


In [4]:
display(employee)
get_config("X")

Unnamed: 0,satisfaction_level,last_evaluation,number_project,average_montly_hours,time_spend_company,Work_accident,promotion_last_5years,department,salary,left
0,0.38,0.53,2,157,3,0,0,sales,low,1
1,0.80,0.86,5,262,6,0,0,sales,medium,1
2,0.11,0.88,7,272,4,0,0,sales,medium,1
3,0.72,0.87,5,223,5,0,0,sales,low,1
4,0.37,0.52,2,159,3,0,0,sales,low,1
...,...,...,...,...,...,...,...,...,...,...
14994,0.40,0.57,2,151,3,0,0,support,low,1
14995,0.37,0.48,2,160,3,0,0,support,low,1
14996,0.37,0.53,2,143,3,0,0,support,low,1
14997,0.11,0.96,6,280,4,0,0,support,low,1


Unnamed: 0,satisfaction_level,last_evaluation,average_montly_hours,salary,number_project_2,number_project_3,number_project_4,number_project_5,number_project_6,number_project_7,...,department_IT,department_RandD,department_accounting,department_hr,department_management,department_marketing,department_product_mng,department_sales,department_support,department_technical
0,0.38,0.53,157.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
1,0.80,0.86,262.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
2,0.11,0.88,272.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
3,0.72,0.87,223.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
4,0.37,0.52,159.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
14994,0.40,0.57,151.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
14995,0.37,0.48,160.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
14996,0.37,0.53,143.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
14997,0.11,0.96,280.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0


## Cardinal Encoding（基本エンコーディング）<a id="cardinal_encoding"></a>

When categorical features in the dataset contain variables with many levels (also known as high cardinality features), then typical One-Hot Encoding leads to the creation of a very large number of new features, thereby making the experiment slow. Features with high cardinality can be handled using `high_cardinality_features` in the setup. It supports two methods for cardinal encoding  (1) Frequency and (2) Clustering. These methods can be defined in the setup function.

データセットのカテゴリ特徴量が多くのレベルを持つ変数（高基準特徴量とも呼ばれる）を含む場合、通常のOne-Hot Encodingでは非常に多くの新しい特徴量が作成されるため、実験が遅くなります。カーディナリティの高い特徴量は、setup で `high_cardinality_features` を使って処理することができます。カーディナルエンコーディングの方法として、(1)周波数、(2)クラスタリングの2つをサポートしています。これらの方法は、setup 関数で定義できます。

### PARAMETERS

- **high_cardinality_features**: string, default = None
  - When the data contains features with high cardinality, they can be compressed into fewer levels by passing them as a list of column names with high cardinality. Features are compressed using the method defined in the high_cardinality_method parameter.
  - データにカーディナリティの高い特徴量が含まれている場合、カーディナリティの高い列名のリストとして渡すことで、より少ないレベルに圧縮することができます。特徴量は high_cardinality_method パラメータで定義された方式で圧縮されます。
  - 
- **high_cardinality_method**: string, default = ‘frequency’
  - When the method is set to `frequency`, it will replace the original value of the feature with the frequency distribution and convert the feature into numeric. The other available method is `clustering` that clusters the statistical attributes of data and replaces the original value of the feature with the cluster label. The number of clusters is determined using a combination of Calinski-Harabasz and Silhouette criteria.
  - メソッドに `frequency` を指定すると、特徴量の元の値を頻度分布に置き換え、特徴量を数値に変換します。もう一つの方法は `clustering` で、これはデータの統計的属性をクラスタリングし、特徴量の元の値をクラスタラベルに置き換えます。クラスタ数は、Calinski-Harabasz 基準と Silhouette 基準の組み合わせで決定されます。
  - 

### Example

In [5]:
# load dataset
from pycaret.datasets import get_data
income = get_data('income')

# init setup
from pycaret.classification import *
clf1 = setup(
    data=income,
    target='income >50K',
    high_cardinality_features=['native-country']
)

Unnamed: 0,Description,Value
0,session_id,2702
1,Target,income >50K
2,Target Type,Binary
3,Label Encoded,
4,Original Data,"(32561, 14)"
5,Missing Values,True
6,Numeric Features,4
7,Categorical Features,9
8,Ordinal Features,False
9,High Cardinality Features,True


In [6]:
display(income)
get_config("X")

Unnamed: 0,age,workclass,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,income >50K
0,39,State-gov,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,0
1,50,Self-emp-not-inc,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,0
2,38,Private,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,0
3,53,Private,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,0
4,28,Private,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
32556,27,Private,Assoc-acdm,12,Married-civ-spouse,Tech-support,Wife,White,Female,0,0,38,United-States,0
32557,40,Private,HS-grad,9,Married-civ-spouse,Machine-op-inspct,Husband,White,Male,0,0,40,United-States,1
32558,58,Private,HS-grad,9,Widowed,Adm-clerical,Unmarried,White,Female,0,0,40,United-States,0
32559,22,Private,HS-grad,9,Never-married,Adm-clerical,Own-child,White,Male,0,0,20,United-States,0


Unnamed: 0,age,capital-gain,capital-loss,hours-per-week,native-country,workclass_Federal-gov,workclass_Local-gov,workclass_Never-worked,workclass_Private,workclass_Self-emp-inc,...,relationship_Other-relative,relationship_Own-child,relationship_Unmarried,relationship_Wife,race_Amer-Indian-Eskimo,race_Asian-Pac-Islander,race_Black,race_Other,race_White,sex_Male
0,39.0,2174.0,0.0,40.0,20414.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0
1,50.0,0.0,0.0,13.0,20414.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0
2,38.0,0.0,0.0,40.0,20414.0,0.0,0.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0
3,53.0,0.0,0.0,40.0,20414.0,0.0,0.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0
4,28.0,0.0,0.0,40.0,71.0,0.0,0.0,0.0,1.0,0.0,...,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
32556,27.0,0.0,0.0,38.0,20414.0,0.0,0.0,0.0,1.0,0.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0
32557,40.0,0.0,0.0,40.0,20414.0,0.0,0.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0
32558,58.0,0.0,0.0,40.0,20414.0,0.0,0.0,0.0,1.0,0.0,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
32559,22.0,0.0,0.0,20.0,20414.0,0.0,0.0,0.0,1.0,0.0,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0


## Handle Unknown Levels（未知のレベルの扱い）<a id="handle_unknown_levels"></a>

When categorical features in the dataset contain unseen variables at the time of predictions, it may cause problems for the trained model as those levels were not present at the time of training. One way to deal with this is to reassign such levels. This can be achieved in PyCaret using `handle_unknown_categorical` and `unknown_categorical_method` parameters in the setup function.

データセット内のカテゴリ特徴量が予測時に未知の変数を含む場合、それらのレベルは学習時には存在しなかったため、学習済みモデルに問題が生じる可能性があります。この問題に対処する一つの方法は、そのようなレベルを再割り当てすることです。これは PyCaret の setup 関数で `handle_unknown_categorical` と `unknown_categorical_method` パラメータを使用して実現することができます。

### PARAMETERS

- **handle_unknown_categorical**: bool, default = True
  - When set to `True`, unknown categorical levels are replaced by the most or least frequent level as learned in the training dataset.
  - `True` に設定すると，未知のカテゴリカルレベルはトレーニングデータセットで学習された最も頻度の高いレベル，あるいは最も頻度の低いレベルに置き換えられます．
  - 
- **unknown_categorical_method**: string, default = ‘least_frequent’
  - This can be set to `least_frequent` or `most_frequent`.
  - これは `least_frequent` または `most_frequent` に設定することができます。
  - 

### Example

In [7]:
# load dataset
from pycaret.datasets import get_data
insurance = get_data('insurance')

# init setup
from pycaret.regression import *
reg1 = setup(
    data=insurance,
    target='charges',
    handle_unknown_categorical=True,
    unknown_categorical_method='most_frequent'
)

Unnamed: 0,Description,Value
0,session_id,5336
1,Target,charges
2,Original Data,"(1338, 7)"
3,Missing Values,False
4,Numeric Features,2
5,Categorical Features,4
6,Ordinal Features,False
7,High Cardinality Features,False
8,High Cardinality Method,
9,Transformed Train Set,"(936, 14)"


In [8]:
display(insurance)
get_config("X")

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
0,19,female,27.900,0,yes,southwest,16884.92400
1,18,male,33.770,1,no,southeast,1725.55230
2,28,male,33.000,3,no,southeast,4449.46200
3,33,male,22.705,0,no,northwest,21984.47061
4,32,male,28.880,0,no,northwest,3866.85520
...,...,...,...,...,...,...,...
1333,50,male,30.970,3,no,northwest,10600.54830
1334,18,female,31.920,0,no,northeast,2205.98080
1335,18,female,36.850,0,no,southeast,1629.83350
1336,21,female,25.800,0,no,southwest,2007.94500


Unnamed: 0,age,bmi,sex_female,children_0,children_1,children_2,children_3,children_4,children_5,smoker_no,region_northeast,region_northwest,region_southeast,region_southwest
0,19.0,27.900000,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
1,18.0,33.770000,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0
2,28.0,33.000000,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0
3,33.0,22.705000,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0
4,32.0,28.879999,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1333,50.0,30.969999,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0
1334,18.0,31.920000,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0
1335,18.0,36.849998,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0
1336,21.0,25.799999,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0


## Target Imbalance（ターゲット不均衡）<a id="target_imbalance"></a>

When the training dataset has an unequal distribution of target class it can be fixed using the `fix_imbalance` parameter in the setup. When set to `True`, SMOTE (Synthetic Minority Over-sampling Technique) is used as a default method for resampling. The method for resampling can be changed using the `fix_imbalance_method` within the setup.

トレーニングデータセットにターゲットクラスの不均等分布がある場合、 setup の `fix_imbalance` パラメータを使用して修正することができます。`True` に設定すると、SMOTE (Synthetic Minority Over-sampling Technique) がデフォルトのリサンプリング方法として使用されます。setup の `fix_imbalance_method` でリサンプリングの方法を変更することができます。

### PARAMETERS

- **fix_imbalance**: bool, default = False
  - When set to `True`, the training dataset is resampled using the algorithm defined in `fix_imbalance_method` . When `None`, SMOTE is used by default.
  - `True` に設定すると， `fix_imbalance_method` で定義されたアルゴリズムを用いて，トレーニングデータセットをリサンプリングします．`None` の場合、デフォルトで SMOTE が利用されます。
  - 
- **fix_imbalance_method**: obj, default = None
  - This parameter accepts any algorithm from [imblearn](https://imbalanced-learn.org/stable/) that supports `fit_resample` method.
  - このパラメータには、[imblearn](https://imbalanced-learn.org/stable/) にある `fit_resample` メソッドをサポートする任意のアルゴリズムを指定することができます。

### Example

In [9]:
# load dataset
from pycaret.datasets import get_data
credit = get_data('credit')

# init setup
from pycaret.classification import *
clf1 = setup(
    data=credit,
    target='default',
    fix_imbalance=True
)

Unnamed: 0,Description,Value
0,session_id,3068
1,Target,default
2,Target Type,Binary
3,Label Encoded,
4,Original Data,"(24000, 24)"
5,Missing Values,False
6,Numeric Features,14
7,Categorical Features,9
8,Ordinal Features,False
9,High Cardinality Features,False


In [12]:
display(credit)
display(get_config("X"))
display(get_config("y"))

Unnamed: 0,LIMIT_BAL,SEX,EDUCATION,MARRIAGE,AGE,PAY_1,PAY_2,PAY_3,PAY_4,PAY_5,...,BILL_AMT4,BILL_AMT5,BILL_AMT6,PAY_AMT1,PAY_AMT2,PAY_AMT3,PAY_AMT4,PAY_AMT5,PAY_AMT6,default
0,20000,2,2,1,24,2,2,-1,-1,-2,...,0.0,0.0,0.0,0.0,689.0,0.0,0.0,0.0,0.0,1
1,90000,2,2,2,34,0,0,0,0,0,...,14331.0,14948.0,15549.0,1518.0,1500.0,1000.0,1000.0,1000.0,5000.0,0
2,50000,2,2,1,37,0,0,0,0,0,...,28314.0,28959.0,29547.0,2000.0,2019.0,1200.0,1100.0,1069.0,1000.0,0
3,50000,1,2,1,57,-1,0,-1,0,0,...,20940.0,19146.0,19131.0,2000.0,36681.0,10000.0,9000.0,689.0,679.0,0
4,50000,1,1,2,37,0,0,0,0,0,...,19394.0,19619.0,20024.0,2500.0,1815.0,657.0,1000.0,1000.0,800.0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
23995,80000,1,2,2,34,2,2,2,2,2,...,77519.0,82607.0,81158.0,7000.0,3500.0,0.0,7000.0,0.0,4000.0,1
23996,150000,1,3,2,43,-1,-1,-1,-1,0,...,8979.0,5190.0,0.0,1837.0,3526.0,8998.0,129.0,0.0,0.0,0
23997,30000,1,2,2,37,4,3,2,-1,0,...,20878.0,20582.0,19357.0,0.0,0.0,22000.0,4200.0,2000.0,3100.0,1
23998,80000,1,3,1,41,1,-1,0,0,0,...,52774.0,11855.0,48944.0,85900.0,3409.0,1178.0,1926.0,52964.0,1804.0,1


Unnamed: 0,LIMIT_BAL,AGE,BILL_AMT1,BILL_AMT2,BILL_AMT3,BILL_AMT4,BILL_AMT5,BILL_AMT6,PAY_AMT1,PAY_AMT2,...,PAY_6_-1,PAY_6_-2,PAY_6_0,PAY_6_2,PAY_6_3,PAY_6_4,PAY_6_5,PAY_6_6,PAY_6_7,PAY_6_8
0,20000.0,24.0,3913.0,3102.0,689.0,0.0,0.0,0.0,0.0,689.0,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,90000.0,34.0,29239.0,14027.0,13559.0,14331.0,14948.0,15549.0,1518.0,1500.0,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,50000.0,37.0,46990.0,48233.0,49291.0,28314.0,28959.0,29547.0,2000.0,2019.0,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,50000.0,57.0,8617.0,5670.0,35835.0,20940.0,19146.0,19131.0,2000.0,36681.0,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,50000.0,37.0,64400.0,57069.0,57608.0,19394.0,19619.0,20024.0,2500.0,1815.0,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
23995,80000.0,34.0,72557.0,77708.0,79384.0,77519.0,82607.0,81158.0,7000.0,3500.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
23996,150000.0,43.0,1683.0,1828.0,3502.0,8979.0,5190.0,0.0,1837.0,3526.0,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
23997,30000.0,37.0,3565.0,3356.0,2758.0,20878.0,20582.0,19357.0,0.0,0.0,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
23998,80000.0,41.0,-1645.0,78379.0,76304.0,52774.0,11855.0,48944.0,85900.0,3409.0,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


0        1
1        0
2        0
3        0
4        0
        ..
23995    1
23996    0
23997    1
23998    1
23999    1
Name: default, Length: 24000, dtype: int64

#### Before and After SMOTE 

## Remove Outliers（外れ値の除去）<a id="remove_outliers"></a>

The `remove_outliers` function in PyCaret allows you to identify and remove outliers from the dataset before training the model. Outliers are identified through PCA linear dimensionality reduction using the Singular Value Decomposition technique. It can be achieved using `remove_outliers` parameter within [setup]. The proportion of outliers are controlled through `outliers_threshold` parameter.

PyCaret の `remove_outliers` 関数を使用すると、モデルの学習前にデータセットから外れ値を識別して除去することができます。外れ値は、Singular Value Decomposition テクニックを用いた PCA 線形次元削減によって識別されます。これは [setup] 内の `remove_outliers` パラメータを使用して実現できます。外れ値の割合は `outliers_threshold` パラメータで制御します。

### PARAMETERS

- **remove_outliers**: bool, default = False
  - When set to True, outliers from the training data are removed using PCA linear dimensionality reduction using the Singular Value Decomposition technique.
  - Trueに設定すると、特異値分解法を用いたPCA線形次元削減により、学習データから外れ値が除去されます。
  - 
- **outliers_threshold**: float, default = 0.05
  - The percentage/proportion of outliers in the dataset can be defined using the outliers_threshold param. By default, 0.05 is used which means 0.025 of the values on each side of the distribution’s tail are dropped from training data.
  - データセットに含まれる外れ値の割合/比率は、 outliers_threshold パラメータで定義できます。デフォルトでは 0.05 が使用され、分布の末尾の両側にある値の 0.025 が学習データから取り除かれることを意味します。
  - 

### Example

In [13]:
# load dataset
from pycaret.datasets import get_data
insurance = get_data('insurance')

# init setup
from pycaret.regression import *
reg1 = setup(
    data=insurance,
    target='charges',
    remove_outliers=True
)

Unnamed: 0,Description,Value
0,session_id,5330
1,Target,charges
2,Original Data,"(1338, 7)"
3,Missing Values,False
4,Numeric Features,2
5,Categorical Features,4
6,Ordinal Features,False
7,High Cardinality Features,False
8,High Cardinality Method,
9,Transformed Train Set,"(889, 12)"


#### Before and After removing outliers