In [2]:
import pandas as pd

pd.set_option("display.max_columns", None)

# [Feature Selection（特徴量の選択）][def]

[def]: https://pycaret.gitbook.io/docs/get-started/preprocessing/feature-selection

## Feature Selection（特徴量の選択） <a id="feature_selection"></a>

Feature Importance is a process used to select features in the dataset that contributes the most in predicting the target variable. Working with selected features instead of all the features reduces the risk of over-fitting, improves accuracy, and decreases the training time. In PyCaret, this can be achieved using `feature_selection` parameter. It uses a combination of several supervised feature selection techniques to select the subset of features that are most important for modeling. The size of the subset can be controlled using `feature_selection_threshold` parameter within setup.

特徴量の重要性とは、データセットの中からターゲット変数の予測に最も貢献する特徴量を選択するために用いられるプロセスです。すべての特徴量ではなく、選択された特徴量を使用することで、オーバーフィットのリスクを減らし、精度を向上させ、学習時間を短縮することができます。PyCaretでは、`feature_selection`パラメータを使用してこれを実現することができます。これは、いくつかの教師あり特徴量選択手法を組み合わせて、モデリングに最も重要な特徴量の部分集合を選択するものです。サブセットの大きさは `feature_selection_threshold` パラメータで制御できます。

### PARAMETERS

- **feature_selection**: bool, default = False
  - When set to `True`, a subset of features are selected using a combination of various permutation importance techniques including Random Forest, Adaboost and Linear correlation with target variable. The size of the subset is dependent on the `feature_selection` param. Generally, this is used to constrain the feature space in order to improve efficiency in modeling. When `polynomial_features` and `feature_interaction` are used, it is highly recommended to define the `feature_selection_threshold` param with a lower value.
  - `True` に設定すると、Random Forest、Adaboost、ターゲット変数との線形相関など、さまざまな並べ替え重要度テクニックを組み合わせて、特徴量の部分集合を選択するようになります。サブセットのサイズは `feature_selection` パラメータに依存します。一般に、これはモデリングの効率を上げるために、特徴量空間を制限するために使用されます。`多項式特徴量`や`特徴量間相互作用`を利用する場合は、 `feature_selection_threshold` パラメータを小さく設定することが推奨されます。
- **feature_selection_threshold**: float, default = 0.8
  - Threshold used for feature selection (including newly created polynomial features). A higher value will result in a higher feature space. It is recommended to do multiple trials with different values of `feature_selection_threshold` specially in cases where `polynomial_features` and `feature_interaction` are used. Setting a very low value may be efficient but could result in under-fitting.
  - 特徴量選択（新しく作成された多項式特徴を含む）に使用される閾値。高い値を設定すると、より高い特徴量空間が得られます。特に `polynomial_features` や `feature_interaction` を利用する場合は、 `feature_selection_threshold` の値を変えて複数回試行することが推奨されます。非常に低い値を設定すると効率的ではあるが、アンダーフィットになる可能性があります。

### Example

In [1]:
# load dataset
from pycaret.datasets import get_data

diabetes = get_data('diabetes')

# init setup
from pycaret.regression import *

clf1 = setup(data=diabetes, target='Class variable', feature_selection=True)

Unnamed: 0,Description,Value
0,session_id,8268
1,Target,Class variable
2,Original Data,"(768, 9)"
3,Missing Values,False
4,Numeric Features,7
5,Categorical Features,1
6,Ordinal Features,False
7,High Cardinality Features,False
8,High Cardinality Method,
9,Transformed Train Set,"(537, 23)"


In [3]:
display(get_config("data_before_preprocess"))

display(get_config("X"))

Unnamed: 0,Number of times pregnant,Plasma glucose concentration a 2 hours in an oral glucose tolerance test,Diastolic blood pressure (mm Hg),Triceps skin fold thickness (mm),2-Hour serum insulin (mu U/ml),Body mass index (weight in kg/(height in m)^2),Diabetes pedigree function,Age (years),Class variable
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1
...,...,...,...,...,...,...,...,...,...
763,10,101,76,48,180,32.9,0.171,63,0
764,2,122,70,27,0,36.8,0.340,27,0
765,5,121,72,23,112,26.2,0.245,30,0
766,1,126,60,0,0,30.1,0.349,47,1


Unnamed: 0,Number of times pregnant_1,Number of times pregnant_3,Diabetes pedigree function,Age (years),Diastolic blood pressure (mm Hg),Number of times pregnant_13,Number of times pregnant_14,Number of times pregnant_12,Number of times pregnant_11,2-Hour serum insulin (mu U/ml),Number of times pregnant_6,Number of times pregnant_9,Number of times pregnant_4,Plasma glucose concentration a 2 hours in an oral glucose tolerance test,Number of times pregnant_5,Number of times pregnant_2,Body mass index (weight in kg/(height in m)^2),Number of times pregnant_0,Triceps skin fold thickness (mm),Number of times pregnant_7,Number of times pregnant_17,Number of times pregnant_8,Number of times pregnant_10
0,0.0,0.0,0.627,50.0,72.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,148.0,0.0,0.0,33.599998,0.0,35.0,0.0,0.0,0.0,0.0
1,1.0,0.0,0.351,31.0,66.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,85.0,0.0,0.0,26.600000,0.0,29.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.672,32.0,64.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,183.0,0.0,0.0,23.299999,0.0,0.0,0.0,0.0,1.0,0.0
3,1.0,0.0,0.167,21.0,66.0,0.0,0.0,0.0,0.0,94.0,0.0,0.0,0.0,89.0,0.0,0.0,28.100000,0.0,23.0,0.0,0.0,0.0,0.0
4,0.0,0.0,2.288,33.0,40.0,0.0,0.0,0.0,0.0,168.0,0.0,0.0,0.0,137.0,0.0,0.0,43.099998,1.0,35.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
763,0.0,0.0,0.171,63.0,76.0,0.0,0.0,0.0,0.0,180.0,0.0,0.0,0.0,101.0,0.0,0.0,32.900002,0.0,48.0,0.0,0.0,0.0,1.0
764,0.0,0.0,0.340,27.0,70.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,122.0,0.0,1.0,36.799999,0.0,27.0,0.0,0.0,0.0,0.0
765,0.0,0.0,0.245,30.0,72.0,0.0,0.0,0.0,0.0,112.0,0.0,0.0,0.0,121.0,1.0,0.0,26.200001,0.0,23.0,0.0,0.0,0.0,0.0
766,1.0,0.0,0.349,47.0,60.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,126.0,0.0,0.0,30.100000,0.0,0.0,0.0,0.0,0.0,0.0


> Notice that how "Triceps skin fold thickness (mm)" is dropped from the dataset after using `feature_selection` parameter in setup.
> 
> setupで`feature_selection`パラメータを使用すると、"Triceps skin fold thickness (mm) "がデータセットから削除されることに注意してください。

## Remove Multicollinearity（多重共線性の除去） <a id="remove_multicollinearity"></a>

**Multicollinearity** (also called collinearity) is a phenomenon in which one feature variable in the dataset is highly linearly correlated with another feature variable in the same dataset. Multicollinearity increases the variance of the coefficients, thus making them unstable and noisy for linear models. One such way to deal with Multicollinearity is to drop one of the two features that are highly correlated with each other. This can be achieved in PyCaret using `remove_multicollinearity` parameter within setup.

**多重共線性**（共線性とも呼ばれる）とは、データセット内のある特徴量変数が、同じデータセット内の別の特徴量変数と高い線形相関を持つ現象のことです。Multicollinearityは係数の分散を増加させるため、線形モデルとしては不安定でノイズの多いものになります。このような多重共線性に対処する方法の1つは、互いに高い相関を持つ2つの特徴のうち、1つを削除することです。これは、PyCaret の setup で `remove_multicollinearity` パラメータを使用して実現することができます。

### PARAMETERS

- **remove_multicollinearity**: bool, default = False
  - When set to True, the variables with inter-correlations higher than the threshold defined under the `multicollinearity_threshold` param are dropped. When two features are highly correlated with each other, the feature that is less correlated with the target variable is dropped.
  - `True` に設定すると、 `multicollinearity_threshold` パラメータで定義した閾値よりも高い相互相関を持つ変数が削除されます。2つの特徴量が互いに高い相関を持つ場合、ターゲット変数との相関が低い方の特徴量が削除されます。
- **multicollinearity_threshold**: float, default = 0.9
  - Threshold used for dropping the correlated features. Only comes into effect when remove_multicollinearity is set to `True`.
  - 相関のある特徴量を削除するために使用する閾値。remove_multicollinearity が `True` に設定されている場合のみ有効です。

### Example

In [4]:
# load dataset
from pycaret.datasets import get_data

concrete = get_data('concrete')

# init setup
from pycaret.regression import *

reg1 = setup(
    data=concrete,
    target='strength',
    remove_multicollinearity=True,
    multicollinearity_threshold=0.6
)

Unnamed: 0,Description,Value
0,session_id,6102
1,Target,strength
2,Original Data,"(1030, 9)"
3,Missing Values,False
4,Numeric Features,7
5,Categorical Features,1
6,Ordinal Features,False
7,High Cardinality Features,False
8,High Cardinality Method,
9,Transformed Train Set,"(720, 20)"


In [5]:
display(get_config("data_before_preprocess"))

display(get_config("X"))

Unnamed: 0,Cement (component 1)(kg in a m^3 mixture),Blast Furnace Slag (component 2)(kg in a m^3 mixture),Fly Ash (component 3)(kg in a m^3 mixture),Water (component 4)(kg in a m^3 mixture),Superplasticizer (component 5)(kg in a m^3 mixture),Coarse Aggregate (component 6)(kg in a m^3 mixture),Fine Aggregate (component 7)(kg in a m^3 mixture),Age (day),strength
0,540.0,0.0,0.0,162.0,2.5,1040.0,676.0,28,79.99
1,540.0,0.0,0.0,162.0,2.5,1055.0,676.0,28,61.89
2,332.5,142.5,0.0,228.0,0.0,932.0,594.0,270,40.27
3,332.5,142.5,0.0,228.0,0.0,932.0,594.0,365,41.05
4,198.6,132.4,0.0,192.0,0.0,978.4,825.5,360,44.30
...,...,...,...,...,...,...,...,...,...
1025,276.4,116.0,90.3,179.6,8.9,870.1,768.3,28,44.28
1026,322.2,0.0,115.6,196.0,10.4,817.9,813.4,28,31.18
1027,148.5,139.4,108.6,192.7,6.1,892.4,780.0,28,23.70
1028,159.1,186.7,0.0,175.6,11.3,989.6,788.9,28,32.77


Unnamed: 0,Cement (component 1)(kg in a m^3 mixture),Blast Furnace Slag (component 2)(kg in a m^3 mixture),Fly Ash (component 3)(kg in a m^3 mixture),Superplasticizer (component 5)(kg in a m^3 mixture),Coarse Aggregate (component 6)(kg in a m^3 mixture),Fine Aggregate (component 7)(kg in a m^3 mixture),Age (day)_1,Age (day)_100,Age (day)_120,Age (day)_14,Age (day)_180,Age (day)_270,Age (day)_28,Age (day)_3,Age (day)_360,Age (day)_365,Age (day)_56,Age (day)_7,Age (day)_90,Age (day)_91
0,540.000000,0.000000,0.000000,2.5,1040.000000,676.000000,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,540.000000,0.000000,0.000000,2.5,1055.000000,676.000000,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,332.500000,142.500000,0.000000,0.0,932.000000,594.000000,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,332.500000,142.500000,0.000000,0.0,932.000000,594.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
4,198.600006,132.399994,0.000000,0.0,978.400024,825.500000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1025,276.399994,116.000000,90.300003,8.9,870.099976,768.299988,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1026,322.200012,0.000000,115.599998,10.4,817.900024,813.400024,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1027,148.500000,139.399994,108.599998,6.1,892.400024,780.000000,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1028,159.100006,186.699997,0.000000,11.3,989.599976,788.900024,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


> Notice  how Water (component 4)(kg in a m^3 mixture) column is dropped from the dataset after using `remove_multicollinearity` parameter in setup.
> 
> setupで `remove_multicollinearity` パラメータを使用した後、Water (component 4)(kg in a m^3 mixture) カラムがデータセットから削除されたことに注目してください。

## Principal Component Analysis（主成分分析） <a id="principal_component_analysis"></a>

Principal Component Analysis (PCA) is an unsupervised technique used in machine learning to reduce the dimensionality of a data. It does so by compressing the feature space by identifying a subspace that captures most of the information in the complete feature matrix. It projects the original feature space into lower dimensionality. This can be achieved in PyCaret using `pca` parameter within setup.

主成分分析（PCA）は、機械学習においてデータの次元を削減するために用いられる教師なし手法です。これは、完全な特徴量行列の情報のほとんどを捕らえる部分空間を特定することによって特徴量空間を圧縮することによって行われます。これは、元の特徴量空間をより低い次元に投影するものです。これは、PyCaret の setup で `pca` パラメータを使用して行うことができます。

### PARAMETERS

- **pca**: bool, default = False
  - When set to `True`, dimensionality reduction is applied to project the data into a lower dimensional space using the method defined in `pca_method` param. In supervised learning pca is generally performed when dealing with high feature space and memory is a constraint. Note that not all datasets can be decomposed efficiently using a linear PCA technique and that applying PCA may result in loss of information. As such, it is advised to run multiple experiments with different `pca_methods` to evaluate the impact.
  - `True` に設定すると， `pca_method` パラメータで定義された手法を用いて，データを低次元空間に射影する次元削減が行われます．教師あり学習において、pca は一般的に大きな特徴空間を扱い、メモリに制約がある場合に実行されます。すべてのデータセットが線形PCA手法で効率的に分解できるわけではなく、PCAを適用することで情報が失われる可能性があることに注意してください。そのため、影響を評価するために、異なる `pca_methods` を用いて複数の実験を行うことが推奨されます。
- **pca_method**: string, default = ‘linear’
  - The ‘linear’ method performs Linear dimensionality reduction using Singular Value Decomposition. The other available options are:
  - 'linear' メソッドは、特異値分解を用いた線形次元削減を行います。他に利用できるオプションは:
    - **kernel** :
      - dimensionality reduction through the use of RVF kernel.
      - RVFカーネルを用いた次元削減。
    - **incremental** :
      - replacement for ‘linear’ pca when the dataset to be decomposed is too large to fit in memory.
      - 分解されるデータセットが大きすぎてメモリに収まらない場合の 'linear' pca の代替。
- **pca_components**: int/float, default = 0.99
  - Number of components to keep. if `pca_components` is a float, it is treated as a target percentage for information retention. When `pca_components` is an integer it is treated as the number of features to be kept. `pca_components` must be strictly less than the original number of features in the dataset.
  - `pca_components` が float の場合、情報を保持するための目標パーセンテージとして扱われます。`pca_components` が整数の場合、保持する特徴量の個数として扱われます。pca_components` は、データセットに含まれる元の特徴量よりも厳密に少なくなければなりません。

### Example

In [6]:
# load dataset
from pycaret.datasets import get_data

income = get_data('income')

# init setup
from pycaret.classification import *

clf1 = setup(
    data=income,target='income >50K',
    pca=True,
    pca_components=10
)

Unnamed: 0,Description,Value
0,session_id,7178
1,Target,income >50K
2,Target Type,Binary
3,Label Encoded,
4,Original Data,"(32561, 14)"
5,Missing Values,True
6,Numeric Features,4
7,Categorical Features,9
8,Ordinal Features,False
9,High Cardinality Features,False


In [7]:
display(get_config("data_before_preprocess"))

display(get_config("X"))

Unnamed: 0,age,workclass,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,income >50K
0,39,State-gov,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,0
1,50,Self-emp-not-inc,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,0
2,38,Private,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,0
3,53,Private,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,0
4,28,Private,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
32556,27,Private,Assoc-acdm,12,Married-civ-spouse,Tech-support,Wife,White,Female,0,0,38,United-States,0
32557,40,Private,HS-grad,9,Married-civ-spouse,Machine-op-inspct,Husband,White,Male,0,0,40,United-States,1
32558,58,Private,HS-grad,9,Widowed,Adm-clerical,Unmarried,White,Female,0,0,40,United-States,0
32559,22,Private,HS-grad,9,Never-married,Adm-clerical,Own-child,White,Male,0,0,20,United-States,0


Unnamed: 0,Component_1,Component_2,Component_3,Component_4,Component_5,Component_6,Component_7,Component_8,Component_9,Component_10
0,1065.773804,-84.827164,0.354766,-0.524713,0.796512,-0.726226,0.774422,-0.644547,-0.295538,0.027417
1,-1108.225220,-88.514641,3.501306,-29.358192,-1.258501,-0.787691,0.596025,-0.136047,-0.608446,-0.071002
2,-1108.223511,-88.491600,-0.200842,-0.054634,0.595231,0.855925,0.118824,-0.451460,-0.234098,-0.598582
3,-1108.221313,-88.460480,14.167626,-4.394745,-0.605601,0.110577,-0.398486,0.149176,0.131468,1.166092
4,-1108.224854,-88.512032,-9.762453,2.838964,-0.164990,-0.362061,-0.391917,0.968325,-1.301376,1.162937
...,...,...,...,...,...,...,...,...,...,...
32556,-1108.225342,-88.517700,-11.298690,1.213544,-0.334073,-0.025569,-0.478066,0.615813,-0.628601,-0.338749
32557,-1108.223145,-88.487267,1.736079,-0.624499,-0.990635,0.816939,-0.039876,-0.110774,-0.078815,-0.094960
32558,-1108.220581,-88.450470,18.928707,-5.861992,1.040147,0.892496,-0.313320,0.601368,-0.000986,-0.190212
32559,-1108.228271,-88.560326,-21.330334,-14.564683,-0.096841,0.683126,0.286220,-0.326368,0.325341,0.019620


## Ignore Low Variance（分散が小さい場合は無視する） <a id="ignore_low_variance"></a>

Sometimes a dataset may have a `categorical feature` with multiple levels, where distribution of such levels are skewed and one level may dominate over other levels. This means there is not much variation in the information provided by such feature.  For a ML model, such feature may not add a lot of information and thus can be ignored for modeling. This can be achieved in PyCaret using `ignore_low_variance` parameter within setup. Both conditions below must be met for a feature to be considered a low variance feature.

Count of unique values in a feature  / sample size < 10%

Count of most common value / Count of second most common value > 20 times.

データセットには、複数のレベルを持つ `カテゴリ特徴量` が存在する場合がありますが、そのようなレベルの分布は歪んでいて、あるレベルが他のレベルよりも優位に立つことがあります。これは、そのような特徴量が提供する情報には大きなばらつきがないことを意味します。 MLモデルでは、このような特徴量は多くの情報を追加しないので、モデリングに無視することができます。これは、PyCaret の setup で `ignore_low_variance` パラメータを使用して実現することができます。低分散特徴量とみなされるためには、以下の両方の条件を満たす必要があります。

特徴量に含まれるユニークな値の数  / sample size < 10%

最も多い値のカウント  /  2番目に多い値のカウント > 20回。

### PARAMETERS

- **ignore_low_variance**: bool, default = False
  - When set to `True`, all categorical features with statistically insignificant variances are removed from the dataset. The variance is calculated using the ratio of unique  values to the number of samples, and the ratio of the most common value to the frequency of the second most common value.
  - `True` に設定すると、統計的に重要でない分散を持つすべてのカテゴリ特徴量がデータセットから削除されます。分散は、サンプル数に対するユニークな値の比率と、最も多い値の頻度と2番目に多い値の頻度の比を使用して計算されます。

### Example

In [15]:
# load dataset
from pycaret.datasets import get_data

mice = get_data('mice')

# filter dataset
# mice = mice[mice['Genotype']] = 'Control'
mice['Genotype'] = 'Control'

# init setup
from pycaret.classification import *

clf1 = setup(
    data=mice,
    target='class',
    ignore_low_variance=True
)

Unnamed: 0,Description,Value
0,session_id,7070
1,Target,class
2,Target Type,Multiclass
3,Label Encoded,"c-CS-m: 0, c-CS-s: 1, c-SC-m: 2, c-SC-s: 3, t-CS-m: 4, t-CS-s: 5, t-SC-m: 6, t-SC-s: 7"
4,Original Data,"(1080, 82)"
5,Missing Values,True
6,Numeric Features,77
7,Categorical Features,4
8,Ordinal Features,False
9,High Cardinality Features,False


In [11]:
# mice = mice[mice['Genotype']] = 'Control'
mice = get_data('mice')

Unnamed: 0,MouseID,DYRK1A_N,ITSN1_N,BDNF_N,NR1_N,NR2A_N,pAKT_N,pBRAF_N,pCAMKII_N,pCREB_N,...,pCFOS_N,SYP_N,H3AcK18_N,EGR1_N,H3MeK4_N,CaNA_N,Genotype,Treatment,Behavior,class
0,309_1,0.503644,0.747193,0.430175,2.816329,5.990152,0.21883,0.177565,2.373744,0.232224,...,0.108336,0.427099,0.114783,0.13179,0.128186,1.675652,Control,Memantine,C/S,c-CS-m
1,309_2,0.514617,0.689064,0.41177,2.789514,5.685038,0.211636,0.172817,2.29215,0.226972,...,0.104315,0.441581,0.111974,0.135103,0.131119,1.74361,Control,Memantine,C/S,c-CS-m
2,309_3,0.509183,0.730247,0.418309,2.687201,5.622059,0.209011,0.175722,2.283337,0.230247,...,0.106219,0.435777,0.111883,0.133362,0.127431,1.926427,Control,Memantine,C/S,c-CS-m
3,309_4,0.442107,0.617076,0.358626,2.466947,4.979503,0.222886,0.176463,2.152301,0.207004,...,0.111262,0.391691,0.130405,0.147444,0.146901,1.700563,Control,Memantine,C/S,c-CS-m
4,309_5,0.43494,0.61743,0.358802,2.365785,4.718679,0.213106,0.173627,2.134014,0.192158,...,0.110694,0.434154,0.118481,0.140314,0.14838,1.83973,Control,Memantine,C/S,c-CS-m


In [16]:
display(get_config("data_before_preprocess"))

display(get_config("X"))

Unnamed: 0,MouseID,DYRK1A_N,ITSN1_N,BDNF_N,NR1_N,NR2A_N,pAKT_N,pBRAF_N,pCAMKII_N,pCREB_N,...,pCFOS_N,SYP_N,H3AcK18_N,EGR1_N,H3MeK4_N,CaNA_N,Genotype,Treatment,Behavior,class
0,309_1,0.503644,0.747193,0.430175,2.816329,5.990152,0.218830,0.177565,2.373744,0.232224,...,0.108336,0.427099,0.114783,0.131790,0.128186,1.675652,Control,Memantine,C/S,c-CS-m
1,309_2,0.514617,0.689064,0.411770,2.789514,5.685038,0.211636,0.172817,2.292150,0.226972,...,0.104315,0.441581,0.111974,0.135103,0.131119,1.743610,Control,Memantine,C/S,c-CS-m
2,309_3,0.509183,0.730247,0.418309,2.687201,5.622059,0.209011,0.175722,2.283337,0.230247,...,0.106219,0.435777,0.111883,0.133362,0.127431,1.926427,Control,Memantine,C/S,c-CS-m
3,309_4,0.442107,0.617076,0.358626,2.466947,4.979503,0.222886,0.176463,2.152301,0.207004,...,0.111262,0.391691,0.130405,0.147444,0.146901,1.700563,Control,Memantine,C/S,c-CS-m
4,309_5,0.434940,0.617430,0.358802,2.365785,4.718679,0.213106,0.173627,2.134014,0.192158,...,0.110694,0.434154,0.118481,0.140314,0.148380,1.839730,Control,Memantine,C/S,c-CS-m
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1075,J3295_11,0.254860,0.463591,0.254860,2.092082,2.600035,0.211736,0.171262,2.483740,0.207317,...,0.183324,0.374088,0.318782,0.204660,0.328327,1.364823,Control,Saline,S/C,t-SC-s
1076,J3295_12,0.272198,0.474163,0.251638,2.161390,2.801492,0.251274,0.182496,2.512737,0.216339,...,0.175674,0.375259,0.325639,0.200415,0.293435,1.364478,Control,Saline,S/C,t-SC-s
1077,J3295_13,0.228700,0.395179,0.234118,1.733184,2.220852,0.220665,0.161435,1.989723,0.185164,...,0.158296,0.422121,0.321306,0.229193,0.355213,1.430825,Control,Saline,S/C,t-SC-s
1078,J3295_14,0.221242,0.412894,0.243974,1.876347,2.384088,0.208897,0.173623,2.086028,0.192044,...,0.196296,0.397676,0.335936,0.251317,0.365353,1.404031,Control,Saline,S/C,t-SC-s


Unnamed: 0,DYRK1A_N,ITSN1_N,BDNF_N,NR1_N,NR2A_N,pAKT_N,pBRAF_N,pCAMKII_N,pCREB_N,pELK_N,...,MouseID_J3295_12,MouseID_J3295_13,MouseID_J3295_14,MouseID_J3295_15,MouseID_J3295_2,MouseID_J3295_6,MouseID_J3295_8,MouseID_J3295_9,Treatment_Memantine,Behavior_C/S
0,0.503644,0.747193,0.430175,2.816329,5.990152,0.218830,0.177565,2.373744,0.232224,1.750936,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0
1,0.514617,0.689064,0.411770,2.789514,5.685038,0.211636,0.172817,2.292150,0.226972,1.596377,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0
2,0.509183,0.730247,0.418309,2.687201,5.622058,0.209011,0.175722,2.283337,0.230247,1.561316,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0
3,0.442107,0.617076,0.358626,2.466947,4.979503,0.222886,0.176463,2.152301,0.207004,1.595086,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0
4,0.434940,0.617430,0.358802,2.365785,4.718678,0.213106,0.173627,2.134014,0.192158,1.504230,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1075,0.254860,0.463591,0.254860,2.092082,2.600035,0.211736,0.171262,2.483740,0.207317,1.057971,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1076,0.272198,0.474163,0.251638,2.161390,2.801492,0.251274,0.182496,2.512737,0.216339,1.081150,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1077,0.228700,0.395179,0.234118,1.733184,2.220852,0.220665,0.161435,1.989723,0.185164,0.884342,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1078,0.221242,0.412894,0.243974,1.876347,2.384088,0.208897,0.173623,2.086028,0.192044,0.922595,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


> Notice that "Genotype" column has been dropped when we used `ignore_low_variance` parameter in setup.
> 
> セットアップで `ignore_low_variance` パラメータを使用すると、"Genotype" カラムが削除されたことに注意してください。