## Dimensionality Reduction

### Dimensionality Reduction Algorithms
- Linear Algebra Methods
    - Matrix factorization mehtods
        - https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.NMF.html
    - Principal Components Analysis
    - Singular Value Decomposition
    - Non-Negative Matrix Factorization
        - https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.NMF.html
- Manifold Learning Methods: seek a lower-dimensional projection of high dimensional input that captures the salient properties of the input data.
    - Isomap Embedding
    - Locally Linear Embedding
    - Multidimensional Scaling
        - https://scikit-learn.orq/stable/modules/generated/sklearn.manifold.MDS.html
    - Spectral Embedding
        - https://scikit-learn.org/stable/modules/generated/sklearn.manifold.SpectralEmbedding.html
    - t-distributed Stochastic Neighbor Embedding
        - https://scikit-learn.org/stable/modules/generated/sklearn.manifold.TSNE.html

In [None]:
# !pip install scikit-learn

In [1]:
import sklearn
print(sklearn.__version__)
# 0.23.0 이상인 버전

1.2.2


## 모의 실험용 데이터 생성: 분류분석

sklearn.datasets.***

<br>

- make_regression:
    - https://scikit-learn.org/stable/modules/generated/sklearn.datasets.make_regression.html
- make_classification:
    - https://scikit-learn.org/stable/modules/generated/sklearn.datasets.make_classification.html
- make_multilabel_classification:
    - https://scikit-learn.org/stable/modules/generated/sklearn.datasets.make_multilabel_classification.html

<br>

옵션
- n_samples = 100
- n_features = 20
- n_informative = 2
- random_state
- make_regression
    - n_targets = 1
    - noise = 0.0
- make_classification
    - n_redundant = 2: random linear combinations of the informative feature
- make_multilabel_classification
    - m_classes = 5
    - n_labels = 2: the number of labels per sample is drawn from a Poisson distribution with n_labels as its expected value, but samples are bounded (using rejection sampling) by n_classes, and must be nonzero if allow_unlabeled is False
    - length = 50: the sum of the features (number of words if documents) is drawn from a Poisson distribution with this expected value.
    - For each sample, the generative process is:
        - pick the number of labels: n ~ Poisson(n_labels)
        - n times, choose a class c: c ~ Multinomial(theta)
        - pick the document length: k ~ Poisson(length)
        - k times, choose a word: w ~ Multinomial(theta_c)

In [3]:
from sklearn.datasets import make_classification
설명변수, 반응변수 = make_classification(n_samples=1000, n_features=5, n_informative=5, n_redundant=0, random_state=0)

In [4]:
import statsmodels.api as sm
# with statsmodels
설명변수 = sm.add_constant(설명변수) # adding a constant
print(sm.Logit(반응변수, 설명변수).fit().summary())

Optimization terminated successfully.
         Current function value: 0.229467
         Iterations 8
                           Logit Regression Results                           
Dep. Variable:                      y   No. Observations:                 1000
Model:                          Logit   Df Residuals:                      994
Method:                           MLE   Df Model:                            5
Date:                Mon, 17 Apr 2023   Pseudo R-squ.:                  0.6689
Time:                        16:09:42   Log-Likelihood:                -229.47
converged:                       True   LL-Null:                       -693.15
Covariance Type:            nonrobust   LLR p-value:                3.185e-198
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
const          0.8778      0.225      3.898      0.000       0.436       1.319
x1             0.3045      0.

In [5]:
# 표본크기=1000, 설명변수개수=20, 설명력이 있는 변수=10, 불필요한 변수=10
설명변수, 반응변수 = make_classification(n_samples=1000, n_features=20, n_informative=5, n_redundant=10, random_state=0)
설명변수 = sm.add_constant(설명변수) # adding a constant
print(sm.Logit(반응변수, 설명변수).fit().summary())

         Current function value: 0.243352
         Iterations: 35
                           Logit Regression Results                           
Dep. Variable:                      y   No. Observations:                 1000
Model:                          Logit   Df Residuals:                      989
Method:                           MLE   Df Model:                           10
Date:                Mon, 17 Apr 2023   Pseudo R-squ.:                  0.6489
Time:                        16:09:42   Log-Likelihood:                -243.35
converged:                      False   LL-Null:                       -693.14
Covariance Type:            nonrobust   LLR p-value:                7.860e-187
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
const          0.8979      0.217      4.141      0.000       0.473       1.323
x1             0.1679      0.122      1.376      0.169      -0.07



### 차원축소 방법을 적용한 자료에 대한 로지스틱 회귀분석
Repeated stratified 10-fold cross-validation 방법을 통해 성능비교

In [6]:
from numpy import mean
from numpy import std
from sklearn.datasets import make_classification
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.linear_model import LogisticRegression

In [8]:
# 모형설정
분석방법 = LogisticRegression()

# 원자료에 대한 평가 (10x3개의 평가자료)
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
점수 = cross_val_score(분석방법, 설명변수, 반응변수, scoring='accuracy', cv = cv, n_jobs = -1)
print('Accuracy: %.3f (%.3f)' % (mean(점수), std(점수)))

Accuracy: 0.902 (0.026)


In [9]:
# Principal Components Analysis(PCA)
# https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html
from sklearn.decomposition import PCA
from sklearn.pipeline import Pipeline

PCA(n_components=5).fit_transform(설명변수)

array([[-3.49116061, -3.18790272,  1.12960961,  0.54013697,  0.02654012],
       [-0.10276923, -3.50457665, -0.72794009,  0.05346297, -0.91530089],
       [-3.0979493 ,  3.53395139, -2.76254129, -4.2934772 , -0.83307225],
       ...,
       [-2.14847104,  6.97758979,  0.91435403, -0.2051838 , -0.81127541],
       [ 2.44834879, -0.22556633,  4.41572754,  0.97962709,  1.22745587],
       [-3.67345152, -3.69101242, -1.22137635, -2.54358137, -0.43784689]])

In [10]:
과정 = [('pca', PCA(n_components=5)), ('m', LogisticRegression())]
모형 = Pipeline(steps=과정)
모형

In [11]:
점수 = cross_val_score(모형, 설명변수, 반응변수, scoring='accuracy', cv=cv, n_jobs=-1)
print('Accuracy: %.3f (%.3f)' % (mean(점수), std(점수)))

Accuracy: 0.906 (0.028)


In [13]:
# Singular Value Decomposition(SVD)
# https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.TruncatedSVD.html
from sklearn.decomposition import TruncatedSVD
TruncatedSVD(n_components=5).fit_transform(설명변수)

array([[-3.70343229, -3.98650213, -1.37265822, -0.63471761,  1.03247114],
       [-0.32235725, -4.55947547, -0.71336881,  0.67398658, -0.48779025],
       [-3.28594421, -0.12982111,  5.53813371, -1.45593703, -3.56446996],
       ...,
       [-2.23701309,  4.47584669,  4.24237147, -1.62495065,  0.73184243],
       [ 2.28298104, -1.0570723 , -1.26324642, -3.26202823,  3.29992018],
       [-3.92271469, -5.39950294,  0.2761539 , -0.79135582, -2.22434667]])

In [14]:
과정 = [('svd', TruncatedSVD(n_components=5)), ('m', LogisticRegression())]
모형 = Pipeline(steps=과정)
점수 = cross_val_score(모형, 설명변수, 반응변수, scoring='accuracy', cv=cv, n_jobs=-1)
print('Accuracy: %.3f (%.3f)' % (mean(점수), std(점수)))

Accuracy: 0.906 (0.027)


In [15]:
# Linear Discriminant Analysis(LDA)
# https://scikit-learn.org/stable/modules/generated/sklearn.discriminant_analysis.LinearDiscriminantAnalysis.html
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis

LinearDiscriminantAnalysis(n_components=1).fit_transform(설명변수, 반응변수)
# n_components <= min(n_classes-1, n_features)

array([[-1.31747760e+00],
       [-1.34985350e+00],
       [ 3.25719223e-01],
       [ 1.41043834e+00],
       [-6.42487620e-02],
       [-8.98172353e-01],
       [ 6.32568776e-01],
       [-2.01219248e-01],
       [ 1.71324103e+00],
       [-7.63549109e-01],
       [ 2.74741311e+00],
       [ 9.73468800e-01],
       [ 1.15054845e+00],
       [-6.40329012e-01],
       [ 1.87612955e+00],
       [ 2.18406795e+00],
       [ 2.51852426e+00],
       [ 1.02624956e+00],
       [-9.33981598e-01],
       [ 1.90397275e+00],
       [-8.17790252e-01],
       [ 1.52270597e+00],
       [-1.22929989e+00],
       [ 5.67358233e-01],
       [-1.48637735e+00],
       [-9.30350437e-01],
       [ 4.56079458e-01],
       [ 3.08761806e-01],
       [ 2.24330984e+00],
       [-1.70411415e+00],
       [-1.25439532e+00],
       [-1.93248063e+00],
       [-2.29770054e+00],
       [-1.11434193e+00],
       [-9.77901055e-01],
       [-1.34675755e+00],
       [ 1.29388540e+00],
       [ 2.16106628e+00],
       [-1.8

In [16]:
과정 = [('lda', LinearDiscriminantAnalysis(n_components=1)), ('m', LogisticRegression())]

모형 = Pipeline(steps=과정)
점수 = cross_val_score(모형, 설명변수, 반응변수, scoring='accuracy', cv=cv, n_jobs=-1)
print('Accuracy: %.3f (%.3f)' % (mean(점수), std(점수)))

Accuracy: 0.904 (0.027)


In [17]:
# Isomap Embedding
# https://scikit-learn.org/stable/modules/generated/sklearn.manifold.Isomap.html
from sklearn.manifold import Isomap

Isomap(n_components=5).fit_transform(설명변수)

array([[-7.76254864, -0.66304379,  1.80835807,  0.62572071, -0.45239671],
       [-2.7732265 , -5.77186012,  0.65549934, -0.73085225, -2.05136113],
       [-1.32547897, 12.58390657, -5.94061377, -0.55662981, -2.72985668],
       ...,
       [ 9.62942053, 15.28561819, -0.58394826,  3.80816428,  1.22223658],
       [ 6.0879797 , -0.23694135, 11.87014937, -3.25689988,  3.55412476],
       [-9.50504788, -4.15974532,  0.46826841, -3.20340085, -4.754981  ]])

In [22]:
과정 = [('iso', Isomap(n_components=5)), ('m', LogisticRegression())]

모형 = Pipeline(steps=과정)
점수 = cross_val_score(모형, 설명변수, 반응변수, scoring='accuracy', cv=cv, n_jobs=-1)
print('Accuracy: %.3f (%.3f)' % (mean(점수), std(점수)))

Accuracy: 0.928 (0.023)


In [19]:
# Locally Linear Embedding(LLE)
# https://scikit-learn.org/stable/modules/generated/sklearn.manifold.LocallyLinearEmbedding.html
from sklearn.manifold import LocallyLinearEmbedding

LocallyLinearEmbedding(n_components=5).fit_transform(설명변수)

array([[ 0.02761939, -0.00421555, -0.0208986 , -0.00688121, -0.04862449],
       [ 0.01966833,  0.00273742,  0.01956708, -0.03550667,  0.02456504],
       [-0.00465232, -0.02210735, -0.01575333,  0.01748055, -0.01811169],
       ...,
       [-0.03971646, -0.02712612, -0.0061399 ,  0.01108519, -0.01335942],
       [-0.01024942,  0.00884455, -0.00236471, -0.00549745, -0.01690963],
       [ 0.02385041, -0.00079897,  0.00029863, -0.0214865 ,  0.01566282]])

In [23]:
과정 = [('lle', LocallyLinearEmbedding(n_components=5)), ('m', LogisticRegression())]

모형 = Pipeline(steps=과정)
점수 = cross_val_score(모형, 설명변수, 반응변수, scoring='accuracy', cv=cv, n_jobs=-1)
print('Accuracy: %.3f (%.3f)' % (mean(점수), std(점수)))

Accuracy: 0.875 (0.030)


### 추가정보 
https://scikit-learn.org/stable/modules/classes.html#module-sklearn.decomposition

In [21]:
설명축소 = Isomap(n_components=5).fit_transform(설명변수)
설명축소 = sm.add_constant(설명축소) # adding a constant
print(sm.Logit(반응변수, 설명축소).fit().summary())

Optimization terminated successfully.
         Current function value: 0.185880
         Iterations 9
                           Logit Regression Results                           
Dep. Variable:                      y   No. Observations:                 1000
Model:                          Logit   Df Residuals:                      994
Method:                           MLE   Df Model:                            5
Date:                Mon, 17 Apr 2023   Pseudo R-squ.:                  0.7318
Time:                        16:10:39   Log-Likelihood:                -185.88
converged:                       True   LL-Null:                       -693.14
Covariance Type:            nonrobust   LLR p-value:                4.322e-217
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
const          0.9161      0.212      4.314      0.000       0.500       1.332
x1             0.4608      0.