 Scaling

① Scale: 데이터 단위의 크기(범위)
   Scaling : 데이터의 범위를 조정하는 과정

거리기반 모델을 쓸때는 반드시 스케일링 써서 각각 변수의 특성에 따라 동일한 스케일을 지닐수 있게 맞춰주는것이 중요 합니다.

② 주요 Scaling 방법 :

Standardization : 각 데이터 포인트에서 평균을 빼고, 그 결과를 표준편차로 나누어 줍니다. 결과적으로 데이터는 평균이 0이고 표준편차가 1인 분포를 갖게 됩니다.
Robust Scaling : 이 방법은 중앙값과 사분위 범위를 사용하여 데이터를 스케일링합니다. 표준화와 유사하지만, 이상치의 영향을 덜 받습니다.
MinMaxScaler : 데이터를 0과 1 사이의 범위로 조정합니다. 주로 최소값과 최대값을 사용하여 계산합니다. 주로 MinMaxScaler라는 클래스를 통해 구현됩니다.
③ 각 Scaling 방법의 적합한 상황과 특징:

Standardization : 아웃라이어가 없는 경우, 변수의 분포가 정규분포를 따를 때 적합합니다.
Robust Scaling : 아웃라이어가 많은 데이터에 적합 합니다.
특징 : 정규분포를 따르지 않는 데이터에 적용하기 좋은 방법입니다.
MinMaxScaler : 데이터의 범위를 0에서 1로 제한하고자 할 때 적합 합니다.
주로 딥러닝 모델에서 사용됩니다.


In [None]:
import pandas as pd
import numpy as np

In [6]:
salary_1 =  pd.read_csv('~/data/salary_1.csv')

In [7]:
salary_2 =  pd.read_csv('~/data/salary_2.csv')

In [8]:
salary_df = pd.concat([salary_1, salary_2])

In [9]:
salary_df

Unnamed: 0,Age,Gender,Education Level,Job Title,Years of Experience,Salary,Country,Race,Senior
0,32.0,Male,1,Software Engineer,5.0,90000,UK,White,0
1,28.0,Female,2,Data Analyst,3.0,65000,USA,Hispanic,0
2,45.0,Male,3,Manager,15.0,150000,Canada,White,1
3,36.0,Female,1,Sales Associate,7.0,60000,USA,Hispanic,0
4,52.0,Male,2,Director,20.0,200000,USA,Asian,0
...,...,...,...,...,...,...,...,...,...
2680,49.0,Female,3,Director of Marketing,20.0,200000,UK,Mixed,0
2681,32.0,Male,0,Sales Associate,3.0,50000,Australia,Australian,0
2682,30.0,Female,1,Financial Manager,4.0,55000,China,Chinese,0
2683,46.0,Male,2,Marketing Manager,14.0,140000,China,Korean,0


In [10]:
salary_df = salary_df.dropna()

In [11]:
gender_salary = salary_df.groupby('Gender')['Salary'].mean()

In [12]:
gender_salary

Gender
Female    107888.555814
Male      121393.353134
Name: Salary, dtype: float64

In [13]:
gender_salary = gender_salary.reset_index()

In [14]:
gender_salary

Unnamed: 0,Gender,Salary
0,Female,107888.555814
1,Male,121393.353134


In [15]:
salary_df = salary_df.merge(gender_salary, on = 'Gender', how = 'left')

In [16]:
salary_df

Unnamed: 0,Age,Gender,Education Level,Job Title,Years of Experience,Salary_x,Country,Race,Senior,Salary_y
0,32.0,Male,1,Software Engineer,5.0,90000,UK,White,0,121393.353134
1,28.0,Female,2,Data Analyst,3.0,65000,USA,Hispanic,0,107888.555814
2,45.0,Male,3,Manager,15.0,150000,Canada,White,1,121393.353134
3,36.0,Female,1,Sales Associate,7.0,60000,USA,Hispanic,0,107888.555814
4,52.0,Male,2,Director,20.0,200000,USA,Asian,0,121393.353134
...,...,...,...,...,...,...,...,...,...,...
6675,49.0,Female,3,Director of Marketing,20.0,200000,UK,Mixed,0,107888.555814
6676,32.0,Male,0,Sales Associate,3.0,50000,Australia,Australian,0,121393.353134
6677,30.0,Female,1,Financial Manager,4.0,55000,China,Chinese,0,107888.555814
6678,46.0,Male,2,Marketing Manager,14.0,140000,China,Korean,0,121393.353134


In [17]:
salary_df = salary_df.rename({'Salary_x': 'Salary', 'Salary_y': 'Gender_salary'}, axis = 1)

In [26]:
salary_df

Unnamed: 0,Age,Gender,Education Level,Job Title,Years of Experience,Salary,Country,Race,Senior,Gender_salary
0,32.0,Male,1,Software Engineer,5.0,90000,UK,White,0,121393.353134
1,28.0,Female,2,Data Analyst,3.0,65000,USA,Hispanic,0,107888.555814
2,45.0,Male,3,Manager,15.0,150000,Canada,White,1,121393.353134
3,36.0,Female,1,Sales Associate,7.0,60000,USA,Hispanic,0,107888.555814
4,52.0,Male,2,Director,20.0,200000,USA,Asian,0,121393.353134
...,...,...,...,...,...,...,...,...,...,...
6675,49.0,Female,3,Director of Marketing,20.0,200000,UK,Mixed,0,107888.555814
6676,32.0,Male,0,Sales Associate,3.0,50000,Australia,Australian,0,121393.353134
6677,30.0,Female,1,Financial Manager,4.0,55000,China,Chinese,0,107888.555814
6678,46.0,Male,2,Marketing Manager,14.0,140000,China,Korean,0,121393.353134


In [32]:
salary_df.drop(['Gender','Job Title', 'Country', 'Race'], axis = 1, inplace = True)

In [33]:
# RobustScaler 패키지 불러오기
from sklearn.preprocessing import RobustScaler

In [34]:
# RobustScaler를 사용하기 위해 rs 이름으로 저장
rs = RobustScaler()

In [35]:
# rs로 salary_df의 정보를 학습시키기
rs.fit(salary_df)

RobustScaler()

In [37]:
rs_df = rs.transform(salary_df)

In [38]:
rs_df = pd.DataFrame(rs_df, columns = salary_df.columns)

In [39]:
rs_df

Unnamed: 0,Age,Education Level,Years of Experience,Salary,Senior,Gender_salary
0,0.0,0.0,-0.222222,-0.277778,0.0,0.0
1,-0.4,1.0,-0.444444,-0.555556,0.0,-1.0
2,1.3,2.0,0.888889,0.388889,1.0,0.0
3,0.4,0.0,0.000000,-0.611111,0.0,-1.0
4,2.0,1.0,1.444444,0.944444,0.0,0.0
...,...,...,...,...,...,...
6675,1.7,2.0,1.444444,0.944444,0.0,-1.0
6676,0.0,-1.0,-0.444444,-0.722222,0.0,0.0
6677,-0.2,0.0,-0.333333,-0.666667,0.0,-1.0
6678,1.4,1.0,0.777778,0.277778,0.0,0.0


In [40]:
# MinMaxScaler 패키지 불러오기
from sklearn.preprocessing import MinMaxScaler

In [42]:
# MinMaxScaler를 사용하기 위해 mm 이름으로 저장
mm = MinMaxScaler()

In [43]:
# 한 줄의 코드로 salary_df를 mm으로 학습하고 변형하여 mm_df 로 저장하기
mm_df =  mm.fit_transform(salary_df)

In [44]:
# mm_df를 Pandas DataFrame으로 변경하여 저장 (컬럼 이름도 기존 컬럼이름으로 채워넣기)
mm_df =  pd.DataFrame(mm_df, columns = salary_df.columns)
mm_df

Unnamed: 0,Age,Education Level,Years of Experience,Salary,Senior,Gender_salary
0,0.268293,0.333333,0.072289,0.359103,0.0,1.0
1,0.170732,0.666667,0.048193,0.258963,0.0,0.0
2,0.585366,1.000000,0.192771,0.599439,1.0,1.0
3,0.365854,0.333333,0.096386,0.238935,0.0,0.0
4,0.756098,0.666667,0.253012,0.799720,0.0,1.0
...,...,...,...,...,...,...
6675,0.682927,1.000000,0.253012,0.799720,0.0,0.0
6676,0.268293,0.000000,0.048193,0.198878,0.0,1.0
6677,0.219512,0.333333,0.060241,0.218906,0.0,0.0
6678,0.609756,0.666667,0.180723,0.559383,0.0,1.0


PCA

① PCA (Principal Component Analysis) : 데이터의 차원(변수의 수)를 축소하는 기법.

② explained_variance_ratio_ : 주로 주성분 분석(Principal Component Analysis, PCA)과 같은 차원 축소 기법에서 사용되는 속성입니다. 

용도: 데이터 시각화, 데이터양이 많을때, 다중 공선성 처리
단점: PC의 특성을 설명할 수 없다

In [45]:
# PCA 패키지 불러오기
from sklearn.decomposition import PCA

In [46]:
# PCA를 사용하기 위해 pca 이름으로 저장: 2개의 주성분을 뽑을 수 있도록 설정
pca =  PCA(2)

In [47]:
# pca로 salary_df를 학습 및 변환하여 pca_df로 저장
pca_df =  pca.fit_transform(salary_df)
pca_df

array([[-25204.56023482,  -6500.8214399 ],
       [-50423.3737748 ,   6590.81760373],
       [ 34787.31969187,  -5513.62893911],
       ...,
       [-60422.01975293,   6426.28542161],
       [ 24788.67342243,  -5678.16110808],
       [-80419.31299916,   6097.22124521]])

In [48]:
# pca_df를 Pandas DataFrame으로 변경하고, 각 컬럼이름을 PC1, PC2로 설정하여 pca_df로 저장
pca_df =  pd.DataFrame(pca_df,columns =['PC1','PC2'])
pca_df

Unnamed: 0,PC1,PC2
0,-25204.560235,-6500.821440
1,-50423.373775,6590.817604
2,34787.319692,-5513.628939
3,-55422.695712,6508.551345
4,84780.552223,-4690.968463
...,...,...
6675,84558.354646,8812.000886
6676,-65199.145508,-7158.949972
6677,-60422.019753,6426.285422
6678,24788.673422,-5678.161108


In [49]:
# 추출된 두개의 주성분으로 기존 데이터 정보의 얼마만큼을 설명할 수 있는지 확인하는 코드 작성
(pca.explained_variance_ratio_).sum()

0.9999999855619667