---
# 05. Feature Engineering - 스피드 데이팅 데이터 다루기

## 5-1. 들어가며

### 학습 내용

- Feature Engineering을 통해 주어진 데이터에서 추가적인 정보를 추출하고, 이를 아이디어와 파이썬 코드로 구현하는 방법을 배웁니다.

### 학습목표

1. 상황에 맞게 다양한 방법으로 데이터를 처리할 수 있다.
2. 주어진 데이터에서 추가적인 정보를 이끌어 낼 수 있다.
---

---
## 5-2. 컬럼 이름 변경, 결측치 처리(Missing values)
### 컬럼이름 변경 : **`important score`**를 나의평가는 **`i`** , 상대의 평가 **`o`** 로 변경 하세요.

   - 만약 컬럼 이름이 'pref_o'로 시작한다면, 'pref_o'를 'o_important'로 교체
        - <예시>'pref_o_hobby' → 'o_important_hobby'
   - 만약 컬럼 이름이 '_o'로 끝난다면, 'o'를 삭제하고 이름 앞에 'o_score'를 추가
        - <예시>'hobby_o' → 'o_score_hobby'
   - 만약 컬럼 이름이 '_important'로 끝난다면, 'important'를 삭제하고 이름 앞에 'i_important'를 추가
        - <예시>'hobby_important' → 'i_important_hobby'
   - 만약 컬럼 이름이 '_partner'로 끝난다면, 'partner'를 삭제하고 이름 앞에 'i_score'를 추가
        - <예시>'hobby_partner' → 'i_score_hobby'
   - 해당 패턴에 맞게 이름을 변경한 후 new_cols 리스트에 저장

### 결측치 확인 : **`dating_df.isna().mean()`**

   - 종속변수는 다른 변수(독립변수)에 의해 그 값이 결정되거나 영향을 받는 변수를 말합니다.
   - 'match'라는 종속변수가 있다면, 이는 예측해야 할 '정답'과 같은 역할을 합니다. 만약 종속변수에 결측치가 존재한다면, 이를 단순히 평균값이나 임의의 값으로 채우는 것은 적절하지 않습니다.
   - 결측치를 임의의 값으로 대체하는 것은 정확한 예측을 방해하고, 잘못된 결과를 초래할 수 있기 때문입니다. 따라서, 분석 과정에서 종속변수에 결측치가 있는 경우, 해당 데이터를 제거(drop)하는 것이 바람직합니다.

### 결측치 제거 : **`dating_df = dating_df.dropna(subset = drop_cols)`**

   - dating_df DataFrame에서 drop_cols 리스트에 지정된 열들 중 어느 하나라도 결측치를 포함하는 행을 전부 제거합니다.
---

In [942]:
import pandas as pd
import numpy as np
import seaborn as sns

In [943]:
dating_df = pd.read_csv('/aiffel/data/dating.csv')

In [944]:
pd.set_option('display.max_columns', 50)

In [945]:
dating_df.head()

Unnamed: 0,gender,age,age_o,race,race_o,importance_same_race,importance_same_religion,pref_o_attractive,pref_o_sincere,pref_o_intelligence,pref_o_funny,pref_o_ambitious,pref_o_shared_interests,attractive_o,sincere_o,intelligence_o,funny_o,ambitous_o,shared_interests_o,attractive_important,sincere_important,intellicence_important,funny_important,ambtition_important,shared_interests_important,attractive_partner,sincere_partner,intelligence_partner,funny_partner,ambition_partner,shared_interests_partner,interests_correlate,expected_happy_with_sd_people,expected_num_interested_in_me,like,guess_prob_liked,match
0,female,21.0,27.0,Asian/PacificIslander/Asian-American,European/Caucasian-American,2.0,4.0,35.0,20.0,20.0,20.0,0.0,5.0,6.0,8.0,8.0,8.0,8.0,6.0,15.0,20.0,20.0,15.0,15.0,15.0,6.0,9.0,7.0,7.0,6.0,5.0,0.14,3.0,2.0,7.0,6.0,0
1,female,21.0,22.0,Asian/PacificIslander/Asian-American,European/Caucasian-American,2.0,4.0,60.0,0.0,0.0,40.0,0.0,0.0,7.0,8.0,10.0,7.0,7.0,5.0,15.0,20.0,20.0,15.0,15.0,15.0,7.0,8.0,7.0,8.0,5.0,6.0,0.54,3.0,2.0,7.0,5.0,0
2,female,21.0,22.0,Asian/PacificIslander/Asian-American,Asian/PacificIslander/Asian-American,2.0,4.0,19.0,18.0,19.0,18.0,14.0,12.0,10.0,10.0,10.0,10.0,10.0,10.0,15.0,20.0,20.0,15.0,15.0,15.0,5.0,8.0,9.0,8.0,5.0,7.0,0.16,3.0,2.0,7.0,,1
3,female,21.0,23.0,Asian/PacificIslander/Asian-American,European/Caucasian-American,2.0,4.0,30.0,5.0,15.0,40.0,5.0,5.0,7.0,8.0,9.0,8.0,9.0,8.0,15.0,20.0,20.0,15.0,15.0,15.0,7.0,6.0,8.0,7.0,6.0,8.0,0.61,3.0,2.0,7.0,6.0,1
4,female,21.0,24.0,Asian/PacificIslander/Asian-American,Latino/HispanicAmerican,2.0,4.0,30.0,10.0,20.0,10.0,10.0,20.0,8.0,7.0,9.0,6.0,9.0,7.0,15.0,20.0,20.0,15.0,15.0,15.0,5.0,6.0,7.0,7.0,6.0,6.0,0.21,3.0,2.0,6.0,6.0,1


In [946]:
dating_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8378 entries, 0 to 8377
Data columns (total 37 columns):
 #   Column                         Non-Null Count  Dtype  
---  ------                         --------------  -----  
 0   gender                         8378 non-null   object 
 1   age                            8283 non-null   float64
 2   age_o                          8274 non-null   float64
 3   race                           8315 non-null   object 
 4   race_o                         8305 non-null   object 
 5   importance_same_race           8299 non-null   float64
 6   importance_same_religion       8299 non-null   float64
 7   pref_o_attractive              8289 non-null   float64
 8   pref_o_sincere                 8289 non-null   float64
 9   pref_o_intelligence            8289 non-null   float64
 10  pref_o_funny                   8280 non-null   float64
 11  pref_o_ambitious               8271 non-null   float64
 12  pref_o_shared_interests        8249 non-null   f

In [947]:
dating_df.describe()

Unnamed: 0,age,age_o,importance_same_race,importance_same_religion,pref_o_attractive,pref_o_sincere,pref_o_intelligence,pref_o_funny,pref_o_ambitious,pref_o_shared_interests,attractive_o,sincere_o,intelligence_o,funny_o,ambitous_o,shared_interests_o,attractive_important,sincere_important,intellicence_important,funny_important,ambtition_important,shared_interests_important,attractive_partner,sincere_partner,intelligence_partner,funny_partner,ambition_partner,shared_interests_partner,interests_correlate,expected_happy_with_sd_people,expected_num_interested_in_me,like,guess_prob_liked,match
count,8283.0,8274.0,8299.0,8299.0,8289.0,8289.0,8289.0,8280.0,8271.0,8249.0,8166.0,8091.0,8072.0,8018.0,7656.0,7302.0,8299.0,8299.0,8299.0,8289.0,8279.0,8257.0,8176.0,8101.0,8082.0,8028.0,7666.0,7311.0,8220.0,8277.0,1800.0,8138.0,8069.0,8378.0
mean,26.358928,26.364999,3.784793,3.651645,22.495347,17.396867,20.270759,17.459714,10.685375,11.84593,6.190411,7.175256,7.369301,6.400599,6.778409,5.47487,22.514632,17.396389,20.265613,17.457043,10.682539,11.845111,6.189995,7.175164,7.368597,6.400598,6.777524,5.474559,0.19601,5.534131,5.570556,6.134087,5.207523,0.164717
std,3.566763,3.563648,2.845708,2.805237,12.569802,7.044003,6.782895,6.085526,6.126544,6.362746,1.950305,1.740575,1.550501,1.954078,1.79408,2.156163,12.587674,7.0467,6.783003,6.085239,6.124888,6.362154,1.950169,1.740315,1.550453,1.953702,1.794055,2.156363,0.303539,1.734059,4.762569,1.841285,2.129565,0.370947
min,18.0,18.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-0.83,1.0,0.0,0.0,0.0,0.0
25%,24.0,24.0,1.0,1.0,15.0,15.0,17.39,15.0,5.0,9.52,5.0,6.0,6.0,5.0,6.0,4.0,15.0,15.0,17.39,15.0,5.0,9.52,5.0,6.0,6.0,5.0,6.0,4.0,-0.02,5.0,2.0,5.0,4.0,0.0
50%,26.0,26.0,3.0,3.0,20.0,18.37,20.0,18.0,10.0,10.64,6.0,7.0,7.0,7.0,7.0,6.0,20.0,18.18,20.0,18.0,10.0,10.64,6.0,7.0,7.0,7.0,7.0,6.0,0.21,6.0,4.0,6.0,5.0,0.0
75%,28.0,28.0,6.0,6.0,25.0,20.0,23.81,20.0,15.0,16.0,8.0,8.0,8.0,8.0,8.0,7.0,25.0,20.0,23.81,20.0,15.0,16.0,8.0,8.0,8.0,8.0,8.0,7.0,0.43,7.0,8.0,7.0,7.0,0.0
max,55.0,55.0,10.0,10.0,100.0,60.0,50.0,50.0,53.0,30.0,10.5,10.0,10.0,11.0,10.0,10.0,100.0,60.0,50.0,50.0,53.0,30.0,10.0,10.0,10.0,10.0,10.0,10.0,0.91,10.0,20.0,10.0,10.0,1.0


### 컬럼이름 변경

- **`important score`**를 나의평가는 **`i`** , 상대의 평가 **`o`** 로 변경 하세요.

    - 만약 컬럼 이름이 'pref_o'로 시작한다면, 'pref_o'를 'o_important'로 교체
        - <예시>'pref_o_hobby' → 'o_important_hobby'
    - 만약 컬럼 이름이 '_o'로 끝난다면, 'o'를 삭제하고 이름 앞에 'o_score'를 추가
        - <예시>'hobby_o' → 'o_score_hobby'
    - 만약 컬럼 이름이 '_important'로 끝난다면, 'important'를 삭제하고 이름 앞에 'i_important'를 추가
        - <예시>'hobby_important' → 'i_important_hobby'
    - 만약 컬럼 이름이 '_partner'로 끝난다면, 'partner'를 삭제하고 이름 앞에 'i_score'를 추가
        - <예시>'hobby_partner' → 'i_score_hobby'
    - 해당 패턴에 맞게 이름을 변경한 후 new_cols 리스트에 저장


In [948]:
dating_df.columns

Index(['gender', 'age', 'age_o', 'race', 'race_o', 'importance_same_race',
       'importance_same_religion', 'pref_o_attractive', 'pref_o_sincere',
       'pref_o_intelligence', 'pref_o_funny', 'pref_o_ambitious',
       'pref_o_shared_interests', 'attractive_o', 'sincere_o',
       'intelligence_o', 'funny_o', 'ambitous_o', 'shared_interests_o',
       'attractive_important', 'sincere_important', 'intellicence_important',
       'funny_important', 'ambtition_important', 'shared_interests_important',
       'attractive_partner', 'sincere_partner', 'intelligence_partner',
       'funny_partner', 'ambition_partner', 'shared_interests_partner',
       'interests_correlate', 'expected_happy_with_sd_people',
       'expected_num_interested_in_me', 'like', 'guess_prob_liked', 'match'],
      dtype='object')

In [949]:
'string'.startswith('b')

False

In [950]:
new_cols = []

for i in dating_df.columns:
    if i.startswith('pref_o'):
        i = i.replace('pref_o', 'o_important')
    elif i.endswith('_o'):
        i = 'o_score_' + i.replace('_o', '')
    elif i.endswith('_important'):
        i = 'i_important_' + i.replace('_important', '')
    elif i.endswith('_partner'):
        i = 'i_score_' + i.replace('_partner', '')
    new_cols.append(i)
    
new_cols

['gender',
 'age',
 'o_score_age',
 'race',
 'o_score_race',
 'importance_same_race',
 'importance_same_religion',
 'o_important_attractive',
 'o_important_sincere',
 'o_important_intelligence',
 'o_important_funny',
 'o_important_ambitious',
 'o_important_shared_interests',
 'o_score_attractive',
 'o_score_sincere',
 'o_score_intelligence',
 'o_score_funny',
 'o_score_ambitous',
 'o_score_shared_interests',
 'i_important_attractive',
 'i_important_sincere',
 'i_important_intellicence',
 'i_important_funny',
 'i_important_ambtition',
 'i_important_shared_interests',
 'i_score_attractive',
 'i_score_sincere',
 'i_score_intelligence',
 'i_score_funny',
 'i_score_ambition',
 'i_score_shared_interests',
 'interests_correlate',
 'expected_happy_with_sd_people',
 'expected_num_interested_in_me',
 'like',
 'guess_prob_liked',
 'match']

In [951]:
dating_df.columns=new_cols

In [952]:
dating_df.head()

Unnamed: 0,gender,age,o_score_age,race,o_score_race,importance_same_race,importance_same_religion,o_important_attractive,o_important_sincere,o_important_intelligence,o_important_funny,o_important_ambitious,o_important_shared_interests,o_score_attractive,o_score_sincere,o_score_intelligence,o_score_funny,o_score_ambitous,o_score_shared_interests,i_important_attractive,i_important_sincere,i_important_intellicence,i_important_funny,i_important_ambtition,i_important_shared_interests,i_score_attractive,i_score_sincere,i_score_intelligence,i_score_funny,i_score_ambition,i_score_shared_interests,interests_correlate,expected_happy_with_sd_people,expected_num_interested_in_me,like,guess_prob_liked,match
0,female,21.0,27.0,Asian/PacificIslander/Asian-American,European/Caucasian-American,2.0,4.0,35.0,20.0,20.0,20.0,0.0,5.0,6.0,8.0,8.0,8.0,8.0,6.0,15.0,20.0,20.0,15.0,15.0,15.0,6.0,9.0,7.0,7.0,6.0,5.0,0.14,3.0,2.0,7.0,6.0,0
1,female,21.0,22.0,Asian/PacificIslander/Asian-American,European/Caucasian-American,2.0,4.0,60.0,0.0,0.0,40.0,0.0,0.0,7.0,8.0,10.0,7.0,7.0,5.0,15.0,20.0,20.0,15.0,15.0,15.0,7.0,8.0,7.0,8.0,5.0,6.0,0.54,3.0,2.0,7.0,5.0,0
2,female,21.0,22.0,Asian/PacificIslander/Asian-American,Asian/PacificIslander/Asian-American,2.0,4.0,19.0,18.0,19.0,18.0,14.0,12.0,10.0,10.0,10.0,10.0,10.0,10.0,15.0,20.0,20.0,15.0,15.0,15.0,5.0,8.0,9.0,8.0,5.0,7.0,0.16,3.0,2.0,7.0,,1
3,female,21.0,23.0,Asian/PacificIslander/Asian-American,European/Caucasian-American,2.0,4.0,30.0,5.0,15.0,40.0,5.0,5.0,7.0,8.0,9.0,8.0,9.0,8.0,15.0,20.0,20.0,15.0,15.0,15.0,7.0,6.0,8.0,7.0,6.0,8.0,0.61,3.0,2.0,7.0,6.0,1
4,female,21.0,24.0,Asian/PacificIslander/Asian-American,Latino/HispanicAmerican,2.0,4.0,30.0,10.0,20.0,10.0,10.0,20.0,8.0,7.0,9.0,6.0,9.0,7.0,15.0,20.0,20.0,15.0,15.0,15.0,5.0,6.0,7.0,7.0,6.0,6.0,0.21,3.0,2.0,6.0,6.0,1


In [953]:
dating_df = dating_df.rename({'o_score_race':'race_o', 'o_score_age':'age_o'}, axis = 1)

In [954]:
dating_df

Unnamed: 0,gender,age,age_o,race,race_o,importance_same_race,importance_same_religion,o_important_attractive,o_important_sincere,o_important_intelligence,o_important_funny,o_important_ambitious,o_important_shared_interests,o_score_attractive,o_score_sincere,o_score_intelligence,o_score_funny,o_score_ambitous,o_score_shared_interests,i_important_attractive,i_important_sincere,i_important_intellicence,i_important_funny,i_important_ambtition,i_important_shared_interests,i_score_attractive,i_score_sincere,i_score_intelligence,i_score_funny,i_score_ambition,i_score_shared_interests,interests_correlate,expected_happy_with_sd_people,expected_num_interested_in_me,like,guess_prob_liked,match
0,female,21.0,27.0,Asian/PacificIslander/Asian-American,European/Caucasian-American,2.0,4.0,35.0,20.0,20.0,20.0,0.0,5.0,6.0,8.0,8.0,8.0,8.0,6.0,15.0,20.0,20.0,15.0,15.0,15.0,6.0,9.0,7.0,7.0,6.0,5.0,0.14,3.0,2.0,7.0,6.0,0
1,female,21.0,22.0,Asian/PacificIslander/Asian-American,European/Caucasian-American,2.0,4.0,60.0,0.0,0.0,40.0,0.0,0.0,7.0,8.0,10.0,7.0,7.0,5.0,15.0,20.0,20.0,15.0,15.0,15.0,7.0,8.0,7.0,8.0,5.0,6.0,0.54,3.0,2.0,7.0,5.0,0
2,female,21.0,22.0,Asian/PacificIslander/Asian-American,Asian/PacificIslander/Asian-American,2.0,4.0,19.0,18.0,19.0,18.0,14.0,12.0,10.0,10.0,10.0,10.0,10.0,10.0,15.0,20.0,20.0,15.0,15.0,15.0,5.0,8.0,9.0,8.0,5.0,7.0,0.16,3.0,2.0,7.0,,1
3,female,21.0,23.0,Asian/PacificIslander/Asian-American,European/Caucasian-American,2.0,4.0,30.0,5.0,15.0,40.0,5.0,5.0,7.0,8.0,9.0,8.0,9.0,8.0,15.0,20.0,20.0,15.0,15.0,15.0,7.0,6.0,8.0,7.0,6.0,8.0,0.61,3.0,2.0,7.0,6.0,1
4,female,21.0,24.0,Asian/PacificIslander/Asian-American,Latino/HispanicAmerican,2.0,4.0,30.0,10.0,20.0,10.0,10.0,20.0,8.0,7.0,9.0,6.0,9.0,7.0,15.0,20.0,20.0,15.0,15.0,15.0,5.0,6.0,7.0,7.0,6.0,6.0,0.21,3.0,2.0,6.0,6.0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8373,male,25.0,26.0,European/Caucasian-American,Latino/HispanicAmerican,1.0,1.0,10.0,10.0,30.0,20.0,10.0,15.0,10.0,5.0,3.0,2.0,6.0,5.0,70.0,0.0,15.0,15.0,0.0,0.0,3.0,5.0,5.0,5.0,,,0.64,10.0,,2.0,5.0,0
8374,male,25.0,24.0,European/Caucasian-American,Other,1.0,1.0,50.0,20.0,10.0,5.0,10.0,5.0,6.0,3.0,7.0,3.0,7.0,2.0,70.0,0.0,15.0,15.0,0.0,0.0,4.0,6.0,8.0,4.0,4.0,,0.71,10.0,,4.0,4.0,0
8375,male,25.0,29.0,European/Caucasian-American,Latino/HispanicAmerican,1.0,1.0,40.0,10.0,30.0,10.0,10.0,,2.0,1.0,2.0,2.0,2.0,1.0,70.0,0.0,15.0,15.0,0.0,0.0,4.0,7.0,8.0,8.0,8.0,,-0.46,10.0,,6.0,5.0,0
8376,male,25.0,22.0,European/Caucasian-American,Asian/PacificIslander/Asian-American,1.0,1.0,10.0,25.0,25.0,10.0,10.0,20.0,5.0,7.0,5.0,5.0,3.0,6.0,70.0,0.0,15.0,15.0,0.0,0.0,4.0,6.0,5.0,4.0,,5.0,0.62,10.0,,5.0,5.0,0


### 결측치 처리
#### 결측치 확인 : **`dating_df.isna().mean()`**

   - 종속변수는 다른 변수(독립변수)에 의해 그 값이 결정되거나 영향을 받는 변수를 말합니다.'match'라는 종속변수가 있다면, 이는 예측해야 할 '정답'과 같은 역할을 합니다. 만약 종속변수에 결측치가 존재한다면, 이를 단순히 평균값이나 임의의 값으로 채우는 것은 적절하지 않습니다.결측치를 임의의 값으로 대체하는 것은 정확한 예측을 방해하고, 잘못된 결과를 초래할 수 있기 때문입니다. 따라서, 분석 과정에서 종속변수에 결측치가 있는 경우, 해당 데이터를 제거(drop)하는 것이 바람직합니다.

#### 결측치 제거 : **`dating_df = dating_df.dropna(subset = drop_cols)`**

   - dating_df DataFrame에서 drop_cols 리스트에 지정된 열들 중 어느 하나라도 결측치를 포함하는 행을 전부 제거합니다.
---

In [955]:
# 종속변수에 결측치가 있다면 해당 row는 무조건 drop
# 종속변수는 중앙값이나 평균값으로 결측치를 대체할 수 없음 (예측모델에 중대한 결함 발생할 수 있음)
dating_df.isna().mean()

gender                           0.000000
age                              0.011339
age_o                            0.012413
race                             0.007520
race_o                           0.008713
importance_same_race             0.009429
importance_same_religion         0.009429
o_important_attractive           0.010623
o_important_sincere              0.010623
o_important_intelligence         0.010623
o_important_funny                0.011697
o_important_ambitious            0.012772
o_important_shared_interests     0.015397
o_score_attractive               0.025304
o_score_sincere                  0.034256
o_score_intelligence             0.036524
o_score_funny                    0.042970
o_score_ambitous                 0.086178
o_score_shared_interests         0.128432
i_important_attractive           0.009429
i_important_sincere              0.009429
i_important_intellicence         0.009429
i_important_funny                0.010623
i_important_ambtition            0

In [956]:
# o_important_attractive 컬럼의 결측치 확인 
dating_df[dating_df['o_important_attractive'].isna()]

Unnamed: 0,gender,age,age_o,race,race_o,importance_same_race,importance_same_religion,o_important_attractive,o_important_sincere,o_important_intelligence,o_important_funny,o_important_ambitious,o_important_shared_interests,o_score_attractive,o_score_sincere,o_score_intelligence,o_score_funny,o_score_ambitous,o_score_shared_interests,i_important_attractive,i_important_sincere,i_important_intellicence,i_important_funny,i_important_ambtition,i_important_shared_interests,i_score_attractive,i_score_sincere,i_score_intelligence,i_score_funny,i_score_ambition,i_score_shared_interests,interests_correlate,expected_happy_with_sd_people,expected_num_interested_in_me,like,guess_prob_liked,match
511,male,25.0,26.0,Asian/PacificIslander/Asian-American,European/Caucasian-American,9.0,6.0,,,,,,,3.0,6.0,6.0,6.0,5.0,2.0,25.0,20.0,25.0,20.0,10.0,0.0,8.0,8.0,7.0,8.0,7.0,3.0,,3.0,2.0,6.0,2.0,0
530,male,30.0,26.0,Asian/PacificIslander/Asian-American,European/Caucasian-American,1.0,1.0,,,,,,,2.0,6.0,5.0,4.0,5.0,2.0,30.0,20.0,10.0,30.0,0.0,10.0,8.0,8.0,8.0,8.0,8.0,8.0,,7.0,3.0,5.0,,0
549,male,23.0,26.0,Asian/PacificIslander/Asian-American,European/Caucasian-American,1.0,6.0,,,,,,,3.0,5.0,8.0,5.0,8.0,3.0,20.0,25.0,20.0,15.0,10.0,10.0,7.0,10.0,9.0,8.0,9.0,,,6.0,3.0,8.0,6.0,0
568,male,24.0,26.0,European/Caucasian-American,European/Caucasian-American,1.0,1.0,,,,,,,4.0,6.0,5.0,4.0,4.0,3.0,25.0,35.0,30.0,10.0,0.0,0.0,5.0,6.0,6.0,4.0,7.0,6.0,,3.0,1.0,5.0,2.0,0
587,male,24.0,26.0,European/Caucasian-American,European/Caucasian-American,3.0,3.0,,,,,,,4.0,6.0,7.0,4.0,5.0,2.0,25.0,15.0,25.0,25.0,15.0,15.0,5.0,5.0,5.0,5.0,6.0,5.0,,7.0,5.0,5.0,5.0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5649,male,30.0,,Asian/PacificIslander/Asian-American,,6.0,6.0,,,,,,,6.0,7.0,8.0,6.0,7.0,4.0,50.0,10.0,10.0,10.0,5.0,15.0,5.0,5.0,6.0,5.0,5.0,5.0,,7.0,,6.0,6.0,0
5669,male,27.0,,European/Caucasian-American,,3.0,2.0,,,,,,,8.0,8.0,9.0,7.0,7.0,8.0,30.0,20.0,15.0,15.0,8.0,12.0,5.0,6.0,7.0,6.0,7.0,7.0,,5.0,,6.0,6.0,0
5689,male,25.0,,Asian/PacificIslander/Asian-American,,10.0,7.0,,,,,,,,,,,,,30.0,10.0,15.0,20.0,15.0,10.0,8.0,9.0,7.0,7.0,7.0,8.0,,5.0,,7.0,6.0,0
5709,male,26.0,,Asian/PacificIslander/Asian-American,,3.0,2.0,,,,,,,6.0,6.0,5.0,4.0,4.0,3.0,25.0,20.0,10.0,20.0,15.0,10.0,9.0,9.0,9.0,9.0,7.0,7.0,,6.0,,8.0,7.0,0


In [957]:
drop_cols = []

for i in dating_df.columns:
    if i.startswith('o_important'):
        drop_cols.append(i)
    elif i.startswith('i_important'):
        drop_cols.append(i)
        
drop_cols

['o_important_attractive',
 'o_important_sincere',
 'o_important_intelligence',
 'o_important_funny',
 'o_important_ambitious',
 'o_important_shared_interests',
 'i_important_attractive',
 'i_important_sincere',
 'i_important_intellicence',
 'i_important_funny',
 'i_important_ambtition',
 'i_important_shared_interests']

In [958]:
dating_df = dating_df.dropna(subset=drop_cols)

In [959]:
dating_df.isna().mean()

gender                           0.000000
age                              0.002706
age_o                            0.002706
race                             0.000000
race_o                           0.000000
importance_same_race             0.000000
importance_same_religion         0.000000
o_important_attractive           0.000000
o_important_sincere              0.000000
o_important_intelligence         0.000000
o_important_funny                0.000000
o_important_ambitious            0.000000
o_important_shared_interests     0.000000
o_score_attractive               0.022755
o_score_sincere                  0.031488
o_score_intelligence             0.034071
o_score_funny                    0.040590
o_score_ambitous                 0.084625
o_score_shared_interests         0.127798
i_important_attractive           0.000000
i_important_sincere              0.000000
i_important_intellicence         0.000000
i_important_funny                0.000000
i_important_ambtition            0

In [960]:
# 무응답도 의미가 있다고 판단하므로 별도의 숫자를 사용하여 플래그를 달아줌 
# 단, 선형모델을 쓰는 등 머신러닝 알고리즘을 쓸 때 숫자값을 고려하므로 주의해야함 (선형모델은 데이터의 크기에 민감)
# 단, 트리기반 모델을 사용하면 괜찮을 수 있음 (알고리즘 선택시 알고리즘의 특성을 고려해야함)
dating_df = dating_df.fillna(-99)

## 5-3. 이상치 처리(Outlier), 나이차(Age), 인종(Race)

### Age(나이차) : `def age_func(x):~`

- 결측치를 -99로 채웠기 때문에 그냥 계산하면 음수 값들이 나옵니다. 이에 결측치를 -99로 채워준 부분이 등장 시 나이차도 -99로 표시되게끔 정의합니다.

### Race(인종): `dating_df['same_race_point'].head(30)`

- 동일 인종인가에 대해 0과 1로 표시해서 계산하기보다는, 피처 엔지니어링을 할 때 디테일을 두어서 같은 인종일 때 2점, 다른 인종일 때는 -2점으로 표기하는 방법 등으로 데이터가 가지고 있는 정보를 잘 끌어내는 것이 좋습니다.

In [961]:
dating_df.describe()

Unnamed: 0,age,age_o,importance_same_race,importance_same_religion,o_important_attractive,o_important_sincere,o_important_intelligence,o_important_funny,o_important_ambitious,o_important_shared_interests,o_score_attractive,o_score_sincere,o_score_intelligence,o_score_funny,o_score_ambitous,o_score_shared_interests,i_important_attractive,i_important_sincere,i_important_intellicence,i_important_funny,i_important_ambtition,i_important_shared_interests,i_score_attractive,i_score_sincere,i_score_intelligence,i_score_funny,i_score_ambition,i_score_shared_interests,interests_correlate,expected_happy_with_sd_people,expected_num_interested_in_me,like,guess_prob_liked,match
count,8130.0,8130.0,8130.0,8130.0,8130.0,8130.0,8130.0,8130.0,8130.0,8130.0,8130.0,8130.0,8130.0,8130.0,8130.0,8130.0,8130.0,8130.0,8130.0,8130.0,8130.0,8130.0,8130.0,8130.0,8130.0,8130.0,8130.0,8130.0,8130.0,8130.0,8130.0,8130.0,8130.0,8130.0
mean,26.034932,26.034932,3.776753,3.658303,22.320451,17.433629,20.266113,17.446224,10.705183,11.872082,3.803555,3.841882,3.75738,2.128106,-2.159533,-7.869065,22.320451,17.433629,20.266113,17.446224,10.705183,11.872082,3.803493,3.841882,3.75738,2.127983,-2.159533,-7.869065,0.198866,5.244649,-76.87909,3.270812,1.537761,0.164822
std,7.418562,7.418562,2.844297,2.807314,12.39847,7.017378,6.770061,6.068068,6.101937,6.351582,15.804941,18.622449,19.358834,20.888796,29.495408,34.942787,12.39847,7.017378,6.770061,6.068068,6.101937,6.351582,15.804916,18.622449,19.358834,20.888746,29.495408,34.942787,0.302338,5.697854,42.761973,17.230563,19.345369,0.371042
min,-99.0,-99.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,-99.0,-99.0,-99.0,-99.0,-99.0,-99.0,0.0,0.0,0.0,0.0,0.0,0.0,-99.0,-99.0,-99.0,-99.0,-99.0,-99.0,-0.83,-99.0,-99.0,-99.0,-99.0,0.0
25%,24.0,24.0,1.0,1.0,15.0,15.0,17.39,15.0,5.0,9.52,5.0,6.0,6.0,5.0,5.0,3.0,15.0,15.0,17.39,15.0,5.0,9.52,5.0,6.0,6.0,5.0,5.0,3.0,-0.01,5.0,-99.0,5.0,4.0,0.0
50%,26.0,26.0,3.0,3.0,20.0,18.37,20.0,18.0,10.0,10.87,6.0,7.0,7.0,6.0,7.0,5.0,20.0,18.37,20.0,18.0,10.0,10.87,6.0,7.0,7.0,6.0,7.0,5.0,0.22,6.0,-99.0,6.0,5.0,0.0
75%,28.0,28.0,6.0,6.0,25.0,20.0,23.26,20.0,15.0,16.0,8.0,8.0,8.0,8.0,8.0,7.0,25.0,20.0,23.26,20.0,15.0,16.0,8.0,8.0,8.0,8.0,8.0,7.0,0.43,7.0,-99.0,7.0,7.0,0.0
max,55.0,55.0,10.0,10.0,100.0,60.0,50.0,50.0,53.0,30.0,10.5,10.0,10.0,11.0,10.0,10.0,100.0,60.0,50.0,50.0,53.0,30.0,10.0,10.0,10.0,10.0,10.0,10.0,0.91,10.0,20.0,10.0,10.0,1.0


In [962]:
dating_df['o_score_attractive'].sort_values()

5499   -99.0
5419   -99.0
8144   -99.0
7499   -99.0
7497   -99.0
        ... 
4898    10.0
4894    10.0
1102    10.0
869     10.0
6216    10.5
Name: o_score_attractive, Length: 8130, dtype: float64

In [963]:
dating_df['o_score_funny'].sort_values()

8235   -99.0
2110   -99.0
8059   -99.0
7508   -99.0
5643   -99.0
        ... 
516     10.0
1495    10.0
478     10.0
4938    10.0
6608    11.0
Name: o_score_funny, Length: 8130, dtype: float64

In [964]:
dating_df['o_score_attractive'] = dating_df['o_score_attractive'].apply(lambda x:10 if x > 10 else x )

In [965]:
dating_df['o_score_attractive'].max()

10.0

In [966]:
dating_df['o_score_funny'] = dating_df['o_score_funny'].apply(lambda x:10 if x > 10 else x )

In [967]:
dating_df['o_score_funny'].max()

10.0

In [968]:
o_imp = []

for i in dating_df.columns:
    if i.startswith('o_important'):
        o_imp.append(i)

    o_imp

['o_important_attractive',
 'o_important_sincere',
 'o_important_intelligence',
 'o_important_funny',
 'o_important_ambitious',
 'o_important_shared_interests']

In [969]:
dating_df['o_imp_sum'] = dating_df[o_imp].sum(axis=1)

In [970]:
i_imp = []

for i in dating_df.columns:
    if i.startswith('i_important'):
        i_imp.append(i)

i_imp

['i_important_attractive',
 'i_important_sincere',
 'i_important_intellicence',
 'i_important_funny',
 'i_important_ambtition',
 'i_important_shared_interests']

In [971]:
dating_df['i_imp_sum'] = dating_df[i_imp].sum(axis=1)

In [972]:
dating_df.head()

Unnamed: 0,gender,age,age_o,race,race_o,importance_same_race,importance_same_religion,o_important_attractive,o_important_sincere,o_important_intelligence,o_important_funny,o_important_ambitious,o_important_shared_interests,o_score_attractive,o_score_sincere,o_score_intelligence,o_score_funny,o_score_ambitous,o_score_shared_interests,i_important_attractive,i_important_sincere,i_important_intellicence,i_important_funny,i_important_ambtition,i_important_shared_interests,i_score_attractive,i_score_sincere,i_score_intelligence,i_score_funny,i_score_ambition,i_score_shared_interests,interests_correlate,expected_happy_with_sd_people,expected_num_interested_in_me,like,guess_prob_liked,match,o_imp_sum,i_imp_sum
0,female,21.0,27.0,Asian/PacificIslander/Asian-American,European/Caucasian-American,2.0,4.0,35.0,20.0,20.0,20.0,0.0,5.0,6.0,8.0,8.0,8.0,8.0,6.0,15.0,20.0,20.0,15.0,15.0,15.0,6.0,9.0,7.0,7.0,6.0,5.0,0.14,3.0,2.0,7.0,6.0,0,100.0,100.0
1,female,21.0,22.0,Asian/PacificIslander/Asian-American,European/Caucasian-American,2.0,4.0,60.0,0.0,0.0,40.0,0.0,0.0,7.0,8.0,10.0,7.0,7.0,5.0,15.0,20.0,20.0,15.0,15.0,15.0,7.0,8.0,7.0,8.0,5.0,6.0,0.54,3.0,2.0,7.0,5.0,0,100.0,100.0
2,female,21.0,22.0,Asian/PacificIslander/Asian-American,Asian/PacificIslander/Asian-American,2.0,4.0,19.0,18.0,19.0,18.0,14.0,12.0,10.0,10.0,10.0,10.0,10.0,10.0,15.0,20.0,20.0,15.0,15.0,15.0,5.0,8.0,9.0,8.0,5.0,7.0,0.16,3.0,2.0,7.0,-99.0,1,100.0,100.0
3,female,21.0,23.0,Asian/PacificIslander/Asian-American,European/Caucasian-American,2.0,4.0,30.0,5.0,15.0,40.0,5.0,5.0,7.0,8.0,9.0,8.0,9.0,8.0,15.0,20.0,20.0,15.0,15.0,15.0,7.0,6.0,8.0,7.0,6.0,8.0,0.61,3.0,2.0,7.0,6.0,1,100.0,100.0
4,female,21.0,24.0,Asian/PacificIslander/Asian-American,Latino/HispanicAmerican,2.0,4.0,30.0,10.0,20.0,10.0,10.0,20.0,8.0,7.0,9.0,6.0,9.0,7.0,15.0,20.0,20.0,15.0,15.0,15.0,5.0,6.0,7.0,7.0,6.0,6.0,0.21,3.0,2.0,6.0,6.0,1,100.0,100.0


In [973]:
# 애매한 숫자로 총점 100이 안되는 데이터가 너무 많음
# 가중치 등을 활용해서 100으로 맞춰줄 필요가 있음 
dating_df[dating_df['o_imp_sum'] != 100]

Unnamed: 0,gender,age,age_o,race,race_o,importance_same_race,importance_same_religion,o_important_attractive,o_important_sincere,o_important_intelligence,o_important_funny,o_important_ambitious,o_important_shared_interests,o_score_attractive,o_score_sincere,o_score_intelligence,o_score_funny,o_score_ambitous,o_score_shared_interests,i_important_attractive,i_important_sincere,i_important_intellicence,i_important_funny,i_important_ambtition,i_important_shared_interests,i_score_attractive,i_score_sincere,i_score_intelligence,i_score_funny,i_score_ambition,i_score_shared_interests,interests_correlate,expected_happy_with_sd_people,expected_num_interested_in_me,like,guess_prob_liked,match,o_imp_sum,i_imp_sum
7,female,21.0,27.0,Asian/PacificIslander/Asian-American,European/Caucasian-American,2.0,4.0,33.33,11.11,11.11,11.11,11.11,22.22,6.0,7.0,5.0,6.0,8.0,6.0,15.0,20.0,20.0,15.0,15.0,15.0,4.0,9.0,7.0,6.0,5.0,6.0,0.50,3.0,2.0,6.0,7.0,0,99.99,100.0
17,female,24.0,27.0,European/Caucasian-American,European/Caucasian-American,2.0,5.0,33.33,11.11,11.11,11.11,11.11,22.22,7.0,7.0,7.0,7.0,7.0,5.0,45.0,5.0,25.0,20.0,0.0,5.0,5.0,8.0,7.0,5.0,9.0,5.0,0.07,4.0,5.0,5.0,6.0,0,99.99,100.0
27,female,25.0,27.0,European/Caucasian-American,European/Caucasian-American,8.0,4.0,33.33,11.11,11.11,11.11,11.11,22.22,4.0,5.0,6.0,4.0,6.0,4.0,35.0,10.0,35.0,10.0,10.0,0.0,7.0,9.0,9.0,8.0,9.0,7.0,0.29,4.0,2.0,8.0,7.0,0,99.99,100.0
37,female,23.0,27.0,European/Caucasian-American,European/Caucasian-American,1.0,1.0,33.33,11.11,11.11,11.11,11.11,22.22,6.0,7.0,8.0,6.0,6.0,5.0,20.0,20.0,20.0,20.0,10.0,10.0,5.0,9.0,9.0,5.0,9.0,7.0,0.15,1.0,2.0,6.0,6.0,0,99.99,100.0
47,female,21.0,27.0,European/Caucasian-American,European/Caucasian-American,8.0,1.0,33.33,11.11,11.11,11.11,11.11,22.22,5.0,5.0,6.0,5.0,5.0,5.0,20.0,5.0,25.0,25.0,10.0,15.0,5.0,5.0,7.0,6.0,7.0,5.0,0.07,7.0,10.0,6.0,4.0,0,99.99,100.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8343,male,27.0,23.0,Black/AfricanAmerican,European/Caucasian-American,2.0,1.0,20.00,25.00,25.00,30.00,5.00,5.00,7.0,6.0,6.0,5.0,5.0,-99.0,40.0,20.0,20.0,20.0,0.0,0.0,8.0,5.0,5.0,7.0,7.0,5.0,0.53,3.0,-99.0,7.0,3.0,0,110.00,100.0
8351,male,27.0,26.0,Black/AfricanAmerican,Latino/HispanicAmerican,2.0,1.0,10.00,10.00,30.00,20.00,10.00,15.00,10.0,5.0,8.0,6.0,9.0,8.0,40.0,20.0,20.0,20.0,0.0,0.0,7.0,7.0,7.0,7.0,7.0,7.0,0.41,3.0,-99.0,8.0,1.0,0,95.00,100.0
8364,male,25.0,30.0,European/Caucasian-American,Latino/HispanicAmerican,1.0,1.0,15.00,20.00,20.00,20.00,5.00,10.00,6.0,5.0,4.0,3.0,6.0,5.0,70.0,0.0,15.0,15.0,0.0,0.0,6.0,9.0,9.0,9.0,8.0,8.0,0.43,10.0,-99.0,7.0,6.0,0,90.00,100.0
8365,male,25.0,23.0,European/Caucasian-American,European/Caucasian-American,1.0,1.0,20.00,25.00,25.00,30.00,5.00,5.00,8.0,7.0,6.0,7.0,7.0,2.0,70.0,0.0,15.0,15.0,0.0,0.0,5.0,8.0,8.0,7.0,7.0,8.0,0.47,10.0,-99.0,6.0,7.0,0,110.00,100.0


In [974]:
dating_df[o_imp] = dating_df.apply(lambda x: (100 / x['o_imp_sum']) * x[o_imp], axis = 1)

In [975]:
dating_df[o_imp].sum(axis=1).min()

99.99999999999997

In [976]:
dating_df[i_imp] = dating_df.apply(lambda x: (100 / x['i_imp_sum']) * x[i_imp], axis = 1)

In [977]:
dating_df[i_imp].sum(axis=1).min()

99.99999999999997

In [978]:
dating_df.drop(['i_imp_sum', 'o_imp_sum'], axis=1, inplace = True)

In [979]:
dating_df.head()

Unnamed: 0,gender,age,age_o,race,race_o,importance_same_race,importance_same_religion,o_important_attractive,o_important_sincere,o_important_intelligence,o_important_funny,o_important_ambitious,o_important_shared_interests,o_score_attractive,o_score_sincere,o_score_intelligence,o_score_funny,o_score_ambitous,o_score_shared_interests,i_important_attractive,i_important_sincere,i_important_intellicence,i_important_funny,i_important_ambtition,i_important_shared_interests,i_score_attractive,i_score_sincere,i_score_intelligence,i_score_funny,i_score_ambition,i_score_shared_interests,interests_correlate,expected_happy_with_sd_people,expected_num_interested_in_me,like,guess_prob_liked,match
0,female,21.0,27.0,Asian/PacificIslander/Asian-American,European/Caucasian-American,2.0,4.0,35.0,20.0,20.0,20.0,0.0,5.0,6.0,8.0,8.0,8.0,8.0,6.0,15.0,20.0,20.0,15.0,15.0,15.0,6.0,9.0,7.0,7.0,6.0,5.0,0.14,3.0,2.0,7.0,6.0,0
1,female,21.0,22.0,Asian/PacificIslander/Asian-American,European/Caucasian-American,2.0,4.0,60.0,0.0,0.0,40.0,0.0,0.0,7.0,8.0,10.0,7.0,7.0,5.0,15.0,20.0,20.0,15.0,15.0,15.0,7.0,8.0,7.0,8.0,5.0,6.0,0.54,3.0,2.0,7.0,5.0,0
2,female,21.0,22.0,Asian/PacificIslander/Asian-American,Asian/PacificIslander/Asian-American,2.0,4.0,19.0,18.0,19.0,18.0,14.0,12.0,10.0,10.0,10.0,10.0,10.0,10.0,15.0,20.0,20.0,15.0,15.0,15.0,5.0,8.0,9.0,8.0,5.0,7.0,0.16,3.0,2.0,7.0,-99.0,1
3,female,21.0,23.0,Asian/PacificIslander/Asian-American,European/Caucasian-American,2.0,4.0,30.0,5.0,15.0,40.0,5.0,5.0,7.0,8.0,9.0,8.0,9.0,8.0,15.0,20.0,20.0,15.0,15.0,15.0,7.0,6.0,8.0,7.0,6.0,8.0,0.61,3.0,2.0,7.0,6.0,1
4,female,21.0,24.0,Asian/PacificIslander/Asian-American,Latino/HispanicAmerican,2.0,4.0,30.0,10.0,20.0,10.0,10.0,20.0,8.0,7.0,9.0,6.0,9.0,7.0,15.0,20.0,20.0,15.0,15.0,15.0,5.0,6.0,7.0,7.0,6.0,6.0,0.21,3.0,2.0,6.0,6.0,1


### Age(나이차) : def age_func(x):~
- 결측치를 -99로 채웠기 때문에 그냥 계산하면 음수 값들이 나옵니다. 이에 결측치를 -99로 채워준 부분이 등장 시 나이차도 -99로 표시되게끔 정의합니다.

In [980]:
def age_func(x):
    if x['age'] == -99:
        return -99
    elif x['age_o'] == -99:
        return -99
    elif x['gender'] == 'female':
        return x['age_o'] - x['age']
    else:
        return x['age'] - x['age_o']

In [981]:
dating_df['age_gap'] = dating_df.apply(age_func, axis=1)

In [982]:
dating_df['age_gap'] 

0       6.0
1       1.0
2       1.0
3       2.0
4       3.0
       ... 
8372    1.0
8373   -1.0
8374    1.0
8376    3.0
8377    3.0
Name: age_gap, Length: 8130, dtype: float64

In [983]:
dating_df['age_gap_dir'] = dating_df['age_gap'].apply(lambda x: 'positive' if x > 0 else 'negative' if x < 0  else 'zero' )

In [984]:
dating_df['age_gap'] = abs(dating_df['age_gap'])

In [985]:
dating_df.head(10)

Unnamed: 0,gender,age,age_o,race,race_o,importance_same_race,importance_same_religion,o_important_attractive,o_important_sincere,o_important_intelligence,o_important_funny,o_important_ambitious,o_important_shared_interests,o_score_attractive,o_score_sincere,o_score_intelligence,o_score_funny,o_score_ambitous,o_score_shared_interests,i_important_attractive,i_important_sincere,i_important_intellicence,i_important_funny,i_important_ambtition,i_important_shared_interests,i_score_attractive,i_score_sincere,i_score_intelligence,i_score_funny,i_score_ambition,i_score_shared_interests,interests_correlate,expected_happy_with_sd_people,expected_num_interested_in_me,like,guess_prob_liked,match,age_gap,age_gap_dir
0,female,21.0,27.0,Asian/PacificIslander/Asian-American,European/Caucasian-American,2.0,4.0,35.0,20.0,20.0,20.0,0.0,5.0,6.0,8.0,8.0,8.0,8.0,6.0,15.0,20.0,20.0,15.0,15.0,15.0,6.0,9.0,7.0,7.0,6.0,5.0,0.14,3.0,2.0,7.0,6.0,0,6.0,positive
1,female,21.0,22.0,Asian/PacificIslander/Asian-American,European/Caucasian-American,2.0,4.0,60.0,0.0,0.0,40.0,0.0,0.0,7.0,8.0,10.0,7.0,7.0,5.0,15.0,20.0,20.0,15.0,15.0,15.0,7.0,8.0,7.0,8.0,5.0,6.0,0.54,3.0,2.0,7.0,5.0,0,1.0,positive
2,female,21.0,22.0,Asian/PacificIslander/Asian-American,Asian/PacificIslander/Asian-American,2.0,4.0,19.0,18.0,19.0,18.0,14.0,12.0,10.0,10.0,10.0,10.0,10.0,10.0,15.0,20.0,20.0,15.0,15.0,15.0,5.0,8.0,9.0,8.0,5.0,7.0,0.16,3.0,2.0,7.0,-99.0,1,1.0,positive
3,female,21.0,23.0,Asian/PacificIslander/Asian-American,European/Caucasian-American,2.0,4.0,30.0,5.0,15.0,40.0,5.0,5.0,7.0,8.0,9.0,8.0,9.0,8.0,15.0,20.0,20.0,15.0,15.0,15.0,7.0,6.0,8.0,7.0,6.0,8.0,0.61,3.0,2.0,7.0,6.0,1,2.0,positive
4,female,21.0,24.0,Asian/PacificIslander/Asian-American,Latino/HispanicAmerican,2.0,4.0,30.0,10.0,20.0,10.0,10.0,20.0,8.0,7.0,9.0,6.0,9.0,7.0,15.0,20.0,20.0,15.0,15.0,15.0,5.0,6.0,7.0,7.0,6.0,6.0,0.21,3.0,2.0,6.0,6.0,1,3.0,positive
5,female,21.0,25.0,Asian/PacificIslander/Asian-American,European/Caucasian-American,2.0,4.0,50.0,0.0,30.0,10.0,0.0,10.0,7.0,7.0,8.0,8.0,7.0,7.0,15.0,20.0,20.0,15.0,15.0,15.0,4.0,9.0,7.0,4.0,6.0,4.0,0.25,3.0,2.0,6.0,5.0,0,4.0,positive
6,female,21.0,30.0,Asian/PacificIslander/Asian-American,European/Caucasian-American,2.0,4.0,35.0,15.0,25.0,10.0,5.0,10.0,3.0,6.0,7.0,5.0,8.0,7.0,15.0,20.0,20.0,15.0,15.0,15.0,7.0,6.0,7.0,4.0,6.0,7.0,0.34,3.0,2.0,6.0,5.0,0,9.0,positive
7,female,21.0,27.0,Asian/PacificIslander/Asian-American,European/Caucasian-American,2.0,4.0,33.333333,11.111111,11.111111,11.111111,11.111111,22.222222,6.0,7.0,5.0,6.0,8.0,6.0,15.0,20.0,20.0,15.0,15.0,15.0,4.0,9.0,7.0,6.0,5.0,6.0,0.5,3.0,2.0,6.0,7.0,0,6.0,positive
8,female,21.0,28.0,Asian/PacificIslander/Asian-American,European/Caucasian-American,2.0,4.0,50.0,0.0,25.0,10.0,0.0,15.0,7.0,7.0,8.0,8.0,8.0,9.0,15.0,20.0,20.0,15.0,15.0,15.0,7.0,6.0,8.0,9.0,8.0,8.0,0.28,3.0,2.0,7.0,7.0,1,7.0,positive
9,female,21.0,24.0,Asian/PacificIslander/Asian-American,European/Caucasian-American,2.0,4.0,100.0,0.0,0.0,0.0,0.0,0.0,6.0,6.0,6.0,6.0,6.0,6.0,15.0,20.0,20.0,15.0,15.0,15.0,5.0,6.0,6.0,8.0,10.0,8.0,-0.36,3.0,2.0,6.0,6.0,0,3.0,positive


In [986]:
dating_df.tail(10)

Unnamed: 0,gender,age,age_o,race,race_o,importance_same_race,importance_same_religion,o_important_attractive,o_important_sincere,o_important_intelligence,o_important_funny,o_important_ambitious,o_important_shared_interests,o_score_attractive,o_score_sincere,o_score_intelligence,o_score_funny,o_score_ambitous,o_score_shared_interests,i_important_attractive,i_important_sincere,i_important_intellicence,i_important_funny,i_important_ambtition,i_important_shared_interests,i_score_attractive,i_score_sincere,i_score_intelligence,i_score_funny,i_score_ambition,i_score_shared_interests,interests_correlate,expected_happy_with_sd_people,expected_num_interested_in_me,like,guess_prob_liked,match,age_gap,age_gap_dir
8367,male,25.0,28.0,European/Caucasian-American,Other,1.0,1.0,25.0,15.0,25.0,15.0,10.0,10.0,6.0,6.0,6.0,6.0,6.0,6.0,70.0,0.0,15.0,15.0,0.0,0.0,2.0,7.0,6.0,6.0,6.0,7.0,0.37,10.0,-99.0,5.0,5.0,0,3.0,negative
8368,male,25.0,26.0,European/Caucasian-American,European/Caucasian-American,1.0,1.0,10.0,30.0,20.0,15.0,15.0,10.0,9.0,7.0,7.0,8.0,8.0,6.0,70.0,0.0,15.0,15.0,0.0,0.0,3.0,4.0,6.0,4.0,4.0,4.0,0.27,10.0,-99.0,4.0,5.0,0,1.0,negative
8369,male,25.0,22.0,European/Caucasian-American,European/Caucasian-American,1.0,1.0,10.0,20.0,15.0,20.0,15.0,20.0,8.0,9.0,9.0,7.0,6.0,7.0,70.0,0.0,15.0,15.0,0.0,0.0,3.0,3.0,9.0,6.0,9.0,6.0,0.45,10.0,-99.0,6.0,3.0,0,3.0,positive
8370,male,25.0,27.0,European/Caucasian-American,Asian/PacificIslander/Asian-American,1.0,1.0,10.0,25.0,20.0,20.0,5.0,20.0,7.0,2.0,6.0,5.0,8.0,3.0,70.0,0.0,15.0,15.0,0.0,0.0,2.0,7.0,9.0,8.0,7.0,8.0,0.35,10.0,-99.0,6.0,6.0,0,2.0,negative
8371,male,25.0,25.0,European/Caucasian-American,Asian/PacificIslander/Asian-American,1.0,1.0,15.0,20.0,25.0,20.0,10.0,10.0,8.0,6.0,7.0,7.0,6.0,4.0,70.0,0.0,15.0,15.0,0.0,0.0,8.0,7.0,8.0,8.0,-99.0,-99.0,0.59,10.0,-99.0,7.0,6.0,1,0.0,zero
8372,male,25.0,24.0,European/Caucasian-American,European/Caucasian-American,1.0,1.0,10.0,15.0,30.0,20.0,15.0,10.0,8.0,8.0,7.0,7.0,8.0,6.0,70.0,0.0,15.0,15.0,0.0,0.0,7.0,5.0,5.0,5.0,6.0,-99.0,0.28,10.0,-99.0,4.0,4.0,0,1.0,positive
8373,male,25.0,26.0,European/Caucasian-American,Latino/HispanicAmerican,1.0,1.0,10.526316,10.526316,31.578947,21.052632,10.526316,15.789474,10.0,5.0,3.0,2.0,6.0,5.0,70.0,0.0,15.0,15.0,0.0,0.0,3.0,5.0,5.0,5.0,-99.0,-99.0,0.64,10.0,-99.0,2.0,5.0,0,1.0,negative
8374,male,25.0,24.0,European/Caucasian-American,Other,1.0,1.0,50.0,20.0,10.0,5.0,10.0,5.0,6.0,3.0,7.0,3.0,7.0,2.0,70.0,0.0,15.0,15.0,0.0,0.0,4.0,6.0,8.0,4.0,4.0,-99.0,0.71,10.0,-99.0,4.0,4.0,0,1.0,positive
8376,male,25.0,22.0,European/Caucasian-American,Asian/PacificIslander/Asian-American,1.0,1.0,10.0,25.0,25.0,10.0,10.0,20.0,5.0,7.0,5.0,5.0,3.0,6.0,70.0,0.0,15.0,15.0,0.0,0.0,4.0,6.0,5.0,4.0,-99.0,5.0,0.62,10.0,-99.0,5.0,5.0,0,3.0,positive
8377,male,25.0,22.0,European/Caucasian-American,Asian/PacificIslander/Asian-American,1.0,1.0,20.0,20.0,10.0,15.0,5.0,30.0,8.0,8.0,7.0,7.0,7.0,7.0,70.0,0.0,15.0,15.0,0.0,0.0,3.0,7.0,6.0,4.0,8.0,1.0,0.01,10.0,-99.0,4.0,5.0,0,3.0,positive


### Race(인종): `dating_df['same_race_point'].head(30)`

- 동일 인종인가에 대해 0과 1로 표시해서 계산하기보다는, 피처 엔지니어링을 할 때 디테일을 두어서 같은 인종일 때 2점, 다른 인종일 때는 -2점으로 표기하는 방법 등으로 데이터가 가지고 있는 정보를 잘 끌어내는 것이 좋습니다.

In [987]:
dating_df['importance_same_race'].value_counts()

1.0     2749
3.0      964
2.0      938
5.0      644
8.0      631
7.0      536
6.0      516
4.0      494
9.0      404
10.0     246
0.0        8
Name: importance_same_race, dtype: int64

In [988]:
dating_df['race'].unique()

array(['Asian/PacificIslander/Asian-American',
       'European/Caucasian-American', 'Other', 'Latino/HispanicAmerican',
       'Black/AfricanAmerican'], dtype=object)

In [989]:
dating_df['race_o'].unique()

array(['European/Caucasian-American',
       'Asian/PacificIslander/Asian-American', 'Latino/HispanicAmerican',
       'Other', 'Black/AfricanAmerican'], dtype=object)

In [990]:
# 불리언값을 숫자로 치환해서 별도 행을 만들어주면
# 다른 컬럼과 연계해서 유의미한 정보값을 얻을 수 도 있음 
dating_df['same_race'] = (dating_df['race'] == dating_df['race_o'] ).astype('int')

In [991]:
dating_df.head()

Unnamed: 0,gender,age,age_o,race,race_o,importance_same_race,importance_same_religion,o_important_attractive,o_important_sincere,o_important_intelligence,o_important_funny,o_important_ambitious,o_important_shared_interests,o_score_attractive,o_score_sincere,o_score_intelligence,o_score_funny,o_score_ambitous,o_score_shared_interests,i_important_attractive,i_important_sincere,i_important_intellicence,i_important_funny,i_important_ambtition,i_important_shared_interests,i_score_attractive,i_score_sincere,i_score_intelligence,i_score_funny,i_score_ambition,i_score_shared_interests,interests_correlate,expected_happy_with_sd_people,expected_num_interested_in_me,like,guess_prob_liked,match,age_gap,age_gap_dir,same_race
0,female,21.0,27.0,Asian/PacificIslander/Asian-American,European/Caucasian-American,2.0,4.0,35.0,20.0,20.0,20.0,0.0,5.0,6.0,8.0,8.0,8.0,8.0,6.0,15.0,20.0,20.0,15.0,15.0,15.0,6.0,9.0,7.0,7.0,6.0,5.0,0.14,3.0,2.0,7.0,6.0,0,6.0,positive,0
1,female,21.0,22.0,Asian/PacificIslander/Asian-American,European/Caucasian-American,2.0,4.0,60.0,0.0,0.0,40.0,0.0,0.0,7.0,8.0,10.0,7.0,7.0,5.0,15.0,20.0,20.0,15.0,15.0,15.0,7.0,8.0,7.0,8.0,5.0,6.0,0.54,3.0,2.0,7.0,5.0,0,1.0,positive,0
2,female,21.0,22.0,Asian/PacificIslander/Asian-American,Asian/PacificIslander/Asian-American,2.0,4.0,19.0,18.0,19.0,18.0,14.0,12.0,10.0,10.0,10.0,10.0,10.0,10.0,15.0,20.0,20.0,15.0,15.0,15.0,5.0,8.0,9.0,8.0,5.0,7.0,0.16,3.0,2.0,7.0,-99.0,1,1.0,positive,1
3,female,21.0,23.0,Asian/PacificIslander/Asian-American,European/Caucasian-American,2.0,4.0,30.0,5.0,15.0,40.0,5.0,5.0,7.0,8.0,9.0,8.0,9.0,8.0,15.0,20.0,20.0,15.0,15.0,15.0,7.0,6.0,8.0,7.0,6.0,8.0,0.61,3.0,2.0,7.0,6.0,1,2.0,positive,0
4,female,21.0,24.0,Asian/PacificIslander/Asian-American,Latino/HispanicAmerican,2.0,4.0,30.0,10.0,20.0,10.0,10.0,20.0,8.0,7.0,9.0,6.0,9.0,7.0,15.0,20.0,20.0,15.0,15.0,15.0,5.0,6.0,7.0,7.0,6.0,6.0,0.21,3.0,2.0,6.0,6.0,1,3.0,positive,0


In [992]:
dating_df['same_race'] * dating_df['importance_same_race']

0       0.0
1       0.0
2       2.0
3       0.0
4       0.0
       ... 
8372    1.0
8373    0.0
8374    0.0
8376    0.0
8377    0.0
Length: 8130, dtype: float64

In [993]:
# 인종이 중요한 사람에게는 importance값에 관계없이 인종이 다르면 무조건 0으로 수렴하는 문제발생 
# 문제 해결을 위해 0을 -1로 변경

dating_df['same_race'] = dating_df['same_race'].replace({0:-1})

In [994]:
dating_df['same_race_point'] = dating_df['same_race'] * dating_df['importance_same_race']

In [995]:
dating_df['same_race_point'].head(30)

0    -2.0
1    -2.0
2     2.0
3    -2.0
4    -2.0
5    -2.0
6    -2.0
7    -2.0
8    -2.0
9    -2.0
10    2.0
11    2.0
12   -2.0
13    2.0
14   -2.0
15    2.0
16    2.0
17    2.0
18    2.0
19    2.0
20    8.0
21    8.0
22   -8.0
23    8.0
24   -8.0
25    8.0
26    8.0
27    8.0
28    8.0
29    8.0
Name: same_race_point, dtype: float64

## **5-4. 평가(Rating) 관련 피처 생성 및 활용**

### 주피터 노트북에서 자주 사용하는 단축키

- 주피터 노트북에서 2개의 셀을 합치는 순서
    - 1. 왼쪽 파란색 레이블 클릭
    - 2. Shift + 방향키
    - 3. Shift + M

### zip : **`for i, j in zip(i_important, i_score):~`**

- for in zip : zip 함수는 여러 시퀀스의 요소를 하나씩 묶어서 튜플로 반환하며, for 루프를 사용하여 이 튜플들을 순차적으로 처리할 수 있습니다.

### Rating : **`def rating(data, important, score)`**

- "조합평균" 또는 "가중 평균" : 각 데이터 포인트에 서로 다른 가중치를 적용하여 계산하는 평균입니다.
    - 이 방법은 각 요소의 중요도나 기여도가 다를 때 사용됩니다.
    - 조합 평균은 두 개의 값이 비슷할수록 그 평균 값이 더 높게 나옵니다.

In [996]:
o_important = []
o_score = []
i_important = []
i_score = []

for i in dating_df.columns:
    if i.startswith('o_important'):
        o_important.append(i)
    elif i.startswith('o_score'):
        o_score.append(i)
    elif i.startswith('i_important'):
        i_important.append(i)
    elif i.startswith('i_score'):
        i_score.append(i)

In [997]:
dating_df[o_important] = dating_df[o_important].replace({0:-99})

In [998]:
dating_df[i_important] = dating_df[i_important].replace({0:-99})

In [999]:
dating_df.head()

Unnamed: 0,gender,age,age_o,race,race_o,importance_same_race,importance_same_religion,o_important_attractive,o_important_sincere,o_important_intelligence,o_important_funny,o_important_ambitious,o_important_shared_interests,o_score_attractive,o_score_sincere,o_score_intelligence,o_score_funny,o_score_ambitous,o_score_shared_interests,i_important_attractive,i_important_sincere,i_important_intellicence,i_important_funny,i_important_ambtition,i_important_shared_interests,i_score_attractive,i_score_sincere,i_score_intelligence,i_score_funny,i_score_ambition,i_score_shared_interests,interests_correlate,expected_happy_with_sd_people,expected_num_interested_in_me,like,guess_prob_liked,match,age_gap,age_gap_dir,same_race,same_race_point
0,female,21.0,27.0,Asian/PacificIslander/Asian-American,European/Caucasian-American,2.0,4.0,35.0,20.0,20.0,20.0,-99.0,5.0,6.0,8.0,8.0,8.0,8.0,6.0,15.0,20.0,20.0,15.0,15.0,15.0,6.0,9.0,7.0,7.0,6.0,5.0,0.14,3.0,2.0,7.0,6.0,0,6.0,positive,-1,-2.0
1,female,21.0,22.0,Asian/PacificIslander/Asian-American,European/Caucasian-American,2.0,4.0,60.0,-99.0,-99.0,40.0,-99.0,-99.0,7.0,8.0,10.0,7.0,7.0,5.0,15.0,20.0,20.0,15.0,15.0,15.0,7.0,8.0,7.0,8.0,5.0,6.0,0.54,3.0,2.0,7.0,5.0,0,1.0,positive,-1,-2.0
2,female,21.0,22.0,Asian/PacificIslander/Asian-American,Asian/PacificIslander/Asian-American,2.0,4.0,19.0,18.0,19.0,18.0,14.0,12.0,10.0,10.0,10.0,10.0,10.0,10.0,15.0,20.0,20.0,15.0,15.0,15.0,5.0,8.0,9.0,8.0,5.0,7.0,0.16,3.0,2.0,7.0,-99.0,1,1.0,positive,1,2.0
3,female,21.0,23.0,Asian/PacificIslander/Asian-American,European/Caucasian-American,2.0,4.0,30.0,5.0,15.0,40.0,5.0,5.0,7.0,8.0,9.0,8.0,9.0,8.0,15.0,20.0,20.0,15.0,15.0,15.0,7.0,6.0,8.0,7.0,6.0,8.0,0.61,3.0,2.0,7.0,6.0,1,2.0,positive,-1,-2.0
4,female,21.0,24.0,Asian/PacificIslander/Asian-American,Latino/HispanicAmerican,2.0,4.0,30.0,10.0,20.0,10.0,10.0,20.0,8.0,7.0,9.0,6.0,9.0,7.0,15.0,20.0,20.0,15.0,15.0,15.0,5.0,6.0,7.0,7.0,6.0,6.0,0.21,3.0,2.0,6.0,6.0,1,3.0,positive,-1,-2.0


In [1000]:
# importance 와 score를 연계해서 계산하면 실제에 가까운 score 를 알 수 있게됨 
# 머신러닝 알고리즘에 따라 연계하는 방법이 다양할 수 있는데(스케일 고려등..) 
# 우선 지금 학습단계에서는 단순 곱으로 처리

dating_df['o_important_attractive'] * dating_df['o_score_attractive']

0       210.000000
1       420.000000
2       190.000000
3       210.000000
4       240.000000
           ...    
8372     80.000000
8373    105.263158
8374    300.000000
8376     50.000000
8377    160.000000
Length: 8130, dtype: float64

In [1001]:
# 처리단계에서 결측치를 -99로 처리한 부분은 예외처리 후 곱하기 

def rating(data, important, score):
    if data[score] == -99:
        return -99
    elif data[important] == -99:
        return -99
    else:
        return data[important] * data[score]

In [1002]:
dating_df.apply(lambda x: rating(x, 'o_important_attractive', 'o_score_attractive'), axis=1)

0       210.000000
1       420.000000
2       190.000000
3       210.000000
4       240.000000
           ...    
8372     80.000000
8373    105.263158
8374    300.000000
8376     50.000000
8377    160.000000
Length: 8130, dtype: float64

In [1003]:
o_important

['o_important_attractive',
 'o_important_sincere',
 'o_important_intelligence',
 'o_important_funny',
 'o_important_ambitious',
 'o_important_shared_interests']

In [1004]:
i_score

['i_score_attractive',
 'i_score_sincere',
 'i_score_intelligence',
 'i_score_funny',
 'i_score_ambition',
 'i_score_shared_interests']

### zip : **`for i, j in zip(i_important, i_score):~`**

- for in zip : zip 함수는 여러 시퀀스의 요소를 하나씩 묶어서 튜플로 반환하며, for 루프를 사용하여 이 튜플들을 순차적으로 처리할 수 있습니다.

In [1005]:
i_rating = ['i_rating_attractive',
 'i_rating_sincere',
 'i_rating_intelligence',
 'i_rating_funny',
 'i_rating_ambitious',
 'i_rating_shared_interests']

In [1006]:
o_rating = ['o_rating_attractive',
 'o_rating_sincere',
 'o_rating_intelligence',
 'o_rating_funny',
 'o_rating_ambitious',
 'o_rating_shared_interests']

In [1007]:
for i, j, k in zip (o_important, o_score, o_rating):
     dating_df[k] = dating_df.apply(lambda x: rating(x, i, j), axis=1)

dating_df.head()

Unnamed: 0,gender,age,age_o,race,race_o,importance_same_race,importance_same_religion,o_important_attractive,o_important_sincere,o_important_intelligence,o_important_funny,o_important_ambitious,o_important_shared_interests,o_score_attractive,o_score_sincere,o_score_intelligence,o_score_funny,o_score_ambitous,o_score_shared_interests,i_important_attractive,i_important_sincere,i_important_intellicence,i_important_funny,i_important_ambtition,i_important_shared_interests,i_score_attractive,i_score_sincere,i_score_intelligence,i_score_funny,i_score_ambition,i_score_shared_interests,interests_correlate,expected_happy_with_sd_people,expected_num_interested_in_me,like,guess_prob_liked,match,age_gap,age_gap_dir,same_race,same_race_point,o_rating_attractive,o_rating_sincere,o_rating_intelligence,o_rating_funny,o_rating_ambitious,o_rating_shared_interests
0,female,21.0,27.0,Asian/PacificIslander/Asian-American,European/Caucasian-American,2.0,4.0,35.0,20.0,20.0,20.0,-99.0,5.0,6.0,8.0,8.0,8.0,8.0,6.0,15.0,20.0,20.0,15.0,15.0,15.0,6.0,9.0,7.0,7.0,6.0,5.0,0.14,3.0,2.0,7.0,6.0,0,6.0,positive,-1,-2.0,210.0,160.0,160.0,160.0,-99.0,30.0
1,female,21.0,22.0,Asian/PacificIslander/Asian-American,European/Caucasian-American,2.0,4.0,60.0,-99.0,-99.0,40.0,-99.0,-99.0,7.0,8.0,10.0,7.0,7.0,5.0,15.0,20.0,20.0,15.0,15.0,15.0,7.0,8.0,7.0,8.0,5.0,6.0,0.54,3.0,2.0,7.0,5.0,0,1.0,positive,-1,-2.0,420.0,-99.0,-99.0,280.0,-99.0,-99.0
2,female,21.0,22.0,Asian/PacificIslander/Asian-American,Asian/PacificIslander/Asian-American,2.0,4.0,19.0,18.0,19.0,18.0,14.0,12.0,10.0,10.0,10.0,10.0,10.0,10.0,15.0,20.0,20.0,15.0,15.0,15.0,5.0,8.0,9.0,8.0,5.0,7.0,0.16,3.0,2.0,7.0,-99.0,1,1.0,positive,1,2.0,190.0,180.0,190.0,180.0,140.0,120.0
3,female,21.0,23.0,Asian/PacificIslander/Asian-American,European/Caucasian-American,2.0,4.0,30.0,5.0,15.0,40.0,5.0,5.0,7.0,8.0,9.0,8.0,9.0,8.0,15.0,20.0,20.0,15.0,15.0,15.0,7.0,6.0,8.0,7.0,6.0,8.0,0.61,3.0,2.0,7.0,6.0,1,2.0,positive,-1,-2.0,210.0,40.0,135.0,320.0,45.0,40.0
4,female,21.0,24.0,Asian/PacificIslander/Asian-American,Latino/HispanicAmerican,2.0,4.0,30.0,10.0,20.0,10.0,10.0,20.0,8.0,7.0,9.0,6.0,9.0,7.0,15.0,20.0,20.0,15.0,15.0,15.0,5.0,6.0,7.0,7.0,6.0,6.0,0.21,3.0,2.0,6.0,6.0,1,3.0,positive,-1,-2.0,240.0,70.0,180.0,60.0,90.0,140.0


In [1008]:
for i, j, k in zip (i_important, i_score, i_rating):
     dating_df[k] = dating_df.apply(lambda x: rating(x, i, j), axis=1)

dating_df.head()

Unnamed: 0,gender,age,age_o,race,race_o,importance_same_race,importance_same_religion,o_important_attractive,o_important_sincere,o_important_intelligence,o_important_funny,o_important_ambitious,o_important_shared_interests,o_score_attractive,o_score_sincere,o_score_intelligence,o_score_funny,o_score_ambitous,o_score_shared_interests,i_important_attractive,i_important_sincere,i_important_intellicence,i_important_funny,i_important_ambtition,i_important_shared_interests,...,i_score_funny,i_score_ambition,i_score_shared_interests,interests_correlate,expected_happy_with_sd_people,expected_num_interested_in_me,like,guess_prob_liked,match,age_gap,age_gap_dir,same_race,same_race_point,o_rating_attractive,o_rating_sincere,o_rating_intelligence,o_rating_funny,o_rating_ambitious,o_rating_shared_interests,i_rating_attractive,i_rating_sincere,i_rating_intelligence,i_rating_funny,i_rating_ambitious,i_rating_shared_interests
0,female,21.0,27.0,Asian/PacificIslander/Asian-American,European/Caucasian-American,2.0,4.0,35.0,20.0,20.0,20.0,-99.0,5.0,6.0,8.0,8.0,8.0,8.0,6.0,15.0,20.0,20.0,15.0,15.0,15.0,...,7.0,6.0,5.0,0.14,3.0,2.0,7.0,6.0,0,6.0,positive,-1,-2.0,210.0,160.0,160.0,160.0,-99.0,30.0,90.0,180.0,140.0,105.0,90.0,75.0
1,female,21.0,22.0,Asian/PacificIslander/Asian-American,European/Caucasian-American,2.0,4.0,60.0,-99.0,-99.0,40.0,-99.0,-99.0,7.0,8.0,10.0,7.0,7.0,5.0,15.0,20.0,20.0,15.0,15.0,15.0,...,8.0,5.0,6.0,0.54,3.0,2.0,7.0,5.0,0,1.0,positive,-1,-2.0,420.0,-99.0,-99.0,280.0,-99.0,-99.0,105.0,160.0,140.0,120.0,75.0,90.0
2,female,21.0,22.0,Asian/PacificIslander/Asian-American,Asian/PacificIslander/Asian-American,2.0,4.0,19.0,18.0,19.0,18.0,14.0,12.0,10.0,10.0,10.0,10.0,10.0,10.0,15.0,20.0,20.0,15.0,15.0,15.0,...,8.0,5.0,7.0,0.16,3.0,2.0,7.0,-99.0,1,1.0,positive,1,2.0,190.0,180.0,190.0,180.0,140.0,120.0,75.0,160.0,180.0,120.0,75.0,105.0
3,female,21.0,23.0,Asian/PacificIslander/Asian-American,European/Caucasian-American,2.0,4.0,30.0,5.0,15.0,40.0,5.0,5.0,7.0,8.0,9.0,8.0,9.0,8.0,15.0,20.0,20.0,15.0,15.0,15.0,...,7.0,6.0,8.0,0.61,3.0,2.0,7.0,6.0,1,2.0,positive,-1,-2.0,210.0,40.0,135.0,320.0,45.0,40.0,105.0,120.0,160.0,105.0,90.0,120.0
4,female,21.0,24.0,Asian/PacificIslander/Asian-American,Latino/HispanicAmerican,2.0,4.0,30.0,10.0,20.0,10.0,10.0,20.0,8.0,7.0,9.0,6.0,9.0,7.0,15.0,20.0,20.0,15.0,15.0,15.0,...,7.0,6.0,6.0,0.21,3.0,2.0,6.0,6.0,1,3.0,positive,-1,-2.0,240.0,70.0,180.0,60.0,90.0,140.0,75.0,120.0,140.0,105.0,90.0,90.0


In [1009]:
# sample 데이터 행을 가져와서 확인해 보기
dating_df[i_rating].loc[8377][dating_df[i_rating].loc[8377] > 0].mean()

120.0

In [1010]:
dating_df['i_rating_total'] = dating_df[i_rating].apply(lambda x: x [x > 0].mean(), axis=1)

In [1011]:
dating_df['o_rating_total'] = dating_df[o_rating].apply(lambda x: x [x > 0].mean(), axis=1)

In [1012]:
dating_df.head()

Unnamed: 0,gender,age,age_o,race,race_o,importance_same_race,importance_same_religion,o_important_attractive,o_important_sincere,o_important_intelligence,o_important_funny,o_important_ambitious,o_important_shared_interests,o_score_attractive,o_score_sincere,o_score_intelligence,o_score_funny,o_score_ambitous,o_score_shared_interests,i_important_attractive,i_important_sincere,i_important_intellicence,i_important_funny,i_important_ambtition,i_important_shared_interests,...,i_score_shared_interests,interests_correlate,expected_happy_with_sd_people,expected_num_interested_in_me,like,guess_prob_liked,match,age_gap,age_gap_dir,same_race,same_race_point,o_rating_attractive,o_rating_sincere,o_rating_intelligence,o_rating_funny,o_rating_ambitious,o_rating_shared_interests,i_rating_attractive,i_rating_sincere,i_rating_intelligence,i_rating_funny,i_rating_ambitious,i_rating_shared_interests,i_rating_total,o_rating_total
0,female,21.0,27.0,Asian/PacificIslander/Asian-American,European/Caucasian-American,2.0,4.0,35.0,20.0,20.0,20.0,-99.0,5.0,6.0,8.0,8.0,8.0,8.0,6.0,15.0,20.0,20.0,15.0,15.0,15.0,...,5.0,0.14,3.0,2.0,7.0,6.0,0,6.0,positive,-1,-2.0,210.0,160.0,160.0,160.0,-99.0,30.0,90.0,180.0,140.0,105.0,90.0,75.0,113.333333,144.0
1,female,21.0,22.0,Asian/PacificIslander/Asian-American,European/Caucasian-American,2.0,4.0,60.0,-99.0,-99.0,40.0,-99.0,-99.0,7.0,8.0,10.0,7.0,7.0,5.0,15.0,20.0,20.0,15.0,15.0,15.0,...,6.0,0.54,3.0,2.0,7.0,5.0,0,1.0,positive,-1,-2.0,420.0,-99.0,-99.0,280.0,-99.0,-99.0,105.0,160.0,140.0,120.0,75.0,90.0,115.0,350.0
2,female,21.0,22.0,Asian/PacificIslander/Asian-American,Asian/PacificIslander/Asian-American,2.0,4.0,19.0,18.0,19.0,18.0,14.0,12.0,10.0,10.0,10.0,10.0,10.0,10.0,15.0,20.0,20.0,15.0,15.0,15.0,...,7.0,0.16,3.0,2.0,7.0,-99.0,1,1.0,positive,1,2.0,190.0,180.0,190.0,180.0,140.0,120.0,75.0,160.0,180.0,120.0,75.0,105.0,119.166667,166.666667
3,female,21.0,23.0,Asian/PacificIslander/Asian-American,European/Caucasian-American,2.0,4.0,30.0,5.0,15.0,40.0,5.0,5.0,7.0,8.0,9.0,8.0,9.0,8.0,15.0,20.0,20.0,15.0,15.0,15.0,...,8.0,0.61,3.0,2.0,7.0,6.0,1,2.0,positive,-1,-2.0,210.0,40.0,135.0,320.0,45.0,40.0,105.0,120.0,160.0,105.0,90.0,120.0,116.666667,131.666667
4,female,21.0,24.0,Asian/PacificIslander/Asian-American,Latino/HispanicAmerican,2.0,4.0,30.0,10.0,20.0,10.0,10.0,20.0,8.0,7.0,9.0,6.0,9.0,7.0,15.0,20.0,20.0,15.0,15.0,15.0,...,6.0,0.21,3.0,2.0,6.0,6.0,1,3.0,positive,-1,-2.0,240.0,70.0,180.0,60.0,90.0,140.0,75.0,120.0,140.0,105.0,90.0,90.0,103.333333,130.0


In [1013]:
dating_df.tail()

Unnamed: 0,gender,age,age_o,race,race_o,importance_same_race,importance_same_religion,o_important_attractive,o_important_sincere,o_important_intelligence,o_important_funny,o_important_ambitious,o_important_shared_interests,o_score_attractive,o_score_sincere,o_score_intelligence,o_score_funny,o_score_ambitous,o_score_shared_interests,i_important_attractive,i_important_sincere,i_important_intellicence,i_important_funny,i_important_ambtition,i_important_shared_interests,...,i_score_shared_interests,interests_correlate,expected_happy_with_sd_people,expected_num_interested_in_me,like,guess_prob_liked,match,age_gap,age_gap_dir,same_race,same_race_point,o_rating_attractive,o_rating_sincere,o_rating_intelligence,o_rating_funny,o_rating_ambitious,o_rating_shared_interests,i_rating_attractive,i_rating_sincere,i_rating_intelligence,i_rating_funny,i_rating_ambitious,i_rating_shared_interests,i_rating_total,o_rating_total
8372,male,25.0,24.0,European/Caucasian-American,European/Caucasian-American,1.0,1.0,10.0,15.0,30.0,20.0,15.0,10.0,8.0,8.0,7.0,7.0,8.0,6.0,70.0,-99.0,15.0,15.0,-99.0,-99.0,...,-99.0,0.28,10.0,-99.0,4.0,4.0,0,1.0,positive,1,1.0,80.0,120.0,210.0,140.0,120.0,60.0,490.0,-99.0,75.0,75.0,-99.0,-99.0,213.333333,121.666667
8373,male,25.0,26.0,European/Caucasian-American,Latino/HispanicAmerican,1.0,1.0,10.526316,10.526316,31.578947,21.052632,10.526316,15.789474,10.0,5.0,3.0,2.0,6.0,5.0,70.0,-99.0,15.0,15.0,-99.0,-99.0,...,-99.0,0.64,10.0,-99.0,2.0,5.0,0,1.0,negative,-1,-1.0,105.263158,52.631579,94.736842,42.105263,63.157895,78.947368,210.0,-99.0,75.0,75.0,-99.0,-99.0,120.0,72.807018
8374,male,25.0,24.0,European/Caucasian-American,Other,1.0,1.0,50.0,20.0,10.0,5.0,10.0,5.0,6.0,3.0,7.0,3.0,7.0,2.0,70.0,-99.0,15.0,15.0,-99.0,-99.0,...,-99.0,0.71,10.0,-99.0,4.0,4.0,0,1.0,positive,-1,-1.0,300.0,60.0,70.0,15.0,70.0,10.0,280.0,-99.0,120.0,60.0,-99.0,-99.0,153.333333,87.5
8376,male,25.0,22.0,European/Caucasian-American,Asian/PacificIslander/Asian-American,1.0,1.0,10.0,25.0,25.0,10.0,10.0,20.0,5.0,7.0,5.0,5.0,3.0,6.0,70.0,-99.0,15.0,15.0,-99.0,-99.0,...,5.0,0.62,10.0,-99.0,5.0,5.0,0,3.0,positive,-1,-1.0,50.0,175.0,125.0,50.0,30.0,120.0,280.0,-99.0,75.0,60.0,-99.0,-99.0,138.333333,91.666667
8377,male,25.0,22.0,European/Caucasian-American,Asian/PacificIslander/Asian-American,1.0,1.0,20.0,20.0,10.0,15.0,5.0,30.0,8.0,8.0,7.0,7.0,7.0,7.0,70.0,-99.0,15.0,15.0,-99.0,-99.0,...,1.0,0.01,10.0,-99.0,4.0,5.0,0,3.0,positive,-1,-1.0,160.0,160.0,70.0,105.0,35.0,210.0,210.0,-99.0,90.0,60.0,-99.0,-99.0,120.0,123.333333


#### 조합평균
- 2ab/(a+b)
- 편차가 클 수록 평균으 낮게 나타남 

In [1014]:
dating_df['rating_mean'] = 2* dating_df['o_rating_total'] * dating_df['i_rating_total']/(dating_df['o_rating_total'] + dating_df['i_rating_total'])

In [1015]:
dating_df.head()

Unnamed: 0,gender,age,age_o,race,race_o,importance_same_race,importance_same_religion,o_important_attractive,o_important_sincere,o_important_intelligence,o_important_funny,o_important_ambitious,o_important_shared_interests,o_score_attractive,o_score_sincere,o_score_intelligence,o_score_funny,o_score_ambitous,o_score_shared_interests,i_important_attractive,i_important_sincere,i_important_intellicence,i_important_funny,i_important_ambtition,i_important_shared_interests,...,interests_correlate,expected_happy_with_sd_people,expected_num_interested_in_me,like,guess_prob_liked,match,age_gap,age_gap_dir,same_race,same_race_point,o_rating_attractive,o_rating_sincere,o_rating_intelligence,o_rating_funny,o_rating_ambitious,o_rating_shared_interests,i_rating_attractive,i_rating_sincere,i_rating_intelligence,i_rating_funny,i_rating_ambitious,i_rating_shared_interests,i_rating_total,o_rating_total,rating_mean
0,female,21.0,27.0,Asian/PacificIslander/Asian-American,European/Caucasian-American,2.0,4.0,35.0,20.0,20.0,20.0,-99.0,5.0,6.0,8.0,8.0,8.0,8.0,6.0,15.0,20.0,20.0,15.0,15.0,15.0,...,0.14,3.0,2.0,7.0,6.0,0,6.0,positive,-1,-2.0,210.0,160.0,160.0,160.0,-99.0,30.0,90.0,180.0,140.0,105.0,90.0,75.0,113.333333,144.0,126.839378
1,female,21.0,22.0,Asian/PacificIslander/Asian-American,European/Caucasian-American,2.0,4.0,60.0,-99.0,-99.0,40.0,-99.0,-99.0,7.0,8.0,10.0,7.0,7.0,5.0,15.0,20.0,20.0,15.0,15.0,15.0,...,0.54,3.0,2.0,7.0,5.0,0,1.0,positive,-1,-2.0,420.0,-99.0,-99.0,280.0,-99.0,-99.0,105.0,160.0,140.0,120.0,75.0,90.0,115.0,350.0,173.11828
2,female,21.0,22.0,Asian/PacificIslander/Asian-American,Asian/PacificIslander/Asian-American,2.0,4.0,19.0,18.0,19.0,18.0,14.0,12.0,10.0,10.0,10.0,10.0,10.0,10.0,15.0,20.0,20.0,15.0,15.0,15.0,...,0.16,3.0,2.0,7.0,-99.0,1,1.0,positive,1,2.0,190.0,180.0,190.0,180.0,140.0,120.0,75.0,160.0,180.0,120.0,75.0,105.0,119.166667,166.666667,138.969874
3,female,21.0,23.0,Asian/PacificIslander/Asian-American,European/Caucasian-American,2.0,4.0,30.0,5.0,15.0,40.0,5.0,5.0,7.0,8.0,9.0,8.0,9.0,8.0,15.0,20.0,20.0,15.0,15.0,15.0,...,0.61,3.0,2.0,7.0,6.0,1,2.0,positive,-1,-2.0,210.0,40.0,135.0,320.0,45.0,40.0,105.0,120.0,160.0,105.0,90.0,120.0,116.666667,131.666667,123.713647
4,female,21.0,24.0,Asian/PacificIslander/Asian-American,Latino/HispanicAmerican,2.0,4.0,30.0,10.0,20.0,10.0,10.0,20.0,8.0,7.0,9.0,6.0,9.0,7.0,15.0,20.0,20.0,15.0,15.0,15.0,...,0.21,3.0,2.0,6.0,6.0,1,3.0,positive,-1,-2.0,240.0,70.0,180.0,60.0,90.0,140.0,75.0,120.0,140.0,105.0,90.0,90.0,103.333333,130.0,115.142857


### 컬럼이 너무 많으므로 컬럼을 정리 / 원핫 인코딩

In [1016]:
# 시리즈 포맷으로 컬럼내역 출력 
pd.Series(dating_df.columns)

0                            gender
1                               age
2                             age_o
3                              race
4                            race_o
5              importance_same_race
6          importance_same_religion
7            o_important_attractive
8               o_important_sincere
9          o_important_intelligence
10                o_important_funny
11            o_important_ambitious
12     o_important_shared_interests
13               o_score_attractive
14                  o_score_sincere
15             o_score_intelligence
16                    o_score_funny
17                 o_score_ambitous
18         o_score_shared_interests
19           i_important_attractive
20              i_important_sincere
21         i_important_intellicence
22                i_important_funny
23            i_important_ambtition
24     i_important_shared_interests
25               i_score_attractive
26                  i_score_sincere
27             i_score_intel

In [1017]:
# 시리즈 넘버링 기준으로 컬럼 추출
dating_df.columns[[0,1,3,]]

Index(['gender', 'age', 'race'], dtype='object')

In [1018]:
dating_df.columns[31:]

Index(['interests_correlate', 'expected_happy_with_sd_people',
       'expected_num_interested_in_me', 'like', 'guess_prob_liked', 'match',
       'age_gap', 'age_gap_dir', 'same_race', 'same_race_point',
       'o_rating_attractive', 'o_rating_sincere', 'o_rating_intelligence',
       'o_rating_funny', 'o_rating_ambitious', 'o_rating_shared_interests',
       'i_rating_attractive', 'i_rating_sincere', 'i_rating_intelligence',
       'i_rating_funny', 'i_rating_ambitious', 'i_rating_shared_interests',
       'i_rating_total', 'o_rating_total', 'rating_mean'],
      dtype='object')

In [1019]:
select_col = list(dating_df.columns[[0,1,3,]]) + list(dating_df.columns[31:])

In [1020]:
select_col

['gender',
 'age',
 'race',
 'interests_correlate',
 'expected_happy_with_sd_people',
 'expected_num_interested_in_me',
 'like',
 'guess_prob_liked',
 'match',
 'age_gap',
 'age_gap_dir',
 'same_race',
 'same_race_point',
 'o_rating_attractive',
 'o_rating_sincere',
 'o_rating_intelligence',
 'o_rating_funny',
 'o_rating_ambitious',
 'o_rating_shared_interests',
 'i_rating_attractive',
 'i_rating_sincere',
 'i_rating_intelligence',
 'i_rating_funny',
 'i_rating_ambitious',
 'i_rating_shared_interests',
 'i_rating_total',
 'o_rating_total',
 'rating_mean']

In [1025]:
select_col.remove('same_race')

In [1026]:
final_df = dating_df[select_col]

In [1027]:
final_df.head()

Unnamed: 0,gender,age,race,interests_correlate,expected_happy_with_sd_people,expected_num_interested_in_me,like,guess_prob_liked,match,age_gap,age_gap_dir,same_race_point,o_rating_attractive,o_rating_sincere,o_rating_intelligence,o_rating_funny,o_rating_ambitious,o_rating_shared_interests,i_rating_attractive,i_rating_sincere,i_rating_intelligence,i_rating_funny,i_rating_ambitious,i_rating_shared_interests,i_rating_total,o_rating_total,rating_mean
0,female,21.0,Asian/PacificIslander/Asian-American,0.14,3.0,2.0,7.0,6.0,0,6.0,positive,-2.0,210.0,160.0,160.0,160.0,-99.0,30.0,90.0,180.0,140.0,105.0,90.0,75.0,113.333333,144.0,126.839378
1,female,21.0,Asian/PacificIslander/Asian-American,0.54,3.0,2.0,7.0,5.0,0,1.0,positive,-2.0,420.0,-99.0,-99.0,280.0,-99.0,-99.0,105.0,160.0,140.0,120.0,75.0,90.0,115.0,350.0,173.11828
2,female,21.0,Asian/PacificIslander/Asian-American,0.16,3.0,2.0,7.0,-99.0,1,1.0,positive,2.0,190.0,180.0,190.0,180.0,140.0,120.0,75.0,160.0,180.0,120.0,75.0,105.0,119.166667,166.666667,138.969874
3,female,21.0,Asian/PacificIslander/Asian-American,0.61,3.0,2.0,7.0,6.0,1,2.0,positive,-2.0,210.0,40.0,135.0,320.0,45.0,40.0,105.0,120.0,160.0,105.0,90.0,120.0,116.666667,131.666667,123.713647
4,female,21.0,Asian/PacificIslander/Asian-American,0.21,3.0,2.0,6.0,6.0,1,3.0,positive,-2.0,240.0,70.0,180.0,60.0,90.0,140.0,75.0,120.0,140.0,105.0,90.0,90.0,103.333333,130.0,115.142857


In [1028]:
pd.get_dummies(final_df)

Unnamed: 0,age,interests_correlate,expected_happy_with_sd_people,expected_num_interested_in_me,like,guess_prob_liked,match,age_gap,same_race_point,o_rating_attractive,o_rating_sincere,o_rating_intelligence,o_rating_funny,o_rating_ambitious,o_rating_shared_interests,i_rating_attractive,i_rating_sincere,i_rating_intelligence,i_rating_funny,i_rating_ambitious,i_rating_shared_interests,i_rating_total,o_rating_total,rating_mean,gender_female,gender_male,race_Asian/PacificIslander/Asian-American,race_Black/AfricanAmerican,race_European/Caucasian-American,race_Latino/HispanicAmerican,race_Other,age_gap_dir_negative,age_gap_dir_positive,age_gap_dir_zero
0,21.0,0.14,3.0,2.0,7.0,6.0,0,6.0,-2.0,210.000000,160.000000,160.000000,160.000000,-99.000000,30.000000,90.0,180.0,140.0,105.0,90.0,75.0,113.333333,144.000000,126.839378,1,0,1,0,0,0,0,0,1,0
1,21.0,0.54,3.0,2.0,7.0,5.0,0,1.0,-2.0,420.000000,-99.000000,-99.000000,280.000000,-99.000000,-99.000000,105.0,160.0,140.0,120.0,75.0,90.0,115.000000,350.000000,173.118280,1,0,1,0,0,0,0,0,1,0
2,21.0,0.16,3.0,2.0,7.0,-99.0,1,1.0,2.0,190.000000,180.000000,190.000000,180.000000,140.000000,120.000000,75.0,160.0,180.0,120.0,75.0,105.0,119.166667,166.666667,138.969874,1,0,1,0,0,0,0,0,1,0
3,21.0,0.61,3.0,2.0,7.0,6.0,1,2.0,-2.0,210.000000,40.000000,135.000000,320.000000,45.000000,40.000000,105.0,120.0,160.0,105.0,90.0,120.0,116.666667,131.666667,123.713647,1,0,1,0,0,0,0,0,1,0
4,21.0,0.21,3.0,2.0,6.0,6.0,1,3.0,-2.0,240.000000,70.000000,180.000000,60.000000,90.000000,140.000000,75.0,120.0,140.0,105.0,90.0,90.0,103.333333,130.000000,115.142857,1,0,1,0,0,0,0,0,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8372,25.0,0.28,10.0,-99.0,4.0,4.0,0,1.0,1.0,80.000000,120.000000,210.000000,140.000000,120.000000,60.000000,490.0,-99.0,75.0,75.0,-99.0,-99.0,213.333333,121.666667,154.958541,0,1,0,0,1,0,0,0,1,0
8373,25.0,0.64,10.0,-99.0,2.0,5.0,0,1.0,-1.0,105.263158,52.631579,94.736842,42.105263,63.157895,78.947368,210.0,-99.0,75.0,75.0,-99.0,-99.0,120.000000,72.807018,90.627843,0,1,0,0,1,0,0,1,0,0
8374,25.0,0.71,10.0,-99.0,4.0,4.0,0,1.0,-1.0,300.000000,60.000000,70.000000,15.000000,70.000000,10.000000,280.0,-99.0,120.0,60.0,-99.0,-99.0,153.333333,87.500000,111.418685,0,1,0,0,1,0,0,0,1,0
8376,25.0,0.62,10.0,-99.0,5.0,5.0,0,3.0,-1.0,50.000000,175.000000,125.000000,50.000000,30.000000,120.000000,280.0,-99.0,75.0,60.0,-99.0,-99.0,138.333333,91.666667,110.265700,0,1,0,0,1,0,0,0,1,0
