# 19기 KNN 정규세션 과제

## KNN 구현해보기
### 1. Preprocssing / EDA
지금까지 배운 내용을 토대로 해당 데이터에 대해 자유롭게 전처리와 EDA를 진행해주세요.
### 2. KNN 구현 & 파라미터 튜닝
수업 내용 및 실습 자료를 참고하여 KNN을 구현하고 파라미터 튜닝을 하며 결과를 비교해주세요.
### 3. Evaluation
결과에 대한 평가를 진행하고, 나름의 해석을 달아주세요.

**데이터:** [blackfriday | Kaggle](https://www.kaggle.com/llopesolivei/blackfriday)

---

## 0. 데이터 불러오기

In [243]:
import matplotlib.pyplot as plt
import seaborn as sns

import pandas as pd
df = pd.read_csv("blackfriday.csv", index_col = 0)
df.head()

Unnamed: 0,User_ID,Product_ID,Gender,Age,Occupation,City_Category,Stay_In_Current_City_Years,Marital_Status,Product_Category_1,Product_Category_2,Product_Category_3,Purchase
0,1001088,P00046042,F,0-17,10,A,3,0,5,17.0,,2010
1,1004493,P00347742,F,0-17,10,A,1,0,7,,,4483
2,1005302,P00048942,F,0-17,10,A,1,0,1,4.0,,7696
3,1001348,P00145242,F,0-17,10,A,3,0,2,4.0,,16429
4,1001348,P00106742,F,0-17,10,A,3,0,3,5.0,,5780


In [244]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 4998 entries, 0 to 4997
Data columns (total 12 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   User_ID                     4998 non-null   int64  
 1   Product_ID                  4998 non-null   object 
 2   Gender                      4998 non-null   object 
 3   Age                         4998 non-null   object 
 4   Occupation                  4998 non-null   int64  
 5   City_Category               4998 non-null   object 
 6   Stay_In_Current_City_Years  4998 non-null   object 
 7   Marital_Status              4998 non-null   int64  
 8   Product_Category_1          4998 non-null   int64  
 9   Product_Category_2          3465 non-null   float64
 10  Product_Category_3          1544 non-null   float64
 11  Purchase                    4998 non-null   int64  
dtypes: float64(2), int64(5), object(5)
memory usage: 507.6+ KB


### 변수설명 (구글링해서 찾았음)
```bash
 User_ID                     고객고유번호                
 Product_ID                  상품고유번호                
 Gender                      성별                       
 Age                         연령대                      
 Occupation                  직종                       
 City_Category               도시범주(A,B,C)             
 Stay_In_Current_City_Years  현재 도시에서 머문 년수       
 Marital_Status              결혼여부               
 Product_Category_1          상품 범주1           
 Product_Category_2                                
 Product_Category_3                         
 # 위 까지 모두 범주형 변수들이다.
 # Purchase 만 연속형 변수로 파악됨.
 Purchase                    구매금액(단위:$)  
```

In [245]:
df['Occupation'].value_counts()

4     706
0     572
7     535
1     460
17    339
12    305
20    285
2     251
14    245
16    244
6     186
3     160
15    133
10    122
5      98
11     95
19     73
13     65
18     56
9      50
8      18
Name: Occupation, dtype: int64

## 1. Preprocessing / EDA

결측치를 대체 혹은 제거하고, 관련없는 특징들은 모두 제거하겠다.

In [246]:
df.isnull().sum()

User_ID                          0
Product_ID                       0
Gender                           0
Age                              0
Occupation                       0
City_Category                    0
Stay_In_Current_City_Years       0
Marital_Status                   0
Product_Category_1               0
Product_Category_2            1533
Product_Category_3            3454
Purchase                         0
dtype: int64

대체로 결측치가 없다.

In [247]:
def showNanRatio(df, cat):
    data_ratio = df[cat].isna().sum() / df[cat].shape[0]
    print(cat + ' 결측치 비율 : %0.3f' % data_ratio)
    

showNanRatio(df, 'Product_Category_2')
showNanRatio(df, 'Product_Category_3')

Product_Category_2 결측치 비율 : 0.307
Product_Category_3 결측치 비율 : 0.691


데이터로 보았을때, 의존성 변수인것 같아 대체가 어려워 보이니 제거로 결정.

In [248]:
df = df.drop(columns=['Product_Category_2', 'Product_Category_3'], axis=1)

이어서, 분석에 영향을 미치지 않는 ID관련 특징 제거

In [249]:
df = df.drop(columns=['User_ID', 'Product_ID'], axis=1)

이어서, 범주형 변수인데 문자열이거나, 타입이 올바르지 않은 경우 인코딩

In [250]:
# 우선 Unique하게 뽑아서 Binary로 할지, get_dummies를 사용할지 결정
def showUnique(df, col):
    print(col + ' : ' + ', '.join(str(item) for item in df[col].unique()))
    
for _c in ['Gender', 'Age', 'City_Category', 'Stay_In_Current_City_Years']:
    showUnique(df, _c)

Gender : F, M
Age : 0-17, 18-25, 26-35, 36-45, 46-50, 51-55, 55+
City_Category : A, B, C
Stay_In_Current_City_Years : 3, 1, 2, 4+, 0


Gender, City_Category => 0,1 혹은 1,2,3 으로 대체

나머지는 `pd.get_dummies()`로 one-hot encoding 진행

In [251]:
# Gender, City_Category 값 변경
target = [
    [
        {'col' : 'Gender', 'bVal' : 'F', 'aVal' : 0 },
        {'col' : 'Gender', 'bVal' : 'M', 'aVal' : 1 }
    ],
    [
        {'col' : 'City_Category', 'bVal' : 'A', 'aVal' : 1 },
        {'col' : 'City_Category', 'bVal' : 'B', 'aVal' : 2 },
        {'col' : 'City_Category', 'bVal' : 'C', 'aVal' : 3 }
    ]
]

def changeValue(df, info):
    for _info in info:
        col, bVal, aVal = _info.values()
        df.loc[df[col] == bVal, col] = aVal

for _t in target:
    changeValue(df, _t)
    
# Gender 및 City_Category 타입 변경
df['Gender'] = df['Gender'].astype('int64')
df['City_Category'] = df['City_Category'].astype('int64')

In [252]:
# Age, Stay_In_Current_City_Years One-hot Encoding
target = ['Age', 'Stay_In_Current_City_Years']
encodedData = pd.get_dummies(df[target])

df = pd.concat([df[df.describe().columns], encodedData], axis=1)

In [253]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 4998 entries, 0 to 4997
Data columns (total 18 columns):
 #   Column                         Non-Null Count  Dtype
---  ------                         --------------  -----
 0   Gender                         4998 non-null   int64
 1   Occupation                     4998 non-null   int64
 2   City_Category                  4998 non-null   int64
 3   Marital_Status                 4998 non-null   int64
 4   Product_Category_1             4998 non-null   int64
 5   Purchase                       4998 non-null   int64
 6   Age_0-17                       4998 non-null   uint8
 7   Age_18-25                      4998 non-null   uint8
 8   Age_26-35                      4998 non-null   uint8
 9   Age_36-45                      4998 non-null   uint8
 10  Age_46-50                      4998 non-null   uint8
 11  Age_51-55                      4998 non-null   uint8
 12  Age_55+                        4998 non-null   uint8
 13  Stay_In_Current_Ci

위 내용대로, Y에 해당하는 변수를 `Purchase`로 두어 KNN 구현

In [254]:
X = df.drop(['Purchase'], axis=1)
Y = df['Purchase']

## 2. KNN 구현 & 파라미터 튜닝

먼저, 테스트 데이터와 훈련 데이터를 나누었다.

In [255]:
# 관련 코드는 이전 과제 참고 (2주차).
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.3, shuffle=True)

print(f"X_train shape: {X_train.shape}")
print(f"X_test shape: {X_test.shape}")
print(f"Y_train shape: {Y_train.shape}")
print(f"Y_test shape: {Y_test.shape}")

X_train shape: (3498, 17)
X_test shape: (1500, 17)
Y_train shape: (3498,)
Y_test shape: (1500,)


In [262]:
#train / test set
from sklearn.model_selection import train_test_split

train_df, test_df = train_test_split(df, train_size=0.7, random_state=2023)

In [264]:
train_target = train_df['Purchase']
test_target = test_df['Purchase']
train_data = train_df.drop('Purchase', axis=1)
test_data = test_df.drop('Purchase', axis=1)

In [265]:
#knn 시작
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import GridSearchCV

knn = KNeighborsClassifier()

In [267]:
# 그리드서치를 위한 파라미터 설정
# 실습 내용을 참고했다.
params = {
    "n_neighbors": [i for i in range(1, 20, 2)],
    "p": [1, 2], #1은 맨하탄, 2는 유클리드
    "weights": ['uniform']
}

grid_cv = GridSearchCV(knn, param_grid = params, cv = 3)
grid_cv.fit(train_data, train_target)
grid_cv.best_params_



{'n_neighbors': 13, 'p': 2, 'weights': 'uniform'}

In [268]:
knn_new = KNeighborsClassifier(
    n_neighbors = grid_cv.best_params_['n_neighbors'], 
    p = grid_cv.best_params_['p'], 
    weights = grid_cv.best_params_['weights']
)
knn_new.fit(train_data, train_target)

## 3. Evaluation

In [269]:
test_pred = knn_new.predict(test_data)

In [271]:
print("test_data MAE : ", mean_absolute_error(test_target, test_pred))
print("test_data MSE : ", mean_squared_error(test_target, test_pred))
print("test_data MSE : ", r2_score(test_target, test_pred))

test_data MAE :  5413.7
test_data MSE :  47060238.92933334
test_data MSE :  -0.9467525285271354
