# KNN(K-Nearest Neighbor)

## KNN 분류(Classification)
    새로운 값은 기존의 데이터를 기준으로 가장 가까운 k개의 최근접 값을 기준으로 분류됨
    k는 동률의 문제 때문에 짝수는 되도록이면 피하는 것이 좋음
    k가 1에 가까울수록 과적합, k가 클수록 과소적합이 되기 때문에 적절한 k값 선정 필요
    
## KNN 회귀(Regression)
    기본 개념은 분류모델과 같으며 k개의 인접한 자료의 (가중)평균으로 예측
    
## sklearn - KNeighborsClassifier()
    KNN 분류 모델을 학습하기 위한 sklear의 함수
    n_neighbors 인자에 학습 시 고려할 이웃 데이터의 개수를 지정
    n_neighbors가 1에 가까울수록 과적합되며 커질수록 과소적합되는 경향 존재
    KNeighborsClassifier() 함수의 fit() 메서드에 독립변수와 종속변수 할당
    
## sklearn - KNeighborsRegressor()
    KNN 회귀 모델을 학습하기 위한 sklear의 함수

In [1]:
import pandas as pd
from sklearn.neighbors import KNeighborsClassifier
from sklearn.neighbors import KNeighborsRegressor

In [2]:
df = pd.read_csv("iris.csv")
df.head(2)

Unnamed: 0,Sepal.Length,Sepal.Width,Petal.Length,Petal.Width,Species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa


In [3]:
df["is_setosa"] = (df["Species"] == "setosa") + 0
df.head(2)

Unnamed: 0,Sepal.Length,Sepal.Width,Petal.Length,Petal.Width,Species,is_setosa
0,5.1,3.5,1.4,0.2,setosa,1
1,4.9,3.0,1.4,0.2,setosa,1


In [4]:
model_c = KNeighborsClassifier(n_neighbors = 3)
model_c.fit(X=df.iloc[:,:4], y=df["is_setosa"])
model_c

KNeighborsClassifier(n_neighbors=3)

In [5]:
model_c.predict(df.iloc[:,:4])[:5]

array([1, 1, 1, 1, 1])

In [6]:
from sklearn.metrics import accuracy_score

In [7]:
accuracy_score(y_true = df["is_setosa"],
              y_pred = model_c.predict(df.iloc[:,:4]))

1.0

In [8]:
model_r = KNeighborsRegressor(n_neighbors=3)
model_r.fit(X=df.iloc[:,:3],
           y=df["Petal.Width"])
model_r

KNeighborsRegressor(n_neighbors=3)

In [9]:
pred_r = model_r.predict(df.iloc[:,:3])
pred_r[:5]

array([0.26666667, 0.2       , 0.23333333, 0.2       , 0.16666667])

In [11]:
from sklearn.metrics import mean_squared_error as mse

In [12]:
mse(y_true=df["Petal.Width"], y_pred=pred_r)

0.018651851851851857

In [13]:
mse(y_true=df["Petal.Width"], y_pred=pred_r) ** 0.5

0.13657178278052848

문제 01. 당뇨 발생 여부 예측하기 위해 임신 횟수, 혈당, 혈압을 사용할 경우 그 정확도는 얼마인가?

In [14]:
df = pd.read_csv("diabetes.csv")
df.head(2)

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0


In [15]:
from sklearn.model_selection import train_test_split

In [17]:
df_train, df_test = train_test_split(df, train_size=0.7, random_state=123)
df_train.head(2)

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
429,1,95,82,25,180,35.0,0.233,43,1
524,3,125,58,0,0,31.6,0.151,24,0


In [19]:
model = KNeighborsClassifier()
model.fit(X=df_train.loc[:,["Pregnancies","Glucose","BloodPressure"]], y=df_train["Outcome"])
pred = model.predict(df_test.loc[:,["Pregnancies","Glucose","BloodPressure"]])
pred[:5]

array([1, 0, 0, 0, 0], dtype=int64)

In [20]:
from sklearn.metrics import accuracy_score

In [22]:
accuracy_score(y_pred = pred, y_true = df_test["Outcome"])

0.7272727272727273

문제 02. 종속변수를 당뇨 발병 여부로 하고 임신여부, 혈당, 혈압, 인슐린, 체질량지수를 독립변수로 하여 정확도를 확인했을 때 그 k 값과 정확도가 올바르게 연결되지 않은 것은?

In [23]:
df = pd.read_csv("diabetes.csv")
df.head(2)

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0


In [24]:
df["is_preg"] = (df["Pregnancies"] > 0) + 0
df.head(2)

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome,is_preg
0,6,148,72,35,0,33.6,0.627,50,1,1
1,1,85,66,29,0,26.6,0.351,31,0,1


In [25]:
df_train, df_test = train_test_split(df, train_size=0.8, random_state=123)

In [27]:
X_cols = ["is_preg", "Glucose", "BloodPressure", "Insulin", "BMI"]

neighbors = [3,5,10,20]
accs = []
for n_n in neighbors:
    model = KNeighborsClassifier(n_neighbors=n_n)
    model.fit(X=df_train.loc[:,X_cols], y=df_train["Outcome"])
    pred = model.predict(df_test.loc[:, X_cols])
    acc_sub = accuracy_score(y_pred = pred, y_true = df_test["Outcome"])
    accs = accs + [acc_sub]
    
df_score = pd.DataFrame({"neighbors":neighbors, "accs":accs})
df_score["accs"] = df_score["accs"].round(2)
df_score

Unnamed: 0,neighbors,accs
0,3,0.71
1,5,0.73
2,10,0.78
3,20,0.76


문제 03. 종속변수 체질량지수로 하고 임신여부, 혈당, 혈압, 인슐린을 독립변수로 하여 예측값을 확인했을 때 그 k 값과 RMSE가 올바르게 연결되지 않은 것은?

In [28]:
df = pd.read_csv("diabetes.csv")
df["is_preg"] = (df["Pregnancies"] > 0) + 0
df_train, df_test = train_test_split(df, train_size=0.8, random_state=123)

In [29]:
from sklearn.metrics import mean_squared_error

In [32]:
X_cols = ["is_preg", "Glucose", "BloodPressure", "Insulin"]

neighbors = [3,5,10,20]
rmses = []
for n_n in neighbors:
    model = KNeighborsRegressor(n_neighbors=n_n)
    model.fit(X=df_train.loc[:,X_cols], y=df_train["BMI"])
    pred = model.predict(df_test.loc[:, X_cols])
    rmses_sub = mean_squared_error(y_pred = pred, y_true = df_test["BMI"]) ** 0.5
    rmses = rmses + [rmses_sub]
    
df_score = pd.DataFrame({"neighbors":neighbors, "rmses":rmses})
df_score["rmses"] = df_score["rmses"].round(3)
df_score

Unnamed: 0,neighbors,rmses
0,3,8.508
1,5,8.706
2,10,8.517
3,20,8.514
