# KNN 회귀 알고리즘

영화 평점 예측하기

`scikit-learn`의 `neighbors` 라이브러리

- `KNeighborsRegressor`: KNN 모델 생성
   - `weights = distance`: 거리 가중치
   - `n_neighbors = 3`: k값 설정

In [66]:
# 필요한 라이브러리 불러오기

import pandas as pd
import numpy as np
import warnings
warnings.filterwarnings("ignore")

from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsRegressor
from sklearn.metrics import r2_score, mean_squared_error
from matplotlib import pyplot as plt

**데이터 전처리**

kaggle의 IMDb movie 데이터를 영화 별 개봉년도, 러닝타임, 제작 예산으로 영화 평균 평점을 예측하는데 필요한 형태로 전처리해준다.

In [3]:
movies = pd.read_csv('../input/imdb-extensive-dataset/IMDb movies.csv')
ratings = pd.read_csv('../input/imdb-extensive-dataset/IMDb ratings.csv')
data = pd.merge(movies, ratings, how='inner', on='imdb_title_id')
data = data[['title', 'year', 'duration', 'budget', 'mean_vote']]
data = data.dropna(axis=0).reset_index().drop(['index'], axis = 1)

def buget_convert(x):
    num = ''
    for i in x:
        if i.isdigit(): num += i
    return int(num)

data['year']=data['year'].apply(lambda x: int(x))
data['budget']=data['budget'].apply(buget_convert)

In [4]:
data.head()

Unnamed: 0,title,year,duration,budget,mean_vote
0,The Story of the Kelly Gang,1906,70,2250,6.3
1,Cleopatra,1912,100,45000,5.3
2,Quo Vadis?,1913,120,45000,6.2
3,Independenta Romaniei,1912,120,400000,7.1
4,Richard III,1912,55,30000,5.4


In [5]:
movies = pd.DataFrame(data['title'])[:1000]
movie_data = pd.DataFrame(data[['year', 'duration', 'budget']])[:1000]
movie_target = pd.DataFrame(data['mean_vote'])[:1000]

In [6]:
movie_data.head()

Unnamed: 0,year,duration,budget
0,1906,70,2250
1,1912,100,45000
2,1913,120,45000
3,1912,120,400000
4,1912,55,30000


In [7]:
movie_target.head()

Unnamed: 0,mean_vote
0,6.3
1,5.3
2,6.2
3,7.1
4,5.4


In [8]:
movie_data.describe()

Unnamed: 0,year,duration,budget
count,1000.0,1000.0,1000.0
mean,1935.887,95.313,840410.6
std,7.77436,30.846986,1823768.0
min,1906.0,48.0,2000.0
25%,1932.0,76.0,231500.0
50%,1937.0,90.0,500000.0
75%,1942.0,105.0,1000000.0
max,2001.0,306.0,48000000.0


## 1. 데이터 정규화

컬럼 별 (항목 별) 스케일이 매우 다른 것을 확인할 수 있다. <br>
따라서 모델 학습을 하기 전 정규화 작업을 진행한다. <br>
최소 최대 정규화 함수를 만들어준 위 cancer_data에 적용하여 모든 항목의 최소값과 최대값이 각각 0과 1이 되도록 만들어준다.

In [9]:
def normalize(dataset):
    result = []
    for data in dataset:
        num = (data - min(dataset)) / (max(dataset) - min(dataset))
        result.append(num)
    return result

In [10]:
for colname in movie_data.columns:
    movie_data[colname] = normalize(movie_data[colname])

movie_data.describe()

Unnamed: 0,year,duration,budget
count,1000.0,1000.0,1000.0
mean,0.3146,0.183384,0.017468
std,0.081835,0.119562,0.037997
min,0.0,0.0,0.0
25%,0.273684,0.108527,0.004781
50%,0.326316,0.162791,0.010375
75%,0.378947,0.22093,0.020793
max,1.0,1.0,1.0


## 2. 모델 생성

In [27]:
X_train, X_test, y_train, y_test = train_test_split(movie_data, movie_target, test_size = 0.2, random_state = 30)

print(len(X_train), len(X_test))
print(len(y_train), len(y_test))

800 200
800 200


**KNN 모델 생성**

k = 3 으로 지정하여 모델을 생성하고 학습해보자

In [28]:
regressor = KNeighborsRegressor(n_neighbors = 10, weights = 'distance')
regressor.fit(X_train, y_train)

KNeighborsRegressor(n_neighbors=10, weights='distance')

In [29]:
y_pred = regressor.predict(X_test)

**새로운 영화에 대한 평점 예측**

test 데이터에 있는 영화의 평점을 예측해보고 RMSE를 계산해보자.

In [71]:
movie_name = input()
idx = movies[movies['title']==movie_name].index
if idx in list(X_test.index):
    y_pred = regressor.predict(X_test.loc[idx])[0]
    print('{}의 예상 평점은 {:.2f}이며 RMSE는 {:.4f}입니다.'
          .format(movie_name, y_pred[0], np.sqrt(mean_squared_error(y_test.loc[idx]['mean_vote'], y_pred))))
else:
    print('다른 영화를 입력하세요!')

 Fantasia


Fantasia의 예상 평점은 7.76이며 RMSE는 0.0428입니다.
