## 1. CSV 파일 불러와 DataFrame으로 저장

In [1]:
import pandas as pd
movies = pd.read_csv("./dataset/tmdb_5000_movies.csv")         # tmdb_5000_movies.csv dataframe으로 읽어오기
movies = movies[["original_title", "overview"]]
movies["overview"] = movies["overview"].astype("str")
movies

Unnamed: 0,original_title,overview
0,Avatar,"In the 22nd century, a paraplegic Marine is di..."
1,Pirates of the Caribbean: At World's End,"Captain Barbossa, long believed to be dead, ha..."
2,Spectre,A cryptic message from Bond’s past sends him o...
3,The Dark Knight Rises,Following the death of District Attorney Harve...
4,John Carter,"John Carter is a war-weary, former military ca..."
...,...,...
4798,El Mariachi,El Mariachi just wants to play his guitar and ...
4799,Newlyweds,A newlywed couple's honeymoon is upended by th...
4800,"Signed, Sealed, Delivered","""Signed, Sealed, Delivered"" introduces a dedic..."
4801,Shanghai Calling,When ambitious New York attorney Sam is sent t...


## 2. 전처리  

In [2]:
# null 값 개수 확인
movies.isnull().sum()

original_title    0
overview          0
dtype: int64

"overview" column 모두 소문자로, 문자+숫자(\w)만 남기고 나머지는 띄어쓰기로 대체\
https://wikidocs.net/21703 참고

In [3]:
import re
movies['overview'].apply(lambda x:re.sub('\W',' ',x.lower()))

0       in the 22nd century  a paraplegic marine is di...
1       captain barbossa  long believed to be dead  ha...
2       a cryptic message from bond s past sends him o...
3       following the death of district attorney harve...
4       john carter is a war weary  former military ca...
                              ...                        
4798    el mariachi just wants to play his guitar and ...
4799    a newlywed couple s honeymoon is upended by th...
4800     signed  sealed  delivered  introduces a dedic...
4801    when ambitious new york attorney sam is sent t...
4802    ever since the second grade when he first saw ...
Name: overview, Length: 4803, dtype: object

TF-IDF 사용\
=>텍스트를 수치화한다 by 특징 추출\
가장 기본은 countVectorizer. 하지만 조사, 관사 등 **의미 없는 단어에 높은 수치를 부여** 할 위험이 있음\
https://chan-lab.tistory.com/24?category=810217 참고

In [4]:
from sklearn.feature_extraction.text import TfidfVectorizer
# 단어 묶음을 1-2개로 설정
# ex) go home, very nice 등과 같은 단어도 인덱스로 받는다
tfidf_vec = TfidfVectorizer(ngram_range=(1, 2))
tfidf_matrix = tfidf_vec.fit_transform(movies['overview'])# 단어 학습

## cosine similarity

In [5]:
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np
print("tfidf_matrix shape: ",tfidf_matrix.shape) # (데이터 개수, unique한 단어 개수)
plot_similarity = cosine_similarity(tfidf_matrix,tfidf_matrix) # 줄거리 간 cosine 유사도 구하기 - tfidf_matrix 사용
print("### COSINE Similarity ###")
print(plot_similarity)
similar_index = np.argsort(-plot_similarity)  # 유사도 높은 순서대로 index 정렬
print("### 유사도 기준 index 정렬 ###") 
print(similar_index)

tfidf_matrix shape:  (4803, 154844)
### COSINE Similarity ###
[[1.         0.01514413 0.00614504 ... 0.01195829 0.00572386 0.006304  ]
 [0.01514413 1.         0.01308527 ... 0.0176922  0.00997908 0.00666831]
 [0.00614504 0.01308527 1.         ... 0.01289    0.00565554 0.00612954]
 ...
 [0.01195829 0.0176922  0.01289    ... 1.         0.01532978 0.00900306]
 [0.00572386 0.00997908 0.00565554 ... 0.01532978 1.         0.01649947]
 [0.006304   0.00666831 0.00612954 ... 0.00900306 0.01649947 1.        ]]
### 유사도 기준 index 정렬 ###
[[   0 3604  634 ... 4140 2596 2669]
 [   1 2379 2542 ...  161 2656 4458]
 [   2 1343 3162 ... 4144 4148 4180]
 ...
 [4800 4034  569 ... 2853 4140 1038]
 [4801 2017 1480 ... 4140 4458 2108]
 [4802 2586  868 ... 3988 4513 3152]]
