# **영화 메타 데이터와 TF-iDF 실습**

영화 줄거리 데이터에 TF-IDF를 적용해 영화별로 유사도를 계산

In [4]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

np.random.seed(2021) # 랜덤시드 고정

# 1. Data

1.1. Data Load

데이터에서 영화 제목을 나타내는 title과 줄거리 overview 컬럼을 이용함

In [5]:
df = pd.read_csv("movies_metadata.csv")

  df = pd.read_csv("movies_metadata.csv")


In [6]:
df = df[["title", "overview"]]

전체 데이터의 1000개만 사용하겠음

In [7]:
df = df.iloc[:1000]

In [8]:
df.shape

(1000, 2)

In [9]:
df

Unnamed: 0,title,overview
0,Toy Story,"Led by Woody, Andy's toys live happily in his ..."
1,Jumanji,When siblings Judy and Peter discover an encha...
2,Grumpier Old Men,A family wedding reignites the ancient feud be...
3,Waiting to Exhale,"Cheated on, mistreated and stepped on, the wom..."
4,Father of the Bride Part II,Just when George Banks has recovered from his ...
...,...,...
995,The Three Caballeros,For Donald's birthday he receives a box with t...
996,The Sword in the Stone,Wart is a young boy who aspires to be a knight...
997,So Dear to My Heart,The tale of Jeremiah Kincaid and his quest to ...
998,Robin Hood: Prince of Thieves,When the dastardly Sheriff of Nottingham murde...


1.2 Data Cleaning

'overview'가 결측값인 경우 빈 str로 대체!

In [10]:
df["overview"].isna().sum()

12

In [33]:
df["overview"] = df["overview"].fillna(' ')

# 2. TF-IDF계산

In [22]:
from sklearn.feature_extraction.text import TfidfVectorizer

2.1 Sample Data

In [23]:
# 샘플 데이터 2개 문장만 우선 확인
df["overview"].values[:2]

array(["Led by Woody, Andy's toys live happily in his room until Andy's birthday brings Buzz Lightyear onto the scene. Afraid of losing his place in Andy's heart, Woody plots against Buzz. But when circumstances separate Buzz and Woody from their owner, the duo eventually learns to put aside their differences.",
       "When siblings Judy and Peter discover an enchanted board game that opens the door to a magical world, they unwittingly invite Alan -- an adult who's been trapped inside the game for 26 years -- into their living room. Alan's only hope for freedom is to finish the game, which proves risky as all three find themselves running from giant rhinoceroses, evil monkeys and other terrifying creatures."],
      dtype=object)

In [24]:
# stop_words='english' -> 자주 사용되는 and or 같은 단어는 TF-IDF 로 생성해주지 않음
transformer = TfidfVectorizer(stop_words='english')
# fit_transform을 해서 TF-IDF변환시켜줌
tfidf_matrix = transformer.fit_transform(df['overview'].values[:2])

In [25]:
tfidf_matrix.toarray()

array([[0.        , 0.        , 0.14358239, 0.        , 0.43074717,
        0.14358239, 0.14358239, 0.        , 0.14358239, 0.43074717,
        0.14358239, 0.        , 0.14358239, 0.        , 0.        ,
        0.14358239, 0.        , 0.14358239, 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.14358239, 0.14358239,
        0.        , 0.        , 0.        , 0.        , 0.14358239,
        0.14358239, 0.14358239, 0.14358239, 0.        , 0.14358239,
        0.        , 0.        , 0.        , 0.14358239, 0.        ,
        0.14358239, 0.14358239, 0.        , 0.        , 0.        ,
        0.10216005, 0.        , 0.14358239, 0.14358239, 0.        ,
        0.        , 0.14358239, 0.        , 0.        , 0.43074717,
        0.        , 0.        ],
       [0.15160873, 0.15160873, 0.        , 0.30321746, 0.        ,
        0.        , 0.        , 0.15160873, 0.        , 0.        ,
        0.        , 0.15160873, 0.        , 0.15160873, 0.15160873,
        0.     

In [26]:
pd.DataFrame(tfidf_matrix.toarray(), columns=transformer.get_feature_names()).T.head(10)

AttributeError: ignored

In [None]:
transformer.get_feature_names() #어떤 단어들 추출 되었는지 확인 가능함

2.2 학습

In [28]:
transformer = TfidfVectorizer(stop_words='english')

2.3 변환

In [29]:
tfidf_matrix = transformer.fit_transform(df["overview"])

In [31]:
tfidf_matrix.toarray()

array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]])

키워드 확인

In [32]:
transformer.get_feature_names()[-5:]

AttributeError: ignored

# 3. 영화별 유사도 게산

코사인 유사도를 이용해 영화별 유사도를 계산 가능

In [36]:
from sklearn.metrics.pairwise import cosine_similarity

In [37]:
similarity = cosine_similarity(tfidf_matrix)

In [38]:
similarity

array([[1.        , 0.01570657, 0.        , ..., 0.        , 0.        ,
        0.01234882],
       [0.01570657, 1.        , 0.05047128, ..., 0.        , 0.01578968,
        0.02378018],
       [0.        , 0.05047128, 1.        , ..., 0.        , 0.        ,
        0.        ],
       ...,
       [0.        , 0.        , 0.        , ..., 1.        , 0.        ,
        0.        ],
       [0.        , 0.01578968, 0.        , ..., 0.        , 1.        ,
        0.        ],
       [0.01234882, 0.02378018, 0.        , ..., 0.        , 0.        ,
        1.        ]])

# 4. 유사한 영화 추천

예를 들어, 데이터 인덱스 998과 유사한 영화를 추천

In [39]:
idx = 998
print(df.loc[idx,'title'])

Robin Hood: Prince of Thieves


위에서 계산한 similarity에서 998번째 영화와 다른 영화 사이의 유사도를 추출하고, 유사도 높은 인덱스 반환

In [40]:
similarity_one_idx = similarity[idx]

In [41]:
similarity_one_idx # 1개의 array로 이루어짐

array([0.        , 0.01578968, 0.        , 0.06497967, 0.        ,
       0.01669058, 0.        , 0.        , 0.01884548, 0.        ,
       0.        , 0.03206453, 0.        , 0.01847308, 0.01592856,
       0.        , 0.        , 0.        , 0.03370223, 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.05644874, 0.        , 0.00826806, 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.01164103, 0.0793925 , 0.        ,
       0.        , 0.        , 0.01173592, 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.01950889,
       0.        , 0.        , 0.        , 0.00772065, 0.        ,
       0.        , 0.        , 0.00915389, 0.01540557, 0.02512398,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.01417411, 0.        , 0.        , 0.     

1. `argsort`는 값을 오름차순으로 정렬할때 해당하는 인덱스를 반환합니다.
2. `argsort`에 역순을 취해 가장 유사한 인덱스가 앞으로 오도록 정렬합니다.


In [46]:
order_idx = similarity_one_idx.argsort()[::-1] # -1을 주어서 역순 반환??

In [47]:
order_idx[:100]

array([998, 515, 913, 215, 779, 598,  43, 150, 675,   3, 392, 148, 181,
        25,  99, 231, 241, 548, 725, 363, 331, 447, 988, 207, 270, 804,
       256, 517, 940, 807, 401, 319, 245, 660, 977, 400,  18, 648,  11,
       695, 459, 621, 402, 503, 903, 135, 971, 594, 790, 975, 185, 978,
       513, 823, 883, 408, 853, 628, 944, 668, 663,  64, 446, 147, 953,
       718, 240, 264, 670, 642, 484, 742, 461, 857, 178, 797, 295, 671,
       763, 412, 470, 564, 243,  54, 268, 829, 841, 547, 890,   8, 597,
        13, 623, 778, 581, 768, 316, 867, 777, 133])

자기 자신을 뺀 유사도가 가장 높고 이후 유사한 영화의 인덱스를 얻습니다.

In [48]:
top5 = order_idx[:6]
top5

array([998, 515, 913, 215, 779, 598])

기존 데이터에서 각 인덱스에 해당하는 영화의 제목은 다음과 같습니다.

"Robin Hood: Prince of Thieves"와 유사한 "Robin Hood" 영화가 추천되는 것을 확인할 수 있습니다.

In [49]:
df.loc[top5,'title']

998      Robin Hood: Prince of Thieves
515          Robin Hood: Men in Tights
913       The Adventures of Robin Hood
215                   Boys on the Side
779                          Lone Star
598    Candyman: Farewell to the Flesh
Name: title, dtype: object