<a href="https://colab.research.google.com/github/dusrbrla-mbb/kubig-portfolio/blob/temp/ml_exe_cf_intro_movie.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### 추천 시스템 개론 - movies 데이터 셋
##### 어떤 기준을 통해 줄세우기 할 것인가?
1. 컨텐츠 기반(Contents-Based, CB) - 상품의 상세설명 기반
2. 지식 기반(Knowledge-Based, KB) - 업계 전문가가 미리 만든 규칙 기반
* 위 2가지의 장점: 새로운 고객 / 새로운 물건에 바로 대응 가능.
* 위 2가지의 단점: 물건 개수 / 종류가 많아지면 활용 X  
\-> 따라서 물건 개수가 적은(ex. 하이엔드 마켓 - 대면 서비스 중요) 곳에서 활용.
3. 협업 필터링(Collaborative Filtering, CF) - 고객=물건 연결고리 기반  

    1) 유저 기반 CF - 한 고객에게 맞춤형 물건 추천  
    2) 아이템 기반 CF - 한 물건을 고른 사람에게 유사한 다른 물건 추천
    * 장점: CB, KB 의 단점이 없다는 것 -> 빅데이터에 적합
    * 단점: 새로운 고객 / **새로운 물건에 바로 대응 불가능**(Cold-start issue). 

### 비지니스 이해
##### Cross / Up / Down - Sell
* Cross-Sell: 세트메뉴 / 끼워팔기 / 같이 사면 좋을 것 보여주기
* Up-Sell: 500원 추가시 라지세트 가능 -> 마진 높은, 좀 더 비싼 상품 보여주기
* Down-Sell: 구매의도 있지만, 가격저항 느끼는 고객에게 저렴한 상품 보여주기

### 협업 필터링(Collaborative Filtering)
##### 사용자들로부터 얻은 취향 / 기호(favor)에 대한 정보를 이용한다.
* 과거의 경향이 미래에도 계속 유지될 것(지도학습과의 공통점 -> 고정관념)이라 가정한다.
* 나와 비슷한 선택을 했던 사람들이 한 과거 선택을 기반으로 나에게 맞춤상품 추천 가능.
* 비슷한 취향을 가진 고객들에게 아직 구매하지 않은 상품 교차 추천.
* 아이템 기반 CF: 특정 물건 검색한 사람에게 해당 물건과 유사한 물건 추천.
* 유저 기반 CF: 특정 고객에게 그 고객과 유사한 선택을 한 다른 사람의 선택을 추천.

In [1]:
import pandas as pd
from google.colab import drive
drive.mount('/content/drive')
path = '/content/drive/My Drive/Study/Jupyter/movie'
movies = pd.read_csv(path + '/' + 'movies.csv')
ratings = pd.read_csv(path + '/' + 'ratings.csv')
print(movies.shape)
print(ratings.shape)

Mounted at /content/drive
(9742, 3)
(100836, 4)


In [2]:
movies.head()

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


In [3]:
ratings.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224
3,1,47,5.0,964983815
4,1,50,5.0,964982931


In [4]:
combined = pd.merge(ratings, movies)
# movieId 컬럼이 둘 다 동일하게 들어있기 때문에,
# 이를 기준으로 데이터 프레임 합침
combined.head()

Unnamed: 0,userId,movieId,rating,timestamp,title,genres
0,1,1,4.0,964982703,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,5,1,4.0,847434962,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
2,7,1,4.5,1106635946,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
3,15,1,2.5,1510577970,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
4,17,1,4.5,1305696483,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy


In [5]:
pvt = combined.pivot_table(index='userId', columns='title', values='rating').fillna(0)
# 유저와 물건 사이의 연결관계를, A 유저가 B 물건에 별점 C 를 주었다는 정보를
# 데이터 프레임으로 표현.
# 한 row 는 한 유저에 관한 데이터
# 한 column 은 한 물건에 관한 데이터
# row 와 column 이 만나는 지점은 A 유저가 B 물건에 준 평점 C 의미.
pvt

title,'71 (2014),'Hellboy': The Seeds of Creation (2004),'Round Midnight (1986),'Salem's Lot (2004),'Til There Was You (1997),'Tis the Season for Love (2015),"'burbs, The (1989)",'night Mother (1986),(500) Days of Summer (2009),*batteries not included (1987),...,Zulu (2013),[REC] (2007),[REC]² (2009),[REC]³ 3 Génesis (2012),anohana: The Flower We Saw That Day - The Movie (2013),eXistenZ (1999),xXx (2002),xXx: State of the Union (2005),¡Three Amigos! (1986),À nous la liberté (Freedom for Us) (1931)
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
606,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
607,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
608,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,4.5,3.5,0.0,0.0,0.0
609,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


### Item-Based(아이템 기반)

In [6]:
item_corr = pvt.corr()
# 물건과 물건 사이의 상관계수를 구함.
# 동일한 사람이 A, B, C 물건에 대해 동일하게 4점이라는 높은 점수를 주었다면,
# A, B, C 는 유사성이 높아짐.
# 수학적으로 깊이 들어가면 정수나 실수를 다루는 피어슨이 아닌,
# 순위(순서, ordinal)를 대입하는 스피어만을 사용하는 것이 정확함.
# 즉, 상관계수 정의의 차이를 알고 쓰면 더 정확함.
item_corr

title,'71 (2014),'Hellboy': The Seeds of Creation (2004),'Round Midnight (1986),'Salem's Lot (2004),'Til There Was You (1997),'Tis the Season for Love (2015),"'burbs, The (1989)",'night Mother (1986),(500) Days of Summer (2009),*batteries not included (1987),...,Zulu (2013),[REC] (2007),[REC]² (2009),[REC]³ 3 Génesis (2012),anohana: The Flower We Saw That Day - The Movie (2013),eXistenZ (1999),xXx (2002),xXx: State of the Union (2005),¡Three Amigos! (1986),À nous la liberté (Freedom for Us) (1931)
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
'71 (2014),1.000000,-0.001642,-0.002324,-0.001642,-0.002254,-0.001642,-0.006407,-0.001642,0.135943,-0.004325,...,-0.001642,0.339935,0.542247,0.706526,-0.001642,-0.007675,0.134327,0.325287,-0.008185,-0.001642
'Hellboy': The Seeds of Creation (2004),-0.001642,1.000000,0.706526,-0.001642,-0.002254,-0.001642,-0.006407,-0.001642,-0.010568,-0.004325,...,-0.001642,-0.004589,-0.002808,-0.002324,-0.001642,-0.007675,-0.007744,-0.003594,-0.008185,-0.001642
'Round Midnight (1986),-0.002324,0.706526,1.000000,-0.002324,-0.003191,-0.002324,0.170199,-0.002324,-0.014958,-0.006121,...,-0.002324,-0.006495,-0.003975,-0.003289,-0.002324,-0.010863,-0.010961,-0.005087,-0.011585,-0.002324
'Salem's Lot (2004),-0.001642,-0.001642,-0.002324,1.000000,0.857269,-0.001642,-0.006407,-0.001642,-0.010568,-0.004325,...,-0.001642,-0.004589,-0.002808,-0.002324,-0.001642,-0.007675,-0.007744,-0.003594,-0.008185,-0.001642
'Til There Was You (1997),-0.002254,-0.002254,-0.003191,0.857269,1.000000,-0.002254,-0.008797,-0.002254,-0.014510,-0.005938,...,-0.002254,-0.006301,-0.003856,-0.003191,-0.002254,-0.010538,-0.010632,-0.004935,-0.011238,-0.002254
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
eXistenZ (1999),-0.007675,-0.007675,-0.010863,-0.007675,-0.010538,-0.007675,0.187953,0.212646,0.053614,0.115396,...,-0.007675,-0.021449,-0.013126,-0.010863,-0.007675,1.000000,0.163022,-0.016800,0.138611,-0.007675
xXx (2002),0.134327,-0.007744,-0.010961,-0.007744,-0.010632,-0.007744,0.062174,-0.007744,0.241092,-0.000060,...,0.063291,0.291410,0.163464,0.240394,-0.007744,0.163022,1.000000,0.259049,0.065673,-0.007744
xXx: State of the Union (2005),0.325287,-0.003594,-0.005087,-0.003594,-0.004935,-0.003594,-0.014025,-0.003594,0.139511,-0.009467,...,-0.003594,0.376455,0.172818,0.227658,-0.003594,-0.016800,0.259049,1.000000,-0.017917,-0.003594
¡Three Amigos! (1986),-0.008185,-0.008185,-0.011585,-0.008185,-0.011238,-0.008185,0.353194,0.175610,0.125905,0.234514,...,0.175610,-0.022876,-0.013999,-0.011585,-0.008185,0.138611,0.065673,-0.017917,1.000000,-0.008185


In [7]:
target = 'Wick'
# 물건 이름을 파악(검색기 기능)하고
for title in item_corr.columns:
    if target in title:
        print(title)

Crow, The: Wicked Prayer (2005)
John Wick (2014)
John Wick: Chapter Two (2017)
Something Wicked This Way Comes (1983)
Wicked Blood (2014)
Wicked City (Yôjû toshi) (1987)
Wicker Man, The (1973)
Wicker Man, The (2006)
Wicker Park (2004)


In [8]:
interested = 'John Wick (2014)'
# 이 물건과 유사한 다른 물건을 찾는다.
item_corr.sort_values(by=interested, ascending=False)[interested].head()

title
John Wick (2014)                                  1.000000
Mad Max: Fury Road (2015)                         0.617796
Snowpiercer (2013)                                0.604255
Fast Five (Fast and the Furious 5, The) (2011)    0.557030
Rogue One: A Star Wars Story (2016)               0.545675
Name: John Wick (2014), dtype: float64

### User-Based(유저 기반)

In [9]:
user_corr = pvt.T.corr()
user_corr

userId,1,2,3,4,5,6,7,8,9,10,...,601,602,603,604,605,606,607,608,609,610
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,1.000000,0.019396,0.053052,0.176911,0.120862,0.104406,0.143785,0.128542,0.055263,-0.000307,...,0.066248,0.149934,0.186959,0.056523,0.134402,0.121958,0.254192,0.262225,0.085430,0.098693
2,0.019396,1.000000,-0.002595,-0.003808,0.013181,0.016252,0.021564,0.023748,-0.003450,0.061877,...,0.198547,0.010885,-0.004038,-0.005348,-0.007923,0.011290,0.005809,0.032723,0.024371,0.089321
3,0.053052,-0.002595,1.000000,-0.004559,0.001886,-0.004581,-0.005637,0.001701,-0.003112,-0.005504,...,0.000148,-0.000588,0.011203,-0.004824,0.003674,-0.003255,0.012881,0.008089,-0.002964,0.015953
4,0.176911,-0.003808,-0.004559,1.000000,0.121014,0.065707,0.100595,0.054231,0.002412,0.015607,...,0.072841,0.114280,0.281852,0.039692,0.065483,0.164812,0.115109,0.116843,0.023926,0.062498
5,0.120862,0.013181,0.001886,0.121014,1.000000,0.294134,0.101721,0.426575,-0.004187,0.023468,...,0.061908,0.414929,0.095386,0.254115,0.141073,0.090149,0.145760,0.122600,0.258288,0.040361
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
606,0.121958,0.011290,-0.003255,0.164812,0.090149,0.047476,0.172484,0.081904,0.057979,0.054858,...,0.153879,0.084190,0.224593,0.035234,0.106729,1.000000,0.115978,0.188312,0.052375,0.093788
607,0.254192,0.005809,0.012881,0.115109,0.145760,0.142158,0.173287,0.178130,0.003252,-0.004817,...,0.080027,0.187581,0.173008,0.126261,0.101129,0.115978,1.000000,0.258232,0.142529,0.098496
608,0.262225,0.032723,0.008089,0.116843,0.122600,0.137932,0.305429,0.175906,0.086221,0.048357,...,0.136304,0.174056,0.164440,0.133722,0.144878,0.188312,0.258232,1.000000,0.109556,0.248902
609,0.085430,0.024371,-0.002964,0.023926,0.258288,0.207121,0.084491,0.421626,-0.003940,0.014980,...,0.029660,0.331051,0.045991,0.232113,0.089806,0.052375,0.142529,0.109556,1.000000,0.033702


In [10]:
interested = 379
# 특정 고객과 유사도가 높은 다른 고객을 찾음.
# A 고객과 과거 유사한 선택을 한 B 고객을 찾음.
# B 고객과 A 고객은 서로의 취향이 유사하다 판단하는 근거.
user_corr.sort_values(by=interested, ascending=False)[interested].head()

userId
379    1.000000
126    0.812737
470    0.700414
347    0.688932
94     0.688177
Name: 379, dtype: float64

In [11]:
user_1, user_2 = 379, 470
# A 고객과 유사한 B 고객을 판별하고
# A 고객과 B 고객이 서로 교집합으로 같이 본 물건은 추천대상 X
# 도메인의 특성상 재구매가 의미 없는 분야이기 때문이다.
# 차집합을 이용하여 B 고객만 선택한 물건을 선별.
u1_title = set(combined.loc[combined['userId'] == user_1]['title'])
u2_title = set(combined.loc[combined['userId'] == user_2]['title'])
diff = u2_title.difference(u1_title)

In [12]:
u2_data = combined.loc[combined['userId'] == user_2]
# B 고객만 선택한 물건 중 B 고객이 마음에 들어한 물건 선별.
u2_data.loc[u2_data['title'].isin(diff)].sort_values(by='rating', ascending=False)['title'].head()

162                        Toy Story (1995)
19602    Four Weddings and a Funeral (1994)
32487             Executive Decision (1996)
30909         Star Trek: Generations (1994)
26669        Remains of the Day, The (1993)
Name: title, dtype: object