# 데이터셋 종류
- 추천알고리즘 성능비교를 위한 데이터셋
  1. MovieLens
  2. KMRD
  3. Netflix
( netflix 데이터 크기가 google colab 환경에서 진행하기에 데이터 규모가 큰 편. )

In [1]:
import os
import pandas as pd

In [2]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [3]:
data_path = "/content/drive/My Drive/data/"

## MovieLens
  - 미네소타 대학에서 개발된 영화 평점 데이터셋으로 추천 알고리즘 성능 평가를 위해 일반적으로 많이 사용된다
  - [imdb 영화 사이트](https://www.imdb.com)를 참고하는 경우도 있다.
  - [데이터셋 다운로드 링크](https://grouplens.org/datasets/movielens/)
  - `ml-latest`, `ml-25m`, `ml-1m`, `ml-10m` 등 데이터셋 크기에 따라 종류가 다양하다.
  - `ml-latest-small` 을 데이터셋으로 사용할 예정이다.

In [4]:
path = data_path + "movielens"
tags_df = pd.read_csv(os.path.join(path, 'tags.csv'), encoding='utf-8')
ratings_df = pd.read_csv(os.path.join(path, 'ratings.csv'), index_col = 'userId', encoding='utf-8')
movies_df = pd.read_csv(os.path.join(path, 'movies.csv'), index_col = 'movieId', encoding='utf-8')

In [5]:
def get_simple_df_info(df):
  print("dataframe 사이즈: ", df.shape)
  print("\n")
  print("dataframe 정보")
  print(df.info())
  print("\n")
  print("dataframe 간단 통계량")
  print(df.describe())
  print("\n")
  print("dataframe의 몇몇 데이터 샘플")
  print(df.head())

In [6]:
get_simple_df_info(df=tags_df)

dataframe 사이즈:  (3683, 4)


dataframe 정보
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3683 entries, 0 to 3682
Data columns (total 4 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   userId     3683 non-null   int64 
 1   movieId    3683 non-null   int64 
 2   tag        3683 non-null   object
 3   timestamp  3683 non-null   int64 
dtypes: int64(3), object(1)
memory usage: 115.2+ KB
None


dataframe 간단 통계량
            userId        movieId     timestamp
count  3683.000000    3683.000000  3.683000e+03
mean    431.149335   27252.013576  1.320032e+09
std     158.472553   43490.558803  1.721025e+08
min       2.000000       1.000000  1.137179e+09
25%     424.000000    1262.500000  1.137521e+09
50%     474.000000    4454.000000  1.269833e+09
75%     477.000000   39263.000000  1.498457e+09
max     610.000000  193565.000000  1.537099e+09


dataframe의 몇몇 데이터 샘플
   userId  movieId              tag   timestamp
0       2    60756            funny  144

# KMRD 
  - Korean Movie Recommender system Dataset
  - MovieLens 스타일로 네이버 영화 평점 사이트를 바탕으로 제작된 한국 데이터셋
  - https://github.com/lovit/kmrd

In [7]:
path = data_path + "kmrd"
%cd $path

if not os.path.exists(path):
  !git clone https://github.com/lovit/kmrd.git
else:
  !python setup.py install
  print("data and path already exists!")

/content/drive/My Drive/data/kmrd
running install
running bdist_egg
running egg_info
writing kmr_dataset.egg-info/PKG-INFO
writing dependency_links to kmr_dataset.egg-info/dependency_links.txt
writing requirements to kmr_dataset.egg-info/requires.txt
writing top-level names to kmr_dataset.egg-info/top_level.txt
reading manifest file 'kmr_dataset.egg-info/SOURCES.txt'
writing manifest file 'kmr_dataset.egg-info/SOURCES.txt'
installing library code to build/bdist.linux-x86_64/egg
running install_lib
running build_py
creating build/bdist.linux-x86_64/egg
creating build/bdist.linux-x86_64/egg/kmr_dataset
copying build/lib/kmr_dataset/install.py -> build/bdist.linux-x86_64/egg/kmr_dataset
copying build/lib/kmr_dataset/io.py -> build/bdist.linux-x86_64/egg/kmr_dataset
copying build/lib/kmr_dataset/__init__.py -> build/bdist.linux-x86_64/egg/kmr_dataset
creating build/bdist.linux-x86_64/egg/kmr_dataset/datafile
creating build/bdist.linux-x86_64/egg/kmr_dataset/datafile/kmrd-small
copying buil

- 데이터셋 종류
  - 'small', '2m', '5m'
  - delimiter = '\t'

- `2m` 또는 `5m` 은 zip파일로 되어 있으므로 아래의 코드참고
- `kmr_dataset` 있는 directory 임을 확인해야한다

In [8]:
from kmr_dataset import load_rates
from kmr_dataset import get_paths

paths = get_paths(size='2m')
rates, timestamps = load_rates(size='2m')

skip 44048 lines which are duplicated (user, item), #uniques=2570549


In [9]:
path = data_path + "kmrd/kmr_dataset/datafile/kmrd-small"
print(os.listdir(path))

['castings.csv', 'peoples.txt', 'countries.csv', 'rates.csv', 'movies.txt', 'genres.csv']


- 데이터 종류 및 설명

| 파일 이름 | column 이름 | separator | 
|---|---|---|
| castings.csv | movie id, people id, credit order, leading(주연배우 0 or 1) | comma(,) |
| countries.csv | movie id, 국가 이름 | comma(,) |
| genres.csv | movie id, genre | comma(,) |
| movies.txt | movie id, 한국제목, 영어제목, 개봉년도, 관람등급 | tab(\t) |
| peoples.txt | people id, 한국이름, 영어이름 | tab(\t) |
| rates.csv | user id, movie id, 평점(0 ~ 10), 시간 | comma(,) |


In [10]:
castings_df = pd.read_csv(os.path.join(path, 'castings.csv'), encoding='utf-8')
get_simple_df_info(castings_df)

dataframe 사이즈:  (9776, 4)


dataframe 정보
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9776 entries, 0 to 9775
Data columns (total 4 columns):
 #   Column   Non-Null Count  Dtype
---  ------   --------------  -----
 0   movie    9776 non-null   int64
 1   people   9776 non-null   int64
 2   order    9776 non-null   int64
 3   leading  9776 non-null   int64
dtypes: int64(4)
memory usage: 305.6 KB
None


dataframe 간단 통계량
              movie         people        order      leading
count   9776.000000    9776.000000  9776.000000  9776.000000
mean   10499.104746   36151.930851     9.799509     0.295315
std      287.023933   62989.430164    12.576221     0.456208
min    10001.000000       5.000000     1.000000     0.000000
25%    10260.000000    4327.000000     3.000000     0.000000
50%    10485.000000   14048.500000     6.000000     0.000000
75%    10754.250000   27978.000000    10.000000     1.000000
max    10999.000000  420466.000000   101.000000     1.000000


dataframe의 몇몇 데이터 샘플
 

In [11]:
countries_df = pd.read_csv(os.path.join(path, 'countries.csv'), encoding='utf-8')
get_simple_df_info(countries_df)

dataframe 사이즈:  (1109, 2)


dataframe 정보
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1109 entries, 0 to 1108
Data columns (total 2 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   movie    1109 non-null   int64 
 1   country  1109 non-null   object
dtypes: int64(1), object(1)
memory usage: 17.5+ KB
None


dataframe 간단 통계량
              movie
count   1109.000000
mean   10496.257890
std      285.409915
min    10001.000000
25%    10253.000000
50%    10492.000000
75%    10746.000000
max    10999.000000


dataframe의 몇몇 데이터 샘플
   movie country
0  10001    이탈리아
1  10001     프랑스
2  10002      미국
3  10003      미국
4  10004      미국


In [12]:
genres_df = pd.read_csv(os.path.join(path, 'genres.csv'), encoding='utf-8')
get_simple_df_info(genres_df)

dataframe 사이즈:  (2025, 2)


dataframe 정보
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2025 entries, 0 to 2024
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   movie   2025 non-null   int64 
 1   genre   2025 non-null   object
dtypes: int64(1), object(1)
memory usage: 31.8+ KB
None


dataframe 간단 통계량
              movie
count   2025.000000
mean   10474.521975
std      289.972315
min    10001.000000
25%    10221.000000
50%    10474.000000
75%    10719.000000
max    10999.000000


dataframe의 몇몇 데이터 샘플
   movie   genre
0  10001     드라마
1  10001  멜로/로맨스
2  10002      SF
3  10002     코미디
4  10003      SF


In [13]:
movies_df = pd.read_csv(os.path.join(path, 'movies.txt'), sep='\t', encoding='utf-8')
get_simple_df_info(movies_df)

dataframe 사이즈:  (999, 5)


dataframe 정보
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 999 entries, 0 to 998
Data columns (total 5 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   movie      999 non-null    int64  
 1   title      992 non-null    object 
 2   title_eng  991 non-null    object 
 3   year       609 non-null    float64
 4   grade      957 non-null    object 
dtypes: float64(1), int64(1), object(3)
memory usage: 39.1+ KB
None


dataframe 간단 통계량
              movie         year
count    999.000000   609.000000
mean   10500.000000  1987.471264
std      288.530761    15.303710
min    10001.000000  1926.000000
25%    10250.500000  1982.000000
50%    10500.000000  1989.000000
75%    10749.500000  1991.000000
max    10999.000000  2020.000000


dataframe의 몇몇 데이터 샘플
   movie                 title  ...    year    grade
0  10001                시네마 천국  ...  2013.0   전체 관람가
1  10002              빽 투 더 퓨쳐  ...  2015.0  12세 관람가
2  10003  

In [14]:
peoples_df = pd.read_csv(os.path.join(path, 'peoples.txt'), sep='\t', encoding='utf-8')
get_simple_df_info(peoples_df)

dataframe 사이즈:  (7172, 3)


dataframe 정보
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7172 entries, 0 to 7171
Data columns (total 3 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   people    7172 non-null   int64 
 1   korean    7172 non-null   object
 2   original  6305 non-null   object
dtypes: int64(1), object(2)
memory usage: 168.2+ KB
None


dataframe 간단 통계량
              people
count    7172.000000
mean    45828.791132
std     70461.756830
min         5.000000
25%      7157.000000
50%     15658.500000
75%     42337.000000
max    420466.000000


dataframe의 몇몇 데이터 샘플
   people    korean        original
0       5    아담 볼드윈    Adam Baldwin
1       8   애드리안 라인     Adrian Lyne
2       9     에이단 퀸     Aidan Quinn
3      13  구로사와 아키라  Akira Kurosawa
4      15     알 파치노       Al Pacino


In [15]:
rates_df = pd.read_csv(os.path.join(path, 'rates.csv'), encoding='utf-8')
get_simple_df_info(rates_df)

dataframe 사이즈:  (140710, 4)


dataframe 정보
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 140710 entries, 0 to 140709
Data columns (total 4 columns):
 #   Column  Non-Null Count   Dtype
---  ------  --------------   -----
 0   user    140710 non-null  int64
 1   movie   140710 non-null  int64
 2   rate    140710 non-null  int64
 3   time    140710 non-null  int64
dtypes: int64(4)
memory usage: 4.3 MB
None


dataframe 간단 통계량
                user          movie           rate          time
count  140710.000000  140710.000000  140710.000000  1.407100e+05
mean    14948.679916   10278.818861       8.953258  1.297460e+09
std     14539.728057     292.806259       2.106047  1.374877e+08
min         0.000000   10001.000000       1.000000  1.069340e+09
25%      2980.000000   10048.000000       9.000000  1.180398e+09
50%      9292.000000   10148.000000      10.000000  1.271521e+09
75%     24129.000000   10489.000000      10.000000  1.409478e+09
max     52027.000000   10998.000000      10.00000

# Genres 데이터 확인

In [16]:
genres_df.head()

Unnamed: 0,movie,genre
0,10001,드라마
1,10001,멜로/로맨스
2,10002,SF
3,10002,코미디
4,10003,SF


In [18]:
groups = genres_df.groupby('movie')
groups.head()

Unnamed: 0,movie,genre
0,10001,드라마
1,10001,멜로/로맨스
2,10002,SF
3,10002,코미디
4,10003,SF
...,...,...
2020,10998,모험
2021,10998,스릴러
2022,10999,SF
2023,10999,드라마


In [24]:
genres = [(list(set(x['movie'].values))[0], '/'.join(x['genre'].values)) for index, x in groups]
combined_genres_df = pd.DataFrame(data=genres, columns=['movie', 'genres'])
combined_genres_df = combined_genres_df.set_index('movie')
combined_genres_df.head()

[(10001, '드라마/멜로/로맨스'), (10002, 'SF/코미디'), (10003, 'SF/코미디'), (10004, '서부/SF/판타지/코미디'), (10005, '판타지/모험/SF/액션'), (10006, '판타지/모험/SF/액션'), (10007, '판타지/SF/액션/모험'), (10008, 'SF/액션/모험/가족'), (10009, '판타지/SF/액션/모험'), (10010, '액션/코미디/판타지/SF/모험'), (10011, 'SF/액션/모험/가족'), (10012, '액션/스릴러/범죄'), (10013, '멜로/로맨스/판타지/SF/모험/가족/스릴러/공포'), (10014, '드라마/전쟁'), (10015, '드라마/액션/스릴러'), (10016, '코미디/가족/모험/범죄'), (10017, '드라마/서부'), (10018, '판타지/SF/모험/가족'), (10019, '드라마/서부/범죄'), (10020, '멜로/로맨스/드라마/전쟁'), (10021, '드라마/액션'), (10022, '드라마/액션'), (10023, '드라마/액션'), (10024, '드라마/액션'), (10025, '드라마/액션'), (10026, '액션/스릴러/범죄'), (10027, '액션/코미디/범죄'), (10028, 'SF/액션/스릴러'), (10029, '스릴러/공포'), (10030, '모험/스릴러/공포'), (10031, '스릴러/공포'), (10032, '스릴러/공포'), (10033, '액션/모험'), (10034, '판타지/액션/모험'), (10035, '판타지/액션/모험'), (10036, '액션'), (10037, 'SF/공포'), (10038, 'SF/액션/스릴러/공포'), (10039, '드라마/전쟁'), (10040, '코미디/스릴러/모험/가족'), (10041, '드라마/액션/모험'), (10042, '드라마/모험/전쟁'), (10043, '드라마/액션/스릴러/전쟁'), (10044, '액션/전쟁'), (10045, '드라마/전쟁'), (10

Unnamed: 0_level_0,genres
movie,Unnamed: 1_level_1
10001,드라마/멜로/로맨스
10002,SF/코미디
10003,SF/코미디
10004,서부/SF/판타지/코미디
10005,판타지/모험/SF/액션


In [25]:
movies_df = movies_df.set_index('movie')
movies_df.head()

Unnamed: 0_level_0,title,title_eng,year,grade
movie,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
10001,시네마 천국,"Cinema Paradiso , 1988",2013.0,전체 관람가
10002,빽 투 더 퓨쳐,"Back To The Future , 1985",2015.0,12세 관람가
10003,빽 투 더 퓨쳐 2,"Back To The Future Part 2 , 1989",2015.0,12세 관람가
10004,빽 투 더 퓨쳐 3,"Back To The Future Part III , 1990",1990.0,전체 관람가
10005,스타워즈 에피소드 4 - 새로운 희망,"Star Wars , 1977",1997.0,PG


#peoples 와 castings 데이터 확인

In [37]:
peoples_df.head()

Unnamed: 0,people,korean,original
0,5,아담 볼드윈,Adam Baldwin
1,8,애드리안 라인,Adrian Lyne
2,9,에이단 퀸,Aidan Quinn
3,13,구로사와 아키라,Akira Kurosawa
4,15,알 파치노,Al Pacino


In [38]:
castings_df.head()

Unnamed: 0,movie,people,order,leading
0,10001,4374,1,1
1,10001,178,2,1
2,10001,3241,3,1
3,10001,47952,4,1
4,10001,47953,5,0


In [42]:
castings = [(list(set(x['movie'].values))[0], x['people'].values) for index, x in castings_df.groupby('movie')]
combined_castings_df = pd.DataFrame(data=castings, columns=['movie','people'])
combined_castings_df = combined_castings_df.set_index('movie')
combined_castings_df.head()

Unnamed: 0_level_0,people
movie,Unnamed: 1_level_1
10001,"[4374, 178, 3241, 47952, 47953, 19538, 18991, ..."
10002,"[1076, 4603, 917, 8637, 5104, 9986, 7470, 9987]"
10003,"[1076, 4603, 917, 5104, 391, 5106, 5105, 5107,..."
10004,"[1076, 4603, 1031, 5104, 10001, 5984, 10002, 1..."
10005,"[1007, 535, 215, 1236, 35]"


In [43]:
movies_df = pd.concat([movies_df, combined_castings_df], axis=1)

In [45]:
movies_df.head()

Unnamed: 0_level_0,title,title_eng,year,grade,people
movie,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
10001,시네마 천국,"Cinema Paradiso , 1988",2013.0,전체 관람가,"[4374, 178, 3241, 47952, 47953, 19538, 18991, ..."
10002,빽 투 더 퓨쳐,"Back To The Future , 1985",2015.0,12세 관람가,"[1076, 4603, 917, 8637, 5104, 9986, 7470, 9987]"
10003,빽 투 더 퓨쳐 2,"Back To The Future Part 2 , 1989",2015.0,12세 관람가,"[1076, 4603, 917, 5104, 391, 5106, 5105, 5107,..."
10004,빽 투 더 퓨쳐 3,"Back To The Future Part III , 1990",1990.0,전체 관람가,"[1076, 4603, 1031, 5104, 10001, 5984, 10002, 1..."
10005,스타워즈 에피소드 4 - 새로운 희망,"Star Wars , 1977",1997.0,PG,"[1007, 535, 215, 1236, 35]"
