#  About Data

- movies.dat

    Contains the items (i.e., movies) that were rated in the tweets, together with their genre metadata in the following format: movie_id::movie_title (movie_year)::genre|genre|genre. For example:

    0110912::Pulp Fiction (1994)::Crime|Thriller

    The file is UTF-8 encoded to deal with the many foreign movie titles contained in tweets.

- ratings.dat

    In this file the extracted ratings are stored in the following format: user_id::movie_id::rating::rating_timestamp. For example:

    14927::0110912::9::1375657563

    The ratings contained in the tweets are scaled from 0 to 10, as is the norm on the IMDb platform. To prevent information loss we have chosen to not down-scale this rating value, so all rating values of this dataset are contained in the interval [0,10].
    
- users.dat

    Contains the mapping of the users ids on their true Twitter id in the following format: userid::twitter_id. For example:

    1::177651718

    We provide the Twitter id and not the Twitter @handle (username) because while the @handle can be changed, the id will always remain the same. Conversions from Twitter id to @handle can be done by means of an online tool like [Tweeterid] (http://tweeterid.com/) or simply through the Twitter API itself. The mapping provided here again facilitates additional metadata enrichment.
    

In [119]:
import pandas as pd

with open("movies.dat", "r", encoding="utf-8") as f:
    movies = f.read()

with open("ratings.dat", "r", encoding="utf-8") as f:
    ratings = f.read()

with open("users.dat", "r", encoding="utf-8") as f:
    users = f.read()

In [120]:
movies = movies.split("\n")
movie_list = [line.replace("(", "::").replace(")","").split("::") for line in movies]
movie_df = pd.DataFrame(movie_list)

ratings = ratings.split("\n")
rating_list = [line.split("::") for line in ratings]
rating_df = pd.DataFrame(rating_list)

users = users.split("\n")
user_list = [line.split("::") for line in users]
user_df = pd.DataFrame(user_list)

In [122]:
movie_df.columns = ["movie_id", "movie_title", "movie_ year", "genres"]
rating_df.columns = ["user_id", "movie_id", "rating", "rating_timestamp"]
user_df.columns = ["user_id", "twitter_id"]

In [124]:
movie_df = movie_df.set_index("movie_id")
user_df = user_df.set_index("user_id")

In [126]:
movie_df.head()

Unnamed: 0_level_0,movie_title,movie_ year,genres
movie_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
8,Edison Kinetoscopic Record of a Sneeze,1894,Documentary|Short
10,La sortie des usines Lumière,1895,Documentary|Short
12,The Arrival of a Train,1896,Documentary|Short
25,The Oxford and Cambridge University Boat Race,1895,
91,Le manoir du diable,1896,Short|Horror


In [127]:
user_df.head()

Unnamed: 0_level_0,twitter_id
user_id,Unnamed: 1_level_1
1,397291295
2,40501255
3,417333257
4,138805259
5,2452094989


In [128]:
rating_df.head()

Unnamed: 0,user_id,movie_id,rating,rating_timestamp
0,1,111161,10,1373234211
1,1,117060,7,1373415231
2,1,120755,6,1373424360
3,1,317919,6,1373495763
4,1,454876,10,1373621125


In [105]:
print(f"영화 종류 : {len(movie_df)}개")
print(f"유저 숫자 : {len(user_df)}명")
print(f"평점 갯수 : {len(rating_df)}개")

영화 종류 : 34438개
유저 숫자 : 60284명
평점 갯수 : 814506개
