# Movies
In this exercise, we will use the bigger MovieLens dataset called [MovieLens 25M Dataset](https://grouplens.org/datasets/movielens/)

## Download the dataset

You need to have `unzip` on your server. Test `unzip` in the shell.

In [None]:
!mkdir /data/dataset/movielens

In [None]:
!wget https://files.grouplens.org/datasets/movielens/ml-25m.zip -O /data/dataset/movielens/ml-25m.zip

In [None]:
!unzip /data/dataset/movielens/ml-25m.zip -d /data/dataset/movielens/

In [None]:
!ls /data/dataset/movielens/ml-25m/

## Sneak Peak at the Data

In [None]:
#checking some ratings
!head -n 3 /data/dataset/movielens/ml-25m/ratings.csv

In [None]:
#checking some movies
!head -n 3 /data/dataset/movielens/ml-25m/movies.csv

## Pre-Processing for Movies

In [None]:
# load csv file
import pandas as pd

df = pd.read_csv("/data/dataset/movielens/ml-25m/movies.csv")

In [None]:
# split genres into an array
df['genres'] = df['genres'].str.split(pat="|")

In [None]:
# use empty string instead of '(no genres listed)'
mask_no_genres = df['genres'].apply(lambda x: x == ['(no genres listed)'])
df.loc[mask_no_genres, 'genres'] = ''

In [None]:
df['genres'] = df['genres'].str.join("|")

In [None]:
# extract the year of the movie
df['year'] = df['title'].str[-5:-1]

In [None]:
# and remove the year from the title
df['title'] = df['title'].str[:-7]

In [None]:
# re-arrange the columns
df = df[["movieId", "title", "year", "genres"]]

In [None]:
# save the results
df.to_csv("/data/dataset/movielens/ml-25m/movies_cleaned.csv", index=None, sep='\t')