# Movies
In this exercise, we will analyse good and bad movies based on the [MovieLens dataset](https://grouplens.org/datasets/movielens/)

## Download the dataset

You need to install `unzip` on your computer with

```bash
sudo apt install unzip
```

In [None]:
!wget https://files.grouplens.org/datasets/movielens/ml-latest-small.zip

In [None]:
!unzip ml-latest-small.zip

## Inspecting the data we need

In [None]:
!head ml-latest-small/movies.csv

In [None]:
!wc -l ml-latest-small/movies.csv

In [None]:
!head ml-latest-small/ratings.csv

In [None]:
!wc -l ml-latest-small/ratings.csv

## Task 1 - Compute the Average for Each Movie

Given the file `ml-latest-small/ratings.csv`, we want to output the `movieId` and the `average rating` of that movie as long as we have at least 5 ratings.

Implement a MapReduce algorithm:

### Implementation

In [None]:
%%writefile average.py



### Run Task 1

Save the results to `averages.txt`. The result should look like

```text
3113	2.7
3114	3.8608247422680413
3115	4.0
```

In [None]:
!python average.py ml-latest-small/ratings.csv > averages.txt

## Task 2 - PreProcessing of Ratings

We just need the `movieId` and the `title` from the `ml-latest-small/movies.csv`

### Implementation

In [None]:
%%writefile movie_cleaning.py



### Run Task 2

Again, let us save the results to `movies.txt`.

The result should look like:

```text
6400	"Murder on a Sunday Morning (Un coupable id\u00e9al) (2001)"
6402	"Siam Sunset (1999)"
6405	"Treasure Island (1950)"
6407	"Walk, Don't Run (1966)"
6408	"Animals are Beautiful People (1974)"
```

Do not worry about the special characters like `\u00e9`.

In [None]:
!python movie_cleaning.py ml-latest-small/movies.csv > movies.txt

## Task 3 - Joining `movies.txt` and `averages.txt`

Try to implement another MapReduce algorithm which reads `movies.txt` and `averages.txt` and joins the results.

Try to sort the output (best movies first)

You can figure out which file is read by using the command
```python
file_name = os.environ['mapreduce_map_input_file']
```

In [None]:
%%writefile join.py



### Testing locally (no sort)

As we have learned, MRJob does not do the sorting in local mode. However, it is still useful to test if the algorithm is working in general.

The result should look like:

```text
3.43	"Star Wars: Episode III - Revenge of the Sith (2005)"
3.43	"Red Dawn (1984)"
3.43	"Mystery Train (1989)"
3.43	"Batman (1989)"
```

Note, I've rounded the ratings here to two decimal places.

In [None]:
!python join.py ./*.txt

### Testing with Hadoop (Sorted)

Let us run the same command but with hadoop, so that sorting works. We will save the results in `movies_rated.txt`.

In [None]:
!python join.py -r hadoop ./*.txt > movies_rated.txt

## Inspect the Results

Have a look at the results. Do you agree?