**Initial Setup**

1. First, you will setup your CoLab environment. Run the cell below.

In [None]:
!pip install pyspark
!pip install -U -q PyDrive
!apt install openjdk-8-jdk-headless -qq
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"

Now we authenticate a Google Drive client to download the file we will be processing in our Spark job.

**Make sure to follow the interactive instructions.**

In [None]:
from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive
from google.colab import auth
from oauth2client.client import GoogleCredentials

# Authenticate and create the PyDrive client
auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)

Download both anime.csv and rating.csv, and store it in your google drive. It is advisable to create a separate project folder, where you can store this dataset and also your code.

The script will give you the id of the two files in your drive.

In [None]:
file_list = drive.ListFile({'q': "'1Oi8cMnAfJVZH9-FyXGxwOrGGCIkkB7uy' in parents"}).GetList()
for f in file_list:
  print('title: %s, id: %s' % (f['title'], f['id']))

If you executed the cells below, you should be able to see the dataset we will need for this Colab under the "Files" tab on the left panel.

In [None]:
# Change the id, if it differs from the one below.
id='1TppJoj4QVJlc_HML20xmH847Brrw0Zfc'
downloaded = drive.CreateFile({'id': id})
downloaded.GetContentFile('anime.csv')

id='1f76dQZxRB1fNaReBv_DnUDVkIXNm7mw9'
downloaded = drive.CreateFile({'id': id})
downloaded.GetContentFile('rating.csv')

Here is a list of packages that might be useful to you. 

**Student Activity: Add the packages you need to carry out your analysis here** 

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

# Student Activity: Add your packages here.


**This step initializes the Spark context.**

In [None]:
# create the session
conf = SparkConf().set("spark.ui.port", "4050")

# create the context
sc = pyspark.SparkContext(conf=conf)
spark = SparkSession.builder.getOrCreate()

You can easily check the current version and get the link of the web interface. In the Spark UI, you can monitor the progress of your job and debug the performance bottlenecks (if your Colab is running with a local runtime).

In [None]:
spark

## **From this point onwards, you are supposed to do the coding yourself. Follow the steps as mentioned below in its appropriate place.**

**1. Student Activity: Read the datasets here. You must write the script for the first question and explore both the files here.**

Q1. Identify and describe the number of columns in the two dataset files.

**2. Student Activity: Preprocess the datasets here. You must write the script for the second question here. Make sure to check if the script is running is correctly or not**

Q2. Merge/Join/Combine the two datasets and identify the key common column that would you have performed? 

**3. Student Activity: Now do some exploratory analysis. You must write the script for the third and fourth question here. Make sure to check if the script is running is correctly or not**

Q3. Find the top 10 anime based on rating. Use tabular/graphical presentation to provide evidence of your analysis.

Q4. Find the top 10 anime with the most episodes. Use tabular/graphical presentation to provide evidence of your analysis.

**4. Student Activity: Design the recommendation system. Remember to split the dataset into training and testing to validate your recommendation model. This section would help you in answering question 5**

Q5. Design a collaborative filter-based recommendation system. 

**Student Activity: Analyse the output of the test dataset here.**

Q6. Give example of best three anime recommendations for minimum of 10 users.