# Six Degrees of Kevin Bacon
**Introduction** - Six Degrees of Kevin Bacon is a game based on the "six degrees of separation"
concept, which posits that any two people on Earth are six or fewer acquaintance links apart. Movie
buffs challenge each other to find the shortest path between an arbitrary actor and prolific actor
Kevin Bacon. It rests on the assumption that anyone involved in the film industry can be linked
through their film roles to Bacon within six steps.
The analysis of social networks can be a computationally intensive task, especially when dealing with
large volumes of data. It is also a challenging problem to devise a correct methodology to infer an
informative social network structure. Here, we will analyze a social network of actors and actresses
that co-participated in movies. We will do some simple descriptive analysis, and in the end try to
relate an actor/actress’s position in the social network with the success of the movies in which they
participate.

#### Rules & Notes - Please take your time to read the following points:

1. The submission deadline will be set for the 5th of June at 23:59.
2. It is acceptable that you **discuss** with your colleagues different approaches to solve each step of the problem set, but the assignment is individual. That is, you are responsible for writing your own code, and analysing the results. Clear cases of cheating will be penalized with 0 points in this assignment;
3. After review of your submission files, and before a mark is attributed, you might be called to orally defend your submission;
4. You will be scored first and foremost by the number of correct answers, secondly by the logic used in the trying to approach each step of the problem set;
5. Consider skipping questions that you are stuck in, and get back to them later;
6. Expect computations to take a few minutes to finish in some of the steps.
7. **IMPORTANT** It is expected you have developed skills beyond writting SQL queries. Any question where you directly write a SQL query (then for example create a temporary table and use spark.sql to pass the query) will receive a 25% penalty. Using the spark syntax (for example dataframe.select("\*").where("conditions")) is acceptable and does not incur this penalty.
8. **Questions** – Any questions about this assignment should be posted in the Forum@Moodle. The last class will be an open office session for anyone with questions concerning the assignment. 
9. **Delivery** - To fulfil this activity you will have to upload the following materials to Moodle:
    1. An exported IPython notebook. The notebook should be solved (have results displayed), but should contain all neccesary code so that when the notebook is run in databricks it should also replicate these results. This means the all data downloading and processing should be done in this notebook. It is also important you clearly indicate where your final answer to each question is when you are using multiple cells (for example you print "my final anwser is" before your answer or use cell comments).
    2. **Delivery** - You will also need to provide a signed statement of authorship, which is present in the last page;
    3. It is recommended you read the whole assignment before starting.
    4. You can add as many cells as you like to answer the questions.
    5. You can make use of caching or persisitng your RDDs or Dataframes, this may speed up performance.
    6. If you have trouble with graphframes in databricks (specifically the import statement) you need to make sure the graphframes package is installed on the cluster you are running. If you click home on the left, then click on the graphframes library which you loaded in Lab 9 you can install the package on your cluster (check the graphframes checkbox and click install)

#### Data Sources and Description
We will use data from IMDB. You can download raw datafiles
from https://datasets.imdbws.com. Note that the files are tab delimited (.tsv) You can find a
description of the each datafile in https://www.imdb.com/interfaces/

In [0]:
from pyspark.ml.feature import VectorAssembler
from  pyspark.ml.regression import LinearRegression
import os

## Questions
### Data loading and preperation
Review the file descriptions and load the necessary data onto your databricks cluser and into spark dataframes. You will need to use shell commands to download the data, unzip the data, load the data into spark. Note that the data might require parsing and preprocessing to be ready for the questions below.

**Hints** You can use 'gunzip' to unzip the .tz files. The data files will then be tab seperated (.tsv), which you can load into a dataframe using the tab seperated option instead of the comma seperated option we have typically used in class: `.option(“sep”,”\t”)`

In [0]:
##########################
# Execute this only once #
##########################
# Downloading data and unzipping

In [0]:
%sh
wget https://datasets.imdbws.com/name.basics.tsv.gz
wget https://datasets.imdbws.com/title.ratings.tsv.gz 
wget https://datasets.imdbws.com/title.principals.tsv.gz 
wget https://datasets.imdbws.com/title.basics.tsv.gz 
gunzip name.basics.tsv.gz title.ratings.tsv.gz title.principals.tsv.gz title.basics.tsv.gz

In [0]:
##########################
# Execute this only once #
##########################
# Copying data to /dbfs/tmp

In [0]:
%sh
cp name.basics.tsv /dbfs/tmp
cp title.ratings.tsv /dbfs/tmp
cp title.principals.tsv /dbfs/tmp
cp title.basics.tsv /dbfs/tmp

In [0]:
# Reading the data into DataFrames
names_of_actors=spark.read.options(sep='\t',header=True,inferSchema=True).csv('dbfs:/tmp/name.basics.tsv')
ratings=spark.read.options(sep='\t',header=True,inferSchema=True).csv('dbfs:/tmp/title.ratings.tsv')
titles_with_actors=spark.read.options(sep='\t',header=True,inferSchema=True).csv('dbfs:/tmp/title.principals.tsv')
names_of_title=spark.read.options(sep='\t',header=True,inferSchema=True).csv('dbfs:/tmp/title.basics.tsv')

### Network Inference, Let’s build a network
In the following questions you will look to summarise the data and build a network. We want to examine a network that abstracts how actors and actress are related through their co-participation in movies. To that end perform the following steps:

**Q1** Create a DataFrame that combines **all the information** on each of the titles (i.e., movies, tv-shows, etc …) and **all of the information** the participants in those movies (i.e., actors, directors, etc … ), make sure the actual names of the movies and participants are included. It may be worth reviewing the following questions to see how this dataframe will be used.

How many rows does your dataframe have?

In [0]:
#DataFrame that combines all the information on each of the titles (i.e., movies, tv-shows, etc …) and all of the information the participants in those movies (i.e., actors, directors, etc … )
titles_actors_names=names_of_title.join(titles_with_actors,on='tconst').join(names_of_actors,on='nconst')
print(titles_actors_names.count())

**Q2** Create a new DataFrame based on the previous step, with the following removed:
1. Any participant that is not an actor or actress (as measured by the category column);
1. All adult movies;
1. All dead actors or actresses;
1. All actors or actresses born before 1920 or with no date of birth listed;
1. All titles that are not of the type movie.

How many rows does your dataframe have?

In [0]:
filtered_titles_actors_names=titles_actors_names.filter(
  (titles_actors_names.category=='actor')|(titles_actors_names.category=='actress')).filter( # selecting only actors and actresses
  titles_actors_names.isAdult=='0').filter(                                                  # removing adult movies
  titles_actors_names.deathYear=='\\N').filter(                                              # removing dead actors and actresses 
  (titles_actors_names.birthYear!='\\N')|(titles_actors_names.birthYear>'1920')).filter(     # removing actors and actresses born before 1920 / with no dob listed
  titles_actors_names.titleType=='movie')                                                    # selecting only movies
print(filtered_titles_actors_names.count())

**Q3** Convert the above Dataframe to an RDD. Use map and reduce to create a paired RDD which counts how many movies each actor / actress appears in.

Display names of the top 10 actors/actresses according to the number of movies in which they appeared. Be careful to deal with different actors / actresses with the same name, these could be different people.

In [0]:
filtered_titles_actors_names.createOrReplaceTempView("filtered_titles_actors_names")
count_movies_by_nconst=spark.sql("""
SELECT nconst,COUNT(nconst) as count_nconst 
FROM filtered_titles_actors_names 
GROUP BY nconst""")                                                                           # RDD pairs containing actor id and the number of movies the actor is in
nconst_to_names=filtered_titles_actors_names.select('nconst','primaryName').distinct()        # RDD pairs containing actor id and name
names_to_count_movies=nconst_to_names.rdd.join(count_movies_by_nconst.rdd).values()           # Joining RDD pairs on actor id to get name and count of movies
names_to_count_movies.sortBy(lambda x: x[1],ascending=False).take(10)                         # Printing in descending order of number of movies

**Q4** Start with the dataframe from Q2. Generate a DataFrame that lists all links of your network. Here we shall consider that a link connects a pair of actors/actresses if they participated in at least one movie together (actors / actresses should be represented by their unique ID's). For every link we then need anytime a pair of actors were together in a movie as a link in each direction (A -> B and B -> A). However links should be distinct we do not need duplicates when two actors worked together in several movies.

In [0]:
links_between_actors=spark.sql("SELECT DISTINCT f1.nconst as src,f2.nconst as dst FROM filtered_titles_actors_names as f1,filtered_titles_actors_names as f2 WHERE f1.nconst!=f2.nconst AND f1.tconst=f2.tconst")

**Q5** Compute the page rank of each actor. This can be done using GraphFrames or
by using RDDs and the iterative implementation of the PageRank algorithm. Do not take
more than 5 iterations and use reset probility = 0.1.

List the top 10 actors / actresses by pagerank.

`https://repos.spark-packages.org/graphframes/graphframes/0.8.2-spark3.0-s_2.12/graphframes-0.8.2-spark3.0-s_2.12.jar`

Upload the above package into the cluster libraries and install before running this cell

In [0]:
from graphframes import *
graph = GraphFrame(nconst_to_names.selectExpr("nconst as id","primaryName as name"), links_between_actors) # creating graph using GraphFrame. Reference: https://graphframes.github.io/graphframes/docs/_site/user-guide.html
results = graph.pageRank(resetProbability=0.1, maxIter=5)  # pageRank reference: https://graphframes.github.io/graphframes/docs/_site/user-guide.html#pagerank 
results.vertices.sort('pagerank',ascending=False).take(10) # Sorting the pagerank results and returning top 10

**Q6**: Create an RDD with the number of outDegrees for each actor. Display the top 10 by outdegrees.

In [0]:
# outDegrees Reference: https://docs.databricks.com/_static/notebooks/graphframes-user-guide-py.html
graph.outDegrees.rdd.sortBy(lambda x: x[1],ascending=False).take(10) # Finding out degrees, sorting in descending order and returning top 10

### Let’s play Kevin’s own game

**Q7** Start with the graphframe / dataframe you developed in the previous questions. Using Spark GraphFrame and/or Spark Core library perform the following steps:

1. Identify the id of Kevin Bacon, there are two actors named ‘Kevin Bacon’, we will use the one with the highest degree, that is, the one that participated in most titles;
1. Estimate the shortest path between every actor in the database actors and Kevin Bacon, keep a dataframe with this information as you will need it later;
1. Summarise the data, that is, count the number of actors at each number of degress from kevin bacon (you will need to deal with actors unconnected to kevin bacon, if not connected to Kevin Bacon given these actors / actresses a score/degree of 20).

In [0]:
# Takes each row in summary which contains id, name and a map of shortest distances and converting it to id, name and shortest distance to Kevin Bacon
def get_distance(x):
  if(kbid in x[2]):
    return (x[0],x[1],x[2][kbid])
  return (x[0],x[1],20)

In [0]:
count_movies_by_nconst.createOrReplaceTempView("count_movies_by_nconst")
nconst_to_names.createOrReplaceTempView("nconst_to_names")

kbid=spark.sql("SELECT count_movies_by_nconst.nconst FROM count_movies_by_nconst,(SELECT nconst as n2 FROM nconst_to_names WHERE primaryName LIKE 'Kevin Bacon') WHERE count_movies_by_nconst.nconst=n2 ORDER BY count_nconst DESC").take(1)[0]['nconst']                   # getting the id of Kevin Bacon

shortestpath_results=graph.shortestPaths(landmarks=[kbid]) # finding shortest path with Kevin Bacon (passed through landmarks into shortestPaths function). Reference: https://graphframes.github.io/graphframes/docs/_site/user-guide.html#shortest-paths

summary=shortestpath_results.rdd.map(get_distance)         # Since shortest path gives results in a map, converting it to int using get_distance function

spark.createDataFrame(summary).createOrReplaceTempView("summary")

answer=spark.sql("""
SELECT _3 as degree,COUNT(_3) as freq
FROM summary
GROUP BY _3
""")                                                       # Counting the frequency of shortest distance in summary
answer.collect()                                           # returns count the number of actors at each number of degress from kevin bacon

### Exploring the data with RDD's

Using RDDs and (not dataframes) answer the following questions (if you loaded your data into spark in a dataframe you can convert to an RDD of rows easily using `.rdd`):

**Q8** Movies can have multiple genres. Considering only titles of the type 'movie' what is the combination of genres that is the most popluar (as measured by number of reviews). Hint: paired RDD's will be useful.

In [0]:
movies_reviews=names_of_title.filter(names_of_title.titleType=='movie').join(ratings,on='tconst') # joining movie information with movie ratings

movies_reviews.createOrReplaceTempView("movies_reviews")

genere_votes=spark.sql("""
SELECT genres,SUM(numVotes)
FROM movies_reviews
GROUP BY genres
""").rdd.sortBy(lambda x: x[1],ascending=False)                                                   # taking total number of votes for each combination of genres and sorting in descending order
genere_votes.take(1)                                                                              # returning combination of genres with highest votes

**Q9** Movies can have multiple genres. Considering only titles of the type 'movie', and movies with more than 400 ratings, what is the combination of genres that has the highest **average movie rating** (you can average the movie rating for each movie in that genre combination). Hint: paired RDD's will be useful.

In [0]:
genere_ratings=spark.sql("""
SELECT genres,AVG(averageRating)
FROM movies_reviews
WHERE numVotes>400
GROUP BY genres
""").rdd.sortBy(lambda x: x[1],ascending=False) # taking total average rating for each combination of genres and sorting in descending order
genere_ratings.take(1)                          # returning combination of genres with highest average rating

**Q10** Movies can have multiple genres. What is **the individual genre** which is the most popular as meaured by number of votes. Votes for multiple genres count towards each genre listed. Hint: flatmap and pairedRDD's will be useful here.

In [0]:
votes_per_genre=dict()
for genre_votes in genere_votes.collect():
  for genre in genre_votes.genres.split(','):                                                        # taking combination of genres, splitting to get a list of genres and iterating over it
    if genre in votes_per_genre:                                                                     # counting the number of votes towards each genre and storing in votes_per_genre
       votes_per_genre[genre]+=genre_votes['sum(numVotes)']
    else:
      votes_per_genre[genre]=genre_votes['sum(numVotes)']
genre_ordered=[k for k, v in sorted(votes_per_genre.items(), key=lambda item: item[1],reverse=True)] # ordering the genres by votes
genre_ordered[0]                                                                                     # returning genre with highest votes

## Engineering the perfect cast
We have created a number of potential features for predicting the rating of a movie based on its cast. Use sparkML to build a simple linear model to predict the rating of a movie based on the following features:

1. The total number of movies in which the actors / actresses have acted (based on Q3)
1. The average pagerank of the cast in each movie (based on Q5)
1. The average outDegree of the cast in each movie (based on Q6)
1. The average value for for the cast of degrees of Kevin Bacon (based on Q7).

You will need to create a dataframe with the required features and label. Use a pipeline to create the vectors required by sparkML and apply the model. Remember to split your dataset, leave 30% of the data for testing, when splitting your data use the option seed=0.

**Q11** Provide the coefficients of the regression and the accuracy of your model on that test dataset according to RSME.

In [0]:
# function to get vector features from a DataFrame
def addFeatureVectors(x,inputCols):
  vecAssembler = VectorAssembler(inputCols=inputCols, outputCol="features")
  return vecAssembler.transform(x)

In [0]:
actor_data=count_movies_by_nconst.join(
  results.vertices,count_movies_by_nconst.nconst==results.vertices.id).drop('name').drop('id').join(
  graph.outDegrees,graph.outDegrees.id==count_movies_by_nconst.nconst).drop('id').join(
  spark.createDataFrame(summary,['nconst','name','KBD']),on='nconst') # obtaining features of actors such as total movies, page rank, out degree, distance from Kevin Bacon
full_final_data=filtered_titles_actors_names.drop('titleType','primaryTitle','originalTitle','ordering',
                                                  'isAdult','startYear','endYear','runtimeMinutes','genres',
                                                  'category','job','characters','primaryName','birthYear','deathYear',
                                                  'primaryProfession','knownForTitles').join(
  actor_data,on='nconst').drop('name').groupBy('tconst').agg({'count_nconst':'mean','pagerank':'mean','outDegree':'mean','KBD':'mean'}).join(ratings,on='tconst') # obtaining average of total movies, page rank, out degree, distance from Kevin Bacon of the cast of a movie
train,test=addFeatureVectors(full_final_data.drop('tconst','numVotes'),['avg(outDegree)','avg(count_nconst)','avg(pagerank)','avg(KBD)']).randomSplit([0.9,0.1],0) # getting vector fectures and splitting the dataset into train and test

In [0]:
lr=LinearRegression(labelCol='averageRating') # initializing a LinearRegression model
if os.path.exists("dbfs:/tmp/q11model"):
  model.load("dbfs:/tmp/q11model")
else:
  model=lr.fit(train)                         # fitting training data to the model
  model.write().overwrite().save("dbfs:/tmp/q11model")
testSummary=model.evaluate(test)              # evaluating the model on test data

In [0]:
model.intercept,model.coefficients # model.intercept gives the trained intercept and model.coefficients gives the trained coefficients of the LinearRegression model

In [0]:
testSummary.rootMeanSquaredError # RMSE on test data

In [0]:
def get_prediction_on_titanic(model,data,features):
  titanic_id=names_of_title.filter(((names_of_title.primaryTitle=='Titanic')|(names_of_title.originalTitle=='Titanic'))&(names_of_title.startYear=='1997')&(names_of_title.titleType=='movie')).take(1)[0]['tconst'] # getting the id of movie Titanic
  titanic_data=data.filter(data.tconst==titanic_id)                                    # getting the training data of movie Titanic using its id
  titanic_vector=addFeatureVectors(titanic_data.drop('tconst','numVotes'),features)    # getting features vector of Titanic
  titanic_result=model.predict(titanic_vector.collect()[0].features)                   # prediction on Titanic
  return titanic_result                                                                # returning prediction result
# get_prediction_on_titanic(model,full_final_data,['avg(outDegree)','avg(count_nconst)','avg(pagerank)','avg(KBD)'])

In [0]:
get_prediction_on_titanic(model,full_final_data,['avg(outDegree)','avg(count_nconst)','avg(pagerank)','avg(KBD)'])

**Q12** What score would your model predict for the 1997 movie Titanic.

In [0]:
titanic_id=names_of_title.filter(((names_of_title.primaryTitle=='Titanic')|(names_of_title.originalTitle=='Titanic'))&(names_of_title.startYear=='1997')&(names_of_title.titleType=='movie')).take(1)[0]['tconst'] # getting the id of movie Titanic
titanic_data=full_final_data.filter(full_final_data.tconst==titanic_id)  # getting the training data of movie Titanic using its id
titanic_vector=addFeatureVectors(titanic_data.drop('tconst','numVotes'),['avg(outDegree)','avg(count_nconst)','avg(pagerank)','avg(KBD)']) # getting features vector of Titanic
titanic_result=model.predict(titanic_vector.collect()[0].features)       # prediction on Titanic
titanic_result                                                           # returning prediction result

**Q13** Create dummy variables for each of the top 10 movie genres for Q10. These variable should have a value of 1 if the movie was rated with that genre and 0 otherwise. For example the 1997 movie Titanic should have a 1 in the dummy variable column for Romance, and a 1 in the dummy variable column for Drama, and 0's in all the other dummy variable columns.

Does adding these variable to the regression improve your results? What is the new RMSE and predicted rating for the 1997 movie Titanic.

In [0]:
top_ten_genres=genre_ordered[:10] # getting top 10 genres
full_final_data_with_genres=filtered_titles_actors_names.drop('titleType','primaryTitle','originalTitle','ordering','isAdult','startYear','endYear','runtimeMinutes','category','job','characters','primaryName','birthYear','deathYear','primaryProfession','knownForTitles').join(actor_data,on='nconst').drop('name').groupBy('tconst','genres').agg({'count_nconst':'mean','pagerank':'mean','outDegree':'mean','KBD':'mean'}).join(ratings,on='tconst')  # obtaining genres along with average of total movies, page rank, out degree, distance from Kevin Bacon of the cast of a movie 

full_values=[[row.genres for row in full_final_data_with_genres.collect()]] # inserting column names(genres into the data)
for genre in top_ten_genres:
  full_values.append([0 if row.genres.find(genre)==-1 else 1 for row in full_final_data_with_genres.collect()]) # adding data of each column. 0/1 if genre is present in genres/not
full_values =[[row[i] for row in full_values] for i in range(len(full_values[0]))] # taking transpose

genre_data=spark.createDataFrame(full_values,['genres']+top_ten_genres) # creating DataFrame with this data
full_final_data_with_genres=full_final_data_with_genres.join(genre_data,on='genres').drop('genres') # joining with previous data 

train1,test1=addFeatureVectors(full_final_data_with_genres.drop('tconst','numVotes'),['avg(outDegree)','avg(count_nconst)','avg(pagerank)','avg(KBD)']+top_ten_genres).sample(.00000001).randomSplit([0.7,0.3],0) # getting vector fectures and taking less dataset into train and test as training time is long

In [0]:
train.toPandas().shape

In [0]:
lr1=LinearRegression(labelCol='averageRating')
model1=lr1.fit(train1)
testSummary1=model1.evaluate(test1)

In [0]:
testSummary1.rootMeanSquaredError

In [0]:
get_prediction_on_titanic(model1,full_final_data_with_genres,['avg(outDegree)','avg(count_nconst)','avg(pagerank)','avg(KBD)']+top_ten_genres)

**Q14 - Open Question**: Improve your model by testing different machine learning algorithms, using hyperparameter tuning on these algorithms, changing the included features. What is the RMSE of you final model and what rating does it predict for the 1997 movie Titanic.

In [0]:
# tirangle count indicates the embededness of a node in a network 
triangles=graph.triangleCount()

In [0]:
data_with_traiangles=filtered_titles_actors_names.drop('titleType','primaryTitle','originalTitle','ordering',
                                                  'isAdult','startYear','endYear','runtimeMinutes','genres',
                                                  'category','job','characters','primaryName','birthYear','deathYear',
                                                  'primaryProfession','knownForTitles').join(
                                                  actor_data.join(triangles,triangles.id==actor_data.nconst # including traingle count
                                                  ).drop('id','name'),on='nconst').drop('name').groupBy('tconst').agg(
                                                  {'count_nconst':'mean','pagerank':'mean','outDegree':'mean','KBD':'mean','count':'mean'}).join(ratings,on='tconst')

train2,test2=addFeatureVectors(data_with_traiangles.drop('tconst','numVotes'),['avg(count)','avg(outDegree)','avg(count_nconst)','avg(pagerank)','avg(KBD)']).sample(.00000001).randomSplit([0.7,0.3],0) # getting vector fectures and taking less dataset into train and test as training time is long

lr2=LinearRegression(labelCol='averageRating')
model2=lr2.fit(train2)
testSummary2=model2.evaluate(test2)

In [0]:
testSummary2.rootMeanSquaredError

In [0]:
get_prediction_on_titanic(model2,data_with_traiangles,['avg(count)','avg(outDegree)','avg(count_nconst)','avg(pagerank)','avg(KBD)'])