# IMDb Movie Segments
**Using principal components and k-means to find clusters of similar movies**

# Contents

1. [Background](#Background)
1. [Setup](#Setup)
1. [Data](#Data)
  1. [Import and Unzip](##Import and Unzip)
  1. [Transform](##Transform)
    1. [Visualize](###Visualize)
  1. [Upload](##Upload)
1. [Train PCA](#Train PCA)
1. [Host PCA](#Host PCA)
   1. [Score PCA](##Score PCA)
      1. [Visualize Components](#Visualize Components)
1. [Train k-means](#Train k-means)
1. [Host k-means](#Host k-means)
   1. [Score k-means](##Score k-means)

# Background

Clustering is a common unsupervised machine learning task, used in contexts from marketing to recommender systems.  However, clustering does have difficulty in very high dimensional spaces, where all observations in the data start to look dissimilar because they randomly happen to differ on some (potentially irrelevant) feature.

To correct for this, dimensionality reduction techniques are often used to bring the data into a lower dimensional space, reducing redundant variance, and allowing for better clustering solutions.

In this notebook, we walk through an example which starts with IMDb movie data on genre, ratings, age, etc. and utilizes Principal Component Analysis (PCA) for dimensionality reduction, and k-means for clustering within that reduced dimensional space.

---
# Setup

Here we specify the linkage and authentication to AWS services. There are three parts to this:

* The credentials and region for the account that's running training. Upload the credentials in the normal AWS credentials file format to '~/.aws/' or run 'aws configure' from a Jupyter terminal. The region must always be `us-west-2` during the Beta program.
* The roles used to give learning and hosting access to your data. See the documentation for how to specify these.
* The S3 bucket that you want to use for training and model data.

In [None]:
import os
import boto3

os.environ['AWS_DEFAULT_REGION'] = 'us-west-2'
role = boto3.client('iam').list_instance_profiles()['InstanceProfiles'][0]['Roles'][0]['Arn']

bucket = '<your_s3_bucket_here>'
pca_prefix = 'pca_kmeans_movie_clustering/pca'
kmeans_prefix = 'pca_kmeans_movie_clustering/kmeans'

Let's also bring in the Python libraries we'll want to use for this exercise.

In [None]:
import pandas as pd
import numpy as np
import sys
import convert_data
import boto3
import time
import json
import io
import matplotlib.pyplot as plt
from IPython.display import display

---
# Data

For this Notebook, we'll be using the IMBb dataset which is openly available on S3.  There is a great deal of detail, but to keep this straightforward, let's limit ourselves to basic details and user ratings for movies.

## Import and Unzip

In [None]:
!aws s3api get-object --request-payer requester --bucket imdb-datasets --key documents/v1/current/title.basics.tsv.gz ./title.basics.tsv.gz
!aws s3api get-object --request-payer requester --bucket imdb-datasets --key documents/v1/current/title.ratings.tsv.gz ./title.ratings.tsv.gz

In [None]:
!gunzip -f title.basics.tsv.gz
!gunzip -f title.ratings.tsv.gz

## Transform

Let's filter down to just movies and remove those with incomplete or irrelevant data or a small number of reviews.

In [None]:
basics = pd.read_csv('title.basics.tsv', sep='\t')
movies = basics[(basics['titleType'] == 'movie') & \
                (basics['isAdult'] == 0) & \
                (basics['startYear'] != '\\N') & \
                (basics['runtimeMinutes'] != '\\N')]
ratings = pd.read_csv('title.ratings.tsv', sep='\t')
movies = movies.merge(ratings[ratings['numVotes'] >= 100], on='tconst')
movies

There are text columns which need to be converted to a numeric representation in order to use them in our machine learning models.  In this case, that text information is genre.  We'd like to convert this single column into a set of columns which are 1 if the movie is a member of that genre and 0 otherwise.  Since a movie can be in multiple genres, the below pre-processing is necessary.

In [None]:
def split_indicators(df, col):
    keys = df[col].unique()
    split_keys = pd.concat([pd.DataFrame(keys), pd.Series(keys).apply(lambda x: pd.Series([i for i in x.split(',')]))], axis=1)
    split_keys.columns = [col] + ['x.{}'.format(i) for i in range(1, len(split_keys.columns))]
    key_list = split_keys.melt(id_vars=col)
    key_list['dummy'] = 1
    return key_list.pivot_table(index=col, columns='value', values='dummy').fillna(0)

def add_indicators(df, col):
    indicators = split_indicators(df, col)
    indicators[col] = indicators.index
    return df.merge(indicators, on=col)

In [None]:
movies = add_indicators(movies, 'genres')

Now let's:
1. Convert all columns to numbers
1. Drop columns that we won't use as features for training our machine learning algorithms
1. Standardize (give each column a mean of 0 and a standard deviation of 1 since they columns like startYear are on a completely different scale than our averageRating)
1. Convert to numpy matrix

In [None]:
movies['startYear'] = pd.to_numeric(movies['startYear'])
movies['runtimeMinutes'] = pd.to_numeric(movies['runtimeMinutes'])
train_data = movies.drop(['tconst', 'titleType', 'primaryTitle', 'originalTitle', 'isAdult', 'endYear', 'genres', '\\N'], axis=1)
train_data = (train_data - train_data.mean()) / train_data.std()
train_data = train_data.as_matrix().astype(float)

### Visualize

The best case scenario in clustering is that the machine learning model is more of a formality, with the clusters already being visibly apparent.  However, the higher the dimensional space, the more difficult it becomes.  Let's look at scatterplots for just the first few columns in our training data.

In [None]:
pd.plotting.scatter_matrix(pd.DataFrame(train_data).iloc[:, 0:5], figsize=(12, 12))
plt.show()

As we can see, the data are often continuously distributed, with occasional outliers, but minimal other distinction which we can use to visibly separate them into clusters.

## Upload

Let's upload the data to S3 in so that we can train our model in EASE.  Notice we are using the convert_data functions which write our in memory datasets to a recordIO protobuf format for improved performance.

In [None]:
train_file = 'pca_train.data'

f = io.BytesIO()
for row in train_data:
    convert_data.write_recordio(f, convert_data.list_to_record_bytes(row, label=0, feature_size=31))
f.seek(0)

boto3.Session().resource('s3').Bucket(bucket).Object(os.path.join(pca_prefix, 'train', train_file)).upload_fileobj(f)

---
# Train PCA

PCA is a technique that...

Let's start by specifying our training parameters needed for the IM API, including:
1. The role to use
1. Our training job name
1. The PCA algorithm container
1. Training instance type and count
1. S3 location for training data
1. S3 location for output data
1. Algorithm hyperparameters
1. Stopping conditions

In [None]:
pca_job = 'pca-poc-' + time.strftime("%Y-%m-%d-%H-%M-%S", time.gmtime())

print("Job name is:", pca_job)

pca_training_params = {
    "RoleArn": role,
    "TrainingJobName": pca_job,
    "AlgorithmSpecification": {
        "TrainingImage": "900597767885.dkr.ecr.us-east-1.amazonaws.com/ease-pca:latest",
        "TrainingInputMode": "File"
    },
    "ResourceConfig": {
        "InstanceCount": 2,
        "InstanceType": "c4.8xlarge",
        "VolumeSizeInGB": 50
    },
    "InputDataConfig": [
        {
            "ChannelName": "train",
            "DataSource": {
                "S3DataSource": {
                    "S3DataType": "S3Prefix",
                    "S3Uri": "s3://{}/{}/train/".format(bucket, pca_prefix),
                    "S3DataDistributionType": "FullyReplicated"
                }
            },
            "CompressionType": "None",
            "RecordWrapperType": "None"
        }
    ],
    "OutputDataConfig": {
        "S3OutputPath": "s3://{}/{}/".format(bucket, pca_prefix)
    },
    "HyperParameters": {
        'algorithm_mode': 'randomized',
        'num_components': '5',
        'subtract_mean': 'True',
        'extra_components': '-1',
        'feature_dim': '31',
        'mini_batch_size': '5000'
    },
    "StoppingCondition": {
        "MaxRuntimeInHours": 1
    }
}

Now let's kick off our training job on EASE, using the parameters we just created.  Because training is serverless, we don't have to wait for our job to finish to continue, but for this case, let's setup a while loop so we can monitor the status of our training.

In [None]:
%%time

im = boto3.client('im')
im.create_training_job(**pca_training_params)

status = im.describe_training_job(TrainingJobName=pca_job)['TrainingJobStatus']
print(status)
im.get_waiter('TrainingJob_Created').wait(TrainingJobName=pca_job)
if status == 'Failed':
    message = im.describe_training_job(TrainingJobName=pca_job)['FailureReason']
    print('Training failed with the following error: {}'.format(message))
    raise Exception('Training job failed')

---
# Host PCA

Now that we've trained the PCA algorithm on our data, let's setup a model which can later be hosted.  We will:
1. Point to the scoring container
1. Point to the model.tar.gz that came from training
1. Create the hosting model

In [None]:
pca_hosting_container = {
    'Image': "900597767885.dkr.ecr.us-east-1.amazonaws.com/ease-pca:latest",
    'ModelDataUrl': im.describe_training_job(TrainingJobName=pca_job)['ModelArtifacts']['S3ModelArtifacts']
}

create_model_response = im.create_model(
    ModelName=pca_job,
    ExecutionRoleArn=role,
    PrimaryContainer=pca_hosting_container)

print(create_model_response['ModelArn'])

Once we've setup a model, we can configure what our hosting endpoints should be.  Here we specify:
1. EC2 instance type to use for hosting
1. Lower and upper bounds for number of instances
1. Our hosting model name

In [None]:
pca_endpoint_config = 'pca-poc-endpoint-config-' + time.strftime("%Y-%m-%d-%H-%M-%S", time.gmtime())
print(pca_endpoint_config)
create_endpoint_config_response = im.create_endpoint_config(
    EndpointConfigName=pca_endpoint_config,
    ProductionVariants=[{
        'InstanceType': 'c4.xlarge',
        'MaxInstanceCount': 3,
        'MinInstanceCount': 1,
        'ModelName': pca_job,
        'VariantName': 'AllTraffic'}])

print("Endpoint Config Arn: " + create_endpoint_config_response['EndpointConfigArn'])

Now that we've specified how our endpoint should be configured, we can create them.  This can be done in the background, but for now let's run a loop that updates us on the status of the endpoints so that we know when they are ready for use.

In [None]:
%%time

pca_endpoint = 'pca-poc-endpoint-' + time.strftime("%Y%m%d%H%M", time.gmtime())
print(pca_endpoint)
create_endpoint_response = im.create_endpoint(
    EndpointName=pca_endpoint,
    EndpointConfigName=pca_endpoint_config)
print(create_endpoint_response['EndpointArn'])

resp = im.describe_endpoint(EndpointName=pca_endpoint)
status = resp['EndpointStatus']
print("Status: " + status)

im.get_waiter('Endpoint_Created').wait(EndpointName=pca_endpoint)

resp = im.describe_endpoint(EndpointName=pca_endpoint)
status = resp['EndpointStatus']
print("Arn: " + resp['EndpointArn'])
print("Status: " + status)

if status != 'InService':
    raise Exception('Endpoint creation did not succeed')

## Score PCA

Now that our endpoint is live, we can generate predictions from it.  In this case, we'll use it to score our training data, which results in the reduced dimensional components.

In [None]:
def np2csv(arr):
    csv = io.BytesIO()
    np.savetxt(csv, arr, delimiter=',', fmt='%g')
    return csv.getvalue().decode().rstrip()

In [None]:
runtime = boto3.Session().client(service_name='runtime.maeve', endpoint_url='https://maeveruntime.prod.us-west-2.ml-platform.aws.a2z.com')

minibatch_rows = 5000000. / sys.getsizeof(np2csv(train_data[0]))
split_array = np.array_split(train_data, int(train_data.shape[0] / float(minibatch_rows) + 1))
components = []
for array in split_array:
    payload = np2csv(array)
    response = runtime.invoke_endpoint(EndpointName=pca_endpoint,
                                       ContentType='text/csv',
                                       Body=payload)
    result = json.loads(response['Body'].read().decode())
    components += [p['projection'] for p in result['projections']]

components = np.array(components)

### Visualize Components

As mentioned above, ideally the clusters would already be visibly apparent in our data.  Now that we've run PCA to reduce the dimensionality of our data, let's look at some scatterplots to understand if we can easily make out any groups of movies.

In [None]:
pd.plotting.scatter_matrix(pd.DataFrame(components), figsize=(12, 12))
plt.show()

The scatterplots tend to be dominated by one large mass of data points, but there are also several other sizable goups which are noticeably distinct.  We can utilize k-means to find these groupings a robust manner.

---
# Train k-means

Next, let's run k-means on our reduced dimensional output.  Start by outputting the data to S3.  Notice we'll use the same bucket, but a different S3 prefix to avoid supplying conflicting training data.

In [None]:
# TODO update to newer protobuf format
train_file = 'kmeans_train.data'

vectors = [t.tolist() for t in components]
labels = [t.tolist() for t in components[:, 0]]

f = io.BytesIO()
convert_data.to_proto(f, labels=labels, vectors=vectors)
f.seek(0)

boto3.Session().resource('s3').Bucket(bucket).Object(os.path.join(kmeans_prefix, 'train', train_file)).upload_fileobj(f)

Now we'll setup our k-means training parameters.  This is essentially the same as our definition for pca_training_params except we've changed:
1. The container image to k-means
1. S3 output path
1. Algorithm hyperparameters (notice our feature dimension is now 5 as we're clustering the 5 components output by PCA).

In [None]:
# TODO update to newer container... Hosting already is
kmeans_job = 'kmeans-poc-' + time.strftime("%Y-%m-%d-%H-%M-%S", time.gmtime())

print("Job name is:", kmeans_job)

kmeans_training_params = {
    "RoleArn": role,
    "TrainingJobName": kmeans_job,
    "AlgorithmSpecification": {
        "TrainingImage": "900597767885.dkr.ecr.us-east-1.amazonaws.com/kmeanswebscale:latest",
        "TrainingInputMode": "File"
    },
    "ResourceConfig": {
        "InstanceCount": 2,
        "InstanceType": "c4.8xlarge",
        "VolumeSizeInGB": 50
    },
    "InputDataConfig": [
        {
            "ChannelName": "train",
            "DataSource": {
                "S3DataSource": {
                    "S3DataType": "S3Prefix",
                    "S3Uri": "s3://{}/{}/train".format(bucket, kmeans_prefix),
                    "S3DataDistributionType": "FullyReplicated"
                }
            },
            "CompressionType": "None",
            "RecordWrapperType": "None"
        }
    ],
    "OutputDataConfig": {
        "S3OutputPath": "s3://{}/{}/".format(bucket, kmeans_prefix)
    },
    "HyperParameters": {
        "num_clusters": "8",
        "feature_dim": "5",
        "mini_batch_size": "5000",
        "init_method": "random",
        "epochs": "1"
    },
    "StoppingCondition": {
        "MaxRuntimeInHours": 1
    }
}

Now invoke EASE for serverless training.

In [None]:
%%time

im = boto3.client('im')
im.create_training_job(**kmeans_training_params)

status = im.describe_training_job(TrainingJobName=kmeans_job)['TrainingJobStatus']
print(status)
im.get_waiter('TrainingJob_Created').wait(TrainingJobName=kmeans_job)
if status == 'Failed':
    message = im.describe_training_job(TrainingJobName=kmeans_job)['FailureReason']
    print('Training failed with the following error: {}'.format(message))
    raise Exception('Training job failed')

---
# Host k-means

Define our model for hosting.

In [None]:
kmeans_hosting_container = {
    'Image': "900597767885.dkr.ecr.us-east-1.amazonaws.com/aialgorithmskmeanswebscalecontainer:latest",
    'ModelDataUrl': im.describe_training_job(TrainingJobName=kmeans_job)['ModelArtifacts']['S3ModelArtifacts']
}

create_model_response = im.create_model(
    ModelName=kmeans_job,
    ExecutionRoleArn=role,
    PrimaryContainer=kmeans_hosting_container)

print(create_model_response['ModelArn'])

Setup our endpoint configuration.

In [None]:
kmeans_endpoint_config = 'kmeans-poc-endpoint-config-' + time.strftime("%Y-%m-%d-%H-%M-%S", time.gmtime())
print(kmeans_endpoint_config)
create_endpoint_config_response = im.create_endpoint_config(
    EndpointConfigName=kmeans_endpoint_config,
    ProductionVariants=[{
        'InstanceType': 'c4.xlarge',
        'MaxInstanceCount': 3,
        'MinInstanceCount': 1,
        'ModelName': kmeans_job,
        'VariantName': 'AllTraffic'}])

print("Endpoint Config Arn: " + create_endpoint_config_response['EndpointConfigArn'])

Initiate our endpoints.

In [None]:
%%time

kmeans_endpoint = 'kmeans-poc-endpoint-' + time.strftime("%Y%m%d%H%M", time.gmtime())
print(kmeans_endpoint)
create_endpoint_response = im.create_endpoint(
    EndpointName=kmeans_endpoint,
    EndpointConfigName=kmeans_endpoint_config)
print(create_endpoint_response['EndpointArn'])

resp = im.describe_endpoint(EndpointName=kmeans_endpoint)
status = resp['EndpointStatus']
print("Status: " + status)

im.get_waiter('Endpoint_Created').wait(EndpointName=kmeans_endpoint)

resp = im.describe_endpoint(EndpointName=kmeans_endpoint)
status = resp['EndpointStatus']
print("Arn: " + resp['EndpointArn'])
print("Status: " + status)

if status != 'InService':
    raise Exception('Endpoint creation did not succeed')

## Score k-means

Now that our endpoint is live, we can generate predictions from it.  In this case, we'll use it to score our training data, which results in the assigned cluster for each movie.

In [None]:
runtime = boto3.Session().client(service_name='runtime.maeve', endpoint_url='https://maeveruntime.prod.us-west-2.ml-platform.aws.a2z.com')

minibatch_rows = 5000000. / sys.getsizeof(np2csv(components[0]))
split_array = np.array_split(components, int(components.shape[0] / float(minibatch_rows) + 1))
clusters = []
for array in split_array:
    payload = np2csv(array)
    response = runtime.invoke_endpoint(EndpointName=kmeans_endpoint,
                                       ContentType='text/csv',
                                       Body=payload)
    result = json.loads(response['Body'].read().decode())
    clusters += [r['closest_cluster'] for r in result['predictions']]


movies['cluster'] = clusters
movies['cluster'] = movies['cluster'].astype(object)

Let's take a quick look at how the clusters differ from one another.

_Note that because of random initialization of cluster centroids, results may vary slightly on across runs._

In [None]:
pd.crosstab(index=movies['cluster'], columns='% observations', normalize='columns')

Most movies belong to one of two large clusters, but there aren't an unreasonably small clusters, which is a good sign.  Let's look at how their distributions differ.

In [None]:
%matplotlib inline

for column in ['startYear', 'averageRating', 'numVotes']: #movies.select_dtypes(exclude=['object']).columns:
    print(column)
    hist = movies[[column, 'cluster']].hist(by='cluster', bins=30, figsize=(12, 3), layout=(1, 8))
    plt.show()

As we can see:
- Clusters 2, 5, and 6 (in particular) skew toward substantially earlier release dates.
- Clusters 0, 3, and 5 have wider ratings distributions
- Cluster 7 appears to skew toward the most popular movies, with a larger portion having a very high number of reviews.

Now let's get recent examples from each cluster.

In [None]:
for group in movies[(movies['startYear'] > 2000) & (movies['numVotes'] > 1000)].groupby('cluster'):
    print('Cluster:', np.max(group[1]['cluster']))
    display(group[1].sample(n=10, replace=True, random_state=0))

We can see:
- Cluster 0 includes a wide variety of movies, some foreign, some low budget, most with mediocre ratings.
- Cluster 1 has many Romance and Drama films.
- Cluster 2 tends to be Crime and Thriller movies.
- Cluster 3 is largely low rated Horror.
- Cluster 4 is largely Biographies.
- Cluster 5 is another broad category with, some foreign films and others with mediocre ratings.
- Cluster 6 appears to be mostly Musicals.
- Cluster 7 is mostly Hollywood blockbusters with broad appeal.

Although many of the clusters are defined by genre, there are clusters like 2, which span multiple related genres, and do not require membership in both to be included in the cluster.