# Exploring Movie Data (Inspecting and Visualizing)

Three movie data files are stored in an s3 bucket(mlmovieinfofiles). These three data files will be loaded into dataframes, inspected and transformed for the machine learning training process.

## Loading title.akas.tsv file into a DataFrame for inspection 

In [None]:
import pandas as pd
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt

#reading title.Akas.tsv file from s3 bucket and loading data into DataFrame
titleAkas_df = pd.read_csv("s3://mlmovieinfofiles/title.akas.tsv", sep='\t', dtype={'region':'category','language':'category','types':'category','attributes':'category','ordering':'uint8','isOriginalTitle' : 'object'})

In [None]:
titleAkas_df.head()

In [None]:
#checking data types
titleAkas_df.dtypes

In [None]:
#checking for duplicate rows in titleAkas_df 
duplicate_Akasrows_df = titleAkas_df[titleAkas_df.duplicated()]
print("number of duplicate rows: ", duplicate_Akasrows_df.shape)

In [None]:
#checking the dimensionality of the DataFrame
titleAkas_df.shape

Visualizing data in some of the columns in titleAkas_df DataFrame.

In [None]:
#Bar chart for data in column types
sns.countplot(x='types', data=titleAkas_df)
plt.xticks(rotation=90)

In [None]:
#Bar chart for data in column attributes
sns.countplot(x='attributes', data=titleAkas_df)
plt.xticks(rotation=90)

Noticed some of the cells in the DataFrame contain "\N". Checking to see how many cells containing this data are in each column. Will need to get rid of them since they do not mean anything.

In [None]:
titleAkas_df.title.str.count("\\\\N").sum()

In [None]:
titleAkas_df.region.str.count("\\\\N").sum()

In [None]:
titleAkas_df.language.str.count("\\\\N").sum()

In [None]:
titleAkas_df.types.str.count("\\\\N").sum()

In [None]:
titleAkas_df.attributes.str.count("\\\\N").sum()

In [None]:
titleAkas_df.isOriginalTitle.str.count("\\\\N").sum()

After checking how many cells contain "\N", decided to drop columns which have too many cells containing "\N".

In [None]:
#dropping irrelevant columns 
titleAkas_df = titleAkas_df.drop(['types','attributes','language'],axis=1)
titleAkas_df.head()

## Loading title.basics.tsv file into a DataFrame for inspection 

In [None]:
#reading title.basics.tsv file from s3 bucket and loading data into DataFrame
titleBasics_df = pd.read_csv("s3://mlmovieinfofiles/title.basics.tsv", sep='\t', dtype={'isAdult':'uint8', 'startYear':'category', 'endYear':'category','genres':'object'})

In [None]:
titleBasics_df.head()

In [None]:
#checking data types
titleBasics_df.dtypes

In [None]:
#renaming confusing column names
titleBasics_df = titleBasics_df.rename(columns={'tconst': 'titleId'})

In [None]:
#checking for duplicate rows in titleBasics_df
duplicate_Basicsrows_df = titleBasics_df[titleBasics_df.duplicated()]
print("number of duplicate rows: ", duplicate_Basicsrows_df.shape)

In [None]:
#checking the dimensionality of the DataFrame
titleBasics_df.shape

Visualizing data in some of the columns in titleBasics_df DataFrame.

In [None]:
#Bar chart for data in column titleType
sns.countplot(x='titleType', data=titleBasics_df)
plt.xticks(rotation=90)

In [None]:
#Bar chart for data in column endYear
sns.countplot(x='endYear', data=titleBasics_df)
plt.xticks(rotation=90)

In [None]:
#Bar chart for data in column endYear
sns.countplot(x='runtimeMinutes', data=titleBasics_df)
plt.xticks(rotation=90)

Again, noticed some of the cells in the DataFrame contain "\N". Checking to see how many cells containing this data are in each column. Will need to get rid of them since they do not mean anything.

In [None]:
titleBasics_df.startYear.str.count("\\\\N").sum()

In [None]:
titleBasics_df.endYear.str.count("\\\\N").sum()

In [None]:
titleBasics_df.runtimeMinutes.str.count("\\\\N").sum()

In [None]:
titleBasics_df.genres.str.count("\\\\N").sum()

After checking how many cells contain "\N", decided to drop columns which have too many cells containing "\N".

In [None]:
#dropping irrelevant columns 
titleBasics_df = titleBasics_df.drop(['endYear','runtimeMinutes'],axis=1)
titleBasics_df.head()

## Loading title.ratings.tsv file into a DataFrame for inspection 

In [None]:
#reading title.ratings.tsv file from s3 bucket and loading data into DataFrame
titleRatings_df = pd.read_csv("s3://mlmovieinfofiles/title.ratings.tsv", sep='\t')

In [None]:
titleRatings_df.head()

In [None]:
#checking data types
titleRatings_df.dtypes

In [None]:
#renaming confusing column names
titleRatings_df = titleRatings_df.rename(columns={'tconst': 'titleId'})

In [None]:
#checking for duplicate rows in titleRatings_df 
duplicate_Ratingsrows_df = titleRatings_df[titleRatings_df.duplicated()]
print("number of duplicate rows: ", duplicate_Ratingsrows_df.shape)

In [None]:
#checking the dimensionality of the DataFrame
titleRatings_df.shape

### Finding Outliers in titleRatings_df Dataframe

In [None]:
#Displaying a boxplot for titleRatings_df
ratingsdata = (titleRatings_df.averageRating,titleRatings_df.numVotes)

red_square = dict(markerfacecolor='r', marker='s')
fig, ax = plt.subplots()
ax.set_title('Title Ratings Boxplot')
ax.set_xlabel('Ratings')

ax.boxplot(ratingsdata, vert=False, flierprops=red_square)

In [None]:
#Displaying a scatterplot for titleRatings_df
x = titleRatings_df.averageRating
y = titleRatings_df.numVotes


plt.scatter(x, y, alpha=0.5)
plt.show()

### Using the IQR score technique to detect and remove outliers

In [None]:
Q1 = titleRatings_df.quantile(0.25)
Q3 = titleRatings_df.quantile(0.75)
IQR = (Q3 - Q1)
print(IQR)

In [None]:
#Removing outliers
titleRatings_df = titleRatings_df[~((titleRatings_df < (Q1 - 1.5 * IQR)) |(titleRatings_df > (Q3 + 1.5 * IQR))).any(axis=1)]
titleRatings_df.shape

In [None]:
#Displaying a scatterplot after outliers have been removed
x = titleRatings_df.averageRating
y = titleRatings_df.numVotes


plt.scatter(x, y, alpha=0.5)
plt.show()

## Joining titleAkas_df, titleBasics_df and titleRatings_df Dataframes

In [None]:
from pandas import DataFrame

#first joining titleBasics_df to titleRatings_df
ratingsBasics_df = pd.merge(titleRatings_df, titleBasics_df,on='titleId')

ratingsBasics_df.head()

In [None]:
#checking dimensionality of joined DataFrame
ratingsBasics_df.shape

In [None]:
#Assigning rows in titleAkas_df DataFrame with isOriginalTitle value "1" to a new dataframe
titleAkasOriginals_df = titleAkas_df[titleAkas_df.isOriginalTitle=='1']
titleAkasOriginals_df.head()

In [None]:
#joining titleAkasOriginals_df to ratingsBasics_df
ratingsBasicsAkas_df = pd.merge(ratingsBasics_df, titleAkasOriginals_df,on='titleId')
ratingsBasicsAkas_df.head()

In [None]:
#checking data types of joind DataFrame
ratingsBasicsAkas_df.dtypes

In [None]:
#checking if any row in joined DataFrame has null value
ratingsBasicsAkas_df.isnull().sum()

In [None]:
#checking dimensionality of joined DataFrame
ratingsBasicsAkas_df.shape

In [None]:
#checking for duplicate rows in ratingsBasicsAkas_df
duplicate_joinedData_df = ratingsBasicsAkas_df[ratingsBasicsAkas_df.duplicated()]
print("number of duplicate rows: ", duplicate_joinedData_df.shape)

Doing some feature engineering on columns in DataFrame

In [None]:
#Feature engineering for titleType column
ratingsBasicsAkas_df['titleType'] = np.where(ratingsBasicsAkas_df['titleType']=='tvShort', 'short', ratingsBasicsAkas_df['titleType'])

ratingsBasicsAkas_df['titleType'] = np.where(ratingsBasicsAkas_df['titleType']=='tvMovie', 'movie', ratingsBasicsAkas_df['titleType'])

ratingsBasicsAkas_df['titleType'] = np.where(ratingsBasicsAkas_df['titleType']=='tvMiniSeries', 'tvSeries', ratingsBasicsAkas_df['titleType'])

ratingsBasicsAkas_df['titleType'] = np.where(ratingsBasicsAkas_df['titleType']=='videoGame', 'video', ratingsBasicsAkas_df['titleType'])

In [None]:
#Bar chart for data in column titleType
sns.countplot(x='titleType', data=ratingsBasicsAkas_df)
plt.xticks(rotation=90)

Noticed some of the cells in the DataFrame still contain "\N". Checking to see how many cells containing this data are in each column. Will need to get rid of them since they do not mean anything.

In [None]:
ratingsBasicsAkas_df.startYear.str.count("\\\\N").sum()

In [None]:
ratingsBasicsAkas_df.genres.str.count("\\\\N").sum()

In [None]:
ratingsBasicsAkas_df.region.str.count("\\\\N").sum()

In [None]:
#dropping irrelevant columns 
ratingsBasicsAkas_df = ratingsBasicsAkas_df.drop(['ordering','title','region','isOriginalTitle'],axis=1)
ratingsBasicsAkas_df.head()

Getting rid of a few rows that still contain "\N"

In [None]:
ratingsBasicsAkas_df.genres.str.count("\\\\N").sum()

In [None]:
ratingsBasicsAkas_df = ratingsBasicsAkas_df[ratingsBasicsAkas_df.genres != '\\N']

In [None]:
ratingsBasicsAkas_df.shape

In [None]:
#ensuring rows in genres containing "\N" have been removed
ratingsBasicsAkas_df.genres.str.count("\\\\N").sum()

In [None]:
ratingsBasicsAkas_df.startYear.str.count("\\\\N").sum()

In [None]:
ratingsBasicsAkas_df = ratingsBasicsAkas_df[ratingsBasicsAkas_df.startYear != '\\N']

In [None]:
ratingsBasicsAkas_df.shape

In [None]:
#ensuring rows in startYear containing "\N" have been removed
ratingsBasicsAkas_df.startYear.str.count("\\\\N").sum()

In [None]:
#dropping some columns to do more feature engineering on column titletype
convertData = ratingsBasicsAkas_df.drop(['titleId','averageRating','numVotes','originalTitle','isAdult','startYear','genres'],axis=1)

convertData.head()

In [None]:
#converting categorical data found in column titleType into numerical data
cat_vars = ['titleType' ]
for var in cat_vars:
    catList = 'var'+'_'+var
    catList = pd.get_dummies(convertData[var], prefix=var)
    data1 = convertData.join(catList)
   
    
data_vars = data1.columns.values.tolist()
to_keep = [i for i in data_vars if i not in cat_vars]
data_final=data1[to_keep]

In [None]:
data_final.head()

In [None]:
#Joining ratingsBasicsAkas_df to data_final after completing feature engineering on column titleType
trainData_df = pd.merge(ratingsBasicsAkas_df, data_final,on='primaryTitle')
trainData_df.head()

In [None]:
#dropping columns that can not be used as features for the K-means model
trainData_df = trainData_df.drop(['titleId','titleType','originalTitle'],axis=1)
trainData_df.head()

In [None]:
#changing index of trainData_df to primaryTitle + genres
trainData_df.index=trainData_df['primaryTitle'] + "-" + trainData_df['genres'].astype(object)
drop=["primaryTitle" , "genres"]
trainData_df.drop(drop, axis=1, inplace=True)
trainData_df.head()

In [None]:
#rearranging columns in trainData_df DataFrame
trainData_df = trainData_df[['startYear', 'numVotes', 'averageRating', 'isAdult','titleType_movie','titleType_short','titleType_tvEpisode','titleType_tvSeries','titleType_tvSpecial','titleType_video']]
trainData_df.head()

In [None]:
#checking data types 
trainData_df.dtypes

In [None]:
#checking dimensionality of DataFrame
trainData_df.shape

In [None]:
#checking for duplicate rows in trainData_df
trainDataduplicates_df = trainData_df[trainData_df.duplicated()]
print("number of duplicate rows: ", trainDataduplicates_df.shape)

In [None]:
#dropping duplicates in trainData_df
trainData_df = trainData_df.drop_duplicates()
trainData_df.shape

In [None]:
#converting all data types to float
train_data = trainData_df.values.astype('float32')

# Training the K-Means model

In [None]:
import sagemaker
from sagemaker import get_execution_role
from sagemaker.session import Session
from sagemaker import KMeans
import boto3
import os
import mxnet as mx
from random import randint

role = get_execution_role()
bucket_name = 'mlmovieinfofiles'
data_location = 's3://{}/data'.format(bucket_name)
output_location = 's3://{}/output'.format(bucket_name)

print('training data will be uploaded to: {}'.format(data_location))
print('training artifacts will be uploaded to: {}'.format(output_location))

#defining the hyperparameters of Kmeans model and specifying 10 clusters to be identified
num_clusters = 10
kmeans = KMeans(role=role,
                instance_count=1,
                instance_type='ml.m5.xlarge',
                output_path=output_location,              
                k=num_clusters,
                data_path=data_location)

Training the model on the training data

In [None]:
%%time

kmeans.fit(kmeans.record_set(train_data))

# Setting up hosting for the model

Deploying the model we just trained behind a real-time hosted endpoint. 

In [None]:
%%time

kmeans_predictor = kmeans.deploy(initial_instance_count=1,
                                 instance_type='ml.m4.xlarge')


Passing in the original training set to get the labels for each entry. Had to pass the original training set in chunks since there is a limit on how much of the training set you can pass. This will give us which cluster each movie/movie watcher belongs to.

In [None]:
#result=kmeans_predictor.predict(train_data)
result=kmeans_predictor.predict(train_data[0:98300])

In [None]:
result1=kmeans_predictor.predict(train_data[98300:150000])

In [None]:
result2=kmeans_predictor.predict(train_data[150000:179793])

Breakdown of cluster counts and the distribution of clusters in each chunck of the training set passed above.

In [None]:
cluster_labels = [r.label['closest_cluster'].float32_tensor.values[0] for r in result]
pd.DataFrame(cluster_labels)[0].value_counts()

In [None]:
cluster_labels1 = [r.label['closest_cluster'].float32_tensor.values[0] for r in result1]
pd.DataFrame(cluster_labels1)[0].value_counts()

In [None]:
cluster_labels2 = [r.label['closest_cluster'].float32_tensor.values[0] for r in result2]
pd.DataFrame(cluster_labels2)[0].value_counts()

Mapping cluster labels back to each movie. This has to be done in chunks since distribution of clusters was done in chunks.

In [None]:
traindata = trainData_df.iloc[0:98300]
traindata1 = trainData_df.iloc[98300:150000]
traindata2 = trainData_df.iloc[150000:179793]
traindata['clusterLabels']=list(map(int, cluster_labels))
traindata1['clusterLabels']=list(map(int, cluster_labels1))
traindata2['clusterLabels']=list(map(int, cluster_labels2))
finalData = pd.concat([traindata,traindata1,traindata2])
finalData.head()

In [None]:
#resetting index and renaming index column(made up of primaryTitle + titleType) as Title 
finalData.reset_index(level=0, inplace=True)
finalData = finalData.rename(columns={'index': 'Title'})
finalData.head()

In [None]:
#splitting data in column Title(splitting primaryTitle and titleType) 
split_df = finalData["Title"].str.split("-", n = 1, expand = True)

In [None]:
#joining newly created columns due to Title split to finalData Dataframe
finalData_df = pd.merge(finalData, split_df,right_index=True,left_index=True)
finalData_df.head()

In [None]:
#dropping numerical data columns for previously created titleType features
finalData_df = finalData_df.drop(['Title','titleType_movie','titleType_short','titleType_tvEpisode','titleType_tvSeries','titleType_tvSpecial','titleType_video'],axis=1)

In [None]:
#renaming newly added columns created out of Title split
finalData_df = finalData_df.rename(columns={0: 'Title'})
finalData_df = finalData_df.rename(columns={1:'titleType'})

In [None]:
#rearranging columns in the DataFrame
finalData_df = finalData_df[['Title', 'titleType','startYear','numVotes', 'averageRating','isAdult','clusterLabels' ]]
finalData_df.head()

In [None]:
#checking for duplicate rows in finalData_df 
duplicatesData_df = finalData_df[finalData_df.duplicated()]
print("number of duplicate rows: ", duplicatesData_df.shape)

In [None]:
#dropping duplicates in finalData_df
finalData_df = finalData_df.drop_duplicates()
finalData_df.shape

In [None]:
checkconcludeduplicates_df = finalData_df[finalData_df.duplicated()]
print("number of duplicate rows: ", checkconcludeduplicates_df.shape)

In [None]:
#converting column Title data type to string
finalData_df['Title'] = finalData_df['Title'].astype('string')

In [None]:
finalData_df.dtypes

In [None]:
cluster1=finalData_df[finalData_df['clusterLabels']==0]
cluster2=finalData_df[finalData_df['clusterLabels']==1]
cluster3=finalData_df[finalData_df['clusterLabels']==2]
cluster4=finalData_df[finalData_df['clusterLabels']==3]
cluster5=finalData_df[finalData_df['clusterLabels']==4]
cluster6=finalData_df[finalData_df['clusterLabels']==5]
cluster7=finalData_df[finalData_df['clusterLabels']==6]
cluster8=finalData_df[finalData_df['clusterLabels']==7]
cluster9=finalData_df[finalData_df['clusterLabels']==8]
cluster10=finalData_df[finalData_df['clusterLabels']==9]

In [None]:
print(kmeans_predictor.endpoint_name)

In [None]:
#deleting endpoints
sagemaker.Session().delete_endpoint(kmeans_predictor.endpoint_name)


# Suggesting Movies 

In [None]:
cluster1_movies = cluster1['Title'].to_list()
movie_watcher1 = cluster1_movies[1]
print("Movie Watcher 1 watched: " + movie_watcher1)
print("\n")

cluster2_movies = cluster2['Title'].to_list()
movie_watcher2 = cluster2_movies[1]
print("Movie Watcher 2 watched: " + movie_watcher2)
print("\n")

cluster3_movies = cluster3['Title'].to_list()
movie_watcher3 = cluster3_movies[2]
print("Movie Watcher 3 watched: " + movie_watcher3)
print("\n")

cluster4_movies = cluster4['Title'].to_list()
movie_watcher4 = cluster4_movies[3]
print("Movie Watcher 4 watched: " + movie_watcher4)
print("\n")

cluster5_movies = cluster5['Title'].to_list()
movie_watcher5 = cluster1_movies[4]
print("Movie Watcher 5 watched: " + movie_watcher5)
print("\n")

cluster6_movies = cluster6['Title'].to_list()
movie_watcher6 = cluster6_movies[5]
print("Movie Watcher 6 watched: " + movie_watcher6)
print("\n")

cluster7_movies = cluster7['Title'].to_list()
movie_watcher7 = cluster7_movies[6]
print("Movie Watcher 7 watched: " + movie_watcher7)
print("\n")

cluster8_movies = cluster8['Title'].to_list()
movie_watcher8 = cluster8_movies[7]
print("Movie Watcher 8 watched: " + movie_watcher8)
print("\n")

cluster9_movies = cluster9['Title'].to_list()
movie_watcher9 = cluster9_movies[8]
print("Movie Watcher 9 watched: " + movie_watcher9)
print("\n")

cluster10_movies = cluster10['Title'].to_list()
movie_watcher10 = cluster10_movies[9]
print("Movie Watcher 10 watched: " + movie_watcher10)

In [None]:
#First checking if movie title exist in DataFrame. If yes, get cluster label and suggest movie titles that belong to the same cluster

if finalData_df['Title'].str.match(movie_watcher1).any():
    watcheddf1 = finalData_df[finalData_df['Title'].str.contains(movie_watcher1)==True]
    index1=watcheddf1.index.values.astype(int)[0]
    labelvalue1 = finalData_df['clusterLabels'].values[index1]
    newdf1 = finalData_df[finalData_df['clusterLabels']==labelvalue1]

    print("You finished watching: " + movie_watcher1 + ".")
    print("Try one of these titles next.")
    print("\n")
    
    list_of_movies1 = newdf1['Title'].to_list()   
    i = 0
    checklistlen1 = len(list_of_movies1)
    while i<11:   
        value1 = randint(0, checklistlen1)
        if movie_watcher1 == list_of_movies1[i]:       
            i+=1
        else:
            print(list_of_movies1[value1])
            i+=1
else:
    print('Sorry, we can not suggest other movies based on the movie title you just entered!')

In [None]:
#First checking if movie title exist in DataFrame. If yes, get cluster label and suggest movie titles that belong to the same cluster

if finalData_df['Title'].str.match(movie_watcher2).any():
    watcheddf2 = finalData_df[finalData_df['Title'].str.contains(movie_watcher2)==True]
    index2=watcheddf2.index.values.astype(int)[0]
    labelvalue2 = finalData_df['clusterLabels'].values[index2]
    newdf2 = finalData_df[finalData_df['clusterLabels']==labelvalue2]

    print("You finished watching: " + movie_watcher2 + ".")
    print("Try one of these titles next.")
    print("\n")
    
    list_of_movies2 = newdf2['Title'].to_list()   
    i = 0
    checklistlen2 = len(list_of_movies2)
    while i<11:   
        value2 = randint(0, checklistlen2)
        if movie_watcher2 == list_of_movies2[i]:       
            i+=1
        else:
            print(list_of_movies2[value2])
            i+=1
else:
    print('Sorry, we can not suggest other movies based on the movie title you just entered!')

In [None]:
#First checking if movie title exist in DataFrame. If yes, get cluster label and suggest movie titles that belong to the same cluster

if finalData_df['Title'].str.match(movie_watcher3).any():
    watcheddf3 = finalData_df[finalData_df['Title'].str.contains(movie_watcher3)==True]
    index3=watcheddf3.index.values.astype(int)[0]
    labelvalue3 = finalData_df['clusterLabels'].values[index3]
    newdf3 = finalData_df[finalData_df['clusterLabels']==labelvalue3]

    print("You finished watching: " + movie_watcher3 + ".")
    print("Try one of these titles next.")
    print("\n")
    
    list_of_movies3 = newdf3['Title'].to_list()   
    i = 0
    checklistlen3 = len(list_of_movies3)
    while i<11:   
        value3 = randint(0, checklistlen3)
        if movie_watcher3 == list_of_movies3[i]:       
            i+=1
        else:
            print(list_of_movies3[value3])
            i+=1
else:
    print('Sorry, we can not suggest other movies based on the movie title you just entered!')

In [None]:
#First checking if movie title exist in DataFrame. If yes, get cluster label and suggest movie titles that belong to the same cluster

if finalData_df['Title'].str.match(movie_watcher4).any():
    watcheddf4 = finalData_df[finalData_df['Title'].str.contains(movie_watcher4)==True]
    index4=watcheddf4.index.values.astype(int)[0]
    labelvalue4 = finalData_df['clusterLabels'].values[index4]
    newdf4 = finalData_df[finalData_df['clusterLabels']==labelvalue4]

    print("You finished watching: " + movie_watcher4 + ".")
    print("Try one of these titles next.")
    print("\n")
    
    list_of_movies4 = newdf4['Title'].to_list()   
    i = 0
    checklistlen4 = len(list_of_movies4)
    while i<11:   
        value4 = randint(0, checklistlen4)
        if movie_watcher4 == list_of_movies4[i]:       
            i+=1
        else:
            print(list_of_movies4[value4])
            i+=1
else:
    print('Sorry, we can not suggest other movies based on the movie title you just entered!')

In [None]:
#First checking if movie title exist in DataFrame. If yes, get cluster label and suggest movie titles that belong to the same cluster

if finalData_df['Title'].str.match(movie_watcher5).any():
    watcheddf5 = finalData_df[finalData_df['Title'].str.contains(movie_watcher5)==True]
    index5=watcheddf5.index.values.astype(int)[0]
    labelvalue5 = finalData_df['clusterLabels'].values[index5]
    newdf5 = finalData_df[finalData_df['clusterLabels']==labelvalue5]

    print("You finished watching: " + movie_watcher5 + ".")
    print("Try one of these titles next.")
    print("\n")
    
    list_of_movies5 = newdf5['Title'].to_list()   
    i = 0
    checklistlen5 = len(list_of_movies5)
    while i<11:   
        value5 = randint(0, checklistlen5)
        if movie_watcher5 == list_of_movies5[i]:       
            i+=1
        else:
            print(list_of_movies5[value5])
            i+=1
else:
    print('Sorry, we can not suggest other movies based on the movie title you just entered!')

In [None]:
#First checking if movie title exist in DataFrame. If yes, get cluster label and suggest movie titles that belong to the same cluster

if finalData_df['Title'].str.match(movie_watcher6).any():
    watcheddf6 = finalData_df[finalData_df['Title'].str.contains(movie_watcher6)==True]
    index6=watcheddf6.index.values.astype(int)[0]
    labelvalue6 = finalData_df['clusterLabels'].values[index6]
    newdf6 = finalData_df[finalData_df['clusterLabels']==labelvalue6]

    print("You finished watching: " + movie_watcher6 + ".")
    print("Try one of these titles next.")
    print("\n")
    
    list_of_movies6 = newdf6['Title'].to_list()   
    i = 0
    checklistlen6 = len(list_of_movies6)
    while i<11:   
        value6 = randint(0, checklistlen6)
        if movie_watcher6 == list_of_movies6[i]:       
            i+=1
        else:
            print(list_of_movies6[value6])
            i+=1
else:
    print('Sorry, we can not suggest other movies based on the movie title you just entered!')

In [None]:
#First checking if movie title exist in DataFrame. If yes, get cluster label and suggest movie titles that belong to the same cluster

if finalData_df['Title'].str.match(movie_watcher7).any():
    watcheddf7 = finalData_df[finalData_df['Title'].str.contains(movie_watcher7)==True]
    index7=watcheddf7.index.values.astype(int)[0]
    labelvalue7 = finalData_df['clusterLabels'].values[index7]
    newdf7 = finalData_df[finalData_df['clusterLabels']==labelvalue7]

    print("You finished watching: " + movie_watcher7 + ".")
    print("Try one of these titles next.")
    print("\n")
    
    list_of_movies7 = newdf7['Title'].to_list()   
    i = 0
    checklistlen7 = len(list_of_movies7)
    while i<11:   
        value7 = randint(0, checklistlen7)
        if movie_watcher7 == list_of_movies7[i]:       
            i+=1
        else:
            print(list_of_movies7[value7])
            i+=1
else:
    print('Sorry, we can not suggest other movies based on the movie title you just entered!')

In [None]:
#First checking if movie title exist in DataFrame. If yes, get cluster label and suggest movie titles that belong to the same cluster

if finalData_df['Title'].str.match(movie_watcher8).any():
    watcheddf8 = finalData_df[finalData_df['Title'].str.contains(movie_watcher8)==True]
    index8=watcheddf8.index.values.astype(int)[0]
    labelvalue8 = finalData_df['clusterLabels'].values[index8]
    newdf8 = finalData_df[finalData_df['clusterLabels']==labelvalue8]

    print("You finished watching: " + movie_watcher8 + ".")
    print("Try one of these titles next.")
    print("\n")
    
    list_of_movies8 = newdf8['Title'].to_list()   
    i = 0
    checklistlen8 = len(list_of_movies8)
    while i<11:   
        value8 = randint(0, checklistlen8)
        if movie_watcher8 == list_of_movies8[i]:       
            i+=1
        else:
            print(list_of_movies8[value8])
            i+=1
else:
    print('Sorry, we can not suggest other movies based on the movie title you just entered!')

In [None]:
#First checking if movie title exist in DataFrame. If yes, get cluster label and suggest movie titles that belong to the same cluster

if finalData_df['Title'].str.match(movie_watcher9).any():
    watcheddf9 = finalData_df[finalData_df['Title'].str.contains(movie_watcher9)==True]
    index9=watcheddf9.index.values.astype(int)[0]
    labelvalue9 = finalData_df['clusterLabels'].values[index9]
    newdf9 = finalData_df[finalData_df['clusterLabels']==labelvalue9]

    print("You finished watching: " + movie_watcher9 + ".")
    print("Try one of these titles next.")
    print("\n")
    
    list_of_movies9 = newdf9['Title'].to_list()   
    i = 0
    checklistlen9 = len(list_of_movies9)
    while i<11:   
        value9 = randint(0, checklistlen9)
        if movie_watcher9 == list_of_movies9[i]:       
            i+=1
        else:
            print(list_of_movies9[value9])
            i+=1
else:
    print('Sorry, we can not suggest other movies based on the movie title you just entered!')

In [None]:
#First checking if movie title exist in DataFrame. If yes, get cluster label and suggest movie titles that belong to the same cluster

if finalData_df['Title'].str.match(movie_watcher10).any():
    watcheddf10 = finalData_df[finalData_df['Title'].str.contains(movie_watcher10)==True]
    index10=watcheddf10.index.values.astype(int)[0]
    labelvalue10 = finalData_df['clusterLabels'].values[index10]
    newdf10 = finalData_df[finalData_df['clusterLabels']==labelvalue10]

    print("You finished watching: " + movie_watcher10 + ".")
    print("Try one of these titles next.")
    print("\n")
    
    list_of_movies10 = newdf10['Title'].to_list()   
    i = 0
    checklistlen10 = len(list_of_movies10)
    while i<11:   
        value10 = randint(0, checklistlen10)
        if movie_watcher10 == list_of_movies10[i]:       
            i+=1
        else:
            print(list_of_movies10[value10])
            i+=1
else:
    print('Sorry, we can not suggest other movies based on the movie title you just entered!')