# Building a Movie Recommendation System with the Top 5000 Nigerian Movies on Imdb

Nollywood is the third largest movie production industry in the world. It is by far one of Africa's greatest achievements and one of Nigeria's best exports. <br><br>I have always enjoyed Nigerian movies and series. Recently, I thought about how I could help myself and others interested in Nigerian movies to get recommendations on movies to watch based on what they liked.
<br><br> Of course, the way to do this would be to create accounts on the numerous movie streaming sites that post Nigerian content, Iroko TV, Netflix, Amazon Prime, etc and search there, or to make Google searches, which ***might*** not intially give you what you are looking for.
<br><br> I decided to build a Nigerian movie recommendation system, getting data from the Internet Movie Database (IMDb), the largest online database of movie related information. <br><br> This recommendation system is content-based: utilising attributes like cast, diretcor, synopsis, genre, movie title, etc. It works with the assumption that if a user likes a particular movie, then similar movies would also be of interest.

# Web Scraping the IMDB Website using R

I need to initialize the rpy2 package which will allow me to use the R and Python simultanously to achieve my aim.

In [None]:
!pip install rpy2==3.5.1

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [None]:
%load_ext rpy2.ipython

The rpy2.ipython extension is already loaded. To reload it, use:
  %reload_ext rpy2.ipython


### Installing Packages and Importing Libraries

In [None]:
%%R

suppressWarnings(suppressPackageStartupMessages({
  install.packages("rvest")
  install.packages("tidyverse")
  install.packages("purrr")
  install.packages('skimr')
  install.packages('stringr')
  install.packages('xlsx') 
}))

In [None]:
%%R
install.packages("openxlsx", dependencies=TRUE)

In [None]:
%%R

suppressWarnings(suppressPackageStartupMessages({
  library(rvest)
  library(tidyverse)
  library(purrr)
  library(skimr)
  library(stringr)
  library(openxlsx) 
}))

### Loading the Data

I will first extract the titles, year, genre, certificate, rating and synopsis of the top 5000 movies 

In [None]:
%%R

movies1 = data.frame()

for(page_result in seq(from = 1, to = 4951, by = 50)){
  
  link = paste0("https://www.imdb.com/search/title/?country_of_origin=NG&start=", page_result, "&ref_=adv_nxt")
  
  page <- read_html(link)

  df <- page %>% 
  html_nodes(".mode-advanced") %>% 
  map_df(~list(title = html_nodes(.x, '.lister-item-header a') %>% 
                     html_text() %>% 
                     {if(length(.) == 0) NA else .},
               year = html_nodes(.x, '.text-muted.unbold') %>% 
                     html_text() %>% 
                     {if(length(.) == 0) NA else .},
               genre = html_nodes(.x, '.genre') %>% 
                     html_text() %>% 
                     {if(length(.) == 0) NA else .},
               certificate = html_nodes(.x, '.certificate') %>% 
                     html_text() %>% 
                     {if(length(.) == 0) NA else .},
               rating = html_nodes(.x, '.ratings-imdb-rating strong') %>% 
                     html_text() %>% 
                     {if(length(.) == 0) NA else .},
               synopsis = html_nodes(.x, '.ratings-bar+ .text-muted') %>% 
                     html_text() %>% 
                     {if(length(.) == 0) NA else .}))
              

movies1 = rbind(movies1, df)
print(paste("Page:", page_result))

}

*I set up print statements to monitor the status of the scraping per page. However, I cleared the output to improve the readability of this notebook because the output was very long.*

In [None]:
%%R 

#Checking the head of the dataframe
head(movies1)

# A tibble: 6 × 6
  title                 year        genre            certificate rating synopsis
  <chr>                 <chr>       <chr>            <chr>       <chr>  <chr>   
1 The Trade             (2023)      "\nCrime, Drama… <NA>        6.3    "\nThis…
2 After Party           (2021)      "\nComedy      … <NA>        7.4    "\nThe …
3 Strangers             (IV) (2022) "\nDrama       … <NA>        9.0    "\nForg…
4 Battle on Buka Street (2022)      "\nComedy      … <NA>        8.8    "\nAfte…
5 Gangs of Lagos        (2023)      "\nCrime       … <NA>        <NA>    <NA>   
6 Shanty Town           (2023– )    "\nAction, Crim… TV-MA       4.7    "\nA gr…


In [None]:
%%R 

#Summary on the dataframe to confirm the data
skim(movies1)

── Data Summary ────────────────────────
                           Values 
Name                       movies1
Number of rows             5000   
Number of columns          6      
_______________________           
Column type frequency:            
  character                6      
________________________          
Group variables            None   

── Variable type: character ────────────────────────────────────────────────────
  skim_variable n_missing complete_rate min max empty n_unique whitespace
1 title                 0        1        2  84     0     4896          0
2 year                  0        1        0  18    56      256          0
3 genre               247        0.951   16  46     0      180          0
4 certificate        4911        0.0178   1   9     0       10          0
5 rating             4496        0.101    3   4     0       75          0
6 synopsis             83        0.983   20 251     0     1898          0


In [None]:
%%R
write.xlsx(movies1, "movies1.xlsx")

In [None]:
%%R

movies1_copy <- movies1

Now I will get the Cast and Directors for all the movies. Rather than scraping the data for all 5000 movies at once, I will break it into small chuncks of 1000 per time because of the size of the data, after which, I will merge them together.

*I set up print statements to monitor the status of the scraping per page. However, I cleared some of the outputs to improve the readability of this notebook because the output was very long.*

In [None]:
%%R

#Cast and Directors for movies 1 - 1000

get_cast = function(movie_link) {
  movie_page = read_html(movie_link)
  cast = movie_page %>% html_nodes(".cast_list tr:not(:first-child) td:nth-child(2) a") %>% html_text() %>% paste(collapse = ",")
  directors = movie_page %>% html_nodes("h4:contains('Directed by') + table a") %>% html_text() %>% paste(collapse = ",")
  return(data.frame(cast = cast, directors = directors))
}

movies2 = data.frame()
for(page_result in seq(from = 1, to = 951, by = 50)){
  link = paste0("https://imdb.com/search/title/?country_of_origin=NG&start=", page_result, "&ref_=adv_nxt")
  page <- read_html(link)
  movie_links = page %>% html_nodes(".lister-item-header a") %>% html_attr("href") %>% str_replace(pattern = fixed("?ref_=adv_li_tt"), replacement = fixed("fullcredits/?ref_=tt_cl_sm")) %>%
    paste("http://www.imdb.com", ., sep="")
  movie_data = lapply(movie_links, get_cast)
  df = bind_rows(movie_data)
  movies2 = rbind(movies2, df)


print(paste("Page:", page_result))
}

[1] "Page: 1"
[1] "Page: 51"
[1] "Page: 101"
[1] "Page: 151"
[1] "Page: 201"
[1] "Page: 251"
[1] "Page: 301"
[1] "Page: 351"
[1] "Page: 401"
[1] "Page: 451"
[1] "Page: 501"
[1] "Page: 551"
[1] "Page: 601"
[1] "Page: 651"
[1] "Page: 701"
[1] "Page: 751"
[1] "Page: 801"
[1] "Page: 851"
[1] "Page: 901"
[1] "Page: 951"


In [None]:
%%R
head(movies2)

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              cast
1                                    

In [None]:
%%R

#Cast and Directors for movies 1001 - 2000

get_cast = function(movie_link) {
  movie_page = read_html(movie_link)
  cast = movie_page %>% html_nodes(".cast_list tr:not(:first-child) td:nth-child(2) a") %>% html_text() %>% paste(collapse = ",")
  directors = movie_page %>% html_nodes("h4:contains('Directed by') + table a") %>% html_text() %>% paste(collapse = ",")
  return(data.frame(cast = cast, directors = directors))
}

movies3 = data.frame()
for(page_result in seq(from = 1001, to = 1951, by = 50)){
  link = paste0("https://imdb.com/search/title/?country_of_origin=NG&start=", page_result, "&ref_=adv_nxt")
  page <- read_html(link)
  movie_links = page %>% html_nodes(".lister-item-header a") %>% html_attr("href") %>% str_replace(pattern = fixed("?ref_=adv_li_tt"), replacement = fixed("fullcredits/?ref_=tt_cl_sm")) %>%
    paste("http://www.imdb.com", ., sep="")
  movie_data = lapply(movie_links, get_cast)
  df = bind_rows(movie_data)
  movies3 = rbind(movies3, df)


print(paste("Page:", page_result))
}

In [None]:
%%R
head(movies3)

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         cast
1                                                                                                                                                                                                                                                                                                                                                                                                                                                         

In [None]:
%%R
skim(movies3)

── Data Summary ────────────────────────
                           Values 
Name                       movies3
Number of rows             1000   
Number of columns          2      
_______________________           
Column type frequency:            
  character                2      
________________________          
Group variables            None   

── Variable type: character ────────────────────────────────────────────────────
  skim_variable n_missing complete_rate min  max empty n_unique whitespace
1 cast                  0             1   0 1221    17      971          0
2 directors             0             1   0   66    14      407          0


In [None]:
%%R
skim(movies2)

── Data Summary ────────────────────────
                           Values 
Name                       movies2
Number of rows             1000   
Number of columns          2      
_______________________           
Column type frequency:            
  character                2      
________________________          
Group variables            None   

── Variable type: character ────────────────────────────────────────────────────
  skim_variable n_missing complete_rate min  max empty n_unique whitespace
1 cast                  0             1   0 2644    10      987          0
2 directors             0             1   0  128    13      457          0


In [None]:
%%R

#Cast and Directors for movies 2001 - 3000

get_cast = function(movie_link) {
  movie_page = read_html(movie_link)
  cast = movie_page %>% html_nodes(".cast_list tr:not(:first-child) td:nth-child(2) a") %>% html_text() %>% paste(collapse = ",")
  directors = movie_page %>% html_nodes("h4:contains('Directed by') + table a") %>% html_text() %>% paste(collapse = ",")
  return(data.frame(cast = cast, directors = directors))
}

movies4 = data.frame()
for(page_result in seq(from = 2001, to = 2951, by = 50)){
  link = paste0("https://imdb.com/search/title/?country_of_origin=NG&start=", page_result, "&ref_=adv_nxt")
  page <- read_html(link)
  movie_links = page %>% html_nodes(".lister-item-header a") %>% html_attr("href") %>% str_replace(pattern = fixed("?ref_=adv_li_tt"), replacement = fixed("fullcredits/?ref_=tt_cl_sm")) %>%
    paste("http://www.imdb.com", ., sep="")
  movie_data = lapply(movie_links, get_cast)
  df = bind_rows(movie_data)
  movies4 = rbind(movies4, df)


print(paste("Page:", page_result))
}

In [None]:
%%R
head(movies4)

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             cast
1                                                                                                                                                                                                                     

In [None]:
%%R
skim(movies4)

── Data Summary ────────────────────────
                           Values 
Name                       movies4
Number of rows             1000   
Number of columns          2      
_______________________           
Column type frequency:            
  character                2      
________________________          
Group variables            None   

── Variable type: character ────────────────────────────────────────────────────
  skim_variable n_missing complete_rate min max empty n_unique whitespace
1 cast                  0             1   0 953    33      953          0
2 directors             0             1   0  76    31      430          0


In [None]:
%%R

#Cast and Directors for movies 3001 - 4000

get_cast = function(movie_link) {
  movie_page = read_html(movie_link)
  cast = movie_page %>% html_nodes(".cast_list tr:not(:first-child) td:nth-child(2) a") %>% html_text() %>% paste(collapse = ",")
  directors = movie_page %>% html_nodes("h4:contains('Directed by') + table a") %>% html_text() %>% paste(collapse = ",")
  return(data.frame(cast = cast, directors = directors))
}

movies5 = data.frame()
for(page_result in seq(from = 3001, to = 3951, by = 50)){
  link = paste0("https://imdb.com/search/title/?country_of_origin=NG&start=", page_result, "&ref_=adv_nxt")
  page <- read_html(link)
  movie_links = page %>% html_nodes(".lister-item-header a") %>% html_attr("href") %>% str_replace(pattern = fixed("?ref_=adv_li_tt"), replacement = fixed("fullcredits/?ref_=tt_cl_sm")) %>%
    paste("http://www.imdb.com", ., sep="")
  movie_data = lapply(movie_links, get_cast)
  df = bind_rows(movie_data)
  movies5 = rbind(movies5, df)


print(paste("Page:", page_result))
}

In [None]:
%%R
head(movies5)

                                                                                                                                                                                                                                                                                                                                    cast
1                                                                                                                                                                                                                                                                                        Nkechi First\n, Chizoba Nwokoye\n, Frank Tana\n
2                                                                                                 Shola Abimbola\n, Afeez Abiodun\n, Mojisola Adedeji\n, Lateef Adedimeji\n, Busayo Akinboboye\n, Wasiu Alabi\n, Bolanle Ayoola\n, Kiitan Bukola\n, Funke Igiowo\n, Isaiq Jamiu\n, Mide Funmi Martins\n, Kunle Omisore\n, Oseni Samson\n
3            

In [None]:
%%R
skim(movies5)

── Data Summary ────────────────────────
                           Values 
Name                       movies5
Number of rows             1000   
Number of columns          2      
_______________________           
Column type frequency:            
  character                2      
________________________          
Group variables            None   

── Variable type: character ────────────────────────────────────────────────────
  skim_variable n_missing complete_rate min  max empty n_unique whitespace
1 cast                  0             1   0 1058    33      948          0
2 directors             0             1   0   51    21      457          0


In [None]:
%%R

#Cast and Directors for movies 4001 - 5000

get_cast = function(movie_link) {
  movie_page = read_html(movie_link)
  cast = movie_page %>% html_nodes(".cast_list tr:not(:first-child) td:nth-child(2) a") %>% html_text() %>% paste(collapse = ",")
  directors = movie_page %>% html_nodes("h4:contains('Directed by') + table a") %>% html_text() %>% paste(collapse = ",")
  return(data.frame(cast = cast, directors = directors))
}

movies6 = data.frame()
for(page_result in seq(from = 4001, to = 4951, by = 50)){
  link = paste0("https://imdb.com/search/title/?country_of_origin=NG&start=", page_result, "&ref_=adv_nxt")
  page <- read_html(link)
  movie_links = page %>% html_nodes(".lister-item-header a") %>% html_attr("href") %>% str_replace(pattern = fixed("?ref_=adv_li_tt"), replacement = fixed("fullcredits/?ref_=tt_cl_sm")) %>%
    paste("http://www.imdb.com", ., sep="")
  movie_data = lapply(movie_links, get_cast)
  df = bind_rows(movie_data)
  movies6 = rbind(movies6, df)


print(paste("Page:", page_result))
}

In [None]:
%%R
head(movies6)

                                                                                                                                                                                                                                                                                                                                                                                                                                           cast
1  Emeka Amakeze\n, Queency Asogwa\n, Chidi Chijioke\n, Ijibueze Chuks\n, Malechi Chukwudebe\n, Grace Denny\n, Mike Ezuruonye\n, Jonathan Ganagana\n, Ugo Gbams\n, Jim Iyke\n, Paulinus Magbo\n, Prince Nwafor\n, Emmanuel Obi\n, Gabriel Obi\n, Elochukwu Obinna\n, Chinedu Odinachi\n, Adamma Oforah\n, Chinedu Ogah\n, Love Okapare\n, Mike Okechukwu\n, Augustine Okeke\n, Oge Okoye\n, Oyin Omar\n, Francis Oniah\n, Emmanuel Onyemeziem\n
2                                                                                                                                       

In [None]:
%%R
skim(movies6)

── Data Summary ────────────────────────
                           Values 
Name                       movies6
Number of rows             1000   
Number of columns          2      
_______________________           
Column type frequency:            
  character                2      
________________________          
Group variables            None   

── Variable type: character ────────────────────────────────────────────────────
  skim_variable n_missing complete_rate min  max empty n_unique whitespace
1 cast                  0             1   0 1155    37      930          0
2 directors             0             1   0   49    36      481          0


Making copies of the extracted data

In [None]:
%%R

movies2_copy <- movies2
movies3_copy <- movies3
movies4_copy <- movies4
movies5_copy <- movies5
movies6_copy <- movies6

In [None]:
%%R
movies2_copy = rbind(movies2_copy, movies3_copy)

In [None]:
%%R
movies2_copy = rbind(movies2_copy, movies4_copy)

In [None]:
%%R
movies2_copy = rbind(movies2_copy, movies5_copy)

In [None]:
%%R
movies2_copy = rbind(movies2_copy, movies6_copy)

Merging it all together

In [None]:
%%R

movies_copy = cbind(movies1_copy, movies2_copy)

In [None]:
%%R

head(movies_copy)

                  title        year                              genre
1             The Trade      (2023)         \nCrime, Drama            
2           After Party      (2021)               \nComedy            
3             Strangers (IV) (2022)                \nDrama            
4 Battle on Buka Street      (2022)               \nComedy            
5        Gangs of Lagos      (2023)                \nCrime            
6           Shanty Town    (2023– ) \nAction, Crime, Drama            
  certificate rating
1        <NA>    6.3
2        <NA>    7.4
3        <NA>    9.0
4        <NA>    8.8
5        <NA>   <NA>
6       TV-MA    4.7
                                                                                                                                                                                                                                  synopsis
1 \nThis is the story of a notoriously cunning kidnapper known only by name, who has ravaged the southern part of Nigeria

In [None]:
%%R

skim(movies_copy)

── Data Summary ────────────────────────
                           Values     
Name                       movies_copy
Number of rows             5000       
Number of columns          8          
_______________________               
Column type frequency:                
  character                8          
________________________              
Group variables            None       

── Variable type: character ────────────────────────────────────────────────────
  skim_variable n_missing complete_rate min  max empty n_unique whitespace
1 title                 0        1        2   84     0     4896          0
2 year                  0        1        0   18    56      256          0
3 genre               247        0.951   16   46     0      180          0
4 certificate        4911        0.0178   1    9     0       10          0
5 rating             4496        0.101    3    4     0       75          0
6 synopsis             83        0.983   20  251     0     1898          0
7

### Data Pre-Processing

I will now clean the data to remove unwanted elements

In [None]:
%%R

movies_copy1 <- movies_copy

In [None]:
%%R
movies_copy1$genre <- gsub("\n", "", movies_copy1$genre)
movies_copy1$genre <- gsub(",", "", movies_copy1$genre)
movies_copy1$cast <- gsub("\n", "", movies_copy1$cast)
movies_copy1$cast <- gsub(",", "", movies_copy1$cast)
movies_copy1$directors <- gsub("\n", "", movies_copy1$directors)
movies_copy1$directors <- gsub(",", "", movies_copy1$directors)
movies_copy1$synopsis <- gsub("\n", "", movies_copy1$synopsis)

In [None]:
%%R
head(movies_copy1)

                  title        year                          genre certificate
1             The Trade      (2023)        Crime Drama                    <NA>
2           After Party      (2021)             Comedy                    <NA>
3             Strangers (IV) (2022)              Drama                    <NA>
4 Battle on Buka Street      (2022)             Comedy                    <NA>
5        Gangs of Lagos      (2023)              Crime                    <NA>
6           Shanty Town    (2023– ) Action Crime Drama                   TV-MA
  rating
1    6.3
2    7.4
3    9.0
4    8.8
5   <NA>
6    4.7
                                                                                                                                                                                                                                synopsis
1 This is the story of a notoriously cunning kidnapper known only by name, who has ravaged the southern part of Nigeria for over a decade. 'Eric' under

In [None]:
%%R

write.xlsx(movies_copy1, "movies_copy1.xlsx")

I am saving the data into excel format for storage and also because I need to merge the contents of the title and year columns into one column.

# Building the Recommendation System using Python

### Importing Libraries

In [1]:
import pandas as pd
import numpy as np
import difflib

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

### Loading the Data

In [2]:
movies_data = pd.read_excel("/content/movies_copy1.xlsx")

### Understanding the Data

In [3]:
# number of rows and columns in the data frame

movies_data.shape

(5000, 10)

In [4]:
movies_data.head(n = 10) #First 10 rows

Unnamed: 0,index,title,year,movie_title,genre,certificate,rating,synopsis,cast,directors
0,1,The Trade,2023,The Trade 2023,Crime Drama,,6.3,This is the story of a notoriously cunning kid...,Nengi Adoki Chiwetalu Agu Blossom Chukwujekwu...,Jadesola Osiberu
1,2,After Party,2021,After Party 2021,Comedy,,7.4,The universe comes crashing down on a group of...,Funny Bone Timo Elliott Peggy Henshaw Ufuoma ...,Tope Alake
2,3,Strangers,IV 2022,Strangers IV 2022,Drama,,9.0,Forgotten in a remote village and battling a l...,Musa Abdullahi Femi Adebayo Lateef Adedimeji ...,Biodun Stephen
3,4,Battle on Buka Street,2022,Battle on Buka Street 2022,Comedy,,8.8,"After a lifetime of rivalry, two half-sisters ...",Bimbo Ademoye Funke Akindele Sani Danja Mosho...,Funke Akindele Tobi Makinde
4,5,Gangs of Lagos,2023,Gangs of Lagos 2023,Crime,,,,Demi Banwo Adesua Etomi-Wellington Tobi Bakre...,Jadesola Osiberu
5,6,Shanty Town,2023–,Shanty Town 2023–,Action Crime Drama,TV-MA,4.7,A group of courtesans attempts to escape the g...,Sola Sobowale Nancy Isime Richard Mofe-Damijo...,Dimeji Ajibola
6,7,Daughters,IV 2020,Daughters IV 2020,Drama,,3.7,Sold as a sex slave by her father because of d...,Rotimi Adelegan Agbe Adeyemi Amaka Ashley Ash...,Gbemi Phillips
7,8,Wura,2023–,Wura 2023–,Drama,,9.5,Set against the backdrop of the gold mining in...,Scarlet Gomez Martha Ehinome Ray Adeka Iremid...,Dimeji Ajibola Yemi Morafa Rogers Ofime Adeol...
8,9,Half of a Yellow Sun,2013,Half of a Yellow Sun 2013,Drama Romance,R,6.1,Sisters Olanna and Kainene return home to 1960...,Thandiwe Newton Chiwetel Ejiofor Anika Noni R...,Biyi Bandele
9,10,Brotherhood,I 2022,Brotherhood I 2022,Action Crime,,5.0,After years of fighting to survive on the stre...,Jide Kene Achufusi Adetayo Adebowale Adebowal...,Loukman Ali


In [5]:
movies_data.tail(n = 10) #Last 10 rows

Unnamed: 0,index,title,year,movie_title,genre,certificate,rating,synopsis,cast,directors
4990,4991,Mewa n sele,2006 Video,Mewa n sele 2006 Video,Drama,,,Add a Plot,Ebun Oloyede Idowu Philips Yinka Quadri,
4991,4992,Onye Okoso,2018 TV Movie,Onye Okoso 2018 TV Movie,Family,,,Add a Plot,Ngozi Ezeonu Browny Igboegwu,Bruce Natty
4992,4993,World of Commotion 2,2007 Video,World of Commotion 2 2007 Video,Drama,,,Add a Plot,Nonso Diobi Mike Ezuruonye Uche Jombo Zack Or...,Michael Jaja
4993,4994,Olórí,2007 Video,Olórí 2007 Video,Drama,,,Add a Plot,Idowu Philips Remi Surutu,Alade Aromire
4994,4995,The Alternative,2022,The Alternative 2022,Drama,,,Add a Plot,Aminat Adebayo Wale Adebayo Lateef Adedimeji ...,Adewale Rasaq
4995,4996,The Fish Girl,2016 Video,The Fish Girl 2016 Video,Drama,,,A certain woman has been barren for sometime a...,Don Brymo Uchegbu Regina Daniels Mike Odiachi...,Henry Mgbemele
4996,4997,Who Killed Chief?,2017,Who Killed Chief? 2017,Mystery,,,"When a wealthy man dies from poisoning, one of...",Bolanle Babalola Preach Bassey Ubong David Fr...,Kabat Esosa Egbon
4997,4998,The Bond: A Boy in the Middle,2008 Video,The Bond: A Boy in the Middle 2008 Video,Drama,,,Add a Plot,Stephanie Apel Leo Ekwese Samuel Iheanacho Na...,Tola Balogun
4998,4999,Oga on Top,2013,Oga on Top 2013,Comedy Fantasy,,,Add a Plot,Funke Akindele Roy De Nani Uchenna Nnanna Nke...,Amayo Uzo Philips
4999,5000,The Prince of My Heart 2,2007 Video,The Prince of My Heart 2 2007 Video,Romance,,,Add a Plot,Chika Ike Emeka Ike Omotola Jalade-Ekeinde Vi...,Kalu Anya


In building this recommendation system for Nigerian movies, I know fully well that in terms of preference, most Nigerians watch movies mainly because of the actors and sometimes directors, due to their "star power" and then the genre. <br><br> There is always more interest and excitement for movies focused on drama, comedy and romance, and finally, the synopsis. <br><br> I will be using these unique features to build the recommendation system.

In [6]:
movies_data['genre'].value_counts() #Confirming the genre counts


Drama                                2911
Comedy                                285
Romance                               178
Drama Romance                         175
Short Drama                           146
                                     ... 
Short Fantasy Sci-Fi                    1
Fantasy Comedy                          1
Short Drama Fantasy                     1
Action Adventure                        1
Animation Short Drama                   1
Name: genre, Length: 180, dtype: int64

### Building the Model

In [7]:
# selecting the unique features of Movies that Nigerians gravitate to for the recommendation

unique_features = ['cast','directors','genre','synopsis']

I need to replace all empty and NA values of the unique features with an empty string because I will be using it for the analysis

In [8]:
# replacing the null valuess with null string in the unique features

for feature in unique_features:
  movies_data[feature] = movies_data[feature].fillna('')

In [9]:
# combining the unique features per movie

combined_features = movies_data['cast'] +' '+ movies_data['directors'] +' '+ movies_data['genre'] +' '+ movies_data['synopsis']

In [10]:
print(combined_features)

0        Nengi Adoki Chiwetalu Agu Blossom Chukwujekwu...
1        Funny Bone Timo Elliott Peggy Henshaw Ufuoma ...
2        Musa Abdullahi Femi Adebayo Lateef Adedimeji ...
3        Bimbo Ademoye Funke Akindele Sani Danja Mosho...
4        Demi Banwo Adesua Etomi-Wellington Tobi Bakre...
                              ...                        
4995     Don Brymo Uchegbu Regina Daniels Mike Odiachi...
4996     Bolanle Babalola Preach Bassey Ubong David Fr...
4997     Stephanie Apel Leo Ekwese Samuel Iheanacho Na...
4998     Funke Akindele Roy De Nani Uchenna Nnanna Nke...
4999     Chika Ike Emeka Ike Omotola Jalade-Ekeinde Vi...
Length: 5000, dtype: object


I need to convert the combined features by fitting and transforming it into a numerical representation using the TfidVectorizer, this is to enable me to obtain the cosine similarity scores as text data is not used for machine learning modelling. 

In [11]:
# converting the text data to feature vectors

feature_vectors = TfidfVectorizer().fit_transform(combined_features)

In [None]:
print(feature_vectors)

  (0, 7045)	0.08090489307338264
  (0, 19792)	0.08158786051618001
  (0, 9659)	0.09732069677653854
  (0, 3185)	0.0834570404073819
  (0, 7048)	0.10099328640264679
  (0, 19372)	0.08386280956738246
  (0, 4553)	0.12352228818216764
  (0, 11129)	0.08608940604599491
  (0, 3654)	0.09732069677653854
  (0, 19379)	0.05395456406866064
  (0, 10257)	0.08428069301273138
  (0, 20253)	0.12352228818216764
  (0, 6865)	0.07206122606875759
  (0, 5191)	0.11784656495511432
  (0, 15995)	0.0830627028983608
  (0, 7809)	0.04921686050645324
  (0, 13019)	0.06862228639292
  (0, 16301)	0.09276541584159419
  (0, 18623)	0.12352228818216764
  (0, 17052)	0.12352228818216764
  (0, 8592)	0.060928038627861045
  (0, 20899)	0.05163109484462821
  (0, 12757)	0.10598604218690276
  (0, 3794)	0.10790912813732129
  (0, 15281)	0.07153949900291377
  :	:
  (4998, 19915)	0.25486428550049306
  (4998, 7440)	0.23441977509043022
  (4998, 16585)	0.06978175186076865
  (4998, 415)	0.06992016289867006
  (4998, 5160)	0.26501280079241246
  (4998,

Now that I have converted the combined features into a numerical values. I can now determine their similarity scores. This implies that, I would now be determining which movies are similar to each other. 

In [12]:
#Similarity scores using cosine similarity

similarity = cosine_similarity(feature_vectors)

In [13]:
print(similarity)

[[1.         0.02044106 0.04046479 ... 0.01172706 0.         0.01463448]
 [0.02044106 1.         0.00282032 ... 0.         0.01159521 0.        ]
 [0.04046479 0.00282032 1.         ... 0.02370689 0.         0.        ]
 ...
 [0.01172706 0.         0.02370689 ... 1.         0.00757772 0.00884096]
 [0.         0.01159521 0.         ... 0.00757772 1.         0.01138508]
 [0.01463448 0.         0.         ... 0.00884096 0.01138508 1.        ]]


In [14]:
print(similarity.shape)

(5000, 5000)


There are 5000 movies and of course 5000 similarity scores in the model. <br>Now I will structure the input parameters to take input from the user.

### Setting the Parameters

In [None]:
#To get a movie name from the user

movie_name = input('Enter your favourite movie name : ') #I will be testing it as I build with a movie  called "Wedding Party"

Enter your favourite movie name : wedding party


In [None]:
# creating a list with all the movie names given in the dataset, this is to test the similarity of the movie I entered with all other movies in the dataset.

list_of_all_titles = movies_data['movie_title'].to_list()
print(list_of_all_titles)



In [None]:
# finding the close match for the movie name given by the user, I have adjusted the similarity cutoff from the default 0.6 to 0.5, because some movie titles have numbers and special characters attached to it.

find_close_match = difflib.get_close_matches(word = movie_name, possibilities = list_of_all_titles, n = 5, cutoff = 0.5)
print(find_close_match)

['The Wedding Party II 2016', 'Wedding Saga 2019', 'Wedding Plan 2022', 'Leading Lady 2018', 'Wedding Night 2022']


In [None]:
close_match = find_close_match[0]
print(close_match)

The Wedding Party II 2016


In [None]:
# confirming the index of the movie with title

index_of_the_movie = movies_data[movies_data.movie_title == close_match]['index'].values[0]
print(index_of_the_movie)

50


Now I will compare the index of the movie closely matched to the searched item, with the similarities scores of every movie in the data frame.

In [None]:
# getting a list of similar movies

similarity_score = list(enumerate(similarity[index_of_the_movie]))
print(similarity_score)

[(0, 0.045184039174245925), (1, 0.0), (2, 0.011459690618368568), (3, 0.0), (4, 0.0), (5, 0.029102996363876824), (6, 0.0017492219481488973), (7, 0.0019222393667701666), (8, 0.042838149588462826), (9, 0.010142486139056333), (10, 0.0017099408929409576), (11, 0.017884546222214186), (12, 0.00970231135831608), (13, 0.0), (14, 0.002491054303286952), (15, 0.001831552880402206), (16, 0.0017913909882638968), (17, 0.031051101181087313), (18, 0.05529574588559897), (19, 0.0), (20, 0.0016194813174028927), (21, 0.01704084189901195), (22, 0.003110945023354837), (23, 0.001923583143609195), (24, 0.009000834042332494), (25, 0.0015286401760070875), (26, 0.0), (27, 0.018293062194352238), (28, 0.03210438899085695), (29, 0.03576050406494982), (30, 0.009045205731573594), (31, 0.002018589853849546), (32, 0.017679932656598604), (33, 0.0), (34, 0.023478851126938233), (35, 0.0), (36, 0.02276471334176249), (37, 0.0017548001353092613), (38, 0.0), (39, 0.0449570025097577), (40, 0.011417597415706755), (41, 0.0), (42,

All the movies have been highlighted, with their index number and similarity scores with respect to the searched item.

In [None]:
# sorting the movies based on their similarity score

sorted_similar_movies = sorted(similarity_score, key = lambda x:x[1]) 
print(sorted_similar_movies)

[(1, 0.0), (3, 0.0), (4, 0.0), (13, 0.0), (19, 0.0), (26, 0.0), (33, 0.0), (35, 0.0), (38, 0.0), (41, 0.0), (42, 0.0), (48, 0.0), (51, 0.0), (58, 0.0), (59, 0.0), (63, 0.0), (65, 0.0), (72, 0.0), (74, 0.0), (83, 0.0), (85, 0.0), (89, 0.0), (90, 0.0), (92, 0.0), (94, 0.0), (112, 0.0), (115, 0.0), (128, 0.0), (132, 0.0), (142, 0.0), (143, 0.0), (147, 0.0), (149, 0.0), (152, 0.0), (159, 0.0), (160, 0.0), (166, 0.0), (174, 0.0), (182, 0.0), (186, 0.0), (187, 0.0), (197, 0.0), (198, 0.0), (199, 0.0), (204, 0.0), (205, 0.0), (210, 0.0), (215, 0.0), (218, 0.0), (223, 0.0), (224, 0.0), (232, 0.0), (246, 0.0), (247, 0.0), (254, 0.0), (267, 0.0), (283, 0.0), (286, 0.0), (288, 0.0), (299, 0.0), (303, 0.0), (308, 0.0), (312, 0.0), (313, 0.0), (314, 0.0), (315, 0.0), (321, 0.0), (328, 0.0), (330, 0.0), (334, 0.0), (340, 0.0), (342, 0.0), (345, 0.0), (349, 0.0), (352, 0.0), (353, 0.0), (357, 0.0), (358, 0.0), (361, 0.0), (373, 0.0), (376, 0.0), (383, 0.0), (384, 0.0), (387, 0.0), (392, 0.0), (394, 0

The movies are ordered by the lowest to highest similarity scores. What is left now is to print it for the user.

In [None]:
# Displaying the similar movies based on the index

print('Recommended Movies : \n')

i = 1

for movie in (sorted_similar_movies):
  index = movie[0]
  title_from_index = movies_data[movies_data.index==index]['movie_title'].values[0]
  if (i<6):
    print(i, '.',title_from_index)
    i+=1

Recommended Movies : 

1 . After Party 2021
2 . Battle on Buka Street 2022
3 . Gangs of Lagos 2023
4 . Amina 2021
5 . Coming from Insanity 2019


### Putting it all together

I will now combine everything together and recommend 20 movies based on the searched item.

In [16]:
movie_name = input('Enter your favourite movie name : ')
list_of_all_titles = movies_data['movie_title'].to_list()
find_close_match = difflib.get_close_matches(word = movie_name, possibilities = list_of_all_titles, n = 5, cutoff = 0.5)
close_match = find_close_match[0]
index_of_the_movie = movies_data[movies_data.movie_title == close_match]['index'].values[0]
similarity_score = list(enumerate(similarity[index_of_the_movie]))
sorted_similar_movies = sorted(similarity_score, key = lambda x:x[1]) 

print('Recommended Movies : \n')

i = 1

for movie in sorted_similar_movies:
  index = movie[0]
  title_from_index = movies_data[movies_data.index==index]['movie_title'].values[0]
  if (i<21):
    print(i, '.',title_from_index)
    i+=1


Enter your favourite movie name : wedding party
Recommended Movies : 

1 . After Party 2021
2 . Battle on Buka Street 2022
3 . Gangs of Lagos 2023
4 . Amina 2021
5 . Coming from Insanity 2019
6 . Star Girl 2021
7 . Sugar Rush III 2019
8 . Christmas in Miami 2021
9 . Omo Ghetto: The Saga 2020
10 . Chief Daddy 2018
11 . The Real Housewives of Abuja 2023– 
12 . Charmed 2018
13 . My Village People 2021
14 . Legacy I 2010
15 . Dinner at My Place 2022
16 . Baby Maker 2023
17 . Double Mama 2013 Video
18 . Introducing the Kujus 2020
19 . Leaked 2022
20 . Obsession I 2022


# Conclusion

Note:
<br> As an avid lover of Nigerian movies, I can confirm that the recommended movies are movies I will enjoy, if I like the The Wedding Party. I have actually watched most of the movies on this list.

Challenges:
I would have loved to have a more holistic dataset with more columns, eg short notes on the movies instead of the synopsis and movie tags in addition to the genre.