<a href="https://colab.research.google.com/github/bohuslavska/Study-projects/blob/main/Content_Based_Recommender/Content_Based_Recommender.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#About the project

This project was completed as a part of the Machine Learning Engineer course at an online school for programming, analytics, and data science *robot_dreams*. 

The aim of this project was to build a simple content-based recommender system.
While doing this project, I learned how to

*   preprocess text with the help of the NLTK library
*   vectorize text with the help of CountVectorizer
*   apply cosine similarity









#Data preprocessing 

In [29]:
#Imports

import string
import pandas as pd
import numpy as np

import warnings
warnings.filterwarnings("ignore")

import nltk
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('stopwords')
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import sent_tokenize, word_tokenize

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity
pd.set_option('display.max_colwidth', 500)
import gradio as gr

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [30]:
from google.colab import drive
drive.mount('/content/gdrive/')

Drive already mounted at /content/gdrive/; to attempt to forcibly remount, call drive.mount("/content/gdrive/", force_remount=True).


In [31]:
#Let's read the movies_metadata dataset and choose three columns with text data.

df_meta = pd.read_csv('/content/gdrive/MyDrive/ML_projects/movies_metadata.csv')
df_content = df_meta[['title','tagline', 'overview']]
df_content.head(10)

Unnamed: 0,title,tagline,overview
0,Toy Story,,"Led by Woody, Andy's toys live happily in his room until Andy's birthday brings Buzz Lightyear onto the scene. Afraid of losing his place in Andy's heart, Woody plots against Buzz. But when circumstances separate Buzz and Woody from their owner, the duo eventually learns to put aside their differences."
1,Jumanji,Roll the dice and unleash the excitement!,"When siblings Judy and Peter discover an enchanted board game that opens the door to a magical world, they unwittingly invite Alan -- an adult who's been trapped inside the game for 26 years -- into their living room. Alan's only hope for freedom is to finish the game, which proves risky as all three find themselves running from giant rhinoceroses, evil monkeys and other terrifying creatures."
2,Grumpier Old Men,Still Yelling. Still Fighting. Still Ready for Love.,"A family wedding reignites the ancient feud between next-door neighbors and fishing buddies John and Max. Meanwhile, a sultry Italian divorcée opens a restaurant at the local bait shop, alarming the locals who worry she'll scare the fish away. But she's less interested in seafood than she is in cooking up a hot time with Max."
3,Waiting to Exhale,Friends are the people who let you be yourself... and never let you forget it.,"Cheated on, mistreated and stepped on, the women are holding their breath, waiting for the elusive ""good man"" to break a string of less-than-stellar lovers. Friends and confidants Vannah, Bernie, Glo and Robin talk it all out, determined to find a better way to breathe."
4,Father of the Bride Part II,Just When His World Is Back To Normal... He's In For The Surprise Of His Life!,"Just when George Banks has recovered from his daughter's wedding, he receives the news that she's pregnant ... and that George's wife, Nina, is expecting too. He was planning on selling their home, but that's a plan that -- like George -- will have to change with the arrival of both a grandchild and a kid of his own."
5,Heat,A Los Angeles Crime Saga,"Obsessive master thief, Neil McCauley leads a top-notch crew on various insane heists throughout Los Angeles while a mentally unstable detective, Vincent Hanna pursues him without rest. Each man recognizes and respects the ability and the dedication of the other even though they are aware their cat-and-mouse game may end in violence."
6,Sabrina,You are cordially invited to the most surprising merger of the year.,"An ugly duckling having undergone a remarkable change, still harbors feelings for her crush: a carefree playboy, but not before his business-focused brother has something to say about it."
7,Tom and Huck,The Original Bad Boys.,"A mischievous young boy, Tom Sawyer, witnesses a murder by the deadly Injun Joe. Tom becomes friends with Huckleberry Finn, a boy with no future and no family. Tom has to choose between honoring a friendship or honoring an oath because the town alcoholic is accused of the murder. Tom and Huck go through several adventures trying to retrieve evidence."
8,Sudden Death,Terror goes into overtime.,"International action superstar Jean Claude Van Damme teams with Powers Boothe in a Tension-packed, suspense thriller, set against the back-drop of a Stanley Cup game.Van Damme portrays a father whose daughter is suddenly taken during a championship hockey game. With the captors demanding a billion dollars by game's end, Van Damme frantically sets a plan in motion to rescue his daughter and abort an impending explosion before the final buzzer..."
9,GoldenEye,No limits. No fears. No substitutes.,James Bond must unmask the mysterious head of the Janus Syndicate and prevent the leader from utilizing the GoldenEye weapons system to inflict devastating revenge on Britain.


In [32]:
#Let's check missing values. 

df_content.isnull().sum()

title           6
tagline     25054
overview      954
dtype: int64

In [33]:
#Let's get rid of missing values.

filtered_df = df_content[df_content[['overview', 'title']].notna().all(1)]
filtered_df.loc[:, 'tagline'].replace(np.nan, '', inplace = True)
filtered_df.isnull().sum()

title       0
tagline     0
overview    0
dtype: int64

In [34]:
#Also, let's drop some duplicates.

print(filtered_df.shape)
filtered_df.drop_duplicates(inplace=True)
print(filtered_df.shape)

(44506, 3)
(44471, 3)


In [35]:
#Now let's create a dataset with two columns ('title' and 'text') and choose 10 000 samples for convenience. 

filtered_df['text'] =  filtered_df['tagline'] + ' ' + filtered_df['overview']
df_final = pd.concat([filtered_df['title'], filtered_df['text']], axis = 1)

df_final = df_final.sample(n = 10000)
df_final.reset_index(inplace = True, drop = True)

#Text preprocessing and vectorization

In [36]:
#Let's make some text preprocessing.

def text_preprocessing(text_data):
    tknzd_text = word_tokenize(text_data.lower())
    wordnet_lemmatizer = WordNetLemmatizer()
    lemma_text = [wordnet_lemmatizer.lemmatize(word) for word in tknzd_text]
    
    no_punct = [i for i in lemma_text if i not in list(string.punctuation)]
    stop_words = stopwords.words('english')
    final = [i for i in no_punct if i not in stop_words]
    final = ' '.join(final)
    return final

df_final['text'] = df_final['text'].apply(lambda x: text_preprocessing(x))
df_final

Unnamed: 0,title,text
0,Still Walking,twelve year beloved eldest son junpei drowned saving stranger 's life kyohei toshiko welcome surviving child home family reunion younger son ryota still feel parent resent n't one died new wife yukari awkwardly meeting rest family first time daughter chinami strain fill uncomfortable pause forced cheer
1,Heidi,heidi orphaned girl initially raised aunt dete maienfeld switzerland order get job frankfurt dete brings 5-year-old heidi grandfather ha odds villager year life seclusion alm first resents heidi 's arrival girl manages penetrate harsh exterior subsequently ha delightful stay best friend young peter goat-herd
2,The Racket,silent film renegade police captain set catch sadistic mob bos
3,Re-cycle,ting-yin young novelist struggling come followup best-selling trilogy romance novel drafting first chapter stop deletes file computer start seeing strange unexplainable thing find experiencing supernatural event described novel-to-be
4,The Sun,sun russian сóлнце solntse 2005 russian biographical film depicting japanese emperor shōwa hirohito final day world war ii film third drama director aleksandr sokurov 's trilogy included taurus soviet union 's vladimir lenin moloch nazi germany 's adolf hitler
...,...,...
9995,The Snowman,lost father 30 year old conspiracy daughter 's journey uncover truth 1978 jimmy graham thirty four year old happily married father two scored dream job operation deepfreeze training american scientist survival skill antarctica left december year three month later arrived back agitated paranoid said ice stumbled onto secret american nuclear site cia given chemical lobotomy keep quiet jimmy rapidly descended schizophrenia behaviour became frightening wife france fled safety taking two child se...
9996,One Man's Journey,dr. eli watt widower come small town considering failure attempt meaningful career new york raise son jimmy well letty baby whose mother ha died childbirth whose father blame watt abandon child watt dream returning research study always something get way epidemic child 's need need generally ungrateful patient passing year doe come find future n't past n't quite failure believed
9997,In God We Teach,story kearny nj high school student secretly recorded history teacher class accused proselytizing “ god teach ” story matthew laclair student kearny nj public high school secretly recorded history teacher david paszkiewicz class accused proselytizing jesus
9998,Special Forces,survival honor sacrifice afghanistan war correspondent elsa casanova taken hostage taliban faced imminent execution special force unit dispatched free world ’ breathtaking yet hostile landscape relentless pursuit begin kidnapper intention letting prey escape group soldier risk life pursuit single aim – bring home alive strong independent woman men duty thrown together forced confront situation great danger inextricably bind – emotionally violently intimately


In [37]:
#Now we need to vectorize our text data and find out cosine similarity.

count_vect = CountVectorizer()
films_vect = count_vect.fit_transform(df_final['text'])
cosine_sim = cosine_similarity(films_vect, films_vect)

#Getting recommendations

In [38]:
#Let's create a dataset with recommendations.

def get_recommendations(title, cosine_sim, dataset):
    
    idx = dataset.index[dataset['title']== title][0]
    sim_scores = list(enumerate(cosine_sim[idx]))
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
    sim_scores = sim_scores[1:11]
    movie_indices = [i[0] for i in sim_scores]
    recom = dataset['title'].iloc[movie_indices]
    return ', '.join(recom.values.tolist())

In [39]:
#Let's create a dataset with recommendations.

my_recommendations = pd.concat([df_final['title'], df_final['title'].apply(lambda x: get_recommendations(x, cosine_sim, df_final))], axis = 1)
my_recommendations.columns.values[1] = "recommendations"
my_recommendations.head(10)

Unnamed: 0,title,recommendations
0,Still Walking,"The Myth Of Fingerprints, Father and Son, Home Sweet Home, This Christmas, Mr & Mrs Bridge, Holiday Reunion, Jetsons: The Movie, Mission Kashmir, Sweet Nothing, Happy Christmas"
1,Heidi,"Imaginaerum, I Love You, Don't Touch Me!, The Road to Ruin, Memories of My Melancholy Whores, Outrage, First Girl I Loved, The Fruit is Swelling, That Night, Miss Kicki, Heidi"
2,The Racket,"Wise Guys, See This Movie, Intervista, Swimming to Cambodia, Film ist. 7-12, Stage Struck, The Gardener, Faust, Subramaniapuram, American Madness"
3,Re-cycle,"Silk, Azazel, Nim's Island, The Crystal Ball, Beloved, The Mayor of Casterbridge, The Forsyte Saga, Eat, The Ten Lives of Titanics the Cat, Electric Dreams"
4,The Sun,"Education for Death, Intervista, The Lonely Voice of Man, See This Movie, My Way Home, Men Behind the Sun, The Early Years: Erik Nietzsche Part 1, Past Life, Images of a Relief, Faust"
5,The Country Girl,"I Touched All Your Stuff, Wind, Silk, Tears of Stone, The Monkey's Mask, That's Entertainment! III, American: The Bill Hicks Story, Joni's Promise, The Dirty Picture, The Sarnos: A Life in Dirty Movies"
6,A Dangerous Method,"Zandalee, The Music of Chance, The Odd Couple, Still of the Night, Condo Painting, Eminem AKA, Love on Delivery, Epitaph, Princess Aurora, Labyrinth"
7,Close,"A Million, Pulling Strings, Light Gradient, Regret, En rachâchant, Night Watch, The Science of Sleep, Rings, American Justice, Pillow to Post"
8,White Light/Black Rain: The Destruction of Hiroshima and Nagasaki,"The Awful Truth, The Saboteurs, Lorelei: The Witch of the Pacific Ocean, Black Rain, Day One, Copperhead, Memphis Belle, After the Apocalypse, Broken Windows, Looking for Lenny"
9,Captain America,"5 Steps to Danger, Prague, Captain America, Swamp Devil, Silent Trigger, Dead Men Don't Wear Plaid, The Governess, Soylent Green, The Man from Snowy River, HK: Forbidden Super Hero"


In [48]:
list_recomm = get_recommendations(title = 'A Dangerous Method', cosine_sim = cosine_sim, dataset = df_final)
for i in list_recomm.split(','):
  print(i)

Zandalee
 The Music of Chance
 The Odd Couple
 Still of the Night
 Condo Painting
 Eminem AKA
 Love on Delivery
 Epitaph
 Princess Aurora
 Labyrinth
