Lumaa AI / ML Coding Challenge

1. Import / Load frameworks & Data

In [1]:
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

In [2]:
df = pd.read_csv("wiki_movie_plots_deduped.csv")
df.head()

Unnamed: 0,Release Year,Title,Origin/Ethnicity,Director,Cast,Genre,Wiki Page,Plot
0,1901,Kansas Saloon Smashers,American,Unknown,,unknown,https://en.wikipedia.org/wiki/Kansas_Saloon_Sm...,"A bartender is working at a saloon, serving dr..."
1,1901,Love by the Light of the Moon,American,Unknown,,unknown,https://en.wikipedia.org/wiki/Love_by_the_Ligh...,"The moon, painted with a smiling face hangs ov..."
2,1901,The Martyred Presidents,American,Unknown,,unknown,https://en.wikipedia.org/wiki/The_Martyred_Pre...,"The film, just over a minute long, is composed..."
3,1901,"Terrible Teddy, the Grizzly King",American,Unknown,,unknown,"https://en.wikipedia.org/wiki/Terrible_Teddy,_...",Lasting just 61 seconds and consisting of two ...
4,1902,Jack and the Beanstalk,American,"George S. Fleming, Edwin S. Porter",,unknown,https://en.wikipedia.org/wiki/Jack_and_the_Bea...,The earliest known adaptation of the classic f...


In [3]:
df = df[['Title', 'Plot']].dropna()
df = df.sample(n=150, random_state=1).iloc[:100]
df.head()

Unnamed: 0,Title,Plot
32409,Seeta Ramula Kalyanam Lankalo,Chandra Shekhar aka Chandu (Nitin) is a darede...
33991,Iskandar,"Karl Iskandar (Awie), son of Tan Sri Hisham Al..."
25597,First Love Letter,First Love Letter is a Pahlaj Nihalani film in...
1551,Mama Loves Papa,While Wilbur Todd (Charlie Ruggles) is content...
11516,Shipwrecked,"Haakon Haakonsen (Stian Smestad), a young Norw..."


In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 100 entries, 32409 to 18242
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   Title   100 non-null    object
 1   Plot    100 non-null    object
dtypes: object(2)
memory usage: 2.3+ KB


2. Data Transformation (Vectorization)

In [5]:
def preprocess_text(data, text_column):
    """Convert text data into TF-IDF vectors."""
    vectorizer = TfidfVectorizer(stop_words='english')
    tfidf_matrix = vectorizer.fit_transform(data[text_column])
    return vectorizer, tfidf_matrix

3. Recommendation

In [6]:
def get_recommendations(user_input, vectorizer, tfidf_matrix, data, text_column, top_n=5):
    """Compute cosine similarity between user input and dataset items."""
    user_tfidf = vectorizer.transform([user_input])
    similarity_scores = cosine_similarity(user_tfidf, tfidf_matrix).flatten()
    top_indices = np.argsort(similarity_scores)[::-1][:top_n]
    
    recommendations = data.iloc[top_indices].copy()
    recommendations["similarity_score"] = similarity_scores[top_indices]
    
    return recommendations[['Title', text_column, 'similarity_score']]


In [7]:
vectorizer, tfidf_matrix = preprocess_text(df, "Plot")
    
user_query = input("Enter your preference description: ")
recommendations = get_recommendations(user_query, vectorizer, tfidf_matrix, df, "Plot")["Title"]
    
print("Top Recommendations:")
(recommendations)

Top Recommendations:


20534     Dancin' thru the Dark
2729        An Angel from Texas
9080                 Sugar Hill
2241          The Last Gangster
6540     The Girl Can't Help It
Name: Title, dtype: object