# AIML Engineer Take-Home Project


## Exercise Info and Requirements

### Objective
You are tasked with building a recommendation system using a provided public dataset,
focusing on both classical recommendation models and modern LLM-based approaches.


### Project Requirements
Package your application as a service using a suitable framework, e.g. FastAPI or Flask. We
are intentionally not being prescriptive here, but do your best to demonstrate your
understanding of best practices when building your solution.
1. Dataset
  - Use the [Anime Recommendations Database from Kaggle](https://www.kaggle.com/datasets/CooperUnion/anime-recommendations-database). This dataset contains
information on various anime, including user ratings, genres, and other relevant
attributes.
2. Data Endpoint
  - Your service should have an API endpoint to query the dataset
3. Classical Recommendation System
  - Build a recommendation system that utilizes the dataset to suggest top k anime
for a user based on their viewing history and preferences.
  - Ensure that the recommendation logic excludes recently viewed items (e.g.,
anime watched within the last 7 days).
4. Contextual LLM-Based Personalization
  - Implement a feature where users can get personalized anime recommendations
based on a natural language description of their current mood or preferences
(e.g., "I want something uplifting and adventurous").




### Submission
Submit a single .zip file that includes:
- all source code
- A System Design doc

  - Provide a document (e.g. `SYSTEM_DESIGN.md`) that explains your choices and
the architecture.
  - Discuss how you would extend the current system to make it more accurate in
response to more “vague” user input
  - Include recommendations of how to transition this to an LLM deployed in-house
- A Presentation Slide Deck to present your project for 30 mins in during a panel interview
  - Can include information from the System Design doc

## Exercise

In [1]:
# !pip install pandas
# !pip install numpy
# # !pip install seaborn
# !pip install scikit-learn


### First ideas

My first idea is to use `NMF`, but let's first find out the current state of the art (SOTA) for recommender systems.

#### Current state of the art

There is not a single best algorithm for recommender systems. Solutions for the Netflix Prize include[^1][^2]:
- Decomposition models (SVD, NMF, SVD++, etc.)
- RBM
- Decision Tree-based methods (Gradient Boosted Decision Trees, etc.)
- Neural Networks
- SVM

Also, it is common to use a blend of models, like BellKor Solution to the Netflix Grand Prize[^3], which makes sense as many models tend to improve with ensemble methods.

#### Way to go

Considering this and the escope of this project, I will test simple models.


[^1]: Stephen Gower. [Netflix Prize and SVD](http://buzzard.ups.edu/courses/2014spring/420projects/math420-UPS-spring-2014-gower-netflix-SVD.pdf). April 18th 2014 

[^2]: [Netflix Recommendations: Beyond the 5 stars (Part 2)](https://netflixtechblog.com/netflix-recommendations-beyond-the-5-stars-part-2-d9b96aa399f5)

[^3]: [The BellKor Solution to the Netflix Grand Prize](https://www2.seas.gwu.edu/~simhaweb/champalg/cf/papers/KorenBellKor2009.pdf)

Considering this and the escope of this project, I will use the NMF algorithm.

### Feature engineering

Some ideas of features to be used in the model:
- User's movie ratings
- User's movie watching history
- User's movie type history
- User's movie episode size
- Movie's average rating
- Movie's popularity

In [2]:
# import numpy as np
# import pandas as pd
# import matplotlib.pyplot as plt
# from sklearn.metrics import mean_squared_error
# from sklearn.model_selection import train_test_split
# from sklearn.pipeline import Pipeline
# from sklearn.linear_model import LinearRegression

from utils import *

In [3]:
RATINGS_PATH = 'data/rating.csv'
ANIME_PATH = 'data/anime.csv'
ratings, anime = preprocess_data(RATINGS_PATH, ANIME_PATH)

In [4]:
rated = ratings[ratings.rating != -1]
w1, h1, user_item_matrix = nmf(rated.iloc[0:10000], redo=True)

NMF components (w1 and h1) have been saved to 'data/nmf_components.pkl'


In [6]:
user = rated.iloc[0:10000].sample(1)['user_id'].values[0]
rec_anime_id, rec_weight = recommend(user, w1, h1, user_item_matrix)
print(f'{user = }')

user = np.int64(127)


In [12]:
a = [1,2,3,4]
a[:4]

[1, 2, 3, 4]

In [10]:
print(anime[anime['anime_id'].isin(rec_anime_id[:10])][['name', 'anime_id']])
for idx in rec_anime_id[:10]:
    print(idx, anime[anime['anime_id'] == idx]['name'])

                        name  anime_id
288               Fairy Tail      6702
382   Kamisama Hajimemashita     14713
440               Soul Eater      3588
804         Sword Art Online     11757
850             Gakuen Alice        74
904                Special A      3470
1046       Sukitte Ii na yo.     14289
1083           Inu x Boku SS     11013
1266            Shugo Chara!      2923
1566          Kaze no Stigma      1691
6702 288    Fairy Tail
Name: name, dtype: object
2923 1266    Shugo Chara!
Name: name, dtype: object
3588 440    Soul Eater
Name: name, dtype: object
11757 804    Sword Art Online
Name: name, dtype: object
3470 904    Special A
Name: name, dtype: object
14289 1046    Sukitte Ii na yo.
Name: name, dtype: object
74 850    Gakuen Alice
Name: name, dtype: object
1691 1566    Kaze no Stigma
Name: name, dtype: object
11013 1083    Inu x Boku SS
Name: name, dtype: object
14713 382    Kamisama Hajimemashita
Name: name, dtype: object


In [26]:
group_anime = recommend(user, w1, h1, user_item_matrix)
print(user)

123


In [22]:
rated.iloc[0:1000]['user_id'].unique()

array([ 1,  2,  3,  5,  7,  8,  9, 10, 11])

In [27]:
rec_anime_id, rec_weight = top_n_anime(user, group_anime, user_item_matrix, n=5)

In [28]:
user, rec_anime_id, rec_weight
# rec_cols = np.argsort(group_anime)[::-1]
# group_anime[rec_cols] / np.sum(group_anime[rec_cols])

(np.int64(123),
 Index([11061,    20,  8408,  6746, 11737, 22135, 13677, 12189,   895, 12231,
        ...
        10391, 10396, 10397, 10448, 10456, 10464, 10465, 10491, 10495,     1],
       dtype='int64', name='anime_id', length=1717),
 array([0.26428815, 0.20498794, 0.19741301, ..., 0.        , 0.        ,
        0.        ]))

In [34]:
anime.head(10)

Unnamed: 0,anime_id,name,genre,type,episodes,rating,members,Action,Adventure,Cars,...,Slice of Life,Space,Sports,Super Power,Supernatural,Thriller,Unknown,Vampire,Yaoi,Yuri
0,32281,Kimi no Na wa.,"Drama, Romance, School, Supernatural",Movie,1,9.37,200630,0,0,0,...,0,0,0,0,1,0,0,0,0,0
1,5114,Fullmetal Alchemist: Brotherhood,"Action, Adventure, Drama, Fantasy, Magic, Mili...",TV,64,9.26,793665,1,1,0,...,0,0,0,0,0,0,0,0,0,0
2,28977,Gintama°,"Action, Comedy, Historical, Parody, Samurai, S...",TV,51,9.25,114262,1,0,0,...,0,0,0,0,0,0,0,0,0,0
3,9253,Steins;Gate,"Sci-Fi, Thriller",TV,24,9.17,673572,0,0,0,...,0,0,0,0,0,1,0,0,0,0
4,9969,Gintama&#039;,"Action, Comedy, Historical, Parody, Samurai, S...",TV,51,9.16,151266,1,0,0,...,0,0,0,0,0,0,0,0,0,0
5,32935,Haikyuu!!: Karasuno Koukou VS Shiratorizawa Ga...,"Comedy, Drama, School, Shounen, Sports",TV,10,9.15,93351,0,0,0,...,0,0,1,0,0,0,0,0,0,0
6,11061,Hunter x Hunter (2011),"Action, Adventure, Shounen, Super Power",TV,148,9.13,425855,1,1,0,...,0,0,0,1,0,0,0,0,0,0
7,820,Ginga Eiyuu Densetsu,"Drama, Military, Sci-Fi, Space",OVA,110,9.11,80679,0,0,0,...,0,1,0,0,0,0,0,0,0,0
8,15335,Gintama Movie: Kanketsu-hen - Yorozuya yo Eien...,"Action, Comedy, Historical, Parody, Samurai, S...",Movie,1,9.1,72534,1,0,0,...,0,0,0,0,0,0,0,0,0,0
9,15417,Gintama&#039;: Enchousen,"Action, Comedy, Historical, Parody, Samurai, S...",TV,13,9.11,81109,1,0,0,...,0,0,0,0,0,0,0,0,0,0


In [None]:
rec_cols = np.argsort(group_anime)[::-1]
rec_cols = rec_cols[~np.isin(rec_cols, watched_cols)]

In [None]:
r_user = user_item_matrix.index.get_loc(user.values[0])
groups = w1.argmax(axis=1)
user_group = groups[r_user]

In [None]:
group_anime = h1[user_group, :]
print(len(group_anime), len(user_item_matrix.columns))

In [None]:
cols = user_item_matrix.loc[user].dropna(axis=1).columns
# [cols.get_loc(col) for col in cols]
[user_item_matrix.columns.get_loc(col) for col in cols]

a,b = [2,4,6,8],[3,4,5] 
a[~np.isin(a,b)]

In [None]:
user

In [None]:
r_user = user_item_matrix.index.get_loc(user.values[0])
watched_anime = user_item_matrix.loc[user].dropna(axis=1).columns
watched_cols = np.array([user_item_matrix.columns.get_loc(col) for col in watched_anime])

rec_cols = np.argsort(group_anime)[::-1]
rec_cols = rec_cols[~np.isin(rec_cols, watched_cols)]

rec_anime_id = user_item_matrix.iloc[r_user, rec_cols].index
rec_weight = group_anime[rec_cols]# user_item_matrix.loc[user, cols[1]]

In [None]:
rec_anime_id[:6], rec_weight[:6] / np.sum(rec_weight[:6])

In [None]:
anime[anime['anime_id'].isin(rec_anime_id[:6])].head()

In [None]:
print(group_anime)
rec = group_anime[~np.isin(group_anime, user_item_matrix.loc[user].dropna(axis=1).columns)]
print(rec)

In [None]:
rec = group_anime[np.argsort(group_anime)[::-1]]
rec_anime[~np.isin(user_item_matrix.loc[user].dropna(axis=1).columns)]

In [None]:
user_item_matrix.loc[user].dropna(axis=1).columns

In [None]:
anime[anime['genre'].str.contains('School')].head()