# Manga, Manhua, Manhwa Recommender System
Just some practice for recommender systems. This notebook is based off [Manga, Manhwa and Manhua Dataset](https://www.kaggle.com/datasets/victorsoeiro/manga-manhwa-and-manhua-dataset) by Victor Soeiro on Kaggle. This practice is my research to finding out how to make good recommender systems and, specifically, in the style I wish to see. Simply looking for ways to reach the best recommendations, relying *more* on my own tastes, wants, and curiosities rather than others deem reputable or what is profitable. 

Dataset is flawed (in my opinion). Some have repeats (like part 1, part 2 or episode 0, episode 1 when the story should be treated as one entry and not several). The website which the information was scrapped off of also do not involve detailed enough tags. If you want to see most of the project, skip to step 2!

In [1]:
!pip install pandas
!pip install matplotlib
!pip install scikit-learn
!pip install scipy
!pip install zarr




[notice] A new release of pip is available: 24.0 -> 24.2
[notice] To update, run: C:\Users\noble\AppData\Local\Microsoft\WindowsApps\PythonSoftwareFoundation.Python.3.11_qbz5n2kfra8p0\python.exe -m pip install --upgrade pip





[notice] A new release of pip is available: 24.0 -> 24.2
[notice] To update, run: C:\Users\noble\AppData\Local\Microsoft\WindowsApps\PythonSoftwareFoundation.Python.3.11_qbz5n2kfra8p0\python.exe -m pip install --upgrade pip





[notice] A new release of pip is available: 24.0 -> 24.2
[notice] To update, run: C:\Users\noble\AppData\Local\Microsoft\WindowsApps\PythonSoftwareFoundation.Python.3.11_qbz5n2kfra8p0\python.exe -m pip install --upgrade pip





[notice] A new release of pip is available: 24.0 -> 24.2
[notice] To update, run: C:\Users\noble\AppData\Local\Microsoft\WindowsApps\PythonSoftwareFoundation.Python.3.11_qbz5n2kfra8p0\python.exe -m pip install --upgrade pip





## Step 0: Exploratory Data Analysis

In [2]:
import pandas as pd
import numpy as np

df = pd.read_csv('data.csv')

In [3]:
df.head(10)

Unnamed: 0,title,description,rating,year,tags,cover
0,Salad Days (Tang LiuZang) - Part 2,The second season of Salad Days (Tang LiuZang).,4.7,2021.0,"['BL', 'Manhua', 'Romance', 'Shounen-ai', 'Spo...",https://cdn.anime-planet.com/manga/primary/sal...
1,The Master of Diabolism,As the grandmaster who founded the Demonic Sec...,4.7,2017.0,"['Action', 'Adventure', 'BL', 'Comedy', 'Manhu...",https://cdn.anime-planet.com/manga/primary/the...
2,JoJo's Bizarre Adventure Part 7: Steel Ball Run,"Set in 1890, Steel Ball Run spotlights Gyro Ze...",4.7,2004.0,"['Action', 'Adventure', 'Horror', 'Mystery', '...",https://cdn.anime-planet.com/manga/primary/joj...
3,A Sign of Affection,"Yuki is a typical college student, whose world...",4.7,2019.0,"['Romance', 'Shoujo', 'Slice of Life', 'Disabi...",https://cdn.anime-planet.com/manga/primary/a-s...
4,Moriarty the Patriot,"Before he was Sherlock’s rival, Moriarty fough...",4.7,2016.0,"['Mystery', 'Shounen', 'Detectives', 'England'...",https://cdn.anime-planet.com/manga/primary/mor...
5,Classroom of the Elite (Light Novel),At the prestigious Tokyo Metropolitan Advanced...,4.7,2015.0,"['Comedy', 'Drama', 'Ecchi', 'Harem', 'Light N...",https://cdn.anime-planet.com/manga/primary/cla...
6,Who Made Me a Princess,The beautiful Athanasia was killed at the hand...,4.7,2017.0,"['Fantasy', 'Manhwa', 'Romance', 'Slice of Lif...",https://cdn.anime-planet.com/manga/primary/who...
7,Demon Slayer: Kimetsu no Yaiba,The setting is Taisho era Japan. Tanjirou is a...,4.7,2016.0,"['Action', 'Adventure', 'Comedy', 'Drama', 'Fa...",https://cdn.anime-planet.com/manga/primary/dem...
8,Omniscient Reader (Novel),"One day, our MC finds himself stuck in the wor...",4.7,2018.0,"['Action', 'Adventure', 'Drama', 'Fantasy', 'M...",https://cdn.anime-planet.com/manga/primary/omn...
9,Omniscient Reader,"Back then, Dokja had no idea. He had no idea h...",4.7,2020.0,"['Action', 'Adventure', 'Drama', 'Fantasy', 'M...",https://cdn.anime-planet.com/manga/primary/omn...


We remove cover since we don't need the links to make recommendations.

Initially, I had an issue about the possibility that description could cause confusion in the model. Let's say we use Term Frequency-Inverse Document Frequency (TF-IDF). This would catch words that could cause confusion. If recommendations were based closely on words in descriptions, someone who likes the manga Blue Period may accidentally get the recommendation Sweet Home, two entirely different stories of different styles simply because the main character in the description is a high school student.

But, for now, we leave it to be included. We could find a better way to calculating the weight of description similarity in the later parts.

In [4]:
remove = ['cover']
df = df.drop(columns=remove)
df = df.drop_duplicates(subset='title')

## Step 1: Simple Version
Let's think of this recommender system pretty one-dimensionally. Since we don't have much statistics on how these comics were rated and, in some sense, might not be the most reliable ratings due to demographic differences and possible other factors, demographic filtering might not be the best choice for this dataset.

If we try content based filtering, we might get better results. For the simple version, let's take the features we have narrow down to and do some similarity score processing. I found out there are two different ways of using NLP to understand and extract text/words to calculate for similarity: CountVectorizer and TfidfVectorizer. Let's try them both!

### CountVectorizer Version
For this conversion method, we must create a metadata soup for the method to take things in. While rating is very important, for the sake of this method, we will omit it. After all, if you liked a comic rated 2.1, would you want a recommendation for a comic of a similar scoring? Probably not. 

While year can be really important to finding underlying patterns that the tags cannot, for the sake of this section it will be omitted. This is because comparing year numerically like determining the generation gap between two comics is more valuable than comparing the string value of the year which is what these two vector methods do. We will also address this problem later.

In [5]:
from ast import literal_eval

df['tags'] = df['tags'].apply(literal_eval)

In [6]:
def get_list(x):
    if isinstance(x, list):
        names = x
        if len(names) > 9:
            names = names[:9]
        return names
    return []

def clean_data(x):
    if isinstance(x, list):
        return [str.lower(i.replace(" ", "")) for i in x]
    
def create_soup(x):
    return ' '.join(x['tags'])

In [7]:
df['tags'] = df['tags'].apply(get_list)
df.head(3)

Unnamed: 0,title,description,rating,year,tags
0,Salad Days (Tang LiuZang) - Part 2,The second season of Salad Days (Tang LiuZang).,4.7,2021.0,"[BL, Manhua, Romance, Shounen-ai, Sports, Webt..."
1,The Master of Diabolism,As the grandmaster who founded the Demonic Sec...,4.7,2017.0,"[Action, Adventure, BL, Comedy, Manhua, Myster..."
2,JoJo's Bizarre Adventure Part 7: Steel Ball Run,"Set in 1890, Steel Ball Run spotlights Gyro Ze...",4.7,2004.0,"[Action, Adventure, Horror, Mystery, Seinen, 1..."


In [8]:
from sklearn.feature_extraction.text import CountVectorizer

df['tags'] = df['tags'].apply(clean_data)
df['soup'] = df.apply(create_soup, axis=1)
count = CountVectorizer(stop_words='english') # good habit to rid possible filler words
count_matrix = count.fit_transform(df['soup'])

In [9]:
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.decomposition import TruncatedSVD

svd = TruncatedSVD(n_components=100)
reduced_matrix = svd.fit_transform(count_matrix)

reduced_similarity = cosine_similarity(reduced_matrix, reduced_matrix)

In [10]:
df = df.reset_index()
indices = pd.Series(df.index, index=df['title'])

In [11]:
def get_simple_recs(title, similarity, amt):
    idx = indices[title] 
    similarity_scores = list(enumerate(similarity[idx]))
    similarity_scores = sorted(similarity_scores, key=lambda x: x[1], reverse=True)
    similarity_scores = similarity_scores[1:amt]
    
    comic_indices = [i[0] for i in similarity_scores]
    scores = [i[1] for i in similarity_scores]
    
    titles = df['title'].iloc[comic_indices].reset_index(drop=True)
    results = pd.DataFrame({'title': titles, 'similarity_score': scores})
    
    return results


In [12]:
get_simple_recs('Beware the Villainess!', reduced_similarity, 10)

Unnamed: 0,title,similarity_score
0,It's Been a While Since the Original Was Finished,1.0
1,I'll Marry Him!,0.99299
2,Tricked into Becoming the Heroine's Stepmother,0.940253
3,Miss Not-So Sidekick,0.940171
4,The Male Lead's Little Lion Daughter,0.939137
5,Another Typical Fantasy Romance,0.938971
6,Villainesses Have More Fun,0.938711
7,The Tyrant's Only Perfumer,0.938711
8,Villainess's Idol,0.938711


### TfidfVectorizer Version
This conversion method we will also omit the rating since we aren't looking for that possibly false way of comparing for similarity. The same thing for year happens hear since we aren't looking at using integer values quite yet. We will have to find another way to create a similarity score based off of all 3 main factors.

In [13]:
def get_list(x):
    if isinstance(x, list):
        names = x
        if len(names) > 12:
            names = names[:12]
        return names
    return []

In [14]:
from ast import literal_eval

df = pd.read_csv('data.csv')
remove = ['cover']
df = df.drop(columns=remove)


df['tags'] = df['tags'].apply(literal_eval)
df['tags'] = df['tags'].apply(get_list)
df['tags'] = df['tags'].apply(clean_data)
df.head(5)

Unnamed: 0,title,description,rating,year,tags
0,Salad Days (Tang LiuZang) - Part 2,The second season of Salad Days (Tang LiuZang).,4.7,2021.0,"[bl, manhua, romance, shounen-ai, sports, webt..."
1,The Master of Diabolism,As the grandmaster who founded the Demonic Sec...,4.7,2017.0,"[action, adventure, bl, comedy, manhua, myster..."
2,JoJo's Bizarre Adventure Part 7: Steel Ball Run,"Set in 1890, Steel Ball Run spotlights Gyro Ze...",4.7,2004.0,"[action, adventure, horror, mystery, seinen, 1..."
3,A Sign of Affection,"Yuki is a typical college student, whose world...",4.7,2019.0,"[romance, shoujo, sliceoflife, disability]"
4,Moriarty the Patriot,"Before he was Sherlock’s rival, Moriarty fough...",4.7,2016.0,"[mystery, shounen, detectives, england, europe..."


In [15]:
df = df.reset_index()
indices = pd.Series(df.index, index=df['title'])

In [16]:
from sklearn.feature_extraction.text import TfidfVectorizer

df['tags'] = df['tags'].apply(lambda x: ' '.join(x) if isinstance(x, list) else '')

tfidf = TfidfVectorizer(stop_words='english')
df['tags'] = df['tags'].fillna('')

tfidf_matrix = tfidf.fit_transform(df['tags'])
tfidf_matrix.shape

(70945, 589)

In [17]:
from sklearn.decomposition import TruncatedSVD
from sklearn.metrics.pairwise import linear_kernel

svd = TruncatedSVD(n_components=100)
reduced_tfidf_matrix = svd.fit_transform(tfidf_matrix)

reduced_similarity = linear_kernel(reduced_tfidf_matrix, reduced_tfidf_matrix)

indices = pd.Series(df.index, index=df['title']).drop_duplicates()

In [18]:
get_simple_recs('Beware the Villainess!', reduced_similarity, 10)

Unnamed: 0,title,similarity_score
0,Here Comes the Silver Spoon!,0.823919
1,The Lady Tames the Swordmaster,0.823919
2,Tricked into Becoming the Heroine's Stepmother,0.796597
3,Beware the Villainess!,0.792703
4,The Evil Cinderella Needs a Villain,0.790021
5,I Woke Up as the Ugly Duckling,0.785855
6,Miss Not-So Sidekick,0.780704
7,It's Been a While Since the Original Was Finished,0.779653
8,Becoming the Dark Hero's Daughter,0.777913


Now, in the end, we get two quite different results. With the level that I am working in, there is not much more I can do. Tags may not be reliable enough and a lot goes into the math of the functions we decide to use when creating the similarity score. A way to better improve this work is to include the rest of the features, namely the description, year, and rating.

## Step 2: Individual Recommender
Now that we kind of get the gist of the most important part of our recommender: the tags, we have the option to upgrade our calculating of scores. Because all four parts, descripion, rating, year, and tags are so different in how they are assessed and used when looking at a piece of media, we have to treat all three pieces differently.

This is how I want it done:
* tags are the most important, it defines taste but weighing too much of it can overfit the data
* good ratings are only important if results reflect taste and poor ratings are not desired
* year is only important for underlying patterns that appear as trends through time
* descriptions can be the factor to differentiating core elements, not styles, between each media

All that would mean, different factors will be weighed differently. In a simple equation, it can look something like this:
$$S=(S_t×w_t)+(S_y×w_y)+(S_r×w_r)+(S_d×w_d)$$

where:
* $S, S_t, S_y, S_r, S_d$ - similarity, tag similarity, year similarity, rating similarity, decription similarity
* $w_t, w_y, w_r, w_d$ - tag weight, year weight, rating weight, decription weight


### Step 2.1: Redefining
Now, because we're doing the recommendation a different way, let's reload the database and process everything over again. This makes sure that we aren't using the old data that could mess anything up. Also this is easier for me to rerun the notebook from this step directly...

In [1]:
def get_list(x):
    if isinstance(x, list):
        names = x
        if len(names) > 9:
            names = names[:9]
        return names
    return []

def clean_data(x):
    if isinstance(x, list):
        return [str.lower(i.replace(" ", "")) for i in x]
    
def create_soup(x):
    return ' '.join(x['tags'])

In [2]:
from ast import literal_eval
import pandas as pd
import numpy as np

df = pd.read_csv('data.csv')
remove = ['cover']
df = df.drop(columns=remove)
df = df.drop_duplicates(subset='title')

df['tags'] = df['tags'].apply(literal_eval)
df['tags'] = df['tags'].apply(get_list)
df['tags'] = df['tags'].apply(clean_data)
df.head(5)

Unnamed: 0,title,description,rating,year,tags
0,Salad Days (Tang LiuZang) - Part 2,The second season of Salad Days (Tang LiuZang).,4.7,2021.0,"[bl, manhua, romance, shounen-ai, sports, webt..."
1,The Master of Diabolism,As the grandmaster who founded the Demonic Sec...,4.7,2017.0,"[action, adventure, bl, comedy, manhua, myster..."
2,JoJo's Bizarre Adventure Part 7: Steel Ball Run,"Set in 1890, Steel Ball Run spotlights Gyro Ze...",4.7,2004.0,"[action, adventure, horror, mystery, seinen, 1..."
3,A Sign of Affection,"Yuki is a typical college student, whose world...",4.7,2019.0,"[romance, shoujo, sliceoflife, disability]"
4,Moriarty the Patriot,"Before he was Sherlock’s rival, Moriarty fough...",4.7,2016.0,"[mystery, shounen, detectives, england, europe..."


### Step 2.2: Description Factors
Starting with the first factor, let's use TFIDF which can deal well with importance of each word in a summary since it is equiped for that. Because we're looking at details that describe a plot line, it's better to use this method to calculate the values of what the descriptions can compare to. Similar to step 1, we just create a matrix of the words as values and do the comparison.

In [3]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import linear_kernel
from sklearn.decomposition import TruncatedSVD

tfidf = TfidfVectorizer(stop_words='english')
df['description'] = df['description'].fillna('')
description_matrix = tfidf.fit_transform(df['description']).astype(np.float32)

description_score = linear_kernel(description_matrix, description_matrix).astype(np.float32)

### Step 2.3: Rating Factors
Looking at ratings, we can quickly mention that ratings don't mean everything but it's still a very solid guideline when looking for new things to consume. You would always want to look for better ratings. We can just easily do that with some normalizing of the rating number and let's add that number as a column to save the values. BUT normalizing may leave out the fact that lower ratings are much more not sought out so we need to penalize low ratings and benefit higher ratings in recommending according to rating.

If the recommendation really fits your taste, a lower rating should not matter. But when there are a lot of options, we definitely want recommendations with higher ratings to be recommended. I took it upon myself to come up with something I would like to see and drew some example equations on Desmos. Since I wanted higher ratings from 4-5 to be recommended but lower ratings to be around the same (so that, in my logic, when they compare to each other, the difference is not as large). My equation is
$$\frac{e^{1.34R_{old}}}{811} = R_{new}$$

Lil' awkward but I'm learning and trying things out so for now this is what I've got

In [4]:
def exponent_ratings(x):
    return (np.exp(1.34 * x)) / 811

df['rating_score'] = df['rating'].apply(exponent_ratings).astype(np.float32).fillna(0.0)

### Step 2.4: Year Factors
Well, what a bummer. Honestly did my research but I couldn't find any efficient way to calculate year. Most other people working on recommender systems really focus on tags, keywords, casts, ratings, reviews, and descriptions but nothing really stood out to me to calculate year difference which I think could really help with defining some recommendations. Some people don't like art styles from the 90s while others prefer the format of more recent styles and I just couldn't calculate that.

My plan was plain simple. Create a function to calculate the year difference and produce a value that encourages smaller year differences. The score for year would be calculated that way. But because the nature of my plan was different from what I've learned from tutorials on other features, I couldn't carry over similar transformations for the matrix/array to produce something fast. After multiple attempts from 2 minute runs to 30 minute runs, the code always managed to crash while trying to formulate numbers. It was too hard to fill in every value of the comparison map in 70k entries. Maybe a different database would be easier to deal with. But for now, I will just move on since, after all, this is just experimenting and practicing!

In [5]:
def calculate_year(x):
    n = len(x)
    matrix = np.zeros((n, n), dtype=np.float32)
    for i in range(n):
        for j in range(n):
            matrix[i, j] = 1 / (1 + abs(x[i] - x[j]))
    return matrix

year_score = calculate_year(df['year'].values.astype(np.float32))

### Step 2.5: Tags Factors
To be completely honest, I had a really hard time picking between the two methods of calculating the similarity score for tags. In the end, I chose CountVectorizer since I wanted the tags correlation to be more straightforward rather than weighing of each words in each tag since the description does more of that. In hopes that description can cover for what tags cannot, I decided to stick with CountVectorizer. This section is essentially the copy of that code.

In [6]:
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.decomposition import TruncatedSVD
from sklearn.feature_extraction.text import CountVectorizer

df['soup'] = df.apply(create_soup, axis=1)
count = CountVectorizer(stop_words='english')
count_matrix = count.fit_transform(df['soup']).astype(np.float32)

tags_score = cosine_similarity(count_matrix, count_matrix).astype(np.float32)

### Step 2.6: Combine Factors
Excuse the extreme conversion of all the values near end, there was an issue with my device being unable to run the creation of final_score so I needed to shrink the types. The values I choose to be the weights are quite arbitrary as there is no perfect way I can find to see which values are the best. The process of approving the values is difficult since my thoughts on what should be recommended is very subjective, hence, I cannot say when it is a wrong recommendation or correct. After tinkering a bit, this is what I found to be the best attempt. Also very difficult to say because, on my device, it takes over an hour to run every length of this code hence difficult to redo and retest. If I had a better device, cloud computing, or optimized code, I would probably be able to test and measure all day and night long, but this is what I could come up with.

In [7]:
description_score = description_score.astype(np.float16)
tags_score = tags_score.astype(np.float16)
rating_score = df['rating_score'].values.astype(np.float16)
year_score = year_score.astype(np.float16)
final_score = (0.37 * description_score) +  (0.68 * tags_score) + (0.19 * df['rating_score'].values[np.newaxis, :]) + (0.09 * year_score)

In [10]:
def get_top_recs(titles, similarity, amt):
    all_recommendations = []

    for title in titles:
        idx = indices[title]
        scores = list(enumerate(similarity[idx]))
        scores = sorted(scores, key=lambda x: x[1], reverse=True)
        scores = scores[1:amt+1]  

        recommended_indices = [i[0] for i in scores]
        all_scores = [i[1] for i in scores]

        recommended_titles = df['title'].iloc[recommended_indices].reset_index(drop=True)
        recommendations = pd.DataFrame({'title': recommended_titles, 'scores': all_scores})

        all_recommendations.append(recommendations)

    combined_recommendations = pd.concat(all_recommendations).reset_index(drop=True)
    sorted_recommendations = combined_recommendations.sort_values(by='scores', ascending=False).reset_index(drop=True)
    return sorted_recommendations


In [11]:
titles = ['Who Made Me a Princess', 'Lady Baby', 'Daughter of the Archmage']
indices = pd.Series(df.index, index=df['title'])
top_recs = get_top_recs(titles, final_score, 8)
print(top_recs)

                                                title    scores
0                      Who Made Me a Princess (Novel)  0.949432
1                       Becoming the Villain’s Family  0.742264
2                                    A Chance At Last  0.741970
3                     Flirting with The Villain's Dad  0.737576
4                      The Red Knight Seeks No Reward  0.733174
5                    The Villainess Flips the Script!  0.733067
6                                Miss Not-So Sidekick  0.727947
7   The Story of a Low-Rank Soldier Becoming a Mon...  0.718693
8                       Trump (Chae-Eun LEE) - Part 4  0.710357
9                         Masters of Lightning Knives  0.701580
10                              Tower of God - Part 3  0.698311
11                                       Princess Shu  0.690508
12                                      Kill the Hero  0.687932
13                              The Marriage Business  0.686199
14                                    Bl