## <font color='purple'> Data 255 - Lab 1 - Part 1

<font color='purple'> Part 1: Deep Learning-Based Recommendation (10 Points)

<font color='purple'> Read the paper Wide and Deep Learning for Recommender Systems.

<font color='purple'> Download the files anime-dataset-2023.csv, users-details-2023.csv, users-score- 2023.csv from the following link: https://www.kaggle.com/datasets/dbdmobile/myanimelist-dataset

<font color='purple'> Based on the architecture described in the paper, build your own Wide and Deep Recommender system for the Anime Dataset. Your model should learn the features of each user and anime, not just the associated ID numbers. Utilize an 80/20 train-test split and record your model’s prediction accuracy.

#### <font color='blue'> <b>Step 1:</b> Understanding the Wide & Deep Learning Paper

#### <font color='blue'> <b>Step 2:</b> Preparing the Dataset

In [1]:
import pandas as pd
import numpy as np
import gc
import pandas as pd
from sklearn.model_selection import StratifiedShuffleSplit

<font color='blue'> Load the datasets

In [2]:
anime_data = pd.read_csv('anime-dataset-2023.csv')
user_details = pd.read_csv('users-details-2023.csv')
user_scores = pd.read_csv('users-score-2023.csv')

In [3]:
anime_data.head(2)

Unnamed: 0,anime_id,Name,English name,Other name,Score,Genres,Synopsis,Type,Episodes,Aired,...,Studios,Source,Duration,Rating,Rank,Popularity,Favorites,Scored By,Members,Image URL
0,1,Cowboy Bebop,Cowboy Bebop,カウボーイビバップ,8.75,"Action, Award Winning, Sci-Fi","Crime is timeless. By the year 2071, humanity ...",TV,26.0,"Apr 3, 1998 to Apr 24, 1999",...,Sunrise,Original,24 min per ep,R - 17+ (violence & profanity),41.0,43,78525,914193.0,1771505,https://cdn.myanimelist.net/images/anime/4/196...
1,5,Cowboy Bebop: Tengoku no Tobira,Cowboy Bebop: The Movie,カウボーイビバップ 天国の扉,8.38,"Action, Sci-Fi","Another day, another bounty—such is the life o...",Movie,1.0,"Sep 1, 2001",...,Bones,Original,1 hr 55 min,R - 17+ (violence & profanity),189.0,602,1448,206248.0,360978,https://cdn.myanimelist.net/images/anime/1439/...


In [4]:
anime_data.columns

Index(['anime_id', 'Name', 'English name', 'Other name', 'Score', 'Genres',
       'Synopsis', 'Type', 'Episodes', 'Aired', 'Premiered', 'Status',
       'Producers', 'Licensors', 'Studios', 'Source', 'Duration', 'Rating',
       'Rank', 'Popularity', 'Favorites', 'Scored By', 'Members', 'Image URL'],
      dtype='object')

<font color='blue'> Drop Irrelevant Columns

In [5]:
anime_data = anime_data.drop(['English name', 'Other name', 'Synopsis', 'Premiered', 'Licensors', 'Scored By', 
                             'Members', 'Image URL'], axis=1)

In [6]:
anime_data.columns

Index(['anime_id', 'Name', 'Score', 'Genres', 'Type', 'Episodes', 'Aired',
       'Status', 'Producers', 'Studios', 'Source', 'Duration', 'Rating',
       'Rank', 'Popularity', 'Favorites'],
      dtype='object')

In [7]:
anime_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 24905 entries, 0 to 24904
Data columns (total 16 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   anime_id    24905 non-null  int64 
 1   Name        24905 non-null  object
 2   Score       24905 non-null  object
 3   Genres      24905 non-null  object
 4   Type        24905 non-null  object
 5   Episodes    24905 non-null  object
 6   Aired       24905 non-null  object
 7   Status      24905 non-null  object
 8   Producers   24905 non-null  object
 9   Studios     24905 non-null  object
 10  Source      24905 non-null  object
 11  Duration    24905 non-null  object
 12  Rating      24905 non-null  object
 13  Rank        24905 non-null  object
 14  Popularity  24905 non-null  int64 
 15  Favorites   24905 non-null  int64 
dtypes: int64(3), object(13)
memory usage: 3.0+ MB


<font color='blue'> Almost all columns are datatype object

In [8]:
anime_data.isna().sum()

anime_id      0
Name          0
Score         0
Genres        0
Type          0
Episodes      0
Aired         0
Status        0
Producers     0
Studios       0
Source        0
Duration      0
Rating        0
Rank          0
Popularity    0
Favorites     0
dtype: int64

<font color='blue'> We see there are a lot of Unknown Values

In [9]:
unknown_counts = (anime_data.applymap(lambda x: str(x).lower() == 'unknown')).sum()
print(unknown_counts)

anime_id          0
Name              0
Score          9213
Genres         4929
Type             74
Episodes        611
Aired             0
Status            0
Producers     13350
Studios       10526
Source         3689
Duration        663
Rating          669
Rank           4612
Popularity        0
Favorites         0
dtype: int64


<font color='blue'> Change the Unknown Values for Numeric Columns to NaN

In [10]:
columns_to_convert = ['Score', 'Episodes', 'Rank']

anime_data[columns_to_convert] = anime_data[columns_to_convert].replace(
    to_replace=['unknown', 'UNKNOWN', 'Unknown'], value=np.nan
)
anime_data[columns_to_convert] = anime_data[columns_to_convert].apply(pd.to_numeric, errors='coerce')

<font color='blue'> Fill the NaN values with Mean or Median

In [11]:
mean_score = anime_data['Score'].mean()
anime_data['Score'] = anime_data['Score'].fillna(mean_score)

median_episodes = anime_data['Episodes'].median()
anime_data['Episodes'] = anime_data['Episodes'].fillna(median_episodes)

mean_rank = anime_data['Rank'].mean()
anime_data['Rank'] = anime_data['Rank'].fillna(mean_rank)

In [12]:
anime_data.dtypes

anime_id        int64
Name           object
Score         float64
Genres         object
Type           object
Episodes      float64
Aired          object
Status         object
Producers      object
Studios        object
Source         object
Duration       object
Rating         object
Rank          float64
Popularity      int64
Favorites       int64
dtype: object

In [13]:
unknown_counts = (anime_data.applymap(lambda x: str(x).lower() == 'unknown')).sum()
print(unknown_counts)

anime_id          0
Name              0
Score             0
Genres         4929
Type             74
Episodes          0
Aired             0
Status            0
Producers     13350
Studios       10526
Source         3689
Duration        663
Rating          669
Rank              0
Popularity        0
Favorites         0
dtype: int64


<font color='blue'> Split the Aired column to Air Start and Air End

In [14]:
anime_data[['Air Start', 'Air End']] = anime_data['Aired'].str.split(' to ', expand=True)

anime_data['Air Start'] = pd.to_datetime(anime_data['Air Start'], format='%b %d, %Y', errors='coerce').dt.year
anime_data['Air End'] = pd.to_datetime(anime_data['Air End'], format='%b %d, %Y', errors='coerce').dt.year

anime_data.drop(columns=['Aired'], inplace=True)

<font color='blue'> Fill Missing values for Air Start and End with median

In [15]:
anime_data['Air Start'].fillna(anime_data['Air Start'].median(), inplace=True)
anime_data['Air End'].fillna(anime_data['Air End'].median(), inplace=True)

<font color='blue'> Convert Duration to Minutes

In [16]:
def convert_duration_to_minutes(duration):
    total_minutes = 0
    if 'hr' in duration:
        hours = int(duration.split('hr')[0].strip())
        total_minutes += hours * 60

    if 'min' in duration:
        minutes = int(duration.split('min')[0].split()[-1].strip())
        total_minutes += minutes

    return total_minutes

anime_data['Duration (min)'] = anime_data['Duration'].apply(convert_duration_to_minutes)

In [17]:
anime_data.drop(columns=['Duration'], inplace=True)

In [18]:
print((anime_data['Duration (min)'] == 0).sum())

1268


In [19]:
mean_duration = anime_data['Duration (min)'].mean()
anime_data['Duration (min)'].replace(0, mean_duration, inplace=True)

In [20]:
unknown_counts = (anime_data.applymap(lambda x: str(x).lower() == 'unknown')).sum()
print(unknown_counts)

anime_id              0
Name                  0
Score                 0
Genres             4929
Type                 74
Episodes              0
Status                0
Producers         13350
Studios           10526
Source             3689
Rating              669
Rank                  0
Popularity            0
Favorites             0
Air Start             0
Air End               0
Duration (min)        0
dtype: int64


In [21]:
anime_data.head(10)

Unnamed: 0,anime_id,Name,Score,Genres,Type,Episodes,Status,Producers,Studios,Source,Rating,Rank,Popularity,Favorites,Air Start,Air End,Duration (min)
0,1,Cowboy Bebop,8.75,"Action, Award Winning, Sci-Fi",TV,26.0,Finished Airing,Bandai Visual,Sunrise,Original,R - 17+ (violence & profanity),41.0,43,78525,1998.0,1999.0,24.0
1,5,Cowboy Bebop: Tengoku no Tobira,8.38,"Action, Sci-Fi",Movie,1.0,Finished Airing,"Sunrise, Bandai Visual",Bones,Original,R - 17+ (violence & profanity),189.0,602,1448,2001.0,2013.0,115.0
2,6,Trigun,8.22,"Action, Adventure, Sci-Fi",TV,26.0,Finished Airing,Victor Entertainment,Madhouse,Manga,PG-13 - Teens 13 or older,328.0,246,15035,1998.0,1998.0,24.0
3,7,Witch Hunter Robin,7.25,"Action, Drama, Mystery, Supernatural",TV,26.0,Finished Airing,"Bandai Visual, Dentsu, Victor Entertainment, T...",Sunrise,Original,PG-13 - Teens 13 or older,2764.0,1795,613,2002.0,2002.0,25.0
4,8,Bouken Ou Beet,6.94,"Adventure, Fantasy, Supernatural",TV,52.0,Finished Airing,"TV Tokyo, Dentsu",Toei Animation,Manga,PG - Children,4240.0,5126,14,2004.0,2005.0,23.0
5,15,Eyeshield 21,7.92,Sports,TV,145.0,Finished Airing,"TV Tokyo, Nihon Ad Systems, TV Tokyo Music, Sh...",Gallop,Manga,PG-13 - Teens 13 or older,688.0,1252,1997,2005.0,2008.0,23.0
6,16,Hachimitsu to Clover,8.0,"Comedy, Drama, Romance",TV,24.0,Finished Airing,"Dentsu, Genco, Fuji TV, Asmik Ace, Shueisha",J.C.Staff,Manga,PG-13 - Teens 13 or older,589.0,862,4136,2005.0,2005.0,23.0
7,17,Hungry Heart: Wild Striker,7.55,"Comedy, Slice of Life, Sports",TV,52.0,Finished Airing,UNKNOWN,Nippon Animation,Manga,PG-13 - Teens 13 or older,1551.0,4212,237,2002.0,2003.0,23.0
8,18,Initial D Fourth Stage,8.16,"Action, Drama",TV,24.0,Finished Airing,"OB Planning, Studio Jack",A.C.G.T.,Manga,PG-13 - Teens 13 or older,393.0,1273,1237,2004.0,2006.0,27.0
9,19,Monster,8.87,"Drama, Mystery, Suspense",TV,74.0,Finished Airing,"VAP, Shogakukan-Shueisha Productions, Nippon T...",Madhouse,Manga,R+ - Mild Nudity,26.0,142,47235,2004.0,2005.0,24.0


<font color='blue'> Preprocessing for dataset "anime-dataset-2023.csv" is done here

In [22]:
user_details.head()

Unnamed: 0,Mal ID,Username,Gender,Birthday,Location,Joined,Days Watched,Mean Score,Watching,Completed,On Hold,Dropped,Plan to Watch,Total Entries,Rewatched,Episodes Watched
0,1,Xinil,Male,1985-03-04T00:00:00+00:00,California,2004-11-05T00:00:00+00:00,142.3,7.37,1.0,233.0,8.0,93.0,64.0,399.0,60.0,8458.0
1,3,Aokaado,Male,,"Oslo, Norway",2004-11-11T00:00:00+00:00,68.6,7.34,23.0,137.0,99.0,44.0,40.0,343.0,15.0,4072.0
2,4,Crystal,Female,,"Melbourne, Australia",2004-11-13T00:00:00+00:00,212.8,6.68,16.0,636.0,303.0,0.0,45.0,1000.0,10.0,12781.0
3,9,Arcane,,,,2004-12-05T00:00:00+00:00,30.0,7.71,5.0,54.0,4.0,3.0,0.0,66.0,0.0,1817.0
4,18,Mad,,,,2005-01-03T00:00:00+00:00,52.0,6.27,1.0,114.0,10.0,5.0,23.0,153.0,42.0,3038.0


In [23]:
user_details.columns

Index(['Mal ID', 'Username', 'Gender', 'Birthday', 'Location', 'Joined',
       'Days Watched', 'Mean Score', 'Watching', 'Completed', 'On Hold',
       'Dropped', 'Plan to Watch', 'Total Entries', 'Rewatched',
       'Episodes Watched'],
      dtype='object')

In [24]:
user_details.shape

(731290, 16)

In [25]:
user_details.isnull().mean() * 100

Mal ID               0.000000
Username             0.000137
Gender              69.316824
Birthday            77.017599
Location            79.104733
Joined               0.000000
Days Watched         0.001094
Mean Score           0.001094
Watching             0.001094
Completed            0.001094
On Hold              0.001094
Dropped              0.001094
Plan to Watch        0.001094
Total Entries        0.001094
Rewatched            0.001094
Episodes Watched     0.001094
dtype: float64

In [26]:
user_details['Location'].nunique()

53285

<font color='blue'> Drop Irrelevant Columns

In [27]:
user_details.drop(columns = ['Birthday', 'Watching', 'On Hold', 'Dropped', 'Joined', 'Rewatched',
                            'Episodes Watched', 'Plan to Watch'], inplace = True)

<font color='blue'> Fill Missing Values with relevant columns \
    Gender - Fill with Mode \
    Location - Fill with Unknown \
    Username - Fill with Unknown \
    Days Watched - Fill with Median \
    Completed - Fill with Median \
    Total Entries - Fill with Median

In [28]:
user_details['Gender'].fillna(user_details['Gender'].mode()[0], inplace=True)
user_details['Location'].fillna('Unknown', inplace=True)
user_details['Username'].fillna('Unknown', inplace=True)

median_days_watched = user_details['Days Watched'].median()
user_details['Days Watched'] = user_details['Days Watched'].fillna(median_days_watched)

median_mean_score = user_details['Mean Score'].median()
user_details['Mean Score'] = user_details['Mean Score'].fillna(median_mean_score)

median_completed = user_details['Completed'].median()
user_details['Completed'] = user_details['Completed'].fillna(median_completed)

median_total_entries = user_details['Total Entries'].median()
user_details['Total Entries'] = user_details['Total Entries'].fillna(median_total_entries)

In [29]:
user_details.head()

Unnamed: 0,Mal ID,Username,Gender,Location,Days Watched,Mean Score,Completed,Total Entries
0,1,Xinil,Male,California,142.3,7.37,233.0,399.0
1,3,Aokaado,Male,"Oslo, Norway",68.6,7.34,137.0,343.0
2,4,Crystal,Female,"Melbourne, Australia",212.8,6.68,636.0,1000.0
3,9,Arcane,Male,Unknown,30.0,7.71,54.0,66.0
4,18,Mad,Male,Unknown,52.0,6.27,114.0,153.0


In [30]:
user_details.isna().sum()

Mal ID           0
Username         0
Gender           0
Location         0
Days Watched     0
Mean Score       0
Completed        0
Total Entries    0
dtype: int64

In [31]:
unknown_counts = (user_details.applymap(lambda x: str(x).lower() == 'unknown')).sum()
print(unknown_counts)

Mal ID                0
Username              2
Gender                0
Location         578558
Days Watched          0
Mean Score            0
Completed             0
Total Entries         0
dtype: int64


<font color='blue'> In user_Details, columns Mal ID is same as unique ID for each user. Therefore we will rename the column "Mal Id" with "User_id"

In [32]:
user_details.rename(columns={"Mal ID": "user_id"}, inplace=True)

In [33]:
user_scores.head()

Unnamed: 0,user_id,Username,anime_id,Anime Title,rating
0,1,Xinil,21,One Piece,9
1,1,Xinil,48,.hack//Sign,7
2,1,Xinil,320,A Kite,5
3,1,Xinil,49,Aa! Megami-sama!,8
4,1,Xinil,304,Aa! Megami-sama! Movie,8


In [34]:
user_scores.isna().sum()

user_id          0
Username       232
anime_id         0
Anime Title      0
rating           0
dtype: int64

In [35]:
user_scores.drop(columns = ['Username', ], inplace = True)

In [36]:
user_scores.shape

(24325191, 4)

#### <font color='blue'> Step3: Merging the datasets

<font color='blue'> Using Garbage Collection for deleting the unwanted Dataframes to release memory

In [37]:
user_data = pd.merge(user_scores, user_details, on='user_id')

del user_scores
del user_details
del unknown_counts

gc.collect()
gc.collect()
gc.collect()

merged_data = pd.merge(user_data, anime_data, on='anime_id')

del user_data
del anime_data

gc.collect()
gc.collect()
gc.collect()

0

In [38]:
merged_data.head(2)

Unnamed: 0,user_id,anime_id,Anime Title,rating,Username,Gender,Location,Days Watched,Mean Score,Completed,...,Producers,Studios,Source,Rating,Rank,Popularity,Favorites,Air Start,Air End,Duration (min)
0,1,21,One Piece,9,Xinil,Male,California,142.3,7.37,233.0,...,"Fuji TV, TAP, Shueisha",Toei Animation,Manga,PG-13 - Teens 13 or older,55.0,20,198986,1999.0,2013.0,24.0
1,20,21,One Piece,9,vondur,Male,"Bergen, Norway",73.1,8.06,94.0,...,"Fuji TV, TAP, Shueisha",Toei Animation,Manga,PG-13 - Teens 13 or older,55.0,20,198986,1999.0,2013.0,24.0


In [39]:
merged_data.shape

(23803248, 27)

In [40]:
merged_data.columns

Index(['user_id', 'anime_id', 'Anime Title', 'rating', 'Username', 'Gender',
       'Location', 'Days Watched', 'Mean Score', 'Completed', 'Total Entries',
       'Name', 'Score', 'Genres', 'Type', 'Episodes', 'Status', 'Producers',
       'Studios', 'Source', 'Rating', 'Rank', 'Popularity', 'Favorites',
       'Air Start', 'Air End', 'Duration (min)'],
      dtype='object')

In [41]:
merged_data.isna().sum()

user_id           0
anime_id          0
Anime Title       0
rating            0
Username          0
Gender            0
Location          0
Days Watched      0
Mean Score        0
Completed         0
Total Entries     0
Name              0
Score             0
Genres            0
Type              0
Episodes          0
Status            0
Producers         0
Studios           0
Source            0
Rating            0
Rank              0
Popularity        0
Favorites         0
Air Start         0
Air End           0
Duration (min)    0
dtype: int64

In [42]:
merged_data.drop(columns = ['Location'], inplace = True)

In [43]:
merged_data.columns

Index(['user_id', 'anime_id', 'Anime Title', 'rating', 'Username', 'Gender',
       'Days Watched', 'Mean Score', 'Completed', 'Total Entries', 'Name',
       'Score', 'Genres', 'Type', 'Episodes', 'Status', 'Producers', 'Studios',
       'Source', 'Rating', 'Rank', 'Popularity', 'Favorites', 'Air Start',
       'Air End', 'Duration (min)'],
      dtype='object')

In [44]:
merged_data.dtypes

user_id             int64
anime_id            int64
Anime Title        object
rating              int64
Username           object
Gender             object
Days Watched      float64
Mean Score        float64
Completed         float64
Total Entries     float64
Name               object
Score             float64
Genres             object
Type               object
Episodes          float64
Status             object
Producers          object
Studios            object
Source             object
Rating             object
Rank              float64
Popularity          int64
Favorites           int64
Air Start         float64
Air End           float64
Duration (min)    float64
dtype: object

<font color='blue'> Perform Stratified Sampling using a combination of Gender and Type

In [45]:
merged_data['Type_Gender'] = merged_data['Type'].astype(str) + '_' + merged_data['Gender'].astype(str)

print(merged_data['Type_Gender'].value_counts())

Type_Gender
TV_Male               12613597
TV_Female              3644652
Movie_Male             2472211
OVA_Male               1748390
Special_Male           1110570
Movie_Female            785542
OVA_Female              503125
Special_Female          336095
ONA_Male                287630
ONA_Female              105359
TV_Non-Binary            72276
Music_Male               52390
Music_Female             28941
Movie_Non-Binary         18525
OVA_Non-Binary           11468
Special_Non-Binary        7860
ONA_Non-Binary            3668
Music_Non-Binary           945
UNKNOWN_Male                 2
UNKNOWN_Female               2
Name: count, dtype: int64


In [46]:
merged_data.shape

(23803248, 27)

In [47]:
def stratified_sample(data, strat_col, frac=0.5, random_state=42):
    data[strat_col] = data[strat_col].astype('category')
    split = StratifiedShuffleSplit(n_splits=1, test_size=1-frac, random_state=random_state)
    for train_index, _ in split.split(data, data[strat_col]):
        stratified_data = data.iloc[train_index]
    
    return stratified_data

<font color='blue'> Print the Distribution for the Stratified Splits

In [48]:
data_sampled = stratified_sample(merged_data, strat_col='Type_Gender', frac=0.5, random_state=42)
data_sampled2 = merged_data[~merged_data.index.isin(data_sampled.index)]

print("Original distribution:")
print(merged_data['Type'].value_counts(normalize=True))
print("\nSampled distribution:")
print(data_sampled['Type'].value_counts(normalize=True))

print("\nOriginal Gender distribution:")
print(merged_data['Gender'].value_counts(normalize=True).head())
print("\nSampled Gender distribution:")
print(data_sampled['Gender'].value_counts(normalize=True).head())

Original distribution:
Type
TV         6.860629e-01
Movie      1.376400e-01
OVA        9.507035e-02
Special    6.110616e-02
ONA        1.666399e-02
Music      3.456503e-03
UNKNOWN    1.680443e-07
Name: proportion, dtype: float64

Sampled distribution:
Type
TV         6.860628e-01
Movie      1.376400e-01
OVA        9.507039e-02
Special    6.110620e-02
ONA        1.666394e-02
Music      3.456419e-03
UNKNOWN    1.680443e-07
Name: proportion, dtype: float64

Original Gender distribution:
Gender
Male          0.768164
Female        0.227016
Non-Binary    0.004820
Name: proportion, dtype: float64

Sampled Gender distribution:
Gender
Male          0.768164
Female        0.227016
Non-Binary    0.004820
Name: proportion, dtype: float64


In [49]:
merged_data.shape

(23803248, 27)

In [50]:
data_sampled.shape

(11901624, 27)

In [51]:
data_sampled2.shape

(11901624, 27)

<font color='blue'> Save the datasplits as pickle files to be used for training

In [52]:
data_sampled.to_pickle('data_sampled.pkl')

In [53]:
data_sampled2.to_pickle('data_sampled2.pkl')

#### <font color='blue'> Training in a separate IPYNB