# Recommendation System

## Objective:
* The objective of this assignment is to implement a recommendation system using cosine similarity on an anime dataset.

## Dataset:
* Use the Anime Dataset which contains information about various anime, including their titles, genres,No.of episodes and user ratings etc.

## Tasks:

Data Preprocessing:

* Load the dataset into a suitable data structure (e.g., pandas DataFrame).
Handle missing values, if any.
* Explore the dataset to understand its structure and attributes.

Feature Extraction:

* Decide on the features that will be used for computing similarity (e.g., genres, user ratings).
* Convert categorical features into numerical representations if necessary.
Normalize numerical features if required.

Recommendation System:

* Design a function to recommend anime based on cosine similarity.
* Given a target anime, recommend a list of similar anime based on cosine similarity scores.
* Experiment with different threshold values for similarity scores to adjust the recommendation list size.

Evaluation:

* Split the dataset into training and testing sets.
* Evaluate the recommendation system using appropriate metrics such as precision, recall, and F1-score.
* Analyze the performance of the recommendation system and identify areas of improvement.

Interview Questions:
1. Can you explain the difference between user-based and item-based collaborative filtering?
2. What is collaborative filtering, and how does it work?

In [82]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.feature_extraction.text import TfidfVectorizer

import warnings
warnings.filterwarnings('ignore')

## 1. Data Preprocessing

In [31]:
# Load the dataset
anime_df = pd.read_csv('/content/anime.csv')
anime_df.head()

Unnamed: 0,anime_id,name,genre,type,episodes,rating,members
0,32281,Kimi no Na wa.,"Drama, Romance, School, Supernatural",Movie,1,9.37,200630
1,5114,Fullmetal Alchemist: Brotherhood,"Action, Adventure, Drama, Fantasy, Magic, Mili...",TV,64,9.26,793665
2,28977,Gintama°,"Action, Comedy, Historical, Parody, Samurai, S...",TV,51,9.25,114262
3,9253,Steins;Gate,"Sci-Fi, Thriller",TV,24,9.17,673572
4,9969,Gintama&#039;,"Action, Comedy, Historical, Parody, Samurai, S...",TV,51,9.16,151266


In [6]:
print(anime_df.shape)

(12017, 7)


In [38]:
print(anime_df.columns)

Index(['anime_id', 'name', 'genre', 'type', 'episodes', 'rating', 'members'], dtype='object')


Handle Missing Values

In [86]:
# Handle missing values in 'rating' by filling with the median
anime_df['rating'].fillna(anime_df['rating'].median(), inplace=True)

# Handle missing values in 'genre' by filling with 'Unknown'
anime_df['genre'].fillna('Unknown', inplace=True)

# Handle 'episodes' column: Replace 'Unknown' or '?' with 0 and convert to numeric
anime_df['episodes'] = anime_df['episodes'].replace({'Unknown': 0, '?': 0}).astype(int)


In [87]:
print(anime_df.info())  # View structure
print(anime_df.describe())  # View summary statistics

<class 'pandas.core.frame.DataFrame'>
Index: 12017 entries, 0 to 12293
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   anime_id  12017 non-null  int64  
 1   name      12017 non-null  object 
 2   genre     12017 non-null  object 
 3   type      12017 non-null  object 
 4   episodes  12017 non-null  int64  
 5   rating    12017 non-null  float64
 6   members   12017 non-null  float64
dtypes: float64(2), int64(2), object(3)
memory usage: 751.1+ KB
None
           anime_id      episodes        rating       members
count  12017.000000  12017.000000  1.201700e+04  1.201700e+04
mean   13638.001165     12.292419  3.784200e-17  5.676300e-17
std    11231.076675     46.754770  1.000042e+00  1.000042e+00
min        1.000000      0.000000 -4.696423e+00 -3.311688e-01
25%     3391.000000      1.000000 -5.745810e-01 -3.273219e-01
50%     9959.000000      2.000000  8.960208e-02 -3.033560e-01
75%    23729.000000     12.000000  6.854134

## 2: Feature Extraction

In [88]:
# Convert 'genre' column into one-hot encoded features
genres_one_hot = anime_df['genre'].str.get_dummies(sep=', ')

# Concatenate the one-hot encoded genres with the original dataframe
anime_df_clean = pd.concat([anime_df, genres_one_hot], axis=1)

# Drop unnecessary columns ('anime_id', 'name', 'genre', 'type')
anime_df_clean = anime_df_clean.drop(columns=['anime_id', 'name', 'genre', 'type'])

# Normalize numerical features (rating and members)
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
anime_df_clean[['rating', 'members']] = scaler.fit_transform(anime_df_clean[['rating', 'members']])

print(anime_df_clean.head())


   episodes    rating   members  Action  Adventure  Cars  Comedy  Dementia  \
0         1  0.924370  0.197867       0          0     0       0         0   
1        64  0.911164  0.782769       1          1     0       0         0   
2        51  0.909964  0.112683       1          0     0       1         0   
3        24  0.900360  0.664323       0          0     0       0         0   
4        51  0.899160  0.149180       1          0     0       1         0   

   Demons  Drama  ...  Shounen Ai  Slice of Life  Space  Sports  Super Power  \
0       0      1  ...           0              0      0       0            0   
1       0      1  ...           0              0      0       0            0   
2       0      0  ...           0              0      0       0            0   
3       0      0  ...           0              0      0       0            0   
4       0      0  ...           0              0      0       0            0   

   Supernatural  Thriller  Vampire  Yaoi  Yuri  
0

## 3. Recommendation System using Cosine Similarity

In [90]:
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

# Function to recommend anime based on a given anime title
def recommend_anime_optimized(title, df, clean_df, top_n=5, sim_threshold=0.3):
    # Get the index of the anime matching the title
    idx = df[df['name'] == title].index[0]

    # Get the features of the target anime
    target_features = clean_df.iloc[idx].values.reshape(1, -1)

    # Compute cosine similarity between target anime and all others
    sim_scores = cosine_similarity(target_features, clean_df)[0]

    # Apply similarity threshold
    sim_scores = [(i, score) for i, score in enumerate(sim_scores) if score > sim_threshold]

    # Sort the anime based on similarity scores
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)[1:top_n+1]

    # Get the indices of the top similar anime
    anime_indices = [i[0] for i in sim_scores]

    # Return the top N most similar anime
    return df[['name', 'genre', 'rating']].iloc[anime_indices]



In [91]:
# Example usage: Recommend anime similar to "Kimi no Na wa."
recommended_anime = recommend_anime_optimized('Kimi no Na wa.', anime_df, anime_df_clean, top_n=5)
print(recommended_anime)

                                                   name  \
1111              Aura: Maryuuin Kouga Saigo no Tatakai   
504   Clannad: After Story - Mou Hitotsu no Sekai, K...   
208                       Kokoro ga Sakebitagatterunda.   
1201                     Angel Beats!: Another Epilogue   
1494                                           Harmonie   

                                             genre    rating  
1111  Comedy, Drama, Romance, School, Supernatural  1.164016  
504                         Drama, Romance, School  1.505875  
208                         Drama, Romance, School  1.798897  
1201                   Drama, School, Supernatural  1.124946  
1494                   Drama, School, Supernatural  1.017505  


## 4: Evaluation

In [94]:

from sklearn.model_selection import train_test_split
from sklearn.metrics import precision_score, recall_score, f1_score

# Split the dataset into training and testing sets
train_df, test_df = train_test_split(anime_df, test_size=0.2, random_state=42)

# Function to get true positive recommendations based on a genre
def get_true_positives(anime_title, test_df, recommended_anime):
  true_genre = test_df[test_df['name'] == anime_title]['genre'].values[0]
  tp_count = 0
  for index, row in recommended_anime.iterrows():
      if true_genre in row['genre']:
          tp_count += 1
  return tp_count


# Evaluate the recommendation system
precisions = []
recalls = []
f1_scores = []

for index, row in test_df.iterrows():
    anime_title = row['name']
    try:
      recommended_anime = recommend_anime_optimized(anime_title, anime_df, anime_df_clean)

      true_positives = get_true_positives(anime_title, test_df, recommended_anime)
      if len(recommended_anime) > 0:
          precision = true_positives / len(recommended_anime) if len(recommended_anime) > 0 else 0
          recall = true_positives / 1 if 1 > 0 else 0
          f1 = 2 * (precision * recall) / (precision + recall) if (precision + recall) > 0 else 0

          precisions.append(precision)
          recalls.append(recall)
          f1_scores.append(f1)

    except IndexError:
        pass


# Calculate average precision, recall, and F1-score
avg_precision = np.mean(precisions)
avg_recall = np.mean(recalls)
avg_f1 = np.mean(f1_scores)

print(f"Average Precision: {avg_precision}")
print(f"Average Recall: {avg_recall}")
print(f"Average F1-Score: {avg_f1}")



Average Precision: 0.19710391822827938
Average Recall: 0.985519591141397
Average F1-Score: 0.32850653038046573



# Analysis of Performance and Areas of Improvement:

## Based on the evaluation metrics (precision, recall, and F1-score), we can analyze the performance of the recommendation system.

1. Precision:
  * Precision indicates the proportion of recommended anime that are relevant.
  *  A low precision score suggests that many recommended anime are not actually relevant to the target anime.

2. Recall:
  * Recall indicates the proportion of relevant anime that are actually recommended.
  * A low recall score suggests that many relevant anime are not being captured in the recommendations.

3. F1-Score:
  * The F1-score is the harmonic mean of precision and recall. It provides a balanced measure of both precision and recall.

# Areas for Improvement:

1. Feature Engineering:
   - We can improve the performance by exploring more features or considering different feature combinations.
   - For example, we could include user ratings or textual descriptions of the anime in the analysis.
  - We could also experiment with different weighting schemes for different features.

2. Similarity Measures:
   - We can try other similarity measures beyond cosine similarity, such as Pearson correlation or
     Jaccard index, to explore whether they provide better results for this dataset.

3. Handling Sparse Data:
   - The dataset may have sparse data, where there are few ratings or interactions for some anime.
   - We could employ techniques to address sparse data issues, such as matrix factorization or
     collaborative filtering, to create better recommendations.

4. Threshold Selection:
   - The threshold value used for similarity scores can impact the number of recommendations.
   - We can analyze how different threshold values affect precision and recall and fine-tune the threshold
     to find the optimal balance.

5. User-Specific Recommendations:
   - The current recommendation system recommends anime based solely on the similarity between anime.
   - We could personalize the recommendations by considering user preferences and past viewing
     history to provide more relevant suggestions.

In conclusion, the performance of the recommendation system can be further improved through advanced feature engineering, exploring alternative similarity measures, handling sparse data, and fine-tuning parameters such as similarity thresholds. The ultimate goal is to provide more accurate and relevant recommendations to users.


# Interview Questions

## 1. Can you explain the difference between user-based and item-based collaborative filtering?

- **User-based Collaborative Filtering**:
   - In this method, we identify users with similar tastes or behaviors based on their past interactions (e.g., ratings, clicks).
   - The idea is to recommend items that **similar users** have liked. For example, if user A and user B have a similar viewing history, a recommendation for user A would be based on items that user B has enjoyed but user A hasn't interacted with yet.
   - **Example**: Netflix recommends a show to a user based on what other users with similar viewing patterns have watched.

   - **Challenges**:
     - This method can struggle with large datasets since there are usually far more users than items. It also faces problems with sparsity when there are too few overlapping users to make accurate recommendations.

- **Item-based Collaborative Filtering**:
   - In item-based filtering, we focus on the relationships between items. The algorithm finds items similar to those that the user has liked or interacted with and recommends those items.
   - The similarity is based on how other users have rated or interacted with both items.
   - **Example**: Amazon recommends a product similar to one you’ve bought or rated highly, based on how other users have interacted with both products.

   - **Advantages**:
     - Item-based collaborative filtering is often more scalable than user-based, especially for large datasets, since there are usually fewer items than users.
     - Item relationships tend to be more stable over time than user preferences.

---

## 2. What is Collaborative Filtering, and How Does it Work?

- **Collaborative Filtering**:
   - Collaborative filtering is a technique used by recommendation systems to predict a user's interests by collecting preferences or interactions (e.g., ratings) from many users. The core idea is that users who agreed on past preferences will likely agree again in the future.
   - This method is "collaborative" because it leverages the behavior of a community of users to make recommendations.

- **How It Works**:
   - Collaborative filtering works in two main ways: **user-based** and **item-based** (as described above). Both approaches use past interactions to recommend new items.
   - The method can be summarized in three steps:
     1. **Data Collection**: Gather interaction data, which can be implicit (clicks, views) or explicit (ratings, reviews).
     2. **Similarity Calculation**: Determine the similarity between users (user-based) or items (item-based) using metrics like cosine similarity, Pearson correlation, or Jaccard similarity.
     3. **Recommendation**: Use the similarity scores to suggest new items that a user is likely to be interested in, based on what similar users have interacted with (user-based) or similar items they have engaged with (item-based).

- **Pros**:
   - Can make accurate predictions even for unseen data by leveraging collective preferences.
   - Doesn't require content-specific information about items (like metadata) since it's purely based on user-item interactions.

- **Cons**:
   - Suffers from the **cold start problem**: it struggles to recommend items to new users or when new items are introduced because there isn't enough interaction data.
   - May struggle with **sparsity** if user interactions with items are low, leading to difficulty in finding sufficient similarities.

---

These methods are at the heart of many recommendation systems, powering platforms like Netflix, YouTube, and Amazon.
