**RECOMMENDATION SYSTEM**

**1. Data Preprocessing**

The goal here is to prepare the raw data for feature extraction and
modeling.

| Step                          | Action                                                                                                                    | Method/Technique                                                                                                                                                                 | Rationale                                          |
|----------|-----------------|-------------------------------|---------------|
| **1.1 Load Data**             | Load the dataset (e.g., anime.csv) into a DataFrame.                                                                      | pandas.read_csv()                                                                                                                                                                | Standard practice for data manipulation in Python. |
| **1.2 Handle Missing Values** | Identify and handle missing values, particularly in key columns like **genre** and **rating**.                            | **Imputation/Removal:** If **Genre** is missing, dropping the row might be best, as genre is crucial. For missing **ratings**, use the mean rating of all anime or drop the row. | Missing data can skew similarity calculations.     |
| **1.3 Data Cleaning**         | Clean up the **Genre** column by splitting the genre string (often separated by commas) into a list of individual genres. | str.split(',') or str.replace()                                                                                                                                                  | Prepares the genre data for one-hot encoding.      |
| **1.4 Explore Data**          | Check data types, unique values, and distribution of key features like **Type** and **Rating**.                           | df.info(), df.describe(), df\['column'\].value_counts()                                                                                                                          | Confirms data quality and structure.               |

Export to Sheets

**2. Feature Extraction**

We need to create a numerical feature vector for each anime upon which
the similarity will be calculated.

**A. Feature Selection**

Based on the dataset description, the key features for a content-based
recommender are:

1.  **Genre:** Primary descriptor of content.

2.  **Rating & Number of Users:** Indicates popularity/quality.

3.  **Type (Broadcast Type):** Distinguishes between TV, OVA, Movie,
    etc.

**B. Feature Transformation**

| Feature               | Transformation Method          | Resulting Feature Vector                                                                                                                                     |
|--------------|---------------------|--------------------------------------|
| **Genre**             | **One-Hot Encoding (OHE)**     | A sparse vector where 1 indicates the presence of a genre (e.g., Action, Comedy, Drama), and 0 indicates its absence. This is the **most critical** feature. |
| **Type**              | **One-Hot Encoding (OHE)**     | Dummy variables for TV, Movie, OVA, etc.                                                                                                                     |
| **Rating**            | **Weighted Rating (Optional)** | A single normalized numerical value.                                                                                                                         |
| **Community Members** | **Normalization/Scaling**      | A single normalized numerical value (e.g., between 0 and 1).                                                                                                 |

Export to Sheets

**Combining Features (The Feature Matrix)**

1.  **Create the Feature Matrix (X):** Combine the OHE vectors for
    **Genre** and **Type**, and the scaled numerical features (Rating,
    Community Members) into a single matrix where each row represents an
    anime.

X=\[GenreOHE​∣TypeOHE​∣Scaled Rating∣Scaled Members\]

1.  **Normalization of Numerical Features:** While not strictly
    necessary if OHE features dominate, it's good practice to normalize
    numerical features (like Rating and Community Members) using
    **Min-Max Scaling** or **Standardization** to ensure they don't
    disproportionately influence the similarity score.

Scaled_x=max(x)−min(x)x−min(x)​

1.  **Recommendation System Implementation**

The core of the system is calculating the **Cosine Similarity** between
the feature vectors.

**Cosine Similarity**

Cosine similarity measures the cosine of the angle between two non-zero
vectors in a high-dimensional space. The closer the vectors are in
direction, the higher the similarity score (closer to 1).

Similarity(A,B)=cos(θ)=∥A∥∥B∥A⋅B​

**Recommendation Function Design**

**Function:** recommend_anime(target_anime_title, N=10,
similarity_threshold=0.8)

1.  **Identify Target:** Get the **index** of the target_anime_title
    from the DataFrame.

2.  **Extract Target Vector:** Extract the feature vector A for the
    target anime from the Feature Matrix X.

3.  **Compute Similarities:** Calculate the **cosine similarity**
    between the target vector A and **every other anime vector** B in
    the matrix X.

4.  **Sort:** Store the similarities as a list of (index, score) pairs
    and **sort** them in descending order by score.

5.  **Filter and Return:**

    -   **Filter** the results: Exclude the target anime itself.

    -   **Apply Threshold:** Only include anime with a similarity score
        **above** the similarity_threshold.

    -   **Limit:** Return the top N anime titles from the filtered list.

**Experimentation with Threshold**

-   **Low Threshold (e.g., 0.5):** Will result in a **larger** and
    potentially less relevant recommendation list (higher **Recall**).

-   **High Threshold (e.g., 0.9):** Will result in a **smaller** and
    potentially more relevant/niche recommendation list (higher
    **Precision**).

-   **Action:** Test the function with thresholds like **0.75, 0.85, and
    0.95** to observe the impact on the list size and assumed relevance.

**4. Evaluation**

Evaluating a content-based system requires a proxy for "relevance" since
we don't have explicit user feedback.

**A. Splitting the Dataset**

Unlike typical supervised learning, the split here is conceptual. You
aren't training coefficients, but rather building the feature vectors.

-   **Approach:** Split the **list of anime titles** into a **training
    set** (e.g., 80% used to build the similarity matrix) and a
    **testing set** (the remaining 20% to be used as target anime for
    recommendations).

**B. Evaluation Metrics (Surrogate Evaluation)**

Since this is not a user-item collaborative filter, we must use a
surrogate evaluation: **Evaluate how well the system recommends anime
belonging to the same genre(s) or type as the target anime.**

1.  **Define "Relevant":** An anime r in the recommendation list is
    considered **"Relevant"** if it shares a minimum number of **key
    genres** (e.g., at least 3) with the target anime t.

2.  **Calculate Metrics:** For each target anime t in the test set:

    -   Generate a list of N recommendations Rt​.

    -   Count the number of relevant recommendations Rrelevant​.

    -   Calculate **Precision, Recall, and F1-score** for that single
        target, then average the results across all test targets.

| Metric        | Formula                            | Interpretation                                                             |
|----------|-----------------------------|----------------------------------|
| **Precision** | \$\frac{                           | R\_{relevant}                                                              |
| **Recall**    | \$\frac{                           | R\_{relevant}                                                              |
| **F1-Score**  | 2⋅Precision+RecallPrecision⋅Recall​ | Harmonic mean of Precision and Recall; a single metric that balances both. |

Export to Sheets

**C. Performance Analysis and Improvements**

-   **Analysis:**

    -   What is the average R2 score across the test set?

    -   Which features contributed most to similarity (check the weight
        of genres vs. rating in the feature vector)?

    -   Did the **sparsity** of the genre vectors affect performance?

-   **Areas for Improvement:**

    1.  **Feature Weighting:** Apply **Term Frequency-Inverse Document
        Frequency (TF-IDF)** on the **Genre** features instead of simple
        OHE. This would give more weight to rare genres and less weight
        to common genres (like "Action" or "Shounen"), potentially
        improving distinctiveness and relevance.

    2.  **Hybrid Approach:** Integrate **User Ratings** more
        effectively, perhaps by adjusting the similarity score with a
        factor based on the difference in average ratings.

    3.  **Advanced Feature Engineering:** Incorporate textual data like
        **synopses** using techniques like **Word2Vec** or **BERT**
        embeddings if the dataset includes descriptions.