# 1. Problem Definition & Objective

## a. Selected Project Track
**Mood-Based Music Recommendation System**

## b. Clear Problem Statement
Users often struggle to find music that strictly matches their current specific mood or activity across different languages. Generic recommendations might suggest a "Sad" song that is actually an upbeat remix, or miss out on regional content like Telugu or Korean songs that fit the mood perfectly. The goal is to build a system that curates music playlists by rigorously filtering songs based on mood-specific keywords and "anti-keywords" (negative filtering) to ensure high relevance across 5 major languages.

## c. Real-world Relevance and Motivation
- **Cross-Cultural Access:** Music is universal, but discovery is often language-gated. This tool breaks that barrier.
- **Context Awareness:** mood is a primary driver for music consumption (e.g., Gym, Sleep, Party).
- **Quality of Experience:** By filtering out "remixes" for sad moods or "acoustic" for gym moods, we improve the listener's experience significantly.


In [None]:
import pandas as pd
import numpy as np
import joblib
import os
import requests
import time
from sklearn.ensemble import RandomForestClassifier

# Setup directory for data
os.makedirs('data', exist_ok=True)


# 2. Data Understanding & Preparation

## a. Dataset Source
- **Source:** iTunes Search API (Public, Free).
- **Type:** Real-time collected data based on search queries.

## b. Data Loading and Exploration
We define specific search terms for different moods across 5 languages: English, Hindi, Spanish, Korean, and Telugu.

## c. Cleaning, Preprocessing, Feature Engineering
- **Text Construction:** We combine `Track Name`, `Artist Name`, and `Collection Name` into a single text blob for filtering.
- **Labeling:** We automatically assign a "Predicted Emoji" and "Mood Label" based on the search term that yielded the result.

## d. Handling Missing Values or Noise
- **Negative Filtering:** This is our core preprocessing step. We define "negative keywords" for each mood. For example, if we are looking for 'Sad' songs, we exclude results containing 'remix', 'club', or 'party'.



In [None]:
# Emoji Configuration
EMOJI_MAPPING = {
    '😊': {'label': 'Happy', 'negative': ['sad', 'gloom', 'breakup', 'remix']},
    '😢': {'label': 'Sad', 'negative': ['remix', 'club', 'dance', 'happy', 'party', 'mix', 'techno']},
    '😌': {'label': 'Calm', 'negative': ['rock', 'metal', 'techno', 'dubstep']},
    '🔥': {'label': 'Energetic', 'negative': ['lullaby', 'sleep', 'balled', 'slow']},
    '💪': {'label': 'Motivated', 'negative': ['sad', 'weak', 'slow']},
    '😴': {'label': 'Sleepy', 'negative': ['rock', 'pop', 'dance', 'drum', 'beat']},
    '🥰': {'label': 'Romantic', 'negative': ['breakup', 'hate', 'metal']},
    '😠': {'label': 'Angry', 'negative': ['calm', 'soft', 'love']},
    '🎉': {'label': 'Party', 'negative': ['acoustic', 'slow', 'sad']},
    '🙏': {'label': 'Devotion', 'negative': ['explicit']},
    '😎': {'label': 'Cool', 'negative': ['country', 'metal']},
    '💭': {'label': 'Thoughtful', 'negative': ['party', 'scream']},
    '🌙': {'label': 'Melancholic', 'negative': ['happy', 'upbeat', 'dance']},
}

# Localized Search Terms Configuration
LOCALIZED_TERMS = {
    'Happy': {
        'English': ['happy hits', 'feel good pop', 'upbeat hits', 'walking on sunshine'],
        'Hindi': ['bollywood happy songs', 'hindi dance hits', 'punjabi bhangra', 'bollywood party'],
        'Spanish': ['latin pop hits', 'reggaeton fiesta', 'musica alegre', 'happy latin'],
        'Korean': ['k-pop upbeat', 'k-pop dance hits', 'happy k-pop', 'korean pop energy'],
        'Telugu': ['telugu dance hits', 'tollywood party', 'telugu upbeat', 'telugu mass songs'] 
    },
    'Sad': {
        'English': ['sad songs', 'heartbreak', 'piano ballads', 'cry me a river'],
        'Hindi': ['bollywood sad songs', 'arijit singh sad', 'hindi breakup', 'dard bhare'],
        'Spanish': ['musica triste', 'baladas romanticas', 'cortavenas', 'sad latin'],
        'Korean': ['k-pop ballad', 'k-drama ost sad', 'sad k-pop', 'korean heartbreak'],
        'Telugu': ['telugu sad songs', 'tollywood melody sad', 'telugu heartbreak', 'love failure telugu']
    },
    'Calm': {
        'English': ['acoustic chill', 'lo-fi beats', 'relaxing piano', 'stress relief'],
        'Hindi': ['bollywood acoustic', 'hindi lo-fi', 'sufi songs', 'calm hindi'],
        'Spanish': ['latin acoustic', 'guitarras relajantes', 'bossa nova', 'calm spanish'],
        'Korean': ['k-indie', 'korean acoustic', 'piano k-pop', 'calm k-drama'],
        'Telugu': ['telugu melody', 'telugu acoustic', 'calm tollywood', 'pleasant telugu']
    },
    'Energetic': {
        'English': ['workout hits', 'gym motivation', 'power rock', 'high energy pop'],
        'Hindi': ['bollywood workout', 'punjabi high energy', 'hindi gym songs', 'chak de india'],
        'Spanish': ['latin gym', 'reggaeton workout', 'zumba hits', 'energia latina'],
        'Korean': ['k-pop workout', 'k-pop high energy', 'gym k-pop', 'korean rock'],
        'Telugu': ['telugu workout', 'tollywood action', 'mass beats telugu', 'dsp hits high energy']
    },
    'Romantic': {
        'English': ['love songs', 'romantic ballads', 'wedding songs', 'first dance'],
        'Hindi': ['bollywood romantic', 'love songs hindi', 'arijit singh romantic', 'shreya ghoshal love'],
        'Spanish': ['musica romantica', 'latin love songs', 'bachata romantica', 'amor latino'],
        'Korean': ['k-drama romance', 'sweet k-pop', 'korean love songs', 'wedding k-pop'],
        'Telugu': ['telugu love songs', 'sid sriram melody', 'romantic tollywood', 'telugu duets']
    },
    'Party': {
        'English': ['party hits', 'club bangers', 'dance pop', 'house music'],
        'Hindi': ['bollywood party anthem', 'punjabi party mix', 'remix hindi', 'badshah hits'],
        'Spanish': ['fiesta latina', 'reggaeton hits', 'salsa party', 'club latino'],
        'Korean': ['k-pop party', 'club k-pop', 'korean edm', 'big bang hits'],
        'Telugu': ['telugu folk songs', 'teenmaar beats', 'tollywood party mix', 'ramuloo ramulaa']
    }
}

# Fallback for emojis not strictly mapped above
DEFAULT_TERMS = {
    'Motivated': ['motivation', 'champions', 'success'],
    'Sleepy': ['sleep', 'lullaby', 'ambient'],
    'Angry': ['rock', 'metal', 'rage'],
    'Devotion': ['devotional', 'spiritual', 'gospel'],
    'Cool': ['cool', 'jazz', 'smooth'],
    'Thoughtful': ['focus', 'study', 'instrumental'],
    'Melancholic': ['lonely', 'sad', 'night']
}

LANGUAGES = ['English', 'Hindi', 'Spanish', 'Korean', 'Telugu']


# 3. Model / System Design

## a. AI Technique Used
**Hybrid Rule-Based Information Retrieval:** 
Instead of a traditional "black box" Deep Learning model ensuring audio waveform analysis (which is computationally expensive and requires MP3 files), we use a highly curated **Keyword-Based Filtering System**. This acts as a deterministic classifier where terms define the class, and negative lookups prune false positives.

## b. Architecture / Pipeline
1.  **Input:** User selects a mood (Emoji).
2.  **Query Generation:** System maps Emoji -> Mood -> Localized Search Terms.
3.  **Data Fetching:** iTunes API is queried for each term.
4.  **Filtering:** Results are passed through a "Negative Keyword Filter" to remove incompatible tracks.
5.  **Output:** A clean, labeled dataset is generated/saved.

## c. Justification
- **Speed:** API lookup is faster than audio processing.
- **Accuracy for Metadata:** For querying "Bollywood Sad Songs", the metadata (Title/Album) is often more reliable than analyzing the audio waveform without context.



In [None]:
def fetch_from_itunes(term, limit=10):
    '''Search iTunes API for tracks.'''
    url = "https://itunes.apple.com/search"
    params = {
        'term': term,
        'media': 'music',
        'entity': 'song',
        'limit': limit
    }
    try:
        response = requests.get(url, params=params, timeout=10)
        if response.status_code == 200:
            return response.json().get('results', [])
    except Exception as e:
        print(f"Error fetching {term}: {e}")
    return []

def get_search_terms(mood_label, lang):
    if mood_label in LOCALIZED_TERMS and lang in LOCALIZED_TERMS[mood_label]:
        return LOCALIZED_TERMS[mood_label][lang]
    
    # Fallback logic
    base_terms = DEFAULT_TERMS.get(mood_label, [mood_label.lower()])
    if lang == 'English':
        return base_terms
    return [f"{term} {lang}" for term in base_terms]


# 4. Core Implementation

## a. Model Training / Inference Logic
Here we implement the `build_dataset` function which orchestrates the fetching and filtering. 

## b. Prompt Engineering / Query Construction
The `get_search_terms` function (above) effectively acts as our "Prompt Engineer" but for search queries, tailoring the request to specific cultural contexts (e.g., "Tollywood Party" vs "K-Pop Upbeat").

## c. Code Execution
The following cell runs the pipeline top-to-bottom.



In [None]:
def build_dataset():
    print("Fetching data from iTunes (Free API)...")
    all_tracks = []
    
    # Common Negative Filters
    GLOBAL_NEGATIVE = ['karaoke', 'tribute', 'cover', 'ringtone', 'podcast', 'commentary']
    
    for lang in LANGUAGES:
        print(f"--- Fetching {lang} songs ---")
        for emoji, data in EMOJI_MAPPING.items():
            mood_label = data['label']
            terms = get_search_terms(mood_label, lang)
            
            for term in terms:
                # Limit set low (3) for demonstration speed in Notebook. 
                # In prod script it was 12.
                results = fetch_from_itunes(term, limit=3) 
                
                for item in results:
                    song_name = item.get('trackName', '')
                    artist_name = item.get('artistName', '')
                    collection_name = item.get('collectionName', '')
                    full_text = f"{song_name} {artist_name} {collection_name}".lower()
                    
                    # 1. Global Negative Filter
                    if any(bad in full_text for bad in GLOBAL_NEGATIVE):
                        continue
                        
                    # 2. Mood Specific Negative Filter
                    is_bad_match = False
                    if 'negative' in data:
                        for neg in data['negative']:
                            if neg in full_text:
                                is_bad_match = True
                                break
                    if is_bad_match:
                        continue

                    # Basic info
                    track_info = {
                        'id': str(item.get('trackId')),
                        'name': song_name,
                        'artist': artist_name,
                        'album': collection_name,
                        'image_url': item.get('artworkUrl100').replace('100x100', '600x600'),
                        'preview_url': item.get('previewUrl'),
                        'predicted_emoji': emoji, 
                        'mood_label': mood_label,
                        'language': lang
                    }
                    all_tracks.append(track_info)
                time.sleep(0.05) # Polite delay
            
    # Remove duplicates
    df = pd.DataFrame(all_tracks)
    if not df.empty:
        df = df.drop_duplicates(subset=['id'])
    return df

# Run the pipeline
df_music = build_dataset()
print(f"Total songs collected: {len(df_music)}")


# 5. Evaluation & Analysis

## a. Sample Outputs

![Sample Outputs](recommendation_songs.png)
Let's view the collected data to verify that the mood labels match the song content.

## b. Performance Analysis
We inspect the distribution of songs across different languages and moods.



In [None]:
if not df_music.empty:
    display(df_music[['name', 'artist', 'mood_label', 'language']].sample(min(10, len(df_music))))
else:
    print("No data found.")

# Distribution
if not df_music.empty:
    print("\n--- Distribution by Mood ---")
    print(df_music['mood_label'].value_counts())
    
    print("\n--- Distribution by Language ---")
    print(df_music['language'].value_counts())


# 6. Ethical Considerations & Responsible AI

## a. Bias and Fairness
- **Language Bias:** The iTunes API may have richer results for English than for regional languages like Telugu. We try to mitigate this by using specific local search terms ("Tollywood", "K-Pop") rather than direct translations.
- **Cultural Stereotypes:** Manually mapped keywords might enforce stereotypes (e.g., assuming all Latin music is for partying). We mitigate this by including diverse categories like 'Calm' and 'Sad' for all languages.

## b. Dataset Limitations
- **Data Quality:** Dependent on Apple's metadata. If Apple tags a song incorrectly, our system inherits that error.
- **Availability:** Only 30-second previews are available freely.

## c. Responsible Use
- This tool is for recommendations only and does not store user listening data, preserving privacy.



# 7. Conclusion & Future Scope

## a. Summary of Results
We successfully created a pipeline that aggregates music across 5 languages and 13 mood categories, filtering out noise (remixes, covers) to provide a high-quality listening experience.

## b. Possible Improvements
- **Audio Analysis:** Integrate `librosa` to analyze BPM and Key for more accurate "Energy" classification.
- **User Feedback Loop:** Allow users to "Dislike" a recommendation to remove that keyword association in the future.
- **Spotify Integration:** Use Spotify API for full track playback.



In [None]:
# Save the model artifact
def save_artifacts(df):
    if df.empty:
        print("Empty dataframe, skipping save.")
        return
        
    artifacts = {
        'model': None, # No ML model object needed
        'data': df
    }
    joblib.dump(artifacts, 'model.pkl')
    print("Saved 'model.pkl' successfully.")

save_artifacts(df_music)
