<a href="https://colab.research.google.com/github/V-Nayak/ML/blob/main/music_recom.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Problem Statement:**

You are tasked with building a Music Recommendation System leveraging machine learning techniques. This system should provide personalized recommendations to users based on their listening history and the similarity of songs. In addition to implementing the recommendation engine, you will also need to demonstrate strong coding skills, problem-solving abilities, and a solid understanding of statistics, mathematics, and probability.

# **1. Build a Music Recommendation System**

Design an algorithm to recommend songs to a user based on:
Their highest-rated songs.
Similarity of songs (calculated using cosine similarity based on song features).
Inputs:
User listening history in the format:
scss
Copy code
User_ID | Song_ID | Rating (1-5)


Song feature dataset in the format:
Copy code
Song_ID | Feature_1 | Feature_2 | ... | Feature_N


## **Deliverables**:
A function recommend_songs(user_id, user_data, song_features) that outputs the top 5 recommendations for a given user.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import random
import heapq
from scypy import stats

# 1. Music Recommendation System
class MusicRecommender:
    def __init__(self, user_data, song_features):

        #Initialize the recommender with user data and song features

        #Parameters:
        # user_data: DataFrame with columns [User_ID, Song_ID, Rating]
        # song_features: DataFrame with columns [Song_ID, Feature_1, Feature_2, ...]

        self.user_data = user_data
        self.song_features = song_features

    def calculate_cosine_similarity(self, song1, song2):

        #Calculate cosine similarity between two songs

        #Parameters:
        #- song1: Feature vector of first song
        #- song2: Feature vector of second song

        #Returns:
        #- Cosine similarity score

        # Normalize feature vectors
        song1_norm = song1 / np.linalg.norm(song1)
        song2_norm = song2 / np.linalg.norm(song2)

        # Calculate cosine similarity
        return np.dot(song1_norm, song2_norm)

    def recommend_songs(self, user_id, top_n=5):

        #Recommend top N songs for a given user

        #Parameters:
        #- user_id: ID of the user to recommend songs for
        #- top_n: Number of recommendations to return

        #Returns:
        #- List of recommended song IDs

        # Get user's highest-rated songs
        user_songs = self.user_data[self.user_data['User_ID'] == user_id]
        top_rated_songs = user_songs.nlargest(3, 'Rating')['Song_ID'].tolist()

        # Calculate song similarities
        recommendations = {}
        for song_id in top_rated_songs:
            # Get features of the top-rated song
            song_features = self.song_features[self.song_features['Song_ID'] == song_id].iloc[0, 1:].values

            # Compare with all other songs
            for _, other_song in self.song_features.iterrows():
                if other_song['Song_ID'] not in user_songs['Song_ID'].values:
                    other_features = other_song.values[1:]
                    similarity = self.calculate_cosine_similarity(song_features, other_features)
                    recommendations[other_song['Song_ID']] = max(
                        recommendations.get(other_song['Song_ID'], 0),
                        similarity
                    )

        # Return top N recommendations
        return sorted(recommendations, key=recommendations.get, reverse=True)[:top_n]

# **2. Optimize Data Structure for Play Count Analysis**

Given a list of songs with their play counts, write a function to efficiently find the top k most-played songs.
Optimize for large datasets using appropriate data structures like heaps

In [None]:
# 2. Top K Most-Played Songs using Heap

def find_top_k_songs(play_counts, k):

    #Efficiently find top k most-played songs using a heap

    #Parameters:
    #- play_counts: Dictionary of {song_id: play_count}
    #- k: Number of top songs to return

    #Returns:
    #- List of k most-played songs

    # Use a min-heap of size k
    heap = []
    for song, count in play_counts.items():
        if len(heap) < k:
            heapq.heappush(heap, (count, song))
        else:
            heapq.heappushpop(heap, (count, song))

    # Return songs sorted in descending order of play count
    return [song for _, song in sorted(heap, reverse=True)]

# **3. Random Walk Simulation**

Simulate a random walk in 2D space to visualize the trajectory of a user navigating through a playlist in a probabilistic manner.
Deliverables:
A function random_walk(n) that simulates n steps and visualizes the walk using Matplotlib.
Display the Euclidean distance from the origin after n steps.

In [None]:
# 3. Random Walk Simulation
def random_walk(n):
    """
    Simulate a random walk in 2D space

    Parameters:
    - n: Number of steps to take

    Returns:
    - x, y coordinates of final position
    - Trajectory plot
    """
    # Initialize starting point
    x, y = 0, 0

    # Track trajectory
    xs, ys = [x], [y]

    # Perform random walk
    for _ in range(n):
        # Random step: can move in any of 4 directions
        dx, dy = random.choice([(0,1), (0,-1), (1,0), (-1,0)])
        x += dx
        y += dy
        xs.append(x)
        ys.append(y)

    # Calculate Euclidean distance from origin
    final_distance = np.sqrt(x**2 + y**2)

    # Visualize walk
    plt.figure(figsize=(8, 6))
    plt.plot(xs, ys, marker='o')
    plt.title(f'Random Walk ({n} Steps)')
    plt.xlabel('X Coordinate')
    plt.ylabel('Y Coordinate')
    plt.grid(True)
    plt.tight_layout()
    plt.show()

    return x, y, final_distance

# **4. Hypothesis Testing for User Ratings**

Perform a statistical analysis to determine if the average rating of songs differs significantly between two user groups.
Clearly state your null and alternative hypotheses and perform a two-sample t-test using Python.
Summarize your findings with a short interpretation of the results.

In [None]:
# 4. Hypothesis Testing for User Ratings
def user_ratings_hypothesis_test(group1_ratings, group2_ratings, alpha=0.05):
    """
    Perform two-sample t-test on user ratings

    Parameters:
    - group1_ratings: Ratings for first user group
    - group2_ratings: Ratings for second user group
    - alpha: Significance level

    Returns:
    - t-statistic, p-value, and interpretation
    """
    # Perform two-sample t-test
    t_statistic, p_value = stats.ttest_ind(group1_ratings, group2_ratings)

    # Interpret results
    if p_value < alpha:
        conclusion = "Reject null hypothesis. There is significant difference in ratings between groups."
    else:
        conclusion = "Fail to reject null hypothesis. No significant difference in ratings between groups."

    return t_statistic, p_value, conclusion


# **5. Math Puzzle: Probability of Consecutive Songs**

Calculate the probability that two favorite songs (Song A and Song B) are played consecutively in a playlist of 10 songs.
Write a Python simulation to validate the theoretical probability.

In [None]:
# 5. Probability of Consecutive Songs
def consecutive_songs_probability(playlist_size=10):
    """
    Calculate probability of two favorite songs being played consecutively

    Parameters:
    - playlist_size: Total number of songs in playlist

    Returns:
    - Theoretical probability
    - Simulated probability
    """
    # Theoretical probability calculation
    theoretical_prob = 2 / (playlist_size * (playlist_size - 1))

    # Simulation
    num_simulations = 100000
    consecutive_count = 0

    for _ in range(num_simulations):
        # Create a random playlist
        playlist = list(range(playlist_size))
        np.random.shuffle(playlist)

        # Check for consecutive favorite songs
        for i in range(len(playlist) - 1):
            if playlist[i] in [0, 1] and playlist[i+1] in [0, 1]:
                consecutive_count += 1
                break

    simulated_prob = consecutive_count / num_simulations

    return theoretical_prob, simulated_prob