# Trends Over Time

___
**Atoosa Rashid** 

[GitHub](https://github.com/atoosa-r/)

[LinkedIn](https://www.linkedin.com/in/atoosarashid/) 
___


## Table of Contents

1. [Introduction](#introduction)  

2. [Behavioural Insights](#Behavioural-Insights)
   - [Skipped Tracks](#Skipped-Tracks)
   - [Start and End Reasons](#Start-and-End-Reasons)

3. [Country-Based Listening Trends](#Country-Based-Listening-Trends)

4. [Daily and Weekly Patterns](#daily-and-weekly-patterns)  
   - [Days You Listen the Most](#days-you-listen-the-most)  
   - [Hourly Trends](#hourly-trends)
   - [Weekday vs. Weekend](#weekday-vs-weekend)  

5. [Monthly and Seasonal Trends](#monthly-and-seasonal-trends)  
   - [Monthly Trends](#monthly-trends)  
   - [Seasonal Trends](#seasonal-trends)  
   - [Monthly Peak Days](#monthly-peak-days)

6. [Conclusion](#Conclusion)
___

## Introduction

### Purpose
This notebook aims to uncover trends and patterns in individual Spotify streaming behavior over time. By analyzing listening habits, the goal is to provide insights into how preferences and behaviors change on a daily, weekly, monthly, and yearly basis. 

### Scope
This analysis focuses on key metrics such as:
- **`ms_played` (milliseconds played)** – The total duration a track has been listened to, which helps measure engagement.
- **Track counts** – The number of times a song has been played, skipped, or replayed.

The goal is to identify peak listening times, seasonal variations, and long-term trends in listening behavior.

---

In [82]:
# Importing libraries: 

import numpy as np                   
import pandas as pd                   
import matplotlib.pyplot as plt       
import seaborn as sns                 
import os
from wordcloud import WordCloud
      

In [None]:
# Importing the cleaned dataset: 

df = pd.read_csv("/cleaned_spotify_streaming.csv") #insert the path of the cleaned dataset

df.head()

___
## Behavioural Insights


### Skipped Tracks

**Objective**: The goal of this section is to analyze skipping behavior to understand when and why tracks are skipped. Skipping does not always indicate dislike; instead, it may reflect selective listening habits, mood-based preferences, or navigation patterns within a playlist.

**Approach**:
- **Identify Skipped Tracks**: Filter the dataset to include only tracks that were skipped.
- **Track-Level Analysis**: Identify the most frequently skipped tracks and compare them to overall listening trends.

In [84]:
# Counting and sorting most skipped tracks:

skipped_tracks = df[df['skipped'] == True]

track_skip_counts = skipped_tracks.groupby('master_metadata_track_name').size().reset_index(name='skip_count')

top_skipped_tracks = track_skip_counts.sort_values(by='skip_count', ascending=False).head(10)

In [None]:
# Plotting bar chart for most skipped tracks

plt.figure(figsize=(10, 5))

sns.barplot(data=top_skipped_tracks, x='skip_count', y='master_metadata_track_name', color='royalblue')

plt.xlabel('\n Skip Count \n ')
plt.ylabel('\n Track Name \n')
plt.title('\n Top 10 Most Skipped Tracks \n')

plt.show()

Interestingly, the most skipped tracks are also some of the most frequently played. This suggests that skipping does not necessarily indicate dislike. Instead, it may reflect different listening behaviors, such as:
- Replaying favourites often: Skipping may happen when you're navigating to a specific part of the track or restarting it.
- Mood-based listening: You might love a song but skip it when it doesn’t fit your current vibe.
- Playlist structure: If these tracks are in frequently played playlists, they may get skipped when shuffling.
- Partial listening: You might start a track but skip before it ends, meaning it still racks up high play counts.


____
### Start and End Reasons
**Objective**: Analyze playback behavior by looking at the reasons tracks are started or ended (e.g., user skip, forward button, autoplay).  
**Approach**: Group and count occurrences of `reason_start` and `reason_end` to identify common patterns.  

#### Key Questions
- **Are tracks primarily started through autoplay or user-driven actions? This could reveal differences between passive and intentional listening habits.**
- **Do tracks frequently end due to manual skipping rather than natural completion? Identifying this trend may provide insights into user engagement and content preferences.**


In [86]:
# Count occurrences of start reasons
start_reason_counts = df['reason_start'].value_counts().reset_index()
start_reason_counts.columns = ['reason_start', 'count']

# Count occurrences of end reasons
end_reason_counts = df['reason_end'].value_counts().reset_index()
end_reason_counts.columns = ['reason_end', 'count']

In [None]:
# Plotting bar chart for start reasons
plt.figure(figsize=(9, 5))

sns.barplot(data=start_reason_counts, x='reason_start', y='count', color='skyblue')

plt.xlabel('Start Reason')
plt.ylabel('Count')
plt.title('\n Distribution of Track Start Reasons \n')
plt.xticks(rotation=45, ha='right')

plt.show()

# Plotting bar chart for end reasons
plt.figure(figsize=(9, 5))

sns.barplot(data=end_reason_counts, x='reason_end', y='count', color='royalblue')

plt.xlabel('End Reason')
plt.ylabel('Count')
plt.title('\n Distribution of Track End Reasons \n')
plt.xticks(rotation=45, ha='right')

plt.show()

___
## Country-Based Listening Trends

**Objective**: Identify top countries for listening activity, review any unknown country streams, and compare listening volumes across regions.

**Approach**: Aggregate `ms_played` or track counts by `conn_country`.  

In [None]:
# Viewing streams by country

country_counts = df['conn_country'].value_counts()
print("Value counts for `conn_country` column:")
print(country_counts)

In [None]:
# Plotting the distribution of streams by country 

plt.figure(figsize=(9,4))
country_counts.plot(kind='bar', color='royalblue', edgecolor='black')

plt.title(' \n Distribution of Streams by Country \n')
plt.xlabel(' \n Country Code \n')
plt.ylabel('\n Number of Streams \n')
plt.xticks(rotation=45, fontsize=9)

plt.show()

In [None]:
# Filtering rows where the country is Unknown 'ZZ'

filtered_country_ZZ = df[df['conn_country'] == "ZZ"]
filtered_country_ZZ.head()

The presence of streams from an unknown country ("ZZ") suggests that Spotify was unable to determine the listener’s geographic location.  
This could be due to several reasons:  
- The use of VPNs or proxy servers, which mask the actual country of the listener.  
- Incomplete or missing geo-data in Spotify’s logging system.  
- Streams coming from devices or networks that do not transmit location data (e.g., offline mode or restricted territories).  

____
## Daily and Weekly Patterns

**Objective**: Identify which days of the week have the highest and lowest listening activity.  
**Approach**:  
- Aggregate total and average listening time for each weekday.  
- Compare day-to-day variations to determine consistency in engagement.  


In [91]:
df['ts'] = pd.to_datetime(df['ts'])

df['weekday'] = df['ts'].dt.day_name() # Extract the weekday name from the timestamp

In [92]:
# Aggregate total listening time per weekday (convert ms to hours)

listening_by_day = df.groupby('weekday')['ms_played'].sum() / (1000 * 60 * 60)  # Convert to hours

In [93]:
# Reordering days of the week for correct chronological order
weekday_order = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']

listening_by_day = listening_by_day.reindex(weekday_order)

In [None]:
# Plotting the total listening time by day of the week: 

plt.figure(figsize=(10, 5))
plt.bar(listening_by_day.index, listening_by_day.values, color='skyblue')

plt.xlabel("\n Day of the Week \n")
plt.ylabel("\n Total Listening Time (Hours) \n")  # Updated to Hours
plt.title("\n Total Listening Time by Day of the Week\n")
plt.xticks(rotation=45)
plt.grid(axis='y', linestyle='--', alpha=0.7)

plt.show()

In [95]:
# Calculating the average listening time per day (in minutes since it is originally in milliseconds)
average_listening_per_day = df.groupby(df['ts'].dt.date)['ms_played'].sum().mean() / (1000 * 60)  

# Aggregating
average_listening_by_day = df.groupby('weekday')['ms_played'].sum() / df['ts'].dt.date.nunique() / (1000 * 60)

# Reordering 
average_listening_by_day = average_listening_by_day.reindex(weekday_order)


In [None]:
# Plotting the average listening time per day by weekday graph:

plt.figure(figsize=(10, 5))
plt.bar(average_listening_by_day.index, average_listening_by_day.values, color='royalblue')

plt.xlabel("\nDay of the Week\n")
plt.ylabel("\n Average Listening Time per Day (Minutes)\n")
plt.title("\nAverage Listening Time per Day by Weekday\n")
plt.xticks(rotation=45)
plt.grid(axis='y', linestyle='--', alpha=0.7)

plt.show()


____
### Hourly Trends  
**Objective**: Determine peak listening hours throughout the day.  
**Approach**:  
- Extract the hour from the `ts` column.  
- Aggregate and visualize listening activity by hour.  
- Identify trends such as morning vs. evening listening behavior.  

In [97]:
# Creating hourly heatmap of weekday listening time
df['hour'] = df['ts'].dt.hour

heatmap_data = df.pivot_table(index='weekday', columns='hour', values='ms_played', aggfunc='sum') / (1000 * 60 * 60) 

heatmap_data = heatmap_data.reindex(weekday_order)

In [None]:
# Plotting the listening time heatmap

plt.figure(figsize=(12, 6))
sns.heatmap(heatmap_data, cmap='Blues', annot=False, linewidths=0.6)

plt.xlabel("\n Hour of the Day \n")
plt.ylabel("\n Day of the Week \n")
plt.title("\n Listening Time Heatmap (Hours) \n")

plt.show()

____
### Weekday vs. Weekend  
**Objective**: Analyze how listening behavior differs between workdays and leisure days.  
**Approach**:  
- Categorize listening sessions into **weekdays (Monday–Friday)** and **weekends (Saturday–Sunday)**.  
- Compare total listening time (`ms_played`) and track counts.  
- Identify whether users listen more on weekends or during work breaks.  

In [99]:
# Creating a new column categorizing weekday vs. weekend
df['week_part'] = df['ts'].dt.weekday.map(lambda x: 'Weekend' if x >= 5 else 'Weekday')

# Aggregating total listening time for weekdays vs. weekends (convert ms to hours)
week_part_listening = df.groupby('week_part')['ms_played'].sum() / (1000 * 60 * 60)  # Convert to hours


In [None]:
# Plotting a pie chart showing the proportion of listening time on weekdays vs. weekends

plt.figure(figsize=(7, 7))

plt.pie(week_part_listening, labels=week_part_listening.index, autopct='%1.1f%%', colors=['royalblue', 'skyblue'], startangle=90, wedgeprops = {"edgecolor" : "black", 'linewidth': 2, 'antialiased': True}) 
plt.title("Proportion of Listening Time: Weekdays vs. Weekends")

plt.show()


___

## Monthly and Seasonal Trends  

### Monthly Trends  
**Objective**: Identify trends in listening habits over different months of the year.  
**Approach**:  
- Group data by month and year to calculate total `ms_played` and track counts.  
- Identify months with the highest and lowest engagement.  
- Compare listening patterns year-over-year to find long-term trends.  

In [None]:
# Extract month and year for trend analysis
df['month_year'] = df['ts'].dt.to_period('M')

# Aggregating
monthly_listening = df.groupby('month_year')['ms_played'].sum() / (1000 * 60 * 60) 

# Convert PeriodIndex to datetime for plotting
monthly_listening.index = monthly_listening.index.to_timestamp()

In [None]:
# Plotting the monthly listening trends

plt.figure(figsize=(15, 5))

plt.plot(monthly_listening.index, monthly_listening.values, marker='o', linestyle='-', color='royalblue')
plt.xlabel("\n Month \n")
plt.ylabel("\n Total Listening Time (Hours) \n")
plt.title("\n Monthly Listening Trends Over Time \n")
plt.xticks(rotation=45)
plt.grid(True, linestyle='--', alpha=0.7)

plt.show()

___

### Seasonal Trends  
**Objective**: Analyze how listening patterns change based on seasonal variations.  
**Approach**:  
- Categorize months into seasons:  
  - **Winter** (Dec–Feb)  
  - **Spring** (Mar–May)  
  - **Summer** (Jun–Aug)  
  - **Fall** (Sep–Nov)  
- Calculate total `ms_played` and track counts per season.  
- Identify whether certain seasons correlate with higher music engagement.  


In [103]:
# Add year and season columns

def get_season(month):
    if month in [12, 1, 2]:
        return "Winter"
    elif month in [3, 4, 5]:
        return "Spring"
    elif month in [6, 7, 8]:
        return "Summer"
    elif month in [9, 10, 11]:
        return "Fall"

In [104]:
# Add year and season columns
df['year'] = df['ts'].dt.year
df['month'] = df['ts'].dt.month
df['season'] = df['month'].apply(get_season)

**Seasonal Festival Generator**

Plan your own music festival based on your streaming data! This generator takes a specific year from your streaming history and creates four seasonal concert lineups: Spring (March, April, May), Summer (June, July, August), Fall (September, October, November), and Winter (December, January, February). Each lineup features the top 15 artists you streamed the most during that season.

In [105]:
# Festival generator function

def generate_festival_lineup(year):
    lineups = {}
    for season in ["Spring", "Summer", "Fall", "Winter"]:
        seasonal_data = df[(df['year'] == year) & (df['season'] == season)]
        
        # Getting the top 15 streamed artists for the season
        top_artists = (
            seasonal_data['master_metadata_album_artist_name']
            .value_counts()
            .head(15)
            .index.tolist()
        )
        
        lineups[season] = top_artists
    
    return lineups

In [None]:
# Generate and display the festival lineup for a specific year

year = 2023  # Change this to the desired year
festival_lineup = generate_festival_lineup(year)

# Displaying the festival lineups
for season, artists in festival_lineup.items():
    print(f"{season} Festival Lineup ({year}):")
    print(", ".join(artists) if artists else "No data available for this season")
    print("\n----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------\n")


**Visual Display of Festival Lineup** 

In [107]:
# Function to generate a word cloud for a given season

def generate_seasonal_wordcloud(festival_lineup, year):
    # Define colormap for each season
    season_colors = {
        "Spring": "PiYG",       
        "Summer": "Reds",       
        "Fall": "copper",      
        "Winter": "PuBu"        
    }

    for season, artists in festival_lineup.items():
        if artists:
            artist_weights = {artist: rank for rank, artist in enumerate(artists, start=1)}
            
            # Creating WordCloud object with specified colormap for seasons 
            wordcloud = WordCloud(
                width=800, 
                height=400, 
                background_color='white',
                colormap=season_colors[season]
            ).generate_from_frequencies(artist_weights)

            # Plotting the word cloud
            plt.figure(figsize=(6, 4))
            plt.imshow(wordcloud, interpolation='bilinear')
            plt.axis("off")
            plt.title(f"{season} Festival Word Cloud ({year}) \n", fontsize=14)
            plt.show()
        else:
            print(f"No data available for {season} ({year}).")

In [None]:
# Generating the festival lineup

year = 2023  # Specify the year
festival_lineup = generate_festival_lineup(year)

generate_seasonal_wordcloud(festival_lineup, year)


______

### **Top Songs Based on Year and Sesaon**

Want to know your top songs for a specific year? This generator breaks down your favorite tracks by season: Spring (March, April, May), Summer (June, July, August), Fall (September, October, November), and Winter (December, January, February). It gives you the top 10 songs you streamed the most during each season, along with how many times you played them.



In [None]:
# First we'll calculate the top songs overall for a given year 

# Function to get the top songs for a given year
def top_songs_by_year(df, year, top_n=10):
    df_year = df[df['ts'].dt.year == year]
    
    top_songs = df_year.groupby('master_metadata_track_name')['ms_played'].sum().reset_index()
    
    top_songs = top_songs.sort_values(by='ms_played', ascending=False).head(top_n)
    
    return top_songs

# Example 
year_input = 2023  # Change this to any year you want
top_songs = top_songs_by_year(df, year_input)

# Displaying the top songs
print(f"Top 10 Songs of {year_input}:")
print(top_songs)

In [110]:
# We'll continue and calculate the top songs by season for a given year 

# Function for generating the top 10 songs based on year and season
def generate_top_songs_by_season(year):
    results = {}
    for season in ["Spring", "Summer", "Fall", "Winter"]:

        seasonal_data = df[(df['year'] == year) & (df['season'] == season)]
        
        # Get the top 10 streamed songs for the season
        top_songs = (
            seasonal_data['master_metadata_track_name']
            .value_counts()
            .head(10)
            .index.tolist()
        )
        
        # Get the corresponding stream counts for these songs
        song_counts = (
            seasonal_data['master_metadata_track_name']
            .value_counts()
            .head(10)
            .values.tolist()
        )
        
        # Combine song names with their stream counts
        top_songs_with_counts = [(song, count) for song, count in zip(top_songs, song_counts)]
        
        # Add results to the dictionary
        results[season] = top_songs_with_counts
    
    return results

In [None]:
# Input the year and generate the top 10 songs for each season
year = 2023  # Change this to the desired year
top_songs_by_season = generate_top_songs_by_season(year)

# Displaying the results
for season, songs in top_songs_by_season.items():
    print(f"Top 10 Streamed Songs in {season} ({year}):")
    print("---------------------------------------------------------------------------------")

    if songs:
        for i, (song, count) in enumerate(songs, start=1):
            print(f"{i}. {song} - {count} streams")
    else:
        print("No data available for this season.")
    print("\n")


____

## Conclusion
This analysis offers a comprehensive look at personal streaming habits, highlighting trends in track skipping, listening times, and seasonal preferences. By exploring these patterns, users can gain deeper insights into their music consumption and uncover habits they may not have noticed before. Whether identifying peak listening periods or recognizing shifts in preferences, this exploration empowers users to engage with their streaming data in a meaningful way.