This project is a comprehensive SQL-based data analysis of Spotify music streaming data. Using PostgreSQL, we set up a database from raw CSV data and answered a variety of business-relevant questions to gain insights into track performance, audience behavior, and platform trends.
- π Why This Project Is Important
- π Dataset
- π§ Tools Used
- π Database Setup
- π Business Questions & Insights
- Q1. Tracks with over 1B Streams
- Q2. Total Comments on Licensed Tracks
- Q3. Albums and Artists
- Q4. Single Album Tracks
- Q5. Track Count per Artist
- Q6. Average Danceability per Album
- Q7. Top 5 Energetic Tracks
- Q8. Views and Likes of Official Videos
- Q9. Total Views per Album
- Q10. Spotify > YouTube Streams
- Q11. Top 3 Most-Viewed Tracks per Artist
- Q12. Above Average Liveness
- Q13. Energy Range per Album
- βοΈ Key Insights
- π§ What I Have Learned
- π Acknowledgments
- βοΈ License
This project showcases my ability to think like a Data Analyst by transforming raw Spotify data into structured insights using PostgreSQL. It demonstrates end-to-end analytical skillsβfrom designing a database and writing meaningful queries to interpreting results and communicating insights clearly. By focusing on real-world business questions, the project highlights how data can support strategic decisions, making it a valuable piece in my data analytics portfolio.
- Source: Provided via
spotify.csvfrom Sanjana Chaudhari - Contents: Information on tracks, albums, artists, views, likes, comments, danceability, energy, and streaming platforms (Spotify vs YouTube).
- PostgreSQL: Used to create and manage the relational database
- VSCode: Main development environment for writing SQL scripts and managing files
- pgAdmin: GUI client for executing SQL queries, managing tables, and inspecting results
- ChatGPT: For query optimization and insight helper
The database is set up in 3 steps:
1_database_setup.sqlβ Creates the database.2_table_setup.sqlβ Defines tables and schemas.3_data_populate.sqlβ Loads data from the CSV into the tables.
All SQL files are located in sql_files/.
Each query is structured to solve a business question, followed by a code snippet, visual result, and interpretation.
SELECT
track,
stream
FROM spotify
WHERE stream > 1000000000
ORDER BY stream DESC;
π‘ Insight: Identifies globally popular tracks with massive reach on Spotify.
SELECT SUM(comments) num_comments_with_licensed_being_true
FROM spotify
WHERE licensed = TRUE;
π‘ Insight: Provides insights into user engagement (via comments) for officially licensed tracks.
SELECT
album,
artist
FROM spotify
GROUP BY 1, 2;
π‘ Insight: Shows which artists are associated with each album, helpful for cataloging.
SELECT track
FROM spotify
WHERE album_type = 'single';
π‘ Insight: Highlights tracks released as singlesβoften promotional or hit-focused.
SELECT
artist,
COUNT(track) total_tracks
FROM spotify
GROUP BY 1
ORDER BY 2 DESC;
π‘ Insight: Reveals the most prolific artists in the dataset.
SELECT
album,
ROUND(AVG(danceablity)::NUMERIC, 2) avg_danceability
FROM spotify
GROUP BY 1
ORDER BY 2 DESC;
π‘ Insight: Indicates how danceable albums are on averageβuseful for mood or genre classification.
SELECT
track,
MAX(energy)
FROM spotify
GROUP BY 1
ORDER BY 2 DESC
LIMIT 5;
π‘ Insight: Identifies the most energetic tracks, useful for workout or hype playlists.
SELECT
track,
SUM(views) total_views,
SUM(likes) total_likes
FROM spotify
WHERE official_video = TRUE
GROUP BY 1;
π‘ Insight: Measures performance of tracks that also have official videos.
SELECT
album,
track,
SUM(views) total_views
FROM spotify
GROUP BY 1, 2
ORDER BY 3 DESC;
π‘ Insight: Determines album popularity based on cumulative views.
WITH tracks_played_count AS (
SELECT
track,
COALESCE(SUM(CASE WHEN most_played_on = 'Spotify' THEN stream END), 0) spotify_count,
COALESCE(SUM(CASE WHEN most_played_on = 'YouTube' THEN stream END), 0) youtube_count
FROM spotify
GROUP BY 1
)
SELECT
track,
spotify_count
FROM tracks_played_count
WHERE
spotify_count > youtube_count
π‘ Insight: Shows which tracks perform better on Spotify compared to YouTube.
WITH artist_ranking AS (
SELECT
artist,
track,
SUM(views) total_views,
DENSE_RANK() OVER(PARTITION BY artist ORDER BY SUM(views) DESC) rank
FROM spotify
GROUP BY 1, 2
ORDER BY 1, 3 DESC
)
SELECT *
FROM artist_ranking
WHERE rank <= 3;
π‘ Insight: Lists the most viewed tracksβkey indicators of popularity.
SELECT
track,
artist,
liveness
FROM spotify
WHERE liveness > (
SELECT AVG(liveness)
FROM spotify
)
π‘ Insight: Identifies tracks with a strong "live performance" vibe.
β Q13. Calculate the difference between the highest and lowest energy values for tracks in each album.
WITH album_max_low_energies AS (
SELECT
album,
MAX(energy) highest_energy,
MIN(energy) lowest_energy
FROM spotify
GROUP BY 1
)
SELECT
album,
highest_energy - lowest_energy energy_diff
FROM album_max_low_energies
ORDER BY 2 DESC;
π‘ Insight: Measures the variability in energy levels across the dataset.
- Global Popularity Revealed: Multiple tracks exceeded one billion streams on Spotify, indicating massive global reach and audience engagement.
- Licensing and Engagement Go Hand in Hand: Licensed tracks tend to receive more user interaction (e.g., comments), suggesting that official releases are more trusted and promoted.
- Prolific Artists Identified: Certain artists contribute significantly more tracks to the platform, offering a strategic opportunity for label-focused marketing and promotion.
- Energy & Danceability = Popularity: Tracks with higher energy and danceability scores are consistently among the most streamed and viewed, making them ideal for upbeat playlists and social media use.
- Visual Content Drives Performance: Tracks with official videos not only have more visibility but also outperform in terms of views and likes, emphasizing the value of multimedia promotion.
- YouTube vs Spotify Trends: Some tracks perform better on Spotify than on YouTube, suggesting platform-specific audience preferences that can inform targeted advertising.
- Album-Level Trends: Aggregated album data (e.g., total views, average danceability) reveals performance at a higher level and can guide production and curation decisions.
- Live-Feeling Tracks Stand Out: Tracks with above-average liveness may cater to niche audiences that prefer the feel of live performances.
- Wide Variability in Track Energy: The range between the highest and lowest energy scores shows that the dataset includes a diverse mix of moods and stylesβideal for personalized playlists.
- Working on this project gave me hands-on experience with real-world data analysis tasks using SQL and PostgreSQL. Key learnings include:
- Data Modeling & Schema Design: I learned how to translate raw CSV data into a structured relational database, including setting up normalized tables and defining relationships.
- SQL Proficiency: I enhanced my ability to write efficient SQL queries to extract, aggregate, and analyze dataβanswering practical business questions using real data.
- Insight Generation: I practiced interpreting query results to generate actionable business insights, reinforcing the importance of storytelling in data analytics.
- Tool Integration: I gained experience working with PostgreSQL through pgAdmin for database management, and VSCode for scripting and developmentβtools I expect to use in a real analytics role.
- Attention to Data Quality: By cleaning and standardizing the dataset before analysis, I learned the importance of consistent formatting and validation for trustworthy insights.
- Portfolio & Communication Skills: This project helped me practice how to present data projects clearly and professionally in a portfolio, including documenting process, insights, and technical steps.
- Dataset creator: Sanjana Chaudhari
- Inspired by real-world music analytics and data storytelling practices
This project is open for educational and portfolio use.
