## Analysing Spotify Streaming History Data

- Using my data streaming data.
- Using [`polars`](https://pola.rs) library for analysis.
- Columns:
    - `endTime`: Repersents the time at which the track ends streaming _(in datatime format)_.
    - `artistName`: Name of the artist.
    - `trackName`: Name of the song/music.
    - `msPlayed`: The time that the track played for _(in milisecond)_.
- **WARNING**: Maybe you are not able to see the graphs and plots on **GitHub** that's why you have to download and see it in local.

In [1]:
import typing

import polars as pl

### References

| ShortCode | Description     |
| :-------: | --------------- |
|  **T/A**  | Track/Artist    |
| **T/As**  | Track/Artist(s) |

## Some awesome insights

- [x] Top T/As in whole dataset
- [x] Top T/As in each month
- [x] Monthly most listend Tracks and Artists
- [x] First day when T/A was played
- [x] No.of distinct T/As listened in each month/year
- [x] A T/A streaming in barplot (which shows how you stream that during time-to-time)
- [x] Which daytime user listen most and whom
- [x] Tracks which have listened most times in a day
- [x] Tracks streaming streak (by day/week)
- [x] T/As which only played once
- [ ] Dates when user does not any track

In [2]:
pl.Config.set_fmt_str_lengths(40)

polars.config.Config

In [3]:
df = pl.concat(
    [
        pl.read_json("data.arv/Spotify Account Data/StreamingHistory0.json"),
        pl.read_json("data.arv/Spotify Account Data/StreamingHistory1.json"),
        pl.read_json("data.arv/Spotify-Data-5.Jan/StreamingHistory0.json"),
        pl.read_json("data.arv/Spotify-Data-5.Jan/StreamingHistory1.json"),
    ]
).unique()
print(df.shape)
df.head()

(29742, 4)


endTime,artistName,trackName,msPlayed
str,str,str,i64
"""2022-12-02 03:34""","""Pritam""","""Kahani (From ""Laal Singh Chaddha"")""",208539
"""2022-12-02 08:40""","""Zodiac Wave""","""Emptiness""",253100
"""2022-12-02 09:14""","""Jordan Seigel""","""The Journey Begins""",61048
"""2022-12-02 14:47""","""Anuv Jain""","""Riha""",238915
"""2022-12-03 03:59""","""Piyush Bhisekar""","""Woh Raaz""",277866


### Preprocessing

In [4]:
df = df.with_columns(
    pl.col("endTime").str.to_datetime(),
    pl.col("trackName")
    .str.replace(r"\(.*", "")
    .str.replace(r"-.*", "")
    .str.strip_chars_end(),
)
df.head()

endTime,artistName,trackName,msPlayed
datetime[μs],str,str,i64
2022-12-02 03:34:00,"""Pritam""","""Kahani""",208539
2022-12-02 08:40:00,"""Zodiac Wave""","""Emptiness""",253100
2022-12-02 09:14:00,"""Jordan Seigel""","""The Journey Begins""",61048
2022-12-02 14:47:00,"""Anuv Jain""","""Riha""",238915
2022-12-03 03:59:00,"""Piyush Bhisekar""","""Woh Raaz""",277866


### Preprocess `msPlayed`

In [5]:
df.get_column("msPlayed").describe().filter(
    pl.col("statistic").ne("count"),
).with_columns(
    pl.col("value").truediv(60_000).alias("value_min"),
)

statistic,value,value_min
str,f64,f64
"""null_count""",0.0,0.0
"""mean""",166146.338948,2.769106
"""std""",104243.160473,1.737386
"""min""",0.0,0.0
"""25%""",90606.0,1.5101
"""50%""",184090.0,3.068167
"""75%""",227784.0,3.7964
"""max""",3884711.0,64.745183


In [6]:
df.get_column("msPlayed").__truediv__(60_000).plot.kde("msPlayed")

%opts magic unavailable (pyparsing cannot be imported)
%compositor magic unavailable (pyparsing cannot be imported)


- **Conclusion!**
  - Some tracks have `0ms` playtime.
  - Some tracks are podcast too. Because `max` playtime is **64 minutes** long.
  - On the other hand, **25% is `1.5min`** and **50% is `3.0min`**.
- **Why?**
  - The `64min` long tracks can podcast. To tackle this, we have to decide a threshold of playtime.

In [7]:
# top_k streams by="msPlayed"
df.top_k(20, by="msPlayed").with_columns(
    pl.col("msPlayed").truediv(60_000).cast(int).alias("minutesPlayed"),
).drop("endTime").sort("msPlayed", descending=True)

artistName,trackName,msPlayed,minutesPlayed
str,str,i64,i64
"""Prakhar Ke Pravachan""","""Aman Dhattarwal shut me up for 3 hours …",3884711,64
"""Talks at Google""","""Ep231""",2329387,38
"""Lex Fridman Podcast""","""#267 – Mark Zuckerberg: Meta, Facebook,…",1729189,28
"""Lex Fridman Podcast""","""#298 – Susan Cain: The Power of Introve…",1626436,27
"""The Open Podcast - Podcast By An Open L…","""E40 — Third Wave in India: Omicron Saga…",1232825,20
"""Dostcast""","""English Teacher ROASTS Indian Schools |…",1149724,19
"""Commander Karan Saxena ""","""Ek Vaigyanik Ki Talash : Chapter 2""",1074589,17
"""Commander Karan Saxena ""","""Ek Vaigyanik Ki Talash : Chapter 1""",1045081,17
"""Lovemotives""","""Boost Happiness and Positivity""",1024000,17
"""Lovemotives""","""Boost Happiness and Positivity""",1024000,17


In [8]:
# Streams greater than & equal to 20min
df.filter(
    pl.col("msPlayed").ge(60_000 * 20),
).drop("endTime").sort("msPlayed", descending=True)

artistName,trackName,msPlayed
str,str,i64
"""Prakhar Ke Pravachan""","""Aman Dhattarwal shut me up for 3 hours …",3884711
"""Talks at Google""","""Ep231""",2329387
"""Lex Fridman Podcast""","""#267 – Mark Zuckerberg: Meta, Facebook,…",1729189
"""Lex Fridman Podcast""","""#298 – Susan Cain: The Power of Introve…",1626436
"""The Open Podcast - Podcast By An Open L…","""E40 — Third Wave in India: Omicron Saga…",1232825


- **Conclusion!**
  - These long playtime streams are podcasts.
  - We can eliminate streams which are greater than `20min` playtime.
  - Above you can see that some tracks (songs) have `13-14min` long playtime. Also, I listen some songs which are really 13-14min long.
- **Why?**
  - Generally podcasts have longer playtime like `1hr+`.
  - Podcasts are not tracks or songs, these are a separate category and I am not focusing on podcasts in this notebook.
  - Tracks having longer playtime but they seems to be a normal track (song), maybe this happening because the user is streaming with a **(single track repeat mode)**.

In [9]:
# Streams lesser than & equal to 10sec
df.filter(
    pl.col("msPlayed").lt(10_000),
).drop("endTime").sort("msPlayed", descending=True)

artistName,trackName,msPlayed
str,str,i64
"""The Weeknd""","""I Heard You’re Married""",9991
"""Kendrick Lamar""","""LOVE. FEAT. ZACARI.""",9986
"""Sachet Tandon""","""Mehram""",9984
"""James Krivchenia""","""The Eternal Spectator""",9973
"""Bharat Chauhan""","""Usne Kaha Tha""",9972
"""Mahendra Kapoor""","""Tum Agar Saath Dene Ka Vada Karo""",9961
"""Sachin-Jigar""","""Apna Bana Le""",9960
"""Ram Sampath""","""Yeh Beetey Din""",9945
"""Anuv Jain""","""Alag Aasmaan""",9942
"""Mickey Singh""","""Feels Like""",9941


- **Conclusion!**
  - These streams lesser than `10sec`.
  - We can **eliminate them** from analysis because these will work as outliers.
  - These are in very large number (`3.2k`).
- **Why?**
  - Due to a bad click on tracks.
  - Bad track played while shuffle playing.

### Drop the concluded!

In [10]:
prev_height = df.height
df = df.filter(
    # Drop tracks whose playtime is lesser than 10sec
    pl.col("msPlayed").gt(10_000),
    # Drop tracks whose playtime is greater than 20min
    pl.col("msPlayed").lt(60_000 * 20),
)

new_height = df.height
print(f"We've dropped {prev_height - new_height}")

We've dropped 3176


In [11]:
df.get_column("msPlayed").describe().with_columns(
    pl.col("value").truediv(60_000).name.suffix("_minutes")
)

statistic,value,value_minutes
str,f64,f64
"""count""",26566.0,0.442767
"""null_count""",0.0,0.0
"""mean""",185222.450877,3.087041
"""std""",87858.807315,1.464313
"""min""",10004.0,0.166733
"""25%""",138754.0,2.312567
"""50%""",193606.0,3.226767
"""75%""",233386.0,3.889767
"""max""",1149724.0,19.162067


In [12]:
df.get_column("msPlayed").__truediv__(60_000).plot.kde("msPlayed")

### Basic Info

In [13]:
_start = df["endTime"].dt.date().min()
_end = df["endTime"].dt.date().max()

print(f"Datetime range of dataset: ({_start:%d %B, %Y}) — ({_end:%d %B, %Y})")

Datetime range of dataset: (02 January, 2022) — (02 December, 2023)


### Distinct T/As present in dataset

In [14]:
# eg: 1234 tracks are played by the user
df.select(
    pl.col("artistName").n_unique(),
    pl.col("trackName").n_unique(),
    # Some track's (song's) name are same
    # To create a distinction use both (artistName and trackName) for n_unique count
    pl.col("artistName").add(pl.col("trackName")).n_unique().alias("uniqueTrackName"),
)

artistName,trackName,uniqueTrackName
u32,u32,u32
1541,3911,4180


### No. of unique T/As (monthly)

In [15]:
(
    df.group_by(
        pl.col("endTime").dt.year().alias("year"),
        pl.col("endTime").dt.month().alias("month"),
    )
    .agg(
        pl.col("artistName", "trackName").n_unique(),
        pl.col("artistName")
        .add(pl.col("trackName"))
        .n_unique()
        .alias("uniqueTrackName"),
    )
    .sort("year", "month")
)

year,month,artistName,trackName,uniqueTrackName
i32,i8,u32,u32,u32
2022,1,213,370,374
2022,2,155,267,269
2022,3,115,168,171
2022,4,245,364,383
2022,5,178,300,304
2022,6,179,374,378
2022,7,272,527,542
2022,8,210,433,443
2022,9,192,374,381
2022,10,209,367,376


### Top T/As in dataset

In [16]:
# Top artists in dataset
(
    df.group_by("artistName")
    .agg(
        pl.col("msPlayed").sum().truediv(60_000).ceil(),
    )
    .top_k(10, by="msPlayed")
    .plot.bar("artistName", "msPlayed", title="Top 10 artists", rot=45)
)

In [17]:
# Top tracks in dataset
(
    df.group_by("artistName", "trackName")
    .agg(
        pl.col("msPlayed").sum().truediv(60_000).ceil(),
    )
    .top_k(10, by="msPlayed")
    .plot.bar(
        "trackName",
        "msPlayed",
        title="Top 10 tracks",
        rot=45,
    )
)

### Monthly top T/As

In [18]:
def plot_monthly_top(
    df: pl.DataFrame,
    name: typing.Literal["artistName", "trackName"],
    year: int,
    month: int,
):
    """Plot monthly top track or artist. Just choose year and month."""
    return (
        df.group_by(
            pl.col("endTime").dt.year().alias("year"),
            pl.col("endTime").dt.month().alias("month"),
            name,
        )
        .agg(
            pl.col("msPlayed").sum().truediv(60_000).ceil(),
        )
        .sort("year", "month", "msPlayed", descending=True)
        .group_by("year", "month")
        .map_groups(lambda x: x.limit(10))
        .filter(
            pl.col("year").eq(year),
            pl.col("month").eq(month),
        )
        .plot.bar(
            name,
            "msPlayed",
            title=f"Top 10 tracks in (Month {month})",
            rot=45,
        )
    )

In [19]:
plot_monthly_top(df, "artistName", 2023, 10)

### First day when T/A was played

In [20]:
def streamed_first_time(
    df: pl.DataFrame,
    *,
    track: str | None = None,
    artist: str | None = None,
):
    """Check when you stream that track/artist first time."""
    if track is None and artist is None:
        raise ValueError("Must pass either track or artist, or both.")
    if track:
        df = df.filter(pl.col("trackName").eq(track))
    if artist:
        df = df.filter(pl.col("artistName").eq(artist))
    return df["endTime"].min()

In [21]:
streamed_first_time(df, track="Thank You", artist="Dido")

datetime.datetime(2023, 7, 28, 5, 8)

### A T/A streaming graph which shows how you stream them

- Day-by-Day
- Week-by-Week
- Month-by-Month

In [22]:
(
    df.filter(
        pl.col("trackName").eq("Bloom"),
        pl.col("endTime").dt.year().eq(2023),
    )
    .group_by(
        pl.col("endTime").dt.month().alias("month"),
    )
    .agg(
        pl.sum("msPlayed").truediv(60_000).ceil(),
    )
    .sort("month")
    .plot.bar("month", "msPlayed")
)

### Daytime insights

In [23]:
# Extract "daytime" feature from "endTime"
df = df.with_columns(
    pl.col("endTime")
    .dt.hour()
    .cut(
        # These breaks & labels are from my POV
        breaks=[0, 4, 10, 15, 19, 23],
        labels=[
            "Night",
            "Mid-Night",
            "Morning",
            "Afternoon",
            "Evening",
            "Study-Hour",
            "Night",
        ],
        left_closed=True,
    )
    # .cast(pl.Categorical)  # TODO: Learn how to do ordering on Catergorical dtype
    .alias("daytime")
)

### Streaming pattern of T/A during `"daytime"`

In [24]:
(
    df.filter(
        pl.col("artistName").eq("Mukesh"),
    )
    .group_by("daytime")
    .agg(
        pl.sum("msPlayed").truediv(60_000).ceil(),
    )
    .sort("daytime")
    .plot.line("daytime", "msPlayed")
)

### Streaming pattern of T/As during `"daytime"`

In [25]:
artists = [
    "Mukesh",
    "Kishore Kumar",
    "Lata Mangeshkar",
    "Salma Agha",
    "Amit Trivedi",
]

In [26]:
(
    df.filter(
        pl.col("artistName").is_in(artists),
    )
    .group_by("artistName", "daytime")
    .agg(
        pl.sum("msPlayed").truediv(60_000).ceil(),
    )
    .sort("daytime")
    .plot.line("daytime", "msPlayed", by="artistName", legend="top")
)

In [27]:
tracks = [
    "Bloom",
    "Awaraa Ho",
    "Shauq",
    "Dil Mere",
]

In [28]:
(
    df.filter(
        pl.col("trackName").is_in(tracks),
    )
    .group_by("trackName", "daytime")
    .agg(
        pl.sum("msPlayed").truediv(60_000).ceil(),
    )
    .sort("daytime")
    .plot.line("daytime", "msPlayed", by="trackName", legend="top")
)

### Streaming time series

- By Day
- By Week
- By Month

In [29]:
(
    df.filter(
        pl.col("endTime").dt.year().eq(2023),
    )
    .group_by(
        pl.col("endTime").dt.date().alias("date"),
        pl.col("endTime").dt.month().alias("month"),
    )
    .agg(
        pl.col("msPlayed").sum().truediv(60_000).ceil(),
    )
    .plot.line("date", "msPlayed", by="month", width=1000)
)

In [30]:
(
    df.group_by(
        pl.col("endTime").dt.year().alias("year"),
        pl.col("endTime").dt.ordinal_day().alias("ordinal_day"),
    )
    .agg(
        pl.col("msPlayed").sum().truediv(60_000).ceil(),
    )
    .sort("ordinal_day")
    .plot.line(
        "ordinal_day",
        "msPlayed",
        by="year",
        xlabel="Day Of Year",
        ylabel="Tracks Streams (in Min.)",
        width=1000,
    )
)

In [31]:
(
    df.group_by(
        pl.col("endTime").dt.year().alias("year"),
        pl.col("endTime").dt.month().alias("month"),
    )
    .agg(
        pl.col("msPlayed").sum().truediv(60_000).ceil(),
    )
    .sort("month")
    .plot.line(
        "month",
        "msPlayed",
        by="year",
        ylabel="Tracks Streams (in Min.)",
        width=1000,
    )
)

In [32]:
(
    df.group_by(
        pl.col("endTime").dt.year().alias("year"),
        pl.col("endTime").dt.month().alias("month"),
    )
    .len()
    .sort("month")
    .plot.line("month", "len", by="year", ylabel="No. of Tracks", width=1000)
)

In [33]:
(
    df.group_by(
        pl.col("endTime").dt.year().alias("year"),
        pl.col("endTime").dt.week().alias("week"),
    )
    .len()
    .sort("week")
    .plot.line("week", "len", by="year", ylabel="No. of Tracks", width=1000)
)

### Tracks/Artists Streaming Count

- By Day
- By Week

In [34]:
# Tracks streaming (by day)
# TODO: which type of plot/graph
(
    df.filter(
        pl.col("endTime").dt.year().eq(2023),
    )
    .group_by(
        pl.col("endTime").dt.date().alias("date"),
        "artistName",
        "trackName",
    )
    .len()
    .filter(pl.col("len").ge(5))
    .group_by("len")
    .map_groups(lambda x: x.head())
    .sort("len", descending=True)
    .select("date", "trackName", "len")
)

date,trackName,len
date,str,u32
2023-05-11,"""Tu Hi Bataa""",9
2023-08-28,"""Deathcab""",7
2023-08-26,"""A Love of Some Kind""",6
2023-06-25,"""Kya Dekhu""",6
2023-08-24,"""Madhubala""",6
2023-06-28,"""Aasaan Nahin Hota""",6
2023-10-27,"""Ik Dooje Ke Liye""",6
2023-05-22,"""Humara Ho Gaya""",5
2023-09-09,"""Sapna""",5
2023-09-15,"""Faasle""",5


In [35]:
# Artists streaming (by day)
(
    df.filter(
        pl.col("endTime").dt.year().eq(2023),
    )
    .group_by(
        pl.col("endTime").dt.date().alias("date"),
        "artistName",
    )
    .len()
    .filter(pl.col("len").ge(20))
    .group_by("len")
    .map_groups(lambda x: x.head())
    .sort("len", descending=True)
)

date,artistName,len
date,str,u32
2023-06-25,"""Osho Jain""",27
2023-03-28,"""Bayaan""",27
2023-08-25,"""Big Thief""",26
2023-08-10,"""Piyush Bhisekar""",25
2023-10-27,"""Ankur Tewari""",23
2023-09-19,"""Jagjit Singh""",22
2023-08-05,"""Prateek Kuhad""",22
2023-06-23,"""Big Thief""",22
2023-07-06,"""Tajdar Junaid""",21
2023-07-11,"""Tanmaya Bhatnagar""",21


In [36]:
# Tracks streaming (by week)
(
    df.filter(
        pl.col("endTime").dt.year().eq(2023),
    )
    .group_by(
        pl.col("endTime").dt.strftime("(%W) %Y").alias("week"),
        "artistName",
        "trackName",
    )
    .len()
    .filter(pl.col("len").ge(10))
    .group_by("len")
    .map_groups(lambda x: x.head())
    .sort("len", descending=True)
    .select("week", "trackName", "len")
)

week,trackName,len
str,str,u32
"""(35) 2023""","""Deathcab""",18
"""(26) 2023""","""Aasaan Nahin Hota""",17
"""(19) 2023""","""Tu Hi Bataa""",16
"""(25) 2023""","""Paul""",15
"""(25) 2023""","""Certainty""",14
"""(18) 2023""","""Mera Musafir""",14
"""(25) 2023""","""Change""",14
"""(35) 2023""","""A Love of Some Kind""",14
"""(30) 2023""","""Tu Jo Paas""",13
"""(24) 2023""","""Khoya Khoya""",13


In [37]:
# Artists streaming (by week)
(
    df.filter(
        pl.col("endTime").dt.year().eq(2023),
    )
    .group_by(
        pl.col("endTime").dt.strftime("(%W) %Y").alias("week"),
        "artistName",
    )
    .len()
    .filter(pl.col("len").ge(50))
    .group_by("len")
    .map_groups(lambda x: x.head())
    .sort("len", descending=True)
)

week,artistName,len
str,str,u32
"""(25) 2023""","""Big Thief""",99
"""(35) 2023""","""Adrianne Lenker""",90
"""(12) 2023""","""Piyush Bhisekar""",62
"""(26) 2023""","""Bharat Chauhan""",59
"""(38) 2023""","""Jagjit Singh""",58
"""(25) 2023""","""Osho Jain""",56
"""(45) 2023""","""Bharat Chauhan""",53
"""(32) 2023""","""Piyush Bhisekar""",53
"""(45) 2023""","""Bayaan""",52
"""(43) 2023""","""Ankur Tewari""",51


### T/As which only played once

In [38]:
df.group_by("artistName", "trackName").len().filter(
    pl.col("len").eq(1),
)

artistName,trackName,len
str,str,u32
"""Sonu Nigam""","""Main Agar Kahoon""",1
"""Anirudh Ravichander""","""Climax Fight""",1
"""Yatharth Geeta (Hindi)""","""1. प्रथम अध्याय""",1
"""RADWIMPS""","""First Aid""",1
"""Aman Pant""","""Bharam""",1
"""Sanjeeta Bhattacharya""","""Red""",1
"""The Japanese House""","""i saw you in a dream""",1
"""Maple Leaf Learning""","""Pumpkin, Pumpkin""",1
"""JVKE""","""golden hour""",1
"""Vishal Mishra""","""Woh Chaand Kahan Se Laogi""",1


### Dates when user does not played any track

In [62]:
start_date = df.get_column("endTime").dt.date().min()
end_date = df.get_column("endTime").dt.date().max()
date_range: pl.Series = (
    pl.datetime_range(
        start_date,  # type: ignore
        end_date,  # type: ignore
        interval="1d",
        eager=True,
    )
    .dt.date()
    .alias("date_range")
)
date_range.shape

(700,)


date_range
date
2022-01-02
2022-01-03
2022-01-04
2022-01-05
2022-01-06
2022-01-07
2022-01-08
2022-01-09
2022-01-10
2022-01-11


In [52]:
start_date, end_date

(datetime.date(2022, 1, 2), datetime.date(2023, 12, 2))

In [77]:
date_range.filter(
    date_range.is_in(
        df.get_column("endTime").dt.date(),
    ).not_(),
)

date_range
date
2022-03-24
2022-06-20
2022-09-18
2022-10-03
2022-10-04
2022-10-05
2022-10-27
2023-06-10
