## Exercise: Pandas & NumPy with Spotify Dataset
Use pandas and NumPy to analyze, filter, manipulate, and visualize data from the Spotify 2023 dataset.

### Task 1: Data exploration and cleaning
1. Load the dataset in pandas.
2. Check for missing values and handle them:
    - Replace missing values in the "key" column with "Unknown".
    - Fill missing values in "in_shazam_charts" with 0.
3. Filter the dataset:
    - Extract all tracks from 2023 that have been in Spotify Charts at least 50 times.
    - Save this subset as "popular_tracks_2023.csv".

In [68]:
import pandas as pd

spotify_df = pd.read_csv("spotify-2023.csv", encoding_errors="ignore")
spotify_df.info

<bound method DataFrame.info of                               track_name      artist(s)_name  artist_count  \
0    Seven (feat. Latto) (Explicit Ver.)    Latto, Jung Kook             2   
1                                   LALA         Myke Towers             1   
2                                vampire      Olivia Rodrigo             1   
3                           Cruel Summer        Taylor Swift             1   
4                         WHERE SHE GOES           Bad Bunny             1   
..                                   ...                 ...           ...   
948                         My Mind & Me        Selena Gomez             1   
949            Bigger Than The Whole Sky        Taylor Swift             1   
950                 A Veces (feat. Feid)  Feid, Paulo Londra             2   
951                        En La De Ella  Feid, Sech, Jhayco             3   
952                                Alone           Burna Boy             1   

     released_year  released_mo

In [74]:
spotify_df.dtypes

track_name              object
artist(s)_name          object
artist_count             int64
released_year            int64
released_month           int64
released_day             int64
in_spotify_playlists     int64
in_spotify_charts        int64
streams                 object
in_apple_playlists       int64
in_apple_charts          int64
in_deezer_playlists     object
in_deezer_charts         int64
in_shazam_charts        object
bpm                      int64
key                     object
mode                    object
danceability_%           int64
valence_%                int64
energy_%                 int64
acousticness_%           int64
instrumentalness_%       int64
liveness_%               int64
speechiness_%            int64
dtype: object

In [25]:

spotify_df['key'] = spotify_df['key'].fillna('Unknown')
spotify_df['in_shazam_charts'] = spotify_df['in_shazam_charts'].fillna(0)
na_count_k = spotify_df["key"].isna().sum()
na_count_k


In [27]:
na_count_sh = spotify_df["in_shazam_charts"].isna().sum()
na_count_sh

0

In [39]:

popular_tracks_2023 = spotify_df[(spotify_df['released_year'] == 2023) & (spotify_df['in_spotify_charts'] >= 50)]

popular_tracks_2023.to_csv('popular_tracks_2023.csv', index=False)

### Task 2: Statistical analysis and aggregation
1. Calculate basic statistics:
    - Find the average BPM (tempo) by key.
    - Find the average energy level for songs with more than 100 million streams.
2. Sort the dataset:
    - Find the top 10 most streamed songs.
    - Find the 5 least danceable songs.
3. Group the dataset:
    - Count how many tracks belong to each mode (Major/Minor).

In [43]:
avg_bpm_by_key = spotify_df.groupby('key')['bpm'].mean().sort_values(ascending=False)
print(avg_bpm_by_key)

key
A          127.840000
F#         125.479452
D          123.802469
D#         123.393939
G#         123.021978
C#         122.341667
G          122.208333
E          121.935484
B          121.543210
F          120.235955
Unknown    119.947368
A#         119.719298
Name: bpm, dtype: float64


In [58]:
spotify_df["streams"].info

<bound method Series.info of 0      141381703
1      133716286
2      140003974
3      800840817
4      303236322
         ...    
948     91473363
949    121871870
950     73513683
951    133895612
952     96007391
Name: streams, Length: 953, dtype: object>

In [76]:
spotify_df['streams'] = pd.to_numeric(spotify_df['streams'], errors = "coerce")

In [78]:
high_streams = spotify_df[spotify_df['streams'] > 100000000]
avg_energy = high_streams['energy_%'].mean()
print(avg_energy)

64.1125


In [81]:
top_streamed = spotify_df.nlargest(10, 'streams')[['track_name', 'artist(s)_name', 'streams']]
print(top_streamed)

                                        track_name  \
55                                 Blinding Lights   
179                                   Shape of You   
86                               Someone You Loved   
620                                   Dance Monkey   
41   Sunflower - Spider-Man: Into the Spider-Verse   
162                                      One Dance   
84                       STAY (with Justin Bieber)   
140                                       Believer   
725                                         Closer   
48                                         Starboy   

                   artist(s)_name       streams  
55                     The Weeknd  3.703895e+09  
179                    Ed Sheeran  3.562544e+09  
86                  Lewis Capaldi  2.887242e+09  
620                   Tones and I  2.864792e+09  
41          Post Malone, Swae Lee  2.808097e+09  
162           Drake, WizKid, Kyla  2.713922e+09  
84   Justin Bieber, The Kid Laroi  2.665344e+09  
140  

In [83]:
least_danceable = spotify_df.nsmallest(5, 'danceability_%')[['track_name', 'artist(s)_name', 'danceability_%']]
print(least_danceable)

                                            track_name  \
469                                    White Christmas   
447           It's the Most Wonderful Time of the Year   
387  Lift Me Up - From Black Panther: Wakanda Forev...   
521                                            Dawn FM   
523                                        Starry Eyes   

                                        artist(s)_name  danceability_%  
469  Bing Crosby, John Scott Trotter & His Orchestr...              23  
447                                      Andy Williams              24  
387                                            Rihanna              25  
521                                         The Weeknd              27  
523                                         The Weeknd              28  


### Task 3: Feature engineering
1. Create a new variable "track_popularity" using the following logic:
    - "Super Hit" if streams > 500M.
    - "Hit" if streams between 100M and 500M.
    - "Moderate" if streams between 50M and 100M.
    - "Less Popular" otherwise.
2. Save the modified dataset with the new column as "track_popularity_data.csv".

In [86]:
def classify_popularity(streams):
    if streams > 500000000 :
        return "Super Hit"
    elif streams < 100000000:
        return "Hit"
    elif streams > 50000000:
        return "Moderate"
    else:
        return "Obese"

spotify_df["track_popularity"] = spotify_df["streams"].apply(classify_popularity)
print(spotify_df[["streams", "track_popularity"]].head())

       streams track_popularity
0  141381703.0         Moderate
1  133716286.0         Moderate
2  140003974.0         Moderate
3  800840817.0        Super Hit
4  303236322.0         Moderate


In [88]:
spotify_df.to_csv('track_popularity_data.csv', index=False)

### Task 4: Unique Task
Each student must create their own unique variable in the dataset. Choose one approach:
1. Assign a playlist ranking (playlist_rank):
    - Generate a random rank between 1 and 100 for each track using np.random.randint().
2. Create an emotional category (mood_category):
    - Categorize songs based on valence (happiness) and energy.
      - "Energetic & Happy" if valence > 60 and energy > 70.
      - "Calm & Happy" if valence > 60 and energy <= 70.
      - "Sad" if valence < 40.
      - "Neutral" otherwise.
3. Define a personal popularity score (custom_popularity_score):
    - Use a custom formula (e.g., (streams / bpm) * danceability_%).

In [92]:
import numpy as np
spotify_df['playlist_rank'] = np.random.randint(1, 101, size=len(spotify_df))
print(spotify_df[['track_name', 'playlist_rank']].head())


                            track_name  playlist_rank
0  Seven (feat. Latto) (Explicit Ver.)             18
1                                 LALA             74
2                              vampire             75
3                         Cruel Summer             14
4                       WHERE SHE GOES             34
