# Spotify Top 50 (2020) Analysis

## Loading Dataset

In [None]:
import pandas as pd

df = pd.read_csv('spotifytoptracks.csv')

print(df)

## Data Cleaning

In [None]:
''' Check for missing Values '''

df.isnull().sum()

In [34]:
''' Remove duplicate samples '''

df.drop_duplicates(inplace=True)

In [None]:
''' Handling Outliers '''
# Selct only numeric columns
numeric_cols = df.select_dtypes(include=['float64', 'int64'])

Q1 = numeric_cols.quantile(0.25)
Q3 = numeric_cols.quantile(0.75)
IQR = Q3 - Q1

# Define the lower and upper bounds for outliers
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

# Identify outliers based on the bounds
outliers = (numeric_cols < lower_bound) | (numeric_cols > upper_bound)
print(outliers)


## Exploratory Data Analysis (EDA)

### General Information:

In [37]:
''' Number of observations '''

df.shape[0] #gives number of rows

50

In [38]:
''' Number of features '''

df.shape[1] #gives the number of columns

17

### Feature Classification:

In [None]:
# Categorical features
df.select_dtypes(include=['object']).columns


In [None]:
# Numeric features
df.select_dtypes(include=['float64', 'int64']).columns

### Questions:

#### 1. Artists with more than one popular track:

In [None]:
df['artist'].value_counts()[df['artist'].value_counts() > 1]

#### 2. Most Popular artist:

In [47]:
df['artist'].value_counts().idxmax()

'Billie Eilish'

#### 3. Total number of artists:

In [50]:
df['artist'].nunique()

40

#### 4. Albums with more than 1 popular track:

In [None]:
df['album'].value_counts()[df['album'].value_counts() > 1]

#### 5. Total Number of Albums:

In [52]:
df['album'].nunique()

45

#### 6. Tracks with a danceability score above 0.7:

In [None]:
df[df['danceability'] > 0.7]['track_name']

#### 7. Tracks with a danceability score below 0.4:

In [None]:
df[df['danceability'] < 0.4]['track_name']

#### 8. Tracks with loudness above -5:

In [None]:
df[df['loudness'] > -5]['track_name']

#### 9. Tracks with Loudness below -8:

In [None]:
df[df['loudness'] < -8]['track_name']

#### 10. Longest and Shortest track:

In [None]:
# Longest Track
Longest_track = (df.loc[df['duration_ms'].idxmax(), 'track_name'])
Longest_track_artist = (df.loc[df['duration_ms'].idxmax(), 'artist'])
print(f"The longest track is '{Longest_track}' by {Longest_track_artist}")

# Shortest Track
Shortest_track = (df.loc[df['duration_ms'].idxmin(), 'track_name'])
Shortest_track_artist = (df.loc[df['duration_ms'].idxmin(), 'artist'])
print(f"The shortest track is '{Shortest_track}' by {Shortest_track_artist}")

#### Most Popular Genres and genres with only one song:

In [None]:
# Most Popular Genres:

# Getting value counts for the genre
genre_counts = df['genre'].value_counts()

# Resetting the index so "genre" doesn't appear at the top of output
genre_counts = genre_counts.reset_index()

# Rename columns
genre_counts.columns = ['Genre', 'Count']

# Display the result
print("The Most Popular Genres are:\n")
print(genre_counts)


# Genres with only one song:

# Getting value counts for the genre
single_track_genres = df['genre'].value_counts()[df['genre'].value_counts() == 1]

# Resetting the index so "genre" doesn't appear at the top of output
single_track_genres = single_track_genres.reset_index()

# Rename columns
single_track_genres.columns = ['Genre', 'Count']

# Display the result
print("\nThe Genres with only one song are:\n")
print(single_track_genres)



### Correlations:

#### 1. Strong positive/negative correlations:

In [None]:
# Selecting only numeric columns
numeric_cols = df.select_dtypes(include=['int', 'float']).columns

# Calculate the correlation matrix
corr_matrix = df[numeric_cols].corr()

# Display the correlation matrix
print(corr_matrix)


##### Key Insights:

##### Energy & Loudness:
    - High positive correlation (0.79), meaning energetic songs tend to be louder.

##### Energy & Acousticness:
    - Strong negative correlation (-0.68), meaning energetic songs are usually not acoustic.

##### Danceability & Valence:
    - Positive correlation (0.48), meaning more danceable songs are usually happier.


#### 2. Correlation between specific features:

In [None]:
# Filtering specific genres
selected_genres = df[df['genre'].isin(['Pop', 'Hip-Hop/Rap', 'Dance/Electronic', 'Alternative/Indie'])]

# Group by genre and calculate the mean of selected features
genre_comp = selected_genres.groupby('genre')[['danceability', 'loudness', 'acousticness']].mean()

# Print the result
print(genre_comp)

##### Comparative Analysis of Genres Based on Danceability, Loudness, and Acousticness:

##### Danceability: 
    - Hip-Hop/Rap (0.766) and Dance/Electronic (0.755) are the most danceable genres, making them ideal for tracks that get people moving. 
    -Pop (0.678) and Alternative/Indie (0.662) are moderately danceable, indicating a more balanced feel.

##### Loudness:
    - Dance/Electronic is the loudest (-5.34 dB), followed closely by Alternative/Indie (-5.42 dB). 
    - Hip-Hop/Rap (-6.92 dB) and Pop (-6.46 dB) are quieter, suggesting a softer production style compared to the more aggressive sounds of Dance/Electronic.

##### Acousticness:
    - Alternative/Indie (0.583) has the most acoustic elements, reflecting its organic sound, while Dance/Electronic (0.099) has the least, focusing on electronic production. Pop (0.324) and Hip-Hop/Rap (0.189) fall in the middle, blending acoustic and electronic elements.

##### In summary, Dance/Electronic and Hip-Hop/Rap excel in danceability, with Dance/Electronic being louder but less acoustic. Alternative/Indie is more acoustic, offering a more organic sound, while Pop sits between these genres in both attributes.