# **Spotify Popularity Analysis Notebook**


## 1- Introduction and setup
This Jupyter Notebook is part of the Spotify Popularity Analysis project. In this notebook, we will begin by importing the necessary Python libraries for data analysis and visualization. After that, we will proceed with data processing and analysis.

### Importing Libraries

Let's start by importing the required libraries for our analysis:

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from ipywidgets import interact
import datetime
import statsmodels.api as sm
import os

### Loading the data

In this section, we will load the Spotify dataset for the year 2023. We have defined the path to the CSV file and will use the 'ISO-8859-1' encoding for reading the data. 

In [None]:
# Get the project's root directory
project_dir = os.path.abspath(os.path.join(os.getcwd(), '..'))

# Define the path to the CSV file relative to the project directory
csv_file_path = os.path.join(project_dir, 'data/spotify-2023.csv')

# Load the data (CSV format) with 'ISO-8859-1' encoding
data = pd.read_csv(csv_file_path, encoding='ISO-8859-1')

## 2- Data Exploration

This dataset contains a comprehensive list of the most famous songs of 2023 as listed on Spotify. It offers insights into each song's attributes, popularity, and presence on various music platforms. Key features include:

- `track_name`: Name of the song
- `artist(s)_name`: Name of the artist(s) of the song
- `artist_count`: Number of artists contributing to the song
- `released_year`: Year of song release
- `released_month`: Month of song release
- `released_day`: Day of the month of song release
- `in_spotify_playlists`: Number of Spotify playlists the song is in
- `in_spotify_charts`: Presence and rank on Spotify charts
- `streams`: Total streams on Spotify
- `in_apple_playlists`: Number of Apple Music playlists the song is in
- `in_apple_charts`: Presence and rank on Apple Music charts
- `in_deezer_playlists`: Number of Deezer playlists the song is in
- `in_deezer_charts`: Presence and rank on Deezer charts
- `in_shazam_charts`: Presence and rank on Shazam charts
- `bpm`: Beats per minute (song tempo)
- `key`: Key of the song
- `mode`: Mode of the song (major or minor)
- `danceability_%`: Percentage indicating dance suitability
- `valence_%`: Positivity of musical content
- `energy_%`: Perceived energy level
- `acousticness_%`: Amount of acoustic sound
- `instrumentalness_%`: Amount of instrumental content
- `liveness_%`: Presence of live performance elements
- `speechiness_%`: Amount of spoken words


### 2.1 Data Overview

##### Dataset Structure

First, let's explore the structure of the dataset to gain insights into its and features:

In [None]:
# Display the first few rows of the dataset
data.head()

##### Dataset Dimension

Let's check the dimensions of the dataset:



In [None]:
# Check the structure of the dataset
num_rows, num_columns = data.shape

num_rows, num_columns

##### Column Data Types

Display the data types of all columns in the dataset:

In [None]:
# Check data types of all columns
data.dtypes

### 2.2 Data Cleaning

##### Missing Values
Now, we'll check for missing values in the dataset and decide how to handle them. 

In [None]:
# Check for missing values
missing_values = data.isnull().sum()

# Display columns with missing values
missing_columns = missing_values[missing_values > 0]
missing_columns


In [None]:
# Remove columns with missing values
data.drop(columns=missing_columns.index, inplace=True)

##### Handling Duplicates
Next, let's address duplicate rows in the dataset. We'll remove them if necessary.

In [None]:
# Check for duplicate rows
duplicate_rows = data[data.duplicated(keep='first')]
duplicate_rows


In [None]:
# Remove duplicate rows
data = data.drop_duplicates(keep='first')

##### Data Type Conversions
Finally, we'll perform data type conversions to remove non-numeric values from streams column and convert to numeric values.

In [None]:
# Convert streams column to integers.
data['streams'] = pd.to_numeric(data['streams'], errors='coerce')

# Check data types again to confirm the conversion
data.dtypes

### 2.3 Summary Statistics 

##### Basic Statistics
Let's start by calculating summary statistics for the numerical columns, including measures like mean, median, standard deviation, minimum, and maximum values.

In [None]:
numerical_summary = data.describe()
numerical_summary

##### Data Distribution
Next, we'll visualize the distribution of numerical data using histograms.

In [None]:
# Create histograms for numerical columns
numerical_columns = data.select_dtypes(include=['int64', 'float64']).columns
for column in numerical_columns:
    plt.figure(figsize=(8, 6))
    sns.histplot(data[column], kde=True)
    plt.title(f'Distribution of {column}')
    plt.xlabel(column)
    plt.ylabel('Frequency')
    plt.show()