# Exploratory Data Analysis using Music Sales

Practicing data analysis and data science using pandas package for data preparation and plotly package for data visualization. The dataset can be downloaded was originally from Kaggle.

## Data preparation

### Importing packages

In [6]:
import pandas as pd

music_sales = 'music_sales.csv'
df = pd.read_csv(music_sales, index_col='index').drop(columns='Number of Records')
df['Value (Actual)'] = df['Value (Actual)'].fillna(0).apply(lambda x: max(0, x))
df.head()

Unnamed: 0_level_0,Format,Metric,Year,Value (Actual)
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0,CD,Units,1973,0.0
1,CD,Units,1974,0.0
2,CD,Units,1975,0.0
3,CD,Units,1976,0.0
4,CD,Units,1977,0.0


In [7]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 3008 entries, 0 to 3007
Data columns (total 4 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Format          3008 non-null   object 
 1   Metric          3008 non-null   object 
 2   Year            3008 non-null   int64  
 3   Value (Actual)  3008 non-null   float64
dtypes: float64(1), int64(1), object(2)
memory usage: 117.5+ KB


## Data Visualization

In [8]:
import plotly.express as px
from plotly.colors import qualitative

df_units = df[df['Metric'] == 'Units']
df_other_metrics = df[df['Metric'] != 'Units']

fig_1 = px.histogram(data_frame=df_units, x='Year', y='Value (Actual)', color='Format', color_discrete_sequence=qualitative.Alphabet, nbins=df['Year'].nunique(), height=600)
fig_1.show()

*The histogram portrays the intricate distribution of music sales by units across diverse years, unveiling a journey through the ever-evolving music industry. The number of music sales units depicted and the passage of years showcased, this masterpiece unveils each bar carrying the essence of the music sales format it represents.*

In [9]:
fig_2 = px.histogram(data_frame=df_other_metrics, x='Year', y='Value (Actual)', color='Format', facet_row='Metric', color_discrete_sequence=qualitative.Alphabet, height=1200, nbins=df['Year'].nunique())
fig_2.show()

*Using the histogram above, depicting the distribution of music sales by metric (excluding units) across various years. This represents the value of music sales, and the represents the years. The histogram corresponds to a specific format of music sales, with its color indicating the format. Additionally, the visualization is organized using facetting by metric, allocating each metric its own row in the chart.*

In [10]:
from plotly.express import scatter
for metric in ['Units', 'Value (Adjusted)']:
    scatter(data_frame=df[df['Metric'] == metric].dropna(), x='Year', y='Format', size='Value (Actual)', color='Value (Actual)').show()

*The scatter plot showcases the correlation between the format of music sales and the number of units sold across various years. This illustrates the format of music sales, as it also represents the years under consideration. Each data point reflect the corresponding actual value of music sales. Capturing the two key metrics, Units and Value (Adjusted), this scatter plot provides a comprehensive visualization of the changing landscape in the music industry.*

## Conclusion

The accompanying data visualization comprises a histogram, a distribution chart, a scatter plot, and various relevant metrics to offer a comprehensive and informative story about the evolving music industry over the years.

The histogram delves into the distribution of music sales by units across diverse years. Through a collection of bars, each representing a specific format of music sales, the histogram provides a nuanced and intricate depiction of the fluctuating landscape of the music industry. The height of each bar symbolizes the number of music sales units, while the passage of years is visually showcased. As a result, this visualization encapsulates the essence of each music sales format it represents, unravelling a captivating journey through the ever-evolving music industry.

In contrast, the distribution chart within the histogram revolves around the distribution of music sales by metric, excluding units, over the course of different years. This particular visualization focuses on the value of music sales in relation to the years under consideration. The distinctive color scheme employed in the chart aids in distinguishing the different formats of music sales. Additionally, for clarity and organization, the visualization adopts facetting by metric, allocating each metric its own row. Thus, the distribution chart presents a comprehensive overview of the changing value of music sales across different formats and years.

Moreover, the scatter plot emphasizes the correlation between the format of music sales and the number of units sold within various years. By examining each data point, which corresponds to the specific format and the corresponding actual value of music sales, the scatter plot enables the visualization of the changing landscape in the music industry. The encompassing metrics of Units and Value (Adjusted) allow for a detailed analysis of the industry's evolution, highlighting the interplay between sales units and their corresponding value.

In conclusion, this data visualization provides a conclusive and formal representation of the intricate journey of the music industry across diverse years. Delivered through a combination of a histogram, a distribution chart, and a scatter plot, this storytelling effectively captures the distribution, value, and correlation of music sales formats over time.