<a href="https://colab.research.google.com/github/amohan03/CSE-163/blob/main/final_project_cse163.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Final CSE 163 Project: What makes a song popular?

Overview:
This project will analyze Spotify track feature

In [None]:
!pip install -q folium mapclassify
!pip install pyspark;



Libraries

In [None]:
# Import neccesary libraries
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy
import pyspark
from scipy.stats import pearsonr, ttest_ind  # SciPy for statistical tests
import plotly.express as px

Collaboration and Conduct

Code

In [None]:
your_name = "Anika Mohan"
sources = [
    "Model Evaluation lecture notes",
    "Mapping assessment",
    "https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.drop_duplicates.html",
    "https://pandas.pydata.org/docs/getting_started/intro_tutorials/06_calculate_statistics.html",
    "https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.pearsonr.html",
    "https://seaborn.pydata.org/generated/seaborn.histplot.html",
    "https://pieriantraining.com/pearson-correlation-coefficient-with-scipy-pearsonr/#:~:text=To%20find%20the%20Pearson%20Correlation,value%20is%20the%20p%2Dvalue.",
    "https://stackoverflow.com/questions/39581893/find-percentile-stats-of-a-given-column",
    "https://psiaims.github.io/CAMIS/python/two_samples_t_test.html",
    "https://stackoverflow.com/questions/419163/what-does-if-name-main-do",
    "source on plotly.express and interactive line plots",
]

assert your_name != "", "your_name cannot be empty"
assert ... not in sources, "sources should not include the placeholder ellipsis"
assert len(sources) >= 6, "must include at least 6 sources, inclusive of lectures and sections"

In [None]:
#df = load_data("/content/songs.csv")  # define df
#display(df.head())  # load df

In [27]:
# set Seaborn style for plots
sns.set(style="whitegrid")

# load the dataset using pandas & display
# then clean the data of any missing/duplicate values
def load_data(csv_file_path):
  """
  This function will take take a csv file path
  and load the dataset from it

  Parameters: filepath(str): Path to the CSV file

  Returns: A pandas dataframe
  """
  df = pd.read_csv(csv_file_path)
  df["track_album_release_date"] = pd.to_datetime(df["track_album_release_date"], errors="coerce")
  #df.dropna()
  return df

df = load_data("/content/songs.csv")
display(df.head())

# Statistical Hypothesis testing using SciPy library
# first, pearson correlation to check how strongly
# the correlation between two variables are
# checking correlation between i.e popularity and danceabilty
# if so, gives us insight into what makes a song popular
# second, t-test to compare if popular songs are statistically
# different from un-popular songs
# looking at the top & bottom 10% of songs (ranked by popularity col)
def compute_correlation(df, feature1, feature2):
  """
  This function computes Pearson correlation
  on the provided dataset using SciPy library

  Parameters: 1. df (panda DataFrame),
  2. feature1 (str): the name of the first column
  3. feature2 (str): the name of the second column

  Returns nothing
  """
  df[feature1] = pd.to_numeric(df[feature1], errors="coerce")
  df[feature2] = pd.to_numeric(df[feature2], errors="coerce")

  # format: corr, p_value = pearson (variable1, variable2)
  # print the correlation and p-values
  df_clean = df[[feature1, feature2]].dropna()
  corr, p_value = pearsonr(df_clean[feature1], df_clean[feature2])
  print("Pearson Correlation Coefficient between", feature1, "and", feature2, "is", corr)
  print("P-value:", p_value)

# Research Question 1: Popularity vs. Danceability
compute_correlation(df, "popularity", "danceability")

def perform_t_test(df, feature):
  """
  This function will perform a t-test to compare the top and bottom
  10% of popular songs from the dataset

  Parameters: 1. df (panda DataFrame),
  2. feature (str): the name of the column to perform the test on

  Returns nothing
  """
  df[feature] = pd.to_numeric(df[feature], errors="coerce")

  # split into two groups- songs in the top & bottom 10% of popularity
  # find the 90th & 10th percentile popularity score
  top_10 = df[df["popularity"] >= df["popularity"].quantile(0.90)][feature]
  bottom_10 = df[df["popularity"] <= df["popularity"].quantile(0.10)][feature]

  # apply the t-test
  # checks if the difference between the most popular and unpopular
  # songs is meaningful (p < 0.05) -> statistically different
  # ttest_ind() compares the means of the two groups
  # equal_var=False assumes for different variances
  t_stat, p_value = ttest_ind(top_10, bottom_10, equal_var=False)
  print("T-statistic for ", feature, ": ", "T-statistic = ", t_stat, "P-value = ", p_value)

# Research Question #2: Energy difference between popular & unpopular songs
perform_t_test(df, "energy")

def plot_pop_genre_graph(df):
  """
  This function plots the average popularity of each genre over time
  """
  # extract the year from the track_album_release_date col
  # covert to year format
  df["year"] = df["track_album_release_date"].dt.year
  df = df[(df["year"] >= 2000) & (df["year"] <= 2023)]
  df["popularity"] = pd.to_numeric(df["popularity"], errors="coerce")


  # group by year and genre
  pop_trend = df.groupby(["year", "playlist_genre"])["popularity"].mean().reset_index()

  # create interactive line plot using plotly
  fig = px.line(pop_trend, x="year", y="popularity", color="playlist_genre",
                title="Average Popularity by Year and Genre",
                labels={"popularity": "Avg Popularity", "year": "Year"},
                markers=True)
  fig.show()

# Research Question 3: How has song popularity changed over time?
plot_pop_genre_graph(df)









Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.



Unnamed: 0.1,Unnamed: 0,track_id,track_name,track_artist,popularity,track_album_release_date,playlist_genre,danceability,energy,key,...,Unnamed: 821,Unnamed: 822,Unnamed: 823,Unnamed: 824,Unnamed: 825,Unnamed: 826,Unnamed: 827,Unnamed: 828,Unnamed: 829,Unnamed: 830
0,0,6oJ6le65B3SEqPwMRNXWjY,higher love,Kygo,0.5,2019-06-28,Pop,0.6326797385620915,0.6673462817898917,0.7272727272727273,...,,,,,,,,,,
1,1,3yNZ5r3LKfdmjoS3gkhUCT,bad guy (with justin bieber),Billieeilish,0.3181818181818183,2019-07-11,Pop,0.6026143790849674,0.4259040669599743,0.0,...,,,,,,,,,,
2,2,0qc4QlcCxVTGyShurEv1UU,post malone (feat. rani),Samfeldt,0.3181818181818183,2019-05-24,Pop,0.4980392156862744,0.6287155274171049,0.6363636363636364,...,,,,,,,,,,
3,3,4PkIDTPGedm0enzdvilLNd,sixteen,Elliegoulding,0.2272727272727275,2019-04-12,Pop,0.6013071895424837,0.7993346925635799,0.7272727272727273,...,,,,,,,,,,
4,4,5PYQUBXc7NYeI1obMKSJK0,never really over,Katyperry,0.4090909090909091,2019-05-31,Pop,0.7333333333333334,0.8862538899023502,0.7272727272727273,...,,,,,,,,,,




A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



Pearson Correlation Coefficient between popularity and danceability is 0.06938694767002712
P-value: 0.037413384307856484
T-statistic for  energy :  T-statistic =  -2.7468153940169895 P-value =  0.006563426741030975
