<a href="https://colab.research.google.com/github/ahmadfadhilnugraha/Y.Music/blob/main/Sprint_1_Y_Music.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Y.Music

# Content <a id='back'></a>

* [Introduction](#intro)
* [Step 1. Data Review](#data_review)
    * [Data Review Conclusions](#data_review_conclusions)
* [Step 2. Data Preprocessing](#data_preprocessing)
    * [2.1 Header Style](#header_style)
    * [2.2 Missing Values](#missing_values)
    * [2.3 Duplicates](#duplicates)
    * [2.4 Data Preprocessing Conclusions](#data_preprocessing_conclusions)
* [Step 3. Hypothesis Testing](#hypotheses)
    * [3.1 Hypothesis 1: Users Activity in Both City](#activity)
    * [3.2 Hypothesis 2: Music Preferences on Monday and Friday](#week)
    * [3.3 Hypothesis 3: Genre Preferences in Springfield and Shelbyville](#genre)
* [Conclusion](#end)

## Introduction

Every time we conduct research, we need to formulate a hypothesis that we could test. Sometimes we accept this hypothesis, but sometimes we also reject it. To generate informed decisions, a business must be able to understand whether its assumptions are true or not.

In this project, I will compare music preferences in the cities of Springfield and Shelbyville. You will learn from the actual Y.Music data to test the following hypothesis and compare user behavior in both cities.

## Goals
Testing three hypothesis:

1. Users activity varies depending on the day and the city.
2. On Monday morning, residents of Springfield and Shelbyville listen to different genres. This also valid to Friday night.
3. Users in Springfield and Shelbyville have different preferences. In Springfield, they prefer pop music, while in Shelbyville, rap music has more fans.

## Steps
The data related to users behavior is stored in the file /datasets/music_project_en.csv. There is no information available regarding the quality of the data, therefore it is necessary to first examine it before testing any hypotheses.

Firstly, I will evaluate the data quality and determine if there any issues are significant. Then, during the data pre-processing stage, I will attempt to address the most serious problems.

This projects will consist of 3 steps:

1. Data Overview
2. Preprocessing Data
3. Testing the Hypothesis

## Step 1. Data Review

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [2]:
import pandas as pd

In [3]:
df = pd.read_csv('/content/drive/MyDrive/DS/SPRINT_1/music_project_en.csv')

In [4]:
df.head()

Unnamed: 0,userID,Track,artist,genre,City,time,Day
0,FFB692EC,Kamigata To Boots,The Mass Missile,rock,Shelbyville,20:28:33,Wednesday
1,55204538,Delayed Because of Accident,Andreas Rönnberg,rock,Springfield,14:07:09,Friday
2,20EC38,Funiculì funiculà,Mario Lanza,pop,Shelbyville,20:58:07,Wednesday
3,A3DD03C9,Dragons in the Sunset,Fire + Ice,folk,Shelbyville,08:37:09,Monday
4,E2DC1FAE,Soul People,Space Echo,dance,Springfield,08:34:34,Monday


In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 65079 entries, 0 to 65078
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0     userID  65079 non-null  object
 1   Track     63736 non-null  object
 2   artist    57512 non-null  object
 3   genre     63881 non-null  object
 4     City    65079 non-null  object
 5   time      65079 non-null  object
 6   Day       65079 non-null  object
dtypes: object(7)
memory usage: 3.5+ MB


This table contains seven columns, all of which hold the same data type, namely: object.

Recording to documentation:



*   'userID' — User identifier
*   'Track' — Track Title

*   'artist' — Artist Name
* 'genre'
* 'City' — city where the user is located
* 'time' — duration of the played track
* 'Day' — name of the day

We can identify three issues with the column names:

1. Some column names are written in uppercase while others are written in lowercase.
2. Some column names contain spaces.
3. The terms "city", "time", and "day" are ambiguous as it is unclear whether they refer to the song or the user.

The number of values in each column is different, indicating that the data contains missing values.

**Data Review Conclusion**

Each row in the table contains data on the played track. Some columns describe the song itself: track title, artist, and genre. The rest convey information about the user: their city of origin, and the time they played the song.

It is evident that the data is sufficient for testing hypotheses, although there are missing values.

Next, we need to conduct data pre-processing before proceeding.

## Step 2. Data Preprocessing

### 2.1 Header Style



In [6]:
print(df.columns)

Index(['  userID', 'Track', 'artist', 'genre', '  City  ', 'time', 'Day'], dtype='object')


Modify columns name to good writing

In [7]:
df = df.rename(columns={
    '  userID':'user_id',
    'Track':'track',
    '  City  ':'city',
    'Day':'day'
})
print(df.columns)

Index(['user_id', 'track', 'artist', 'genre', 'city', 'time', 'day'], dtype='object')


### 2.2 Missing Values

In [8]:
df.isnull().sum()

user_id       0
track      1343
artist     7567
genre      1198
city          0
time          0
day           0
dtype: int64

We found some value at column track, artist and genre is missing. We will assume that value as unknown. Later I will check is there any significant effect for this replacement to my calculation.

In [9]:
columns_to_replace = ['track','artist','genre']

for column_to_replace in columns_to_replace:
    df[column_to_replace] = df[column_to_replace].fillna('unknown')

df.isnull().sum()

user_id    0
track      0
artist     0
genre      0
city       0
time       0
day        0
dtype: int64

### 2.3 Duplicates

In [10]:
df.duplicated().sum()

3826

In this case, there is 3826 duplicated data and will I will drop the data for better calculation.

In [11]:
df = df.drop_duplicates().reset_index(drop=True)
df

Unnamed: 0,user_id,track,artist,genre,city,time,day
0,FFB692EC,Kamigata To Boots,The Mass Missile,rock,Shelbyville,20:28:33,Wednesday
1,55204538,Delayed Because of Accident,Andreas Rönnberg,rock,Springfield,14:07:09,Friday
2,20EC38,Funiculì funiculà,Mario Lanza,pop,Shelbyville,20:58:07,Wednesday
3,A3DD03C9,Dragons in the Sunset,Fire + Ice,folk,Shelbyville,08:37:09,Monday
4,E2DC1FAE,Soul People,Space Echo,dance,Springfield,08:34:34,Monday
...,...,...,...,...,...,...,...
61248,729CBB09,My Name,McLean,rnb,Springfield,13:32:28,Wednesday
61249,D08D4A55,Maybe One Day (feat. Black Spade),Blu & Exile,hip,Shelbyville,10:00:00,Monday
61250,C5E3A0D5,Jalopiina,unknown,industrial,Springfield,20:09:26,Friday
61251,321D0506,Freight Train,Chas McDevitt,rock,Springfield,21:43:59,Friday


In [12]:
df.duplicated().sum()

0

In [13]:
df_sorted = df.sort_values(by='genre')
df_sorted['genre'].unique()

array(['acid', 'acoustic', 'action', 'adult', 'africa', 'afrikaans',
       'alternative', 'ambient', 'americana', 'animated', 'anime',
       'arabesk', 'arabic', 'arena', 'argentinetango', 'art', 'audiobook',
       'avantgarde', 'axé', 'baile', 'balkan', 'beats', 'bigroom',
       'black', 'bluegrass', 'blues', 'bollywood', 'bossa', 'brazilian',
       'breakbeat', 'breaks', 'broadway', 'cantautori', 'cantopop',
       'canzone', 'caribbean', 'caucasian', 'celtic', 'chamber',
       'children', 'chill', 'chinese', 'choral', 'christian', 'christmas',
       'classical', 'classicmetal', 'club', 'colombian', 'comedy',
       'conjazz', 'contemporary', 'country', 'cuban', 'dance',
       'dancehall', 'dancepop', 'dark', 'death', 'deep', 'deutschrock',
       'deutschspr', 'dirty', 'disco', 'dnb', 'documentary', 'downbeat',
       'downtempo', 'drum', 'dub', 'dubstep', 'eastern', 'easy',
       'electronic', 'electropop', 'emo', 'entehno', 'epicmetal',
       'estrada', 'ethnic', 'eurofo

From list of all genres, we found there is misspelled for hiphop genre.

There is following implicit duplicates:
* hip
* hop
* hip-hop

to remove them, i create a function that replace all implicit duplicates to one (hiphop).

In [14]:
def replace_wrong_genres(wrong_genres, correct_genre):
    for wrong_genre in wrong_genres:
        df['genre'] = df['genre'].replace(wrong_genre, correct_genre)

In [15]:
duplicates_genres = ['hip','hop','hip-hop']
correct_genre_hiphop = 'hiphop'
replace_wrong_genres(duplicates_genres,correct_genre_hiphop)

In [16]:
df_sorted = df.sort_values(by='genre')
df_sorted['genre'].unique()

array(['acid', 'acoustic', 'action', 'adult', 'africa', 'afrikaans',
       'alternative', 'ambient', 'americana', 'animated', 'anime',
       'arabesk', 'arabic', 'arena', 'argentinetango', 'art', 'audiobook',
       'avantgarde', 'axé', 'baile', 'balkan', 'beats', 'bigroom',
       'black', 'bluegrass', 'blues', 'bollywood', 'bossa', 'brazilian',
       'breakbeat', 'breaks', 'broadway', 'cantautori', 'cantopop',
       'canzone', 'caribbean', 'caucasian', 'celtic', 'chamber',
       'children', 'chill', 'chinese', 'choral', 'christian', 'christmas',
       'classical', 'classicmetal', 'club', 'colombian', 'comedy',
       'conjazz', 'contemporary', 'country', 'cuban', 'dance',
       'dancehall', 'dancepop', 'dark', 'death', 'deep', 'deutschrock',
       'deutschspr', 'dirty', 'disco', 'dnb', 'documentary', 'downbeat',
       'downtempo', 'drum', 'dub', 'dubstep', 'eastern', 'easy',
       'electronic', 'electropop', 'emo', 'entehno', 'epicmetal',
       'estrada', 'ethnic', 'eurofo

### 2.4 Data Preprocessing Conclusion



I have detected three problems with the data:
1. Wrong title writing style
2. Missing values
3. Explicit and Implicit Duplicates

All title has been cleaned up to make it processing easier. All missing values have been replaced with unknown and will continue review the effect to our calculation later. For duplicate data, it has been removed for more precise and better understanding.

We continue to hypothesis testing.

## Step 3. Hypothesis Testing

### 3.1 Hypothesis 1: Users Activity in Both City

According to first hypothesis, users from both city shows different behavior when listening to the music. This hypothesis using data from Monday, Wednesday, and Friday.

In [17]:
df.groupby('city')['track'].count()

city
Shelbyville    18512
Springfield    42741
Name: track, dtype: int64

In [18]:
df.groupby('day')['track'].count()

day
Friday       21840
Monday       21354
Wednesday    18059
Name: track, dtype: int64

In [19]:
def number_tracks(day,city):
    track_list_day = df[df['day'] == day]
    track_list = track_list_day[track_list_day['city'] == city]
    track_list_count = track_list.count()['user_id']
    print(day,city,track_list_count)

In [20]:
number_tracks('Monday','Springfield')
number_tracks('Monday','Shelbyville')
number_tracks('Wednesday','Springfield')
number_tracks('Wednesday','Shelbyville')
number_tracks('Friday','Springfield')
number_tracks('Friday','Shelbyville')

Monday Springfield 15740
Monday Shelbyville 5614
Wednesday Springfield 11056
Wednesday Shelbyville 7003
Friday Springfield 15945
Friday Shelbyville 5895


In [21]:
result_columns = ['city', 'monday', 'wednesday', 'friday']

cities = ['Springfield','Shelbyville']

result = [
    [cities[0],15740,11056,15945],
    [cities[1],5614,7003,5895]]

number_track = pd.DataFrame(data=result, columns=result_columns)

print(number_track)

          city  monday  wednesday  friday
0  Springfield   15740      11056   15945
1  Shelbyville    5614       7003    5895


The analysis reveals difference in user behavior.

* In Springfield, the number of songs played peaks on Mondays and Fridays, while there is a decrease in activity on Wednesdays.
* In Shelbyville, users listen to more music on Wednesdays. User activity is lower on Mondays and Fridays

This data also shows Spri

### 3.2 Hypothesis 2: Music Preferences on Monday and Friday

According to second hypothesis, On Monday morning and Friday evening, residents of Springfield listen to different genres compared to those enjoyed by residents of Shelbyville.

In [22]:
spr_general = df[df['city'] == 'Springfield']

In [23]:
shel_general = df[df['city'] == 'Shelbyville']

In [24]:
def genre_weekday(city,day,time1,time2):
    genre_df_1 = city[city['day'] == day]
    genre_df_2 = genre_df_1[genre_df_1['time'] < time2]
    genre_df = genre_df_2[genre_df_2['time'] > time1]
    genre_df_grouped = genre_df.groupby('genre')['user_id']
    genre_df_count = genre_df.groupby('genre')['user_id'].count()
    genre_df_sorted = genre_df_count.sort_values(ascending=False)
    return genre_df_sorted[:15]

In [25]:
genre_weekday(spr_general,'Monday','07:00:00','11:00:00')

genre
pop            781
dance          549
electronic     480
rock           474
hiphop         286
ruspop         186
world          181
rusrap         175
alternative    164
unknown        161
classical      157
metal          120
jazz           100
folk            97
soundtrack      95
Name: user_id, dtype: int64

In [26]:
genre_weekday(shel_general,'Monday','07:00:00','11:00:00')

genre
pop            218
dance          182
rock           162
electronic     147
hiphop          80
ruspop          64
alternative     58
rusrap          55
jazz            44
classical       40
world           36
rap             32
soundtrack      31
rnb             27
metal           27
Name: user_id, dtype: int64

In [27]:
genre_weekday(spr_general,'Friday','17:00:00','23:00:00')

genre
pop            713
rock           517
dance          495
electronic     482
hiphop         273
world          208
ruspop         170
classical      163
alternative    163
rusrap         142
jazz           111
unknown        110
soundtrack     105
rnb             90
metal           88
Name: user_id, dtype: int64

In [28]:
genre_weekday(shel_general,'Friday','17:00:00','23:00:00')

genre
pop            256
rock           216
electronic     216
dance          210
hiphop          97
alternative     63
jazz            61
classical       60
rusrap          59
world           54
unknown         47
ruspop          47
soundtrack      40
metal           39
rap             36
Name: user_id, dtype: int64

Using function i created, it provide information about the 15 most popular genres on a specific day during the period between two time stamps.

From those analysis, we can conclude 2 items:
1. Users from Springfield and Shelbyville listen to music with the same genres. The top five genres are the same, with only rock and electronic exchanging places.

2. In Springfield, the amount of missing values is significant, with the value 'unknown' ranked tenth. This means that the missing values have a substantial amount of data, which may raise questions about the accuracy of our conclusions.

Thus, the second hypothesis was partially proven correct:
* Users listen to the same music at the start and end of the week.
* There are no significant differences between Springfield and Shelbyville. In both cities, pop is the most popular genre.

Nevertheless, the significance of the number of missing values makes these results questionable. In Springfield, there is a lot of missing value that impacts our top 15 genre results. If we did not have these missing values, the results might have been different.

### 3.3 Hypothesis 3: Genre Preferences in Springfield and Shelbyville

Hypothesis: Shelbyville loves rap music. Springfield's citizens are more into pop.

In [29]:
spr_genres = spr_general.groupby('genre')['genre'].count()
spr_genres = spr_genres.sort_values(ascending=False)

In [30]:
spr_genres.head(10)

genre
pop            5892
dance          4435
rock           3965
electronic     3786
hiphop         2096
classical      1616
world          1432
alternative    1379
ruspop         1372
rusrap         1161
Name: genre, dtype: int64

In [31]:
shel_genres = shel_general.groupby('genre')['genre'].count()
shel_genres = shel_genres.sort_values(ascending=False)

In [32]:
shel_genres.head(10)

genre
pop            2431
dance          1932
rock           1879
electronic     1736
hiphop          960
alternative     649
classical       646
rusrap          564
ruspop          538
world           515
Name: genre, dtype: int64

As expected, the hypothesis has been partially proven true:

* In Springfield, pop music is the most popular genre, but it turned out to be equally popular in Shelbyville. Additionally, rap wasn't in the top 5 for either city.

## Conclusion

We have tested the following three hypotheses:

1. User activities in Springfield and Shelbyville depend on the day of the week, although these two cities vary in various ways.
2. On Monday morning, residents of Springfield and Shelbyville listen to different genres. This also applies to Friday night.
3. Listeners in Springfield and Shelbyville have different preferences. In both Springfield and Shelbyville, they prefer pop music.

After analyzing the available data, we can conclude that:

* User activities in Springfield and Shelbyville depend on the day, even though the cities are different. The first hypothesis can be fully accepted.

* Music preferences do not vary significantly throughout the week in Springfield and Shelbyville. We can see small differences in the order on Monday, but: In both Springfield and Shelbyville, users mostly listen to pop music. Therefore, this hypothesis cannot be accepted. It is also important to note that the obtained results may differ if we do not have missing values.

* It turns out that the music preferences of users from Springfield and Shelbyville are very similar.
The third hypothesis is rejected. If there are differences in preferences, unfortunately, we cannot determine this from this data.