## Introduction <a id='intro'></a>
In this project, we will compare the music preferences of users in the cities of Springfield and Shelbyville. we will study actual Y.Music data to test the hypotheses below and compare user behavior in these two cities.

### Purpose: 
Testing three hypotheses:
1. User activity varies depending on the day and city.
2. On Monday mornings, residents of Springfield and Shelbyville tune in to different genres. This also applies on Friday nights.
3. Listeners in Springfield and Shelbyville have different preferences. In Springfield, users prefer pop music, while in Shelbyville rap music has more fans.

### Phases:
 
This project will consist of three phases:
  1. Data Overview
  2. Pre-processing of Data
  3. Hypothesis testing

 
[Kembali ke Konten](#back)

## 1. Data Overview. Ikhtisar Data <a id='data_review'></a>



In [1]:

import pandas as pd

read csv files

In [2]:

df = pd.read_csv('/datasets/music_project_en.csv')
df.describe()

Unnamed: 0,userID,Track,artist,genre,City,time,Day
count,65079,63736,57512,63881,65079,65079,65079
unique,41748,39666,37806,268,2,20392,3
top,A8AE9169,Brand,Kartvelli,pop,Springfield,21:51:22,Friday
freq,76,136,136,8850,45360,14,23149


In [3]:
df.head(10) 

Unnamed: 0,userID,Track,artist,genre,City,time,Day
0,FFB692EC,Kamigata To Boots,The Mass Missile,rock,Shelbyville,20:28:33,Wednesday
1,55204538,Delayed Because of Accident,Andreas Rönnberg,rock,Springfield,14:07:09,Friday
2,20EC38,Funiculì funiculà,Mario Lanza,pop,Shelbyville,20:58:07,Wednesday
3,A3DD03C9,Dragons in the Sunset,Fire + Ice,folk,Shelbyville,08:37:09,Monday
4,E2DC1FAE,Soul People,Space Echo,dance,Springfield,08:34:34,Monday
5,842029A1,Chains,Obladaet,rusrap,Shelbyville,13:09:41,Friday
6,4CB90AA5,True,Roman Messer,dance,Springfield,13:00:07,Wednesday
7,F03E1C1F,Feeling This Way,Polina Griffith,dance,Springfield,20:47:49,Wednesday
8,8FA1D3BE,L’estate,Julia Dalia,ruspop,Springfield,09:17:40,Friday
9,E772D5C0,Pessimist,,dance,Shelbyville,21:20:49,Wednesday


In [45]:
df.info() 

<class 'pandas.core.frame.DataFrame'>
Int64Index: 61253 entries, 0 to 65078
Data columns (total 7 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   user_id  61253 non-null  object
 1   track    61253 non-null  object
 2   artist   61253 non-null  object
 3   genre    61253 non-null  object
 4   city     61253 non-null  object
 5   time     61253 non-null  object
 6   day      61253 non-null  object
dtypes: object(7)
memory usage: 3.7+ MB




Each row in the table stores data related to the track of the song being played. Several columns store data that describes the track itself: track title, artist, and genre. The rest stores data related to user information: their hometown, when they played the track.



##  2. Pre-processing of Data <a id='data_preprocessing'></a>


In [41]:
df.columns 

Index(['user_id', 'track', 'artist', 'genre', 'city', 'time', 'day'], dtype='object')

changing columns name

In [6]:
df = df.rename(
    columns= {
        '  userID': 'user_id',
        'Track': 'track',
        '  City  ': 'city',
        'Day': 'day',
    }
)    

checking the result of columns name

In [7]:
df.columns 

Index(['user_id', 'track', 'artist', 'genre', 'city', 'time', 'day'], dtype='object')

### Handling missing values <a id='missing_values'></a>


checking missing values

In [8]:
df.isnull().sum()
df.isna().sum() 

user_id       0
track      1343
artist     7567
genre      1198
city          0
time          0
day           0
dtype: int64

replacing missing values in columns `'track'`, `'artist'`, and `'genre'` with `'unknown'`

In [9]:
columns_to_replace = ['track', 'artist', 'genre']
for x in columns_to_replace:
    df[x] = df[x].fillna('unknown') 

In [10]:
df.isna().sum() 

user_id    0
track      0
artist     0
genre      0
city       0
time       0
day        0
dtype: int64

### Handling Duplicate <a id='duplicates'></a>


In [11]:
df.duplicated().sum() 

3826

Erasing all duplicates

In [42]:

df.drop_duplicates(inplace=True)

checking duplicates

In [43]:
df.duplicated().sum() 

0

checking duplicates in column 'genre'

In [14]:

df['genre']
df.sort_values(by='genre', ascending=True)
df['genre'].unique()

array(['rock', 'pop', 'folk', 'dance', 'rusrap', 'ruspop', 'world',
       'electronic', 'unknown', 'alternative', 'children', 'rnb', 'hip',
       'jazz', 'postrock', 'latin', 'classical', 'metal', 'reggae',
       'triphop', 'blues', 'instrumental', 'rusrock', 'dnb', 'türk',
       'post', 'country', 'psychedelic', 'conjazz', 'indie',
       'posthardcore', 'local', 'avantgarde', 'punk', 'videogame',
       'techno', 'house', 'christmas', 'melodic', 'caucasian',
       'reggaeton', 'soundtrack', 'singer', 'ska', 'salsa', 'ambient',
       'film', 'western', 'rap', 'beats', "hard'n'heavy", 'progmetal',
       'minimal', 'tropical', 'contemporary', 'new', 'soul', 'holiday',
       'german', 'jpop', 'spiritual', 'urban', 'gospel', 'nujazz',
       'folkmetal', 'trance', 'miscellaneous', 'anime', 'hardcore',
       'progressive', 'korean', 'numetal', 'vocal', 'estrada', 'tango',
       'loungeelectronic', 'classicmetal', 'dubstep', 'club', 'deep',
       'southern', 'black', 'folkrock', 

replacing duplicates in column 'genre'

In [15]:

def replace_wrong_genres(wrong_genres, correct_genre):
    for wrong_genre in wrong_genres:
        df['genre'] = df['genre'].replace(wrong_genre, correct_genre)

In [16]:

duplicates = ['hip', 'hop', 'hip-hop']
name = 'hiphop'
replace_wrong_genres(duplicates, name)
df

Unnamed: 0,user_id,track,artist,genre,city,time,day
0,FFB692EC,Kamigata To Boots,The Mass Missile,rock,Shelbyville,20:28:33,Wednesday
1,55204538,Delayed Because of Accident,Andreas Rönnberg,rock,Springfield,14:07:09,Friday
2,20EC38,Funiculì funiculà,Mario Lanza,pop,Shelbyville,20:58:07,Wednesday
3,A3DD03C9,Dragons in the Sunset,Fire + Ice,folk,Shelbyville,08:37:09,Monday
4,E2DC1FAE,Soul People,Space Echo,dance,Springfield,08:34:34,Monday
...,...,...,...,...,...,...,...
65074,729CBB09,My Name,McLean,rnb,Springfield,13:32:28,Wednesday
65075,D08D4A55,Maybe One Day (feat. Black Spade),Blu & Exile,hiphop,Shelbyville,10:00:00,Monday
65076,C5E3A0D5,Jalopiina,unknown,industrial,Springfield,20:09:26,Friday
65077,321D0506,Freight Train,Chas McDevitt,rock,Springfield,21:43:59,Friday


checking result in column 'genre'

In [17]:

df['genre'].unique()

array(['rock', 'pop', 'folk', 'dance', 'rusrap', 'ruspop', 'world',
       'electronic', 'unknown', 'alternative', 'children', 'rnb',
       'hiphop', 'jazz', 'postrock', 'latin', 'classical', 'metal',
       'reggae', 'triphop', 'blues', 'instrumental', 'rusrock', 'dnb',
       'türk', 'post', 'country', 'psychedelic', 'conjazz', 'indie',
       'posthardcore', 'local', 'avantgarde', 'punk', 'videogame',
       'techno', 'house', 'christmas', 'melodic', 'caucasian',
       'reggaeton', 'soundtrack', 'singer', 'ska', 'salsa', 'ambient',
       'film', 'western', 'rap', 'beats', "hard'n'heavy", 'progmetal',
       'minimal', 'tropical', 'contemporary', 'new', 'soul', 'holiday',
       'german', 'jpop', 'spiritual', 'urban', 'gospel', 'nujazz',
       'folkmetal', 'trance', 'miscellaneous', 'anime', 'hardcore',
       'progressive', 'korean', 'numetal', 'vocal', 'estrada', 'tango',
       'loungeelectronic', 'classicmetal', 'dubstep', 'club', 'deep',
       'southern', 'black', 'folkrock


We've detected three problems in our data:

- Incorrect title writing style
- Missing values
- Explicit and implicit duplicates

## 3. Hypothesis Testing <a id='hypotheses'></a>

### Hypothesis 1: Comparing User Behavior in Two Cities <a id='activity'></a>

According to the first hypothesis, users from Springfield and Shelbyville have different behavior in listening to music. This test uses data taken from three days of the week: Monday, Wednesday, and Friday.


In [18]:

df.groupby('city').count()

Unnamed: 0_level_0,user_id,track,artist,genre,time,day
city,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Shelbyville,18512,18512,18512,18512,18512,18512
Springfield,42741,42741,42741,42741,42741,42741


Users from Springfield played more tracks than users from Shelbyville. However, this does not imply that Springfield residents listen to music more often. The city is bigger, and there are more users.



In [19]:

df.groupby('day').count()

Unnamed: 0_level_0,user_id,track,artist,genre,city,time
day,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Friday,21840,21840,21840,21840,21840,21840
Monday,21354,21354,21354,21354,21354,21354
Wednesday,18059,18059,18059,18059,18059,18059


Wednesday is the quietest day overall.

In [20]:


def number_tracks(day, city):
    track_list = df[(df['day']==day) & (df['city']==city)].sort_values(by= ['day', 'city'])
    track_list_count = track_list['user_id'].count()
    return track_list_count

number of songs played at Springfield on Monday

In [21]:

number_tracks('Monday', 'Springfield')

15740

number of songs played at Shelbyville on Monday

In [22]:

number_tracks('Monday', 'Shelbyville')

5614

number of songs played at Springfield on Wed

In [23]:

number_tracks('Wednesday', 'Springfield')

11056

number of songs played at Shelbyville on Wed

In [24]:

number_tracks('Wednesday', 'Shelbyville')

7003

number of songs played at Springfield on Fry

In [25]:

number_tracks('Friday', 'Springfield')

15945

number of songs played at Shelbyville on Fry

In [26]:
#
number_tracks('Friday', 'Shelbyville')

5895

Result in Table

In [27]:

number_tracks = [  
        ['Shelbyville', 5614, 7003, 5895], 
        ['Springfield', 15740, 11056, 15945]
]
city_day_filtered = ['city', 'monday', 'wednesday', 'friday']

number_tracks_filtered = pd.DataFrame(data=number_tracks , columns=city_day_filtered)
number_tracks_filtered
        

Unnamed: 0,city,monday,wednesday,friday
0,Shelbyville,5614,7003,5895
1,Springfield,15740,11056,15945


**Conclusion**

The data you get reveals differences in user behavior:

- In the city of Springfield, track play peaked Monday and Friday, while Wednesday saw a decline in activity.
- In the city of Shelbyville, on the other hand, users listen to more music on Wednesdays.

Less user activity on Mondays and Fridays.

### Hypothesis 2:  Music at the Beginning and End of the Week <a id='week'></a>

According to the second hypothesis, on Monday mornings and Friday evenings, Springfield residents listen to a different genre of music than the people of Shelbyville enjoy.


* For Springfield — `spr_general`
* For Shelbyville — `shel_general`

Filtering data 

In [28]:

spr_general = df[df['city'] == 'Springfield']

In [29]:

shel_general = df[df['city'] == 'Shelbyville']

creating function to sort music preferences from each city

In [2]:



def genre_weekday(df, day, time1, time2):
    genre_df = df[df['day']==day] 
    genre_df = df[df['time']<=time2]
    genre_df = df[df['time']>=time1] 
    genre_df_grouped = genre_df.groupby('genre')['user_id'].count() 
    genre_df_sorted = genre_df_grouped.sort_values(ascending=False) 
    return genre_df_sorted[:15]

calling the function for Monday morning in Springfield

In [31]:

genre_weekday(spr_general, 'Monday', '07.00', '11.00')

genre
pop            5892
dance          4435
rock           3965
electronic     3786
hiphop         2096
classical      1616
world          1432
alternative    1379
ruspop         1372
rusrap         1161
jazz            980
unknown         849
metal           832
soundtrack      785
folk            692
Name: user_id, dtype: int64

calling the function for Monday morning in Shelbyville 

In [32]:

genre_weekday(shel_general, 'Monday', '07.00', '11.00')

genre
pop            2431
dance          1932
rock           1879
electronic     1736
hiphop          960
alternative     649
classical       646
rusrap          564
ruspop          538
world           515
jazz            486
metal           378
soundtrack      331
rnb             321
rap             309
Name: user_id, dtype: int64

calling the function for Friday morning in Springfield

In [33]:

x = df[df['city'] == 'Springfield']
genre_weekday(x, 'Friday', '17.00', '23.00')

genre
pop            1983
dance          1430
rock           1386
electronic     1284
hiphop          685
world           501
classical       487
alternative     475
ruspop          435
rusrap          382
jazz            332
soundtrack      284
unknown         255
folk            237
metal           233
Name: user_id, dtype: int64

calling the function for Friday morning in Shelbyville 

In [34]:

y = df[df['city'] == 'Shelbyville']
genre_weekday(y, 'Monday', '17.00', '23.00')

genre
pop            839
rock           667
electronic     649
dance          626
hiphop         332
alternative    234
classical      219
rusrap         184
jazz           179
world          179
ruspop         170
metal          135
soundtrack     123
unknown        120
folk           104
Name: user_id, dtype: int64

**Conclusion**

After comparing the top 15 genres on Monday morning, we can draw the following conclusions:

1. Users from Springfield and Shelbyville listen to music of the same genre. The top five genres from both cities are the same, only the rock and electronic genres switch places.

2. In Springfield, the number of missing values is very large, so the value `'unknown'` is in 10th place. This means that the missing values account for a sizeable proportion of the data, so this fact could serve as a basis for questioning the reliability of our conclusions.

For Friday night, the situation is also similar. Individual genres vary quite a bit, but overall, the top 15 genres for both cities are the same.

Thus, the second hypothesis is partially proven correct:
* Users listen to the same music at the beginning and end of the week.
* There are no notable differences between Springfield and Shelbyville. In both cities, pop is the most popular genre.

However, the significance of the number of missing values makes this result questionable. In Springfield, there are so many missing values that influence our top 15 genre results. If we didn't have these missing values, the results might have been different.

### Hypothesis 3: Genre Preferences in Springfield and Shelbyville <a id='genre'></a>

Hypothesis: Shelbyville loves rap music. Springfield residents prefer pop.

In [35]:

spr_genres = spr_general.groupby('genre')['user_id']\
.count()\
.sort_values(ascending=False)

In [36]:

spr_genres.head(10)

genre
pop            5892
dance          4435
rock           3965
electronic     3786
hiphop         2096
classical      1616
world          1432
alternative    1379
ruspop         1372
rusrap         1161
Name: user_id, dtype: int64

In [37]:

shel_genres = shel_general.groupby('genre')['user_id']\
.count()\
.sort_values(ascending=False)

In [38]:

shel_genres.head(10)

genre
pop            2431
dance          1932
rock           1879
electronic     1736
hiphop          960
alternative     649
classical       646
rusrap          564
ruspop          538
world           515
Name: user_id, dtype: int64

**Conclusion**

This hypothesis has been partially proven:
* Pop music is the most popular genre in Springfield, as we would expect.
* However, pop music proved equally popular in both Springfield and Shelbyville, and rap music did not make the top 5 genre lists for either city.


# Final Conclusions <a id='end'></a>

We have tested the following three hypotheses:

1. User activity in Springfield and Shelbyville depends on the day of the week, although these two cities vary in many ways.
2. On Monday mornings, residents of Springfield and Shelbyville tune in to different genres. This also applies on Friday nights.
3. Listeners in Springfield and Shelbyville have different preferences. In both Springfield and Shelbyville, they preferred pop music.

After analyzing the available data, we can conclude that:

1. User activity in Springfield and Shelbyville depends on the day, even if the city is different.

The first hypothesis can be fully accepted.

2. Musical preferences did not vary significantly throughout the week in Springfield and Shelbyville. We can see a small difference in the order on Monday, but:
* In both Springfield and Shelbyville, users listen to pop music the most.

Therefore, we cannot accept this hypothesis. It is also important to remember that the results obtained could have been different had we not had the missing values.
3. It turns out that the music preferences of users from Springfield and Shelbyville are very similar.

The third hypothesis is rejected. If there are indeed differences in preference, unfortunately we cannot know this from this data.