# Y.Music

# Content <a id='back'></a>

* [Introduction](#intro)
* [Stage 1. Data Overview](#data_review)
    * [Conclusion](#data_review_conclusions)
* [Stage 2. Data Preprocessing](#data_preprocessing)
    * [2.1 Title Writing Style](#header_style)
    * [2.2 Missing Values](#missing_values)
    * [2.3 Duplicates](#duplicates)
    * [2.4 Conclusion](#data_preprocessing_conclusions)
* [Stage 3. Hypothesis Testing](#hypotheses)
    * [3.1 Hypothesis 1: User activity in both cities](#activity)
    * [3.2 Hypothesis 2: Music preferences on Mondays and Fridays](#week)
    * [3.3 Hipothesis 3: Genre preferences in Springfield and Shelbyville](#genre)
* [Findings](#end)

## Introduction <a id='intro'></a>
Every time we conduct an analysis, we need to formulate hypotheses that need to be further tested. Sometimes, the tests we perform lead us to accept these hypotheses, while other times we need to reject them. To make informed decisions in business, we need to understand whether the assumptions we make are correct or not.

In this project, you will compare the music preferences of users in the cities of Springfield and Shelbyville. You will study actual Y.Music data to test the following hypotheses and compare user behavior in these two cities.

### Objectives: 

Test three hypotheses:

1. User activities vary depending on the day and city.
2. On Monday mornings, residents of Springfield and Shelbyville listen to different genres. This also applies to Friday nights.
3. Listeners in Springfield and Shelbyville have different preferences. In Springfield, users prefer pop music, while in Shelbyville, rap music has more fans.

### Stages

Data on user behavior is stored in the *file* `/datasets/music_project_en.csv`. There is no information available regarding the data quality, so you need to check it before testing the hypotheses.

First, you will evaluate the data quality and see if there are any significant issues. Then, during the data preprocessing stage, you will try to address the most serious problems.

This project will consist of three stages:

1. Data Overview
2. Data Preprocessing
3. Hypothesis Testing

[Back to Content](#back)

## Stage 1. Data Overview <a id='data_review'></a>

Open the Y.Music-related data and explore the data.

You will need the Pandas library, so please import it.

In [22]:
# Import Pandas
import pandas as pd

Read the music_project_en.csv file from the /datasets/ folder and save it to the variable df:

In [23]:
# Read the file and save it to df
df = pd.read_csv('datasets/music_project_en.csv')

Display the first 10 rows of the table:

In [24]:
# Get the first 10 rows of the df table
df.head(10)

Unnamed: 0,userID,Track,artist,genre,City,time,Day
0,FFB692EC,Kamigata To Boots,The Mass Missile,rock,Shelbyville,20:28:33,Wednesday
1,55204538,Delayed Because of Accident,Andreas Rönnberg,rock,Springfield,14:07:09,Friday
2,20EC38,Funiculì funiculà,Mario Lanza,pop,Shelbyville,20:58:07,Wednesday
3,A3DD03C9,Dragons in the Sunset,Fire + Ice,folk,Shelbyville,08:37:09,Monday
4,E2DC1FAE,Soul People,Space Echo,dance,Springfield,08:34:34,Monday
5,842029A1,Chains,Obladaet,rusrap,Shelbyville,13:09:41,Friday
6,4CB90AA5,True,Roman Messer,dance,Springfield,13:00:07,Wednesday
7,F03E1C1F,Feeling This Way,Polina Griffith,dance,Springfield,20:47:49,Wednesday
8,8FA1D3BE,L’estate,Julia Dalia,ruspop,Springfield,09:17:40,Friday
9,E772D5C0,Pessimist,,dance,Shelbyville,21:20:49,Wednesday


Get general information about the table with a single command:

In [25]:
# Get general information about the available data in df
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 65079 entries, 0 to 65078
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0     userID  65079 non-null  object
 1   Track     63736 non-null  object
 2   artist    57512 non-null  object
 3   genre     63881 non-null  object
 4     City    65079 non-null  object
 5   time      65079 non-null  object
 6   Day       65079 non-null  object
dtypes: object(7)
memory usage: 3.5+ MB


There are also different values among the columns. They all store data of the same type, which is object.

Based on the documentation:
- `'userID'` — user ID
- `'Track'` — track title
- `'artist'` — artist name
- `'genre'`
- `'City'` — city where the user is located
- `'time'` — duration of the song played
- `'Day'` — day of the week

We can see three issues with the column name styling:
1. Some names are written in uppercase, while others are in lowercase.
2. Some names use spaces.
3. Find the third issue and explain it here. There is no clear documentation about the 'genre' column.

The number of values in the column is different. This indicates that the data we have contains missing values.

### Conclusion <a id='data_review_conclusions'></a> 

Each row in the table contains data related to a played music track. Several columns store data describing the track itself: track title, artist, and genre. The rest of the columns store data related to user information: their city of origin and the time they played the music track.

It is clear that the data we have is sufficient to test the hypotheses. However, we do have missing values.

To proceed with the analysis, we need to perform data preprocessing first.

[Back to Content](#back)

## Stage 2. Data Preprocessing <a id='data_preprocessing'></a>
Fix the formatting of column titles and handle missing values. Then, check if your data contains duplicates.

### Title Writing Style <a id='header_style'></a>
Display the column titles:


In [26]:
# List the column names in the df table.
df.columns

Index(['  userID', 'Track', 'artist', 'genre', '  City  ', 'time', 'Day'], dtype='object')

Change the column names according to good writing style rules:

* If the name consists of multiple words, use snake_case.
* All characters should be lowercase.
* Remove spaces.

In [27]:
# Rename the columns
df = df.rename(columns={
'  userID':'user_id',
'Track':'track',
'  City  ':'city',
'Day':'day',    
})

Check the result. Display the column names again.

In [28]:
# Check the result. Display the column names once again.
df.columns

Index(['user_id', 'track', 'artist', 'genre', 'city', 'time', 'day'], dtype='object')

[Back to Content](#back)

### Missing Values <a id='missing_values'></a>
First, find the number of missing values in the table. To do this, use two Pandas methods:

In [29]:
# Count the missing values.
df.isna().sum()

user_id       0
track      1343
artist     7567
genre      1198
city          0
time          0
day           0
dtype: int64

Not all missing values have an impact on the study. For example, missing values in the 'track' and 'artist' columns may not be crucial. You can simply replace them with a clear indicator. However, missing values in the 'genre' column can affect the comparison of music preferences in Springfield and Shelbyville. In real life, it is useful to investigate why the data is missing and attempt to rectify it. Unfortunately, we don't have that opportunity in this project. Therefore, you should:
* Fill in the missing values with an indicator.
* Evaluate how much the missing values can impact your calculations.

Replace the missing values in the 'track', 'artist', and 'genre' columns with the string 'unknown'. To do this, create a list called columns_to_replace, apply a loop with for on that list, and replace the missing values in each column:

In [30]:
# Loop through the column names and replace the missing values with 'unknown'.

columns_to_replace = ['track', 'artist', 'genre']
for col in columns_to_replace:
    df[col] = df[col].fillna('unknown')
    df[col] = df[col].replace('', 'unknown')

Ensure that there are no more missing values in the table. Recalculate the number of missing values.

In [31]:
# Recount the missing values.
df.isna().sum()

user_id    0
track      0
artist     0
genre      0
city       0
time       0
day        0
dtype: int64

[Back to Content](#back)

### Duplicates <a id='duplicates'></a>
Find the number of explicit duplicates in the table using a single command:

In [32]:
# Calculate explicit duplicates.
df.duplicated().sum()

3826

Call the Pandas method to remove explicit duplicates:

In [33]:
# Remove explicit duplicates.
df = df.drop_duplicates()

Calculate explicit duplicates once again to ensure that you have removed all of them:

In [34]:
# Check for duplicates.
df.duplicated().sum()

0

Now, remove the implicit duplicates in the 'genre' column. For example, genre names may be written differently. Such errors will also affect your results.

Display a list containing unique genre names, then sort the list alphabetically. To do this:
* Extract the intended DataFrame column.
* Apply a sorting method to the column.
* For the sorted column, call a method that will generate all the unique values of the column.

In [35]:
# View unique genre names
df = df.sort_values(by='genre')
df['genre'].unique()

array(['acid', 'acoustic', 'action', 'adult', 'africa', 'afrikaans',
       'alternative', 'ambient', 'americana', 'animated', 'anime',
       'arabesk', 'arabic', 'arena', 'argentinetango', 'art', 'audiobook',
       'avantgarde', 'axé', 'baile', 'balkan', 'beats', 'bigroom',
       'black', 'bluegrass', 'blues', 'bollywood', 'bossa', 'brazilian',
       'breakbeat', 'breaks', 'broadway', 'cantautori', 'cantopop',
       'canzone', 'caribbean', 'caucasian', 'celtic', 'chamber',
       'children', 'chill', 'chinese', 'choral', 'christian', 'christmas',
       'classical', 'classicmetal', 'club', 'colombian', 'comedy',
       'conjazz', 'contemporary', 'country', 'cuban', 'dance',
       'dancehall', 'dancepop', 'dark', 'death', 'deep', 'deutschrock',
       'deutschspr', 'dirty', 'disco', 'dnb', 'documentary', 'downbeat',
       'downtempo', 'drum', 'dub', 'dubstep', 'eastern', 'easy',
       'electronic', 'electropop', 'emo', 'entehno', 'epicmetal',
       'estrada', 'ethnic', 'eurofo

Inspect the list carefully to find implicit duplicates of the 'hiphop' genre. The duplicates could be in the form of misspelled names or alternative names for the same genre.

You will find the following implicit duplicates:
* `hip`
* `hop`
* `hip-hop`

To remove them, use the replace_wrong_genres() function with two parameters:
* `wrong_genres=` — a list of duplicates to be replaced
* `correct_genre=` — a string with the correct value

The function should correct the names in the 'genre' column of the df table by replacing each value from the wrong_genres list with the value from the correct_genre.

In [36]:
# Function to replace implicit duplicates.
def replace_wrong_genres(wrong_genres, correct_genre): 
    for wrong_genre in wrong_genres: 
        df['genre'] = df['genre'].replace(wrong_genre, correct_genre)

Call replace_wrong_genres() and pass the arguments to the function so that it can remove the implicit duplicates (hip, hop, and hip-hop) and replace them with hiphop:

In [37]:
# Apply the function that replaces implicit duplicates.
duplicates = ['hip', 'hop', 'hip-hop']
name = 'hiphop'
replace_wrong_genres(duplicates, name)
df['genre'] # DataFrame baru tanpa duplikat 

32883         acid
3650      acoustic
20644     acoustic
39129     acoustic
14811     acoustic
           ...    
6189         world
18268        world
28665    worldbeat
40493    worldbeat
8514           ïîï
Name: genre, Length: 61253, dtype: object

Ensure that the duplicated values have been removed. Display the list of unique values from the 'genre' column:

In [38]:
# Check for implicit duplicates.
df['genre'].unique()

array(['acid', 'acoustic', 'action', 'adult', 'africa', 'afrikaans',
       'alternative', 'ambient', 'americana', 'animated', 'anime',
       'arabesk', 'arabic', 'arena', 'argentinetango', 'art', 'audiobook',
       'avantgarde', 'axé', 'baile', 'balkan', 'beats', 'bigroom',
       'black', 'bluegrass', 'blues', 'bollywood', 'bossa', 'brazilian',
       'breakbeat', 'breaks', 'broadway', 'cantautori', 'cantopop',
       'canzone', 'caribbean', 'caucasian', 'celtic', 'chamber',
       'children', 'chill', 'chinese', 'choral', 'christian', 'christmas',
       'classical', 'classicmetal', 'club', 'colombian', 'comedy',
       'conjazz', 'contemporary', 'country', 'cuban', 'dance',
       'dancehall', 'dancepop', 'dark', 'death', 'deep', 'deutschrock',
       'deutschspr', 'dirty', 'disco', 'dnb', 'documentary', 'downbeat',
       'downtempo', 'drum', 'dub', 'dubstep', 'eastern', 'easy',
       'electronic', 'electropop', 'emo', 'entehno', 'epicmetal',
       'estrada', 'ethnic', 'eurofo

[Back to Content](#back)

### Conclusion <a id='data_preprocessing_conclusions'></a>
We have identified three issues in our data:

- Incorrect title writing style
- Missing values
- Explicit and implicit duplicates

The column titles have now been cleaned for easier table processing. All missing values have been replaced with 'unknown'. However, we still need to consider whether the missing values in the 'genre' column will affect our calculations.

The absence of duplicates will make our results more accurate and easier to understand.

Now, we can proceed to hypothesis testing.

[Back to Content](#back)

## Stage 3. Hypothesis Testing <a id='hypotheses'></a>

### Hypothesis 1: Comparing User Behavior in Two Cities <a id='activity'></a>

According to the first hypothesis, users from Springfield and Shelbyville have different music listening behaviors. This test uses data collected from three days of the week: Monday, Wednesday, and Friday.

* Divide users into groups based on the city.
* Compare the number of tracks played by each group on Monday, Wednesday, and Friday.

As an exercise, perform each calculation separately.

Evaluate user activity in each city. Group the data by city and find the number of tracks played in each group.

In [39]:
# Calculate the number of tracks played in each city.
df.groupby('city')['track'].count().reset_index()

Unnamed: 0,city,track
0,Shelbyville,18512
1,Springfield,42741


Users from Springfield play more tracks than users from Shelbyville. However, this does not imply that Springfield residents listen to music more frequently. Springfield is a larger city with more users.

Now, group the data by day and find the number of tracks played on Monday, Wednesday, and Friday.

In [40]:
# Calculate the number of tracks played on each day.
df.groupby('day')['track'].count().reset_index()

Unnamed: 0,day,track
0,Friday,21840
1,Monday,21354
2,Wednesday,18059


Wednesday is the overall quietest day. However, if we consider each city separately, we may reach a different conclusion.

You have seen how grouping by city or day works. Now, write a function that will group the data by city and day.

Create the number_tracks() function to count the number of tracks played for a specific day and city. The function will require two parameters:
* the name of the weekday
* the name of the city

In the function you create, use a variable to store rows from the original table, where:
  * The value in the 'day' column is equal to the day parameter
  * The value in the 'city' column is equal to the city parameter

Apply sequential filtering with logical indexing.

Then, count the values in the 'user_id' column in the resulting table. Save the result to a new variable. Return this variable from the function.

In [60]:
# Creating the number_tracks() function
# We will declare a function with two parameters: day=, city=.
# Set the track_list variable to store the rows from df where
# the value in the 'day' column is equal to the day parameter, and at the same time,
# the value in the 'city' column is equal to the city parameter (apply sequential filtering
# with logical indexing).
# Set the track_list_count variable to store the count of values in the 'user_id' column of track_list
# (find it using the count() method).
# Set the function you created to return the value of track_list_count as a number.
# The function counts the tracks played for a specific city and day.
# First, it will take the rows with the desired day from the table, then filter those rows with the desired city,
# then find the count of the 'user_id' values in the filtered table,
# then return that count.
# To see the result, wrap the function call in print().

def number_tracks (day,city):
    track_list=df[(df['day']==day)&(df['city']==city)]
    track_list_count=track_list['user_id'].count()
    return track_list_count

Call number_tracks() six times and change the parameter values in each call so that you can retrieve data for both cities for each day (Monday, Wednesday, and Friday).

In [42]:
# Number of tracks played in Springfield on Monday.
monday_springfield=number_tracks('Monday','Springfield')
monday_springfield

15740

In [43]:
# Number of tracks played in Shelbyville on Monday.
monday_shelbyville=number_tracks('Monday','Shelbyville')
monday_shelbyville

5614

In [44]:
# Number of tracks played in Springfield on Wednesday.
wednesday_springfield=number_tracks('Wednesday','Springfield')
wednesday_springfield

11056

In [45]:
# Number of tracks played in Shelbyville on Wednesday.
wednesday_shelbyville=number_tracks('Wednesday','Shelbyville')
wednesday_shelbyville

7003

In [46]:
# Number of tracks played in Springfield on Friday.
friday_springfield=number_tracks('Friday','Springfield')
friday_springfield

15945

In [47]:
# Number of tracks played in Shelbyville on Friday.
friday_shelbyville=number_tracks('Friday','Shelbyville')
friday_shelbyville

5895

Create a table using pd.DataFrame, where:
* The column names are: `['city', 'monday', 'wednesday', 'friday']`
* The data consists of the results from `number_tracks()`

In [48]:
# Table with the results.
headers=['City', 'Monday', 'Wednesday', 'Friday']
rows={'City': ['Springfield','Shelbyville'],
      'Monday': [monday_springfield, monday_shelbyville],
      'Wednesday': [wednesday_springfield, wednesday_shelbyville],
      'Friday': [friday_springfield, friday_shelbyville]
      }
      
table_result=pd.DataFrame(data=rows, columns=headers)
table_result

Unnamed: 0,City,Monday,Wednesday,Friday
0,Springfield,15740,11056,15945
1,Shelbyville,5614,7003,5895


**Conclusion**

The data you obtained reveals differences in user behavior:

- In Springfield, the number of tracks played peaks on Monday and Friday, with a decrease in activity on Wednesday.
- In Shelbyville, on the other hand, users listen to more music on Wednesday.
- User activity is lower on Monday and Friday.

[Back to Content](#back)

### Hypothesis 2: Music on Weekdays and Weekends <a id='week'></a>

According to the second hypothesis, on Monday mornings and Friday nights, residents of Springfield listen to different music genres compared to those enjoyed by Shelbyville residents.

Get the following tables (make sure the names of your combined tables match the DataFrame given in the two code blocks below):
* For Springfield — `spr_general`
* For Shelbyville — `shel_general`

In [49]:
# Get the spr_general table from the df rows
# where the value in the 'city' column is 'Springfield'
spr_general=df[df['city']=='Springfield']

In [50]:
# Get the shel_general table from the df rows
# where the value in the 'city' column is 'Shelbyville'
shel_general=df[df['city']=='Shelbyville']

Write the genre_weekday() function with four parameters:
* A table for the data
* Day name
* First timestamp, in the 'hh:mm' format
* Last timestamp, in the 'hh:mm' format

The function should provide information about the 15 most popular genres on a specific day within a period between two timestamps.

In [61]:
# Declare the genre_weekday() function with the day=, time1=, and time2= parameters. The function should
# provide information about the most popular genres on a specific day and time:
#
# 1) Set the genre_df variable to store the rows that satisfy the following conditions:
#    - the value in the 'day' column is equal to the day argument
#    - the value in the 'time' column is greater than the time1 argument
#    - the value in the 'time' column is less than the time2 argument
#    Use sequential filtering with logical indexing.
#
# 2) Group genre_df by the 'genre' column, then take one of its columns,
#    and use the count() method to find the number of entries for each
#    represented genre; store the resulting Series in the
#    genre_df_count variable.
#
# 3) Sort genre_df_count in descending order based on frequency and store the result
#    in the genre_df_sorted variable.
#
# 4) Generate a Series object with the first 15 genre_df_sorted values - the 15 most
#    popular genres (on a specific day, within a specific time range)
#
# write your function here

def genre_weekday (town,day,time1,time2):

    # pemfilteran berturut-turut
    # genre_df hanya akan menyimpan baris df yang day-nya sama dengan day
    genre_df = town[town['day']==day]

    # genre_df hanya akan menyimpan baris df yang time-nya lebih kecil dari time2
    genre_df = genre_df[genre_df['time']<time2]

    # genre_df hanya akan menyimpan baris df yang time-nya lebih besar dari time1
    genre_df = genre_df[genre_df['time']>time1]

    # kelompokkan DataFrame yang telah difilter berdasarkan kolom dengan nama genre, ambil kolom genre, dan temukan jumlah baris untuk setiap genre dengan metode count()
    genre_df_grouped = genre_df.groupby('genre')['genre'].count()
    
    # kita akan mengurutkan hasilnya dalam urutan menurun (sehingga genre yang paling populer ditampilkan lebih awal pada objek Series
    genre_df_sorted = genre_df_grouped.sort_values(ascending=False)

    # kita akan menghasilkan objek Series yang menyimpan 15 genre paling populer pada hari tertentu dalam jangka waktu tertentu
    return genre_df_sorted[:15]

Compare the results of the genre_weekday() function for Springfield and Shelbyville on Monday mornings (from 07:00 to 11:00) and on Friday nights (from 17:00 to 23:00):

In [52]:
# calling the function for Monday morning in Springfield (use spr_general instead of the df table)
genre_weekday(spr_general,'Monday','07:00','11:00')

genre
pop            781
dance          549
electronic     480
rock           474
hiphop         286
ruspop         186
world          181
rusrap         175
alternative    164
unknown        161
classical      157
metal          120
jazz           100
folk            97
soundtrack      95
Name: genre, dtype: int64

In [53]:
# calling the function for Monday morning in Shelbyville (use shel_general instead of the df table)
genre_weekday(shel_general,'Monday','07:00','11:00')

genre
pop            218
dance          182
rock           162
electronic     147
hiphop          80
ruspop          64
alternative     58
rusrap          55
jazz            44
classical       40
world           36
rap             32
soundtrack      31
rnb             27
metal           27
Name: genre, dtype: int64

In [54]:
# calling the function for Friday night in Springfield
genre_weekday(spr_general,'Friday','17:00','23:00')

genre
pop            713
rock           517
dance          495
electronic     482
hiphop         273
world          208
ruspop         170
classical      163
alternative    163
rusrap         142
jazz           111
unknown        110
soundtrack     105
rnb             90
metal           88
Name: genre, dtype: int64

In [55]:
# calling the function for Friday night in Shelbyville
genre_weekday(shel_general,'Friday','17:00','23:00')

genre
pop            256
rock           216
electronic     216
dance          210
hiphop          97
alternative     63
jazz            61
classical       60
rusrap          59
world           54
unknown         47
ruspop          47
soundtrack      40
metal           39
rap             36
Name: genre, dtype: int64

**Conclusion**

After comparing the top 15 genres on Monday mornings, we can draw the following conclusions:

1. Users from Springfield and Shelbyville listen to the same genres of music. The top five genres are the same for both cities, with only rock and electronic genres swapping places.

2. In Springfield, the missing value count turns out to be quite significant, so the 'unknown' value is ranked at number 10. This means that the missing values encompass a considerable proportion of the data, so this fact could be grounds for questioning the reliability of our conclusions.

For Friday nights, the situation is similar. The individual genres vary quite a bit, but overall, the top 15 genres for both cities are the same.

Thus, the second hypothesis is partially proven true:

* Users listen to the same music on weekdays and weekends.
* There is no significant difference between Springfield and Shelbyville. Pop music is the most popular genre in both cities.

However, the significance of the missing values raises doubts about these findings. In Springfield, there are so many missing values that they affect our top 15 genre results. If we didn't have these missing values, the results might be different.

[Back to Content](#back)

### Hypothesis 3: Genre Preferences in Springfield and Shelbyville <a id='genre'></a>

Hypothesis: Shelbyville likes rap music. Springfield residents prefer pop music.

Group the spr_general table by genre and find the count of tracks played for each genre using the count() method. Then, sort the results in descending order and save them to spr_genres.

In [56]:
# In one line: group the spr_general table by the 'genre' column,
# Count the values in the 'genre' column within the grouping,
# Sort the resulting Series in descending order and save it to spr_genres
spr_genres = spr_general.groupby('genre')['track'].count().sort_values(ascending=False)

Display the first 10 rows of spr_genres:

In [62]:
# Display the first 10 rows of spr_genres:
spr_genres.head(10)

genre
pop            5892
dance          4435
rock           3965
electronic     3786
hiphop         2096
classical      1616
world          1432
alternative    1379
ruspop         1372
rusrap         1161
Name: track, dtype: int64

Now do the same for the data from Shelbyville.

Group the shel_general table by genre and find the count of tracks played for each genre. Then, sort the results in descending order and save them to shel_genres:

In [58]:
# In one line: group the shel_general table by the 'genre' column,
# Count the values in the 'genre' column within the grouping using count(),
# Sort the resulting Series in descending order and save it to shel_genres
shel_genres = shel_general.groupby('genre')['track'].count().sort_values(ascending=False)

Display the first 10 rows of shel_genres:

In [59]:
# Display the first 10 rows of shel_genres
shel_genres.head(10)

genre
pop            2431
dance          1932
rock           1879
electronic     1736
hiphop          960
alternative     649
classical       646
rusrap          564
ruspop          538
world           515
Name: track, dtype: int64

**Conclusion**

This hypothesis is partially proven true:
* Pop music is the most popular genre in Springfield, as we predicted.
* However, pop music is equally popular in both Springfield and Shelbyville, and rap music does not make it to the top 5 genres for both cities.

[Back to Content](#back)

# Findings <a id='end'></a>

We have tested the following three hypotheses:

1. User activities in Springfield and Shelbyville depend on the day of the week, although these two cities vary in various ways.
2. On Monday mornings, residents of Springfield and Shelbyville listen to different genres. This also applies to Friday nights.
3. Listeners in Springfield and Shelbyville have different preferences. Both in Springfield and Shelbyville, they prefer pop music.

After analyzing the available data, we can conclude that:

1. User activities in Springfield and Shelbyville depend on the day of the week, even though the cities are different.

The first hypothesis is fully accepted.

2. Music preferences do not vary significantly throughout the week in Springfield and Shelbyville. We can see a slight difference in the rankings on Monday, but:
* Both in Springfield and Shelbyville, users listen to pop music the most.

Therefore, this hypothesis cannot be accepted. It is also important to note that the results obtained could be different if we did not have missing values.

3. It turns out that the music preferences of users from Springfield and Shelbyville are very similar.

The third hypothesis is rejected. If there are indeed differences in preferences, unfortunately, we cannot know them from this data.

### Note
In a real project, research involves statistical hypothesis testing, which is more accurate and quantitative. Also, note that you cannot always draw conclusions about an entire city based on data from a single source.

You will learn about hypothesis testing in the statistical data analysis sprint.

[Back to Content](#back)