# Yandex.Music

# Contents <a id='back'></a>

* [Introduction](#intro)
* [Stage 1. Data Description](#data_review)
    * [Conclusions](#data_review_conclusions)
* [Stage 2. Data Preprocessing](#data_preprocessing)
    * [2.1 Header Style](#header_style)
    * [2.2 Missing Values](#missing_values)
    * [2.3 Duplicates](#duplicates)
    * [2.4 Conclusions](#data_preprocessing_conclusions)
* [Stage 3. Hypothesis Testing](#hypotheses)
    * [3.1 Hypothesis 1: Compare User Behavior in Two Cities](#activity)
    * [3.2 Hypothesis 2: Music at the Beginning and End of the Week](#week)
    * [3.3 Hypothesis 3: Genre Preferences in Springfield and Shelbyville](#genre)
* [Conclusions](#end)

## Introduction<a id='intro'></a>
Whenever we conduct research, we need to formulate hypotheses that we can later test. Sometimes we accept these hypotheses; other times, we reject them. To make the right decisions, a company must be able to understand if it is making the correct assumptions.

In this project, you will compare the musical preferences of the cities of Springfield and Shelbyville. You will study real data from Yandex.Music to test the hypotheses below and compare the user behavior of these two cities.

### Objective:
Test three hypotheses:
1. User activity differs depending on the day of the week and the city.
2. On Monday mornings, residents of Springfield and Shelbyville listen to different genres. The same happens on Friday nights.
3. Listeners in Springfield and Shelbyville have different preferences. Springfield prefers pop music, while Shelbyville has more rap enthusiasts.

### Stages
User behavior data is stored in the file `/datasets/music_project_en.csv`. There is no information about data quality, so you will need to examine it before testing the hypotheses.

First, you will evaluate the quality of the data and see if the issues are significant. Then, during data preprocessing, you will address the most critical issues.

Your project will consist of three stages:
1. Data Description
2. Data Preprocessing
 3. Hypothesis Testing

[Back to Contents](#back)

## Stage 1. Data Description <a id='data_review'></a>

Open the data in Yandex.Music and examine it.

You will need `pandas`, so import it.

In [1]:
# Importing pandas
import pandas as pd

Read the file `music_project_en.csv` from the `/datasets/` folder and store it in the variable `df`:

In [2]:
# Reading the file and storing it in df
df = pd.read_csv('/datasets/music_project_en.csv')

Print the first 10 rows of the table:

In [3]:
# Getting the first 10 rows of the df table
df.head(10)

Unnamed: 0,userID,Track,artist,genre,City,time,Day
0,FFB692EC,Kamigata To Boots,The Mass Missile,rock,Shelbyville,20:28:33,Wednesday
1,55204538,Delayed Because of Accident,Andreas Rönnberg,rock,Springfield,14:07:09,Friday
2,20EC38,Funiculì funiculà,Mario Lanza,pop,Shelbyville,20:58:07,Wednesday
3,A3DD03C9,Dragons in the Sunset,Fire + Ice,folk,Shelbyville,08:37:09,Monday
4,E2DC1FAE,Soul People,Space Echo,dance,Springfield,08:34:34,Monday
5,842029A1,Chains,Obladaet,rusrap,Shelbyville,13:09:41,Friday
6,4CB90AA5,True,Roman Messer,dance,Springfield,13:00:07,Wednesday
7,F03E1C1F,Feeling This Way,Polina Griffith,dance,Springfield,20:47:49,Wednesday
8,8FA1D3BE,L’estate,Julia Dalia,ruspop,Springfield,09:17:40,Friday
9,E772D5C0,Pessimist,,dance,Shelbyville,21:20:49,Wednesday


Get general information about the table with a command:

In [4]:
# Getting general information about the data in df
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 65079 entries, 0 to 65078
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0     userID  65079 non-null  object
 1   Track     63736 non-null  object
 2   artist    57512 non-null  object
 3   genre     63881 non-null  object
 4     City    65079 non-null  object
 5   time      65079 non-null  object
 6   Day       65079 non-null  object
dtypes: object(7)
memory usage: 3.5+ MB


The table contains seven columns. All of them store the same type of data: object.

According to the documentation:
- `'userID'` — user identifier
- `'Track'` — track title
- `'artist'` — artist name
- `'genre'` — genre
- `'City'` — user city
- `'time'` — the exact time the track was played
- `'Day'` — day of the week

We can see three issues with the naming style of the columns:
1. Some names are in uppercase, others in lowercase.
2. There are some spaces in some names.
3. `Detect the third problem yourself and describe it here`.

The number of values in the columns is different. This means that the data contains missing values.

### Conclusions <a id='data_review_conclusions'></a>

Each row in the table stores data about the track that was played. Some columns describe the track itself: its title, artist, and genre. The rest provide user information: the city they come from and the time the track was played.

It is clear that the data is sufficient to test the hypothesis. However, there are missing values.

To proceed, we need to preprocess the data.

[Back to Contents](#back)

## Stage 2. Data Preprocessing <a id='data_preprocessing'></a>
Correct the format in the column headers and handle the missing values. Then, check for duplicates in the data.

### Header Style <a id='header_style'></a>
Print the column header:

In [5]:
# Here is the list of column names in the `df` table
print(df.columns)

Index(['  userID', 'Track', 'artist', 'genre', '  City  ', 'time', 'Day'], dtype='object')


Change the column names according to good style rules:
* If the name has multiple words, use snake_case
* All characters should be in lowercase
* Remove spaces

In [6]:
# Rename the columns
new_names = {
    '  userID': 'userid',
    'Track': 'track',
    '  City  ': 'city',
    'Day': 'day'
}

df = df.rename(columns=new_names)

Check the result. Print the column names once more:

In [7]:
# Checking the result: the list of column names
print(df.columns)

Index(['userid', 'track', 'artist', 'genre', 'city', 'time', 'day'], dtype='object')


[Back to Contents](#back)

### Missing Values <a id='missing_values'></a>
First, find the number of missing values in the table. For this, use two pandas methods:

In [8]:
# Calculating Missing Values
print(df.isna().sum())
print(df.isnull().sum())

userid       0
track     1343
artist    7567
genre     1198
city         0
time         0
day          0
dtype: int64
userid       0
track     1343
artist    7567
genre     1198
city         0
time         0
day          0
dtype: int64


Not all missing values affect the investigation. For example, missing values in the track and artist columns are not crucial. You can simply replace them with clear markers.

However, missing values in the `'genre'` column can affect the comparison of musical preferences between Springfield and Shelbyville. In real life, it would be helpful to understand the reasons for missing data and try to recover it. But we don't have that opportunity in this project. So you will need to:
* Fill in those missing values with markers
* Assess how much the missing values might affect your computations.

Replace the missing values in `'track'`, `'artist'`, and `'genre'` with the string `'unknown'`. To do this, create the list `columns_to_replace`, iterate through it with a `for` loop, and replace the missing values in each column:

In [9]:
# Iterating through the column names and replacing missing values with 'unknown'
columns_to_replace = ['track', 'artist', 'genre']

for col in columns_to_replace:
    df[col] = df[col].fillna('unknown')

Make sure the table contains no more missing values. Count the missing values again.

In [10]:
# Counting missing values.
print(df.isna().sum())

userid    0
track     0
artist    0
genre     0
city      0
time      0
day       0
dtype: int64


[Back to Contents](#back)

### Duplicates <a id='duplicates'></a>
Find the number of obvious duplicates in the table using one command:

In [11]:
# Counting duplicates
num_duplicados = df.duplicated().sum()
print(num_duplicados)

3826


Call the `pandas` method to remove duplicates:

In [12]:
# Removing duplicates
df = df.drop_duplicates().reset_index(drop = True)

Count the duplicates once again to make sure they have all been removed:

In [13]:
# Checking for duplicates
num_duplicados = df.duplicated().sum()
print(num_duplicados)

0


Now, get rid of implicit duplicates in the `genre` column. For example, a genre name might be written in different ways. Such errors can also affect the results.

Print a list of unique genre names, sorted in alphabetical order. Here’s how to do it:
* Retrieve the desired DataFrame column.
* Apply a sorting method to it.
* For the sorted column, call a method that will return all unique column values.

In [14]:
# Inspecting the unique genre names
genre_list = df['genre'].sort_values().unique().tolist()
print(genre_list)

['acid', 'acoustic', 'action', 'adult', 'africa', 'afrikaans', 'alternative', 'ambient', 'americana', 'animated', 'anime', 'arabesk', 'arabic', 'arena', 'argentinetango', 'art', 'audiobook', 'avantgarde', 'axé', 'baile', 'balkan', 'beats', 'bigroom', 'black', 'bluegrass', 'blues', 'bollywood', 'bossa', 'brazilian', 'breakbeat', 'breaks', 'broadway', 'cantautori', 'cantopop', 'canzone', 'caribbean', 'caucasian', 'celtic', 'chamber', 'children', 'chill', 'chinese', 'choral', 'christian', 'christmas', 'classical', 'classicmetal', 'club', 'colombian', 'comedy', 'conjazz', 'contemporary', 'country', 'cuban', 'dance', 'dancehall', 'dancepop', 'dark', 'death', 'deep', 'deutschrock', 'deutschspr', 'dirty', 'disco', 'dnb', 'documentary', 'downbeat', 'downtempo', 'drum', 'dub', 'dubstep', 'eastern', 'easy', 'electronic', 'electropop', 'emo', 'entehno', 'epicmetal', 'estrada', 'ethnic', 'eurofolk', 'european', 'experimental', 'extrememetal', 'fado', 'film', 'fitness', 'flamenco', 'folk', 'folklor

Search the list to find implicit duplicates of the `hiphop` genre. These may be misspelled names or alternative names for the same genre.

You will see the following implicit duplicates:
* `hip`
* `hop`
* `hip-hop`

To get rid of them, declare the `replace_wrong_genres()` function with two parameters: 
* `wrong_genres=` — the list of duplicates
* `correct_genre=` — the string with the correct value

The function should correct the names in the `'genre'` column of the `df` table, i.e., replace each value in the `wrong_genres` list with the value in `correct_genre`.

In [15]:
# Function to replace the implicit duplicates
def replace_wrong_genres(wrong_genres, correct_genre):
    for genre in wrong_genres:
        df['genre'] = df['genre'].replace(genre, correct_genre)

Call `replace_wrong_genres()` and pass arguments to remove the implicit duplicates (`hip`, `hop`, and `hip-hop`) and replace them with `hiphop`:

In [16]:
# Removing implicit duplicates
replace_wrong_genres(['hip', 'hop', 'hip-hop'], 'hiphop')

Ensure that the duplicate names have been removed. Print the list of unique values from the `'genre'` column:

In [17]:
# Reviewing for implicit duplicates
unique_genres = df['genre'].unique()
unique_genres.sort()
print(unique_genres)

['acid' 'acoustic' 'action' 'adult' 'africa' 'afrikaans' 'alternative'
 'ambient' 'americana' 'animated' 'anime' 'arabesk' 'arabic' 'arena'
 'argentinetango' 'art' 'audiobook' 'avantgarde' 'axé' 'baile' 'balkan'
 'beats' 'bigroom' 'black' 'bluegrass' 'blues' 'bollywood' 'bossa'
 'brazilian' 'breakbeat' 'breaks' 'broadway' 'cantautori' 'cantopop'
 'canzone' 'caribbean' 'caucasian' 'celtic' 'chamber' 'children' 'chill'
 'chinese' 'choral' 'christian' 'christmas' 'classical' 'classicmetal'
 'club' 'colombian' 'comedy' 'conjazz' 'contemporary' 'country' 'cuban'
 'dance' 'dancehall' 'dancepop' 'dark' 'death' 'deep' 'deutschrock'
 'deutschspr' 'dirty' 'disco' 'dnb' 'documentary' 'downbeat' 'downtempo'
 'drum' 'dub' 'dubstep' 'eastern' 'easy' 'electronic' 'electropop' 'emo'
 'entehno' 'epicmetal' 'estrada' 'ethnic' 'eurofolk' 'european'
 'experimental' 'extrememetal' 'fado' 'film' 'fitness' 'flamenco' 'folk'
 'folklore' 'folkmetal' 'folkrock' 'folktronica' 'forró' 'frankreich'
 'französisch' 

[Back to Contents](#back)

### Conclusions <a id='data_preprocessing_conclusions'></a>

We identified three issues with the data:

- Incorrect header styles
- Missing values
- Obvious and implicit duplicates

The headers have been corrected to simplify the processing of the table.

All missing values have been replaced with `'unknown'`. However, we still need to determine if the missing values in `'genre'` affect our calculations.

The removal of duplicates will make the results more accurate and easier to understand.

We can now proceed with testing the hypotheses.

[Back to Contents](#back)

## Stage 3. Hypothesis Testing <a id='hypotheses'></a>

### Hypothesis 1: Compare User Behavior in the Two Cities <a id='activity'></a>

According to the first hypothesis, users from Springfield and Shelbyville listen to music differently. Verify this using data from three days of the week: Monday, Wednesday, and Friday.

* Divide users into groups by city.
* Compare how many tracks each group played on Monday, Wednesday, and Friday.

For the sake of the exercise, perform each calculation separately.

Evaluate user activity in each city. Group the data by city and find the number of songs played in each group.

In [18]:
# Counting the number of tracks played in each city
tracks_by_city = df[['city', 'track']]
tracks_by_city_grouped = tracks_by_city.groupby('city').count()
print(tracks_by_city_grouped)

             track
city              
Shelbyville  18512
Springfield  42741


Springfield has played more tracks than Shelbyville. However, this does not imply that Springfield residents listen to music more frequently. This city is simply larger and has more users.

Now, group the data by day of the week and find the number of tracks played on Monday, Wednesday, and Friday.

In [19]:
# Calculating the number of tracks played on each of the three days
filtered_data = df.loc[df['day'].isin(['Monday', 'Wednesday', 'Friday'])]
tracks_per_day = filtered_data.groupby('day')['track'].count()
print(tracks_per_day)

day
Friday       21840
Monday       21354
Wednesday    18059
Name: track, dtype: int64


Wednesday was the quietest day of all. However, if we consider the two cities separately, we might reach a different conclusion.

You have already seen how grouping by city or day works. Now, write the function that will group both.

Create the function `number_tracks()` to calculate the number of tracks played on a specific day and city. It will require two parameters:
* day of the week
* name of the city

In the function, use a variable to store the rows of the original table where:
  * the value in the `'day'` column is equal to the day parameter
  * the value in the `'city'` column is equal to the city parameter

Apply consecutive filtering using logical indexing.

Then, calculate the values in the `'user_id'` column of the resulting table. Store the result in the new variable. Return this variable from the function.

In [20]:
# <Creating the function number_tracks()>

"""""

We will declare the function with two parameters: day=, city=.
Let the variable track_list store the rows of df where the value in the 'day' column is equal to the parameter day= and,
at the same time, the value in the 'city' column is equal to the parameter city=(apply consecutive filtering
with logical indexing).
Let the variable track_list_count store the number of values in the 'user_id' column in track_list (found using the count() method).
Allow the function to return a number: the value of track_list_count.

The function counts the tracks played on a certain day and city.
It first retrieves the rows for the desired day from the table, then filters the rows for the desired city from the result,
then finds the number of 'user_id' values in the filtered table, and returns that number.
To see what it returns, wrap the function call in print().

"""""

def number_tracks(day, city):
    track_list = df.loc[df['day'] == day].loc[df['city'] == city]
    track_list_count = len(track_list)
    return track_list_count

print(number_tracks('Monday', 'Springfield'))

15740


Call `number_tracks()` six times, changing the parameter values, to retrieve data for both cities for each of the three days.

In [21]:
# The number of tracks played in Springfield on Monday
number_tracks('Monday', 'Springfield')

15740

In [22]:
# The number of tracks played in Shelbyville on Monday
number_tracks('Monday', 'Shelbyville')

5614

In [23]:
# The number of tracks played in Springfield on Wednesday
number_tracks('Wednesday', 'Springfield')

11056

In [24]:
# The number of tracks played in Shelbyville on Wednesday
number_tracks('Wednesday', 'Shelbyville')

7003

In [25]:
# The number of tracks played in Springfield on Friday
number_tracks('Friday', 'Springfield')

15945

In [26]:
# The number of tracks played in Shelbyville on Friday
number_tracks('Friday', 'Shelbyville')

5895

Use `pd.DataFrame` to create a table where:
* The column names are: `['city', 'monday', 'wednesday', 'friday']`
* The data are the results you obtained from `number_tracks()`

In [27]:
table_col = pd.DataFrame(data={
    'city': ['Springfield', 'Shelbyville'],
    'monday': [15740, 5614],
    'wednesday': [11056, 7003],
    'friday': [15945, 5895]
})

In [28]:
# Table with the results
table_col

Unnamed: 0,city,monday,wednesday,friday
0,Springfield,15740,11056,15945
1,Shelbyville,5614,7003,5895


**Conclusions**

The data reveal differences in user behavior:

- In Springfield, the number of songs played peaks on Mondays and Fridays, while there is a drop in activity on Wednesdays.
- In Shelbyville, on the other hand, users listen to more music on Wednesdays. User activity on Mondays and Fridays is lower.

Thus, the first hypothesis seems to be correct.

[Back to Contents](#back)

### Hypothesis 2: Music at the Beginning and End of the Week <a id='week'></a>

According to the second hypothesis, on Monday mornings and Friday nights, the citizens of Springfield listen to genres that differ from those enjoyed by users in Shelbyville.

Obtain tables (make sure the name of your combined table matches the DataFrame given in the two code blocks below):
* For Springfield — `spr_general`
* For Shelbyville — `shel_general`

In [29]:
# Obtaining the table `spr_general` from the rows of `df`, where the values in the column `'city'` are `'Springfield'`
spr_general = df[df["city"] == "Springfield"]

In [30]:
# Obtaining the table `spr_general` from the rows of `df`, where the values in the column `'city'` are `'Shelbyville'`
shel_general = df[df["city"] == "Shelbyville"]

Write the function `genre_weekday()` with four parameters:
* A table for the data (`df`)
* The day of the week (`day`)
* The timestamp in 'hh:mm' format (`time1`)
* The timestamp in 'hh:mm' format (`time2`)

The function should return information about the 15 most popular genres for a given day within a period between two timestamps.

In [31]:
# Declaring the `genre_weekday()` function with parameters `day=`, `time1=`, and `time2=`.
# It should return information about the most popular genres for a given day at a specific time:

"""""""""""

1) Let the variable `genre_df` store the rows that meet several conditions:
   - The value in the 'day' column equals the value of the `day=` argument
   - The value in the 'time' column is greater than the value of the `time1=` argument
   - The value in the 'time' column is less than the value of the `time2=` argument
Use consecutive filtering with logical indexing.

2) Group `genre_df` by the 'genre' column, take one of its columns,
   and use the `count()` method to find the number of entries for each genre;
   store the resulting Series in the `genre_df_count` variable.

3) Sort `genre_df_count` in descending order of frequency and save the result
   in the `genre_df_sorted` variable.

4) Return a Series object with the top 15 values from `genre_df_sorted` - the 15
   most popular genres (on a specific day, during a specific time period).

"""""""""


def genre_weekday(df, day, time1, time2):

    # The `genre_df` variable will only store rows from `df` where the day equals `day=`
    genre_df = df.loc[df['day'] == day]

    # The `genre_df` variable will only store rows from `df` where the time is less than `time2=`
    genre_df = genre_df.loc[genre_df['time']<=time2]

    # The `genre_df` variable will only store rows from `df` where the time is greater than `time1=`
    genre_df = genre_df.loc[genre_df['time']>=time1]

    # Group the filtered DataFrame by the 'genre' column, select the 'genre' column, and find the number of rows for each genre using the `count()` method
    genre_df_grouped = genre_df.groupby('genre')['genre'].count()

    # Sort the result in descending order (so that the most popular genres will appear first in the Series object)
    genre_df_sorted = genre_df_grouped.sort_values(ascending=False)

    # Return the Series object that stores the 15 most popular genres on a specific day during a specified time period
    return genre_df_sorted[:15]

Compare the results of the `genre_weekday()` function for Springfield and Shelbyville on Monday morning (from 7 to 11) and Friday evening (from 17:00 to 23:00):

In [32]:
# Calling the function for Monday morning in Springfield (using spr_general instead of the df table)
genre_weekday(spr_general, day='Monday', time1='07:00', time2='11:00')

genre
pop            781
dance          549
electronic     480
rock           474
hiphop         286
ruspop         186
world          181
rusrap         175
alternative    164
unknown        161
classical      157
metal          120
jazz           100
folk            97
soundtrack      95
Name: genre, dtype: int64

In [33]:
# Calling the function for Monday morning in Shelbyville (using shel_general instead of the df table)
genre_weekday(shel_general, day='Monday', time1='07:00', time2='11:00')

genre
pop            218
dance          182
rock           162
electronic     147
hiphop          80
ruspop          64
alternative     58
rusrap          55
jazz            44
classical       40
world           36
rap             32
soundtrack      31
rnb             27
metal           27
Name: genre, dtype: int64

In [34]:
# Calling the function for friday afternoon in Springfield
genre_weekday(spr_general, day='Friday', time1='17:00', time2='23:00')

genre
pop            713
rock           517
dance          495
electronic     482
hiphop         273
world          208
ruspop         170
classical      163
alternative    163
rusrap         142
jazz           111
unknown        110
soundtrack     105
rnb             90
metal           88
Name: genre, dtype: int64

In [35]:
# Calling the function for Friday afternoon in Shelbyville
genre_weekday(shel_general, day='Friday', time1='17:00', time2='23:00')

genre
pop            256
rock           216
electronic     216
dance          210
hiphop          97
alternative     63
jazz            61
classical       60
rusrap          59
world           54
unknown         47
ruspop          47
soundtrack      40
metal           39
rap             36
Name: genre, dtype: int64

**Conclusion**

Having compared the 15 most popular genres on Monday morning, we can conclude the following:

1. Users in Springfield and Shelbyville listen to similar music. The top five genres are the same, although rock and electronic music have swapped positions.

2. In Springfield, the number of missing values was so high that the value `'unknown'` ranked tenth. This means that missing values make up a significant portion of the data, which might be the basis for questioning the reliability of our conclusions.

For Friday afternoon, the situation is similar. While individual genres vary somewhat, the top 15 are quite similar in both cities.

Thus, the second hypothesis has been partially demonstrated:
* Users listen to similar music at the beginning and end of the week.
* There is no significant difference between Springfield and Shelbyville. In both cities, pop is the most popular genre.

However, the number of missing values makes this result somewhat questionable. In Springfield, the missing values affect our top 15 significantly. Without these missing values, the results might look different.

[Back to Contents](#back)

### Hypothesis 3: Genre Preferences in Springfield and Shelbyville <a id='genre'></a>

Hypothesis: Shelbyville loves rap music. The citizens of Springfield prefer pop music more.

Group the `spr_general` table by the 'genre' column and find the number of songs played for each genre using the `count()` method. Then, sort the result in descending order and store it in `spr_genres`.

In [36]:
# In one line: group the spr_general table by the column 'genre', 
# count the 'genre' values in the grouping using count(), 
# sort the resulting Series in descending order, and store it in spr_genres
spr_genres = spr_general.groupby('genre').size()
spr_genres = spr_genres.sort_values(ascending=False)

Print the first 10 rows of `spr_genres`:

In [37]:
# Printing the first 10 rows of `spr_genres`
print(spr_genres.head(10))

genre
pop            5892
dance          4435
rock           3965
electronic     3786
hiphop         2096
classical      1616
world          1432
alternative    1379
ruspop         1372
rusrap         1161
dtype: int64


Now do the same with the Shelbyville data.

Group the `shel_general` table by genre and find the number of songs played for each genre. Then, sort the result in descending order and save it in the `shel_genres` table:

In [38]:
# Group the `shel_general` table by the `'genre'` column, 
# count the `'genre'` values in the grouping with `count()`, 
# sort the resulting Series in descending order, and save it in `shel_genres`.
shel_genres = shel_general.groupby('genre').size()
shel_genres = shel_genres.sort_values(ascending=False)

Printing the first 10 rows of `shel_genres`:

In [39]:
# Printing the first 10 rows of `shel_genres`
print(shel_genres.head(10))

genre
pop            2431
dance          1932
rock           1879
electronic     1736
hiphop          960
alternative     649
classical       646
rusrap          564
ruspop          538
world           515
dtype: int64


**Conclusion**

The hypothesis has been partially proven:
* Pop music is the most popular genre in Springfield, as expected.
* However, pop music turned out to be equally popular in Springfield as in Shelbyville, and rap was not among the top 5 most popular genres in either city.

[Back to Contents](#back)

# Conclusions <a id='end'></a>

We tested the following three hypotheses:

1. User activity varies depending on the day of the week and between different cities.
2. On Monday mornings, residents of Springfield and Shelbyville listen to different genres. The same applies to Friday nights.
3. Listeners in Springfield and Shelbyville have different preferences. In both cities, Springfield and Shelbyville, pop is preferred.

After analyzing the data, we conclude:

1. User activity in Springfield and Shelbyville depends on the day of the week, although cities vary in different ways.

The first hypothesis is fully accepted.

2. Musical preferences do not vary significantly over the course of the week in Springfield and Shelbyville. We observe minor differences on Mondays, but:
   * In both Springfield and Shelbyville, the most listened-to music is pop.

Therefore, we cannot accept this hypothesis. We should also consider that the result might have been different if it weren't for the missing values.

3. It turns out that the musical preferences of users in Springfield and Shelbyville are quite similar.

The third hypothesis is rejected. If there are any differences in preferences, they are not observable in the data.

[Back to Contents](#back)