# Yandex Music

The comparison between Moscow and Saint Petersburg is surrounded by myths. For example:
* Moscow is a metropolis, governed by the strict rhythm of the working week;
* Saint Petersburg is the cultural capital, with its own tastes.

Using Yandex Music data, we'll compare user behavior in the two capitals.

**Research goal** — to test three hypotheses:
1. User activity depends on the day of the week. Moreover, this manifests differently in Moscow and Saint Petersburg.
2. On Monday morning, certain music genres prevail in Moscow while others prevail in Saint Petersburg. Similarly, different genres prevail on Friday evenings depending on the city.
3. Moscow and Saint Petersburg prefer different music genres. In Moscow, people listen to pop music more often, while in Saint Petersburg, Russian rap is more popular.

**Research Process**
We will obtain user behavior data from the `yandex_music_project.csv` file. The quality of the data is unknown. Therefore, before testing the hypotheses, we'll need a data overview.

We will check the data for errors and assess their impact on the research. Then, during the preprocessing stage, we'll look for ways to fix the most critical data errors.

Thus, the research will proceed in three stages:
1. Data overview.
2. Data preprocessing.
3. Hypothesis testing.


## Data Overview

Let’s create an initial understanding of the Yandex Music data.



**Task 1**

In [1]:
import pandas as pd # import the pandas library

**Task 2**

In [2]:
df = pd.read_csv('/datasets/yandex_music_project.csv') # reading the data file and saving it to df

**Task 3**

In [3]:
df.head(10) # getting the first 10 rows of the df table

Unnamed: 0,userID,Track,artist,genre,City,time,Day
0,FFB692EC,Kamigata To Boots,The Mass Missile,rock,Saint-Petersburg,20:28:33,Wednesday
1,55204538,Delayed Because of Accident,Andreas Rönnberg,rock,Moscow,14:07:09,Friday
2,20EC38,Funiculì funiculà,Mario Lanza,pop,Saint-Petersburg,20:58:07,Wednesday
3,A3DD03C9,Dragons in the Sunset,Fire + Ice,folk,Saint-Petersburg,08:37:09,Monday
4,E2DC1FAE,Soul People,Space Echo,dance,Moscow,08:34:34,Monday
5,842029A1,Преданная,IMPERVTOR,rusrap,Saint-Petersburg,13:09:41,Friday
6,4CB90AA5,True,Roman Messer,dance,Moscow,13:00:07,Wednesday
7,F03E1C1F,Feeling This Way,Polina Griffith,dance,Moscow,20:47:49,Wednesday
8,8FA1D3BE,И вновь продолжается бой,,ruspop,Moscow,09:17:40,Friday
9,E772D5C0,Pessimist,,dance,Saint-Petersburg,21:20:49,Wednesday


**Task 4**

In [4]:
df.info() # getting general information about the data in the df table

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 65079 entries, 0 to 65078
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0     userID  65079 non-null  object
 1   Track     63848 non-null  object
 2   artist    57876 non-null  object
 3   genre     63881 non-null  object
 4     City    65079 non-null  object
 5   time      65079 non-null  object
 6   Day       65079 non-null  object
dtypes: object(7)
memory usage: 3.5+ MB


The number of values in the columns differs. This means there are missing values in the data.

**Task 5**

The column names contain three style violations:
- Lowercase letters are mixed with uppercase.
- There are spaces.

Find the third style violation.

In [5]:
# Write your answer here as a comment. Do not delete the # symbol. Do not change the type of this cell to Markdown.
# The first column name uses 'CamelCase'; it is recommended to rename it to 'snake_case' as user_id.

Conclusions

Each row in the table contains data about a listened track. Some columns describe the composition itself: the title, the artist, and the genre. Other data provides information about the user: which city they are from and when they listened to the music.

It can be preliminarily asserted that there is enough data to test the hypotheses. However, there are missing values in the data, and the column names do not adhere to good style.

To move forward, it is necessary to address the issues in the data.

## Data Preprocessing

### Renaming Columns

**Task 6**

In [6]:
df.columns # the list of column names in the dataframe df

Index(['  userID', 'Track', 'artist', 'genre', '  City  ', 'time', 'Day'], dtype='object')

**Task 7**

In [7]:
df = df.rename( # renaming columns
    columns = {
        "  userID": "user_id",
        "Track": "track",
        "  City  ": "city",
        "Day": "day"
    }
)

**Задание 8**

In [8]:
df.columns # checking the results - list of column names

Index(['user_id', 'track', 'artist', 'genre', 'city', 'time', 'day'], dtype='object')

### Handling missing values

**Task 9**

In [9]:
# counting missing values
df.isna().sum()

user_id       0
track      1231
artist     7203
genre      1198
city          0
time          0
day           0
dtype: int64

**Task 10**

In [10]:
# replacing missing values with ‘unknown’
columns_to_replace = ['track','artist','genre']

for column in columns_to_replace:
    df[column] = df[column].fillna('unknown')

**Task 11**

In [11]:
# checking for the absence of missing values
df.isna().sum()

user_id    0
track      0
artist     0
genre      0
city       0
time       0
day        0
dtype: int64

### Processing duplicates

**Task 12**

In [12]:
# Counting explicit duplicates
df.duplicated().sum()

3826

**Task 13**

In [13]:
# Removing explicit duplicates, creating new indices, and dropping the old ones
df = df.drop_duplicates().reset_index(drop=True)

**Task 14**

In [14]:
# Checking for the absence of explicit duplicates
df.duplicated().sum()

0

**Task 15**

In [15]:
# Viewing unique sorted genre names
df['genre'].sort_values().unique()

array(['acid', 'acoustic', 'action', 'adult', 'africa', 'afrikaans',
       'alternative', 'alternativepunk', 'ambient', 'americana',
       'animated', 'anime', 'arabesk', 'arabic', 'arena',
       'argentinetango', 'art', 'audiobook', 'author', 'avantgarde',
       'axé', 'baile', 'balkan', 'beats', 'bigroom', 'black', 'bluegrass',
       'blues', 'bollywood', 'bossa', 'brazilian', 'breakbeat', 'breaks',
       'broadway', 'cantautori', 'cantopop', 'canzone', 'caribbean',
       'caucasian', 'celtic', 'chamber', 'chanson', 'children', 'chill',
       'chinese', 'choral', 'christian', 'christmas', 'classical',
       'classicmetal', 'club', 'colombian', 'comedy', 'conjazz',
       'contemporary', 'country', 'cuban', 'dance', 'dancehall',
       'dancepop', 'dark', 'death', 'deep', 'deutschrock', 'deutschspr',
       'dirty', 'disco', 'dnb', 'documentary', 'downbeat', 'downtempo',
       'drum', 'dub', 'dubstep', 'eastern', 'easy', 'electronic',
       'electropop', 'emo', 'entehno', '

**Task 16**

In [16]:
# eliminating implicit duplicates
df = df.replace('hip', "hiphop")
df = df.replace('hop', "hiphop")
df = df.replace('hip-hop', "hiphop")


**Task 17**

In [17]:
# checking for the absence of implicit duplicates
df['genre'].sort_values().unique()

array(['acid', 'acoustic', 'action', 'adult', 'africa', 'afrikaans',
       'alternative', 'alternativepunk', 'ambient', 'americana',
       'animated', 'anime', 'arabesk', 'arabic', 'arena',
       'argentinetango', 'art', 'audiobook', 'author', 'avantgarde',
       'axé', 'baile', 'balkan', 'beats', 'bigroom', 'black', 'bluegrass',
       'blues', 'bollywood', 'bossa', 'brazilian', 'breakbeat', 'breaks',
       'broadway', 'cantautori', 'cantopop', 'canzone', 'caribbean',
       'caucasian', 'celtic', 'chamber', 'chanson', 'children', 'chill',
       'chinese', 'choral', 'christian', 'christmas', 'classical',
       'classicmetal', 'club', 'colombian', 'comedy', 'conjazz',
       'contemporary', 'country', 'cuban', 'dance', 'dancehall',
       'dancepop', 'dark', 'death', 'deep', 'deutschrock', 'deutschspr',
       'dirty', 'disco', 'dnb', 'documentary', 'downbeat', 'downtempo',
       'drum', 'dub', 'dubstep', 'eastern', 'easy', 'electronic',
       'electropop', 'emo', 'entehno', '

**Conclusions**

The preprocessing revealed three issues in the data:

- violations in the header style,
- missing values,
- duplicates — both explicit and implicit.

We corrected the headers to simplify working with the table. Removing duplicates will make the analysis more accurate.

Missing values were replaced with 'unknown'. It remains to be seen whether the missing values in the genre column will affect the analysis.

Now we can proceed to hypothesis testing.

## Hypothesis Testing

### Comparison of user behavior in two capitals

The first hypothesis states that users listen to music differently in Moscow and St. Petersburg. We will test this assumption using data from three weekdays: Monday, Wednesday, and Friday. To do this, we will:

- Separate the users from Moscow and St. Petersburg.
- Compare the number of tracks listened to by each group of users on Monday, Wednesday, and Friday.

**Task 18**



In [25]:
# сounting listens in each city
city_track_count = df.groupby('city')['genre'].count()
city_track_count

city
Moscow              42741
Saint-Petersburg    18512
Name: genre, dtype: int64

**Task 19**


In [26]:
# Counting listens on each of the three days
day_track_count = df.groupby('day')['genre'].count()
day_track_count

day
Friday       21840
Monday       21354
Wednesday    18059
Name: genre, dtype: int64

**Task 20**

In [27]:
def number_tracks(day, city):
    track_list = df[df['day'] == day]# Select only the rows from df where the value of the variable day is in the day column
    track_list = track_list[track_list['city'] == city]# Select only the rows from track_list where the value of the variable city is in the city column
    track_list_count = track_list['user_id'].count() # Call the method to count the rows for track_list and select the user_id column
    return track_list_count # Return the track_list_count value from the function

**Task 21**

In [28]:
# the number of listens in Moscow on Mondays
number_tracks('Monday', 'Moscow')

15740

In [29]:
# the number of listens in St. Petersburg on Mondays
number_tracks('Monday', 'Saint-Petersburg')

5614

In [30]:
# the number of listens in Moscow on Wednesdays
number_tracks('Wednesday', 'Moscow')

11056

In [31]:
# the number of listens in St. Petersburg on Wednesdays
number_tracks('Wednesday', 'Saint-Petersburg')

7003

In [32]:
# the number of listens in Moscow on Fridays
number_tracks('Friday', 'Moscow')

15945

In [33]:
# the number of listens in Saint Petersburg on Fridays
number_tracks('Friday', 'Saint-Petersburg')

5895

**Task 22**

In [35]:
# creating a table with the results
info = pd.DataFrame(
    data=[['Moscow',15740,11056,15945],['Saint-Petersburg', 5614, 7003,5895]],
    columns=['city', 'monday', 'wednesday', 'friday']
)
# displaying the table on the screen
info

Unnamed: 0,city,monday,wednesday,friday
0,Moscow,15740,11056,15945
1,Saint-Petersburg,5614,7003,5895


**Conclusions**

The data shows differences in user behavior:

- In Moscow, the peak listening times occur on Monday and Friday, with a noticeable drop on Wednesday.
- In St. Petersburg, on the other hand, music is listened to more on Wednesdays. Activity on Monday and Friday is almost equally lower compared to Wednesday.

Thus, the data supports the first hypothesis.

### Music at the beginning and end of the week

According to the second hypothesis, certain genres prevail in the morning on Mondays in Moscow, while different genres dominate in St. Petersburg. Similarly, different genres are predominant on Friday evenings, depending on the city.

**Task 23**

In [36]:
# creating the moscow_general table from the rows of the df table where the value in the city column is equal to ‘Moscow’
moscow_general = df[df['city'] == 'Moscow']

In [37]:
# creating the spb_general table from the rows of the df table where the value in the city column is equal to ‘Saint-Petersburg’
spb_general = df[df['city'] == 'Saint-Petersburg']

**Task 24**

In [69]:
def genre_weekday(df, day, time1, time2):
    # sequential filtering
    # keep only the rows in `genre_df` from `df` where the day is equal to `day`
    genre_df = df[df['day'] == day] # your code here
    # keep only the rows in genre_df where the time is less than time2
    genre_df = genre_df[genre_df['time'] < time2] # your code here
    # keep only the rows in genre_df where the time is greater than time1
    genre_df = genre_df[genre_df['time'] > time1] # your code here
    # let’s group the filtered DataFrame by the column with genre names, take the genre column, 
    # and count the number of rows for each genre using the count() method
    genre_df_grouped = genre_df.groupby('genre')['user_id'].count() # your code here
    # let’s sort the result in descending order (so that the most popular genres appear at the beginning of the Series)
    genre_df_sorted = genre_df_grouped.sort_values(ascending = False) # your code here
    # let’s return a Series with the 10 most popular genres in the specified time range for the given day
    return genre_df_sorted[:10]

**Task 25**

In [71]:
# call the function for Monday morning in Moscow (instead of df, use the table moscow_general)
genre_weekday(moscow_general, "Monday","07:00","11:00")
                    

genre
pop            781
dance          549
electronic     480
rock           474
hiphop         286
ruspop         186
world          181
rusrap         175
alternative    164
unknown        161
Name: user_id, dtype: int64

In [72]:
# call the function for Monday morning in Saint Petersburg (instead of df, use the table spb_general)
genre_weekday(spb_general, "Monday","07:00","11:00")

genre
pop            218
dance          182
rock           162
electronic     147
hiphop          80
ruspop          64
alternative     58
rusrap          55
jazz            44
classical       40
Name: user_id, dtype: int64

In [73]:
# call the function for Friday evening in Moscow
genre_weekday(moscow_general, "Friday","17:00","23:00")

genre
pop            713
rock           517
dance          495
electronic     482
hiphop         273
world          208
ruspop         170
alternative    163
classical      163
rusrap         142
Name: user_id, dtype: int64

In [75]:
# call the function for Friday evening in Saint Petersburg
genre_weekday(spb_general, "Friday","17:00","23:00")

genre
pop            256
electronic     216
rock           216
dance          210
hiphop          97
alternative     63
jazz            61
classical       60
rusrap          59
world           54
Name: user_id, dtype: int64

Conclusions

Comparing the top 10 genres on Monday morning, we can draw the following conclusions:

1.	Moscow and St. Petersburg listeners enjoy similar music. The only difference is that the Moscow ranking includes the genre “world,” while the St. Petersburg ranking features jazz and classical music.
2.	In Moscow, there were so many missing values that the value 'unknown' took the tenth place among the most popular genres. This indicates that missing values constitute a significant portion of the data and threaten the reliability of the research.

Friday evening does not change this picture. Some genres rise slightly, while others drop, but overall, the top 10 remains the same.

Thus, the second hypothesis was only partially confirmed:

- Users listen to similar music at the beginning and end of the week.
- The difference between Moscow and St. Petersburg is not very pronounced. Russian pop music is more popular in Moscow, while jazz is favored in St. Petersburg.

However, the missing data raises doubts about this result. In Moscow, there are so many missing values that the top 10 ranking could look different if it weren’t for the lost genre data.

### Genre Preferences in Moscow and St. Petersburg

Hypothesis: St. Petersburg is the capital of rap, where this genre is listened to more frequently than in Moscow. Meanwhile, Moscow is a city of contrasts, yet pop music dominates there.

**Task 26**

In [76]:
# in one line: grouping the moscow_general table by the ‘genre’ column, selecting the genre column, counting the number of ‘genre’ values using the count() method, and saving it to moscow_genres
moscow_genres = moscow_general.groupby('genre')['genre'].count()
# sorting the resulting Series in descending order and saving it back to moscow_genres
moscow_genres = moscow_genres.sort_values(ascending = False)

**Task 27**

In [77]:
# viewing the first 10 rows of moscow_genres
moscow_genres[:10]

genre
pop            5892
dance          4435
rock           3965
electronic     3786
hiphop         2096
classical      1616
world          1432
alternative    1379
ruspop         1372
rusrap         1161
Name: genre, dtype: int64

**Task 28**

In [78]:
# oneline: grouping the spb_general table by the ‘genre’ column, selecting the genre column, 
# counting the number of ‘genre’ values using the count() method, and saving it to spb_genres.
spb_genres = spb_general.groupby('genre')['genre'].count()
# sorting the resulting Series in descending order and saving it back to spb_genres
spb_genres = spb_genres.sort_values(ascending = False)

**Task 29**

In [79]:
# viewing the first 10 rows of spb_genres
spb_genres[:10]

genre
pop            2431
dance          1932
rock           1879
electronic     1736
hiphop          960
alternative     649
classical       646
rusrap          564
ruspop          538
world           515
Name: genre, dtype: int64

**Conclusions**

The hypothesis was partially confirmed:
- Pop music is the most popular genre in Moscow, as the hypothesis suggested. Moreover, a closely related genre—Russian pop music—also appears in the top 10 genres.
- Contrary to expectations, rap is equally popular in both Moscow and St. Petersburg.

## Research findings

We tested three hypotheses and established the following:

1.	The day of the week influences user activity differently in Moscow and St. Petersburg.
The first hypothesis was fully confirmed.

2.	Musical preferences do not change significantly throughout the week—whether in Moscow or St. Petersburg. Slight differences are noticeable at the beginning of the week, on Mondays:
- In Moscow, users listen to “world” music,
- In St. Petersburg, they prefer jazz and classical music.  
Thus, the second hypothesis was only partially confirmed. This result could have been different if there were no gaps in the data.  

3.	There is more commonality than difference in the tastes of users from Moscow and St. Petersburg. Contrary to expectations, genre preferences in St. Petersburg resemble those in Moscow.
The third hypothesis was not confirmed. If there are differences in preferences, they are not noticeable in the majority of users.

**In practice, research involves testing statistical hypotheses.**
From part of the data from one service, it is impossible to draw conclusions about all users of the service without statistical methods.
Testing statistical hypotheses will show how reliable they are based on the available data.
You will become familiar with hypothesis testing methods in the following topics.