Research: Music of Big Cities

The Yandex.Music service provides data on music listening by users in Moscow and Saint Petersburg.

Task: Compare the genres and features of music listening among the residents of the two cities.

# Table of Contents

* [Stage 1. Data Acquisition](#stage-1-data-acquisition)
    * [Importing Libraries](#importing-libraries)
* [Stage 2. Data Preprocessing](#stage-2-data-preprocessing)
* [Stage 3. Data Analysis](#stage-3-data-analysis)
    * [Do people in different cities really listen to music differently?](#do-people-in-different-cities-really-listen-to-music-differently)
    * [Monday morning and Friday evening — different music or the same?](#monday-morning-and-friday-evening--different-music-or-the-same)
    * [Moscow and Saint Petersburg — two different capitals, two different music directions. True?](#moscow-and-saint-petersburg--two-different-capitals-two-different-music-directions-true)
* [Stage 4. Research Results](#stage-4-research-results)

# Stage 1. Data Acquisition

We will examine the data provided by the service for the project.

## Importing Libraries

In [1]:
import pandas as pd

In [2]:
df = pd.read_csv('/datasets/music_project.csv')

In [3]:
df.head(10)

Unnamed: 0,userID,Track,artist,genre,City,time,Day
0,FFB692EC,Kamigata To Boots,The Mass Missile,rock,Saint-Petersburg,20:28:33,Wednesday
1,55204538,Delayed Because of Accident,Andreas Rönnberg,rock,Moscow,14:07:09,Friday
2,20EC38,Funiculì funiculà,Mario Lanza,pop,Saint-Petersburg,20:58:07,Wednesday
3,A3DD03C9,Dragons in the Sunset,Fire + Ice,folk,Saint-Petersburg,08:37:09,Monday
4,E2DC1FAE,Soul People,Space Echo,dance,Moscow,08:34:34,Monday
5,842029A1,Преданная,IMPERVTOR,rusrap,Saint-Petersburg,13:09:41,Friday
6,4CB90AA5,True,Roman Messer,dance,Moscow,13:00:07,Wednesday
7,F03E1C1F,Feeling This Way,Polina Griffith,dance,Moscow,20:47:49,Wednesday
8,8FA1D3BE,И вновь продолжается бой,,ruspop,Moscow,09:17:40,Friday
9,E772D5C0,Pessimist,,dance,Saint-Petersburg,21:20:49,Wednesday


General information about the data in the *df* table.

In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 65079 entries, 0 to 65078
Data columns (total 7 columns):
  userID    65079 non-null object
Track       63848 non-null object
artist      57876 non-null object
genre       63881 non-null object
  City      65079 non-null object
time        65079 non-null object
Day         65079 non-null object
dtypes: object(7)
memory usage: 3.5+ MB


Let's examine the obtained information in more detail.

There are a total of 7 columns in the table, and each column has a data type of `object`.

Let's delve into the columns in *df* and the information they contain:

* `userID` — user identifier;
* `Track` — track title;
* `artist` — artist name;
* `genre` — genre name;
* `City` — the city where the listening took place;
* `time` — the time when the user listened to the track;
* `Day` — day of the week.

The number of entries in the columns varies. This indicates that there are missing values in the data.

**Conclusion**

Each row of the table contains information about compositions of a certain genre in a specific performance that users listened to in one of the cities at a specific time and day of the week. There are two problems that need to be addressed: missing values and poor column names. For testing the working hypotheses, the columns *time*, *Day*, and *City* are particularly valuable. Data from the *genre* column will help identify the most popular genres.

# Stage 2. Data Preprocessing

We will eliminate missing values, rename columns, and also check the data for duplicates.

We obtain the list of column names. What problem is observed—besides those already mentioned earlier?

In [5]:
df.columns

Index(['  userID', 'Track', 'artist', 'genre', '  City  ', 'time', 'Day'], dtype='object')

The column names contain spaces, which may make it difficult to access the data.

Let's rename the columns for easier use in future work. Check the result.

In [6]:
df.set_axis(['user_id', 'track_name', 'artist_name', 'genre_name', 'city', 'time', 'weekday'], axis='columns', inplace=True)

In [7]:
df.columns

Index(['user_id', 'track_name', 'artist_name', 'genre_name', 'city', 'time',
       'weekday'],
      dtype='object')

Let's check the data for missing values by calling a set of methods to sum the missing values.

In [8]:
df.isnull().sum()

user_id           0
track_name     1231
artist_name    7203
genre_name     1198
city              0
time              0
weekday           0
dtype: int64

Missing values indicate that for some tracks, not all information is available. The reasons could vary: for instance, a specific performer of a folk song might not be named. Worse, there could be issues with data recording. Each case needs to be examined individually to identify the cause.

Replace the missing values in the columns with track name and artist with the string 'unknown'. After this operation, ensure the table no longer contains missing values

In [9]:
df['track_name'] = df['track_name'].fillna('unknown')

In [10]:
df['artist_name'] = df['artist_name'].fillna('unknown')

In [11]:
df.isnull().sum()

user_id           0
track_name        0
artist_name       0
genre_name     1198
city              0
time              0
weekday           0
dtype: int64

Remove empty values in the genre column; ensure there are no more left.

In [12]:
df.dropna(subset=['genre_name'], inplace=True)

In [13]:
df.isnull().sum()

user_id        0
track_name     0
artist_name    0
genre_name     0
city           0
time           0
weekday        0
dtype: int64

It is necessary to check for duplicates. If any are found, remove them, and verify that all have been removed.

In [14]:
df.duplicated().sum()

3755

In [15]:
df = df.drop_duplicates().reset_index(drop=True)

In [16]:
df.duplicated().sum()

0

Duplicates may have appeared due to a data recording error. It is worth paying attention and understanding the reasons for the occurrence of such 'information noise'.

Save the list of unique values from the genre column in the variable genres_list.

Declare a function find_genre() to search for implicit duplicates in the genre column. For example, when the name of the same genre is written in different ways.

In [17]:
genres_list = df['genre_name'].unique()

In [18]:
def find_genre(genre_name):
    counter = 0
    
    for genre in genres_list:
        if genre == genre_name:
            counter += 1
    
    return counter

Call the *find_genre()* function to search for different variants of the genre name 'hip-hop' in the table.

The correct name is *hiphop*. Let's search for other variants:

* hip
* hop
* hip-hop


In [19]:
find_genre('hip')

1

In [20]:
find_genre('hop')

0

In [21]:
find_genre('hip-hop')

0

Declare a function find_hip_hop(), which replaces the incorrect names of this genre in the 'genre_name' column with 'hiphop' and verifies the success of the replacement.

This will correct all the spelling variants identified during the check.

In [22]:
def find_hip_hop(df, wrong):
    df['genre_name'] = df['genre_name'].replace(wrong, 'hiphop')
    return df['genre_name'][df['genre_name'] == wrong].count()

In [23]:
find_hip_hop(df, 'hip')

0

Get general information about the data. Ensure that the cleaning was successful.

In [24]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 60126 entries, 0 to 60125
Data columns (total 7 columns):
user_id        60126 non-null object
track_name     60126 non-null object
artist_name    60126 non-null object
genre_name     60126 non-null object
city           60126 non-null object
time           60126 non-null object
weekday        60126 non-null object
dtypes: object(7)
memory usage: 3.2+ MB


**Conclusion**

During the preprocessing stage, the data revealed not only missing values and issues with column names, but also various types of duplicates. Removing them will allow for more accurate analysis. Since it is important to preserve genre information for analysis, we didn't just remove all missing values, but also filled in missing artist names and track titles. The column names are now correct and convenient for further work.

# Stage 3. Data Analysis

## Do people in different cities really listen to music differently?

A hypothesis was proposed that users in Moscow and St. Petersburg listen to music differently. We will test this assumption using data from three days of the week—Monday, Wednesday, and Friday.

For each city, determine the number of tracks listened to on these days with a known genre, and compare the results.

Group the data by city and use the count() method to count the tracks for which the genre is known.

In [25]:
df.groupby('city')['genre_name'].count()

city
Moscow              41892
Saint-Petersburg    18234
Name: genre_name, dtype: int64

In Moscow, there are more listens than in St. Petersburg, but that doesn't necessarily mean Moscow is more active. Yandex.Music generally has more users in Moscow, so the figures are comparable.

Group the data by the day of the week and count the tracks listened to on Monday, Wednesday, and Friday, for which the genre is known.

In [26]:
df.groupby('weekday')['genre_name'].count()

weekday
Friday       21482
Monday       20866
Wednesday    17778
Name: genre_name, dtype: int64

Monday and Friday are prime times for music, while on Wednesdays, users are slightly more engaged with work.

Create a function number_tracks() that takes a table, day of the week, and city name as parameters, and returns the number of tracks listened to, for which the genre is known. Check the number of tracks listened to for each city on Monday, then on Wednesday and Friday.

In [27]:
def number_tracks(df, day, city):
    track_list = df[(df['weekday'] == day) & (df['city'] == city)]
    track_list_count = track_list['genre_name'].count()
    return track_list_count
    

In [28]:
number_tracks(df, 'Monday', 'Moscow')

15347

In [29]:
number_tracks(df, 'Monday', 'Saint-Petersburg')

5519

In [30]:
number_tracks(df, 'Wednesday', 'Moscow')

10865

In [31]:
number_tracks(df, 'Wednesday', 'Saint-Petersburg')

6913

In [32]:
number_tracks(df, 'Friday', 'Moscow')

15680

In [33]:
number_tracks(df, 'Friday', 'Saint-Petersburg')

5802

Combine the obtained information into a single table where the column names are ['city', 'monday', 'wednesday', 'friday'].


In [34]:
table = pd.DataFrame(data=[
    ['Moscow', number_tracks(df, 'Monday', 'Moscow'), number_tracks(df, 'Wednesday', 'Moscow'), number_tracks(df, 'Friday', 'Moscow')],
    ['Saint-Petersburg', number_tracks(df, 'Monday', 'Saint-Petersburg'), number_tracks(df, 'Wednesday', 'Saint-Petersburg'), number_tracks(df, 'Friday', 'Saint-Petersburg')]
], columns=['city', 'monday', 'wednesday', 'friday'])

In [35]:
table

Unnamed: 0,city,monday,wednesday,friday
0,Moscow,15347,10865,15680
1,Saint-Petersburg,5519,6913,5802


**Conclusion**

The results show that, relative to Wednesday, music listening in St. Petersburg and Moscow follows an "inverse" pattern: in Moscow, peaks occur on Monday and Friday, while listening time decreases on Wednesday. In contrast, Wednesday is the day of highest interest in music in St. Petersburg, while Monday and Friday show less interest, with almost equal declines in both days.

## Is the music on Monday morning and Friday evening different, or is it the same?

We are looking for the answer to the question of which genres dominate in different cities on Monday morning and Friday evening. There is a hypothesis that on Monday morning, users listen to more energizing music (e.g., pop), while on Friday evening, they tend to listen to more dance music (e.g., electronic).

Retrieve the data tables for Moscow as moscow_general and for St. Petersburg as spb_general.

In [36]:
moscow_general = df[df['city'] == 'Moscow']

In [37]:
spb_general = df[df['city'] == 'Saint-Petersburg']


Create a function genre_weekday() that returns a list of genres for the requested day of the week and time of day, from a specific hour to another.

In [38]:
def genre_weekday(df, day, time1, time2):
    genre_list = df[(df['weekday'] == day) & (df['time'] > time1) & (df['time'] < time2)]
    genre_list_sorted = genre_list.groupby('genre_name')['genre_name'].count().sort_values(ascending=False).head(10)
    return genre_list_sorted

Compare the results obtained from the table for Moscow and St. Petersburg on Monday morning (from 7 to 11) and Friday evening (from 17 to 23).

In [39]:
genre_weekday(moscow_general, 'Monday', '07:00:00', '11:00:00')

genre_name
pop            781
dance          549
electronic     480
rock           474
hiphop         286
ruspop         186
world          181
rusrap         175
alternative    164
classical      157
Name: genre_name, dtype: int64

In [40]:
genre_weekday(spb_general, 'Monday', '07:00:00', '11:00:00')

genre_name
pop            218
dance          182
rock           162
electronic     147
hiphop          80
ruspop          64
alternative     58
rusrap          55
jazz            44
classical       40
Name: genre_name, dtype: int64

In [41]:
genre_weekday(moscow_general, 'Friday', '17:00:00', '23:00:00')

genre_name
pop            713
rock           517
dance          495
electronic     482
hiphop         273
world          208
ruspop         170
alternative    163
classical      163
rusrap         142
Name: genre_name, dtype: int64

In [42]:
genre_weekday(spb_general, 'Friday', '17:00:00', '23:00:00')

genre_name
pop            256
rock           216
electronic     216
dance          210
hiphop          97
alternative     63
jazz            61
classical       60
rusrap          59
world           54
Name: genre_name, dtype: int64

The popular genres on Monday morning in both St. Petersburg and Moscow turned out to be similar: as expected, pop music is popular in both cities. However, the bottom of the top 10 differs between the two cities: jazz and Russian rap make it into the top 10 in St. Petersburg, while in Moscow, the *world* genre appears.

By the end of the week, the situation remains the same. Pop music still holds the top spot. Again, the difference is only noticeable at the bottom of the top 10, where *world* is also present on Friday evening in St. Petersburg.

**Conclusion**

Pop is the undisputed leader, and the top 5 are generally the same in both cities. However, the end of the list is more dynamic: each city highlights more characteristic genres, which indeed shift their positions depending on the day of the week and time.

## Moscow and St. Petersburg—two different capitals, two different directions in music. Is that true?

Hypothesis: St. Petersburg is rich in rap culture, so this genre is listened to more frequently there, while Moscow is a city of contrasts, but the majority of users tend to listen to pop music.

Group the *moscow_general* table by genre, count the number of tracks in each genre using the *count()* method, sort in descending order, and save the result in the *moscow_genres* table.

View the first 10 rows of this new table.

In [43]:
moscow_genres = moscow_general.groupby('genre_name')['genre_name'].count().sort_values(ascending=False)

In [44]:
moscow_genres.head(10)

genre_name
pop            5892
dance          4435
rock           3965
electronic     3786
hiphop         2096
classical      1616
world          1432
alternative    1379
ruspop         1372
rusrap         1161
Name: genre_name, dtype: int64

Group the *spb_general* table by genre, count the number of tracks in each genre using the *count()* method, sort in descending order, and save the result in the *spb_genres* table.

View the first 10 rows of this table. Now you can compare the two cities.

In [45]:
spb_genres = spb_general.groupby('genre_name')['genre_name'].count().sort_values(ascending=False)

In [46]:
spb_genres.head(10)

genre_name
pop            2431
dance          1932
rock           1879
electronic     1736
hiphop          960
alternative     649
classical       646
rusrap          564
ruspop          538
world           515
Name: genre_name, dtype: int64

**Conclusion**

In Moscow, besides the universally popular pop genre, there is a strong presence of Russian pop music, indicating a broader interest in this genre. Contrary to the hypothesis, rap holds similar positions in both cities.

# Stage 4. Research Results


Working Hypotheses:

* Music is listened too differently in Moscow and St. Petersburg.

* The top ten most popular genres on Monday morning and Friday evening have distinct differences.

* The populations of the two cities prefer different music genres.

**General Results**

Moscow and St. Petersburg share similar tastes, with pop music being dominant in both cities. There is no significant variation in preferences depending on the day of the week within each city—people consistently listen to what they enjoy. However, between the cities, there is a mirrored pattern in terms of listening behavior during the week: Moscow listens more on Monday and Friday, while St. Petersburg listens more on Wednesday and less on Monday and Friday.

As a result, the first and second hypotheses are confirmed, while the third hypothesis is not.