# Data Analysis of Yandex.Music Service: Comparing Users from Two Cities

**Research Objective** - Test three hypotheses:
1. User activity varies based on the day of the week, and this difference is manifested differently in Moscow and St. Petersburg.
2. On Monday mornings, certain genres prevail in Moscow, while different ones prevail in St. Petersburg. Similarly, on Friday evenings, different genres dominate based on the city.
3. Moscow and St. Petersburg have distinct genre preferences. Pop music is more frequently listened to in Moscow, while Russian rap is more popular in St. Petersburg.

The research will be conducted in three **stages:**
1. Data overview
2. Data preprocessing
3. Hypothesis testing

## Data overview

In [1]:
# Import the pandas library
import pandas as pd  

In [2]:
# Read the data file and store it in the variable df
df = pd.read_csv('/datasets/yandex_music_project.csv')

In [3]:
# Get the first 10 rows of the df table
df.head(10)  

Unnamed: 0,userID,Track,artist,genre,City,time,Day
0,FFB692EC,Kamigata To Boots,The Mass Missile,rock,Saint-Petersburg,20:28:33,Wednesday
1,55204538,Delayed Because of Accident,Andreas Rönnberg,rock,Moscow,14:07:09,Friday
2,20EC38,Funiculì funiculà,Mario Lanza,pop,Saint-Petersburg,20:58:07,Wednesday
3,A3DD03C9,Dragons in the Sunset,Fire + Ice,folk,Saint-Petersburg,08:37:09,Monday
4,E2DC1FAE,Soul People,Space Echo,dance,Moscow,08:34:34,Monday
5,842029A1,Преданная,IMPERVTOR,rusrap,Saint-Petersburg,13:09:41,Friday
6,4CB90AA5,True,Roman Messer,dance,Moscow,13:00:07,Wednesday
7,F03E1C1F,Feeling This Way,Polina Griffith,dance,Moscow,20:47:49,Wednesday
8,8FA1D3BE,И вновь продолжается бой,,ruspop,Moscow,09:17:40,Friday
9,E772D5C0,Pessimist,,dance,Saint-Petersburg,21:20:49,Wednesday


In [4]:
# Get general information about the data in the df table
df.info()  

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 65079 entries, 0 to 65078
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0     userID  65079 non-null  object
 1   Track     63848 non-null  object
 2   artist    57876 non-null  object
 3   genre     63881 non-null  object
 4     City    65079 non-null  object
 5   time      65079 non-null  object
 6   Day       65079 non-null  object
dtypes: object(7)
memory usage: 3.5+ MB


So, the table consists of seven columns. The data type in all columns is `object`.

According to the documentation for the data:
* `userID` - user identifier;
* `Track` - track name;
* `artist` - artist's name;
* `genre` - genre name;
* `City` - user's city;
* `time` - start time of playback;
* `Day` - day of the week.

The number of values in the columns varies, which indicates that there are missing values in the data.

Each row in the table contains data about a played track. Some columns describe the composition itself: the track's name, artist, and genre. The other data provides information about the user: their city and the time of music playback.

Preliminarily, it can be asserted that there is sufficient data to test the hypotheses. However, there are missing values in the data, and the column names have inconsistencies with good style.

To proceed further, it's necessary to address the data issues.

## Data preprocessing

### Header Style

In [5]:
# List of column names in the df table
df.columns

Index(['  userID', 'Track', 'artist', 'genre', '  City  ', 'time', 'Day'], dtype='object')

In [6]:
# Renaming Columns
df = df.rename(columns={
    '  userID': 'user_id',
    'Track': 'track',
    '  City  ': 'city',
    'Day': 'day'
}) 

### Missing Values

In [7]:
# Counting missing values
df.isna().sum()

user_id       0
track      1231
artist     7203
genre      1198
city          0
time          0
day           0
dtype: int64

Not all missing values affect the research. For example, missing values in the `track` and `artist` columns might not be important for your analysis. It would be sufficient to replace them with explicit labels.

However, missing values in the `genre` column could hinder the comparison of musical preferences between Moscow and St. Petersburg. In practice, it would be best to determine the reasons for these missing values and attempt to restore the data. Unfortunately, in an educational project, we might not have this opportunity. We'll need to:

1. Fill these missing values with explicit labels.
2. Evaluate how much they might affect your calculations.

In [8]:
# List of columns where values need to be replaced
columns_to_replace = ['track', 'artist', 'genre'] 

# Loop through columns and replace missing values with 'unknown'
for column in columns_to_replace:
    df[column] = df[column].fillna('unknown')  

In [9]:
# Counting missing values 
df.isna().sum()

user_id    0
track      0
artist     0
genre      0
city       0
time       0
day        0
dtype: int64

### Duplicates

In [10]:
# Counting explicit duplicates
df.duplicated().sum()  

3826

In [11]:
# Removing explicit duplicates
df = df.drop_duplicates()

# Checking for the absence of duplicates
df.duplicated().sum()  

0

In [12]:
# Viewing unique genre names
print(df['genre'].sort_values(ascending=True).unique())  


['acid' 'acoustic' 'action' 'adult' 'africa' 'afrikaans' 'alternative'
 'alternativepunk' 'ambient' 'americana' 'animated' 'anime' 'arabesk'
 'arabic' 'arena' 'argentinetango' 'art' 'audiobook' 'author' 'avantgarde'
 'axé' 'baile' 'balkan' 'beats' 'bigroom' 'black' 'bluegrass' 'blues'
 'bollywood' 'bossa' 'brazilian' 'breakbeat' 'breaks' 'broadway'
 'cantautori' 'cantopop' 'canzone' 'caribbean' 'caucasian' 'celtic'
 'chamber' 'chanson' 'children' 'chill' 'chinese' 'choral' 'christian'
 'christmas' 'classical' 'classicmetal' 'club' 'colombian' 'comedy'
 'conjazz' 'contemporary' 'country' 'cuban' 'dance' 'dancehall' 'dancepop'
 'dark' 'death' 'deep' 'deutschrock' 'deutschspr' 'dirty' 'disco' 'dnb'
 'documentary' 'downbeat' 'downtempo' 'drum' 'dub' 'dubstep' 'eastern'
 'easy' 'electronic' 'electropop' 'emo' 'entehno' 'epicmetal' 'estrada'
 'ethnic' 'eurofolk' 'european' 'experimental' 'extrememetal' 'fado'
 'fairytail' 'film' 'fitness' 'flamenco' 'folk' 'folklore' 'folkmetal'
 'folkrock' 

In [13]:
# List of incorrect names
duplicates = ['hip', 'hop', 'hip-hop']  

# Correct name
correct_name = 'hiphop'  

df['genre'] = df['genre'].replace(duplicates, correct_name)

# Checking for implicit duplicates
print(df['genre'].sort_values(ascending=True).unique())  

['acid' 'acoustic' 'action' 'adult' 'africa' 'afrikaans' 'alternative'
 'alternativepunk' 'ambient' 'americana' 'animated' 'anime' 'arabesk'
 'arabic' 'arena' 'argentinetango' 'art' 'audiobook' 'author' 'avantgarde'
 'axé' 'baile' 'balkan' 'beats' 'bigroom' 'black' 'bluegrass' 'blues'
 'bollywood' 'bossa' 'brazilian' 'breakbeat' 'breaks' 'broadway'
 'cantautori' 'cantopop' 'canzone' 'caribbean' 'caucasian' 'celtic'
 'chamber' 'chanson' 'children' 'chill' 'chinese' 'choral' 'christian'
 'christmas' 'classical' 'classicmetal' 'club' 'colombian' 'comedy'
 'conjazz' 'contemporary' 'country' 'cuban' 'dance' 'dancehall' 'dancepop'
 'dark' 'death' 'deep' 'deutschrock' 'deutschspr' 'dirty' 'disco' 'dnb'
 'documentary' 'downbeat' 'downtempo' 'drum' 'dub' 'dubstep' 'eastern'
 'easy' 'electronic' 'electropop' 'emo' 'entehno' 'epicmetal' 'estrada'
 'ethnic' 'eurofolk' 'european' 'experimental' 'extrememetal' 'fado'
 'fairytail' 'film' 'fitness' 'flamenco' 'folk' 'folklore' 'folkmetal'
 'folkrock' 

Data preprocessing has identified three issues in the data:

1. Inconsistent header style.
2. Missing values.
3. Explicit and implicit duplicates.

We have rectified the header style to streamline working with the table. Without duplicates, the research will become more accurate.

We replaced missing values with 'unknown'. The impact of missing values in the 'genre' column on the research still needs to be assessed.

Now we can proceed with hypothesis testing.

## Hypothesis testing

### Comparing User Behavior in Two Capitals

The first hypothesis states that users listen to music differently in Moscow and St. Petersburg. Let's test this assumption based on data for three days of the week: Monday, Wednesday, and Friday. To do this:

1. Separate users from Moscow and St. Petersburg.
2. Compare the number of tracks listened to by each group of users on Monday, Wednesday, and Friday.

In [14]:
# Counting listens in each city
print(df.groupby('city')['time'].count())  

city
Moscow              42741
Saint-Petersburg    18512
Name: time, dtype: int64


In Moscow, there are more listens than in St. Petersburg. However, this doesn't necessarily mean that Moscow users listen to music more frequently. It's simply that there are more users in Moscow.

In [15]:
# Counting listens for each of the three days
print(df.groupby('day')['time'].count())  

day
Friday       21840
Monday       21354
Wednesday    18059
Name: time, dtype: int64


On average, users from both cities are less active on Wednesdays. However, the picture might change if we examine each city separately.

In [16]:
# Function for counting listens for a specific city and day.
# By applying sequential filtering with logical indexing, 
# this function first extracts rows with the desired day from the original table. 
# Then, it further filters rows based on the desired city. 
# Using the `count()` method, it calculates the number of values in the `user_id` column. 
# The function returns this count as a result:

def number_tracks(day, city):
    track_list = df[df['day'] == day]
    track_list = track_list[track_list['city'] == city]
    track_list_count = track_list['user_id'].count()
    return track_list_count

In [17]:
# Number of listens in Moscow on Mondays
number_tracks('Monday', 'Moscow')  

15740

In [18]:
# Number of listens in St. Petersburg on Mondays
number_tracks('Monday', 'Saint-Petersburg')  

5614

In [19]:
# Number of listens in Moscow on Wednesdays
number_tracks('Wednesday', 'Moscow')  

11056

In [20]:
# Number of listens in St. Petersburg on Wednesdays
number_tracks('Wednesday', 'Saint-Petersburg')  

7003

In [21]:
# Number of listens in Moscow on Fridays
number_tracks('Friday', 'Moscow') 

15945

In [22]:
# Number of listens in St. Petersburg on Fridays
number_tracks('Friday', 'Saint-Petersburg')  

5895

In [23]:
# Table with results
info = pd.DataFrame(data=[
    ['Moscow', 15740, 11056, 15945],
    ['Saint-Petersburg', 5614, 7003, 5895]
], columns=['city', 'monday', 'wednesday', 'friday']) 
print(info)

               city  monday  wednesday  friday
0            Moscow   15740      11056   15945
1  Saint-Petersburg    5614       7003    5895


The data reveal differences in user behavior:

1) In Moscow, the peak of listens is on Monday and Friday, while there is a noticeable decline on Wednesday.

2) In St. Petersburg, on the contrary, more music is listened to on Wednesdays.

3) Activity on Monday and Friday is almost equally subdued compared to Wednesday.

This **supports the first hypothesis.**

### Music at the Beginning and End of the Week

According to the second hypothesis, in Moscow, different genres prevail on Monday mornings compared to St. Petersburg. Similarly, in the evenings of Fridays, different genres dominate, depending on the city.

In [24]:
# Creating the moscow_general table from rows in the df table
# where the value in the 'city' column is 'Moscow'
moscow_general = df[df['city'] == 'Moscow']

# Creating the spb_general table from rows in the df table
# where the value in the 'city' column is 'Saint-Petersburg'
spb_general = df[df['city'] == 'Saint-Petersburg']  

In [25]:
# This function filters the DataFrame based on the specified day and time range, 
# groups the filtered DataFrame by genre, counts the occurrences of each genre, 
# sorts the results in descending order, and returns the top 10 genres for the 
# specified time range on the specified day.

def genre_weekday(df, day, time1, time2):
    genre_df = df[df['day'] == day]
    genre_df = genre_df[genre_df['time'] > time1]
    genre_df = genre_df[genre_df['time'] < time2]
    genre_df_grouped = genre_df.groupby(['genre'])['genre'].count()
    genre_df_sorted = genre_df_grouped.sort_values(ascending=False)
    return genre_df_sorted[:10]

In [26]:
# Calling the function for Monday morning in Moscow
print(genre_weekday(moscow_general, 'Monday', '07:00', '11:00'))

genre
pop            781
dance          549
electronic     480
rock           474
hiphop         286
ruspop         186
world          181
rusrap         175
alternative    164
unknown        161
Name: genre, dtype: int64


In [27]:
# Calling the function for Monday morning in St. Petersburg
print(genre_weekday(spb_general, 'Monday', '07:00', '11:00'))

genre
pop            218
dance          182
rock           162
electronic     147
hiphop          80
ruspop          64
alternative     58
rusrap          55
jazz            44
classical       40
Name: genre, dtype: int64


In [28]:
# Calling the function for Friday evening in Moscow
print(genre_weekday(moscow_general, 'Friday', '17:00', '23:00'))

genre
pop            713
rock           517
dance          495
electronic     482
hiphop         273
world          208
ruspop         170
alternative    163
classical      163
rusrap         142
Name: genre, dtype: int64


In [29]:
# Calling the function for Friday evening in St. Petersburg
print(genre_weekday(spb_general, 'Friday', '17:00', '23:00'))

genre
pop            256
electronic     216
rock           216
dance          210
hiphop          97
alternative     63
jazz            61
classical       60
rusrap          59
world           54
Name: genre, dtype: int64


Conclusions

Comparing the top 10 genres on Monday morning, the following conclusions can be drawn:

1. Moscow and St. Petersburg listen to similar music. The only difference is that the "world" genre is included in the Moscow ranking, while jazz and classical music are included in the St. Petersburg ranking.

2. In Moscow, there are so many missing values that the 'unknown' value took the tenth place among the most popular genres. This indicates that missing values constitute a significant portion of the data and threaten the reliability of the research.

3. Friday evening does not alter this picture significantly. Some genres rise slightly, while others drop, but the overall top 10 remains the same.

Thus, the **second hypothesis is only partially confirmed:**

- Users listen to similar music at the beginning and end of the week.
- The difference between Moscow and St. Petersburg is not very pronounced. Moscow users listen to more Russian pop music, while jazz is more popular in St. Petersburg.

However, the presence of missing values casts doubt on this result. In Moscow, there are so many missing values that the top 10 ranking could have looked different if genre data were not lost.

### Genre preferences in Moscow and St. Petersburg

Hypothesis: St. Petersburg is the capital of rap, with a higher frequency of rap music consumption compared to Moscow. Meanwhile, Moscow is a city of contrasts where pop music predominates despite that.

In [30]:
# This code groups the `moscow_general` table by the 'genre' column, 
# counts the occurrences of each genre, sorts them in descending order 
# (highest count first), and stores the result in the `moscow_genres` table.

moscow_genres = moscow_general.groupby('genre')['genre'].count().sort_values(ascending=False)

In [31]:
# Viewing the first 10 rows of moscow_genres
print(moscow_genres.head(10))  

genre
pop            5892
dance          4435
rock           3965
electronic     3786
hiphop         2096
classical      1616
world          1432
alternative    1379
ruspop         1372
rusrap         1161
Name: genre, dtype: int64


In [32]:
spb_genres = spb_general.groupby('genre')['genre'].count().sort_values(ascending=False)

# Viewing the top 10 genres in St. Petersburg
print(spb_genres.head(10))  

genre
pop            2431
dance          1932
rock           1879
electronic     1736
hiphop          960
alternative     649
classical       646
rusrap          564
ruspop          538
world           515
Name: genre, dtype: int64


**The hypothesis has been partially confirmed:**

Pop music is the most popular genre in Moscow, as the hypothesis suggested. Furthermore, a closely related genre, Russian pop music, is also present in the top 10 genres.
Contrary to expectations, rap is equally popular in both Moscow and St. Petersburg.

## Research Summary

In this study, we explored various aspects of user behavior and musical preferences in Moscow and St. Petersburg. By analyzing the data and testing hypotheses, we gained valuable insights into the following key findings:

1. **Day-of-the-Week Impact**: Our first hypothesis was confirmed as we observed distinct variations in user activity based on the day of the week. The two cities exhibited different patterns of user engagement, suggesting varying lifestyles or cultural trends.

2. **Consistency in Musical Preferences**: The second hypothesis, partially confirmed, highlighted that musical preferences generally remain consistent throughout the week. While slight differences were observed on Mondays—where "world" music gained popularity in Moscow and jazz/classical in St. Petersburg—the overall patterns were similar between the two cities.

3. **Shared Musical Tastes**: Contrary to initial expectations, the third hypothesis demonstrated that the musical genre preferences of Moscow and St. Petersburg users are more similar than different. Although minor variations might exist, they are not significant enough to draw clear distinctions in the preferences of the majority of users.

In conclusion, this research sheds light on the dynamic interplay between user behavior, musical preferences, and urban culture in Moscow and St. Petersburg. The findings provide valuable insights for further studies in understanding the socio-cultural factors influencing user choices in these cities and beyond.