# Yandex.Music

The comparison of Moscow and St. Petersburg is always surrounded by myths. For example:
 * Moscow is a metropolis with a rigid workweek rhythm;
 * St. Petersburg is a cultural capital with its tastes.

Using the Yandex.Music dataset, we will compare the behavior of users of two capitals.

**Objective** — test 3 hypothesis:
1. User activity depends on the day of the week. Moreover, this manifests itself in different ways in Moscow and St. Petersburg.
2. On Monday mornings, certain genres dominate in Moscow, while others dominate in St. Petersburg. Similarly, Friday evenings are dominated by different genres, depending on the city. 
3. Moscow and St. Petersburg prefer different genres of music. In Moscow they listen to pop music more often, in St. Petersburg - Russian rap.

**Tasks**

There is no information about the quality of the dataset. Therefore, we'll do some EDA before hypothesis testing. 

We'll check our data for errors and assess their impact on the research. During data preprocessing, we'll correct the most critical data errors. 
 
So there are 3 stages of our research:
 1. Data verification.
 2. Data pre-processing.
 3. Hypothesis testing.



## Data overview

In [1]:
import pandas as pd

In [2]:
df = pd.read_csv('https://code.s3.yandex.net/datasets/yandex_music_project.csv')

In [3]:
df.head(10)

Unnamed: 0,userID,Track,artist,genre,City,time,Day
0,FFB692EC,Kamigata To Boots,The Mass Missile,rock,Saint-Petersburg,20:28:33,Wednesday
1,55204538,Delayed Because of Accident,Andreas Rönnberg,rock,Moscow,14:07:09,Friday
2,20EC38,Funiculì funiculà,Mario Lanza,pop,Saint-Petersburg,20:58:07,Wednesday
3,A3DD03C9,Dragons in the Sunset,Fire + Ice,folk,Saint-Petersburg,08:37:09,Monday
4,E2DC1FAE,Soul People,Space Echo,dance,Moscow,08:34:34,Monday
5,842029A1,Преданная,IMPERVTOR,rusrap,Saint-Petersburg,13:09:41,Friday
6,4CB90AA5,True,Roman Messer,dance,Moscow,13:00:07,Wednesday
7,F03E1C1F,Feeling This Way,Polina Griffith,dance,Moscow,20:47:49,Wednesday
8,8FA1D3BE,И вновь продолжается бой,,ruspop,Moscow,09:17:40,Friday
9,E772D5C0,Pessimist,,dance,Saint-Petersburg,21:20:49,Wednesday


In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 65079 entries, 0 to 65078
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0     userID  65079 non-null  object
 1   Track     63848 non-null  object
 2   artist    57876 non-null  object
 3   genre     63881 non-null  object
 4     City    65079 non-null  object
 5   time      65079 non-null  object
 6   Day       65079 non-null  object
dtypes: object(7)
memory usage: 3.5+ MB


There are 7 columns in a dataset. The data type in all columns is `object'.

According to the dataset documentation:
* `userID` - unique id of a user;
* `track` - track name;  
* `artist` - artist name;
* `genre` - genre name;
* `city` - city of the user;
* `Time` - start time of listening;
* Day` - day of the week.

There are 3 style violations in the column headings:
1. Lowercase letters are combined with uppercase letters.
2. There are white spaces in the column headers.
3. Snake case is not used in column names.



There are missing values in our dataset.

**Conclusions**

* **Each row contains information about the track you are listening to. Some columns describe a track: its name, artist, and genre. Other columns describe a user: the city and time when they listened to the music.**

* **Preliminarily, it can be argued that there is enough data to test the hypothesis. But there are missing values in the data and bad style in the column names.**

## Data pre-processing

### Column names correction

In [5]:
df.columns

Index(['  userID', 'Track', 'artist', 'genre', '  City  ', 'time', 'Day'], dtype='object')

In [6]:
df = df.rename(columns={'  userID': 'user_id',
                       'Track': 'track',
                        '  City  ': 'city',
                        'Day': 'day'
                       })

In [7]:
# check
df.columns

Index(['user_id', 'track', 'artist', 'genre', 'city', 'time', 'day'], dtype='object')

### Missing values

In [8]:
df.isna().sum()

user_id       0
track      1231
artist     7203
genre      1198
city          0
time          0
day           0
dtype: int64

Not all missing values affect the research. Such values in the `track' and `artist' columns are not important. It is enough to replace them by explicit values.

But missing values in the `genre` column can interfere with the comparison of musical tastes in Moscow and St. Petersburg. It would be correct to find out the reason for missing values and restore the data. But there is no such possibility in the training object. So we have to:
* fill nans with the explicit values,
* evaluate how much they will influence the calculations. 

We'll replace nans in the `track', `artist' and `genre` columns with the string `unknown`.

In [9]:
# for loop to replace nans with unknown string
columns_to_replace = ['track', 'artist', 'genre'] # list of columns with nans

for column in columns_to_replace: # for loop
    df[column] = df[column].fillna('unknown') # replacing nans

In [10]:
# missing values check
df.isna().sum()

user_id    0
track      0
artist     0
genre      0
city       0
time       0
day        0
dtype: int64

### Dealing with duplicates

In [11]:
# evaluating duplicates
df.duplicated().sum()

3826

In [12]:
# deleting duplicates with reseting indexes
df = df.drop_duplicates().reset_index(drop=True)

In [13]:
# duplicates check
df.duplicated().sum()

0

In this step, we'll examine the implicit duplicates in the `genre` column. For example, the name of the same genre may be spelled slightly differently. Such errors will also affect the search results.

Let's look at a list of unique genre names in alphabetical order. To do this: 
* extract the column we want
* Apply a sort method to it
* On the sorted column, we'll call a method that returns unique values from it.

In [14]:
# unique genres
sorted(df['genre'].unique())

['acid',
 'acoustic',
 'action',
 'adult',
 'africa',
 'afrikaans',
 'alternative',
 'alternativepunk',
 'ambient',
 'americana',
 'animated',
 'anime',
 'arabesk',
 'arabic',
 'arena',
 'argentinetango',
 'art',
 'audiobook',
 'author',
 'avantgarde',
 'axé',
 'baile',
 'balkan',
 'beats',
 'bigroom',
 'black',
 'bluegrass',
 'blues',
 'bollywood',
 'bossa',
 'brazilian',
 'breakbeat',
 'breaks',
 'broadway',
 'cantautori',
 'cantopop',
 'canzone',
 'caribbean',
 'caucasian',
 'celtic',
 'chamber',
 'chanson',
 'children',
 'chill',
 'chinese',
 'choral',
 'christian',
 'christmas',
 'classical',
 'classicmetal',
 'club',
 'colombian',
 'comedy',
 'conjazz',
 'contemporary',
 'country',
 'cuban',
 'dance',
 'dancehall',
 'dancepop',
 'dark',
 'death',
 'deep',
 'deutschrock',
 'deutschspr',
 'dirty',
 'disco',
 'dnb',
 'documentary',
 'downbeat',
 'downtempo',
 'drum',
 'dub',
 'dubstep',
 'eastern',
 'easy',
 'electronic',
 'electropop',
 'emo',
 'entehno',
 'epicmetal',
 'estrada',


There are some explicit duplicates for the hip-hop genre:
* *hip*,
* *hop*,
* *hip-hop*.

To clean our dataset of these duplicates, we'll create a function that takes 2 arguments: 
* `wrong_genres` - list of duplicate genres,
* `correct_genre` - string with a correct name.

In [15]:
# duplicates replacement function
def replace_wrong_genres(wrong_genres, correct_genre):
    for wrong_genre in wrong_genres:
        df['genre'] = df['genre'].replace(wrong_genre, correct_genre)

In [16]:
# replacing duplicates
duplicates = ['hip', 'hop', 'hip-hop']

correct_genre = 'hiphop'

replace_wrong_genres(duplicates, correct_genre)

In [17]:
# check
sorted(df['genre'].unique())

['acid',
 'acoustic',
 'action',
 'adult',
 'africa',
 'afrikaans',
 'alternative',
 'alternativepunk',
 'ambient',
 'americana',
 'animated',
 'anime',
 'arabesk',
 'arabic',
 'arena',
 'argentinetango',
 'art',
 'audiobook',
 'author',
 'avantgarde',
 'axé',
 'baile',
 'balkan',
 'beats',
 'bigroom',
 'black',
 'bluegrass',
 'blues',
 'bollywood',
 'bossa',
 'brazilian',
 'breakbeat',
 'breaks',
 'broadway',
 'cantautori',
 'cantopop',
 'canzone',
 'caribbean',
 'caucasian',
 'celtic',
 'chamber',
 'chanson',
 'children',
 'chill',
 'chinese',
 'choral',
 'christian',
 'christmas',
 'classical',
 'classicmetal',
 'club',
 'colombian',
 'comedy',
 'conjazz',
 'contemporary',
 'country',
 'cuban',
 'dance',
 'dancehall',
 'dancepop',
 'dark',
 'death',
 'deep',
 'deutschrock',
 'deutschspr',
 'dirty',
 'disco',
 'dnb',
 'documentary',
 'downbeat',
 'downtempo',
 'drum',
 'dub',
 'dubstep',
 'eastern',
 'easy',
 'electronic',
 'electropop',
 'emo',
 'entehno',
 'epicmetal',
 'estrada',


**Conclusions**

* **The preprocessing of the data revealed three problems in our dataset:**

    - header style irregularities,
    - missing values,
    - duplicates - explicit and implicit.

* **We have corrected the headers to make the table easier to work with. Deleting duplicates will make the search more accurate. Missing values have been replaced by an `unknown` string. It remains to be seen if the missing values in the `genre` column will harm the research.**

* **Now we can move on to hypothesis testing.**

## Hypothesis testing

### Comparison of user behavior between two capitals

The first hypothesis is that users in Moscow and St. Petersburg listen to music differently. We'll test this hypothesis using data from three days of the week - Monday, Wednesday, and Friday:

* we'll separate Moscow and St. Petersburg users
* compare how many tracks each group of users listened to on Monday, Wednesday, and Friday.

In [18]:
# counting track listenings in each city
df.groupby('city')['user_id'].count()

city
Moscow              42741
Saint-Petersburg    18512
Name: user_id, dtype: int64

There are more listeners in Moscow than in St. Petersburg. It doesn't mean that Moscow users listen to music more often. It's just that there are more users in Moscow.

Now we'll group our data by day of the week, counting listens on Monday, Wednesday, and Friday. Note that the data only contains information for these days.

In [19]:
# counting track listenings for each of three days
df.groupby('day')['user_id'].count()

day
Friday       21840
Monday       21354
Wednesday    18059
Name: user_id, dtype: int64

On average, users from both cities are less active on Wednesdays. But the picture of user behavior can change if we look at each city separately.

Let's create a function that combines the two calculations. It will count the tracks listened to for a given day and city. There will be two arguments:
* day of the week,
* city name.

In [20]:
# counting track listenings function
def number_tracks(day, city):
    track_list = df[df['day'] == day]
    track_list = track_list[track_list['city'] == city]
    track_list_count = track_list['user_id'].count()
    return track_list_count

In [21]:
# number of tracks listenings in Moscow on Mondays
number_tracks('Monday', 'Moscow')

15740

In [22]:
# number of tracks listenings in St. Petersburg on Mondays
number_tracks('Monday', 'Saint-Petersburg')

5614

In [23]:
# number of tracks listenings in Moscow on Wednesdays
number_tracks('Wednesday', 'Moscow')

11056

In [24]:
# number of tracks listenings in St. Petersburg on Wednesdays
number_tracks('Wednesday', 'Saint-Petersburg')

7003

In [25]:
# number of tracks listenings in Moscow on Fridays
number_tracks('Friday', 'Moscow')

15945

In [26]:
# number of tracks listenings in St. Petersburg on Fridays
number_tracks('Friday', 'Saint-Petersburg')

5895

In [27]:
# creating dataframe with results
columns = ['city', 'monday', 'wednesday', 'friday'] # columns names

data = [['Moscow', 15740, 11056, 15945],
        ['Saint-Petersburg', 5614, 7003, 5895]] # number of listenings

total = pd.DataFrame(data=data, columns=columns)

display(total)

Unnamed: 0,city,monday,wednesday,friday
0,Moscow,15740,11056,15945
1,Saint-Petersburg,5614,7003,5895


**Conclusions**

The data shows the difference in listening patterns:

- In Moscow, listening peaks on Mondays and Fridays, with a noticeable drop on Wednesdays.
- In St. Petersburg, on the other hand, more music is listened to on Wednesdays. The activity on Mondays and Fridays is almost as low as that on Wednesdays.

So the data supports the first hypothesis.

### Difference between beginning and end of week

According to the second hypothesis, Monday mornings in Moscow are dominated by different genres than in St. Petersburg. Similarly, Friday evenings are dominated by different genres depending on the city.

In [28]:
# getting rows for Moscow
moscow_general = df[df['city'] == 'Moscow']

In [29]:
# getting rows for St. Petersburg
spb_general = df[df['city'] == 'Saint-Petersburg']

We'll create a function that returns the top 10 genres listened to on a given day between two timestamps. The function will take four arguments:
* df containing the data,
* day of week,
* start timestamp 'hh:mm 
* end timestamp 'hh: mm'.

In [30]:
# counting top-10 genres function
def genre_weekday(table, day, time1, time2):
    genre_df = table[table['day'] == day]
    genre_df = genre_df[genre_df['time'] > time1]
    genre_df = genre_df[genre_df['time'] < time2]
    genre_df_count = genre_df.groupby('genre')['genre'].count()
    genre_df_sorted = genre_df_count.sort_values(ascending=False)
    return genre_df_sorted.head(10)

Let's check the difference between Moscow and St. Petersburg for Monday morning (07:00-11:00) and Friday evening (17:00-23:00).

In [31]:
# genres for Moscow on Mondays mornings
genre_weekday(moscow_general, 'Monday', '07:00', '11:00')

genre
pop            781
dance          549
electronic     480
rock           474
hiphop         286
ruspop         186
world          181
rusrap         175
alternative    164
unknown        161
Name: genre, dtype: int64

In [32]:
# genres for St. Petersburg on Mondays mornings
genre_weekday(spb_general, 'Monday', '07:00', '11:00')

genre
pop            218
dance          182
rock           162
electronic     147
hiphop          80
ruspop          64
alternative     58
rusrap          55
jazz            44
classical       40
Name: genre, dtype: int64

In [33]:
# genres for Moscow on Friday evenings
genre_weekday(moscow_general, 'Friday', '17:00', '23:00')

genre
pop            713
rock           517
dance          495
electronic     482
hiphop         273
world          208
ruspop         170
alternative    163
classical      163
rusrap         142
Name: genre, dtype: int64

In [34]:
# genres for St. Petersburg on Friday evenings
genre_weekday(spb_general, 'Friday', '17:00', '23:00')

genre
pop            256
electronic     216
rock           216
dance          210
hiphop          97
alternative     63
jazz            61
classical       60
rusrap          59
world           54
Name: genre, dtype: int64

**Conclusions**

If we compare the top 10 genres on Monday morning, we can draw some conclusions:

1. People in Moscow and St. Petersburg listen to similar music. The only difference is that the Moscow rating includes the "world" genre, and the St. Petersburg rating includes jazz and classical music.

2. In Moscow there were so many missing values that the "unknown" genre took 10th place among the most popular genres. This means that the missing values represent a significant proportion of the data and threaten the credibility of the research.

Friday nights don't change the picture. Some genres go up a bit, others go down, but overall the top 10 remains the same.

So the second hypothesis is only partially confirmed:
* Users listen to similar music at the beginning and end of the week.
* The difference between Moscow and St. Petersburg is not so great. In Moscow people listen more often to Russian popular music, in St. Petersburg to jazz.

However, the missing values cast doubt on this result. There are so many of them in Moscow that the top 10 could look different if there were data without nans.

### Moscow and St. Petersburg genre preferences

Hypothesis: St. Petersburg is the Russian capital of rap, and this genre is heard more often in this city than in Moscow. And Moscow is a city of contrasts, where pop music still predominates.

We'll group the `moscow_general` table by genre and count the listens of each genre using the "count()" method. Then we'll sort the results in ascending order.

In [35]:
# counting genres for Moscow
moscow_genres = moscow_general.groupby('genre')['genre'].count().sort_values(ascending=False)

In [36]:
moscow_genres.head(10)

genre
pop            5892
dance          4435
rock           3965
electronic     3786
hiphop         2096
classical      1616
world          1432
alternative    1379
ruspop         1372
rusrap         1161
Name: genre, dtype: int64

In [37]:
# counting genres for St. Petersburg
spb_genres = spb_general.groupby('genre')['genre'].count().sort_values(ascending=False)

In [38]:
spb_genres.head(10)

genre
pop            2431
dance          1932
rock           1879
electronic     1736
hiphop          960
alternative     649
classical       646
rusrap          564
ruspop          538
world           515
Name: genre, dtype: int64

**Conclusions**

The hypothesis was partly confirmed:
* Pop music is the most popular genre in Moscow, as the hypothesis suggested. Moreover, in the top 10 genres there is a close genre - Russian popular music.
* Contrary to expectations, rap is equally popular in Moscow and St. Petersburg.

## Summary

We tested three hypotheses and concluded that

1. The day of the week has a different effect on user activity in Moscow and St. Petersburg.

The first hypothesis was fully confirmed.

2. Musical preferences do not change much during the week in either Moscow or St. Petersburg. There are slight differences at the beginning of the week, on Mondays:
* in Moscow they listen to "world" music,
* In St. Petersburg they listen to jazz and classical music.

Thus, the second hypothesis was only partially confirmed. The result could have been different if there were no missing values in the data.

3. The tastes of Moscow and St. Petersburg users have more similarities than differences. Contrary to expectations, genre preferences in St. Petersburg were similar to those in Moscow.

The third hypothesis was rejected. If there are differences in preferences, they are not noticeable for the majority of users.