# Study of Yandex.Music

Comparison of Moscow and St. Petersburg is surrounded by myths. For example:
 * Moscow is a metropolis subject to the rigid rhythm of the working week;
 * St. Petersburg is a cultural capital, with its own tastes.

Using Yandex.Music data, we will compare the behavior of users in the two capitals.

**The purpose of the study** is to test three hypotheses:
1. User activity depends on the week day. Moreover, in Moscow and St. Petersburg this manifests itself in different ways.
2. On Monday morning, certain genres prevail in Moscow, while others prevail in St. Petersburg. Likewise, Friday nights are dominated by different genres, depending on the city.
3. Moscow and St. Petersburg prefer different genres of music. In Moscow, they listen to pop music more often, in St. Petersburg - Russian rap.

**Research Progress**

We will retrieve data on the users' behavior from the `yandex_music_project.csv` file. Nothing is yet known about the quality of the data. Therefore, before testing hypotheses we first review  the data.

We will check the error data and assess its impact on the study. Then, in the pre-processing phase, we will eventually correct the most critical data errors.
 
Thus, the study will take place in three stages:
 1. Data overview.
 2. Data preprocessing.
 3. Testing the hypothesis.




## Data Overview

Let's have the first glance at the Yandex.Music data.




Our principal tool is going to be `pandas` library. Let's import it.

In [1]:
import pandas as pd  # importing pandas library

We will read the file  `yandex_music_project.csv` and save it as `df`:

In [2]:
df = pd.read_csv('yandex_music_project.csv')  # reading csv file and saving it to df

Let's display the first 10 rows of the dataframe:

In [3]:
df.head(10) # displaying first 10 df rows

Unnamed: 0,userID,Track,artist,genre,City,time,Day
0,FFB692EC,Kamigata To Boots,The Mass Missile,rock,Saint-Petersburg,20:28:33,Wednesday
1,55204538,Delayed Because of Accident,Andreas Rönnberg,rock,Moscow,14:07:09,Friday
2,20EC38,Funiculì funiculà,Mario Lanza,pop,Saint-Petersburg,20:58:07,Wednesday
3,A3DD03C9,Dragons in the Sunset,Fire + Ice,folk,Saint-Petersburg,08:37:09,Monday
4,E2DC1FAE,Soul People,Space Echo,dance,Moscow,08:34:34,Monday
5,842029A1,Преданная,IMPERVTOR,rusrap,Saint-Petersburg,13:09:41,Friday
6,4CB90AA5,True,Roman Messer,dance,Moscow,13:00:07,Wednesday
7,F03E1C1F,Feeling This Way,Polina Griffith,dance,Moscow,20:47:49,Wednesday
8,8FA1D3BE,И вновь продолжается бой,,ruspop,Moscow,09:17:40,Friday
9,E772D5C0,Pessimist,,dance,Saint-Petersburg,21:20:49,Wednesday


Let's also get the general information on the dataframe:

In [4]:
df.info()  # getting general info on the df

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 65079 entries, 0 to 65078
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0     userID  65079 non-null  object
 1   Track     63848 non-null  object
 2   artist    57876 non-null  object
 3   genre     63881 non-null  object
 4     City    65079 non-null  object
 5   time      65079 non-null  object
 6   Day       65079 non-null  object
dtypes: object(7)
memory usage: 3.5+ MB


So, as we can observe, the dataframe has seven columns. The data type in all columns is `object`.

According to the data documentation:
* `userID` - user ID;
* `Track` — track name;
* `artist` — artist name;
* `genre` — genre name;
* `City` - user's city;
* `time` - start time of listening;
* `Day` is the weekday.

There are three style non-conformities in the column headings:
1. Lowercase letters are combined with uppercase.
2. There are spaces.
3. Compound names do not use snake case.


The number of values in the columns varies. This means there are missing values in the data.


**Conclusions**

Each row of the dataframe contains data about the track that has been listened to. Some of the columns describe the composition itself: title, artist and genre. The rest of the data tells us about the user: what city she/he is from, when she/he listened to music.

Preliminarily, we can suppose that there is enough data to test hypotheses. But there are missing values in the data, and discrepancies in the names of the columns: this does not not comply with the PEP 8.

To move forward, we need to fix problems in the data.


## Data Preprocessing
We shall correct the column names stye, fill in the missiinmg values. Further we will check the data on redundant rows.

### Column header's style
Let's display the columns' names:

In [5]:
df.columns  # getting list of the df columns

Index(['  userID', 'Track', 'artist', 'genre', '  City  ', 'time', 'Day'], dtype='object')

We will bring the column headers in line with the  good style, i.e.:
* put more words in the header in "snake_case",
* make all characters lowercase,
* eliminate spaces.

To do this, let's rename the columns like this:
* `'  userID'` → `'user_id'`;
* `'Track'` → `'track'`;
* `'  City  '` → `'city'`;
* `'Day'` → `'day'`.

In [6]:
df = df.rename(columns={'  userID': 'user_id', 'Track': 'track', '  City  ': 'city', 'Day': 'day'})  # renaming columns

Let's check the result displaying one more time the column headers:

In [7]:
df.columns # displaying the list of column names

Index(['user_id', 'track', 'artist', 'genre', 'city', 'time', 'day'], dtype='object')

### Missing Values
Let's count how many missing values are in the dataframe. To do this, we need to call two standard `pandas` methods:

In [8]:
df.isna().sum()# counting the missing values

user_id       0
track      1231
artist     7203
genre      1198
city          0
time          0
day           0
dtype: int64

Not all missing values affect the study. Thus, in `track` and `artist` columns the missing values are not important for our work. It will be enough to replace them with the explicit notation.

But omissions in `genre` may interfere with the comparison of musical tastes in Moscow and St. Petersburg. In practice, it would be correct to determine the cause of this omissions and restore the data, but such an option is not available. so we have to:
* fill in these missing values with explicit notation,
* estimate how much they will damage the calculations.


Let's replace the missing values in the columns `track`, `artist` and `genre` with the value `'unknown'`. To do this, first we will create a list `columns_to_replace`, then iterate it with the use of  `for`, and replace the missing values in each column:

In [9]:
columns_to_replace = ['track', 'artist', 'genre']  # creating list of the columns to fill in
for column in columns_to_replace:  # iterating column headers 
    df[column] = df[column].fillna('unknown')  # replacing missing values with 'unknown'
    

Let's make sure there are no more missing values in the dataframe. To do this, we shall count missing values one more time.

In [10]:
df.isna().sum()  # counting the missing values

user_id    0
track      0
artist     0
genre      0
city       0
time       0
day        0
dtype: int64

### Duplicates
Let's count with one comand hom many explicit dublicates are in the dataframe:

In [11]:
df.duplicated().sum()# counting explicit duplicates

3826

Now we will call a `pandas` method to delete the explicit duplicates:

In [12]:
df = df.drop_duplicates().reset_index(drop=True)  # deleting explicit duplicates together with dropping indexes

Let's count once again explicit duplicates in the dataframe to make sure their sum is equal to zero:

In [13]:
df.duplicated().sum()  # making sure the duplicates are absent

0

Now we will get rid of the implicit duplicates in the `genre` column. For example, the name of the same genre can be specified in slightly different ways. Such errors will also affect the result of the study.


Let's display the list of unique genre names sorted alphabetically. To do this, we need to:
* extract the proper dataframe column,
* apply a sort method to it,
* call a method  for the  sorted column, that will return the unique values from the column.


In [14]:
df['genre'].sort_values().unique()  # dispaying the list of unique values in 'genre' column

array(['acid', 'acoustic', 'action', 'adult', 'africa', 'afrikaans',
       'alternative', 'alternativepunk', 'ambient', 'americana',
       'animated', 'anime', 'arabesk', 'arabic', 'arena',
       'argentinetango', 'art', 'audiobook', 'author', 'avantgarde',
       'axé', 'baile', 'balkan', 'beats', 'bigroom', 'black', 'bluegrass',
       'blues', 'bollywood', 'bossa', 'brazilian', 'breakbeat', 'breaks',
       'broadway', 'cantautori', 'cantopop', 'canzone', 'caribbean',
       'caucasian', 'celtic', 'chamber', 'chanson', 'children', 'chill',
       'chinese', 'choral', 'christian', 'christmas', 'classical',
       'classicmetal', 'club', 'colombian', 'comedy', 'conjazz',
       'contemporary', 'country', 'cuban', 'dance', 'dancehall',
       'dancepop', 'dark', 'death', 'deep', 'deutschrock', 'deutschspr',
       'dirty', 'disco', 'dnb', 'documentary', 'downbeat', 'downtempo',
       'drum', 'dub', 'dubstep', 'eastern', 'easy', 'electronic',
       'electropop', 'emo', 'entehno', '

Let's look through the list and look for implicit duplicates of the name `hiphop`. These may be misspelled words or synonims in the same genre.

We've noticed the following implicit duplicates:
**hip*,
**hop*,
**hip-hop*.

To clear the dataframe of them,  we will define a `replace_wrong_genres()` function with two parameters:
* `wrong_genres` - list of duplicates,
* `correct_genre` - a string with the correct value.

The function should correct the `genre` column in the `df` table: replace each value from the `wrong_genres` list with a value from `correct_genre`.


In [15]:
# Defining a function for replacement of the implicit duplicates:
def replace_wrong_genres(wrong_genres, correct_genre):  
    for wrong_genre in wrong_genres: # iterating implicit duplicates
        df['genre'] = df['genre'].replace(wrong_genre, correct_genre) #  calling replace() method for each duplicate
  
    

Let's call the `replace_wrong_genres()` and pass it arguments such that it eliminates implicit duplicates: instead of `hip`, `hop` and `hip-hop` the dataframe should have the value `hiphop`:


In [16]:
duplicates = ['hip', 'hop', 'hip-hop']  # creating a list of duplicates
correct_name = 'hiphop'  # assigning the correct string value, we will use for replacement
replace_wrong_genres(duplicates, correct_name)  # eliminating implicit diplicates through calling the function

Let's check out if we've managed to clear the dataframe of the incorrect names:

*   hip
*   hop
*   hip-hop

by displaying the sorted list of unique values of the column `genre`:

In [17]:
df['genre'].sort_values().unique()  # checking the implicit duplicates

array(['acid', 'acoustic', 'action', 'adult', 'africa', 'afrikaans',
       'alternative', 'alternativepunk', 'ambient', 'americana',
       'animated', 'anime', 'arabesk', 'arabic', 'arena',
       'argentinetango', 'art', 'audiobook', 'author', 'avantgarde',
       'axé', 'baile', 'balkan', 'beats', 'bigroom', 'black', 'bluegrass',
       'blues', 'bollywood', 'bossa', 'brazilian', 'breakbeat', 'breaks',
       'broadway', 'cantautori', 'cantopop', 'canzone', 'caribbean',
       'caucasian', 'celtic', 'chamber', 'chanson', 'children', 'chill',
       'chinese', 'choral', 'christian', 'christmas', 'classical',
       'classicmetal', 'club', 'colombian', 'comedy', 'conjazz',
       'contemporary', 'country', 'cuban', 'dance', 'dancehall',
       'dancepop', 'dark', 'death', 'deep', 'deutschrock', 'deutschspr',
       'dirty', 'disco', 'dnb', 'documentary', 'downbeat', 'downtempo',
       'drum', 'dub', 'dubstep', 'eastern', 'easy', 'electronic',
       'electropop', 'emo', 'entehno', '

**Conclusions**

Preprocessing revealed three problems in the data:

- headers style non-conformity,
- missing values,
- duplicates: explicit and implicit ones.

We have fixed the column headers to make the dataframe easier to work with. Without duplicates, the study will become more accurate.

We have replaced the missing values with `'unknown'`. We still have to find out whether the missing values in the `genre` column will harm the study.

Now we can move on to hypothesis testing.


## Hypothesis Testing

### Comparison of users behaviorin the two capitals 

The first hypothesis states that users listen to music differently in Moscow and St. Petersburg. We'll check this assumption against the data on the three days of the week - Monday, Wednesday and Friday. To do this, we will:

* separate Moscow and St. Petersburg users
* compare how many tracks each user group listened to on Monday, Wednesday and Friday.



First, we will have a look at the users activity in each city. We will group the data by the city and count playbacks in each group.



In [18]:
df.groupby('city')['track'].count()  # counting playbacks in each city

city
Moscow              42741
Saint-Petersburg    18512
Name: track, dtype: int64

There are more playbacks in Moscow than in St. Petersburg. It does not imply Moscow users listen to music more often. There are simply more users in Moscow.

Now let's group the data by weekday and count the playbacks on Monday, Wednesday, and Friday. Please note that the data only contains information about the playbacks for these days only.



In [19]:
df.groupby('day')['track'].count()  # counting playbacks on each of three weekdays

day
Friday       21840
Monday       21354
Wednesday    18059
Name: track, dtype: int64

On average, in both cities the users are less active on Wednesday. Anyway, this may change, if we analyze each city separately. 

We have seen how grouping by city and by weekday works. Now we will define a function that combines these two calculations.

Let's create a `number_tracks()` function that will count the playbacks for a given day and city. It needs two parameters:
* weekday,
* city name.

In the function we will be saving to a variable the rows of the source dataframe that having the following values:
  * the `day` column value is equal to the `day` parameter,
  * the `city` column value is equal to the `city` parameter.

To do this,  we will apply sequential filtering with logical indexing.

Then we will count the values in the `user_id` column of the resulting table. We will save the result to a new variable and return it from the function.


In [20]:
# defining a function with two parameters: day, city:
def number_tracks(day, city):
    track_list = df[df['day']==day]  # saving to the variable track_list those df rows where 'day' column values is equal to the day parameter
    track_list = track_list[track_list['city']==city]  # filtering the variable by city parameter 
    track_list_count = track_list['user_id'].count()  # saving to the variable track_list_count count of the column 'user_id'
    return track_list_count# the functions returns a number: value of track_list_count.


Let's create two lists and iterate each couple of values calling the function.

In [21]:
cities = ['Moscow', 'Saint-Petersburg'] # creating a list of cities
days = ['Monday', 'Wednesday','Friday']  # creating a list of weekdays
for day in days: # iterating the days
    for city in cities:  #iterating the cities
        print(day, city, number_tracks(day, city)) #printing the results for each couple
        

Monday Moscow 15740
Monday Saint-Petersburg 5614
Wednesday Moscow 11056
Wednesday Saint-Petersburg 7003
Friday Moscow 15945
Friday Saint-Petersburg 5895


With the use of constructor `pd.DataFrame` we will create a new dataframe with the following parameters:
* columns — `['city', 'monday', 'wednesday', 'friday']`;
* data — the results we got due to `number_tracks` function.

In [22]:
columns = ['city', 'monday', 'wednesday', 'friday'] # creating a list with the column names
data = [['Moscow', 15740, 11056, 15945], ['Saint-Petersburg', 5614, 7003, 5895]] # creating a list with the data
result = pd.DataFrame(data = data, columns = columns)  # the new dataframe
result.head() # checking the results

Unnamed: 0,city,monday,wednesday,friday
0,Moscow,15740,11056,15945
1,Saint-Petersburg,5614,7003,5895


**Conclusions**

The data shows the difference in user behavior:

- In Moscow, the peak of listening falls on Monday and Friday, and on Wednesday there is a noticeable decline.
- In St. Petersburg, on the contrary, they listen to music more on Wednesdays. Activity on Monday and Friday here is almost equally inferior to Wednesday.

So the data is definitely in favour of the first hypothesis.


### Music in the beginning and in the end of the week

According to the second hypothesis, on Monday morning certain genres prevail in Moscow, while others dominate in St. Petersburg. Similarly, Friday evenings are dominated by different genres, depending on the city.

We will save the date to two variables:
* related to Moscow — to `moscow_general`;
* related to St. Petersburg — to `spb_general`.

In [23]:
moscow_general = df[df['city']=='Moscow']  # creating the datframe 'moscow_general' from those rows of df, where the values of the 'city' column is 'Moscow'
moscow_general.head()  # checking the result

Unnamed: 0,user_id,track,artist,genre,city,time,day
1,55204538,Delayed Because of Accident,Andreas Rönnberg,rock,Moscow,14:07:09,Friday
4,E2DC1FAE,Soul People,Space Echo,dance,Moscow,08:34:34,Monday
6,4CB90AA5,True,Roman Messer,dance,Moscow,13:00:07,Wednesday
7,F03E1C1F,Feeling This Way,Polina Griffith,dance,Moscow,20:47:49,Wednesday
8,8FA1D3BE,И вновь продолжается бой,unknown,ruspop,Moscow,09:17:40,Friday


In [24]:
spb_general = df[df['city']=='Saint-Petersburg']  # creating the datframe 'spb_general' from those rows of df, where the values of the 'city' column is 'Saint-Petersburg'
spb_general.head()  # checking the result

Unnamed: 0,user_id,track,artist,genre,city,time,day
0,FFB692EC,Kamigata To Boots,The Mass Missile,rock,Saint-Petersburg,20:28:33,Wednesday
2,20EC38,Funiculì funiculà,Mario Lanza,pop,Saint-Petersburg,20:58:07,Wednesday
3,A3DD03C9,Dragons in the Sunset,Fire + Ice,folk,Saint-Petersburg,08:37:09,Monday
5,842029A1,Преданная,IMPERVTOR,rusrap,Saint-Petersburg,13:09:41,Friday
9,E772D5C0,Pessimist,unknown,dance,Saint-Petersburg,21:20:49,Wednesday


Let's create a function `genre_weekday()` with four parameters:
* dataframe with data,
* weekwday,
* start timestamp in 'hh:mm' format,
* last timestamp in 'hh:mm' format.

The function should return information about the top 10 genres of those tracks that were listened to on the specified day, in the interval between two timestamps.


In [25]:
# Defining the function genre_weekday() with the parameters table, day, time1, time2:
def genre_weekday(table, day, time1, time2):
    genre_df = table[table['day']==day]  # filtering and saving to genre_df those rows where the column day value is equal to the day argument
    genre_df = genre_df[genre_df['time']>time1]  # where time value is major than time1 value
    genre_df = genre_df[genre_df['time']<time2]  # where time value is minor than time2 value
    genre_df_count = genre_df.groupby('genre')['track'].count()  # grouping genre_df by the column genre and counting tracks for each genre 
    genre_df_sorted = genre_df_count.sort_values(ascending=False)  #  sorting genre_df_count in descending order and saving the variable genre_df_sorted
    return genre_df_sorted[:10]  # returning top 10 values in the genre_df_sorted (on a certian day, at certain hours)

Let's compare the results of the `genre_weekday()` function for Moscow and St. Petersburg on Monday morning (from 7:00 to 11:00) and Friday evening (from 17:00 to 23:00):


In [26]:
genre_weekday(moscow_general, 'Monday', '07:00', '11:00') # calling the function for Monday morning in Moscow 


genre
pop            781
dance          549
electronic     480
rock           474
hiphop         286
ruspop         186
world          181
rusrap         175
alternative    164
unknown        161
Name: track, dtype: int64

In [27]:
genre_weekday(spb_general, 'Monday', '07:00', '11:00')  # calling the function for Monday morning in St. Petersburg 

genre
pop            218
dance          182
rock           162
electronic     147
hiphop          80
ruspop          64
alternative     58
rusrap          55
jazz            44
classical       40
Name: track, dtype: int64

In [28]:
genre_weekday(moscow_general, 'Friday', '17:00', '23:00')  # calling the function for Friday night in Moscow 

genre
pop            713
rock           517
dance          495
electronic     482
hiphop         273
world          208
ruspop         170
alternative    163
classical      163
rusrap         142
Name: track, dtype: int64

In [29]:
genre_weekday(spb_general, 'Friday', '17:00', '23:00')  # calling the function for Friday night in St. Petersburg 

genre
pop            256
electronic     216
rock           216
dance          210
hiphop          97
alternative     63
jazz            61
classical       60
rusrap          59
world           54
Name: track, dtype: int64

**Conclusions**

If we compare the top 10 genres on Monday morning, we can draw the following conclusions:
1. In Moscow and St. Petersburg they listen to similar music. The only difference is that the Moscow rating includes the “world” genre, while the St. Petersburg rating includes jazz and classical.

2. There were so many missing values in Moscow that the value `'unknown'` took tenth place among the most popular genres. This means that missing values have a significant share in the data and threaten the reliability of the study.

Friday night does not change this picture. Some genres rise a little higher, others go down, but overall the top 10 stays the same.

Thus, the second hypothesis has been only partially confirmed:
* Users listen to similar music in the beginning of the week and in the end.
* The difference between Moscow and St. Petersburg is not very pronounced. In Moscow, they listen to Russian popular music more often, in St. Petersburg - jazz.

However, missing values in the data cast doubt on this result. There are so many of them in Moscow that the top 10 ranking could look different if there were not the lost genre data.


### Genre preferences in Moscow and St. Petersburg

Hypothesis: St. Petersburg is the capital of rap, the music of this genre is listened to more often than in Moscow. And Moscow is a city of contrasts, which, nevertheless, is dominated by pop music.

Let's group the `moscow_general` table by genre and count the playbacks of each genre's tracks using the `count()` method. Then sort the result in descending order and store it in the `moscow_genres` table.


In [30]:
moscow_genres = moscow_general.groupby('genre')['genre'].count().sort_values(ascending=False) #  getting the dataframe moscow_genres


Let's display the top 10 of `moscow_genres`:

In [31]:
moscow_genres.head(10)  # displaying the top 10 of moscow_genres

genre
pop            5892
dance          4435
rock           3965
electronic     3786
hiphop         2096
classical      1616
world          1432
alternative    1379
ruspop         1372
rusrap         1161
Name: genre, dtype: int64

Now we will carry out the same for Petersburg.

We will group the `spb_general` table by genre, count the number of playbacks for tracks of each genre, after we will sort the result in descending order and store them in the `spb_genres` table:



In [32]:
spb_genres = spb_general.groupby('genre')['genre'].count().sort_values(ascending=False)  # getting the dataframe spb_genres

As above, we will display the top 10 of `spb_genres`:

In [33]:
spb_genres.head(10)  # displaying the top 10 of spb_genres

genre
pop            2431
dance          1932
rock           1879
electronic     1736
hiphop          960
alternative     649
classical       646
rusrap          564
ruspop          538
world           515
Name: genre, dtype: int64

**Conclusions**

The hypothesis has been partially confirmed:
* Pop music is the most popular genre in Moscow, as the hypothesis suggested. Moreover, in the top 10 genres there is a similar genre - Russian popular music.
* Contrary to expectations, rap is equally popular in Moscow and St. Petersburg.



## Study Results

We have tested three hypotheses and found out that:

1. The weekday has a different effect on the activity of users in Moscow and St. Petersburg.

The first hypothesis has been fully confirmed.

2. Musical preferences do not change much during the week - either in Moscow or St. Petersburg. Insignificant differences are noticeable in the beginning of the week, on Mondays:
* in Moscow they listen to music of the “world” genre,
* in St. Petersburg - jazz and classical music.

Thus, the second hypothesis has been only partly confirmed. This result could have been different if there had not been missing values in the data.

3. The tastes of users of Moscow and St. Petersburg have more in common than differences. Contrary to expectations, genre preferences in St. Petersburg resemble those in Moscow.

The third hypothesis has not been confirmed. If there are differences in preferences, they are invisible to the bulk of users.

**In practice, studies contain tests of statistical hypotheses.**
Baes on the data of one service, it is not always possible to draw a conclusion about all the inhabitants of the city.
Tests of statistical hypotheses will show how reliable they are, based on the available data.
