# Yandex Music

The comparison of Moscow and St. Petersburg:
 * Moscow is a megapolis, subject to the tough rhythm of the working week;
 * St. Petersburg is a cultural capital, with its own tastes.

Using Yandex.Music data, I will compare the behavior of users of the two capitals.

**The purpose of the study**  is to test three hypotheses:
1. User activity depends on the day of the week. Moreover, in Moscow and St. Petersburg, this manifests itself in different ways.
2. On Monday morning some genres prevail in Moscow, and others in St. Petersburg. Similarly, on Friday evening different genres prevail  depending on the city. 
3. Moscow and St. Petersburg prefer different genres of music. In Moscow, pop music is more often listened to, in St. Petersburg - Russian rap.

**Research progress**

I got the data on user behavior from the yandex_music_project.csv file. Nothing is known about the quality of the data. Therefore, a review of the data will be needed before testing hypotheses.

I will check the data for errors and evaluate their impact on the study. Then, at the preprocessing stage, I will correct the most critical data errors.
 
Thus, the study will take place in three stages:
 1. Data overview.
 2. Data preprocessing.
 3. Testing of hypotheses.



## Data overview

Let's ckeck the data




In [1]:
import pandas # importing pandas

Reading `yandex_music_project.csv` from `/datasets` and saving it as `df`:

In [2]:
df = pandas.read_csv('/datasets/yandex_music_project.csv')
df.head()

Unnamed: 0,userID,Track,artist,genre,City,time,Day
0,FFB692EC,Kamigata To Boots,The Mass Missile,rock,Saint-Petersburg,20:28:33,Wednesday
1,55204538,Delayed Because of Accident,Andreas Rönnberg,rock,Moscow,14:07:09,Friday
2,20EC38,Funiculì funiculà,Mario Lanza,pop,Saint-Petersburg,20:58:07,Wednesday
3,A3DD03C9,Dragons in the Sunset,Fire + Ice,folk,Saint-Petersburg,08:37:09,Monday
4,E2DC1FAE,Soul People,Space Echo,dance,Moscow,08:34:34,Monday


In [3]:
df.head(10) # let's take a look at the first 10 lines of df

Unnamed: 0,userID,Track,artist,genre,City,time,Day
0,FFB692EC,Kamigata To Boots,The Mass Missile,rock,Saint-Petersburg,20:28:33,Wednesday
1,55204538,Delayed Because of Accident,Andreas Rönnberg,rock,Moscow,14:07:09,Friday
2,20EC38,Funiculì funiculà,Mario Lanza,pop,Saint-Petersburg,20:58:07,Wednesday
3,A3DD03C9,Dragons in the Sunset,Fire + Ice,folk,Saint-Petersburg,08:37:09,Monday
4,E2DC1FAE,Soul People,Space Echo,dance,Moscow,08:34:34,Monday
5,842029A1,Преданная,IMPERVTOR,rusrap,Saint-Petersburg,13:09:41,Friday
6,4CB90AA5,True,Roman Messer,dance,Moscow,13:00:07,Wednesday
7,F03E1C1F,Feeling This Way,Polina Griffith,dance,Moscow,20:47:49,Wednesday
8,8FA1D3BE,И вновь продолжается бой,,ruspop,Moscow,09:17:40,Friday
9,E772D5C0,Pessimist,,dance,Saint-Petersburg,21:20:49,Wednesday


In [4]:
df.info() # getting general infirmation about df

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 65079 entries, 0 to 65078
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0     userID  65079 non-null  object
 1   Track     63848 non-null  object
 2   artist    57876 non-null  object
 3   genre     63881 non-null  object
 4     City    65079 non-null  object
 5   time      65079 non-null  object
 6   Day       65079 non-null  object
dtypes: object(7)
memory usage: 3.5+ MB


So, there are seven columns in the table. The data type in all columns is `object`.

According to the data documentation:
* `userID` — user ID;
* `Track` — the name of the track;  
* `artist` — artist's name;
* `genre` — the name of the genre;
* `City` — the user's city;
* `time` — the start time of listening;
* `Day` — the day of the week.

Three style violations are visible in the column names:
1. Lowercase letters are combined with uppercase.
2. There are gaps in the names.
3. UserID type names are not in the "snake register".



The number of values in the columns varies. It means there are missing values in the data.


**Conclusions**

In each row of the table there is data about the listened track. A part of the columns describes the composition itself: the name, the artist and the genre. The rest of the data tells the following about the user: what city they are from and when they listened to the music.

Firstly, it can be assumed that there is enough data to test hypotheses. But there are gaps in the data, and  there are  style errors in the column names.

To move on we need to fix the problems in the data.


## Data preprocessing
Correcting the style in the column headers, eliminating omissions, checking the data for duplicates.

### Header style


In [5]:
df.columns # all df columns

Index(['  userID', 'Track', 'artist', 'genre', '  City  ', 'time', 'Day'], dtype='object')

Changin the names of the columns:
* `'  userID'` → `'user_id'`;
* `'Track'` → `'track'`;
* `'  City  '` → `'city'`;
* `'Day'` → `'day'`.

In [6]:
df = df.rename(columns={'  userID' : 'user_id', 'Track': 'track', '  City  ': 'city', 'Day' : 'day'}) # переименование столбцов

In [7]:
df = df.rename(columns = {
    '  userID':'user_id',
    'Track':'track',
    '  City  ':'city',
    'Day':'day'})

Checking the result. To do this I'll display the column names again:

In [8]:
df.columns 

Index(['user_id', 'track', 'artist', 'genre', 'city', 'time', 'day'], dtype='object')

### Missing values
First, count how many missing values there are in the table. Two pandas methods are enough for this:

In [9]:
df.isnull().sum() # counting missing values

user_id       0
track      1231
artist     7203
genre      1198
city          0
time          0
day           0
dtype: int64

Not all missing values affect the study. So in "track" and "artist", omissions are not important. It is enough to replace them with explicit designations.

But omissions in "genre" can interfere with the comparison of musical tastes in Moscow and St. Petersburg. In this case, it would be correct to determine the reason for the omissions and restore the data. There is no such opportunity in the training project. So, l'll have to:
* fill in these gaps with explicit designations,
* assess how much they will alter the calculations. 

Replace the missing values in the 'track', 'artist', and 'genre' columns with the string 'unknown'. To do this, I'll create a 'columns_to_replace list', iterate its elements with a for loop and replace the missing values for each column:

In [10]:
columns_to_replace = ['track', 'artist', 'genre']
for column in columns_to_replace:
     df[column] = df[column].fillna('unknown')

     

Let's make sure that there are no gaps left in the table. To do this, let's count the missing values again.

In [11]:
df.isnull().sum() # counting gaps

user_id    0
track      0
artist     0
genre      0
city       0
time       0
day        0
dtype: int64

### Duplicates
Counting  the obvious duplicates in the table with one command:

In [12]:
df.duplicated().sum() 

3826

Calling the special pandas method to remove obvious duplicates:

In [13]:
df = df.drop_duplicates().reset_index(drop=True)

In [14]:
df.duplicated().sum() # checking the results

0

In [15]:
print("Missing values: {}".format(df.duplicated().sum()))

Missing values: 0


Now we should get rid of obvious duplicates in the 'genre' column. For example, the name of the same genre may be written a little bit differently. Such errors will also affect the results of the study.

Displaying a list of unique genre names sorted alphabetically. For this:
* extract the necessary  dataframe column, 
* apply a sorting method, 
* call a method for the sorted column that will return unique values from the column.

In [16]:

df['genre'].sort_values(ascending = True).unique() # Просмотр уникальных названий жанров

array(['acid', 'acoustic', 'action', 'adult', 'africa', 'afrikaans',
       'alternative', 'alternativepunk', 'ambient', 'americana',
       'animated', 'anime', 'arabesk', 'arabic', 'arena',
       'argentinetango', 'art', 'audiobook', 'author', 'avantgarde',
       'axé', 'baile', 'balkan', 'beats', 'bigroom', 'black', 'bluegrass',
       'blues', 'bollywood', 'bossa', 'brazilian', 'breakbeat', 'breaks',
       'broadway', 'cantautori', 'cantopop', 'canzone', 'caribbean',
       'caucasian', 'celtic', 'chamber', 'chanson', 'children', 'chill',
       'chinese', 'choral', 'christian', 'christmas', 'classical',
       'classicmetal', 'club', 'colombian', 'comedy', 'conjazz',
       'contemporary', 'country', 'cuban', 'dance', 'dancehall',
       'dancepop', 'dark', 'death', 'deep', 'deutschrock', 'deutschspr',
       'dirty', 'disco', 'dnb', 'documentary', 'downbeat', 'downtempo',
       'drum', 'dub', 'dubstep', 'eastern', 'easy', 'electronic',
       'electropop', 'emo', 'entehno', '

Browsing the list and finding obvious duplicates of the hip hop name. These may be misspelled titles or alternative titles of the same genre (hip,hop,hip-hop).

To clear the table of these duplicates, I"ll write the replace_wrong_genres() function with two parameters: 
* `wrong_genres` — a list of duplicates,
* `correct_genre` — a string with the correct value.

The function will correct the genre column in the df table: replace each value from the wrong_genres list with the value from correct_genre

In [17]:
def replace_wrong_genres(wrong_genres, correct_genre):
    for wrong_genre in wrong_genres:
        df['genre'] = df['genre'].replace(wrong_genres, correct_genre)

        
            
            

In [18]:
duplicates = ['hip', 'hop', 'hip-hop']
genre = 'hiphop'
replace_wrong_genres(duplicates, genre) 

In [19]:
df['genre'].sort_values(ascending = True).unique() 

array(['acid', 'acoustic', 'action', 'adult', 'africa', 'afrikaans',
       'alternative', 'alternativepunk', 'ambient', 'americana',
       'animated', 'anime', 'arabesk', 'arabic', 'arena',
       'argentinetango', 'art', 'audiobook', 'author', 'avantgarde',
       'axé', 'baile', 'balkan', 'beats', 'bigroom', 'black', 'bluegrass',
       'blues', 'bollywood', 'bossa', 'brazilian', 'breakbeat', 'breaks',
       'broadway', 'cantautori', 'cantopop', 'canzone', 'caribbean',
       'caucasian', 'celtic', 'chamber', 'chanson', 'children', 'chill',
       'chinese', 'choral', 'christian', 'christmas', 'classical',
       'classicmetal', 'club', 'colombian', 'comedy', 'conjazz',
       'contemporary', 'country', 'cuban', 'dance', 'dancehall',
       'dancepop', 'dark', 'death', 'deep', 'deutschrock', 'deutschspr',
       'dirty', 'disco', 'dnb', 'documentary', 'downbeat', 'downtempo',
       'drum', 'dub', 'dubstep', 'eastern', 'easy', 'electronic',
       'electropop', 'emo', 'entehno', '

**Conclusions**

Preprocessing revealed three problems in the data:

- header style violations,
- missing values,
- duplicates — explicit and implicit.

I have corrected the headers to make it easier to work with the table. Without duplicates, the study will become more accurate.

I have replaced the missing values with 'unknown'. It remains to be seen whether omissions in the 'genre' column will alter the study.

Now we can proceed to hypothesis testing.


## Testing of the hypothesis 

### Comparison of user behavior of the two capitals

The first hypothesis states that the users listen to music differently in Moscow and St. Petersburg. We should check this assumption based on data on three days of the week — Monday, Wednesday and Friday. To do this:

* Separate the users of Moscow and St. Petersburg
* Compare how many tracks each user group listened to on Monday, Wednesday and Friday.


Then we should evaluate user activity in each city, group the data by city and count the listened tracks in each group.



In [20]:
df.groupby('city').count() # counting values in each city

Unnamed: 0_level_0,user_id,track,artist,genre,time,day
city,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Moscow,42741,42741,42741,42741,42741,42741
Saint-Petersburg,18512,18512,18512,18512,18512,18512


In [21]:
df.groupby('city')['user_id'].count()

city
Moscow              42741
Saint-Petersburg    18512
Name: user_id, dtype: int64

There are more listenings in Moscow than in St. Petersburg. This does not mean that Moscow users listen to music more often. It's just that there are more users in Moscow.

Now we should group the data by day of the week and count the listenings on Monday, Wednesday and Friday.



In [22]:
df.groupby('day').count() # counting listens on each of three days

Unnamed: 0_level_0,user_id,track,artist,genre,city,time
day,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Friday,21840,21840,21840,21840,21840,21840
Monday,21354,21354,21354,21354,21354,21354
Wednesday,18059,18059,18059,18059,18059,18059


On average, users from two the cities are less active on Wednesdays. But the picture may change if we consider each city separately.

Creating a `number_tracks()`function that counts the listens for a given day and city. It will need two parameters:
* day of the week,
* name of the city.

In the function I'll save a variable of the rows of the initial table that have the value:
  * in the column  `day` is equal to the parameter  `day`,
  * in the column  `city` is equal to the parameter  `city`.

To do this, I'll apply sequential filtering with logical indexing.

Then I'll count the values in the 'user_id' column of the resulting table. I'll save the result to a new variable and return this variable from the function.

In [23]:
def number_tracks(day, city):
    track_list = df[df['day'] == day]
    track_list = track_list[track_list['city'] == city]
    track_list_count  = track_list['user_id'].count()
    return track_list_count



Calling number_tracks() six times, changing the value of the parameters  so that I get the data for each city on each of the three days

In [24]:
moscow_mon = number_tracks('Monday', 'Moscow') 
moscow_mon

15740

In [25]:
peterburg_mon = number_tracks('Monday', 'Saint-Petersburg')
peterburg_mon 

5614

In [26]:
moscow_wed = number_tracks('Wednesday', 'Moscow')
moscow_wed 

11056

In [27]:
peterburg_wed = number_tracks('Wednesday', 'Saint-Petersburg')
peterburg_wed 

7003

In [28]:
moscow_fri = number_tracks('Friday', 'Moscow') 
moscow_fri 

15945

In [29]:
peterburg_fri = number_tracks('Friday', 'Saint-Petersburg') 
peterburg_fri

5895

Using pd.DataFrame constructor I'll create a table where
* column names  — `['city', 'monday', 'wednesday', 'friday']`;
* data is the results that I've received using  `number_tracks`.

In [30]:

number_tracks = [
    ['Moscow', 15740, 11056, 15945],
    ['Petersburg', 5614, 7003, 5895]
]
column_names = ['city', 'monday', 'wednesday', 'friday']
listen_week = pandas.DataFrame(data=number_tracks, columns=column_names)
display(listen_week)



Unnamed: 0,city,monday,wednesday,friday
0,Moscow,15740,11056,15945
1,Petersburg,5614,7003,5895


**Conclusions**

The data shows the difference in user behavior:

- In Moscow, the peak of listens falls on Monday and Friday, and on Wednesday there is a noticeable decline.
- In St. Petersburg, on the contrary, they listen to music more on Wednesdays. Activity on Monday and Friday is almost equally less than on Wednesday here.

It means that the data is inclined to support the first hypothesis.

### Music at the beginning and at the end of the week

According to the second hypothesis, some genres prevail in Moscow on Monday morning, and others in St. Petersburg. Similarly different genres prevail  on Friday evenings depending on the city.

I'll save the data in tables with two variables:
* for Moscow  — `moscow_general`;
* for St. Petersburg  — `spb_general`.

In [31]:
moscow_general = df[df['city'] == 'Moscow'] # the table moscow_general consits of those rows in dr where 'city' equals 'Moscow'


In [42]:
spb_general = df[df['city'] == 'Saint-Petersburg']  # the table spb_general consits of those rows in dr where 'city' equals 'Saint-Petersburg'


I'll create `genre_weekday()` function with four parameters:
* a table (dataframe) with the data,
* the day of the week,
* the initial timestamp in the format 'hh:mm', 
* the last timestamp in the format 'hh:mm'.

The function should return information about the top 10 genres of those tracks that were listened to on the specified day in the interval between two timestamps.

In [43]:
def genre_weekday(table, day, time1, time2):
    genre_df = table[table['day'] == day]
    genre_df = genre_df[genre_df['time'] > time1]
    genre_df = genre_df[genre_df['time'] < time2]
    genre_df_count = genre_df.groupby('genre')['day'].count()
    genre_df_sorted = genre_df_count.sort_values(ascending = False)
    return genre_df_sorted.head(10)


Comparing the results of `genre_weekday()` function for Moscow and St. Petersburg on Monday morning (from 7:00 to 11:00) and on Friday evening (from 17:00 to 23:00):

In [47]:
moscow_morning_mon = genre_weekday(moscow_general, 'Monday', '07:00', '11:00')
moscow_morning_mon # calling the function


genre
pop            781
dance          549
electronic     480
rock           474
hiphop         286
ruspop         186
world          181
rusrap         175
alternative    164
unknown        161
Name: day, dtype: int64

In [48]:
spb_morning_mon = genre_weekday(spb_general, 'Monday', '07:00', '11:00')
spb_morning_mon


genre
pop            218
dance          182
rock           162
electronic     147
hiphop          80
ruspop          64
alternative     58
rusrap          55
jazz            44
classical       40
Name: day, dtype: int64

In [49]:
moscow_evening_fri = genre_weekday(moscow_general, 'Monday', '17:00', '23:00')
moscow_evening_fri 

genre
pop            717
dance          524
rock           518
electronic     485
hiphop         238
alternative    182
classical      172
world          172
ruspop         149
rusrap         133
Name: day, dtype: int64

In [50]:
spb_evening_fri = genre_weekday(spb_general, 'Monday', '17:00', '23:00')
spb_evening_fri 

genre
pop            263
rock           208
electronic     192
dance          191
hiphop         104
alternative     72
classical       71
jazz            57
rusrap          54
ruspop          53
Name: day, dtype: int64

**Conclusions**

If we compare the top 10 genres on Monday morning, we can draw the following conclusions:

1. In Moscow and St. Petersburg people listen to similar music. The only difference is that the “world” genre entered the Moscow rating, and jazz and classical music entered the St. Petersburg rating.

2. In Moscow there were a lot of missing values so the value of 'unknown' took the tenth place among the most popular genres. This means that the missing values occupy a significant share in the data and compromises the reliability of the study.

Friday night doesn't change that picture. Some genres rise a little higher, others go down, but overall the top 10 remains the same.

Thus, the second hypothesis was only partially confirmed:
* Users listen to similar music at the beginning and at the end of the week.
* The difference between Moscow and St. Petersburg is not too pronounced. In Moscow people listen to Russian popular music more often, in St. Petersburg it is jazz.

However, gaps in the data make this result doubtful. There are so many of them in Moscow that the top-10 rating could look different if not for the  data gaps in genres.

### Genre preferences in Moscow and St. Petersburg

Hypothesis: St. Petersburg is the capital of rap, music of this genre is listened to there more often than in Moscow. And Moscow is a city of contrasts, where, nevertheless, pop music prevails.

I'll group the  `moscow_general` table by genre and count the listenings of tracks of each genre using the `count()` method then sort the result in descending order and save it in the  `moscow_genres`.

In [51]:
moscow_genres = moscow_general.groupby('genre')['genre'].count().sort_values(ascending=False)




In [53]:
moscow_genres.head(10) # checking first 10 rows of moscow_genres

genre
pop            5892
dance          4435
rock           3965
electronic     3786
hiphop         2096
classical      1616
world          1432
alternative    1379
ruspop         1372
rusrap         1161
Name: genre, dtype: int64

Теперь повторите то же и для Петербурга.

Сгруппируйте таблицу `spb_general` по жанру. Посчитайте прослушивания треков каждого жанра. Результат отсортируйте в порядке убывания и сохраните в таблице `spb_genres`:


In [54]:
spb_genres = spb_general.groupby('genre')['genre'].count().sort_values(ascending=False)


In [56]:
spb_genres.head(10) 

genre
pop            2431
dance          1932
rock           1879
electronic     1736
hiphop          960
alternative     649
classical       646
rusrap          564
ruspop          538
world           515
Name: genre, dtype: int64

**Conclusions**

The hypothesis was partially confirmed:
* Pop music is the most popular genre in Moscow as the hypothesis suggested. Moreover, in the top 10 genres there is a similar genre - Russian popular music.
* Contrary to expectations, rap is equally popular in Moscow and St. Petersburg.


## General conclusion

I have tested three hypotheses and established the following:

1. The day of the week has different effects on user activity in Moscow and St. Petersburg.

The first hypothesis was fully confirmed.

2. Musical preferences don't change much during the week whether it's Moscow or St. Petersburg. Small differences are noticeable at the beginning of the week, on Mondays:
* in Moscow people listen to music of the genre “world”,
* in St. Petersburg people prefer jazz and classical.

Thus, the second hypothesis was only partially confirmed. This result could have been different unless for the gaps in the data.

3. The tastes of users in Moscow and St. Petersburg have more in common than differences. Contrary to expectations, genre preferences in St. Petersburg resemble those in Moscow.

The third hypothesis was not confirmed. Even if there are differences in preferences, they are invisible to the majority of users.

