# Yandex Music

This project compares the behavior of Yandex Music users in the two cities of Moscow and St. Petersburg.

**The purpose of the study** — test three hypotheses:
1. User activity depends on the day of the week. Moreover, in Moscow and St. Petersburg, this manifests itself in different ways.
2. On Monday morning, some genres prevail in Moscow, and others in St. Petersburg. Similarly, on Friday evening, different genres prevail — depending on the city.
3. Moscow and St. Petersburg prefer different genres of music. In Moscow, pop music is more often listened to, in St. Petersburg - Russian rap.

**Research progress**

In the data review, before testing hypotheses, we will check the data for errors and evaluate their impact on the study. Then, at the preprocessing stage, we will find an opportunity to correct the most critical data errors.

Thus, the study takes place in three stages:
 1. Data overview.
 2. Data preprocessing.
 3. Hypothesis testing.

## Data overview

Let's make a first impression of the Yandex Music data.

Importing the library — `pandas`:

In [38]:
import pandas as pd # importing the pandas library

Let's read — `datasets` and save it in the variable — `df`:

In [39]:
df = pd.read_csv('yandex_music_dataset.csv') # reading a data file and saving it to dfm

Let's display the first ten rows of the table:

In [None]:
print(df.head(10)) # getting the first 10 rows of the df table

| user_id | track | artist           | genre | city | time | day |
|---|---|------------------|---|---|---|---|
| FFB692EC | Kamigata To Boots | The Mass Missile | rock | Saint-Petersburg | 20:28:33 | Wednesday |
| 55204538 | Delayed Because of Accident | Andreas Rönnberg | rock | Moscow | 14:07:09 | Friday |
| 20EC38 | Funiculì funiculà | Mario Lanza      | pop | Saint-Petersburg | 20:58:07 | Wednesday |
| A3DD03C9 | Dragons in the Sunset | Fire + Ice       | folk | Saint-Petersburg | 08:37:09 | Monday |
| E2DC1FAE | Soul People | Space Echo       | dance | Moscow | 08:34:34 | Monday |
| 842029A1 | Преданная | IMPERVTOR        | rusrap | Saint-Petersburg | 13:09:41 | Friday |
| 4CB90AA5 | True | Roman Messer     | dance            | Moscow | 13:00:07 | Wednesday |
| F03E1C1F | Feeling This Way | Polina Griffith  | dance | Moscow | 20:47:49 | Wednesday |
| 8FA1D3BE | И вновь продолжается бой | unknown          | ruspop | Moscow | 09:17:40 | Friday |
| E772D5C0 | Pessimist | unknown          | dance | Saint-Petersburg | 21:20:49 | Wednesday |


Get general information about the table using the method — `info()`:

In [None]:
df.info() # getting general information about the data in the df table

| | |
| --- | --- |
| **Column** | **Non-Null Count** | **Dtype** |
| --- | --- | --- |
| user_id | 61253 | object |
| track | 61253 | object |
| artist | 61253 | object |
| genre | 61253 | object |
| city | 61253 | object |
| time | 61253 | object |
| day | 61253 | object |
| --- | --- | --- |
| **dtypes:** object(7) | **memory usage:** 3.7+ MB |

So, there are seven columns in the table. Data type in all columns — `object`.

The number of values in the columns varies. So there are missing values in the data.

**Conclusions**

In each row of the table — data about the listened track. Part of the columns describes the composition itself: name, artist and genre. The rest of the data tells about the user: what city he is from, when he listened to music.

Previously, it can be argued that there is enough data to test hypotheses. But there are gaps in the data, and in the column names there are discrepancies with an easy-to-read style.

To move on, let's fix the problems in the data.

## Data preprocessing
Let's fix the style in the column headers, eliminate omissions. Then we will check the data for duplicates.

### Header style

Let's display the column names on the screen:

In [None]:
print(df.columns) # list of column names in the df table

Index(['  userID', 'Track', 'artist', 'genre', '  City  ', 'time', 'Day'], dtype='object')

Let's bring the names to an easy-to-read style:
* we will write it down in the "snake register";
* we will make all characters lowercase;
* eliminate the gaps.

To do this, rename the columns as follows:
* `'  userID'` → `'user_id'`;
* `'Track'` → `'track'`;
* `'  City  '` → `'city'`;
* `'Day'` → `'day'`.

In [43]:
df = df.rename(columns={'  userID': 'user_id', 'Track': 'track', '  City  ': 'city', 'Day': 'day'}) # renaming columns

Let's check the result. To do this, once again display the column names on the screen:

In [None]:
df.columns # checking the results - a list of column names

Index(['user_id', 'track', 'artist', 'genre', 'city', 'time', 'day'], dtype='object')

### Missing values

First, let's calculate how many missing values are in the table. Two methods are enough for this — `pandas`:

In [None]:
df.isna().sum() # counting passes

|  | &lt;unnamed&gt; |
| :--- | :--- |
| user\_id | 0 |
| track | 1231 |
| artist | 7203 |
| genre | 1198 |
| city | 0 |
| time | 0 |
| day | 0 |

Not all missing values affect the study. So, in `track` and `artist`, omissions are not important for our work. It is enough to replace them with explicit designations.

But omissions in `genre` may prevent comparing musical tastes in Moscow and St. Petersburg. In another project, it would be correct to establish the reason for the omissions and restore the data. There is no such possibility in this project. Therefore:
* fill in these gaps with explicit notation;
* let's assess how much they will damage the calculations.

Replace the missing values in the columns `track`, `artist` and `genre` with the string `'unknown'`. To do this, create a list of `columns_to_replace`, iterate through its elements with a loop — `for` and replace the missing values for each column:

In [46]:
columns_to_replace = ['track', 'artist', 'genre']
for columns in columns_to_replace:
    df[columns] = df[columns].fillna('unknown') # iterating over column names in a loop and replacing missing values with 'unknown'

Let's make sure that there are no gaps left in the table. To do this, we will count the missing values again.

In [None]:
df.isna().sum() # counting passes

|  | &lt;unnamed&gt; |
| :--- | :--- |
| user\_id | 0 |
| track | 0 |
| artist | 0 |
| genre | 0 |
| city | 0 |
| time | 0 |
| day | 0 |

### Duplicates

Let's count the obvious duplicates in the table with one command:

In [None]:
df.duplicated().sum() # counting explicit duplicates

3826

Let's call a special method — `pandas`, to remove explicit duplicates:

In [49]:
df = df.drop_duplicates() # removing explicit duplicates

Once again, let's count the obvious duplicates in the table — we will make sure that we completely got rid of them:

In [None]:
df.duplicated().sum() # checking for the absence of duplicates

0

Now let's get rid of implicit duplicates in the column — `genre`. For example, the name of the same genre may be written a little differently. Such errors will also affect the result of our research.

Let's display a list of unique genre names, sorted alphabetically. To do this:
* extract the desired dataframe column;
* apply the sorting method to it;
* for a sorted column, call a method that returns unique values from the column.

In [None]:
df['genre'].sort_values().unique() # view unique genre names

We look through the list and find implicit duplicates of the name — `hiphop`. These may be misspelled titles or alternative titles of the same genre.

We observe the following implicit duplicates:
* *hip*;
* *hop*;
* *hip-hop*.
To clear the table of them, we use the method — `replace()` with two arguments: a list of duplicate strings (including *hip*, *hop*, and *hip-hop*) and a string with the correct value. Let's fix the `genre` column in the `df` table: replace each value from the list of duplicates, probably. Instead of `hip`, `hop` and `hip-hop`, the table should have the value `hiphop`:

In [52]:
df['genre'] = df['genre'].replace('hip', 'hiphop')
df['genre'] = df['genre'].replace('hop', 'hiphop')
df['genre'] = df['genre'].replace('hip-hop', 'hiphop') # elimination of implicit duplicates

Let's check that the wrong names have been replaced:

*   *hip*;
*   *hop*;
*   *hip-hop*.

Output a sorted list of unique column values — `genre`:

In [None]:
df['genre'].sort_values().unique() # checking for implicit duplicates

**Conclusions**

Preprocessing found three problems in the data:

- violations in the style of headlines;
- missing values;
- duplicates — explicit and implicit.

We have corrected the headers to simplify working with the table. Without duplicates, the study will become more accurate.

You replaced the missing values with — `unknown`. It remains to be seen whether omissions in the `genre` column will not harm the study.

Now we turn to hypothesis testing.

## Hypothesis testing

### Comparison of user behavior of two capitals

The first hypothesis states that users listen to music differently in Moscow and St. Petersburg. We check this assumption based on data on three days of the week — Monday, Wednesday and Friday. For this:

* We will separate the users of Moscow and St. Petersburg.
* Compare how many tracks each user group listened to on Monday, Wednesday and Friday.

We perform each of the calculations separately.

We evaluate user activity in each city. We group the data by city and count the auditions in each group.

In [None]:
df.groupby('city')['user_id'].count() # counting auditions in each city

| <br/>city | user\_id<br/> |
| :--- | :--- |
| Moscow | 42741 |
| Saint-Petersburg | 18512 |

There are more auditions in Moscow than in St. Petersburg. This does not mean that Moscow users listen to music more often. It's just that there are more users in Moscow.

Now we group the data by day of the week and count the auditions on Monday, Wednesday and Friday. Note that the data contains information about auditions only for these days.

In [None]:
df.groupby('day')['genre'].count() # counting auditions on each of the three days

| <br/>day | genre<br/> |
| :--- | :--- |
| Friday | 21840 |
| Monday | 21354 |
| Wednesday | 18059 |

On average, users from two cities are less active on Wednesdays. But the picture may change if we consider each city separately.

Let's create a function — `number_tracks()`, which will count auditions for a given day and city. She will need two parameters:
* day of the week;
* name of the city.

In the function, we save to a variable the rows of the source table, which have the value:
* in the column `day` is equal to the parameter `day`;
* in the column `city` is equal to the parameter `city`.

To do this, we apply sequential filtering with logical indexing.

Then we will calculate the values in the column — `user_id` of the resulting table. We will save the result to a new variable. Let's return this variable from the function.

In [56]:
def number_tracks(day, city):
    track_list = df[df['day'] == day]
    track_list =  track_list[ track_list['city'] == city]
    track_list_count = track_list['user_id'].count()
    return(track_list_count)
# A function for counting auditions for a specific city and day.
# Using sequential filtering with logical indexing, it
# # first get the rows with the desired day from the source table,
# then it will filter out the rows with the desired city from the result,
# will use the count() method to count the number of values in the user_id column.
# This is the number the function will return as a result.

Let's call — `number_tracks()` six times, changing the value of the parameters — to get data for each city on each of the three days.

In [None]:
number_tracks('Monday', 'Moscow') # the number of auditions in Moscow on Mondays

15740

In [None]:
number_tracks('Monday', 'Saint-Petersburg') # the number of auditions in St. Petersburg on Mondays

5614

In [None]:
number_tracks('Wednesday', 'Moscow') # number of auditions in Moscow on Wednesdays

11056

In [None]:
number_tracks('Wednesday', 'Saint-Petersburg') # number of auditions in St. Petersburg on Wednesdays

7003

In [None]:
number_tracks('Friday', 'Moscow') # number of auditions in Moscow on Fridays

15945

In [None]:
number_tracks('Friday', 'Saint-Petersburg') # number of auditions in St. Petersburg on Fridays

5895

Using the constructor — `pd.DataFrame` we will create a table where
* column names — `['city', 'monday', 'wednesday', 'friday']`;
* data — the results we got using `number_tracks`.

In [63]:
data = [['Moscow', 15740, 11056, 15945],
        ['Saint-Petersburg', 5614, 7003, 5895]] 
columns = ['city', 'monday', 'wednesday', 'friday'] 
info = pd.DataFrame(data = data, columns = columns) # Results table

**Conclusions**

The data shows the difference in user behavior:

- In Moscow, the peak of auditions falls on Monday and Friday, and on Wednesday there is a noticeable decline.
- In St. Petersburg, on the contrary, they listen to music more on Wednesdays. Activity on Monday and Friday here is almost equally inferior to Wednesday.

So, the data speak in favor of the first hypothesis.

### Music at the beginning and end of the week

According to the second hypothesis, some genres prevail in Moscow on Monday morning, and others in St. Petersburg. Similarly, on Friday evening, different genres prevail — depending on the city.

Let's save tables with data in two variables:
* in Moscow — in `moscow_general`;
* in Saint Petersburg — in `spb_general`.

In [64]:
moscow_general = df[df['city'] == 'Moscow'] # getting the moscow_general table from those rows of the df table,
# for which the value in the 'city' column is 'Moscow'

In [65]:
spb_general = df[df['city'] == 'Saint-Petersburg'] 
# getting the spb_general table from those rows of the df table,
# for which the value in the 'city' column is 'Saint-Petersburg'

Let's create a function `genre_weekday()` with four parameters:
* table (dataframe) with data;
* day of the week;
* initial timestamp in the format 'hh:mm';
* the last timestamp in the format 'hh:mm'.

The function should return information about the top 10 genres of those tracks that were listened to on the specified day, in the interval between two timestamps.

In [66]:
def genre_weekday(df, day, time1, time2):
    # sequential filtering
    # we leave in genre_df only those lines of df whose day is day
    genre_df = df[df['day'] == day]
    # we leave in genre_df only those genre_df lines whose time is less than time2
    genre_df = genre_df[genre_df['time'] < time2]
    # we leave in genre_df only those lines of genre_df whose time is greater than time1
    genre_df = genre_df[genre_df['time'] > time1]
    # group the filtered dataframe by a column with genre names, take the genre column and count the number of rows for each genre using the count() method
    genre_df_grouped = genre_df.groupby('genre')['genre'].count()
    # sort the result in descending order (so that the most popular genres appear at the beginning of the Series)
    genre_df_sorted =  genre_df_grouped.sort_values(ascending=False)
    # we will return a Series with the 10 most popular genres in the specified time period of a given day
    return genre_df_sorted[:10]

Let's compare the results of the `genre_weekday()` function for Moscow and St. Petersburg on Monday morning (from 7:00 to 11:00) and Friday evening (from 17:00 to 23:00):

In [None]:
genre_weekday(moscow_general, 'Monday', '07:00', '11:00')
# function call for Monday morning in Moscow (instead of df — moscow_general table)

| <br/>genre | genre<br/> |
| :--- | :--- |
| pop | 781 |
| dance | 549 |
| electronic | 480 |
| rock | 474 |
| hiphop | 286 |
| ruspop | 186 |
| world | 181 |
| rusrap | 175 |
| alternative | 164 |
| unknown | 161 |

In [None]:
genre_weekday(spb_general, 'Monday', '07:00', '11:00') 
# function call for Monday morning in St. Petersburg (spb_general table instead of df)

| <br/>genre | genre<br/> |
| :--- | :--- |
| pop | 218 |
| dance | 182 |
| rock | 162 |
| electronic | 147 |
| hiphop | 80 |
| ruspop | 64 |
| alternative | 58 |
| rusrap | 55 |
| jazz | 44 |
| classical | 40 |

In [None]:
genre_weekday(moscow_general, 'Friday', '17:00', '23:00') 
# function call for Friday evening in Moscow

| <br/>genre | genre<br/> |
| :--- | :--- |
| pop | 713 |
| rock | 517 |
| dance | 495 |
| electronic | 482 |
| hiphop | 273 |
| world | 208 |
| ruspop | 170 |
| alternative | 163 |
| classical | 163 |
| rusrap | 142 |

In [None]:
genre_weekday(spb_general, 'Friday', '17:00', '23:00') 
# function call for Friday evening in St. Petersburg

| <br/>genre | genre<br/> |
| :--- | :--- |
| pop | 256 |
| electronic | 216 |
| rock | 216 |
| dance | 210 |
| hiphop | 97 |
| alternative | 63 |
| jazz | 61 |
| classical | 60 |
| rusrap | 59 |
| world | 54 |

**Conclusions**

If we compare the top 10 genres on Monday morning, we can draw the following conclusions:

1. In Moscow and St. Petersburg, they listen to similar music. The only difference is that the “world” genre entered the Moscow rating, and jazz and classical music entered the St. Petersburg rating.

2. In Moscow, there were so many missing values that the value — `unknown` took the tenth place among the most popular genres. This means that the missing values occupy a significant share in the data and threaten the reliability of the study.

Friday night doesn't change that picture. Some genres rise a little higher, others go down, but overall the top 10 remains the same.

Thus, the second hypothesis was only partially confirmed:
* Users listen to similar music at the beginning of the week and at the end.
* The difference between Moscow and St. Petersburg is not too pronounced. In Moscow, they listen to Russian popular music more often, in St. Petersburg — jazz.

However, omissions in the data cast doubt on this result. There are so many of them in Moscow that the top-10 rating could look different if not for the lost data on genres.

### Genre preferences in Moscow and St. Petersburg

Hypothesis: St. Petersburg is the capital of rap, music of this genre is listened to there more often than in Moscow.  And Moscow is a city of contrasts, in which, nevertheless, pop music prevails.

Let's group the table — `moscow_general` by genre and count the listening of tracks of each genre by the method — `count()`. Then sort the result in descending order and save it in the table — `moscow_genres`.

In [71]:
moscow_genres = moscow_general.groupby('genre')['genre'].count().sort_values(ascending=False) 
# one row: grouping the moscow_general table by the 'genre' column,
# counting the number of 'genre' values in this grouping by the count() method,
# sorting the resulting Series in descending order and saving to moscow_genres

Let's display the first ten lines — `moscow_genres`:

In [None]:
moscow_genres.head(10) # viewing the first 10 rows of moscow_genres

| <br/>genre | genre<br/> |
| :--- | :--- |
| pop | 5892 |
| dance | 4435 |
| rock | 3965 |
| electronic | 3786 |
| hiphop | 2096 |
| classical | 1616 |
| world | 1432 |
| alternative | 1379 |
| ruspop | 1372 |
| rusrap | 1161 |

Now let's repeat the same for St. Petersburg.

Let's group the table — `spb_general` by genre. Let's count the listening tracks of each genre. We will sort the result in descending order and save it in the `spb_genres` table:


In [73]:
spb_genres = spb_general.groupby('genre')['genre'].count().sort_values(ascending=False) 
# single row: grouping the spb_general table by the 'genre' column,
# counting the number of 'genre' values in this grouping by the count() method,
# sorting the resulting Series in descending order and saving to spb_genres

Let's display the first ten lines — `spb_genres`:

In [None]:
spb_genres.head(10) # viewing the first 10 rows of spb_genres

| <br/>genre | genre<br/> |
| :--- | :--- |
| pop | 2431 |
| dance | 1932 |
| rock | 1879 |
| electronic | 1736 |
| hiphop | 960 |
| alternative | 649 |
| classical | 646 |
| rusrap | 564 |
| ruspop | 538 |
| world | 515 |

**Conclusions**

The hypothesis was partially confirmed:
* Pop music is the most popular genre in Moscow, as the hypothesis suggested. Moreover, in the top 10 genres there is a similar genre - Russian popular music.
* Contrary to expectations, rap is equally popular in Moscow and St. Petersburg.

## Results of the study

We tested three hypotheses and established:

1. The day of the week has different effects on user activity in Moscow and St. Petersburg.

The first hypothesis was fully confirmed.

2. Musical preferences do not change much during the week — whether it is Moscow or St. Petersburg. Small differences are noticeable at the beginning of the week, on Mondays:
* in Moscow, they listen to music of the “world” genre;
* in St. Petersburg — jazz and classics.

Thus, the second hypothesis was only partially confirmed. This result could have been different if not for the omissions in the data.

3. The tastes of users in Moscow and St. Petersburg have more in common than differences. Contrary to expectations, genre preferences in St. Petersburg resemble those in Moscow.

The third hypothesis was not confirmed. If there are differences in preferences, they are invisible to the majority of users.