# Yandex Music (Project 1)

Using the Yandex Music dataset, we will compare the behavior of users in two cities - Moscow and Saint-Petersburg.

**The purpose of this research is to try three following hypotheses:**

1. User activity depends on the day of the week. Furthermore, in Moscow and St. Petersburg it manifests in different forms.
2. Certain music genres tend to play more often in Moscow in the morning, while other genres tend to play more often in St. Petersburg. Similarly, depending on the city, Friday evenings are dominated by different genre plays.
3. Moscow and St. Petersburg users prefer different genres of music. In Moscow, users listen to pop music more frequently, and in St. Petersburg, users listen to rap music.

**Research Progress:**

We receive data related to user behavior from `yandex_music_project.csv` file. The quality of the data is unknown to us. Therefore, before testing our hypotheses, a review of the data is needed.

We are going to check the data for errors, and assess their impact on the study. Then, in the pre-processing phase, we will seek opportunities to correct the most critical data errors.
 
Thus, the case study will be made in three stages:
 1. Data review.
 2. Data preprocessing.
 3. Hypothesis testing.



## Data overview
Get the first idea about Yandex Music dataset.




**Task 1**

The main analytics tool is `pandas`. Let's import this library.

In [1]:
# import pandas library
import pandas as pd

**Task 2**

Read the `yandex_music_project.csv` file from the `/datasets` folder and save it in the `df` variable:

In [2]:
# reading data from dataset and saving it to variable df
df = pd.read_csv('yandex_music_project.csv')

**Task 3**


Display the first ten rows of the table:

In [3]:
# getting first ten rows of df
df.head(10)

Unnamed: 0,userID,Track,artist,genre,City,time,Day
0,FFB692EC,Kamigata To Boots,The Mass Missile,rock,Saint-Petersburg,20:28:33,Wednesday
1,55204538,Delayed Because of Accident,Andreas Rönnberg,rock,Moscow,14:07:09,Friday
2,20EC38,Funiculì funiculà,Mario Lanza,pop,Saint-Petersburg,20:58:07,Wednesday
3,A3DD03C9,Dragons in the Sunset,Fire + Ice,folk,Saint-Petersburg,08:37:09,Monday
4,E2DC1FAE,Soul People,Space Echo,dance,Moscow,08:34:34,Monday
5,842029A1,Преданная,IMPERVTOR,rusrap,Saint-Petersburg,13:09:41,Friday
6,4CB90AA5,True,Roman Messer,dance,Moscow,13:00:07,Wednesday
7,F03E1C1F,Feeling This Way,Polina Griffith,dance,Moscow,20:47:49,Wednesday
8,8FA1D3BE,И вновь продолжается бой,,ruspop,Moscow,09:17:40,Friday
9,E772D5C0,Pessimist,,dance,Saint-Petersburg,21:20:49,Wednesday


**Task 4**


Get general information about the table using the `info()` method in one command:

In [4]:
# getting general information about the data in the df
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 65079 entries, 0 to 65078
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0     userID  65079 non-null  object
 1   Track     63848 non-null  object
 2   artist    57876 non-null  object
 3   genre     63881 non-null  object
 4     City    65079 non-null  object
 5   time      65079 non-null  object
 6   Day       65079 non-null  object
dtypes: object(7)
memory usage: 3.5+ MB


According to the data documentation:
* `userID` - user ID;
* `Track` - track name;
* `artist` - artist name;
* `genre` - the name of the genre;
* `City` - user's city;
* `time` — listening start time;
* `Day` is the day of the week.

The number of values in the columns varies. Therefore, there are missing values in the data.

**Task 5**

**Free-Form Question**

There are some style violations in the column names:
* Lowercase letters are combined with uppercase letters.
*There are spaces.

What is the third violation?

In [5]:
# The userID column should be written in snakecase as user_id

**Conclusions**

Each line of the table contains data about the track you have listened to. Some columns describe the composition itself: title, artist and genre. The rest of the data tells about the user: what city he is from, when he listened to music.

Preliminarily, it can be argued that the data are sufficient to test the hypotheses. But there are gaps in the data, and discrepancies in the names of the columns with good style.

To move forward, we need to resolve problems in the data.

## Data preprocessing
Correct the style in the column headings, eliminate gaps. Then check the data for duplicates.

### Headers style

**Task 6**

Display the column names:

In [6]:
# column names of our data frame
df.columns

Index(['  userID', 'Track', 'artist', 'genre', '  City  ', 'time', 'Day'], dtype='object')

**Task 7**


Let's bring titles in line using a better style:
* write some words in the title using "snake_register",
* make all characters lowercase,
* eliminate spaces.

To do it, let's rename the columns :
* `'  userID'` → `'user_id'`;
* `'Track'` → `'track'`;
* `'  City  '` → `'city'`;
* `'Day'` → `'day'`.

In [7]:
#  columns renaming
df = df.rename(columns={'  userID': 'user_id', 'Track': 'track', '  City  ': 'city', 'Day': 'day'})

**Task 8**


Let's check the result. To do this, display the column names again:

In [8]:
# displaying column names
df.columns

Index(['user_id', 'track', 'artist', 'genre', 'city', 'time', 'day'], dtype='object')

### Missing values

**Task 9**

First, we count how many missing values are in the table. Using two `pandas` methods are enough:

In [9]:
# counting missing values
df.isna().sum()

user_id       0
track      1231
artist     7203
genre      1198
city          0
time          0
day           0
dtype: int64

Not all missing values affect the study. So in `track` and `artist` gaps are not important for your work. It suffices to replace them with explicit notation.

But gaps in the `genre` can interfere with the comparison of musical tastes in Moscow and St. Petersburg. In practice, it would be correct to determine the cause of the missing values and restore the data. This option is not available in our research. We need to:

*estimate how much they will hurt the calculations;
*the missing values should be filled with explicit notation.


**Task 10**

We are going to replace the missing values in the `track`, `artist` and `genre` columns using the string `'unknown'`. To do this, we create a `columns_to_replace` list, iterate through its elements with a `for` loop, and for each column, replace the missing values:

In [10]:
# loop through column names and replace missing values with 'unknown'
columns_to_replace = ['track', 'artist', 'genre']

for column in columns_to_replace:
    df[column] = df[column].fillna('unknown')

**Task 11**

We make sure there are no missing values in the table. To do this, count the missing values again.

In [11]:
# counting missing values
df.isna().sum()

user_id    0
track      0
artist     0
genre      0
city       0
time       0
day        0
dtype: int64

### Duplicates

**Task 12**

Count explicit duplicates in a table using one command:

In [12]:
# duplicates count
df.duplicated().sum()

3826

**Task 13**

Call special `pandas` method to remove obvious duplicates:

In [13]:
# removing duplicates
df = df.drop_duplicates().reset_index(drop=True)

**Task 14**

Once again, we count the evident duplicates in the table to make sure you get rid of them completely:

In [14]:
# counting duplicates after removing them
df.duplicated().sum()

0

Now we get rid of the implicit duplicates in `genre` column. For example, the name of the same genre can be spelled slightly different in various occasions. Such errors will also affect the result of the study.

**Task 15**

We are going to display the list of unique genre names sorted alphabetically. To do this:
1. we extract the desired dataframe column;
2. we apply a sort method to it;
3. For a sorted column, we call a method that will return the unique values from the column.

In [15]:
# viewing genre column unique values
df['genre'].sort_values(ascending=True).unique()


array(['acid', 'acoustic', 'action', 'adult', 'africa', 'afrikaans',
       'alternative', 'alternativepunk', 'ambient', 'americana',
       'animated', 'anime', 'arabesk', 'arabic', 'arena',
       'argentinetango', 'art', 'audiobook', 'author', 'avantgarde',
       'axé', 'baile', 'balkan', 'beats', 'bigroom', 'black', 'bluegrass',
       'blues', 'bollywood', 'bossa', 'brazilian', 'breakbeat', 'breaks',
       'broadway', 'cantautori', 'cantopop', 'canzone', 'caribbean',
       'caucasian', 'celtic', 'chamber', 'chanson', 'children', 'chill',
       'chinese', 'choral', 'christian', 'christmas', 'classical',
       'classicmetal', 'club', 'colombian', 'comedy', 'conjazz',
       'contemporary', 'country', 'cuban', 'dance', 'dancehall',
       'dancepop', 'dark', 'death', 'deep', 'deutschrock', 'deutschspr',
       'dirty', 'disco', 'dnb', 'documentary', 'downbeat', 'downtempo',
       'drum', 'dub', 'dubstep', 'eastern', 'easy', 'electronic',
       'electropop', 'emo', 'entehno', '

**Task 16**

Next, will be look through the list and search for implicit duplicates of the name `hiphop`. Those may be misspelled titles or alternative titles of the same genre.

You will see the following implicit duplicates:
**hip*,
**hop*,
**hip-hop*.

To clear them from the table, let's use `replace()` method with two arguments: a list of duplicate strings (including *hip*, *hop*, and *hip-hop*) and a string with the correct value. We would need to fix `genre` column in the `df` table: replace each value from the list of duplicates with the correct one. Instead of `hip`, `hop` and `hip-hop` the table we should have the value `hiphop`:

In [16]:
#  removing implicit duplicates
duplicates = ['hip', 'hop', 'hip-hop']
correction = 'hiphop'
df['genre'] = df['genre'].replace(duplicates, correction)

**Task 17**

We have to check if all wrong names were replaced:

*hip,
*hop,
* hip-hop.

Print a sorted list of unique values in the `genre` column:

In [17]:
# checking the list once again
df['genre'].sort_values(ascending=True).unique()

array(['acid', 'acoustic', 'action', 'adult', 'africa', 'afrikaans',
       'alternative', 'alternativepunk', 'ambient', 'americana',
       'animated', 'anime', 'arabesk', 'arabic', 'arena',
       'argentinetango', 'art', 'audiobook', 'author', 'avantgarde',
       'axé', 'baile', 'balkan', 'beats', 'bigroom', 'black', 'bluegrass',
       'blues', 'bollywood', 'bossa', 'brazilian', 'breakbeat', 'breaks',
       'broadway', 'cantautori', 'cantopop', 'canzone', 'caribbean',
       'caucasian', 'celtic', 'chamber', 'chanson', 'children', 'chill',
       'chinese', 'choral', 'christian', 'christmas', 'classical',
       'classicmetal', 'club', 'colombian', 'comedy', 'conjazz',
       'contemporary', 'country', 'cuban', 'dance', 'dancehall',
       'dancepop', 'dark', 'death', 'deep', 'deutschrock', 'deutschspr',
       'dirty', 'disco', 'dnb', 'documentary', 'downbeat', 'downtempo',
       'drum', 'dub', 'dubstep', 'eastern', 'easy', 'electronic',
       'electropop', 'emo', 'entehno', '

**Conclusions**

During preprocessing of data we found the following problems in our initial dataset:

- Header style issues,
- Some missing values,
- Duplicates - explicit and implicit.

We were able to fix headers to make the table easier to work with. Moreover, without the duplicates our study will become more accurate.

Also we have replaced the missing values with `'unknown'` value. It remains unclear whether multiple missing valuess in `genre` column will affect accuracy of our study.

Now we can proceed with hypothesis testing. 

## Testing the hypotheses

### Comparison of user behavior in two capitals

The first hypothesis states that users listen to music in a different way in Moscow and St. Petersburg. Let's check this assumption against the data on the three days of the week - Monday, Wednesday and Friday. For this we will:

* Separate users of Moscow and St. Petersburg.
* Compare how many tracks each user group listened to on Monday, Wednesday and Friday in each city.

**Task 18**

Ffirst we perform each of the calculations separately.

Then, we estimate user activity in each city. After that we will group the data by city and count the plays in each group.



In [18]:
#  listen counts by city
df.groupby('city')['user_id'].count()


city
Moscow              42741
Saint-Petersburg    18512
Name: user_id, dtype: int64

There are more listenings in Moscow than in St. Petersburg. Nevertheless, we cannot state that Moscow users listen to music more often. Most likely it happens because there are more users in Moscow in general.

**Task 19**

Now we group the data by day of the week and count the plays on Monday, Wednesday, and Friday. Please note, that the data contains information about the plays for these days only.


In [19]:
# plays counting for each of three days
df.groupby('day')['user_id'].count()

day
Friday       21840
Monday       21354
Wednesday    18059
Name: user_id, dtype: int64

Users from the two cities are less active on Wednesdays. If we consider each city separately, the picture may change.

**Task 20**


We have seen how grouping by city and by day of the week works. Now let's write a function that will combine these two calculations.

Let's create `number_tracks()` function that will count the plays for a given day and city. Function would need two parameters:
* day of the week,
* name of the city.

Inside the function, let's save the rows of the source table as following:
  * `day` column is equal to `day` parameter,
  * `city` column is equal to `city` parameter.

To do this, we will apply sequential filtering with logical indexing (or complex logical expressions in one line)

Then we count values ​​in the `user_id` column of the resulting table. Save the result to a new variable. Return this variable from the function.

In [20]:
# <creating function number_tracks()>

def number_tracks(day, city):
    track_list = df[df['day'] == day]
    track_list = track_list[track_list['city'] == city]
    track_list_count = track_list['user_id'].count()
    return(track_list_count)

# Define function with 2 parameters: day, city.
# In track_list variable we will save all the rowns of our df, for which 
# the value in 'day' calumn is equal of day parameter and in the same time
# in a column 'city' the value is equal to city (we will apply sequential filtering with logical indexing
# Inside track_list_count variable we will save values of the row 'user_id',
# with the help of count() method applied for track_list table.    
# Our function should return a number - value of track_list_count.    


**Task 21**

We will call `number_tracks()` functions six times in total, changing the parameters so that we may get data for each city on each of the three days.

In [21]:
# play counts for Moscow on Monday
number_tracks('Monday', 'Moscow')

15740

In [22]:
# play counts for Saint-Petersburg on Monday
number_tracks('Monday', 'Saint-Petersburg')

5614

In [23]:
# play counts for Moscow on Wednesday
number_tracks('Wednesday', 'Moscow')

11056

In [24]:
# play counts for Saint-Petersburg on Wednesday
number_tracks('Wednesday', 'Saint-Petersburg')

7003

In [25]:
# play counts for Moscow on Friday
number_tracks('Friday', 'Moscow')

15945

In [26]:
# play counts for Saint-Petersburg on Friday
number_tracks('Friday', 'Saint-Petersburg')

5895

**Task 22**

Let's create a table using `pd.DataFrame` constructor, where:
* column names - `['city', 'monday', 'wednesday', 'friday']`;
* data is the results we got with `number_tracks`.

In [27]:
# result table

data = [['Moscow', 15740, 11056, 15945],
        ['Saint-Petersburg', 5614, 7003, 5895]] 
columns = ['city', 'monday', 'wednesday', 'friday'] 


results = pd.DataFrame(data = data, columns = columns)
display(results)

Unnamed: 0,city,monday,wednesday,friday
0,Moscow,15740,11056,15945
1,Saint-Petersburg,5614,7003,5895


**Conclusions**

Data shows the following difference in user behavior:

- In Moscow, the peak of listening falls on Monday and Friday, and on Wednesday there is a noticeable downturn.
- In St. Petersburg, on the contrary, users listen to music much more on Wednesdays. Activity on Monday and Friday here is almost equal or inferior to Wednesday.

Therefore, our data analysis supports the first hypothesis, that users listen to music in Moscow and St. Petersburg differently.

### Music at the beginning and end of the week

According to the second hypothesis, on Monday morning certain genres predominate in Moscow, while others dominate in St. Petersburg. Similarly, Friday evenings are dominated by different genres, depending on the city.

**Task 23**

We will save tables with data in two variables:
* in Moscow - to `moscow_general`;
* in St. Petersburg - to `spb_general`.

In [28]:
# getting the moscow_general table from those rows of the df table,
# for which the value in the 'city' column is 'Moscow'
moscow_general = df[df['city'] == 'Moscow']

In [29]:
# getting the spb_general table from those rows of the df table,
# for which the value in the 'city' column is 'Saint-Petersburg'
spb_general = df[df['city'] == 'Saint-Petersburg']

**Task 24**

We are going to create a function `genre_weekday()` with four parameters:
* table (dataframe) with data,
* day of the week,
* initial timestamp in 'hh:mm' format,
* last timestamp in 'hh:mm' format.

The function should return information about the top 10 genres of those tracks that were listened to on the specified day, in the interval between two timestamps.

In [30]:
# Declaring the function genre_weekday() with parameters table, day, time1, time2,
# which returns information about the most popular genres on the specified day in
# given time:
# 1) the genre_df variable saves those rows of the transmitted dataframe table, for
# of which at the same time:
# - the value in the day column is equal to the value of the day argument
# - the value in the time column is greater than the value of the time1 argument
# - the value in the time column is less than the value of the time2 argument
# Use sequential filtering with boolean indexing.
# 2) group dataframe genre_df by genre column, take one of its
# columns and use the count() method to count the number of entries for each
# of present genres, write the resulting Series to a variable
#genre_df_count
# 3) sort genre_df_count in descending order of occurrence and save
# into the genre_df_sorted variable
# 4) return a Series of the first 10 genre_df_sorted values, these will be the top 10
# popular genres (on the specified day, at the specified time)

def genre_weekday(df, day, time1, time2):
    # sequential filtering
    # leave in genre_df only those df lines whose day is equal to day
    genre_df = df[df['day'] == day]
    # leave in genre_df only those genre_df lines whose time is less than time2
    genre_df = genre_df[genre_df['time'] < time2]
    # leave in genre_df only those genre_df lines whose time is greater than time1
    genre_df = genre_df[genre_df['time'] > time1]
    # group the filtered dataframe by the column with genre names, take the genre column and count the number of rows for each genre using the count() method
    genre_df_count = genre_df.groupby('genre')['genre'].count()
    # sort the result in descending order (so that the most popular genres are at the beginning of the Series)
    genre_df_sorted = genre_df_count.sort_values(ascending=False)
    # return a Series with the 10 most popular genres in the specified time interval of the specified day
    return genre_df_sorted[:10]

**Task 25**


Let's compare the results of the `genre_weekday()` function for Moscow and St. Petersburg on Monday morning (from 7:00 to 11:00) and Friday evening (from 17:00 to 23:00):

In [31]:
# function call for Monday morning in Moscow (instead of df - moscow_general table)
# time objects are strings and are compared as strings
# call example: genre_weekday(moscow_general, 'Monday', '07:00', '11:00')
genre_weekday(moscow_general, 'Monday', '07:00', '11:00')

genre
pop            781
dance          549
electronic     480
rock           474
hiphop         286
ruspop         186
world          181
rusrap         175
alternative    164
unknown        161
Name: genre, dtype: int64

In [32]:
# function call for Monday morning in St. Petersburg (instead of df - spb_general table)
genre_weekday(spb_general, 'Monday', '07:00', '11:00')

genre
pop            218
dance          182
rock           162
electronic     147
hiphop          80
ruspop          64
alternative     58
rusrap          55
jazz            44
classical       40
Name: genre, dtype: int64

In [33]:
# function call for Friday evening in Moscow
genre_weekday(moscow_general, 'Monday', '17:00', '23:00')

genre
pop            717
dance          524
rock           518
electronic     485
hiphop         238
alternative    182
classical      172
world          172
ruspop         149
rusrap         133
Name: genre, dtype: int64

In [34]:
# function call for Friday evening in St. Petersburg
genre_weekday(spb_general, 'Monday', '17:00', '23:00')

genre
pop            263
rock           208
electronic     192
dance          191
hiphop         104
alternative     72
classical       71
jazz            57
rusrap          54
ruspop          53
Name: genre, dtype: int64

**Conclusions**

If we compare the top 10 genres on Monday morning, we can draw the following conclusions:

1. In Moscow and St. Petersburg users listen to similar music. The only difference is that the Moscow rating includes the “world” genre, while the St. Petersburg rating includes jazz and classical.

2. There were so many missing values ​​in Moscow that the value `'unknown'` took tenth place among the most popular genres. This means that missing values ​​occupy a significant share in the data and jeopardize the reliability of the study.

Friday night does not change this picture. Some genres rise a little higher, others go down, but overall the top 10 stays the same.

Thus, the second hypothesis was only partially confirmed:
* Users listen to similar music at the beginning of the week and at the end.
* The difference between Moscow and St. Petersburg is not very pronounced. In Moscow, users listen to Russian popular music more often, in St. Petersburg - jazz.

However, gaps in the data cast doubt on this result. Many of missing values are in Moscow, so the top 10 ranking could be different if it weren't lost genre data.

### Genre preferences in Moscow and St. Petersburg

Hypothesis: St. Petersburg is the capital of rap, the music of this genre is listened to more often than in Moscow. And Moscow is a city of contrasts, which, nevertheless, is dominated by pop music.

**Task 26**

We group the `moscowb_general` table by genre. Then, we count the number of plays for each genre. Sort the result in descending order and save it in a table `moscow_genres`.

In [35]:
# in one line: group moscow_general table by 'genre' column,
# counting the number of 'genre' values in this grouping using the count() method,
# sort the resulting Series in descending order and store it in moscow_genres

moscow_genres = moscow_general.groupby('genre')['genre'].count().sort_values(ascending=False)

**Task 27**

Display first 10 rows of `moscow_genres`:

In [36]:
# displaying firs 10 rows of moscow_genres
moscow_genres.head(10)

genre
pop            5892
dance          4435
rock           3965
electronic     3786
hiphop         2096
classical      1616
world          1432
alternative    1379
ruspop         1372
rusrap         1161
Name: genre, dtype: int64

**Task 28**


Now we do the same for St. Petersburg.

We group the `spb_general` table by genre. Count the number of plays for each genre. Sort the result in descending order and save it in a table `spb_genres`:


In [37]:
# in one line: group spb_general table by 'genre' column,
# counting the number of 'genre' values in this grouping using the count() method,
# sort the resulting Series in descending order and store it in spb_genres
spb_genres = spb_general.groupby('genre')['genre'].count().sort_values(ascending=False)

**Task 29**

Displaying first 10 rows of `spb_genres`:

In [38]:
#  the first 10 rows of spb_genres
spb_genres.head(10)

genre
pop            2431
dance          1932
rock           1879
electronic     1736
hiphop          960
alternative     649
classical       646
rusrap          564
ruspop          538
world           515
Name: genre, dtype: int64

**Conclusions**

The hypothesis was partially confirmed:
* Pop music is the most popular genre in Moscow, as the hypothesis suggested. Moreover, in the top 10 genres there is a close genre - Russian popular music.
* Contrary to expectations, rap is equally popular in Moscow and St. Petersburg.


## Research summary

We evaluated three hypotheses and found the following:

1. The day of the week has a different effect on the activity of users in Moscow and St. Petersburg.

Thus, the first hypothesis was fully confirmed.

2. Musical preferences do not vary much during the week, whether you are in Moscow or St. Petersburg. Small differences are noticeable at the beginning of the week; as such, on Mondays:
* in Moscow, users listen to “world” genre music;
* in St. Petersburg, jazz and classical music.

Thus, the second hypothesis was confirmed partially. These results could have been different if missing values in the data were not present.

3. The likes of users of Moscow and St. Petersburg have more in common than differences. Contrary to expectations, genre preferences in St. Petersburg resemble those in Moscow.

The third hypothesis was not confirmed. If there are differences in preferences, they are invisible to the majority of users.

**In practice, studies contain tests of statistical hypotheses.**

From the data of one service, it is not always possible to draw a conclusion about all the habitants of the city.
Tests of statistical hypotheses will show how reliable they are, based on the available data.