# Yandex.Music
## Analysis of user behavior in Moscow and St. Petersburg

**The purpose of the study** is to test three hypotheses:
1. User activity depends on the day of the week. And in Moscow and St. Petersburg this manifests itself differently.
2. On Monday morning in Moscow some genres prevail, and in St. Petersburg other genres prevail. Similarly, on Friday evening, different genres prevail, depending on the city.
3. Moscow and St. Petersburg prefer different genres of music. In Moscow they listen more often to pop music, in St. Petersburg to Russian rap.

**Course of the research:**

The data is in the file `yandex_music_project.csv`. Since nothing is known about the quality of the data, a review of the data will be needed before hypotheses can be tested. 
 
Thus, the study will proceed in three phases:
1. Data review
2. Data preprocessing
3. Hypothesis testing



## Data review

In [2]:
import pandas as pd

In [4]:
# data loading
df = pd.read_csv('/datasets/yandex_music_project.csv')
df.to_csv('yandex_music_project.csv', index=False)

In [None]:
# the first rows
print(df.head(10))

     userID                        Track            artist   genre  \
0  FFB692EC            Kamigata To Boots  The Mass Missile    rock   
1  55204538  Delayed Because of Accident  Andreas Rönnberg    rock   
2    20EC38            Funiculì funiculà       Mario Lanza     pop   
3  A3DD03C9        Dragons in the Sunset        Fire + Ice    folk   
4  E2DC1FAE                  Soul People        Space Echo   dance   
5  842029A1                    Преданная         IMPERVTOR  rusrap   
6  4CB90AA5                         True      Roman Messer   dance   
7  F03E1C1F             Feeling This Way   Polina Griffith   dance   
8  8FA1D3BE     И вновь продолжается бой               NaN  ruspop   
9  E772D5C0                    Pessimist               NaN   dance   

             City        time        Day  
0  Saint-Petersburg  20:28:33  Wednesday  
1            Moscow  14:07:09     Friday  
2  Saint-Petersburg  20:58:07  Wednesday  
3  Saint-Petersburg  08:37:09     Monday  
4            M

Each row of the table contains data about the track you listened to. Part of the columns describes the song itself: title, artist and genre. The rest of the data tells about the user: what city he/she is from, when he/she listened to the music.

In [4]:
# general info
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 65079 entries, 0 to 65078
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0     userID  65079 non-null  object
 1   Track     63848 non-null  object
 2   artist    57876 non-null  object
 3   genre     63881 non-null  object
 4     City    65079 non-null  object
 5   time      65079 non-null  object
 6   Day       65079 non-null  object
dtypes: object(7)
memory usage: 3.5+ MB


So, there are seven columns in the table. The data type in all columns is `object`.

According to the data documentation:
* `userID` - user ID
* `Track` - track name  
* `artist` - artist name
* `genre` - genre name
* `City` - user's city
* `time` - start time of listening
* `Day` - day of the week

The number of values in the columns is different. It means that there are missing values in the data.

## Data preprocessing
First, let's fix the style in the column headers and eliminate omissions. Then let's check the data for duplicates.

### Header style

In [6]:
# list of the column names
print(df.columns)

Index(['  userID', 'Track', 'artist', 'genre', '  City  ', 'time', 'Day'], dtype='object')


Let's rename the columns this way according to PEP-8:
* `'  userID'` → `'user_id'`;
* `'Track'` → `'track'`;
* `'  City  '` → `'city'`;
* `'Day'` → `'day'`.

In [None]:
df = df.rename(
    columns={
        '  userID': 'user_id',
        'Track': 'track',
        '  City  ': 'city',
        'Day': 'day'
    }
)

In [8]:
print(df.columns)

Index(['user_id', 'track', 'artist', 'genre', 'city', 'time', 'day'], dtype='object')


### Missing values

In [9]:
# count of missing values
print(df.isna().sum())

user_id       0
track      1231
artist     7203
genre      1198
city          0
time          0
day           0
dtype: int64


Not all missing values affect the study. So in `track` and `artist` the gaps are not important for your work. It is enough to replace them with explicit designations.

But missing values in `genre` may interfere with the comparison of musical tastes in Moscow and St. Petersburg. In practice, it would be correct to establish the reason for the gaps and restore the data. Such a possibility is not available in the study project. We will have to:
* fill in these missing values with explicit designations
* assess how much they will damage the calculations

In [10]:
# replacing missing values with 'unknown'
columns_to_replace = ['track', 'artist', 'genre']

for column in columns_to_replace:
    df[column] = df[column].fillna('unknown')

In [11]:
# check
print(df.isna().sum())

user_id    0
track      0
artist     0
genre      0
city       0
time       0
day        0
dtype: int64


### Duplicates

In [12]:
# duplicate counting
print(df.duplicated().sum())

3826


In [15]:
# deletion of obvious duplicates
df = df.drop_duplicates()

In [16]:
# check
print(df.duplicated().sum())

0


Now let us get rid of implicit duplicates in the `genre` column, i.e. cases when the name of the same genre can be written slightly differently. Such errors will also affect the result of the survey.

In [17]:
# unique genre names
print(df['genre'].sort_values().unique())

['acid' 'acoustic' 'action' 'adult' 'africa' 'afrikaans' 'alternative'
 'alternativepunk' 'ambient' 'americana' 'animated' 'anime' 'arabesk'
 'arabic' 'arena' 'argentinetango' 'art' 'audiobook' 'author' 'avantgarde'
 'axé' 'baile' 'balkan' 'beats' 'bigroom' 'black' 'bluegrass' 'blues'
 'bollywood' 'bossa' 'brazilian' 'breakbeat' 'breaks' 'broadway'
 'cantautori' 'cantopop' 'canzone' 'caribbean' 'caucasian' 'celtic'
 'chamber' 'chanson' 'children' 'chill' 'chinese' 'choral' 'christian'
 'christmas' 'classical' 'classicmetal' 'club' 'colombian' 'comedy'
 'conjazz' 'contemporary' 'country' 'cuban' 'dance' 'dancehall' 'dancepop'
 'dark' 'death' 'deep' 'deutschrock' 'deutschspr' 'dirty' 'disco' 'dnb'
 'documentary' 'downbeat' 'downtempo' 'drum' 'dub' 'dubstep' 'eastern'
 'easy' 'electronic' 'electropop' 'emo' 'entehno' 'epicmetal' 'estrada'
 'ethnic' 'eurofolk' 'european' 'experimental' 'extrememetal' 'fado'
 'fairytail' 'film' 'fitness' 'flamenco' 'folk' 'folklore' 'folkmetal'
 'folkrock' 

We can notice that there are implicit duplicates of the name in the column `hiphop`:
* *hip*
* *hop*
* *hip-hop*

In [19]:
# deleting implicit duplicates
df = df.replace(['hip', 'hop', 'hip-hop'], 'hiphop')

In [20]:
# check
print(df['genre'].sort_values().unique())

['acid' 'acoustic' 'action' 'adult' 'africa' 'afrikaans' 'alternative'
 'alternativepunk' 'ambient' 'americana' 'animated' 'anime' 'arabesk'
 'arabic' 'arena' 'argentinetango' 'art' 'audiobook' 'author' 'avantgarde'
 'axé' 'baile' 'balkan' 'beats' 'bigroom' 'black' 'bluegrass' 'blues'
 'bollywood' 'bossa' 'brazilian' 'breakbeat' 'breaks' 'broadway'
 'cantautori' 'cantopop' 'canzone' 'caribbean' 'caucasian' 'celtic'
 'chamber' 'chanson' 'children' 'chill' 'chinese' 'choral' 'christian'
 'christmas' 'classical' 'classicmetal' 'club' 'colombian' 'comedy'
 'conjazz' 'contemporary' 'country' 'cuban' 'dance' 'dancehall' 'dancepop'
 'dark' 'death' 'deep' 'deutschrock' 'deutschspr' 'dirty' 'disco' 'dnb'
 'documentary' 'downbeat' 'downtempo' 'drum' 'dub' 'dubstep' 'eastern'
 'easy' 'electronic' 'electropop' 'emo' 'entehno' 'epicmetal' 'estrada'
 'ethnic' 'eurofolk' 'european' 'experimental' 'extrememetal' 'fado'
 'fairytail' 'film' 'fitness' 'flamenco' 'folk' 'folklore' 'folkmetal'
 'folkrock' 

**Summary:**

Preprocessing found three problems in the data:

- irregularities in header style,
- missing values,
- duplicates, both explicit and implicit.

We corrected the headings to make the table easier to work with. Without duplicates, the study will be more accurate. We replaced the missing values with `'unknown'`. 

Now we can move on to hypothesis testing.

## Hypothesis testing

### Comparison of user behavior of the two capitals

The first hypothesis states that users listen to music differently in Moscow and St. Petersburg. Let's test this assumption using data on three days of the week - Monday, Wednesday and Friday.

In [22]:
# number of listening sessions per city
df.groupby('city')['user_id'].count()

city
Moscow              42741
Saint-Petersburg    18512
Name: user_id, dtype: int64

There are more listens in Moscow than in St. Petersburg. It does not follow that Moscow users listen to music more often. It's just that the users themselves are more numerous in Moscow.


In [23]:
# number of listening sessions per day
df.groupby('day')['day'].count()

day
Friday       21840
Monday       21354
Wednesday    18059
Name: day, dtype: int64

On average, users from the two cities are less active on Wednesdays. But the picture may change if we look at each city separately.

Let's create a function that combines these two calculations.

In [24]:
def number_tracks(day, city):
    track_list = df[df['day'] == day]
    track_list = track_list[track_list['city'] == city]
    track_list_count = track_list['user_id'].count()

    return track_list_count

In [25]:
# number of listening sessions in Moscow on Mondays
number_tracks('Monday', 'Moscow')

15740

In [26]:
# number of listening sessions in Saint-Petersburg on Mondays
number_tracks('Monday', 'Saint-Petersburg')

5614

In [27]:
# number of listening sessions in Moscow on Wednesdays
number_tracks('Wednesday', 'Moscow')

11056

In [28]:
# number of listening sessions in Saint-Petersburg on Wednesdays
number_tracks('Wednesday', 'Saint-Petersburg')

7003

In [29]:
# number of listening sessions in Moscow on Fridays
number_tracks('Friday', 'Moscow')

15945

In [30]:
# number of listening sessions in Saint-Petersburg on Fridays
number_tracks('Friday', 'Saint-Petersburg')

5895

In [31]:
# table with results
info = pd.DataFrame(
    data=[['Москва', 15740, 11056, 15945], ['Санкт-Петербург', 5614, 7003, 5895]],
    columns=['city', 'monday', 'wednesday', 'friday']
)

**Summary**:

The data shows the difference in user behavior:

- In Moscow, listening peaks on Monday and Friday, with a noticeable decline on Wednesday.
- In St. Petersburg, on the contrary, more people listen to music on Wednesdays. The activity on Monday and Friday is almost equally inferior to Wednesday.

So the data are in favor of the first hypothesis.

### Music at the beginning and end of the week

According to the second hypothesis, on Monday morning in Moscow certain genres prevail, and in St. Petersburg other genres prevail. In the same way, different genres prevail on Friday evenings, depending on the city.

In [32]:
moscow_general = df[df['city'] == 'Moscow']

In [33]:
spb_general = df[df['city'] == 'Saint-Petersburg']

Let's create a `genre_weekday()` function with four parameters:
* a table (dataframe) with the data,
* day of the week,
* initial timestamp in the format 'hh:mm', 
* last timestamp in the format 'hh:mm'.

The function will return information about the top 10 genres of those tracks listened to on the specified day, between two time stamps.

In [36]:
def genre_weekday(df, day, time1, time2):
    genre_df = df[df['day'] == day]
    genre_df = genre_df[genre_df['time'] < time2]
    genre_df = genre_df[genre_df['time'] > time1]
    genre_df_grouped = genre_df.groupby('genre')['genre'].count()
    genre_df_sorted = genre_df_grouped.sort_values(ascending = False)
    
    return genre_df_sorted[:10]

Let's compare the results of the `genre_weekday()` function for Moscow and St. Petersburg on Monday morning (from 7:00 to 11:00) and Friday evening (from 17:00 to 23:00):

In [37]:
# Monday morning in Moscow
genre_weekday(moscow_general, 'Monday', '07:00', '11:00')

genre
pop            781
dance          549
electronic     480
rock           474
hiphop         286
ruspop         186
world          181
rusrap         175
alternative    164
unknown        161
Name: genre, dtype: int64

In [38]:
# Monday morning in St. Petersburg
genre_weekday(spb_general, 'Monday', '07:00', '11:00')

genre
pop            218
dance          182
rock           162
electronic     147
hiphop          80
ruspop          64
alternative     58
rusrap          55
jazz            44
classical       40
Name: genre, dtype: int64

In [39]:
# Friday evening in Moscow
genre_weekday(moscow_general, 'Friday', '17:00', '23:00')

genre
pop            713
rock           517
dance          495
electronic     482
hiphop         273
world          208
ruspop         170
alternative    163
classical      163
rusrap         142
Name: genre, dtype: int64

In [40]:
# Friday evening in St. Petersburg
genre_weekday(spb_general, 'Friday', '17:00', '23:00')

genre
pop            256
electronic     216
rock           216
dance          210
hiphop          97
alternative     63
jazz            61
classical       60
rusrap          59
world           54
Name: genre, dtype: int64

**Findings**.

If we compare the top 10 genres on Monday morning, we can draw these conclusions:

1. People in Moscow and St. Petersburg listen to similar music. The only difference is that the Moscow rating includes the genre "world", while the St. Petersburg rating includes jazz and classical.

2. In Moscow there were so many missing values that `'unknown'` took the tenth place among the most popular genres. This means that the missing values take up a significant proportion of the data and threaten the validity of the study.

Friday night doesn't change this picture. Some genres rise a little higher, others come down, but overall the top 10 remains the same.

Thus, the second hypothesis is only partially confirmed:
* Users listen to similar music at the beginning of the week and at the end.
* The difference between Moscow and St. Petersburg is not too pronounced. In Moscow, Russian popular music is listened to more often, in St. Petersburg - jazz.

However, omissions in the data cast doubt on this result. There are so many of them in Moscow that the top-10 ranking could look different if it were not for the missing data on genres.

### Genre preferences in Moscow and St. Petersburg

**Hypothesis**: St. Petersburg is the capital of rap music, and music of this genre is listened to there more often than in Moscow.  And Moscow is a city of contrasts, in which, however, pop music prevails.

In [41]:
moscow_genres = moscow_general.groupby('genre')['genre'].count().sort_values(ascending=False)

In [42]:
# the top 10 most popular genres in Moscow
moscow_genres.head(10)

genre
pop            5892
dance          4435
rock           3965
electronic     3786
hiphop         2096
classical      1616
world          1432
alternative    1379
ruspop         1372
rusrap         1161
Name: genre, dtype: int64

In [43]:
spb_genres = spb_general.groupby('genre')['genre'].count().sort_values(ascending=False)

In [44]:
# the top 10 most popular genres in St. Petersburg
spb_genres.head(10)

genre
pop            2431
dance          1932
rock           1879
electronic     1736
hiphop          960
alternative     649
classical       646
rusrap          564
ruspop          538
world           515
Name: genre, dtype: int64

**Summary**

The hypothesis was partially confirmed:
* Pop music is the most popular genre in Moscow, as the hypothesis suggested. Moreover, a close genre - Russian popular music - is found in the top 10 genres.
* Contrary to expectations, rap is equally popular in Moscow and St. Petersburg.


## Study results

We tested three hypotheses and found that:

1. The day of the week has a different effect on user activity in Moscow and St. Petersburg. 

**The first hypothesis** is completely confirmed.

2. Music preferences do not change much during the week, whether in Moscow or St. Petersburg. Small differences are noticeable at the beginning of the week, on Mondays:
* in Moscow they listen to music of the "world" genre,
* in St. Petersburg - jazz and classical.

Thus, **the second hypothesis** was only partially confirmed. This result could have been different if there had not been omissions in the data.

3. There are more similarities than differences in the tastes of Moscow and St. Petersburg users. Contrary to expectations, genre preferences in St. Petersburg resemble those in Moscow.

**The third hypothesis** was not confirmed. If there are differences in preferences, they are unnoticeable for the bulk of users.