# Music of big cities

Yandex.Music user activity log analysis

**The purpose of the reserch** is to test three hypotheses:
1. User activity depends on the day of the week. Moreover, in Moscow and St. Petersburg this manifests itself in different ways.
2. On Monday morning, some genres prevail in Moscow, while others prevail in St. Petersburg. Similarly, Friday evenings are dominated by different genres, depending on the city.
3. Moscow and St. Petersburg prefer different genres of music. In Moscow, they listen to pop music more often, in St. Petersburg - Russian rap.

## Content

1. [Introduction to data](#start)
2. [Data preprocessing](#preprocessing)
 - [Column headings style](#headers)
 - [Gaps](#missing)
 - [Duplicates](#duplicates)
3. [Hypothesis testing](#main)
 - [Comparison of two capitals user behavior](#behavior)
 - [Auditions at the beginning and end of the week](#weeks)
 - [Genre preferences in Moscow and St. Petersburg](#genres)
4. [Research results](#final)

<a id="start"></a>
## Introduction to data

In [1]:
import pandas as pd

In [2]:
df = pd.read_csv('yandex_music_project.csv')
df.head(10)

Unnamed: 0,userID,Track,artist,genre,City,time,Day
0,FFB692EC,Kamigata To Boots,The Mass Missile,rock,Saint-Petersburg,20:28:33,Wednesday
1,55204538,Delayed Because of Accident,Andreas Rönnberg,rock,Moscow,14:07:09,Friday
2,20EC38,Funiculì funiculà,Mario Lanza,pop,Saint-Petersburg,20:58:07,Wednesday
3,A3DD03C9,Dragons in the Sunset,Fire + Ice,folk,Saint-Petersburg,08:37:09,Monday
4,E2DC1FAE,Soul People,Space Echo,dance,Moscow,08:34:34,Monday
5,842029A1,Преданная,IMPERVTOR,rusrap,Saint-Petersburg,13:09:41,Friday
6,4CB90AA5,True,Roman Messer,dance,Moscow,13:00:07,Wednesday
7,F03E1C1F,Feeling This Way,Polina Griffith,dance,Moscow,20:47:49,Wednesday
8,8FA1D3BE,И вновь продолжается бой,,ruspop,Moscow,09:17:40,Friday
9,E772D5C0,Pessimist,,dance,Saint-Petersburg,21:20:49,Wednesday


In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 65079 entries, 0 to 65078
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0     userID  65079 non-null  object
 1   Track     63848 non-null  object
 2   artist    57876 non-null  object
 3   genre     63881 non-null  object
 4     City    65079 non-null  object
 5   time      65079 non-null  object
 6   Day       65079 non-null  object
dtypes: object(7)
memory usage: 3.5+ MB


So the table has seven columns. Data type in all columns is `object`.

According to the data documentation:
* `userID` — user ID;
* `Track` — track name;  
* `artist` — artist name;
* `genre` — genre name;
* `City` — user city;
* `time` — listening start time;
* `Day` — day of the week.

There are three style violations in the column headings:
1. Lowercase letters are combined with uppercase.
2. Gaps
3. Missing underscores in `userID`

The number of values in the columns varies. This means there are missing values in the data.

**Сonclusion**

Each df row contains data about the track. Some of the columns describe the composition itself: title, artist and genre. The rest of the data tells about the user: what city he is from, when he listened to music.

Preliminarily, we can say, that there is enough data to test hypotheses. But there are gaps in the data, and bad style column names.

To move forward, we need to fix data problems.

<a id="preprocessing"></a>
## Data preprocessing

<a id="headers"></a>
### Column headings style

Let's have a look at the column names

In [4]:
df.columns

Index(['  userID', 'Track', 'artist', 'genre', '  City  ', 'time', 'Day'], dtype='object')

Rename the column names as follows:

* `'  userID'` → `'user_id'`;
* `'Track'` → `'track'`;
* `'  City  '` → `'city'`;
* `'Day'` → `'day'`.

In [5]:
df = df.rename(columns={'  userID': 'user_id', 'Track': 'track', '  City  ': 'city', 'Day': 'day'})

Check the result

In [6]:
df.columns

Index(['user_id', 'track', 'artist', 'genre', 'city', 'time', 'day'], dtype='object')

<a id="missing"></a>
### Gaps

Let's count the number of gaps

In [7]:
df.isna().sum()

user_id       0
track      1231
artist     7203
genre      1198
city          0
time          0
day           0
dtype: int64

Not all missing values affect the reserch. So the gaps in columns `track` and `artist` are not important for the job. It is enough to replace them with explicit notations.

Omissions in `genre` could interfere with the comparison of musical tastes in Moscow and St. Petersburg, but they make up less than 2% of all data, so they are not critical.

Let us fill in the gaps with explicit notation.

In [8]:
columns_to_replace = ['track', 'artist', 'genre']
for column in columns_to_replace:
    df[column] = df[column].fillna('unknown')

Checking

In [9]:
df.isna().sum()

user_id    0
track      0
artist     0
genre      0
city       0
time       0
day        0
dtype: int64

<a id="duplicates"></a>
### Duplicates

Counting explicit duplicates

In [10]:
df.duplicated().sum()

3826

About 6% which is not critical. Deleting them

In [11]:
df = df.drop_duplicates().reset_index(drop=True)

Checking

In [12]:
df.duplicated().sum()

0

According to the terms of the research reference , we need to check for implicit duplicates in the `genre` column.

In [13]:
df['genre'].sort_values().unique()

array(['acid', 'acoustic', 'action', 'adult', 'africa', 'afrikaans',
       'alternative', 'alternativepunk', 'ambient', 'americana',
       'animated', 'anime', 'arabesk', 'arabic', 'arena',
       'argentinetango', 'art', 'audiobook', 'author', 'avantgarde',
       'axé', 'baile', 'balkan', 'beats', 'bigroom', 'black', 'bluegrass',
       'blues', 'bollywood', 'bossa', 'brazilian', 'breakbeat', 'breaks',
       'broadway', 'cantautori', 'cantopop', 'canzone', 'caribbean',
       'caucasian', 'celtic', 'chamber', 'chanson', 'children', 'chill',
       'chinese', 'choral', 'christian', 'christmas', 'classical',
       'classicmetal', 'club', 'colombian', 'comedy', 'conjazz',
       'contemporary', 'country', 'cuban', 'dance', 'dancehall',
       'dancepop', 'dark', 'death', 'deep', 'deutschrock', 'deutschspr',
       'dirty', 'disco', 'dnb', 'documentary', 'downbeat', 'downtempo',
       'drum', 'dub', 'dubstep', 'eastern', 'easy', 'electronic',
       'electropop', 'emo', 'entehno', '

And examine the implicit duplicates of `hip-hop` values.

The following duplicates were found:
* *hip*,
* *jump*,
* *hip-hop*.

Let's replace duplicates using the `replace_wrong_genres()` function.

In [14]:
def replace_wrong_genres(wrong_genres, correct_genre):
    for wrong_genre in wrong_genres:
        df['genre'] = df['genre'].replace(wrong_genre, correct_genre)

Using the function

In [15]:
replace_wrong_genres(['hip', 'hop', 'hip-hop'], 'hiphop')

Checking

In [16]:
df['genre'].sort_values().unique()

array(['acid', 'acoustic', 'action', 'adult', 'africa', 'afrikaans',
       'alternative', 'alternativepunk', 'ambient', 'americana',
       'animated', 'anime', 'arabesk', 'arabic', 'arena',
       'argentinetango', 'art', 'audiobook', 'author', 'avantgarde',
       'axé', 'baile', 'balkan', 'beats', 'bigroom', 'black', 'bluegrass',
       'blues', 'bollywood', 'bossa', 'brazilian', 'breakbeat', 'breaks',
       'broadway', 'cantautori', 'cantopop', 'canzone', 'caribbean',
       'caucasian', 'celtic', 'chamber', 'chanson', 'children', 'chill',
       'chinese', 'choral', 'christian', 'christmas', 'classical',
       'classicmetal', 'club', 'colombian', 'comedy', 'conjazz',
       'contemporary', 'country', 'cuban', 'dance', 'dancehall',
       'dancepop', 'dark', 'death', 'deep', 'deutschrock', 'deutschspr',
       'dirty', 'disco', 'dnb', 'documentary', 'downbeat', 'downtempo',
       'drum', 'dub', 'dubstep', 'eastern', 'easy', 'electronic',
       'electropop', 'emo', 'entehno', '

**Conclusions**

As a preprocessing result, three problems were found in the data:

- bad column headings style,
- gaps in data,
- duplicates - explicit and implicit.

We fixed the headings to make it easier to work with df. Without duplicates, the reserch will become more accurate.

We have replaced missing values with `'unknown'`. It remains to be seen whether the gaps in the `genre` column will harm the study.

Now we can move on to hypothesis testing.

<a id="main"></a>
## Hypothesis testing

<a id="behavior"></a>
### Comparison of two capitals user behavior

The first hypothesis states that users in Moscow and St. Petersburg listen to music differently. Let's check this assumption against the data on the three days of the week - Monday, Wednesday and Friday. For this we will:

* Separate users from Moscow and St. Petersburg
* Compare how many tracks each group of users listened to on Monday, Wednesday and Friday.

In [17]:
df.groupby('city')['user_id'].count()

city
Moscow              42741
Saint-Petersburg    18512
Name: user_id, dtype: int64

There are more auditions in Moscow than in St. Petersburg. It does not meen that Moscow users listen to music more often. There are simply more users in Moscow.

Now let's group the data by day of the week and count the auditions on Monday, Wednesday, and Friday.

In [18]:
df.groupby('day')['user_id'].count()

day
Friday       21840
Monday       21354
Wednesday    18059
Name: user_id, dtype: int64

On average, users from the two cities are less active on Wednesdays. But the picture may change if we consider each city separately.

Let's create a `number_tracks()` function that will count the plays for a given day and city.

In [19]:
def number_tracks(day, city):
    track_list = df[(df['day'] == day) & (df['city'] == city)]
    track_list_count = track_list['user_id'].count()
    return track_list_count


Let's look at the data by day of the week and cities

In [20]:
print('Number of auditions in Moscow on Mondays:', number_tracks('Monday', 'Moscow'))

Number of auditions in Moscow on Mondays: 15740


In [21]:
print('Number of auditions in St. Petersburg on Mondays:', number_tracks('Monday', 'Saint-Petersburg'))

Number of auditions in St. Petersburg on Mondays: 5614


In [22]:
print('Number of auditions in Moscow on Wednesdays:', number_tracks('Wednesday', 'Moscow'))

Number of auditions in Moscow on Wednesdays: 11056


In [23]:
print('Number of auditions in St. Petersburg on Wednesdays:', number_tracks('Wednesday', 'Saint-Petersburg'))

Number of auditions in St. Petersburg on Wednesdays: 7003


In [24]:
print('Number of auditions in Moscow on Fridays:', number_tracks('Friday', 'Moscow'))

Number of auditions in Moscow on Fridays: 15945


In [25]:
print('Number of auditions in St. Petersburg on Fridays:', number_tracks('Friday', 'Saint-Petersburg'))

Number of auditions in St. Petersburg on Fridays: 5895


Let's create a pivot table for day and city statistics

In [26]:
day_city_count = pd.pivot_table(df, values='user_id', index='day', columns='city', aggfunc='count').sort_index()
day_city_count.columns = ['Moscow', 'St. Petersburg']
day_city_count

Unnamed: 0_level_0,Moscow,St. Petersburg
day,Unnamed: 1_level_1,Unnamed: 2_level_1
Friday,15945,5895
Monday,15740,5614
Wednesday,11056,7003


**Conclusions**

The data shows the difference in user behavior:

- In Moscow, the peak of listening falls on Monday and Friday, and on Wednesday there is a noticeable decline.
- In St. Petersburg, on the contrary, users more listen to music on Wednesdays. Activity on Monday and Friday here is almost equally inferior to Wednesday.

So the reserch conclusions support the first hypothesis.

<a id="weeks"></a>
### Auditions at the beginning and end of the week

According to the second hypothesis, on Monday morning certain genres predominate in Moscow, while others dominate in St. Petersburg. Similarly, Friday evenings are dominated by different genres, depending on the city.

Let's save dfs in two variables:
* Moscow - in `moscow_general`;
* St. Petersburg - in `spb_general`.

In [27]:
moscow_general = df[df['city'] == 'Moscow']

In [28]:
spb_general = df[df['city'] == 'Saint-Petersburg']

Let's create a `genre_weekday()` function with four parameters:
* table (df),
* day of the week,
* start timestamp in 'hh:mm' format,
* last timestamp in 'hh:mm' format.

The function should return information about the top 10 genres of those tracks that were listened to on the specified day, in the interval between two timestamps.

In [29]:
def genre_weekday(table, day, time1, time2):
    genre_df = table[table['day'] == day]
    genre_df = genre_df[genre_df['time'] > time1]
    genre_df = genre_df[genre_df['time'] < time2]
    
    genre_df_count = genre_df.groupby('genre')['user_id'].count()
    genre_df_sorted = genre_df_count.sort_values(ascending = False)
    return genre_df_sorted.head(10)


Let's compare the results of the `genre_weekday()` function for Moscow and St. Petersburg on Monday morning (from 7:00 AM to 11:00 AM) and Friday evening (from 5:00 PM to 11:00 PM):

In [30]:
print('Top 10 listened genres for Monday morning in Moscow:')
print(genre_weekday(moscow_general, 'Monday', '07:00', '11:00'))

Top 10 listened genres for Monday morning in Moscow:
genre
pop            781
dance          549
electronic     480
rock           474
hiphop         286
ruspop         186
world          181
rusrap         175
alternative    164
unknown        161
Name: user_id, dtype: int64


In [31]:
print('Top 10 listened genres for Monday morning in St. Petersburg:')
print(genre_weekday(spb_general, 'Monday', '07:00', '11:00'))

Top 10 listened genres for Monday morning in St. Petersburg:
genre
pop            218
dance          182
rock           162
electronic     147
hiphop          80
ruspop          64
alternative     58
rusrap          55
jazz            44
classical       40
Name: user_id, dtype: int64


In [32]:
print('Top 10 listened genres for Friday evening in Moscow:')
print(genre_weekday(moscow_general, 'Friday', '17:00', '23:00'))

Top 10 listened genres for Friday evening in Moscow:
genre
pop            713
rock           517
dance          495
electronic     482
hiphop         273
world          208
ruspop         170
alternative    163
classical      163
rusrap         142
Name: user_id, dtype: int64


In [33]:
print('Top 10 listened genres for Friday evening in St. Petersburg:')
print(genre_weekday(spb_general, 'Friday', '17:00', '23:00'))

Top 10 listened genres for Friday evening in St. Petersburg:
genre
pop            256
electronic     216
rock           216
dance          210
hiphop          97
alternative     63
jazz            61
classical       60
rusrap          59
world           54
Name: user_id, dtype: int64


**Conclusions**

If we compare the top 10 genres on Monday morning, we can draw the following conclusions:

1. In Moscow and St. Petersburg users listen to similar music. The only difference is that the Moscow rating includes the “world” genre, while the St. Petersburg rating includes "jazz" and "classical" genres.

2. There were so many missing values in Moscow that the value `unknown` took tenth place among the most popular genres. This means that missing values occupy a significant share in the data and threaten the reliability of the study.

Friday night statistics does not change this picture. Some genres rise a little higher, others go down, but overall the top 10 stays the same.

Thus, the second hypothesis was only partially confirmed:
* Users listen to similar music at the beginning of the week and at the end.
* The difference between Moscow and St. Petersburg is not very pronounced. In Moscow, users listen to Russian pop music more often, in St. Petersburg - jazz.

However, gaps in the data cast doubt on this result. There are so many of them in Moscow that the top 10 ranking could look different if it were not for the lost genre data.

<a id="genres"></a>
### Genre preferences in Moscow and St. Petersburg

Hypothesis: St. Petersburg is the capital of rap, the music of this genre is listened to more often than in Moscow. And Moscow is a city of contrasts, which, nevertheless, is dominated by pop music.

Let's group the `moscow_general` table by genre and count the listens of tracks of each genre using the `count()` method. Then sort the result in descending order and store it in the `moscow_genres` df.

In [34]:
moscow_genres = moscow_general.groupby('genre')['genre'].count().sort_values(ascending = False)

In [35]:
print('Top 10 genre preferences in Moscow:')
print(moscow_genres.head(10))

Top 10 genre preferences in Moscow:
genre
pop            5892
dance          4435
rock           3965
electronic     3786
hiphop         2096
classical      1616
world          1432
alternative    1379
ruspop         1372
rusrap         1161
Name: genre, dtype: int64


Now let's repeat the same for St. Petersburg.

In [36]:
spb_genres = spb_general.groupby('genre')['genre'].count().sort_values(ascending = False)

In [37]:
print('Top 10 genre preferences in  St. Petersburg.:')
print(spb_genres.head(10))

Top 10 genre preferences in  St. Petersburg.:
genre
pop            2431
dance          1932
rock           1879
electronic     1736
hiphop          960
alternative     649
classical       646
rusrap          564
ruspop          538
world           515
Name: genre, dtype: int64


**Conclusions**

The hypothesis was partially confirmed:
* Pop music is the most popular genre in Moscow, as the hypothesis suggested. Moreover, in the top 10 genres there is a close genre - Russian popular music.
* Contrary to expectations, rap is equally popular in Moscow and St. Petersburg.

<a id="final"></a>
## Research results

We tested three hypotheses and found the following:

1. The day of the week has a different effect on the activity of users in Moscow and St. Petersburg.

The first hypothesis was fully confirmed.

2. Musical preferences do not change much during the week - be it Moscow or St. Petersburg. Small differences are noticeable at the beginning of the week, on Mondays:
* in Moscow they listen to music of the "world" genre,
* in St. Petersburg - "jazz" and "classics".

Thus, the second hypothesis was only partly confirmed. This result could be different due to gaps in the data.

3. The tastes of users of Moscow and St. Petersburg have more in common than differences. Contrary to expectations, genre preferences in St. Petersburg resemble those in Moscow.

The third hypothesis was not confirmed. If there are differences in preferences, they are not noticeable in this data sample.