# Yandex music

Comparison of Moscow and St. Petersburg is surrounded by myths. For example:

Moscow is a metropolis subject to a strict workweek rhythm;
St. Petersburg is a cultural capital with its own tastes.

In this Yandex Music data analysis, we will compare the behavior of users in these two cities.

The research aims to test three hypotheses:

- User activity depends on the day of the week, and this varies differently in Moscow and St. Petersburg.
- Different music genres predominate on Monday mornings in Moscow and St. Petersburg. The same goes for Friday evenings, depending on the city.
- Moscow and St. Petersburg prefer different music genres. Moscow tends to listen to pop music more often, while St. Petersburg leans toward Russian rap.
Course of the Research:

We will obtain user behavior data from the yandex_music_project.csv file. The quality of the data is unknown, so we will start by reviewing the data.

We will check the data for errors and assess their impact on the research. During the preprocessing stage, we will look for opportunities to correct the most critical data errors.

Thus, the research will be conducted in three stages:

1. Data overview.
2. Data preprocessing.
3. Hypothesis testing.

## Data overview.


Let's form an initial understanding of the Yandex Music data.

__Task 1__

The primary tool for data analysis is "pandas." Let's import this library.

In [1]:
import pandas as pd

__Task 2__

Let's read the file "yandex_music_project.csv" from the "/datasets" folder and store it in the variable "df."

In [2]:
df = pd.read_csv('/Users/daniyardjumaliev/Jupyter/Projects/datasets/yandex_music_project.csv')

__Task 3__

Let's display the first ten rows of the table on the screen:

In [3]:
df.head(10)

Unnamed: 0,userID,Track,artist,genre,City,time,Day
0,FFB692EC,Kamigata To Boots,The Mass Missile,rock,Saint-Petersburg,20:28:33,Wednesday
1,55204538,Delayed Because of Accident,Andreas Rönnberg,rock,Moscow,14:07:09,Friday
2,20EC38,Funiculì funiculà,Mario Lanza,pop,Saint-Petersburg,20:58:07,Wednesday
3,A3DD03C9,Dragons in the Sunset,Fire + Ice,folk,Saint-Petersburg,08:37:09,Monday
4,E2DC1FAE,Soul People,Space Echo,dance,Moscow,08:34:34,Monday
5,842029A1,Преданная,IMPERVTOR,rusrap,Saint-Petersburg,13:09:41,Friday
6,4CB90AA5,True,Roman Messer,dance,Moscow,13:00:07,Wednesday
7,F03E1C1F,Feeling This Way,Polina Griffith,dance,Moscow,20:47:49,Wednesday
8,8FA1D3BE,И вновь продолжается бой,,ruspop,Moscow,09:17:40,Friday
9,E772D5C0,Pessimist,,dance,Saint-Petersburg,21:20:49,Wednesday


**Task 4**

Let's obtain general information about the table using the info() method in one command:

In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 65079 entries, 0 to 65078
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0     userID  65079 non-null  object
 1   Track     63848 non-null  object
 2   artist    57876 non-null  object
 3   genre     63881 non-null  object
 4     City    65079 non-null  object
 5   time      65079 non-null  object
 6   Day       65079 non-null  object
dtypes: object(7)
memory usage: 3.5+ MB


So, there are seven columns in the table, and the data type in all columns is 'object'.

According to the data documentation:

- 'userID' - user identifier;
- 'Track' - track name;
- 'artist' - artist name;
- 'genre' - genre name;
- 'City' - user's city;
- 'time' - start time of listening;
- 'Day' - day of the week.

The number of values in the columns varies, indicating the presence of missing values in the data.

__Task 5__

Open-ended Question

In the column names, there are three style violations:

- Lowercase letters are combined with uppercase letters.
- Spaces are present.

__Conclusions__

Each row in the table contains data about a listened track. Some columns describe the composition itself: the title, artist, and genre. The remaining data provides information about the user: their city, when they listened to music.

Preliminarily, it can be stated that there is enough data to test the hypotheses. However, there are missing values in the data, and there are style discrepancies in the column names.

To proceed further, it is necessary to address the data issues.

## Data Preprocessing
Let's correct the style in the column headers, eliminate missing values, and then check the data for duplicates.

### Column Headers Style
__Task 6__

Display the column names on the screen:

In [5]:
df.columns

Index(['  userID', 'Track', 'artist', 'genre', '  City  ', 'time', 'Day'], dtype='object')

__Task 7__

Let's bring the column names into good style:

Convert multi-word names to "snake_case."
- Make all characters lowercase.
- Remove spaces.
- To do this, rename the columns as follows:

- ' userID' → 'user_id'
- 'Track' → 'track'
- ' City ' → 'city'
- 'Day' → 'day'

In [6]:
renamed = {
    '  userID' : 'user_id',
    'Track' : 'track',
    '  City  ' : 'city',
    'Day' : 'day'
}
df.rename(columns = renamed, inplace=True)


__Task 8__

Let's verify the result. To do this, display the column names on the screen again:

In [7]:
df.columns

Index(['user_id', 'track', 'artist', 'genre', 'city', 'time', 'day'], dtype='object')

### Missing Values
__Task 9__

First, let's calculate the number of missing values in the table. To do this, we can use two pandas methods:

In [8]:
df.isna().sum()

user_id       0
track      1231
artist     7203
genre      1198
city          0
time          0
day           0
dtype: int64

Not all missing values affect the research. For example, in the 'track' and 'artist' columns, missing values are not important for our work. It is sufficient to replace them with explicit labels.

However, missing values in the 'genre' column can hinder the comparison of musical preferences in Moscow and St. Petersburg. In practice, it would be correct to determine the cause of these missing values and recover the data. However, this is not possible in an educational project. Therefore, we will need to:

- Fill in these missing values with explicit labels.
- Evaluate the extent to which they may affect the calculations.

__Task 10__

Let's replace the missing values in the 'track', 'artist', and 'genre' columns with the string 'unknown'. To do this, we will create a list 'columns_to_replace', iterate through its elements using a 'for' loop, and perform the replacement for each column:

In [9]:
columns_to_replace = ['track', 'artist', 'genre']

for column in columns_to_replace:
    df[column] = df[column].fillna('unknown')

__Task 11__

Make sure there are no more missing values in the table. To do this, we calculate the number of missing values again.

In [10]:
df.isna().sum()

user_id    0
track      0
artist     0
genre      0
city       0
time       0
day        0
dtype: int64

### Duplicates

__Task 12__

Count explicit duplicates in the table with a single command:

In [11]:
df.duplicated().sum()

3826

__Task 13__

Let's call the special pandas method to remove explicit duplicates:

In [12]:
df = df.drop_duplicates()

__Task 14__

Let's calculate explicit duplicates in the table once again to ensure that we have completely eliminated them:

In [13]:
df.duplicated().sum()

0

Now let's get rid of implicit duplicates in the genre column. For example, the name of the same genre can be written slightly differently. Such inconsistencies can also affect the research results.

__Task 15__

Let's display a list of unique genre names sorted in alphabetical order on the screen. To do this:

- Extract the relevant column from the DataFrame.
- Apply a sorting method to it.
- Use a method to return unique values from the sorted column.

In [14]:
genre_unique = df['genre']
genre_unique = genre_unique.sort_values()
genre_unique = genre_unique.unique()
genre_unique

array(['acid', 'acoustic', 'action', 'adult', 'africa', 'afrikaans',
       'alternative', 'alternativepunk', 'ambient', 'americana',
       'animated', 'anime', 'arabesk', 'arabic', 'arena',
       'argentinetango', 'art', 'audiobook', 'author', 'avantgarde',
       'axé', 'baile', 'balkan', 'beats', 'bigroom', 'black', 'bluegrass',
       'blues', 'bollywood', 'bossa', 'brazilian', 'breakbeat', 'breaks',
       'broadway', 'cantautori', 'cantopop', 'canzone', 'caribbean',
       'caucasian', 'celtic', 'chamber', 'chanson', 'children', 'chill',
       'chinese', 'choral', 'christian', 'christmas', 'classical',
       'classicmetal', 'club', 'colombian', 'comedy', 'conjazz',
       'contemporary', 'country', 'cuban', 'dance', 'dancehall',
       'dancepop', 'dark', 'death', 'deep', 'deutschrock', 'deutschspr',
       'dirty', 'disco', 'dnb', 'documentary', 'downbeat', 'downtempo',
       'drum', 'dub', 'dubstep', 'eastern', 'easy', 'electronic',
       'electropop', 'emo', 'entehno', '

__Task 16__

Let's review the list and find implicit duplicates of the 'hiphop' genre name. These could be names with errors or alternative names for the same genre.

We can see the following implicit duplicates:

- 'hip'
- 'hop'
- 'hip-hop'

To clean the table from them, we will use the replace() method with two arguments: a list of duplicate strings (including 'hip', 'hop', and 'hip-hop') and the correct string value. We need to correct the genre column in the df table: replace each value from the list of duplicates with the correct one. Instead of 'hip', 'hop', and 'hip-hop', the table should have the value 'hiphop':

In [15]:
df['genre'] = df['genre'].replace(['hip', 'hop', 'hip-hop'], 'hiphop')

__Task 17__

Let's verify that we have replaced the incorrect genre names:

'hip'
'hop'
'hip-hop'

Display the sorted list of unique values in the genre column:

In [16]:
df['genre'].sort_values().unique()


array(['acid', 'acoustic', 'action', 'adult', 'africa', 'afrikaans',
       'alternative', 'alternativepunk', 'ambient', 'americana',
       'animated', 'anime', 'arabesk', 'arabic', 'arena',
       'argentinetango', 'art', 'audiobook', 'author', 'avantgarde',
       'axé', 'baile', 'balkan', 'beats', 'bigroom', 'black', 'bluegrass',
       'blues', 'bollywood', 'bossa', 'brazilian', 'breakbeat', 'breaks',
       'broadway', 'cantautori', 'cantopop', 'canzone', 'caribbean',
       'caucasian', 'celtic', 'chamber', 'chanson', 'children', 'chill',
       'chinese', 'choral', 'christian', 'christmas', 'classical',
       'classicmetal', 'club', 'colombian', 'comedy', 'conjazz',
       'contemporary', 'country', 'cuban', 'dance', 'dancehall',
       'dancepop', 'dark', 'death', 'deep', 'deutschrock', 'deutschspr',
       'dirty', 'disco', 'dnb', 'documentary', 'downbeat', 'downtempo',
       'drum', 'dub', 'dubstep', 'eastern', 'easy', 'electronic',
       'electropop', 'emo', 'entehno', '

__Conclusions__

Data preprocessing revealed three issues in the data:

- Style violations in column headers.
- Missing values.
- Duplicates - explicit and implicit.

We corrected the headers to simplify working with the table. Removing duplicates will make the research more accurate.

We replaced missing values with 'unknown'. We still need to determine whether missing values in the genre column will affect the research.

Now, we can proceed to test the hypotheses.

## Hypothesis Testing

### Comparing the Behavior of Users from Two Capitals

The first hypothesis suggests that users listen to music differently in Moscow and St. Petersburg. Let's test this assumption using data for three weekdays: Monday, Wednesday, and Friday. 

To do this:

- Separate users from Moscow and St. Petersburg.
- Compare how many tracks each group of users listened to on Monday, Wednesday, and Friday.

__Task 18__

To practice, let's first perform each of the calculations separately.

Let's assess user activity in each city. We'll group the data by city and calculate the number of tracks listened to in each group.

In [17]:
city_listeners = df.groupby('city')['time'].count()
city_listeners


city
Moscow              42741
Saint-Petersburg    18512
Name: time, dtype: int64

In Moscow, there are more track plays than in St. Petersburg. However, this doesn't necessarily mean that Moscow users listen to music more often. It's possible that there are simply more users in Moscow.

__Task 19__

Now let's group the data by the day of the week and calculate the number of track plays on Monday, Wednesday, and Friday. Please note that the data only contains information about plays on these days.

In [18]:
day_listeners = df.groupby('day')['time'].count()
day_listeners

day
Friday       21840
Monday       21354
Wednesday    18059
Name: time, dtype: int64

On average, users from both cities are less active on Wednesdays. However, the picture may change when we consider each city separately.

__Task 20__

We've seen how grouping by city and by days of the week works. Now let's write a function that combines these calculations.

We'll create a function called number_tracks() that calculates the number of track plays for a given day and city. It will take two parameters:

- day_of_week
- city_name

Inside the function, we will filter the rows of the original table where:

- the day column is equal to the day_of_week parameter.
- the city column is equal to the city_name parameter.

We will use sequential filtering with logical indexing for this.

Then, we'll calculate the values in the user_id column of the resulting table and save the result in a new variable. Finally, we'll return this variable from the function.

In [19]:
def number_tracks(day, city):
    track_list = df[(df['day'] == day) & (df['city'] == city)]
    track_list_count = track_list['user_id'].count()
    return track_list_count
print(number_tracks('Friday', 'Moscow'))

15945


__Task 21__

Let's call number_tracks() six times, changing the parameter values to obtain data for each city on each of the three days.

In [20]:
number_tracks('Monday','Moscow')

15740

In [21]:
number_tracks('Monday', 'Saint-Petersburg')

5614

In [22]:
number_tracks('Wednesday','Moscow')

11056

In [23]:
number_tracks('Wednesday', 'Saint-Petersburg')

7003

In [24]:
number_tracks('Friday','Moscow')

15945

In [25]:
number_tracks('Friday','Saint-Petersburg')

5895

__Task 22__

Using the pd.DataFrame constructor, let's create a table where:

- The column names are ['city', 'monday', 'wednesday', 'friday'].
- The data consists of the results we obtained using number_tracks.

In [26]:
columns = ['city', 'monday', 'wednesday', 'friday']
data = [['Moscow', 15740, 11056, 15945],['Saint-Petersburg', 5614, 11056, 5895]]

city_listeners = pd.DataFrame(data = data, columns = columns)
city_listeners

Unnamed: 0,city,monday,wednesday,friday
0,Moscow,15740,11056,15945
1,Saint-Petersburg,5614,11056,5895


__Conclusions__

The data show differences in user behavior:

- In Moscow, the peak of track plays occurs on Monday and Friday, with a noticeable decrease on Wednesday.
- In St. Petersburg, on the contrary, more music is listened to on Wednesdays. Activity on Monday and Friday is nearly equal to Wednesday.

Therefore, the data support the first hypothesis.

### Music at the Beginning and End of the Week

According to the second hypothesis, different genres dominate in Moscow and St. Petersburg on Monday mornings, as well as on Friday evenings. Let's test this hypothesis using the data.

__Task 23__

Let's save the data tables into two variables:

- For Moscow - moscow_general
- For St. Petersburg - spb_general

In [27]:
moscow_general = df[df['city'] == 'Moscow']

In [28]:
spb_general = df[df['city'] == 'Saint-Petersburg']

__Task 24__

Let's create a function called genre_weekday() with four parameters:

- A data table (DataFrame) with the data.
- A day of the week.
- A start time label in the format 'hh:mm'.
- An end time label in the format 'hh:mm'.

The function should return information about the top 10 genres of tracks listened to on the specified day within the time interval between the two time labels.

In [29]:
def genre_weekday(df, day, time1, time2):
    genre_df = df[(df['day'] == day) & (df['time'] > time1) & (df['time'] < time2)]
    genre_df_grouped = genre_df.groupby('genre')['genre'].count()
    genre_df_sorted = genre_df_grouped.sort_values(ascending=False)
    return genre_df_sorted[:10]

__Task 25__

Let's compare the results of the genre_weekday() function for Moscow and St. Petersburg on Monday morning (from 7:00 to 11:00) and Friday evening (from 17:00 to 23:00):

In [30]:
genre_weekday(moscow_general, 'Monday', '07:00', '11:00')

genre
pop            781
dance          549
electronic     480
rock           474
hiphop         286
ruspop         186
world          181
rusrap         175
alternative    164
unknown        161
Name: genre, dtype: int64

In [31]:
genre_weekday(spb_general, 'Monday', '07:00', '11:00')

genre
pop            218
dance          182
rock           162
electronic     147
hiphop          80
ruspop          64
alternative     58
rusrap          55
jazz            44
classical       40
Name: genre, dtype: int64

In [32]:
genre_weekday(moscow_general, 'Friday', '17:00', '23:00')

genre
pop            713
rock           517
dance          495
electronic     482
hiphop         273
world          208
ruspop         170
alternative    163
classical      163
rusrap         142
Name: genre, dtype: int64

In [33]:
genre_weekday(spb_general, 'Friday', '17:00', '23:00')

genre
pop            256
electronic     216
rock           216
dance          210
hiphop          97
alternative     63
jazz            61
classical       60
rusrap          59
world           54
Name: genre, dtype: int64

__Conclusions__

Comparing the top 10 genres on Monday morning, we can draw the following conclusions:

Moscow and St. Petersburg listen to similar music. The only difference is that the "world" genre is in the Moscow ranking, while jazz and classical are in the St. Petersburg ranking.

In Moscow, there are so many missing values that the 'unknown' value has taken the tenth place among the most popular genres. This means that missing values occupy a significant portion of the data and threaten the reliability of the research.

Friday evening does not change this picture significantly. Some genres move slightly up or down, but overall, the top 10 remains the same.

Thus, the second hypothesis is only partially confirmed:

- Users listen to similar music at the beginning and end of the week.
- The difference between Moscow and St. Petersburg is not very pronounced. Moscow listens to Russian pop music more often, while St. Petersburg listens to jazz.
However, the missing data casts doubt on this result. In Moscow, there are so many missing values that the top 10 ranking could look different if data about genres were not lost.

### Genre Preferences in Moscow and St. Petersburg

Hypothesis: St. Petersburg is the capital of rap, and music of this genre is listened to more often there than in Moscow. Meanwhile, Moscow is a city of contrasts where pop music still dominates.

__Task 26__

Let's group the moscow_general table by genre and calculate the number of track plays for each genre using the count() method. Then, we'll sort the result in descending order and save it in the moscow_genres table.

In [34]:
moscow_genres = moscow_general.groupby('genre')['genre'].count().sort_values(ascending=False)

__Task 27__

Let's display the first ten rows of moscow_genres:

In [35]:
moscow_genres.head(10)

genre
pop            5892
dance          4435
rock           3965
electronic     3786
hiphop         2096
classical      1616
world          1432
alternative    1379
ruspop         1372
rusrap         1161
Name: genre, dtype: int64

__Task 28__

Now let's do the same for St. Petersburg.

We'll group the spb_general table by genre, calculate the number of track plays for each genre, sort the result in descending order, and save it in the spb_genres table.

In [36]:
spb_genres = spb_general.groupby('genre')['genre'].count().sort_values(ascending=False)

__Task 29__

Let's display the first ten rows of spb_genres:

In [37]:
spb_genres.head(10)

genre
pop            2431
dance          1932
rock           1879
electronic     1736
hiphop          960
alternative     649
classical       646
rusrap          564
ruspop          538
world           515
Name: genre, dtype: int64

__Conclusions__

The hypothesis is partially confirmed:

- Pop music is the most popular genre in Moscow, as the hypothesis suggested. Moreover, a similar genre, Russian pop music, is also present in the top 10 genres.
- Contrary to expectations, rap is equally popular in both Moscow and St. Petersburg.

## Research Summary

We have tested three hypotheses and found the following:

1. The day of the week has a different impact on user activity in Moscow and St. Petersburg.
The first hypothesis was fully confirmed.

2. Musical preferences do not change significantly throughout the week, whether in Moscow or St. Petersburg. Small differences are noticeable on Mondays:
- In Moscow, users listen to "world" music genre.
- In St. Petersburg, jazz and classical music are more popular.

Thus, the second hypothesis was only partially confirmed, and this result could have been different if there were no missing data.

3. Users in Moscow and St. Petersburg have more in common in their music tastes than differences. Contrary to expectations, genre preferences in St. Petersburg resemble those in Moscow.
The third hypothesis was not confirmed. If there are differences in preferences, they are not noticeable among the majority of users.

In practice, research often involves statistical hypothesis testing. Data from a single service may not always reflect the preferences of an entire city's population. Statistical hypothesis tests can show how reliable the conclusions are based on the available data.