# Audio streaming app

Comparison between Moscow and St. Petersburg is surrounded by myths. For example:

* Moscow is a megapolis, subjected to a strict working week rhythm.
* St. Petersburg is the cultural capital, with its own tastes.

In this music application, we are comparing the behavior of users from both cities.

**Research objective** - We will test three hypotheses:

1. User activity depends on the day of the week, and this manifests differently in Moscow and St. Petersburg.
2. On Monday mornings, different music genres prevail in Moscow compared to St. Petersburg. The same applies to Friday evenings - different genres dominate depending on the city.
3. Moscow and St. Petersburg prefer different music genres. Pop music is more common in Moscow, while Russian rap is more popular in St. Petersburg.

**Research process**

Nothing is known about the data quality. Therefore, before testing the hypotheses, we need to review the data for errors and assess their impact on the study. 

Subsequently, during the data preprocessing stage, we will look for ways to rectify the most critical data errors.

Hence, the study will consist of three stages:

1. Data overview.
2. Data preprocessing.
3. Hypothesis testing.

## Review of data

In [113]:
import pandas as pd

In [114]:
df = pd.read_csv('...')

In [115]:
df.head(10)

Unnamed: 0,userID,Track,artist,genre,City,time,Day
0,FFB692EC,Kamigata To Boots,The Mass Missile,rock,Saint-Petersburg,20:28:33,Wednesday
1,55204538,Delayed Because of Accident,Andreas Rönnberg,rock,Moscow,14:07:09,Friday
2,20EC38,Funiculì funiculà,Mario Lanza,pop,Saint-Petersburg,20:58:07,Wednesday
3,A3DD03C9,Dragons in the Sunset,Fire + Ice,folk,Saint-Petersburg,08:37:09,Monday
4,E2DC1FAE,Soul People,Space Echo,dance,Moscow,08:34:34,Monday
5,842029A1,Преданная,IMPERVTOR,rusrap,Saint-Petersburg,13:09:41,Friday
6,4CB90AA5,True,Roman Messer,dance,Moscow,13:00:07,Wednesday
7,F03E1C1F,Feeling This Way,Polina Griffith,dance,Moscow,20:47:49,Wednesday
8,8FA1D3BE,И вновь продолжается бой,,ruspop,Moscow,09:17:40,Friday
9,E772D5C0,Pessimist,,dance,Saint-Petersburg,21:20:49,Wednesday


In [116]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 65079 entries, 0 to 65078
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0     userID  65079 non-null  object
 1   Track     63848 non-null  object
 2   artist    57876 non-null  object
 3   genre     63881 non-null  object
 4     City    65079 non-null  object
 5   time      65079 non-null  object
 6   Day       65079 non-null  object
dtypes: object(7)
memory usage: 3.5+ MB


Alright, there are seven columns in the table. The data type in all columns is object.

According to the data documentation:

* userID - User identifier.
* Track - Track name.
* artist - Artist name.
* genre - Genre name.
* City - User's city.
* time - Start time of listening.
* Day - Day of the week.

The number of values in the columns varies, indicating that there are missing values in the data.

In the column titles you can see style violations:
* Lower case letters are combined with upper case letters.
* Gaps encountered

**Conclusions**

In each row of the table - data about the listened track. Part of the columns describes the composition itself: name, artist and genre. The rest of the data tells about the user: from what city he was listening to music. 

It can be provisionally argued that the data are sufficient to test hypotheses. But there are omissions in the data, and in the titles of the columns - differences with a good style.

To move forward, we need to fix data problems.

## Data Preparation

### Columns names style

In [118]:
df.columns

Index(['  userID', 'Track', 'artist', 'genre', '  City  ', 'time', 'Day'], dtype='object')

In [119]:
df = df.rename(
    columns={
    '  userID': 'user_id',
    'Track': 'track',
    '  City  ': 'city',
    'Day': 'day'
    }
)

In [120]:
df.columns

Index(['user_id', 'track', 'artist', 'genre', 'city', 'time', 'day'], dtype='object')

### Missed values  

In [121]:
df.isna().sum()

user_id       0
track      1231
artist     7203
genre      1198
city          0
time          0
day           0
dtype: int64

In [122]:
columns_to_replace = ['track', 'artist', 'genre']
for column in columns_to_replace:
    df[column] = df[column].fillna('unknown')

In [123]:
df.isna().sum()

user_id    0
track      0
artist     0
genre      0
city       0
time       0
day        0
dtype: int64

### Duplicates

In [124]:
df.duplicated().sum()

3826

In [125]:
df = df.drop_duplicates()

In [126]:
df.duplicated().sum()

0

In [127]:
df['genre'].sort_values().unique()

array(['acid', 'acoustic', 'action', 'adult', 'africa', 'afrikaans',
       'alternative', 'alternativepunk', 'ambient', 'americana',
       'animated', 'anime', 'arabesk', 'arabic', 'arena',
       'argentinetango', 'art', 'audiobook', 'author', 'avantgarde',
       'axé', 'baile', 'balkan', 'beats', 'bigroom', 'black', 'bluegrass',
       'blues', 'bollywood', 'bossa', 'brazilian', 'breakbeat', 'breaks',
       'broadway', 'cantautori', 'cantopop', 'canzone', 'caribbean',
       'caucasian', 'celtic', 'chamber', 'chanson', 'children', 'chill',
       'chinese', 'choral', 'christian', 'christmas', 'classical',
       'classicmetal', 'club', 'colombian', 'comedy', 'conjazz',
       'contemporary', 'country', 'cuban', 'dance', 'dancehall',
       'dancepop', 'dark', 'death', 'deep', 'deutschrock', 'deutschspr',
       'dirty', 'disco', 'dnb', 'documentary', 'downbeat', 'downtempo',
       'drum', 'dub', 'dubstep', 'eastern', 'easy', 'electronic',
       'electropop', 'emo', 'entehno', '

In [128]:
df = df.replace('hip', 'hiphop')
df = df.replace('hop', 'hiphop')
df = df.replace('hip-hop', 'hiphop')

In [129]:
df['genre'].sort_values().unique()

array(['acid', 'acoustic', 'action', 'adult', 'africa', 'afrikaans',
       'alternative', 'alternativepunk', 'ambient', 'americana',
       'animated', 'anime', 'arabesk', 'arabic', 'arena',
       'argentinetango', 'art', 'audiobook', 'author', 'avantgarde',
       'axé', 'baile', 'balkan', 'beats', 'bigroom', 'black', 'bluegrass',
       'blues', 'bollywood', 'bossa', 'brazilian', 'breakbeat', 'breaks',
       'broadway', 'cantautori', 'cantopop', 'canzone', 'caribbean',
       'caucasian', 'celtic', 'chamber', 'chanson', 'children', 'chill',
       'chinese', 'choral', 'christian', 'christmas', 'classical',
       'classicmetal', 'club', 'colombian', 'comedy', 'conjazz',
       'contemporary', 'country', 'cuban', 'dance', 'dancehall',
       'dancepop', 'dark', 'death', 'deep', 'deutschrock', 'deutschspr',
       'dirty', 'disco', 'dnb', 'documentary', 'downbeat', 'downtempo',
       'drum', 'dub', 'dubstep', 'eastern', 'easy', 'electronic',
       'electropop', 'emo', 'entehno', '

**Conclusions**

The data preprocessing revealed three issues in the dataset:

- Violations in the header style.
- Missing values.
- Explicit and implicit duplicates.

We fixed the headers to simplify working with the table, and the removal of duplicates will make the research more accurate.

We replaced missing values with 'unknown'. However, we still need to assess whether the missing values in the 'genre' column will affect the research.

Now we can proceed with hypothesis testing.

## Hypothesis Check

### Comparison of user behavior between two cities

The first hypothesis states that users listen to music in different ways in Moscow and Saint Petersburg. Check this assumption for three days of the week - Monday, Wednesday and Friday. To do this:

* Let’s divide users of Moscow and Saint Petersburg.
* Compare how many tracks each group of users listened to on Monday, Wednesday and Friday.


In [130]:
df.groupby(by='city')['user_id'].count()

city
Moscow              42741
Saint-Petersburg    18512
Name: user_id, dtype: int64

There are more auditions in Moscow than in St. Petersburg. Just more users in Moscow.

In [131]:
df.groupby(by='day')['user_id'].count()

day
Friday       21840
Monday       21354
Wednesday    18059
Name: user_id, dtype: int64

On average, users from two cities are less active on Wednesdays. But the picture may change if you consider each city separately.

In [None]:
# <Creating the function number_tracks()>
# A function is declared with two parameters: day and city.
# In the variable track_list, the function stores the rows from the df table
# where the value in the 'day' column equals the parameter day and simultaneously
# the value in the 'city' column equals the parameter city (use sequential filtering
# with logical indexing or complex logical expressions in one line if you are already familiar with them).
# The variable track_list_count stores the number of values in the 'user_id' column,
# calculated using the count() method for the track_list table.
# The function returns the number of values in track_list.
# Function for counting track plays for a specific city and day.
# Using sequential filtering with logical indexing, it first retrieves the rows with the required day
# from the original table. Then, it further filters the result to include only the rows with the required city.
# It uses the count() method to calculate the number of values in the 'user_id' column.
# The function returns this count as the result.
def number_tracks(day, city):
    track_list = df[df['day'] == day]
    track_list = track_list[track_list['city'] == city]
    track_list_count = track_list['user_id'].count()
    return track_list_count

In [133]:
number_tracks('Monday', 'Moscow')

15740

In [134]:
number_tracks('Monday', 'Saint-Petersburg')

5614

In [135]:
number_tracks('Wednesday', 'Moscow')

11056

In [136]:
number_tracks('Wednesday', 'Saint-Petersburg')

7003

In [137]:
number_tracks('Friday', 'Moscow')

15945

In [138]:
number_tracks('Friday', 'Saint-Petersburg')

5895

In [139]:
columns = ['city', 'monday', 'wednesday', 'friday']
data = [['Moscow', 15740, 11056, 15945], ['Saint-Petersburg', 5614, 7003, 5895]]
pd.DataFrame(data=data, columns=columns)

Unnamed: 0,city,monday,wednesday,friday
0,Moscow,15740,11056,15945
1,Saint-Petersburg,5614,7003,5895


**Conclusions**

The data shows the difference in user behavior:

- In Moscow, the peak of auditions falls on Monday and Friday, and on Wednesday a decline is noticeable.
- In St. Petersburg, on the contrary, more people listen to music on Wednesdays.

So the data supports the first hypothesis.

### Music at the beginning and end of the week

According to the second hypothesis, on Monday morning, some genres prevail in Moscow and others in Petersburg. Also on Friday evening, different genres prevail - depending on the city.

In [140]:
# getting the moscow_general table from those rows of the table df, 
# for which the value in the 'city' column is 'Moscow'
moscow_general = df[df['city'] == 'Moscow']

In [141]:
# getting the spb_general table from those rows of the table df,
# for which the value in the 'city' column is 'Saint-Petersburg'
spb_general = df[df['city'] == 'Saint-Petersburg']

In [142]:
def genre_weekday(df, day, time1, time2):
    genre_df = df[df['day'] == day]
    genre_df = genre_df[genre_df['time'] < time2]
    genre_df = genre_df[genre_df['time'] > time1]
    genre_df_grouped = genre_df.groupby(by='genre')['genre'].count()
    genre_df_sorted = genre_df_grouped.sort_values(ascending=False)
    return genre_df_sorted[:10]

In [143]:
genre_weekday(moscow_general, 'Monday', '07:00', '11:00')

genre
pop            781
dance          549
electronic     480
rock           474
hiphop         286
ruspop         186
world          181
rusrap         175
alternative    164
unknown        161
Name: genre, dtype: int64

In [144]:
genre_weekday(spb_general, 'Monday', '07:00', '11:00')

genre
pop            218
dance          182
rock           162
electronic     147
hiphop          80
ruspop          64
alternative     58
rusrap          55
jazz            44
classical       40
Name: genre, dtype: int64

In [145]:
genre_weekday(moscow_general, 'Friday', '17:00', '23:00')

genre
pop            713
rock           517
dance          495
electronic     482
hiphop         273
world          208
ruspop         170
alternative    163
classical      163
rusrap         142
Name: genre, dtype: int64

In [146]:
genre_weekday(spb_general, 'Friday', '17:00', '23:00')

genre
pop            256
electronic     216
rock           216
dance          210
hiphop          97
alternative     63
jazz            61
classical       60
rusrap          59
world           54
Name: genre, dtype: int64

**Conclusions**

If we compare the top 10 genres on Monday morning, the following conclusions can be drawn:

1. Moscow and St. Petersburg have similar music preferences. The only difference is that the "world" genre made it to the Moscow's top 10, while jazz and classical genres are present in St. Petersburg's top 10.

2. In Moscow, there are so many missing values that the `'unknown'` value ranks tenth among the most popular genres. This suggests that missing values constitute a significant portion of the data and pose a threat to the reliability of the research.

Friday evening does not change this picture significantly. Some genres may slightly rise or fall, but the overall top 10 remains the same.

Thus, the second hypothesis is only partially confirmed:
* Users listen to similar music at the beginning and end of the week.
* The difference between Moscow and St. Petersburg is not very pronounced. Moscow leans more towards Russian pop music, while St. Petersburg leans more towards jazz.

However, the missing data casts doubt on this result. In Moscow, there are so many missing values that the top 10 ranking could have looked different if the data on genres were not lost.

### Genre Preferences in Moscow and Petersburg

Hypothesis: Petersburg - the capital of rap, the music of this genre is heard there more often than in Moscow.  And Moscow is a city of contrasts, in which, however, is dominated by pop music.

In [147]:
moscow_genres = moscow_general.groupby('genre')['genre'].count().sort_values(ascending=False)

In [148]:
moscow_genres.head(10)

genre
pop            5892
dance          4435
rock           3965
electronic     3786
hiphop         2096
classical      1616
world          1432
alternative    1379
ruspop         1372
rusrap         1161
Name: genre, dtype: int64

In [149]:
spb_genres = spb_general.groupby('genre')['genre'].count().sort_values(ascending=False)

In [150]:
spb_genres.head(10)

genre
pop            2431
dance          1932
rock           1879
electronic     1736
hiphop          960
alternative     649
classical       646
rusrap          564
ruspop          538
world           515
Name: genre, dtype: int64

**Conclusions**

The hypothesis was partially confirmed:
* Pop music is the most popular genre in Moscow, as the hypothesis suggested. Moreover, in the top 10 genres there is a close genre - Russian popular music.
* Contrary to expectations, rap is equally popular in Moscow and Petersburg.


## Research Results

We have tested three hypotheses and established the following:

1. The day of the week affects user activity differently in Moscow and St. Petersburg. The first hypothesis was fully confirmed.

2. Musical preferences do not change significantly during the week, whether in Moscow or St. Petersburg. Minor differences are noticeable at the beginning of the week, specifically on Mondays:
   * In Moscow, users listen to music of the "world" genre.
   * In St. Petersburg, they lean towards jazz and classical genres.

Thus, the second hypothesis was only partially confirmed. This result could have been different if not for the missing data.

3. The musical tastes of Moscow and St. Petersburg users have more in common than differences. Contrary to expectations, genre preferences in St. Petersburg resemble those in Moscow.

The third hypothesis was not confirmed. If there are any differences in preferences, they are not evident in the majority of users.

**In practical research, statistical hypothesis testing is essential.**
Conclusions drawn from data of a single service may not always apply to all residents of a city.
Statistical hypothesis testing will indicate the level of reliability of findings based on the available data.