# Yandex Music Analysis

# Contents <a id='back'></a>

* [Intro](#intro)
* [Stage 1. Data description](#data_review)
    * [Conclusions](#data_review_conclusions)
* [Stage 2. Data preprocessing](#data_preprocessing)
    * [2.1 Header Style](#header_style)
    * [2.2 Missing Values](#missing_values)
    * [2.3 Duplicates](#duplicates)
    * [2.4 Conclusions](#data_preprocessing_conclusions)
* [Stage 3. Testing hypotheses](#hypotheses)
    * [3.1 Hypothesis 1: compare user behavior in the two cities](#activity)
    * [3.2 Hypothesis 2: music at the beginning and end of the week](#week)
    * [3.3 Hipótesis 3: gender preferences in Springfield and Shelbyville](#genre)
* [Final conclusions](#end)

## Introduction <a id='intro'></a>  

Whenever we do research, we need to formulate hypotheses that we can then test. Sometimes we accept these hypotheses; others, we reject them. To make the right decisions, a company must be able to understand if it is making the right assumptions.

In this project, we will compare the music preferences of the cities of Springfield and Shelbyville. We will study real data from Yandex Music to test the hypotheses below and compare the user behavior of those two cities.

### Target: 
We will test three hypotheses:
1. The activity of the users differs according to the day of the week and depending on the city.
2. On Monday mornings, the people of Springfield and Shelbyville listen to different genres. The same goes for Friday nights.
3. Listeners in Springfield and Shelbyville have different preferences. In Springfield they prefer pop while in Shelbyville there are more fans of rap.

### Steps 
User behavior data is stored in the `music_project_en.csv` file. There is no information on the quality of the data so we will need to examine it before testing the hypotheses.

First, we will assess the quality of the data and see if the problems are significant. So, during data preprocessing, we will take into account the most critical problems.

Our project will consist of three stages:
1. Description of the data
2. Data preprocessing
3. Hypothesis test
 
[Back to Contents](#back)

## Stage 1. Data description <a id='data_review'></a>

Opening the data in Yandex.Music and examine it. We will need `pandas` so we need to import it.

In [1]:
# importing pandas
import pandas as pd

Read the `music_project_en.csv` file and save it in the `df` variable:

In [2]:
# reading the file and storing it in df
df = pd.read_csv('/datasets/music_project_en.csv')

Print the first 10 rows of the table:

In [3]:
df.head(10)

Unnamed: 0,userID,Track,artist,genre,City,time,Day
0,FFB692EC,Kamigata To Boots,The Mass Missile,rock,Shelbyville,20:28:33,Wednesday
1,55204538,Delayed Because of Accident,Andreas Rönnberg,rock,Springfield,14:07:09,Friday
2,20EC38,Funiculì funiculà,Mario Lanza,pop,Shelbyville,20:58:07,Wednesday
3,A3DD03C9,Dragons in the Sunset,Fire + Ice,folk,Shelbyville,08:37:09,Monday
4,E2DC1FAE,Soul People,Space Echo,dance,Springfield,08:34:34,Monday
5,842029A1,Chains,Obladaet,rusrap,Shelbyville,13:09:41,Friday
6,4CB90AA5,True,Roman Messer,dance,Springfield,13:00:07,Wednesday
7,F03E1C1F,Feeling This Way,Polina Griffith,dance,Springfield,20:47:49,Wednesday
8,8FA1D3BE,L’estate,Julia Dalia,ruspop,Springfield,09:17:40,Friday
9,E772D5C0,Pessimist,,dance,Shelbyville,21:20:49,Wednesday


Get the general information about the table with a command:

In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 65079 entries, 0 to 65078
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0     userID  65079 non-null  object
 1   Track     63736 non-null  object
 2   artist    57512 non-null  object
 3   genre     63881 non-null  object
 4     City    65079 non-null  object
 5   time      65079 non-null  object
 6   Day       65079 non-null  object
dtypes: object(7)
memory usage: 3.5+ MB


The table contains seven columns. They all store the same type of data: object.

According to the documentation:
- `'userID'` — user identifier
- `'Track'` — track title
- `'artist'` — artist name
- `'genre'` — genre
- `'City'` — user's city
- `'time'` — the exact time period that the track was played
- `'Day'` — day of the week

We can see three problems with the style of the column names:
1. Some names are in upper case, others in lower case.
2. There are some spaces in some names.
3. "userID" should be written as "user_id".

The number of values in the columns is different. This means that the dataframe contains missing values.


### Conclusions <a id='data_review_conclusions'></a> 

Each row of the table stores data for the track that was played. Some columns describe the track itself: its title, artist, and genre. The rest transmit the user's information: the city he came from, the time he has played the track.

It is clear that the data is sufficient to test the hypothesis. However, there are missing values.

To continue, we need to preprocess the data.

[Back to Contents](#back)

## Stage 2. Data preprocessing <a id='data_preprocessing'></a>

We are going to modify the column headers and take care of missing values. Next, we check for duplicates in the data.

### 2.1 Header Style <a id='header_style'></a>

Print the column header:

In [6]:
# the list of column names in the df table
df.columns

Index(['  userID', 'Track', 'artist', 'genre', '  City  ', 'time', 'Day'], dtype='object')

We are going to change the column names according to the rules of good style:
* If the name has multiple words, using snake_case
* All characters must be lowercase
* Removing spaces

In [7]:
# rename the columns
df = df.rename(
    columns={
        '  userID': 'user_id',
        'Track' : 'track',
        'artist': 'artist',
        'genre': 'genre',
        '  City  ': 'city',
        'time': 'time',
        'Day': 'day'
    }    
)

Checking the result. Print the column names one more time:

In [9]:
# checking the result: the list of column names
df.columns

Index(['user_id', 'track', 'artist', 'genre', 'city', 'time', 'day'], dtype='object')

[Back to Contents](#back)

### 2.2 Missing Values <a id='missing_values'></a>

First, we find the number of missing values in the table. To do this, we use two pandas methods:

In [10]:
# calculating missing values
df.isna().sum()

user_id       0
track      1343
artist     7567
genre      1198
city          0
time          0
day           0
dtype: int64

Not all missing values affect the investigation. For example, the missing values in track and artist are not critical. We can simply replace them with clear markers.

But missing values in `'genre'` may affect the comparison between Springfield and Shelbyville music preferences. In real life, it would be useful to know the reasons for missing data and try to recover it. But we don't have that opportunity in this project. So we will have to:
* Fill those missing values with markers
* Assess how much missing values might affect your computations.

Replace missing values in `'track'`, `'artist'`, and `'genre'` with the string `'unknown'`. To do this, create the `columns_to_replace` list, iterate through it with a `for` loop, and replace the missing values in each of the columns:

In [11]:
# looping through column names and replacing missing values with 'unknown'
columns_to_replace = ['track', 'artist', 'genre']

for val in columns_to_replace:
    df[val] = df[val].fillna('unknown')

Making sure that the table does not contain any more missing values. Recounts the missing values.

In [12]:
# counting missing values
df.isna().sum()

user_id    0
track      0
artist     0
genre      0
city       0
time       0
day        0
dtype: int64

[Back to Contents](#back)

### 2.3 Duplicates <a id='duplicates'></a>
Finding the number of obvious duplicates in the table using one command:

In [13]:
# counting of obvious duplicate 
df.duplicated().sum()

3826

We call the `pandas` method to get rid of obvious duplicates:

In [14]:
# removing obvious duplicates
df = df.drop_duplicates()

We count the obvious duplicates one more time to make sure they've all been removed:

In [15]:
# checking duplicates
df.duplicated().sum()

0

Now we get rid of the implicit duplicates in the genre column. For example, the name of a genre can be written in several ways. These errors can also affect the result.

Printing a list of unique genre names, arranged in alphabetical order. How it's done:
* Retrieves the desired DataFrame column
* Apply an order method to it
* For the sorted column, call the method that will return all unique column values

In [16]:
# inspecting unique genre names
df['genre'].sort_values().unique()

array(['acid', 'acoustic', 'action', 'adult', 'africa', 'afrikaans',
       'alternative', 'ambient', 'americana', 'animated', 'anime',
       'arabesk', 'arabic', 'arena', 'argentinetango', 'art', 'audiobook',
       'avantgarde', 'axé', 'baile', 'balkan', 'beats', 'bigroom',
       'black', 'bluegrass', 'blues', 'bollywood', 'bossa', 'brazilian',
       'breakbeat', 'breaks', 'broadway', 'cantautori', 'cantopop',
       'canzone', 'caribbean', 'caucasian', 'celtic', 'chamber',
       'children', 'chill', 'chinese', 'choral', 'christian', 'christmas',
       'classical', 'classicmetal', 'club', 'colombian', 'comedy',
       'conjazz', 'contemporary', 'country', 'cuban', 'dance',
       'dancehall', 'dancepop', 'dark', 'death', 'deep', 'deutschrock',
       'deutschspr', 'dirty', 'disco', 'dnb', 'documentary', 'downbeat',
       'downtempo', 'drum', 'dub', 'dubstep', 'eastern', 'easy',
       'electronic', 'electropop', 'emo', 'entehno', 'epicmetal',
       'estrada', 'ethnic', 'eurofo

Looking in the list to find implicit duplicates of the genus `hiphop`. These may be names written incorrectly or alternative names for the same generation.

See the following implicit duplicates:
* `hip`
* `hop`
* `hip-hop`

To clear them, we declare the `replace_wrong_genres()` function with two parameters:
* `wrong_genres=` — the list of duplicates
* `correct_genre=` — the string with the correct value

The function must correct the names in the `'genre'` column of the `df` tabla, i.e., replace each value in the `wrong_genres` list with the value in `correct_genre`.

In [17]:
# function to replace implied duplicates
def replace_wrong_genres(wrong_genres, correct_genre):
    for wrong_genre in wrong_genres:
        df['genre'] = df['genre'].replace(wrong_genre, correct_genre)


Calling `replace wrong genres()` and passing arguments to it to remove implicit duplicates (`hip`, `hop` and `hip-hop`) and replace them with `hiphop`:

In [18]:
# removing implicit duplicates
duplicates = ['hip', 'hop', 'hip-hop']
correct = 'hiphop'

replace_wrong_genres(duplicates, correct)

Making sure that duplicate names have been removed. Print the list of unique values of the `'genre'` column:

In [19]:
# checking for implicit duplicates
df['genre'].sort_values().unique()

array(['acid', 'acoustic', 'action', 'adult', 'africa', 'afrikaans',
       'alternative', 'ambient', 'americana', 'animated', 'anime',
       'arabesk', 'arabic', 'arena', 'argentinetango', 'art', 'audiobook',
       'avantgarde', 'axé', 'baile', 'balkan', 'beats', 'bigroom',
       'black', 'bluegrass', 'blues', 'bollywood', 'bossa', 'brazilian',
       'breakbeat', 'breaks', 'broadway', 'cantautori', 'cantopop',
       'canzone', 'caribbean', 'caucasian', 'celtic', 'chamber',
       'children', 'chill', 'chinese', 'choral', 'christian', 'christmas',
       'classical', 'classicmetal', 'club', 'colombian', 'comedy',
       'conjazz', 'contemporary', 'country', 'cuban', 'dance',
       'dancehall', 'dancepop', 'dark', 'death', 'deep', 'deutschrock',
       'deutschspr', 'dirty', 'disco', 'dnb', 'documentary', 'downbeat',
       'downtempo', 'drum', 'dub', 'dubstep', 'eastern', 'easy',
       'electronic', 'electropop', 'emo', 'entehno', 'epicmetal',
       'estrada', 'ethnic', 'eurofo

[Back to Contents](#back)

### Conclusions <a id='data_preprocessing_conclusions'></a> 

We detected three problems with the data:

- Wrong header styles
- Missing values
- Obvious and explicit duplicates

The headers have been removed to make table processing easier.

All missing values have been replaced with `'unknown'`. But we still have to see if the missing values in `'genre'` point to our calculations.

The absence of duplicates will make the results more accurate and easier to understand.

Now we can continue testing the hypotheses.

[Back to Contents](#back)

## Stage 3. Testing hypotheses <a id='hypotheses'></a>

### 3.1 Hypothesis 1: compare user behavior in the two cities <a id='activity'></a>

According to the first hypothesis, users in Springfield and Shelbyville listen to music differently. Check this using data for three days of the week: Monday, Wednesday, and Friday.

* Divide users into groups by city.
* Compare how many tracks each group played on Monday, Wednesday and Friday.

For the sake of exercise, we perform each calculation separately.

We evaluate user activity in each city. Grouping the data by city and find the number of songs played in each group.

In [20]:
# counting tracks played in each city
shelby = 0
spring = 0
for value in df['city']:
    if value == 'Shelbyville':
        shelby += 1
    elif value == 'Springfield':
        spring += 1

print('The songs played in Shelbyville are', shelby)
print('The songs played in Springfield are', spring)
    


The songs played in Shelbyville are 18512
The songs played in Springfield are 42741


Springfield has played more tracks than Shelbyville. But that doesn't mean Springfieldians are listening music more often. This city is simply bigger and there are more users.

Now group the data by day of the week and find the number of tracks played on Monday, Wednesday, and Friday.

In [21]:
# calculating the tracks played on each of the three days
mon = 0
wed = 0
fri = 0
for val in df['day']:
    if val == 'Monday':
        mon += 1
    elif val == 'Wednesday':
        wed += 1
    elif val == 'Friday':
        fri += 1
        
print('The songs played on Monday are', mon)
print('The songs played on Wednesday are', wed)
print('The songs played on Friday are', fri)
    

The songs played on Monday are 21354
The songs played on Wednesday are 18059
The songs played on Friday are 21840


Wednesday was the quietest day of all. But if we consider the two cities separately we could reach a different conclusion.

We have already seen how grouping by city or day works. Now we write the function that will group both.

Create the `number_tracks()` function to calculate the number of songs played on a given day and city. It will require two parameters:
* weekday
* name of the city

In the function, we use a variable to store the rows from the original table, where:
   * the value of the `'day'` column is equal to the day parameter
   * the value of the `'city'` column is equal to the city parameter

Apply consecutive filtering with logical indexing.

Then, it calculates the values of the `'user_id'` column in the resulting table. Store the result in the new variable. Retrieve this variable from the function.

In [22]:
# <creating the number_tracks() function>
# declare the function with two parameters: day=, city=.
def number_tracks(day, city):
    filtro_day = df['day'] == day
    filtro_city = df['city'] == city
    filtro_fin = filtro_day & filtro_city
    track_list = df[filtro_fin]
    
    track_list_count = track_list['user_id'].count()    
        
    return track_list_count

We call `number_tracks()` six times, changing the parameter values, so that you retrieve the data for both cities for each of the three days.

In [23]:
# the number of songs played in Springfield on Monday
number_tracks('Monday', 'Springfield')

15740

In [24]:
# the number of songs played in Shelbyville on Monday
number_tracks('Monday', 'Shelbyville')

5614

In [25]:
# the number of songs played in Springfield on Wednesday
number_tracks('Wednesday', 'Springfield')

11056

In [26]:
# the number of songs played in Shelbyville on Wednesday
number_tracks('Wednesday', 'Shelbyville')

7003

In [35]:
# the number of songs played in Springfield on Friday
number_tracks('Friday', 'Springfield')

15945

In [27]:
# the number of songs played in Shelbyville on Friday
number_tracks('Friday', 'Shelbyville')

5895

We use `pd.DataFrame` to create a table, where
* Column names are: `['city', 'monday', 'wednesday', 'friday']`
* The data is the results you got from `number_tracks()`

In [38]:
nueva_tabla = {'city' : ['Springfield', 'Shelbyville'], 
               'monday' : [15740, 5614], 
               'wednesday' : [11056, 7003], 
               'friday' : [15945,5895]
               }

In [41]:
# tabla con los resultados
pd.DataFrame(nueva_tabla)

Unnamed: 0,city,monday,wednesday,friday
0,Springfield,15740,11056,15945
1,Shelbyville,5614,7003,5895


**Conclusions**

The data reveals differences in user behavior:

- In Springfield, the number of songs played peaks on Mondays and Fridays while on Wednesdays there is a dip in activity.
- In Shelbyville, the opposite, users listen more music on Wednesdays. User activity on Mondays and Fridays is lower.

So the first hypothesis seems to be correct.

[Back to Contents](#back)

### 3.2 Hypothesis 2: music at the beginning and end of the week <a id='week'></a>

According to the second hypothesis, on Monday mornings and Friday nights the citizens of Springfield listen to genres that differ from those of users of Shelbyville.

We get tables (making sure the name of your joined table matches the DataFrame given in the two code blocks below):
* For Springfield — `spr_general`
* For Shelbyville — `shel_general`

In [28]:
# getting the table spr_general from the rows of df,
# where the values in the 'city' column is 'Springfield'

filtro_city = df['city'] == 'Springfield'
spr_general = df[filtro_city]

spr_general

Unnamed: 0,user_id,track,artist,genre,city,time,day
1,55204538,Delayed Because of Accident,Andreas Rönnberg,rock,Springfield,14:07:09,Friday
4,E2DC1FAE,Soul People,Space Echo,dance,Springfield,08:34:34,Monday
6,4CB90AA5,True,Roman Messer,dance,Springfield,13:00:07,Wednesday
7,F03E1C1F,Feeling This Way,Polina Griffith,dance,Springfield,20:47:49,Wednesday
8,8FA1D3BE,L’estate,Julia Dalia,ruspop,Springfield,09:17:40,Friday
...,...,...,...,...,...,...,...
65073,83A474E7,I Worship Only What You Bleed,The Black Dahlia Murder,extrememetal,Springfield,21:07:12,Monday
65074,729CBB09,My Name,McLean,rnb,Springfield,13:32:28,Wednesday
65076,C5E3A0D5,Jalopiina,unknown,industrial,Springfield,20:09:26,Friday
65077,321D0506,Freight Train,Chas McDevitt,rock,Springfield,21:43:59,Friday


In [34]:
# getting shell_general from rows df,
# where the value of the 'city' column is 'Shelbyville'

filtro_city = df['city'] == 'Shelbyville'
shel_general = df[filtro_city]

shel_general


Unnamed: 0,user_id,track,artist,genre,city,time,day
0,FFB692EC,Kamigata To Boots,The Mass Missile,rock,Shelbyville,20:28:33,Wednesday
2,20EC38,Funiculì funiculà,Mario Lanza,pop,Shelbyville,20:58:07,Wednesday
3,A3DD03C9,Dragons in the Sunset,Fire + Ice,folk,Shelbyville,08:37:09,Monday
5,842029A1,Chains,Obladaet,rusrap,Shelbyville,13:09:41,Friday
9,E772D5C0,Pessimist,unknown,dance,Shelbyville,21:20:49,Wednesday
...,...,...,...,...,...,...,...
65063,D94F810B,Theme from the Walking Dead,Proyecto Halloween,film,Shelbyville,21:14:40,Monday
65064,BC8EC5CF,Red Lips: Gta (Rover Rework),Rover,electronic,Shelbyville,21:06:50,Monday
65065,29E04611,Bre Petrunko,Perunika Trio,world,Shelbyville,13:56:00,Monday
65066,1B91C621,(Hello) Cloud Mountain,sleepmakeswaves,postrock,Shelbyville,09:22:13,Monday


We write the function genre_weekday() with four parameters:
* A table for the data (`df`)
* The day of the week (`day`)
* The timestamp in 'hh:mm' (`time1`) format
* The timestamp in 'hh:mm' (`time2`) format

The function should return information for the 15 most popular genres for a given day in the period between two timestamps.

In [29]:
# declaring the function genre_weekday() with the parameters day=, time1= and time2=. Should
# return information about the most popular genres on a given day at a given time:

def genre_weekday(data, day, time1, time2):
    filtro_day = df['day'] == day
    filtro_time1 = df['time'] > time1
    filtro_time2 = df['time'] < time2 
    filtro_fin = filtro_day & filtro_time1 & filtro_time2
    genre_df = df[filtro_fin]
    
    genre_df_count = genre_df.groupby('genre').count()   

    genre_df_sorted = genre_df_count.sort_values(by = 'user_id', ascending = False)

    genre_df_sorted.head(15)

    return genre_df_sorted.head(15)

Comparing the results of the `genre_weekday()` function for Springfield and Shelbyville on Monday morning (7am-11am) and Friday afternoon (5pm-11pm):

In [32]:
# calling the function for Monday morning in Springfield (using spr_general instead of table df)

genre_weekday(spr_general, 'Monday', '07:00', '11:00')

Unnamed: 0_level_0,user_id,track,artist,city,time,day
genre,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
pop,999,999,999,999,999,999
dance,731,731,731,731,731,731
rock,636,636,636,636,636,636
electronic,627,627,627,627,627,627
hiphop,366,366,366,366,366,366
ruspop,250,250,250,250,250,250
rusrap,230,230,230,230,230,230
alternative,222,222,222,222,222,222
world,217,217,217,217,217,217
classical,197,197,197,197,197,197


In [35]:
# calling the function for Monday morning in Shelbyville (using shel_general instead of the df table)

genre_weekday(shel_general, 'Monday', '07:00', '11:00')

Unnamed: 0_level_0,user_id,track,artist,city,time,day
genre,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
pop,999,999,999,999,999,999
dance,731,731,731,731,731,731
rock,636,636,636,636,636,636
electronic,627,627,627,627,627,627
hiphop,366,366,366,366,366,366
ruspop,250,250,250,250,250,250
rusrap,230,230,230,230,230,230
alternative,222,222,222,222,222,222
world,217,217,217,217,217,217
classical,197,197,197,197,197,197


In [36]:
# calling the function for Friday afternoon in Springfield

genre_weekday(spr_general, 'Friday', '17:00', '23:00')

Unnamed: 0_level_0,user_id,track,artist,city,time,day
genre,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
pop,969,969,969,969,969,969
rock,733,733,733,733,733,733
dance,705,705,705,705,705,705
electronic,698,698,698,698,698,698
hiphop,370,370,370,370,370,370
world,262,262,262,262,262,262
alternative,226,226,226,226,226,226
classical,223,223,223,223,223,223
ruspop,217,217,217,217,217,217
rusrap,201,201,201,201,201,201


In [44]:
# calling the function for Friday afternoon in Shelbyville

genre_weekday(shel_general, 'Friday', '17:00', '23:00')

Unnamed: 0_level_0,user_id,track,artist,city,time,day
genre,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
pop,969,969,969,969,969,969
rock,733,733,733,733,733,733
dance,705,705,705,705,705,705
electronic,698,698,698,698,698,698
hiphop,370,370,370,370,370,370
world,262,262,262,262,262,262
alternative,226,226,226,226,226,226
classical,223,223,223,223,223,223
ruspop,217,217,217,217,217,217
rusrap,201,201,201,201,201,201


**Conclusions**

Having compared the 15 most popular genres on Monday morning we can conclude the following:

1. Users in Springfield and Shelbyville listen to similar music. The five most popular genres are the same, only rock and electronic have swapped places.

2. In Springfield the number of missing values turned out to be so high that the `'unknown'` value reached tenth. This means that the missing values form a considerable part of the data, which could be the basis of the question about the reliability of our conclusions.

For Friday afternoon, the situation is similar. Individual genres vary somewhat, but overall the top 15 are similar across the two cities.

In this way, the second hypothesis has been partially demonstrated:
* Users listen to similar music at the beginning and end of the week.
* There is not a huge difference between Springfield and Shelbyville. In both cities, pop is the most popular genre.

However, the number of missing values makes this result somewhat questionable. In Springfield, there are so many that it affects our 15 most popular. If we weren't missing those values, things might look different.

[Back to Contents](#back)

### 3.3 Hipótesis 3: gender preferences in Springfield and Shelbyville <a id='genre'></a>

Hypothesis: Shelbyville loves rap music. The citizens of Springfield like more pop music.

Grouping the `spr_general` table by genre and find the number of songs played from each genre with the `count()` method. Then, we sort the result in descending order and store it in `spr_genres`.

In [45]:
# on one line: we group the spr_general table by the 'genre' column,
# count the 'genre' values with count() in the pool,
# sort the resulting Series in descending order, and store it in spr_genres

spr_genres = spr_general.groupby('genre').count().sort_values(by = 'user_id', ascending = False)

Print the first 10 rows of `spr_genres`:

In [46]:
spr_genres.head(10)

Unnamed: 0_level_0,user_id,track,artist,city,time,day
genre,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
pop,5892,5892,5892,5892,5892,5892
dance,4435,4435,4435,4435,4435,4435
rock,3965,3965,3965,3965,3965,3965
electronic,3786,3786,3786,3786,3786,3786
hiphop,2096,2096,2096,2096,2096,2096
classical,1616,1616,1616,1616,1616,1616
world,1432,1432,1432,1432,1432,1432
alternative,1379,1379,1379,1379,1379,1379
ruspop,1372,1372,1372,1372,1372,1372
rusrap,1161,1161,1161,1161,1161,1161


Now we do the same for the Shelbyville data.

Grouping  the `shel_general` table by genre and find the number of songs played from each genre. Then sort the result in descending order and store it in the `shel_genres` table:

In [47]:
# on one line: group the shel_general table by the 'genre' column,
# count the 'genre' values in the pool with count(),
# sort the resulting Series in descending order and store it in shel_genres

shel_genres = shel_general.groupby('genre').count().sort_values(by = 'user_id', ascending = False)

Print the first 10 rows of `shel_genres`:

In [48]:
shel_genres.head(10)

Unnamed: 0_level_0,user_id,track,artist,city,time,day
genre,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
pop,2431,2431,2431,2431,2431,2431
dance,1932,1932,1932,1932,1932,1932
rock,1879,1879,1879,1879,1879,1879
electronic,1736,1736,1736,1736,1736,1736
hiphop,960,960,960,960,960,960
alternative,649,649,649,649,649,649
classical,646,646,646,646,646,646
rusrap,564,564,564,564,564,564
ruspop,538,538,538,538,538,538
world,515,515,515,515,515,515


**Conclusion**

The hypothesis has been partially proven:
*Pop music is the most popular genre in Springfield, as expected.
* However, pop music turned out to be just as popular in Springfield as it was in Shelbyville, and rap was not in the top 5 in either city.

[back to Contents](#back)

# Final conclusions <a id='end'></a>

We have tested the following three hypotheses:

1. User activity differs depending on the day of the week and in different cities.
2. On Monday mornings, residents of Springfield and Shelbyville listen to different genres. The same goes for Friday nights.
3. Listeners in Springfield and Shelbyville have different preferences. In both cities, Springfield and Shelbyville, pop is preferred.

After analyzing the data, we conclude:

1. User activity in Springfield and Shelbyville depends on the day of the week although the cities vary in different ways.

The first hypothesis has been fully accepted.

2. Music preferences do not vary significantly over the course of the week in Springfield and Shelbyville. We can observe small differences in the order on Mondays, but:
* In Springfield and Shelbyville, what people listen to the most is pop music.

So we cannot accept this hypothesis. We should also keep in mind that the result might have been different if it weren't for the missing values.

3. It turns out that the music preferences of users in Springfield and Shelbyville are quite similar.

The third hypothesis is rejected. If there is a difference in preferences cannot be observed in the data.


### Note
In real projects, research involves the study of statistical hypotheses that is more precise and quantitative. Also keep in mind that we can't always draw conclusions about an entire city based on data from a single source.

[Back to Contents](#back)