# Y.Music

# Contents <a id='back'></a>

* [Introduction](#intro)
* [Step 1. Data Overview](#data_review)
    * [Conclusions](#data_review_conclusions)
* [Step 2. Data pre-processing](#data_preprocessing)
    * [2.1 Header style](#header_style)
    * [2.2 Missing values](#missing_values)
    * [2.3 Duplicates](#duplicates)
    * [2.4 Conclusions](#data_preprocessing_conclusions)
* [Step 3. Testing the hypotheses](#hypotheses)
    * [3.1 Hypothesis 1: comparing user behavior in two cities](#activity)
    * [3.2 Hypothesis 2: music at the beginning and end of the week](#week)
    * [3.3 Hypothesis 3: preferences in Springfield and Shelbyville](#genre)
* [Conclusions](#end)

## Introduction <a id='intro'></a>
In this project, the musical preferences of Springfield and Shelbyville residents will be compared. Real data from Y.Music will be analyzed to test the hypothesis below and compare user behavior for these two cities.

### Objective: 
Testing of the following hypotheses:
1. User activity is different depending on the day of the week and city. 
2. During Monday mornings, residents of Springfield and Shelbyville listen to different genres. This is also true for Friday nights.. 
3. Os ouvintes de Springfield e Shelbyville têm diferentes preferências. Em Springfield, as pessoas preferem pop, enquanto Shelbyville tem mais fãs de rap.

### Stages 
Data about user behavior is stored in the file `/datasets/music_project_en.csv`. There is no information about the quality of the data, so they will be examined before testing the hypotheses. 

First, the quality of the data will be assessed to see if the issues are significant. Then, during data pre-processing, there will be an attempt to solve the most critical problems.
 
This project will consist of three stages:
 1. Data overview  2. Data pre-processing  3. Test of hypotheses  
[Back to Index](#back)

## Step 1. Data Overview <a id='data_review'></a>

Opening and exploration of data in Y.Music.

Importing and using `pandas`.

In [1]:
import pandas as pd # importando pandas

Reading the `music_project_en.csv` file from the `/datasets/` folder and linking it to the `df` variable:

In [2]:
# reading the file and storing it in df
df = pd.read_csv('/datasets/music_project_en.csv')

Display of the first 10 rows of the table:

In [3]:
df.head(10) # getting first 10 rows from table df

Unnamed: 0,userID,Track,artist,genre,City,time,Day
0,FFB692EC,Kamigata To Boots,The Mass Missile,rock,Shelbyville,20:28:33,Wednesday
1,55204538,Delayed Because of Accident,Andreas Rönnberg,rock,Springfield,14:07:09,Friday
2,20EC38,Funiculì funiculà,Mario Lanza,pop,Shelbyville,20:58:07,Wednesday
3,A3DD03C9,Dragons in the Sunset,Fire + Ice,folk,Shelbyville,08:37:09,Monday
4,E2DC1FAE,Soul People,Space Echo,dance,Springfield,08:34:34,Monday
5,842029A1,Chains,Obladaet,rusrap,Shelbyville,13:09:41,Friday
6,4CB90AA5,True,Roman Messer,dance,Springfield,13:00:07,Wednesday
7,F03E1C1F,Feeling This Way,Polina Griffith,dance,Springfield,20:47:49,Wednesday
8,8FA1D3BE,L’estate,Julia Dalia,ruspop,Springfield,09:17:40,Friday
9,E772D5C0,Pessimist,,dance,Shelbyville,21:20:49,Wednesday


Getting the general information about the table with one command:

In [4]:
df.info() # getting general information about the data in df

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 65079 entries, 0 to 65078
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0     userID  65079 non-null  object
 1   Track     63736 non-null  object
 2   artist    57512 non-null  object
 3   genre     63881 non-null  object
 4     City    65079 non-null  object
 5   time      65079 non-null  object
 6   Day       65079 non-null  object
dtypes: object(7)
memory usage: 3.5+ MB


In [5]:
df.describe() # obtaining information for analysis of numerical variables

Unnamed: 0,userID,Track,artist,genre,City,time,Day
count,65079,63736,57512,63881,65079,65079,65079
unique,41748,39666,37806,268,2,20392,3
top,A8AE9169,Brand,Kartvelli,pop,Springfield,08:14:07,Friday
freq,76,136,136,8850,45360,14,23149


The table contains seven columns. They store the same type of data: objects.

According to the documentation:
- `'userID'` — user ID
- `'Track'` — song title
- `'artist'` — name of the artist
- `'genre'` — the genre
- `'City'` — user's city
- `'time'` — exact time the song was played
- `'Day'` — day of the week

We can see three problems with the style of the column names:
1. Some names are capitalized, some are lowercase.
2. There are spaces in some names.
3. Different words are not separated by "underline".

The number of column values is different. This means that the data contains missing values.


### Conclusions <a id='data_review_conclusions'></a> 

Each row in the table stores data about a song that has been played. Some columns describe the song itself: its title, artist and genre. The rest contains information about the user: the city they come from, the number of times the song has been played.

It is clear that the data are sufficient to test the hypotheses. However, there are missing values.

To move forward, we need to pre-process the data.

[Back to Index](#back)

## Step 2. Data pre-processing <a id='data_preprocessing'></a>
Fixing column header formatting and reordering of missing and duplicate values.

### Header style <a id='header_style'></a>
Column header display:

In [6]:
df.columns # the list of column names in table df

Index(['  userID', 'Track', 'artist', 'genre', '  City  ', 'time', 'Day'], dtype='object')

Changing the column names according to the rules of good style practice:
* Use of snake_case when the name has several words
* Conversion of all characters to lowercase
* Removal of spaces

In [7]:
# renaming columns
df = df.rename(columns={
    '  userID': 'userid', 
    'Track': 'track',
    '  City  ': 'city',
    'Day': 'day'
})

Result check. Display the column names one more time:

In [8]:
# checking the result: the list of column names
df.columns

Index(['userid', 'track', 'artist', 'genre', 'city', 'time', 'day'], dtype='object')

[Back to Index](#back)

### Missing Values <a id='missing_values'></a>
First, the amount of missing values in the table was found using two pandas methods:

In [9]:
# calculating missing values
df.isna().sum()

userid       0
track     1343
artist    7567
genre     1198
city         0
time         0
day          0
dtype: int64

Not all missing values affect the search. For example, missing values in song and artist is not decisive. These have been replaced by clear markers.

Missing values in 'genre' may affect the comparison of Springfield and Shelbyville musical preferences. In real life, it would be useful to find out the reasons why data is missing and try to compensate for them. We do not have that possibility in this project. Then, the following steps were followed:
* Fill in missing values with bullets
* Evaluate how much missing values can affect calculations

Replaced the missing values in 'track', 'artist', and 'genre' with the string 'unknown' by creating a columns_to_replace list, looping through it with the "for" loop, and replacing the missing values in each of the columns:

In [10]:
# looping through column names and replacing missing values with 'unknown'
columns_to_replace = ['track', 'artist', 'genre']

for col in columns_to_replace:
    df[col].fillna(value='unknown', inplace=True)

The values were counted again to ensure that there were no more missing values.

In [11]:
# counting the missing values
df.isna().sum()

userid    0
track     0
artist    0
genre     0
city      0
time      0
day       0
dtype: int64

[Back to Index](#back)

### Duplicates <a id='duplicates'></a>
The number of obvious duplicates in the table was found using the "duplicated" and "sum" commands:

In [12]:
# counting clear duplicates
df.duplicated().sum()

3826

The pandas method was called to remove the obvious duplicates:

In [13]:
# removing obvious duplicates
df = df.drop_duplicates().reset_index(drop=True)

Obvious duplicates were counted one more time to make sure they were all removed:

In [14]:
# checking duplicates
df.duplicated().sum()

0

Then the implicit duplicates in the genre column were removed. For example, the name of a genus can be spelled in different ways. Some errors could affect the result.

The list of unique genus names was displayed, organized in alphabetical order, following the steps:
* Retrieval of the DataFrame of the intended column
* Application of a method of choice for this
* For the selected column, the method that would return all unique values of the columns was called

In [15]:
# viewing unique gender names
sorted(df['genre'].unique())

['acid',
 'acoustic',
 'action',
 'adult',
 'africa',
 'afrikaans',
 'alternative',
 'ambient',
 'americana',
 'animated',
 'anime',
 'arabesk',
 'arabic',
 'arena',
 'argentinetango',
 'art',
 'audiobook',
 'avantgarde',
 'axé',
 'baile',
 'balkan',
 'beats',
 'bigroom',
 'black',
 'bluegrass',
 'blues',
 'bollywood',
 'bossa',
 'brazilian',
 'breakbeat',
 'breaks',
 'broadway',
 'cantautori',
 'cantopop',
 'canzone',
 'caribbean',
 'caucasian',
 'celtic',
 'chamber',
 'children',
 'chill',
 'chinese',
 'choral',
 'christian',
 'christmas',
 'classical',
 'classicmetal',
 'club',
 'colombian',
 'comedy',
 'conjazz',
 'contemporary',
 'country',
 'cuban',
 'dance',
 'dancehall',
 'dancepop',
 'dark',
 'death',
 'deep',
 'deutschrock',
 'deutschspr',
 'dirty',
 'disco',
 'dnb',
 'documentary',
 'downbeat',
 'downtempo',
 'drum',
 'dub',
 'dubstep',
 'eastern',
 'easy',
 'electronic',
 'electropop',
 'emo',
 'entehno',
 'epicmetal',
 'estrada',
 'ethnic',
 'eurofolk',
 'european',
 'expe

Looking through the list, I found implied duplicates of the hiphop genre. These may be misspelled names, or alternative names for the same genus.

The following implicit duplicates were observed:
* hip
* hop
* hip-hop

To remove them, the replace_wrong_genres() function was declared with two parameters:
* wrong_genres= — the list of duplicates
* correct_genre= — the string with the correct value

The function must correct the names in the 'genre' column of the df table, ie replacing each value in the wrong_genres list with values from correct_genre.

In [16]:
# function to replace implicit duplicates
def replace_wrong_genres(wrong_genres, correct_genre):
    for wrong_genre in wrong:
        df['genre'] = df['genre'].replace(wrong_genre, correct_genre)


The replace_wrong_genres() function was used and arguments were passed to it so that it could eliminate the implicit duplicates (hip, hop, and hip-hop) and replace them with hiphop:

In [17]:
# removing implicit duplicates
wrong = ['hip', 'hop', 'hip-hop']
correct_genre = 'hiphop'

replace_wrong_genres(wrong, correct_genre)

We then made sure that the duplicate names were removed by displaying the column's list of unique values alphabetically:

In [18]:
# checking for duplicate values
sorted(df['genre'].unique())

['acid',
 'acoustic',
 'action',
 'adult',
 'africa',
 'afrikaans',
 'alternative',
 'ambient',
 'americana',
 'animated',
 'anime',
 'arabesk',
 'arabic',
 'arena',
 'argentinetango',
 'art',
 'audiobook',
 'avantgarde',
 'axé',
 'baile',
 'balkan',
 'beats',
 'bigroom',
 'black',
 'bluegrass',
 'blues',
 'bollywood',
 'bossa',
 'brazilian',
 'breakbeat',
 'breaks',
 'broadway',
 'cantautori',
 'cantopop',
 'canzone',
 'caribbean',
 'caucasian',
 'celtic',
 'chamber',
 'children',
 'chill',
 'chinese',
 'choral',
 'christian',
 'christmas',
 'classical',
 'classicmetal',
 'club',
 'colombian',
 'comedy',
 'conjazz',
 'contemporary',
 'country',
 'cuban',
 'dance',
 'dancehall',
 'dancepop',
 'dark',
 'death',
 'deep',
 'deutschrock',
 'deutschspr',
 'dirty',
 'disco',
 'dnb',
 'documentary',
 'downbeat',
 'downtempo',
 'drum',
 'dub',
 'dubstep',
 'eastern',
 'easy',
 'electronic',
 'electropop',
 'emo',
 'entehno',
 'epicmetal',
 'estrada',
 'ethnic',
 'eurofolk',
 'european',
 'expe

[Back to Index](#back)

### Conclusions <a id='data_preprocessing_conclusions'></a>
We detected three problems with the data:

- Incorrect heading style
- Missing values
- Obvious and implied duplicates

The header has been cleaned up to make table processing simpler.

All missing values have been replaced with 'unknown'. But we have yet to see whether missing values in 'genre' will affect our calculations.

The absence of duplicates will make the results more accurate and easier to understand.

After processing the data, hypotheses can be tested.

[Back to Index](#back)

## Step 3. Testing hypotheses <a id='hypotheses'></a>

### Hypothesis 1: comparing user behavior in two cities <a id='activity'></a>

According to the first hypothesis, users from Springfield and Shelbyville listen to music differently. This hypothesis was tested using data from three days of the week: Monday, Wednesday, and Friday.

* Users from each city were divided into groups.
* How many songs each group listened to on Monday, Wednesday and Friday were compared.

As a matter of practice, each of these calculations was done separately.

User activity was evaluated in each city and the data were grouped by city, finding the number of songs played in each group.

In [19]:
# Counting the songs played in each city
df_bycity = df.groupby(['city'])['track'].count()
df_bycity

city
Shelbyville    18512
Springfield    42741
Name: track, dtype: int64

Springfield has more songs played than Shelbyville. But that doesn't mean Springfield citizens listen to music more often. This city is just bigger, and has more users.

Data were grouped by day of the week to find the number of songs played on Monday, Wednesday and Friday.

In [20]:
# Calculating the songs listened to on each of these three days
df_byday = df.groupby(['day'])['track'].count()
df_byday

day
Friday       21840
Monday       21354
Wednesday    18059
Name: track, dtype: int64

Wednesday is the quietest day in general. But if we consider the two cities separately, we must reach a different conclusion.

In [21]:
df.groupby(['city', 'day'])['track'].count()

city         day      
Shelbyville  Friday        5895
             Monday        5614
             Wednesday     7003
Springfield  Friday       15945
             Monday       15740
             Wednesday    11056
Name: track, dtype: int64

Let's see how grouping by city or day of the week works. Next, the function that grouped the data by the two criteria was written.

The number_tracks() function was created to calculate the number of songs played on a given day of the week and in each city. Two parameters were needed:
* day of the week
* Name of the city

In the function, the variable was used to store the lines of the original table, where:
   * the value of the 'day' column is equal to the day parameter
   * the value of the 'city' column is equal to the city parameter

Consecutive filters with logical indexing were applied.

Afterwards, the values of the 'userid' column in the resulting table were calculated, storing the result in the new variable and returning this variable from the function.

In [22]:
def number_tracks(day, city):
     
    track_list = (df['day'] == day) & (df['city'] == city)
    track_list_count = df[track_list]['userid'].count()
    return track_list_count

# <creating the number_tracks() function>
# let's declare the function with two parameters: day=, city=.
# Let the track_list variable store the df lines where
# the value in the 'day' column is equal to the day= parameter and at the same time,
# the value in the 'city' column is equal to the city= parameter (apply consecutive filtering
# with logical indexing).
# Let the track_list_count variable store the number of values of the 'userid' column in track_list
# (found with the count() method).
# We let the function return a number: the value of track_list_count.

# The function counts songs played by a certain city and day.
# first the rows with the intended day of the table were returned,
# then the lines with the desired city were filtered from the result,
# then find the number of 'userid' values in the filtered table,
# then we return that number.
# For what it returns, wrap the call function in print().

The `number_tracks()` function was called six times, changing the parameter values, in order to retrieve data for both cities for the three days.

In [23]:
# the number of songs played in Springfield on Monday
number_tracks('Monday', 'Springfield')

15740

In [24]:
# the number of songs played in Shelbyville on Monday
number_tracks('Monday', 'Shelbyville')

5614

In [25]:
#the amount of songs played in Springfield on Wednesday
number_tracks('Wednesday', 'Springfield')

11056

In [26]:
# the amount of songs played in Shelbyville on Wednesday
number_tracks('Wednesday', 'Shelbyville')

7003

In [27]:
# the number of songs played in Springfield on Friday
number_tracks('Friday', 'Springfield')

15945

In [28]:
# the number of songs played in Shelbyville on Friday
number_tracks('Friday', 'Shelbyville')

5895

pd.DataFrame was used to create a table, where
* Column names are: ['city', 'monday', 'wednesday', 'friday']`
* The data is the result that received from number_tracks()

In [29]:
# table with results
pd.DataFrame({
    'city': ['Springfield', 'Shelbyville'],
    'monday': [number_tracks('Monday', 'Springfield'), number_tracks('Monday', 'Shelbyville')],
    'wednesday': [number_tracks('Wednesday', 'Springfield'), number_tracks('Wednesday', 'Shelbyville')],
    'friday': [number_tracks('Friday', 'Springfield'), number_tracks('Friday', 'Shelbyville')],
})


Unnamed: 0,city,monday,wednesday,friday
0,Springfield,15740,11056,15945
1,Shelbyville,5614,7003,5895


**Conclusions**

The data reveal differences in user behavior:

- In Springfield, the amount of music played peaks on Mondays and Fridays, while on Wednesdays there is a decrease in activity.
- In Shelbyville, on the contrary, users listen to more music on Wednesday. Activity on Monday and Friday is small.

So the first hypothesis seems to be correct.

[Back to Index](#back)

### Hypothesis 2: music at the beginning and end of the week <a id='week'></a>

According to the second hypothesis, on Monday morning and Friday night, Springfielders listen to genres that differ from what some Shelbyville users like.

We got a combined table whose name corresponds to the DataFrame given in two code blocks below:
* For Springfield — `spr_general`
* For Shelbyville — `shel_general`

In [30]:
# getting spr_general table from rows df,
# where the value in column 'city' is 'Springfield'
spr_general = df[df['city'] == 'Springfield']
spr_general 

Unnamed: 0,userid,track,artist,genre,city,time,day
1,55204538,Delayed Because of Accident,Andreas Rönnberg,rock,Springfield,14:07:09,Friday
4,E2DC1FAE,Soul People,Space Echo,dance,Springfield,08:34:34,Monday
6,4CB90AA5,True,Roman Messer,dance,Springfield,13:00:07,Wednesday
7,F03E1C1F,Feeling This Way,Polina Griffith,dance,Springfield,20:47:49,Wednesday
8,8FA1D3BE,L’estate,Julia Dalia,ruspop,Springfield,09:17:40,Friday
...,...,...,...,...,...,...,...
61247,83A474E7,I Worship Only What You Bleed,The Black Dahlia Murder,extrememetal,Springfield,21:07:12,Monday
61248,729CBB09,My Name,McLean,rnb,Springfield,13:32:28,Wednesday
61250,C5E3A0D5,Jalopiina,unknown,industrial,Springfield,20:09:26,Friday
61251,321D0506,Freight Train,Chas McDevitt,rock,Springfield,21:43:59,Friday


In [31]:
# getting the shell_general from the df lines,
# where values in column 'city' are Shelbyville'
shel_general = df[df['city'] == 'Shelbyville']
shel_general 

Unnamed: 0,userid,track,artist,genre,city,time,day
0,FFB692EC,Kamigata To Boots,The Mass Missile,rock,Shelbyville,20:28:33,Wednesday
2,20EC38,Funiculì funiculà,Mario Lanza,pop,Shelbyville,20:58:07,Wednesday
3,A3DD03C9,Dragons in the Sunset,Fire + Ice,folk,Shelbyville,08:37:09,Monday
5,842029A1,Chains,Obladaet,rusrap,Shelbyville,13:09:41,Friday
9,E772D5C0,Pessimist,unknown,dance,Shelbyville,21:20:49,Wednesday
...,...,...,...,...,...,...,...
61239,D94F810B,Theme from the Walking Dead,Proyecto Halloween,film,Shelbyville,21:14:40,Monday
61240,BC8EC5CF,Red Lips: Gta (Rover Rework),Rover,electronic,Shelbyville,21:06:50,Monday
61241,29E04611,Bre Petrunko,Perunika Trio,world,Shelbyville,13:56:00,Monday
61242,1B91C621,(Hello) Cloud Mountain,sleepmakeswaves,postrock,Shelbyville,09:22:13,Monday


The genre_weekday() function was written with four parameters:
* A table for data (`df`)
* The day of the week (`day`)
* The first timestamp, in 'HH:MM' (`time1`) format
* The last timestamp, in 'HH,MM' (`time2`) format

The function returns information about the 15 most popular genres on a given day, within the period between the two timestamps.

In [32]:
# Declaring the genre_weekday() function with day=, time1=, and time2= parameters. She
# returns information about the most popular genres on a given day in a specified period:

# 1) The genre_df variable stores the lines that satisfy several conditions:
# - the value in the 'day' column is equal to the value of the day= argument
# - the value in the 'time' column is greater than the value of the time1= argument
# - the value in the 'time' column is less than the value of the time2= argument
# Consecutive filters with logical indexing were used.

# 2) Grouped genre_df by column 'genre', considering one of its columns,
# and using the count() method to find the number of entries for each of the entries
# gender representatives; storing the resulting Series object in the
# variable genre_df_count

# 3) Sorted genre_df_count in descending order of frequency, storing the result
# for genre_df_sorted variable

#4) Returned a Series object with the first 15 genre_df_sorted values - the top 15
# popular genres (on a given day, within a certain time range)

# The function was written here

def genre_weekday (df, day, time1, time2):

    # consecutive filtering
     # genre_df stores only df lines where day equals day=
    genre_df =  df[(df['day'] == day) & (df['time'] > time1) & (df['time'] < time2)]

   # genre_df stores only df lines whose time is less than time2=

     # genre_df only stores lines where where time is greater than time1=

     # Grouped the DataFrame filtered by the column with genre names, considering the genre column, and finding the number of lines for each genre with the count() method
    genre_df_grouped = genre_df.groupby('genre')['userid'].count() 

    # We store the result in descending order (so that the most popular genres come first in the Series object)
    genre_df_sorted = genre_df_grouped.sort_values(ascending = False) 
    
    # We return the Serie object storing the 15 most popular genres on a given day, within a given time range
    return genre_df_sorted[:15]

The results of the `genre_weekday()` function were compared for Springfield and Shelbyville on Monday morning (7am to 11am) and Friday afternoon (5pm to 11pm):

In [33]:
# calling the function for Monday morning in Springfield (using spr_general instead of df table)
genre_weekday (spr_general, 'Monday', '07:00', '11:00')

genre
pop            781
dance          549
electronic     480
rock           474
hiphop         286
ruspop         186
world          181
rusrap         175
alternative    164
unknown        161
classical      157
metal          120
jazz           100
folk            97
soundtrack      95
Name: userid, dtype: int64

In [34]:
# calling the function for Monday morning in Shelbyville (use shel_general instead of df table)
genre_weekday (shel_general, 'Monday', '07:00', '11:00')

genre
pop            218
dance          182
rock           162
electronic     147
hiphop          80
ruspop          64
alternative     58
rusrap          55
jazz            44
classical       40
world           36
rap             32
soundtrack      31
rnb             27
metal           27
Name: userid, dtype: int64

In [35]:
# calling the function for Friday afternoon in Springfield
genre_weekday (spr_general, 'Friday', '17:00', '23:00')

genre
pop            713
rock           517
dance          495
electronic     482
hiphop         273
world          208
ruspop         170
classical      163
alternative    163
rusrap         142
jazz           111
unknown        110
soundtrack     105
rnb             90
metal           88
Name: userid, dtype: int64

In [36]:
# calling the function for Friday afternoon in Shelbyville
genre_weekday (shel_general, 'Friday', '17:00', '23:00')

genre
pop            256
rock           216
electronic     216
dance          210
hiphop          97
alternative     63
jazz            61
classical       60
rusrap          59
world           54
unknown         47
ruspop          47
soundtrack      40
metal           39
rap             36
Name: userid, dtype: int64

**Conclusion**

Having compared the 15 most listened genres on Monday morning, we can draw the following conclusions:

1. Springfield and Shelbyville users listen to similar music. The five most listened genres are the same, only rock and electronic music have switched places.

2. In Springfield, the amount of missing values turned out to be so many that the value 'unknown' came in 10th. This means that missing values accounted for a considerable portion of the data, which may be the basis for questioning the reliability of the conclusions.

For Friday afternoon, the situation is similar. Individual genres vary slightly, but overall, the top 15 most listened genres are similar for the two cities.

Thus, the second hypothesis was partially proved:
* Users listen to similar music genres at the beginning and end of the week.
* There is not much difference between Springfield and Shelbyville. In both cities, pop is the most popular genre.

However, the number of missing values makes this result questionable. In Springfield, there are so many that they've affected the top 15. If we didn't lack these values, things could be different.

[Back to Index](#back)

### Hypothesis 3: preferences in Springfield and Shelbyville <a id='genre'></a>

Hypothesis: Shelbyville loves rap. Springfield citizens like more pop.

We group the spr_general table by genre and find the number of songs played for each genre with the count() method. Then we arrange the results in descending order and store them in spr_genres.

In [37]:
# In one line: we group the spr_general table by the 'genre' column,
# We count the values of 'genre' with count() in the grouping,
# Sort the resulting Series object in descending order, and store it in spr_genres
spr_genres = spr_general.groupby('genre')['genre'].count().sort_values(ascending = False)

Displaying the first 10 lines of spr_genres:

In [38]:
# displaying the first 10 lines of spr_genres
spr_genres.head(10)

genre
pop            5892
dance          4435
rock           3965
electronic     3786
hiphop         2096
classical      1616
world          1432
alternative    1379
ruspop         1372
rusrap         1161
Name: genre, dtype: int64

Then the same procedure was performed with the Shelbyville data.

The shel_general table was grouped by genre and the number of songs played for each genre was found. Then, the result was organized in descending order, storing it in the shel_genres table:

In [39]:
# on line one: we group the shell_general table by the 'genre' column,
# counted the 'genre' values in the grouping with count(),
# sort the resulting Series object in descending order and store it in shell_genres
shel_genres = shel_general.groupby('genre')['genre'].count().sort_values(ascending = False)

The first 10 lines of shell_genres:

In [40]:
# displaying the first 10 lines of shell_genres
shel_genres.head(10)

genre
pop            2431
dance          1932
rock           1879
electronic     1736
hiphop          960
alternative     649
classical       646
rusrap          564
ruspop          538
world           515
Name: genre, dtype: int64

**Conclusion**

The hypothesis was partially proved:
* Pop music is the most popular genre in Springfield, as expected.
* However, pop music turned out to be equally popular in Springfield and Shelbyville. Rap wasn't in the top 5 in any city.

[Back to Index](#back)

# Conclusions <a id='end'></a>

We tested the following three hypotheses:

1. User activity varies depending on the day of the week and the city.
2. On Monday mornings, residents of Springfield and Shelbyville listen to different genres. This is also true for Friday nights.
3. Springfield and Shelbyville listeners have different preferences. In both Springfield and Shellbyville, they prefer pop.

After analyzing the data, we conclude:

1. User activity in Springfield and Shelbyville depends on the day of the week, although cities vary in different ways.

The first hypothesis is fully accepted.

2. Music preferences do not vary significantly over the course of the week in both Springfield and Shelbyville. We can see small differences in the order on Mondays, but:
* In Springfield and Shelbyville, people listen to more pop music.

So we can accept this hypothesis. We must also bear in mind that the result could have been different if not for so many missing values.

3. It turns out that Springfield and Shelbyville users' music preferences are quite similar.

The third hypothesis was rejected. If there is any difference in preferences, it cannot be seen in this data.

### Observation
In real projects, there are researches involving statistical tests of hypotheses, which are more precise and more quantitative. We also realized that we cannot always draw conclusions about an entire city based on data from just one source.

[Back to Index](#back)