# Yandex.Music

# Contents <a id='back'></a>

* [Introduction](#intro)
* [Stage 1. Data overview](#data_review)
    * [Conclusions](#data_review_conclusions)
* [Stage 2. Data preprocessing](#data_preprocessing)
    * [2.1 Header style](#header_style)
    * [2.2 Missing values](#missing_values)
    * [2.3 Duplicates](#duplicates)
    * [2.4 Conclusions](#data_preprocessing_conclusions)
* [Stage 3. Testing the hypotheses](#hypotheses)
    * [3.1 Hypothesis 1: user activity in the two cities](#activity)
    * [3.2 Hypothesis 2: music preferences on Monday and Friday](#week)
    * [3.3 Hypothesis 3: genre preferences in Springfield and Shelbyville](#genre)
* [Findings](#end)

## Introduction <a id='intro'></a>
Whenever we're doing research, we need to formulate hypotheses that we can then test. Sometimes we accept these hypotheses; other times, we reject them. To make the right decisions, a business must be able to understand whether or not it's making the right assumptions.

In this project, you'll compare the music preferences of the cities of Springfield and Shelbyville. You'll study real Yandex.Music data to test the hypotheses below and compare user behavior for these two cities.

### Goal: 
Test three hypotheses:
1. User activity differs depending on the day of the week and from city to city. 
2. On Monday mornings, Springfield and Shelbyville residents listen to different genres. This is also true for Friday evenings. 
3. Springfield and Shelbyville listeners have different preferences. In Springfield, they prefer pop, while Shelbyville has more rap fans.

### Stages 
Data on user behavior is stored in the file `/datasets/music_project_en.csv`. There is no information about the quality of the data, so you will need to explore it before testing the hypotheses. 

First, you'll evaluate the quality of the data and see whether its issues are significant. Then, during data preprocessing, you will try to account for the most critical problems.
 
Your project will consist of three stages:
 1. Data overview
 2. Data preprocessing
 3. Testing the hypotheses
 
[Back to Contents](#back)

## Stage 1. Data overview <a id='data_review'></a>

Open the data on Yandex.Music and explore it.

You'll need `pandas`, so import it.

Print the first 10 table rows:

In [21]:
# importing pandas
import pandas
import numpy

Read the file `music_project_en.csv` from the `/datasets/` folder and save it in the `df` variable:

In [9]:
try:
    df = pandas.read_csv('/datasets/music_project_en.csv')
except: 
    df = pandas.read_csv(r'C:\Users\Alar\Downloads\music_project_en.csv')

In [3]:
# reading the file and storing it to df

In [4]:
df.head(10)

Unnamed: 0,userID,Track,artist,genre,City,time,Day
0,FFB692EC,Kamigata To Boots,The Mass Missile,rock,Shelbyville,20:28:33,Wednesday
1,55204538,Delayed Because of Accident,Andreas Rönnberg,rock,Springfield,14:07:09,Friday
2,20EC38,Funiculì funiculà,Mario Lanza,pop,Shelbyville,20:58:07,Wednesday
3,A3DD03C9,Dragons in the Sunset,Fire + Ice,folk,Shelbyville,08:37:09,Monday
4,E2DC1FAE,Soul People,Space Echo,dance,Springfield,08:34:34,Monday
5,842029A1,Chains,Obladaet,rusrap,Shelbyville,13:09:41,Friday
6,4CB90AA5,True,Roman Messer,dance,Springfield,13:00:07,Wednesday
7,F03E1C1F,Feeling This Way,Polina Griffith,dance,Springfield,20:47:49,Wednesday
8,8FA1D3BE,L’estate,Julia Dalia,ruspop,Springfield,09:17:40,Friday
9,E772D5C0,Pessimist,,dance,Shelbyville,21:20:49,Wednesday


In [5]:
df.tail(10)

Unnamed: 0,userID,Track,artist,genre,City,time,Day
65069,BE1AAD74,Waterwalk,Eduardo Gonzales,electronic,Springfield,20:38:59,Monday
65070,49F35D53,Ass Up,Rameez,dance,Springfield,14:08:58,Friday
65071,92378E24,Swing it Like You Mean it,OJOJOJ,techno,Springfield,21:12:56,Friday
65072,C532021D,We Can Not Be Silenced,Pänzer,extrememetal,Springfield,08:38:24,Friday
65073,83A474E7,I Worship Only What You Bleed,The Black Dahlia Murder,extrememetal,Springfield,21:07:12,Monday
65074,729CBB09,My Name,McLean,rnb,Springfield,13:32:28,Wednesday
65075,D08D4A55,Maybe One Day (feat. Black Spade),Blu & Exile,hip,Shelbyville,10:00:00,Monday
65076,C5E3A0D5,Jalopiina,,industrial,Springfield,20:09:26,Friday
65077,321D0506,Freight Train,Chas McDevitt,rock,Springfield,21:43:59,Friday
65078,3A64EF84,Tell Me Sweet Little Lies,Monica Lopez,country,Springfield,21:59:46,Friday


Obtaining the general information about the table with one command:

In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 65079 entries, 0 to 65078
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0     userID  65079 non-null  object
 1   Track     63736 non-null  object
 2   artist    57512 non-null  object
 3   genre     63881 non-null  object
 4     City    65079 non-null  object
 5   time      65079 non-null  object
 6   Day       65079 non-null  object
dtypes: object(7)
memory usage: 3.5+ MB


The table contains seven columns. They all store the same data type: `object`.

According to the documentation:
- `'userID'` — user identifier
- `'Track'` — track title
- `'artist'` — artist's name
- `'genre'`
- `'City'` — user's city
- `'time'` — the exact time the track was played
- `'Day'` — day of the week

We can see three issues with style in the column names:
1. Some names are uppercase, some are lowercase.
2. There are spaces in some names.
3. `Detect the third issue yourself and describe it here`.
there is name has several words we should use snake case

The number of column values is different. This means the data contains missing values.


### Conclusions <a id='data_review_conclusions'></a> 

Each row in the table stores data on a track that was played. Some columns describe the track itself: its title, artist and genre. The rest convey information about the user: the city they come from, the time they played the track. 

It's clear that the data is sufficient to test the hypotheses. However, there are missing values.

To move forward, we need to preprocess the data.

In [7]:
# the list of column names in the df table
df.columns

Index(['  userID', 'Track', 'artist', 'genre', '  City  ', 'time', 'Day'], dtype='object')

In [10]:
#renaming columns :

df = df.rename(columns={
        '  userID' : 'user_id',
        'Track' : 'track', 
        '  City  ' : 'city', 
        'Day' : 'day',
    }
)


In [11]:
#checking result:
print(df.columns)

Index(['user_id', 'track', 'artist', 'genre', 'city', 'time', 'day'], dtype='object')


[Back to Contents](#back)

### Missing values <a id='missing_values'></a>
First, let's find the number of missing values in the table. To do so, i will use two `pandas` methods:

In [12]:
# calculating missing values
print(df.isna().sum()) 

user_id       0
track      1343
artist     7567
genre      1198
city          0
time          0
day           0
dtype: int64


In [13]:
df.isnull().sum()/len(df)

user_id    0.000000
track      0.020636
artist     0.116274
genre      0.018408
city       0.000000
time       0.000000
day        0.000000
dtype: float64

Replace the missing values in `'track'`, `'artist'`, and `'genre'` with the string `'unknown'`. To do this, create the `columns_to_replace` list, loop over it with `for`, and replace the missing values in each of the columns:

In [14]:
# looping over column names and replacing missing values with 'unknown'

columns_to_replace = ['track','artist', 'genre']
for row in columns_to_replace:
    df[row] = df[row].fillna('unknown') 

Make sure the table contains no more missing values. Count the missing values again.

In [15]:
print(df.isna().sum()) 

user_id    0
track      0
artist     0
genre      0
city       0
time       0
day        0
dtype: int64


### Duplicates <a id='duplicates'></a>
Find the number of obvious duplicates in the table using one command:

In [16]:
# counting clear duplicates
print(df.duplicated().sum())

3826


Call the `pandas` method for getting rid of obvious duplicates:

In [17]:
# removing obvious duplicates
df = df.drop_duplicates()

In [18]:
df = df.reset_index(drop=True)

In [19]:
# checking for duplicates
print(df.duplicated().sum())

0


Now get rid of implicit duplicates in the `genre` column. For example, the name of a genre can be written in different ways. Such errors will also affect the result.

Print a list of unique genre names, sorted in alphabetical order. To do so:
* Retrieve the intended DataFrame column 
* Apply a sorting method to it
* For the sorted column, call the method that will return all unique column values

In [23]:
# viewing unique genre names
print(df['genre'].sort_values())

31916         acid
3636      acoustic
20264     acoustic
37784     acoustic
14611     acoustic
           ...    
6158         world
17963        world
27939    worldbeat
39063    worldbeat
8448           ïîï
Name: genre, Length: 61253, dtype: object


Look through the list to find implicit duplicates of the genre `hiphop`. These could be names written incorrectly or alternative names of the same genre.

You will see the following implicit duplicates:
* `hip`
* `hop`
* `hip-hop`

To get rid of them, declare the function `replace_wrong_genres()` with two parameters: 
* `wrong_genres=` — the list of duplicates
* `correct_genre=` — the string with the correct value

The function should correct the names in the `'genre'` column from the `df` table, i.e. replace each value from the `wrong_genres` list with the value in `correct_genre`.

In [21]:
# function for replacing implicit duplicates
def replace_wrong_genres(wrong_values, correct_values):
    replace_wrong_genres(['hip', 'hop', 'hip-hop'], 'hiphop')
    return(wrong_genres, correct_genre)


In [22]:
df.genre

0              rock
1              rock
2               pop
3              folk
4             dance
            ...    
61248           rnb
61249           hip
61250    industrial
61251          rock
61252       country
Name: genre, Length: 61253, dtype: object

Call `replace_wrong_genres()` and pass it arguments so that it clears implicit duplcates (`hip`, `hop`, and `hip-hop`) and replaces them with `hiphop`:

In [None]:
# removing implicit duplicates

Make sure the duplicate names were removed. Print the list of unique values from the `'genre'` column:

In [23]:
# checking for implicit duplicates
print(df['genre'].unique())

['rock' 'pop' 'folk' 'dance' 'rusrap' 'ruspop' 'world' 'electronic'
 'unknown' 'alternative' 'children' 'rnb' 'hip' 'jazz' 'postrock' 'latin'
 'classical' 'metal' 'reggae' 'triphop' 'blues' 'instrumental' 'rusrock'
 'dnb' 'türk' 'post' 'country' 'psychedelic' 'conjazz' 'indie'
 'posthardcore' 'local' 'avantgarde' 'punk' 'videogame' 'techno' 'house'
 'christmas' 'melodic' 'caucasian' 'reggaeton' 'soundtrack' 'singer' 'ska'
 'salsa' 'ambient' 'film' 'western' 'rap' 'beats' "hard'n'heavy"
 'progmetal' 'minimal' 'tropical' 'contemporary' 'new' 'soul' 'holiday'
 'german' 'jpop' 'spiritual' 'urban' 'gospel' 'nujazz' 'folkmetal'
 'trance' 'miscellaneous' 'anime' 'hardcore' 'progressive' 'korean'
 'numetal' 'vocal' 'estrada' 'tango' 'loungeelectronic' 'classicmetal'
 'dubstep' 'club' 'deep' 'southern' 'black' 'folkrock' 'fitness' 'french'
 'disco' 'religious' 'hiphop' 'drum' 'extrememetal' 'türkçe'
 'experimental' 'easy' 'metalcore' 'modern' 'argentinetango' 'old' 'swing'
 'breaks' 'eurofolk' 

[Back to Contents](#back)

### Conclusions <a id='data_preprocessing_conclusions'></a>
We detected three issues with the data:

- Incorrect header styles
- Missing values
- Obvious and implicit duplicates

The headers have been cleaned up to make processing the table simpler.

All missing values have been replaced with `'unknown'`. But we still have to see whether the missing values in `'genre'` will affect our calculations.

The absence of duplicates will make the results more precise and easier to understand.

Now we can move on to testing hypotheses. 

[Back to Contents](#back)

## Stage 3. Testing hypotheses <a id='hypotheses'></a>

### Hypothesis 1: comparing user behavior in two cities <a id='activity'></a>

According to the first hypothesis, users from Springfield and Shelbyville listen to music differently. Test this using the data on three days of the week: Monday, Wednesday, and Friday.

* Divide the users into groups by city.
* Compare how many tracks each group played on Monday, Wednesday, and Friday.


In [24]:
#Divide the users into groups by city
users_per_city = df.groupby('city')['user_id'].count()
print(users_per_city) 

city
Shelbyville    18512
Springfield    42741
Name: user_id, dtype: int64


In [25]:
#Compare how many tracks each group played on Monday, Wednesday, and Friday
songs_per_day = df.groupby('day')['track'].count()
print(songs_per_day) 

day
Friday       21840
Monday       21354
Wednesday    18059
Name: track, dtype: int64


In [26]:
#Compare how many tracks each group played on Monday, Wednesday, and Friday
df1 = df.groupby(['city','day'])['track'].count()
print (df1)

city         day      
Shelbyville  Friday        5895
             Monday        5614
             Wednesday     7003
Springfield  Friday       15945
             Monday       15740
             Wednesday    11056
Name: track, dtype: int64


For the sake of practice, perform each computation separately. 

Evaluate user activity in each city. Group the data by city and find the number of songs played in each group.



Springfield has more tracks played than Shelbyville. But that does not imply that citizens of Springfield listen to music more often. This city is simply bigger, and there are more users.

Now group the data by day of the week and find the number of tracks played on Monday, Wednesday, and Friday.


In [27]:
# Calculating tracks played on each of the three days
track_play = df.groupby(['day', 'time'])['track'].count()
print(track_play)

day        time    
Friday     08:00:00    1
           08:00:03    2
           08:00:04    2
           08:00:05    1
           08:00:07    1
                      ..
Wednesday  22:00:49    1
           22:00:54    2
           22:00:55    1
           22:00:56    1
           22:00:59    2
Name: track, Length: 39587, dtype: int64


Wednesday is the quietest day overall. But if we consider the two cities separately, we might come to a different conclusion.

You have seen how grouping by city or day works. Now write a function that will group by both.

Create the `number_tracks()` function to calculate the number of songs played for a given day and city. It will require two parameters:
* day of the week
* name of the city

In the function, use a variable to store the rows from the original table, where:
  * `'day'` column value is equal to the `day` parameter
  * `'city'` column value is equal to the `city` parameter

Apply consecutive filtering with logical indexing.

Then calculate the `'user_id'` column values in the resulting table. Store the result to a new variable. Return this variable from the function.

In [28]:
# <creating the function number_tracks()>
# We'll declare a function with two parameters: day=, city=.
# Let the track_list variable store the df rows where
# the value in the 'day' column is equal to the day= parameter and, at the same time, 
# the value in the 'city' column is equal to the city= parameter (apply consecutive filtering 
# with logical indexing).
# Let the track_list_count variable store the number of 'user_id' column values in track_list
# (found with the count() method).
# Let the function return a number: the value of track_list_count.
def number_tracks(day, city):
    track_list = df.loc[(df['day'] == day) & (df['city'] == city)]
    track_list_count = track_list['user_id'].count()
    return track_list_count
# The function counts tracked played for a certain city and day.
# It first retrieves the rows with the intended day from the table,
# then filters out the rows with the intended city from the result,
# then finds the number of 'user_id' values in the filtered table,
# then returns that number.
# To see what it returns, wrap the function call in print(). 

Call `number_tracks()` six times, changing the parameter values, so that you retrieve the data on both cities for each of the three days.

In [29]:
#the number of songs played in Springfiled on monday
day = 'Monday'
city ='Springfield'
number_tracks(day, city)

15740

In [30]:
# the number of songs played in Shelbyville on Monday
number_tracks('Monday', 'Shelbyville')

5614

In [31]:
# the number of songs played in Springfield on Wednesday
number_tracks('Wednesday', 'Springfield')

11056

In [32]:
# the number of songs played in Shelbyville on Wednesday
number_tracks('Wednesday', 'Shelbyville')

7003

In [33]:
# the number of songs played in Springfield on Friday
number_tracks('Friday', 'Springfield')

15945

In [34]:
# the number of songs played in Shelbyville on Friday
number_tracks('Friday', 'Shelbyville')

5895

i will Use `pd.DataFrame` to create a table, where
* Column names are: `['city', 'monday', 'wednesday', 'friday']`
* The data is the results you got from `number_tracks()`

In [19]:
# table with results
datamusic = [
    ['Springfield', 15740, 11056, 15945],
    ['Shelbyville', 5614, 7003, 5895],
]

columns = ['city','monday','wednesday','friday']
city_table = pd.DataFrame(data=datamusic, columns=columns)
print(city_table)

          city  monday  wednesday  friday
0  Springfield   15740      11056   15945
1  Shelbyville    5614       7003    5895


**Conclusions**

The data reveals differences in user behavior:

- In Springfield, the number of songs played peaks on Mondays and Fridays, while on Wednesday there is a decrease in activity.
- In Shelbyville, on the contrary, users listen to music more on Wednesday. User activity on Monday and Friday is smaller.

So the first hypothesis seems to be correct.

[Back to Contents](#back)

### Hypothesis 2: music at the beginning and end of the week <a id='week'></a>

According to the second hypothesis, on Monday morning and Friday night, citizens of Springfield listen to genres that differ from ones users from Shelbyville enjoy.

Get tables (make sure that the name of your combined table matches the DataFrame given in the two code blocks below):
* For Springfield — `spr_general`
* For Shelbyville — `shel_general`

In [36]:
# create the spr_general table from the df rows, 
# where the value in the 'city' column is 'Springfield'
spr_general=df[df['city']=='Springfield']

print(spr_general)

        user_id                          track                   artist  \
1      55204538    Delayed Because of Accident         Andreas Rönnberg   
4      E2DC1FAE                    Soul People               Space Echo   
6      4CB90AA5                           True             Roman Messer   
7      F03E1C1F               Feeling This Way          Polina Griffith   
8      8FA1D3BE                       L’estate              Julia Dalia   
...         ...                            ...                      ...   
61247  83A474E7  I Worship Only What You Bleed  The Black Dahlia Murder   
61248  729CBB09                        My Name                   McLean   
61250  C5E3A0D5                      Jalopiina                  unknown   
61251  321D0506                  Freight Train            Chas McDevitt   
61252  3A64EF84      Tell Me Sweet Little Lies             Monica Lopez   

              genre         city      time        day  
1              rock  Springfield  14:07:09 

In [37]:
# create the shel_general from the df rows,
# where the value in the 'Shelbyville'
print(spr_general)

shel_general=df[df['city']=='Shelbyville']

print(shel_general)


        user_id                          track                   artist  \
1      55204538    Delayed Because of Accident         Andreas Rönnberg   
4      E2DC1FAE                    Soul People               Space Echo   
6      4CB90AA5                           True             Roman Messer   
7      F03E1C1F               Feeling This Way          Polina Griffith   
8      8FA1D3BE                       L’estate              Julia Dalia   
...         ...                            ...                      ...   
61247  83A474E7  I Worship Only What You Bleed  The Black Dahlia Murder   
61248  729CBB09                        My Name                   McLean   
61250  C5E3A0D5                      Jalopiina                  unknown   
61251  321D0506                  Freight Train            Chas McDevitt   
61252  3A64EF84      Tell Me Sweet Little Lies             Monica Lopez   

              genre         city      time        day  
1              rock  Springfield  14:07:09 

i will Write the `genre_weekday()` function with four parameters:
* A table for data (`df`)
* The day of the week (`day`)
* The first timestamp, in 'hh:mm' format (`time1`)
* The last timestamp, in 'hh:mm' format (`time2`)

The function should return info on the 15 most popular genres on a given day within the period between the two timestamps.

In [38]:
# 1) Let the genre_df variable store the rows that meet several conditions:
#    - the value in the 'day' column is equal to the value of the day= argument
#    - the value in the 'time' column is greater than the value of the time1= argument
#    - the value in the 'time' column is smaller than the value of the time2= argument
#    Use consecutive filtering with logical indexing.


# 2) Group genre_df by the 'genre' column, take one of its columns, 
#    and use the count() method to find the number of entries for each of 
#    the represented genres; store the resulting Series to the
#    genre_df_count variable

# 3) Sort genre_df_count in descending order of frequency and store the result
#    to the genre_df_sorted variable

# 4) Return a Series object with the first 15 genre_df_sorted value - the 15 most
#    popular genres (on a given day, within a certain timeframe)

# Write your function here
def genre_weekday(df, day, time1, time2):
    genre_df = df[df["day"]==day]
    genre_df = genre_df[genre_df['time']>time1]
    genre_df = genre_df[genre_df['time']<time2]
    genre_df_count = genre_df.groupby('genre')['genre'].count()
    genre_df_sorted = genre_df_count.sort_values(ascending=False)
    return genre_df_sorted[:15]

    # consecutive filtering
    # Create the variable genre_df which will store only those df rows where the day is equal to day=

    # filter again so that genre_df will store only those rows where the time is smaller than time2=
    
    # filter once more so that genre_df will store only rows where the time is greater than time1=
     # write your code here

    # group the filtered DataFrame by the column with the names of genres, take the genre column, and find the number of rows for each genre with the count() method
    

    # sort the result in descending order (so that the most popular genres come first in the Series object)
   

    # we will return the Series object storing the 15 most popular genres on a given day in a given timeframe
    

Compare the results of the `genre_weekday()` function for Springfield and Shelbyville on Monday morning (from 7AM to 11AM) and on Friday evening (from 17:00 to 23:00):

In [39]:
# calling the function for Monday morning in Springfield (use spr_general instead of the df table)
genre_weekday(spr_general, 'Monday', '07:00', '11:00')

genre
pop            781
dance          549
electronic     480
rock           474
hip            281
ruspop         186
world          181
rusrap         175
alternative    164
unknown        161
classical      157
metal          120
jazz           100
folk            97
soundtrack      95
Name: genre, dtype: int64

In [40]:
# calling the function for Monday morning in Shelbyville (use shel_general instead of the df table)
genre_weekday(shel_general, 'Monday', '07:00', '11:00')

genre
pop            218
dance          182
rock           162
electronic     147
hip             79
ruspop          64
alternative     58
rusrap          55
jazz            44
classical       40
world           36
rap             32
soundtrack      31
rnb             27
metal           27
Name: genre, dtype: int64

In [41]:
# calling the function for Friday evening in Springfield
genre_weekday(spr_general, 'Friday', '17:00', '23:00')

genre
pop            713
rock           517
dance          495
electronic     482
hip            267
world          208
ruspop         170
classical      163
alternative    163
rusrap         142
jazz           111
unknown        110
soundtrack     105
rnb             90
metal           88
Name: genre, dtype: int64

In [42]:
# calling the function for Friday evening in Shelbyville
genre_weekday(shel_general, 'Friday', '17:00', '23:00')

genre
pop            256
electronic     216
rock           216
dance          210
hip             94
alternative     63
jazz            61
classical       60
rusrap          59
world           54
unknown         47
ruspop          47
soundtrack      40
metal           39
rap             36
Name: genre, dtype: int64

**Conclusion**

Having compared the top 15 genres on Monday morning, we can draw the following conclusions:

1. Users from Springfield and Shelbyville listen to similar music. The top five genres are the same, only rock and electronic have switched places.

2. In Springfield, the number of missing values turned out to be so big that the value `'unknown'` came in 10th. This means that missing values make up a considerable portion of the data, which may be a basis for questioning the reliability of our conclusions.

For Friday evening, the situation is similar. Individual genres vary somewhat, but on the whole, the top 15 is similar for the two cities.

Thus, the second hypothesis has been partially proven true:
* Users listen to similar music at the beginning and end of the week.
* There is no major difference between Springfield and Shelbyville. In both cities, pop is the most popular genre.

However, the number of missing values makes this result questionable. In Springfield, there are so many that they affect our top 15. Were we not missing these values, things might look different.

[Back to Contents](#back)

### Hypothesis 3: genre preferences in Springfield and Shelbyville <a id='genre'></a>

Hypothesis: Shelbyville loves rap music. Springfield's citizens are more into pop.

Group the `spr_general` table by genre and find the number of songs played for each genre with the `count()` method. Then sort the result in descending order and store it to `spr_genres`.

In [43]:
# on one line: group the spr_general table by the 'genre' column, 
# count the 'genre' values with count() in the grouping, 
# sort the resulting Series in descending order, and store it to spr_genres
spr_genres=spr_general.groupby('genre')['genre'].count().sort_values(ascending = False)

Print the first 10 rows from `spr_genres`:

In [44]:
spr_genres.head(10)

genre
pop            5892
dance          4435
rock           3965
electronic     3786
hip            2041
classical      1616
world          1432
alternative    1379
ruspop         1372
rusrap         1161
Name: genre, dtype: int64

Now do the same with the data on Shelbyville.

Group the `shel_general` table by genre and find the number of songs played for each genre. Then sort the result in descending order and store it to the `shel_genres` table:


In [45]:
# on one line: group the shel_general table by the 'genre' column, 
# count the 'genre' values in the grouping with count(), 
# sort the resulting Series in descending order and store it to shel_genres
shel_genres=shel_general.groupby('genre')['genre'].count().sort_values(ascending=False)

Print the first 10 rows of `shel_genres`:

In [46]:
# printing the first 10 rows from shel_genres
shel_genres.head(10)

genre
pop            2431
dance          1932
rock           1879
electronic     1736
hip             934
alternative     649
classical       646
rusrap          564
ruspop          538
world           515
Name: genre, dtype: int64

**Conclusion**

The hypothesis has been partially proven true:
* Pop music is the most popular genre in Springfield, as expected.
* However, pop music turned out to be equally popular in Springfield and Shelbyville, and rap wasn't in the top 5 for either city.


[Back to Contents](#back)

# Findings <a id='end'></a>

We have tied to answer these question:

1. whitch is the best open-source e-Commerce for yandex.afisha . 
2. On Monday mornings, Springfield and Shelbyville residents listen to different genres. This is also true for Friday evenings. 
3. Springfield and Shelbyville listeners have different preferences. In both Springfield and Shelbyville, they prefer pop.

After analyzing the data, we concluded:

1. User activity in Springfield and Shelbyville depends on the day of the week, though the cities vary in different ways. 

The first hypothesis is fully accepted.

2. Musical preferences do not vary significantly over the course of the week in both Springfield and Shelbyville. We can see small differences in order on Mondays, but:
* In Springfield and Shelbyville, people listen to pop music most.

So we can't accept this hypothesis. We must also keep in mind that the result could have been different if not for the missing values.

3. It turns out that the musical preferences of users from Springfield and Shelbyville are quite similar.

The third hypothesis is rejected. If there is any difference in preferences, it cannot be seen from this data.

### Note 
In real projects, research involves statistical hypothesis testing, which is more precise and more quantitative. Also note that you cannot always draw conclusions about an entire city based on the data from just one source.

You will study hypothesis testing in the sprint on statistical data analysis.

[Back to Contents](#back)