# Yandex.Music

# Contents <a id='back'></a>

* [Introduction](#intro)
* [Stage 1. Data overview](#data_review)
    * [Conclusions](#data_review_conclusions)
* [Stage 2. Data preprocessing](#data_preprocessing)
    * [2.1 Header style](#header_style)
    * [2.2 Missing values](#missing_values)
    * [2.3 Duplicates](#duplicates)
    * [2.4 Conclusions](#data_preprocessing_conclusions)
* [Stage 3. Testing the hypotheses](#hypotheses)
    * [3.1 Hypothesis 1: user activity in the two cities](#activity)
    * [3.2 Hypothesis 2: music preferences on Monday and Friday](#week)
    * [3.3 Hypothesis 3: genre preferences in Springfield and Shelbyville](#genre)
* [Findings](#end)

## Introduction <a id='intro'></a>
Whenever we're doing research, we need to formulate hypotheses that we can then test. Sometimes we accept these hypotheses; other times, we reject them. To make the right decisions, a business must be able to understand whether or not it's making the right assumptions.

In this project, you'll compare the music preferences of the cities of Springfield and Shelbyville. You'll study real Yandex.Music data to test the hypotheses below and compare user behavior for these two cities.

### Goal: 
Test three hypotheses:
1. User activity differs depending on the day of the week and from city to city. 
2. On Monday mornings, Springfield and Shelbyville residents listen to different genres. This is also true for Friday evenings. 
3. Springfield and Shelbyville listeners have different preferences. In Springfield, they prefer pop, while Shelbyville has more rap fans.

### Stages 
Data on user behavior is stored in the file `/datasets/music_project_en.csv`. There is no information about the quality of the data, so you will need to explore it before testing the hypotheses. 

First, you'll evaluate the quality of the data and see whether its issues are significant. Then, during data preprocessing, you will try to account for the most critical problems.
 
Your project will consist of three stages:
 1. Data overview
 2. Data preprocessing
 3. Testing the hypotheses
 
[Back to Contents](#back)

## Stage 1. Data overview

Let's import pandas, download data and have first look at it.

In [1]:
import pandas as pd # importing pandas


In [2]:
df=pd.read_csv('/datasets/music_project_en.csv') # reading the file and storing it to df


FileNotFoundError: [Errno 2] No such file or directory: '/datasets/music_project_en.csv'

In [None]:
df.head(10) # obtaining the first 10 rows from the df table

In [None]:
# obtaining general information about the data in df
df.info()

Our data has seven columns and 65079 rows. Each row in the table stores data on a track that was played. Some columns describe the track itself: its title, artist and genre. The rest convey information about the user: the city they come from, the time they played the track.
The data isn't in a perfect condition, we can see such problems:
1. Some column names are uppercase, some are lowercase.
2. There are spaces in some column names.
3. Some column names are not clear and should be specified.
4. Some columns have incorrect datatype. 
5. The data contains missing values.

To move forward, we need to preprocess the data.

[Back to Contents](#back)

## Stage 2. Data preprocessing <a id='data_preprocessing'></a>

### Header style <a id='header_style'></a>

In [None]:
df = df.rename(
    columns={
        '  userID' : 'user_id',
        'Track' : 'track',
        '  City  ' : 'city',
        'Day' : 'day',
    } 
) # renaming columns

In [None]:
df.columns # the list of column names in the df table

[Back to Contents](#back)

### Missing values <a id='missing_values'></a>

In [None]:
df.isna().sum()
# calculating missing values

Columns 'track', 'artist' and 'genre' contain missing values. Not all missing values affect the research. For instance, the missing values in `track` and `artist` are not critical. But missing values in `genre` can affect the comparison of music preferences in Springfield and Shelbyville. So we them with clear markers, for example with the string 'unknown'.


In [None]:
columns_to_replace = ['track', 'artist', 'genre']
for i in columns_to_replace:
    df[i] = df[i].fillna('unknown')
# looping over column names and replacing missing values with 'unknown'

In [None]:
df.isna().sum() # counting missing values

Now we can be absolutely sure that all the missing values have been filled it

[Back to Contents](#back)

### Duplicates <a id='duplicates'></a>

Now we'll check if there is any duplicates in our data and will drop them if we find.

In [None]:
# counting clear duplicates
df.duplicated().sum()

In [None]:
# removing obvious duplicates
df = df.drop_duplicates()

In [None]:
# checking for duplicates
df.duplicated().sum()

We can be sure, that our data don't contains obvious duplicates now. But what about implicit duplicates? Let's check column 'genre', are there genres written in different ways but still are the same? Such errors will also affect the result. Let's print a list of unique genre names, sorted in alphabetical order.

In [None]:
# viewing unique genre names
df['genre'].sort_values().unique()

Looking through the list we can find implicit duplicates of the genre `hiphop`. There are such names as `hip`, `hop` and `hip-hop` that are actually variants of one genre- hiphop. Let's unify all variants, give them the name hiphop and check if we've done everything right.

In [None]:
# function for replacing implicit duplicates
wrong_genres = ['hip','hop','hip-hop']
correct_genre ='hiphop'
def replace_wrong_genres(wrong_genres,correct_genre):
    for wrong_genre in wrong_genres:
        df['genre']= df['genre'].replace(wrong_genre,correct_genre)



In [None]:
# removing implicit duplicates
replace_wrong_genres(wrong_genres,correct_genre)

In [None]:
# checking for implicit duplicates
sorted(df.genre.unique())

Yes, we did everything right.

[Back to Contents](#back)

### Conclusions <a id='data_preprocessing_conclusions'></a>
We detected three issues with the data:

- Incorrect header styles
- Missing values
- Obvious and implicit duplicates

The headers have been renamed to make processing the data easier.

All missing values have been replaced with 'unknown'. But we still have to look whether the missing values in 'genre' will affect our calculations.

We have dealt with duplicates. The absence of duplicates will make the results more precise and easier to understand.

Now we can move on to testing hypotheses. 

[Back to Contents](#back)

## Stage 3. Testing hypotheses <a id='hypotheses'></a>

### Hypothesis 1: comparing user behavior in two cities <a id='activity'></a>

According to the first hypothesis, users from Springfield and Shelbyville listen to music differently. We need to test this thesis using the data on three days of the week: Monday, Wednesday, and Friday. So we will:

* Divide the users into groups by city.
* Compare how many tracks each group played on Monday, Wednesday, and Friday.


In [None]:
# Counting up the tracks played in each city
df.groupby(['city'])['track'].count()

Springfield has more tracks played than Shelbyville. But that does not imply that citizens of Springfield listen to music more often. This city is simply bigger, and there are more users.

Now we will group the data by day of the week and find the number of tracks played on Monday, Wednesday, and Friday.

In [None]:
# Calculating tracks played on each of the three days
df.groupby(['day'])['track'].count()

Wednesday is the quietest day overall. But if we consider the two cities separately, we might come to a different conclusion.

We have seen how grouping by city or day works. Now  we'll write a function that will group by both.

In [None]:
# <creating the function number_tracks()>
def number_tracks(day,city):
    track_list=df[(df['day']==day)&(df['city']==city)]
    track_list_count=track_list['user_id'].count()
    
    return track_list_count
number_tracks('Monday','Springfield')


In [None]:
# the number of songs played in Springfield on Monday
sp_m = number_tracks('Monday','Springfield')
sp_m

In [None]:
# the number of songs played in Shelbyville on Monday
sh_m = number_tracks('Monday','Shelbyville')
sh_m

In [None]:
# the number of songs played in Springfield on Wednesday

sp_w = number_tracks('Wednesday','Springfield')
sp_w

In [None]:
# the number of songs played in Shelbyville on Wednesday
sh_w = number_tracks('Wednesday','Shelbyville')
sh_w

In [None]:
# the number of songs played in Springfield on Friday
sp_f = number_tracks('Friday','Springfield')
sp_f

In [None]:
# the number of songs played in Shelbyville on Friday
sh_f = number_tracks('Friday','Shelbyville')
sh_f

For our convenience, let's create one table with the results of our calculations.

In [None]:
# table with results
dict_option={'city':['Springfield','Shelbyville'],
             'monday':[number_tracks('Monday','Springfield'),number_tracks('Monday','Shelbyville')], 
             'wednesday':[number_tracks('Wednesday','Springfield'),number_tracks('Wednesday','Shelbyville')],
             'friday':[number_tracks('Friday','Springfield'),number_tracks('Friday','Shelbyville')],
            }
tot_dataframe=pd.DataFrame(dict_option)
tot_dataframe


**Conclusions**

The data reveals differences in user behavior:

- In Springfield the number of songs were played peaks on Mondays and Fridays, while on Wednesday there is a decrease in activity.
- In Shelbyville, on the contrary, users listen to music more on Wednesday. User activity on Monday and Friday is lower.

So the first hypothesis seems to be correct.

[Back to Contents](#back)

### Hypothesis 2: music at the beginning and end of the week <a id='week'></a>

According to the second hypothesis, on Monday morning and Friday night, citizens of Springfield listen to genres that differ from ones users from Shelbyville enjoy. Let's create 2 groups of listeners: Springfield residents and Shelbyville residents.

In [None]:
# create the spr_general table from the df rows, 
# where the value in the 'city' column is 'Springfield'
spr_general = df[df['city'] == 'Springfield']
spr_general

In [None]:
# create the shel_general from the df rows,
# where the value in the 'city' column is 'Shelbyville'
shel_general = df[df['city'] == 'Shelbyville']
shel_general

Let's create the function that return info on the 15 most popular genres on a given day within the period between the two timestamps.

In [None]:
def genre_weekday(my_df, day,time1,time2):
    
    genre_df = my_df[my_df['day'] == day]

    
    genre_df = genre_df[genre_df['time']<time2]

   
    genre_df = genre_df[genre_df['time']>time1] # write your code here

    genre_df_count = genre_df.groupby(['genre'])['genre'].count()

    
    genre_df_sorted = genre_df_count.sort_values(ascending=False)
  
    return genre_df_sorted[:15]


In [None]:
result = genre_weekday(df,'Monday','07:00:00','11:00:00')
result

Compare the results of the `genre_weekday()` function for Springfield and Shelbyville on Monday morning (from 7AM to 11AM) and on Friday evening (from 17:00 to 23:00):

In [None]:
# calling the function for Monday morning in Springfield (use spr_general instead of the df table)
spr_mon = genre_weekday(spr_general,'Monday','07:00:00','11:00:00')
spr_mon

In [None]:
# calling the function for Monday morning in Shelbyville (use shel_general instead of the df table)
shel_mon = genre_weekday(shel_general,'Monday','07:00:00','11:00:00')
shel_mon

In [None]:
# calling the function for Friday evening in Springfield
spr_fr = genre_weekday(spr_general,'Friday','17:00:00','23:00:00')
spr_fr

In [None]:
# calling the function for Friday evening in Shelbyville
shel_fr = genre_weekday(shel_general,'Friday','17:00:00','23:00:00')
shel_fr

**Conclusion**

Having compared the top 15 genres on Monday morning, we can draw the following conclusions:

1. Users from Springfield and Shelbyville listen to similar music. The top five genres are the same, only rock and electronic have switched places.

2. In Springfield the number of missing values turned out to be so big that the value `'unknown'` came in 10th place. This means that missing values make up a considerable portion of the data, which may be a basis for questioning the reliability of our conclusions.

For Friday evening, the situation is similar. Individual genres vary, but in general, the top 15 is similar for the both cities.

Thus, the second hypothesis has been reguted:
* Users listen to similar music at the beginning and the end of the week.
* There is no big difference between Springfield and Shelbyville. In both cities, pop is the most popular genre.

However, the number of missing values makes this result questionable. In Springfield, there are so many that they affect our top 15. If we don't have these missing values, things might look different.

[Back to Contents](#back)

### Hypothesis 3: genre preferences in Springfield and Shelbyville <a id='genre'></a>

Hypothesis: Shelbyville loves rap music. Springfield's citizens are more into pop.

In [None]:
# on one line: group the spr_general table by the 'genre' column, 
# count the 'genre' values with count() in the grouping, 
# sort the resulting Series in descending order, and store it to spr_genres
spr_genres = spr_general.groupby(['genre'])['track'].count()
spr_genres = spr_genres.sort_values(ascending=False)

In [None]:
# printing the first 10 rows of spr_genres
spr_genres.head(10)

In [None]:
# on one line: group the shel_general table by the 'genre' column, 
# count the 'genre' values in the grouping with count(), 
# sort the resulting Series in descending order and store it to shel_genres
shel_generes = shel_general.groupby(['genre'])['track'].count()
shel_generes = shel_generes.sort_values(ascending=False)

In [None]:
# printing the first 10 rows from shel_genres
shel_generes.head(10)

**Conclusion**

The hypothesis has been partially proven true:
* Pop music is the most popular genre in Springfield, as expected.
* However, pop music turned out to be equally popular in Springfield and Shelbyville. Rap wasn't in the top 5 for Springfield's citizens.

[Back to Contents](#back)

# Findings <a id='end'></a>

We have tested the following three hypotheses:

1. User activity differs depending on the day of the week and from city to city. 
2. On Monday mornings, Springfield and Shelbyville residents listen to different genres. This is also true for Friday evenings. 
3. Springfield and Shelbyville listeners have different music preferences. 

After analyzing the data, we concluded:

1. User activity in Springfield and Shelbyville depends on the day of the week. In Springfield people listen to music more on 
Mondays and Fridays,in Shelbyville, on the contrary, users listen to music more on Wednesdays. 
The first hypothesis is fully accepted.

2. Musical preferences do not vary significantly over the day of the week in both Springfield and Shelbyville. In Springfield 
and Shelbyville people listen to pop music the most.
    So we can't accept this hypothesis. We must also keep in mind that the result could have been different if there were no
    missing values.

3. It turns out that the musical preferences of users from Springfield and Shelbyville are quite similar.
The third hypothesis is rejected. 

[Back to Contents](#back)