# The final project for 'Data Analysis with Python' course by Yandex.Praktikum

**Yandex is a Russian multinational corporation specializing in Internet-related products and services, including transportation, search and information services, eCommerce, navigation, mobile applications, and online advertising. Yandex provides over 70 services in total. Incorporated in the Netherlands, Yandex primarily serves audiences in Russia and the CIS.** 

**In the final project I have to answer the following question for Yandex.Music streaming service: 
"What music people usually listen to while commuting to work on Monday morning, Wednesday, or Friday evening? Compare Moscow and Saint-Petersburg residents’ music preferences"**

Let's have a look at the dataset.

## Step 0. Import libraries

In [1]:
import pandas as pd

Read *music_project.csv* and save as *df*. 

In [2]:
df = pd.read_csv('/datasets/music_project.csv')

Look at the first 10 rows.

In [3]:
df.head(10)

Unnamed: 0,userID,Track,artist,genre,City,time,Day
0,FFB692EC,Kamigata To Boots,The Mass Missile,rock,Saint-Petersburg,20:28:33,Wednesday
1,55204538,Delayed Because of Accident,Andreas Rönnberg,rock,Moscow,14:07:09,Friday
2,20EC38,Funiculì funiculà,Mario Lanza,pop,Saint-Petersburg,20:58:07,Wednesday
3,A3DD03C9,Dragons in the Sunset,Fire + Ice,folk,Saint-Petersburg,08:37:09,Monday
4,E2DC1FAE,Soul People,Space Echo,dance,Moscow,08:34:34,Monday
5,842029A1,Преданная,IMPERVTOR,rusrap,Saint-Petersburg,13:09:41,Friday
6,4CB90AA5,True,Roman Messer,dance,Moscow,13:00:07,Wednesday
7,F03E1C1F,Feeling This Way,Polina Griffith,dance,Moscow,20:47:49,Wednesday
8,8FA1D3BE,И вновь продолжается бой,,ruspop,Moscow,09:17:40,Friday
9,E772D5C0,Pessimist,,dance,Saint-Petersburg,21:20:49,Wednesday


Get the general information about *df*.




In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 65079 entries, 0 to 65078
Data columns (total 7 columns):
  userID    65079 non-null object
Track       63848 non-null object
artist      57876 non-null object
genre       63881 non-null object
  City      65079 non-null object
time        65079 non-null object
Day         65079 non-null object
dtypes: object(7)
memory usage: 3.5+ MB


There are 7 columns, each column’s type is object.
The columns in *df* are as follows:
* userID — user ID;
* Track — song name;  
* artist — artist name;
* genre — genre;
* City — city where the track was listened to;
* time — time of the day when the track was listened to;
* Day — weekday.<br>
The number of values in the columns is different which means that some values are missing.

**Conclusion**

Each row contains information about the compositions of a certain genre that were listened to in different cities at a certain time and certain day of the week. We need to solve two problems: missing values and names of the columns. For testing hypotheses, we need *time*, *day* and *City*. *Genre* can tell us which genres were the most popular with Moscow and Saint-Petersburg residents.

# Step 1. Data preprocessing

Exclude the missing values, rename the columns and check for duplicates.

Let's see what columns are there.

In [5]:
df.columns

Index(['  userID', 'Track', 'artist', 'genre', '  City  ', 'time', 'Day'], dtype='object')

Some column names have a space.

Let's rename them.

In [6]:
df.set_axis(['user_id','track_name', 'artist_name', 'genre_name', 'city', 'time', 'weekday'], axis = 'columns', inplace = True)

In [7]:
df.head()

Unnamed: 0,user_id,track_name,artist_name,genre_name,city,time,weekday
0,FFB692EC,Kamigata To Boots,The Mass Missile,rock,Saint-Petersburg,20:28:33,Wednesday
1,55204538,Delayed Because of Accident,Andreas Rönnberg,rock,Moscow,14:07:09,Friday
2,20EC38,Funiculì funiculà,Mario Lanza,pop,Saint-Petersburg,20:58:07,Wednesday
3,A3DD03C9,Dragons in the Sunset,Fire + Ice,folk,Saint-Petersburg,08:37:09,Monday
4,E2DC1FAE,Soul People,Space Echo,dance,Moscow,08:34:34,Monday


Find the number of the missing values in each column.

In [8]:
df.isnull().sum()

user_id           0
track_name     1231
artist_name    7203
genre_name     1198
city              0
time              0
weekday           0
dtype: int64

*track_name*, *artist_name* and *genre_name* have missing values.

Replace *track_name* and *artist_name* missing values with *unknown*

In [9]:
df['track_name'] = df['track_name'].fillna('unknown')

In [10]:
df['artist_name'] = df['artist_name'].fillna('unknown')

In [11]:
df.isnull().sum()

user_id           0
track_name        0
artist_name       0
genre_name     1198
city              0
time              0
weekday           0
dtype: int64

Delete the rows with missing values in *genre_name*

In [12]:
df.dropna(subset=['genre_name'], inplace= True)

In [13]:
df.isnull().sum()

user_id        0
track_name     0
artist_name    0
genre_name     0
city           0
time           0
weekday        0
dtype: int64

Check for duplicates. If duplicates exist, remove them.

In [14]:
df.duplicated().sum()

3755

In [15]:
df = df.drop_duplicates().reset_index(drop=True)

In [16]:
df.duplicated().sum()

0

It's important to find out why duplicates appear in data.

Save the unique names of genres in *genres_list*. 
Create the function *find_genre()* for finding ducplicates, e.g. a different way of spelling of the same genre.

In [17]:
genres_list=df['genre_name'].unique()
genres_list

array(['rock', 'pop', 'folk', 'dance', 'rusrap', 'ruspop', 'world',
       'electronic', 'alternative', 'children', 'rnb', 'hip', 'jazz',
       'postrock', 'latin', 'classical', 'metal', 'reggae', 'tatar',
       'blues', 'instrumental', 'rusrock', 'dnb', 'türk', 'post',
       'country', 'psychedelic', 'conjazz', 'indie', 'posthardcore',
       'local', 'avantgarde', 'punk', 'videogame', 'techno', 'house',
       'christmas', 'melodic', 'caucasian', 'reggaeton', 'soundtrack',
       'singer', 'ska', 'shanson', 'ambient', 'film', 'western', 'rap',
       'beats', "hard'n'heavy", 'progmetal', 'minimal', 'contemporary',
       'new', 'soul', 'holiday', 'german', 'tropical', 'fairytail',
       'spiritual', 'urban', 'gospel', 'nujazz', 'folkmetal', 'trance',
       'miscellaneous', 'anime', 'hardcore', 'progressive', 'chanson',
       'numetal', 'vocal', 'estrada', 'russian', 'classicmetal',
       'dubstep', 'club', 'deep', 'southern', 'black', 'folkrock',
       'fitness', 'french', 'd

Create the function *find_genre()* that counts the number of tracks of a certain genre with genre as a parameter.

In [19]:
def find_genre(genre):
    genre_count = 0
    for row in genres_list:
        if row == genre:
            genre_count += 1
    return genre_count   

Find different ways of spelling *hip-hop* in data.

The correct spelling — *hiphop*. Other variants:

* hip
* hop
* hip-hop


In [20]:
find_genre('hiphop')

1

In [21]:
find_genre('hip')

1

In [22]:
find_genre('hop')

0

In [23]:
find_genre('hip-hop')

0

Create the function *find_hip_hop()* that replaces the wrong way of spelling 'hip-hop' with *'hiphop'* and checks if the wrong way of spelling is still present in data.

In [24]:
def find_hip_hop(df,wrong_name):
    df['genre_name'] = df['genre_name'].replace(wrong_name, 'hiphop')
    count = df.loc[df.loc[:, 'genre_name']== wrong_name]['genre_name'].count()
    return count

In [26]:
find_hip_hop(df,'hip')

0

Check if everything is OK with *df*

In [27]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 60126 entries, 0 to 60125
Data columns (total 7 columns):
user_id        60126 non-null object
track_name     60126 non-null object
artist_name    60126 non-null object
genre_name     60126 non-null object
city           60126 non-null object
time           60126 non-null object
weekday        60126 non-null object
dtypes: object(7)
memory usage: 3.2+ MB


**Conlusion**

The dataset contained missing values, spaces in column names, and duplicates. Since the genre name was more important for drawing conclusions than the name of the song or the artist, the missing values in *track_name* and *artist_name* were simply replaced by *unknown*. If *genre_name* was missing, it was removed. The columns were renamed in a consistent way.

# Step 2. Data Analysis

# Are there differences in how people listen to music in different cities?

There's a hypothesis that *“There's a difference between how Moscow and Saint-Petersburg residents listen to
music.”* <br>
For testing this hypothesis compare the number of the listened tracks for each genre on Monday, Wednesday and Friday in Moscow and Saint-Petersburg. 

Group columns and count the listened tracks for each genre.

In [28]:
df.groupby('city')['genre_name'].count()

city
Moscow              41892
Saint-Petersburg    18234
Name: genre_name, dtype: int64

More tracks were listened to in Moscow than in Saint-Petersburg, but it doesn't mean that Moscow is more active. In general, Yandex.Music has more active users in Moscow.

Group the columns by weekday to see when people listen to music more often.

In [29]:
df.groupby('weekday')['genre_name'].count()

weekday
Friday       21482
Monday       20866
Wednesday    17778
Name: genre_name, dtype: int64

On Monday and Friday people listen to music more often than on Wednesday. Perhaps, they are more concerned with work on Wednesday.

Create the function *number_tracks()* that counts the number of the listened tracks for different genres in different cities with the following three parameters:<br>*df*, *day*, *city*. 

In [30]:
def number_tracks(data, day, city):
    track_list = data[(data['weekday'] == day) & (data['city'] == city)]
    track_list_count = track_list['genre_name'].count()
    return track_list_count

In [31]:
number_tracks(df,'Monday','Moscow')

15347

In [32]:
number_tracks(df,'Monday','Saint-Petersburg')

5519

In [33]:
number_tracks(df, 'Wednesday' , 'Moscow')

10865

In [34]:
number_tracks(df, 'Wednesday', 'Saint-Petersburg')

6913

In [35]:
number_tracks(df,'Friday','Moscow')

15680

In [36]:
number_tracks(df,'Friday', 'Saint-Petersburg')

5802

Create a table with ['city', 'monday', 'wednesday', 'friday'] as column names.

In [37]:
data = [['Moscow', 15347, 10865, 15680],['Saint-Petersburg', 5519, 6913, 5802]]
columns = ['city', 'monday', 'wednesday', 'friday']
table = pd.DataFrame(data = data, columns = columns)
table

Unnamed: 0,city,monday,wednesday,friday
0,Moscow,15347,10865,15680
1,Saint-Petersburg,5519,6913,5802


**Conclusion**

We’ve got interesting findings. While people in Moscow are listening to music more actively on Monday and Friday, people in Saint-Petersburg are more active on the app on Wednesday.

# Monday morning and Friday evening — different music or the same?

Let’s see if people in Moscow and Saint-Petersburg prefer different genres on Monday morning and Friday evening. There’s a hypothesis that on Monday morning people prefer uplifting music, e.g. pop music, and on Friday evening dance music, e.g. electronic. 

Create two separate tables for Moscow (*moscow_general*) and for Saint-Petersburg (*spb_general*).

In [38]:
moscow_general = df[df['city']=='Moscow']
moscow_general

Unnamed: 0,user_id,track_name,artist_name,genre_name,city,time,weekday
1,55204538,Delayed Because of Accident,Andreas Rönnberg,rock,Moscow,14:07:09,Friday
4,E2DC1FAE,Soul People,Space Echo,dance,Moscow,08:34:34,Monday
6,4CB90AA5,True,Roman Messer,dance,Moscow,13:00:07,Wednesday
7,F03E1C1F,Feeling This Way,Polina Griffith,dance,Moscow,20:47:49,Wednesday
8,8FA1D3BE,И вновь продолжается бой,unknown,ruspop,Moscow,09:17:40,Friday
...,...,...,...,...,...,...,...
60120,83A474E7,I Worship Only What You Bleed,The Black Dahlia Murder,extrememetal,Moscow,21:07:12,Monday
60121,729CBB09,My Name,McLean,rnb,Moscow,13:32:28,Wednesday
60123,C5E3A0D5,Jalopiina,unknown,industrial,Moscow,20:09:26,Friday
60124,321D0506,Freight Train,Chas McDevitt,rock,Moscow,21:43:59,Friday


In [39]:
spb_general = df.loc[df.loc[:,'city']=='Saint-Petersburg']
spb_general

Unnamed: 0,user_id,track_name,artist_name,genre_name,city,time,weekday
0,FFB692EC,Kamigata To Boots,The Mass Missile,rock,Saint-Petersburg,20:28:33,Wednesday
2,20EC38,Funiculì funiculà,Mario Lanza,pop,Saint-Petersburg,20:58:07,Wednesday
3,A3DD03C9,Dragons in the Sunset,Fire + Ice,folk,Saint-Petersburg,08:37:09,Monday
5,842029A1,Преданная,IMPERVTOR,rusrap,Saint-Petersburg,13:09:41,Friday
9,E772D5C0,Pessimist,unknown,dance,Saint-Petersburg,21:20:49,Wednesday
...,...,...,...,...,...,...,...
60112,D94F810B,Theme from the Walking Dead,Proyecto Halloween,film,Saint-Petersburg,21:14:40,Monday
60113,BC8EC5CF,Red Lips: Gta (Rover Rework),Rover,electronic,Saint-Petersburg,21:06:50,Monday
60114,29E04611,Bre Petrunko,Perunika Trio,world,Saint-Petersburg,13:56:00,Monday
60115,1B91C621,(Hello) Cloud Mountain,sleepmakeswaves,postrock,Saint-Petersburg,09:22:13,Monday


Create the function *genre_weekday()* that returns the number of the listened tracks for different genres, weekdays and time.

Create the function genre_weekday() with the following parameters: *df*, *day*, *time1*, *time2*<br>
Put rows in *genre_list* if the following conditions are met: <br>
1)'weekday' == the parameter day,<br>
2)'time'> time1 <br>
3) < time2 <br>
*genre_list_sorted* sorts the *genre_list* in descending order, grouping by *genre_name*

In [41]:
def genre_weekday(df,day,time1,time2):
    genre_list = df[(df['weekday']==day)&(df['time']>time1)&(df['time']<time2)]
    genre_list_sorted = genre_list.groupby('genre_name')['genre_name'].count().sort_values(ascending = False)
    return genre_list_sorted

Compare the results for Moscow and Saint-Petersburg on Monday morning (from 7am until 11am) and Friday evening  (from 5pm until 11pm).

In [42]:
genre_weekday(moscow_general, 'Monday', '07:00:00' , '11:00:00')

genre_name
pop           781
dance         549
electronic    480
rock          474
hiphop        286
             ... 
glitch          1
folklore        1
flamenco        1
eastern         1
adult           1
Name: genre_name, Length: 152, dtype: int64

In [43]:
genre_weekday(spb_general, 'Monday','07:00:00' , '11:00:00')

genre_name
pop            218
dance          182
rock           162
electronic     147
hiphop          80
              ... 
dub              1
drum             1
ukrrock          1
deutschrock      1
adult            1
Name: genre_name, Length: 107, dtype: int64

In [44]:
genre_weekday(moscow_general, 'Friday', '17:00:00' , '23:00:00')

genre_name
pop           713
rock          517
dance         495
electronic    482
hiphop        273
             ... 
rockindie       1
thrash          1
triphop         1
tropical        1
adult           1
Name: genre_name, Length: 163, dtype: int64

In [45]:
genre_weekday(spb_general, 'Friday', '17:00:00' , '23:00:00')

genre_name
pop           256
rock          216
electronic    216
dance         210
hiphop         97
             ... 
european        1
eurofolk        1
ethnic          1
dub             1
acoustic        1
Name: genre_name, Length: 126, dtype: int64

It turned out that the genre preferences on Monday morning in Moscow and Saint-Petersburg are very similar - both cities prefer pop music. Pop music is also preferred at the end of the week in both cities.

**Conclusion**

Undoubtedly, pop music is the most popular genre in these two cities. While top 5 doesn’t really changes from city to city, the end of the top 10 list differs depending on weekday and time.

# Moscow and Saint-Petersburg — two different capitals, two different music genres. Is it so?

Hypothesis: *"Saint-Petersburg is famous for its rap-culture, therefore rap should be listened there more often. Although Moscow is the city of contrasts, the majority of people listen to pop music"* 

Group the table *moscow_general* by genre and count the compositions for each genre sorting in descending order. Save the results in *moscow_genres*.

Show the first 10 rows.

In [None]:
moscow_genres = moscow_general.groupby('genre_name')['genre_name'].count().sort_values(ascending=False)

In [47]:
moscow_genres.head(10)

genre_name
pop            5892
dance          4435
rock           3965
electronic     3786
hiphop         2096
classical      1616
world          1432
alternative    1379
ruspop         1372
rusrap         1161
Name: genre_name, dtype: int64

Do the same with *spb_general* and save the results in *spb_genres*. Show the first 10 rows.<br>
Now we can compare the results.

In [48]:
spb_genres = spb_general.groupby('genre_name')['genre_name'].count().sort_values(ascending=False)

In [49]:
spb_genres.head(10)

genre_name
pop            2431
dance          1932
rock           1879
electronic     1736
hiphop          960
alternative     649
classical       646
rusrap          564
ruspop          538
world           515
Name: genre_name, dtype: int64

**Conclusion**

In Moscow, in addition to the most popular genre of pop music there is another popular genre - Russian pop music. As to rap, interestingly, it is equally popular in both cities.

# Step 3. Research results


The hypotheses were:

* Moscow and Saint-Petersburg residents have different music habits;

* Top 10 genres on Monday morning are different from top 10 on Friday evening;

* Residents of two different cities prefer different genres.

**Results**

Both cities' residents prefer pop music. The preferred genres do not change depending on weekday. However, people in Moscow listen to music more often on Monday and Friday, as opposed to people in Saint-Petersburg who listen to music more often on Wednesday than on Monday and Friday. 

Thus, the first hypothesis was supported, the second was partially correct and the third one was not supported.