Comparison of Moscow and Petersburg is surrounded by myths. For example:
 * Moscow - metropolis, subject to the hard rhythm of the working week;
 * Petersburg - cultural capital, with its own tastes.

On Yandex Music data you will compare the behavior of users of the two capitals.

**The purpose of the study** - test three hypotheses:
1. The activity of users depends on the day of the week. And in Moscow and St. Petersburg it manifests itself differently.
2. Monday morning in Moscow some genres prevail, and in Petersburg - others. Also on Friday evening, different genres prevail - depending on the city. 
3. Moscow and St. Petersburg prefer different genres of music.

**Research progress**

You will get information about user behavior from the file [yandex_music_project.csv¹. Nothing is known about the data quality. Therefore, before testing hypotheses you will need to review the data. 

You will check the data for errors and evaluate their impact on the research. Then, in the pre-processing phase you will look for an opportunity to fix the most critical data errors.
 
Thus, the study will take place in three stages:
 1. Review of data.
 2. Data pre-processing.
 3. Test hypothesis.

## Data Review

Make the first representation of Yandex Music data.

**Job 1**

The main analytics tool is í pandas¹. Import this library.

In [1]:
import pandas as pd

**Job 2**

Read the file [yandex_music_project.csv[ from the folder ¹/datasets™ and save it in the variable ¹df ː:

In [3]:
df = pd.read_csv('/Applications/Python/Проекты/Датасеты/yandex_music_project.csv')

**Job 3**


Display the first ten rows of the table:

In [5]:
print(df.head(10))

     userID                        Track            artist   genre  \
0  FFB692EC            Kamigata To Boots  The Mass Missile    rock   
1  55204538  Delayed Because of Accident  Andreas Rönnberg    rock   
2    20EC38            Funiculì funiculà       Mario Lanza     pop   
3  A3DD03C9        Dragons in the Sunset        Fire + Ice    folk   
4  E2DC1FAE                  Soul People        Space Echo   dance   
5  842029A1                    Преданная         IMPERVTOR  rusrap   
6  4CB90AA5                         True      Roman Messer   dance   
7  F03E1C1F             Feeling This Way   Polina Griffith   dance   
8  8FA1D3BE     И вновь продолжается бой               NaN  ruspop   
9  E772D5C0                    Pessimist               NaN   dance   

             City        time        Day  
0  Saint-Petersburg  20:28:33  Wednesday  
1            Moscow  14:07:09     Friday  
2  Saint-Petersburg  20:58:07  Wednesday  
3  Saint-Petersburg  08:37:09     Monday  
4            M

**Job 4**


One command to get general information about the table using the ːinfo())method:

In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 65079 entries, 0 to 65078
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0     userID  65079 non-null  object
 1   Track     63848 non-null  object
 2   artist    57876 non-null  object
 3   genre     63881 non-null  object
 4     City    65079 non-null  object
 5   time      65079 non-null  object
 6   Day       65079 non-null  object
dtypes: object(7)
memory usage: 3.5+ MB


So, there are seven columns in the table. The data type in all columns is expo object.

According to the data documentation:
* ː userID™ - user identifier;
* € Track™ - the name of the track;  
* ™artist™ - name of the artist;
* ™genre™ - the name of the genre;
* € City™ - User city;
* ː time™ - the beginning of the audition;
* € Day™ - the day of the week.

The number of values in the columns varies. This means that there are missing values in the data.

**Job 5**

**Question with free form of answer**

In the column titles you can see style violations:
* Lower case letters are combined with upper case letters.
* Gaps encountered

What’s the third infraction?

Words are better separated by underlining: user_id

**Conclusions**

In each row of the table - data about the listened track. Part of the columns describes the composition itself: name, artist and genre. The rest of the data tells about the user: from what city he was listening to music. 

It can be provisionally argued that the data are sufficient to test hypotheses. But there are omissions in the data, and in the titles of the columns - differences with a good style.

To move forward, we need to fix data problems.

## Data Preview
Correct the style in the column headers, eliminate the skips. Then check the data for duplicates.

### Header Style

**Job 6**

Print column names on the screen:

In [7]:
df.columns

Index(['  userID', 'Track', 'artist', 'genre', '  City  ', 'time', 'Day'], dtype='object')

**Job 7**


Align titles with good style:
* a few words in the name write in «zminom_register»,
* Make all characters lowercase,
* Fill in the gaps.

To do this, rename the columns as follows:

* `'  userID'` → `'user_id'`;
* `'Track'` → `'track'`;
* `'  City  '` → `'city'`;
* `'Day'` → `'day'`.

In [8]:
df = df.rename(columns = {
    '  userID' : 'user_id',
    'Track' : 'track',
    '  City  ' : 'city',
    'Day' : 'day'
})

**Job 8**


Check the result. To do this, print the column names again:

In [9]:
df.columns

Index(['user_id', 'track', 'artist', 'genre', 'city', 'time', 'day'], dtype='object')

### Skipping values

**Job 9**

First, calculate how many values are missing in the table. For this, two methods are sufficient ¹pandas ː:

In [10]:
df.isna().sum()

user_id       0
track      1231
artist     7203
genre      1198
city          0
time          0
day           0
dtype: int64

Not all omitted values affect the study. So in ¹track™ and ¹artist™ skips are not important for your work. It is sufficient to replace them with explicit designations.

However, the omissions in ¹ genre may interfere with the comparison of musical tastes in Moscow and Saint Petersburg. In practice it would be correct to set the cause of the skips and recover the data. This possibility is not available in the training project. You will have to:
* Fill in these blanks with explicit symbols;
* assess how much they will damage the calculations. 

**Job 10**

Replace the missing values in the columns ¹track™, ī artist™ and ¹genre™ with the row ¿'unknown'. To do this, create a list of ːcolumns_to_replace ː, search its elements with a cycle of ¹for€ , and replace the missing values for each column:

In [11]:
columns_to_replace = [['track'], ['artist'], ['genre']]
for i in columns_to_replace:
    df[i] = df[i].fillna('unknown')

**Job 11**

Make sure that there are no blanks left in the table. To do this, count the missing values again.

In [12]:
df.isna().sum()

user_id    0
track      0
artist     0
genre      0
city       0
time       0
day        0
dtype: int64

### Duplicates

**Job 12**

Count the explicit duplicates in the table as one command:

In [13]:
df.duplicated().sum()

3826

**Job 13**

Call the special method ːpandas¹to remove obvious duplicates:

In [14]:
df = df.drop_duplicates().reset_index(drop=True)

**Job 14**

Again count the explicit duplicates in the table - make sure you get rid of them completely:

In [15]:
df.duplicated().sum()

0

Now get rid of the implicit duplicates in the column ːgenre¹. For example, the name of the same genre can be written in a slightly different way. Such errors will also affect the outcome of the study.

**Job 15**

Display a list of unique genre names in alphabetical order. To do this:
1. Extract the desired dataframe column; 
2. Apply the sorting method to it;
3. For the sorted column, call the method that returns unique values from the column.

In [16]:
df['genre'].sort_values().unique()

array(['acid', 'acoustic', 'action', 'adult', 'africa', 'afrikaans',
       'alternative', 'alternativepunk', 'ambient', 'americana',
       'animated', 'anime', 'arabesk', 'arabic', 'arena',
       'argentinetango', 'art', 'audiobook', 'author', 'avantgarde',
       'axé', 'baile', 'balkan', 'beats', 'bigroom', 'black', 'bluegrass',
       'blues', 'bollywood', 'bossa', 'brazilian', 'breakbeat', 'breaks',
       'broadway', 'cantautori', 'cantopop', 'canzone', 'caribbean',
       'caucasian', 'celtic', 'chamber', 'chanson', 'children', 'chill',
       'chinese', 'choral', 'christian', 'christmas', 'classical',
       'classicmetal', 'club', 'colombian', 'comedy', 'conjazz',
       'contemporary', 'country', 'cuban', 'dance', 'dancehall',
       'dancepop', 'dark', 'death', 'deep', 'deutschrock', 'deutschspr',
       'dirty', 'disco', 'dnb', 'documentary', 'downbeat', 'downtempo',
       'drum', 'dub', 'dubstep', 'eastern', 'easy', 'electronic',
       'electropop', 'emo', 'entehno', '

**Job 16**

Review the list and find implicit duplicates of the name ¿hiphop³. These may be names with errors or alternate names of the same genre.

You will see the following implicit duplicates:
* *hip*,
* *hop*,
* *hip-hop*.

To clear the table, use the ːreplace()) method with two arguments: a list of duplicate rows (including *hip*, *hop*, and *hip-hop*) and a string with the correct value. You need to correct the column ːgenre[ in the table ¹df¹: replace each value from the list of duplicates with the correct one. Instead of the ¿hip ː, ¹hopñ and ¿hip-hopñ the table should have the ¿hiphop value:

In [17]:
df = df.replace(['hip', 'hop', 'hip-hop'], 'hiphop')

**Job 17**

Check that you have replaced the wrong names:

*   hip,
*   hop,
*   hip-hop.

Print a sorted list of unique column values ːgenre ê:

In [18]:
df['genre'].sort_values().unique()

array(['acid', 'acoustic', 'action', 'adult', 'africa', 'afrikaans',
       'alternative', 'alternativepunk', 'ambient', 'americana',
       'animated', 'anime', 'arabesk', 'arabic', 'arena',
       'argentinetango', 'art', 'audiobook', 'author', 'avantgarde',
       'axé', 'baile', 'balkan', 'beats', 'bigroom', 'black', 'bluegrass',
       'blues', 'bollywood', 'bossa', 'brazilian', 'breakbeat', 'breaks',
       'broadway', 'cantautori', 'cantopop', 'canzone', 'caribbean',
       'caucasian', 'celtic', 'chamber', 'chanson', 'children', 'chill',
       'chinese', 'choral', 'christian', 'christmas', 'classical',
       'classicmetal', 'club', 'colombian', 'comedy', 'conjazz',
       'contemporary', 'country', 'cuban', 'dance', 'dancehall',
       'dancepop', 'dark', 'death', 'deep', 'deutschrock', 'deutschspr',
       'dirty', 'disco', 'dnb', 'documentary', 'downbeat', 'downtempo',
       'drum', 'dub', 'dubstep', 'eastern', 'easy', 'electronic',
       'electropop', 'emo', 'entehno', '

**Conclusions**

The pre-processing revealed three data problems:

- violations in the style of headlines,
- missing values,
- duplicates - explicit and implicit.

You have corrected the headings to simplify the work with the table. Without duplicates, the study will become more accurate.

The omitted values have been replaced by ¹'unknown'ː. It is yet to be seen whether the omission in column ¿genre³will not harm the research.

Now we can move on to hypothesis testing. 

## Hypothesis Testing

### Comparing user behavior between two capitals

The first hypothesis states that users listen to music in different ways in Moscow and Saint Petersburg. Check this assumption for three days of the week - Monday, Wednesday and Friday. To do this:

* Divide users of Moscow and Saint Petersburg.
* Compare how many tracks each user group listened to on Monday, Wednesday and Friday.

**Job 18**

For training, first perform each calculation separately. 

Evaluate user activity in each city. Group the city data and count the listings in each group.

In [19]:
cities = df.groupby('city')['track'].count()
cities

city
Moscow              42741
Saint-Petersburg    18512
Name: track, dtype: int64

There are more auditions in Moscow than in St. Petersburg. Just more users in Moscow.

**Job 19**

Now group the day of the week and count the auditions on Monday, Wednesday and Friday. Note that the data contains information about the audition only for these days.

In [20]:
days = df.groupby('day')['track'].count()
days

day
Friday       21840
Monday       21354
Wednesday    18059
Name: track, dtype: int64

On average, users from two cities are less active on Wednesdays. But the picture may change if you consider each city separately.

**Job 20**


You have seen how the group works in the city and on the days of the week. Now write a feature that will combine these two calculations.

Create a function ːnumber_tracks()ː that counts listings for a given day and city. It will need two parameters:
* day of the week,
* city name.

In functions, save to a variable rows of the source table with the value:
  * in column ːday™ equals the parameter ¹day¹,
  * in column ːcity™ equals the parameter ːcity ː.

To do this, apply sequential filtering with logical indexing (or complex logical expressions into one line if you are already familiar with them).

Then calculate the values in the column [user_id™ of the resulting table. Save the result to the new variable. Return this variable from the function.

In [21]:
def number_tracks(day, city):
    track_list = df[df['day'] == day]
    track_list = track_list[track_list['city'] == city]
    track_list_count = track_list['user_id'].count()
    return track_list_count


**Job 21**

Call ːnumber_tracks()) six times, changing the parameter value so as to get data for each city in each of three days.

In [22]:
result = number_tracks('Monday', 'Moscow')
result

15740

In [23]:
result = number_tracks('Monday', 'Saint-Petersburg')
result

5614

In [24]:
result = number_tracks('Wednesday', 'Moscow')
result

11056

In [25]:
result = number_tracks('Wednesday', 'Saint-Petersburg')
result

7003

In [26]:
result = number_tracks('Friday', 'Moscow')
result

15945

In [27]:
result = number_tracks('Friday', 'Saint-Petersburg')
result

5895

**Job 22**

Create with the help of the constructor ¹pd.DataFrame™ a table where
* the names of columns - € ['city', 'monday', 'wednesday', 'friday']);
* data - the results that you obtained with the help of ːnumber_tracks.

In [28]:
data = [['Moscow', 15740, 11056, 15945],
       ['Saint-Petersburg', 5614, 7003, 5895]] # Таблица с результатами
columns = ['city', 'monday', 'wednesday', 'friday']
print(pd.DataFrame(data=data, columns=columns))

               city  monday  wednesday  friday
0            Moscow   15740      11056   15945
1  Saint-Petersburg    5614       7003    5895


**Conclusions**

The data shows the difference in user behavior:

- In Moscow, the peak of auditions falls on Monday and Friday, and on Wednesday a decline is noticeable.
- In St. Petersburg, on the contrary, more people listen to music on Wednesdays.

So the data supports the first hypothesis.

### Music at the beginning and end of the week

According to the second hypothesis, on Monday morning, some genres prevail in Moscow and others in Petersburg. Also on Friday evening, different genres prevail - depending on the city.

**Job 23**

Save tables with data into two variables:
* for Moscow - in ːmoscow_general ː;
* for Saint Petersburg - in [spb_general¹.

In [29]:
moscow_general = df[df['city'] == 'Moscow']

In [30]:
spb_general = df[df['city'] == 'Saint-Petersburg']

**Job 24**

Create the function ːgenre_weekday()™ with four parameters:
* table (dataframe) with data
* day of the week,
* initial timestamp in the format 'hh:mm', 
* the last time stamp in the format 'hh:mm'.

The feature should return information about the top 10 genres of the tracks that were listened to on the specified day, between the two time stamps.

In [31]:
def genre_weekday(df, day, time1, time2):
    genre_df = df[df['day'] == day]
    genre_df = genre_df[genre_df['time'] < time2]
    genre_df = genre_df[genre_df['time'] > time1]
    genre_df_count = genre_df.groupby('genre')['genre'].count()
    genre_df_sorted = genre_df_count.sort_values(ascending = False)
    return genre_df_sorted[:10]

**Job 25**


Compare the results of the function for Moscow and Saint Petersburg on Monday morning (from 7:00 to 11:00) and Friday evening (from 17:00 to 23:00):

In [32]:
print(genre_weekday(moscow_general, 'Monday', '07:00', '11:00'))

genre
pop            781
dance          549
electronic     480
rock           474
hiphop         286
ruspop         186
world          181
rusrap         175
alternative    164
unknown        161
Name: genre, dtype: int64


In [33]:
print(genre_weekday(spb_general, 'Monday', '07:00', '11:00'))

genre
pop            218
dance          182
rock           162
electronic     147
hiphop          80
ruspop          64
alternative     58
rusrap          55
jazz            44
classical       40
Name: genre, dtype: int64


In [34]:
print(genre_weekday(moscow_general, 'Friday', '17:00', '23:00'))

genre
pop            713
rock           517
dance          495
electronic     482
hiphop         273
world          208
ruspop         170
alternative    163
classical      163
rusrap         142
Name: genre, dtype: int64


In [35]:
print(genre_weekday(spb_general, 'Friday', '17:00', '23:00'))

genre
pop            256
electronic     216
rock           216
dance          210
hiphop          97
alternative     63
jazz            61
classical       60
rusrap          59
world           54
Name: genre, dtype: int64


**Conclusions**

If you compare the top 10 genres on Monday morning, you can draw the following conclusions:

1. Moscow and St. Petersburg listen to similar music. The only difference - the Moscow rating entered the genre "world", and in Petersburg - jazz and classics.

2. In Moscow, the missing values were so many that the value of ¹'unknown'³ ranked tenth among the most popular genres. So the missed values are a significant part of the data and threaten the credibility of the study.

Friday night does not change this picture. Some genres rise a little higher, others descend, but overall the top 10 remains the same.

Thus, the second hypothesis was only partially confirmed:
* Users listen to similar music at the beginning of the week and at the end.
* The difference between Moscow and St. Petersburg is not too pronounced.

However, omissions in the data cast doubt on this result. There are so many of them in Moscow that the top 10 rating might look different if not for lost data about genres.

### Genre Preferences in Moscow and Petersburg

Hypothesis: Petersburg - the capital of rap, the music of this genre is heard there more often than in Moscow.  And Moscow is a city of contrasts, in which, however, is dominated by pop music.

**Job 26**

Group the table ːmoscow_general™ by genre and consider listening to the tracks of each genre as ¹count(). Then sort the result in descending order and save it in the table ¿moscow_genres¹.

In [36]:
moscow_genres = moscow_general.groupby('genre')['genre'].count().sort_values(ascending = False)

**Job 27**

Display the first ten lines of ːmoscow_genres ê:

In [37]:
moscow_genres[:10]

genre
pop            5892
dance          4435
rock           3965
electronic     3786
hiphop         2096
classical      1616
world          1432
alternative    1379
ruspop         1372
rusrap         1161
Name: genre, dtype: int64

**Job 28**


Now repeat the same for Petersburg.

Groupe the table ːspb_general™ by genre. Count listening to the tracks of each genre. Sort the result in descending order and save it in the table ¿spb_genres ê:

In [38]:
spb_genres = spb_general.groupby('genre')['genre'].count().sort_values(ascending = False)


**Tarea 29**

Mostrar las primeras diez líneas de ːspb_genres e:

In [39]:
spb_genres[:10]

genre
pop            2431
dance          1932
rock           1879
electronic     1736
hiphop          960
alternative     649
classical       646
rusrap          564
ruspop          538
world           515
Name: genre, dtype: int64

**Conclusions**

The hypothesis was partially confirmed:
* Pop music is the most popular genre in Moscow, as the hypothesis suggested. Moreover, in the top 10 genres there is a close genre - Russian popular music.
* Contrary to expectations, rap is equally popular in Moscow and Petersburg. 

## Research Results

You tested three hypotheses and established:

1. The day of the week influences users' activity in Moscow and Petersburg in different ways. 

The first hypothesis is fully confirmed.

2. Musical preferences do not change much within a week - whether it is Moscow or Petersburg. Small differences are visible at the beginning of the week, on Mondays:
* listen to "world" music in Moscow,
* in Petersburg - jazz and classics.

Thus, the second hypothesis was only partially confirmed. This result might have been different if not for the missing data.

3. In the tastes of users of Moscow and St. Petersburg there is more common than differences.

The third hypothesis is not true. If differences in preferences exist, they are invisible to the general public.