# User comparison between two cities

## Data overview

In [3]:
# importing pandas
import pandas as pd

In [4]:
# reading csv file and assigning the result to variable
df = pd.read_csv('https://code.s3.yandex.net/datasets/yandex_music_project.csv')

In [5]:
# displaying the head of the DataFrame
df.head(10)

Unnamed: 0,userID,Track,artist,genre,City,time,Day
0,FFB692EC,Kamigata To Boots,The Mass Missile,rock,Saint-Petersburg,20:28:33,Wednesday
1,55204538,Delayed Because of Accident,Andreas Rönnberg,rock,Moscow,14:07:09,Friday
2,20EC38,Funiculì funiculà,Mario Lanza,pop,Saint-Petersburg,20:58:07,Wednesday
3,A3DD03C9,Dragons in the Sunset,Fire + Ice,folk,Saint-Petersburg,08:37:09,Monday
4,E2DC1FAE,Soul People,Space Echo,dance,Moscow,08:34:34,Monday
5,842029A1,Преданная,IMPERVTOR,rusrap,Saint-Petersburg,13:09:41,Friday
6,4CB90AA5,True,Roman Messer,dance,Moscow,13:00:07,Wednesday
7,F03E1C1F,Feeling This Way,Polina Griffith,dance,Moscow,20:47:49,Wednesday
8,8FA1D3BE,И вновь продолжается бой,,ruspop,Moscow,09:17:40,Friday
9,E772D5C0,Pessimist,,dance,Saint-Petersburg,21:20:49,Wednesday


- Column names and data use lowercase and uppercase letters;
- Column names are not in snake case.

In [6]:
# printing information about the DataFrame
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 65079 entries, 0 to 65078
Data columns (total 7 columns):
  userID    65079 non-null object
Track       63848 non-null object
artist      57876 non-null object
genre       63881 non-null object
  City      65079 non-null object
time        65079 non-null object
Day         65079 non-null object
dtypes: object(7)
memory usage: 3.5+ MB


- All columns contain object-type data;
- Several columns have missing data, such as track, artist, and genre;
- In the name of City and userID columns, there are an extra space.

**Conclusion**

We have enough data for analysis and all columns contain object-type data.

We need to fix these errors:
- Column names use lowercase and uppercase letters;
- Column names are not in snake case;
- Several columns have missing data, such as track, artist, and genre;
- In the name of City and userID columns, there are an extra space.



## Data preprocessing

### Heading style

In [5]:
# printing list of column names
df.columns

Index(['  userID', 'Track', 'artist', 'genre', '  City  ', 'time', 'Day'], dtype='object')

Let's rename columns:
* `'  userID'` → `'user_id'`;
* `'Track'` → `'track'`;
* `'  City  '` → `'city'`;
* `'Day'` → `'day'`.

In [8]:
# renaming columns
df = df.rename(columns={'  userID' : 'user_id', 'Track' : 'track', '  City  ' : 'city', 'Day' : 'day'})

In [9]:
# checking column names
df.head(10)

Unnamed: 0,user_id,track,artist,genre,city,time,day
0,FFB692EC,Kamigata To Boots,The Mass Missile,rock,Saint-Petersburg,20:28:33,Wednesday
1,55204538,Delayed Because of Accident,Andreas Rönnberg,rock,Moscow,14:07:09,Friday
2,20EC38,Funiculì funiculà,Mario Lanza,pop,Saint-Petersburg,20:58:07,Wednesday
3,A3DD03C9,Dragons in the Sunset,Fire + Ice,folk,Saint-Petersburg,08:37:09,Monday
4,E2DC1FAE,Soul People,Space Echo,dance,Moscow,08:34:34,Monday
5,842029A1,Преданная,IMPERVTOR,rusrap,Saint-Petersburg,13:09:41,Friday
6,4CB90AA5,True,Roman Messer,dance,Moscow,13:00:07,Wednesday
7,F03E1C1F,Feeling This Way,Polina Griffith,dance,Moscow,20:47:49,Wednesday
8,8FA1D3BE,И вновь продолжается бой,,ruspop,Moscow,09:17:40,Friday
9,E772D5C0,Pessimist,,dance,Saint-Petersburg,21:20:49,Wednesday


### Missing values

In [11]:
# counting missing values
df.isna().sum()

user_id       0
track      1231
artist     7203
genre      1198
city          0
time          0
day           0
dtype: int64

There are some cases where missing values don't affect the study. For instance, in the case of `track` and `artist`, blank values can be replaced with an explicit notation such as 'unknown'. 

However, missing values in `genre` can interfere with the comparison of musical tastes in in two cities. Therefore, it would be appropriate to identify the reasons for the gaps and to restore the data. This option is not included in the curriculum. We can:
* fill in these gaps with explicit notation,
* estimate how much it will affect calculations.

Let's fill in gaps in `track`, `artist` and `genre` with `'unknown'`. For that I will create a list `columns_to_replace`, loop through its elements and for every column replace missing values:

In [12]:
# creating a list, looping through its elements and replacing missing values with 'unknown'
columns_to_replace = ['track', 'artist', 'genre']
for column in columns_to_replace:
    df[column] = df[column].fillna('unknown')

In [10]:
# checking for gaps
df.isna().sum()

user_id    0
track      0
artist     0
genre      0
city       0
time       0
day        0
dtype: int64

### Duplicates

In [11]:
# checking for duplicates
df.duplicated().sum()

3826

Let's remove duplicates 
Call the special `pandas` method to remove obvious duplicates:

In [13]:
# removing duplicates
df = df.drop_duplicates().reset_index(drop=True)

In [13]:
# checking for duplicates
df.duplicated().sum()

0

Let's check for other duplicates when the same genre can be spelled differently. For that print out the alphabetical list of unique genres.

In [17]:
# printing out the alphabetical list of unique genres
df['genre'].sort_values().unique()

array(['acid', 'acoustic', 'action', 'adult', 'africa', 'afrikaans',
       'alternative', 'alternativepunk', 'ambient', 'americana',
       'animated', 'anime', 'arabesk', 'arabic', 'arena',
       'argentinetango', 'art', 'audiobook', 'author', 'avantgarde',
       'axé', 'baile', 'balkan', 'beats', 'bigroom', 'black', 'bluegrass',
       'blues', 'bollywood', 'bossa', 'brazilian', 'breakbeat', 'breaks',
       'broadway', 'cantautori', 'cantopop', 'canzone', 'caribbean',
       'caucasian', 'celtic', 'chamber', 'chanson', 'children', 'chill',
       'chinese', 'choral', 'christian', 'christmas', 'classical',
       'classicmetal', 'club', 'colombian', 'comedy', 'conjazz',
       'contemporary', 'country', 'cuban', 'dance', 'dancehall',
       'dancepop', 'dark', 'death', 'deep', 'deutschrock', 'deutschspr',
       'dirty', 'disco', 'dnb', 'documentary', 'downbeat', 'downtempo',
       'drum', 'dub', 'dubstep', 'eastern', 'easy', 'electronic',
       'electropop', 'emo', 'entehno', '

Hiphop is spelled:
* *hip*,
* *hop*,
* *hip-hop*.

Let's write a function `replace_wrong_genres()` with two parameters: 
* `wrong_genres` — list with duplicates,
* `correct_genre` — correct spelling.

The function will replace each value from the `wrong_genres` list with a value from `correct_genre`.

In [18]:
# creating a function replace_wrong_genres()
def replace_wrong_genres(wrong_genres, correct_genre):
    for wrong_genre in wrong_genres:
        df['genre'] = df['genre'].replace(wrong_genre, correct_genre)
duplicates = ['hip', 'hop', 'hip-hop']
name = 'hiphop'

In [16]:
# calling replace_wrong_genres() function
replace_wrong_genres(duplicates, name)

In [17]:
# checking for duplicates
df['genre'].sort_values().unique()

array(['acid', 'acoustic', 'action', 'adult', 'africa', 'afrikaans',
       'alternative', 'alternativepunk', 'ambient', 'americana',
       'animated', 'anime', 'arabesk', 'arabic', 'arena',
       'argentinetango', 'art', 'audiobook', 'author', 'avantgarde',
       'axé', 'baile', 'balkan', 'beats', 'bigroom', 'black', 'bluegrass',
       'blues', 'bollywood', 'bossa', 'brazilian', 'breakbeat', 'breaks',
       'broadway', 'cantautori', 'cantopop', 'canzone', 'caribbean',
       'caucasian', 'celtic', 'chamber', 'chanson', 'children', 'chill',
       'chinese', 'choral', 'christian', 'christmas', 'classical',
       'classicmetal', 'club', 'colombian', 'comedy', 'conjazz',
       'contemporary', 'country', 'cuban', 'dance', 'dancehall',
       'dancepop', 'dark', 'death', 'deep', 'deutschrock', 'deutschspr',
       'dirty', 'disco', 'dnb', 'documentary', 'downbeat', 'downtempo',
       'drum', 'dub', 'dubstep', 'eastern', 'easy', 'electronic',
       'electropop', 'emo', 'entehno', '

All duplicates have been renamed.

**Conclusion**

- Column names have been renamed;
- Missing values have been filled in;
- Duplicates have been removed;
- Wrong genre names have been renamed.

Now we can move on to hypothesis testing.

## Hypotheses testing

### A comparison of the behavior of users on different days of the week in two cities

According to the first hypothesis, users in two cities listen to music differently on different days of the week.

The data for Monday, Wednesday, and Friday will be used to test this assumption. I will:

* separate users in two cities;
* compare the number of tracks each user group listened to on Monday, Wednesday and Friday.

In [21]:
# separating users in two cities
df.groupby('city')['track'].count()

city
Moscow              42741
Saint-Petersburg    18512
Name: track, dtype: int64

Moscow listened to more tracks because there are more users in Moscow than in St.Petersburg.

In [22]:
# comparing the number of tracks listened to on Monday, Wednesday and Friday.
df.groupby('day')['track'].count()

day
Friday       21840
Monday       21354
Wednesday    18059
Name: track, dtype: int64

On average, users are less active on Wednesdays. 

Let's create a function `number_tracks()` that combines tracks listened to in two cities and on different days of the week. It needs two parameters:
* day of the week;
* city name.

In [23]:
# creating a function number_tracks()
def number_tracks(day, city):
    track_list = df.loc[(df['day']==day) & (df['city']==city)]
    track_list_count=track_list['user_id'].count()
    return track_list_count

Call `number_tracks()` six times, changing parameters to get data for each city on each of the three days.

In [27]:
# getting number of tracks listened to in Moscow on Mondays
mon_m = number_tracks('Monday','Moscow')
mon_m

15740

In [28]:
# getting number of tracks listened to ins in St. Petersburg on Mondays
mon_sp = number_tracks('Monday','Saint-Petersburg')
mon_sp

5614

In [29]:
# getting number of tracks listened to in Moscow on Wednesdays
wed_m = number_tracks('Wednesday','Moscow')
wed_m

11056

In [30]:
# getting number of tracks listened to in St. Petersburg on Wednesdays
wed_sp = number_tracks('Wednesday','Saint-Petersburg')
wed_sp

7003

In [31]:
# getting number of tracks listened to in Moscow on Fridays
fr_m = number_tracks('Friday','Moscow')
fr_m

15945

In [32]:
# getting number of tracks listened to in St. Petersburg on Fridays
fr_sp = number_tracks('Friday','Saint-Petersburg')
fr_sp 

5895

Create a DataFrame where:
* column names - `['city', 'monday', 'wednesday', 'friday']`;
* values from the `number_tracks` function.

In [33]:
# creating a DataFrame
data=[['Moscow', mon_m, wed_m, fr_m],
      ['Saint-Petersburg', mon_sp, wed_sp, fr_sp]]
result_table = pd.DataFrame(data=data, columns=['city', 'monday', 'wednesday', 'friday'])
display(result_table)

Unnamed: 0,city,monday,wednesday,friday
0,Moscow,15740,11056,15945
1,Saint-Petersburg,5614,7003,5895


**Conclusion**

There is difference in user behavior:

- Listening peaks in Moscow on Mondays and Fridays, and declines on Wednesdays; 
- Listening peaks in St. Petersburg on Wednesdays, and declines on Mondays and Fridays. 

The analysis supports the first hypothesis.

### Musical preferences at the beginning and end of the week

According to the second hypothesis, certain genres predominate in Moscow on Monday mornings, whereas others predominate in St. Petersburg. Similarly, Friday evenings vary by city in terms of genre.

In [38]:
# creating a table with Moscow data
moscow_general=df[df.city=='Moscow']

In [39]:
# creating a table with St. Petersburg data
spb_general=df[df.city=='Saint-Petersburg']

Let's create a function `genre_weekday()` with four parameters:
* dataframe;
* day of the week;
* initial timestamp in 'hh:mm' format;
* last timestamp in 'hh:mm' format.

It should return information about the top 10 genres of tracks listened to in the interval between two timestamps on the specified day.

In [40]:
# creating a function genre_weekday() 
def genre_weekday(table, day, time1, time2):
    genre_df = table.loc[(table.day==day) & (table.time>time1) & (table.time<time2)]
    genre_df_count = genre_df.groupby('genre')['track'].count()
    genre_df_sorted = genre_df_count.sort_values(ascending=False)
    return genre_df_sorted[:10]

I will compare Moscow and St. Petersburg on Monday morning (from 7:00 to 11:00) and Friday evening (from 17:00 to 23:00) using `genre_weekday()`.

In [41]:
# displaying top-10 genres in Moscow from 7:00 to 11:00
genre_weekday(moscow_general, 'Monday', '07:00', '11:00')

genre
pop            781
dance          549
electronic     480
rock           474
hip            281
ruspop         186
world          181
rusrap         175
alternative    164
unknown        161
Name: track, dtype: int64

In [42]:
# displaying top-10 genres in St. Petersburg from 7:00 to 11:00
genre_weekday(spb_general, 'Monday', '07:00', '11:00')

genre
pop            218
dance          182
rock           162
electronic     147
hip             79
ruspop          64
alternative     58
rusrap          55
jazz            44
classical       40
Name: track, dtype: int64

In [44]:
# displaying top-10 genres in Moscow from 17:00 to 23:00
genre_weekday(moscow_general, 'Friday', '17:00', '23:00')

genre
pop            713
rock           517
dance          495
electronic     482
hip            267
world          208
ruspop         170
alternative    163
classical      163
rusrap         142
Name: track, dtype: int64

In [45]:
# displaying top-10 genres in St. Petersburg from 17:00 to 23:00
genre_weekday(spb_general, 'Friday', '17:00', '23:00')

genre
pop            256
electronic     216
rock           216
dance          210
hip             94
alternative     63
jazz            61
classical       60
rusrap          59
world           54
Name: track, dtype: int64

**Conclusion**

After comparing top 10 genres on Monday morning, I can say that:

1. Moscow and St. Petersburg listen to similar music. The only difference is that the Moscow rating includes world music, while the St. Petersburg rating includes jazz and classical music.

2. The Moscow data has so many missing values that the value `unknown` has taken tenth place among the most popular genres, which means that missing values occupy a significant share of the data and threaten its credibility. 

This picture does not change on Friday night. Some genres rise a little higher, others drop, but overall the top 10 remains the same. 

The second hypothesis was partially supported:
* Users listen to similar music at the beginning and at the end of the week;
* There is not much difference between Moscow and St. Petersburg.

Missing data, however, casts doubt on this result. With so many of them in Moscow, the top-10 ranking could look quite different. 

### Genre preferences in two cities

According to the third hypothesis, rap is the most popular genre in St. Petersburg. Moscow is dominated by pop music.

In [51]:
# grouping the moscow_general table by genre
# counting the number of tracks in each genre
# sorting in the descending order
moscow_genres = moscow_general.groupby('genre')['track'].count().sort_values(ascending=False)
moscow_genres[:10]

genre
pop            5892
dance          4435
rock           3965
electronic     3786
hip            2041
classical      1616
world          1432
alternative    1379
ruspop         1372
rusrap         1161
Name: track, dtype: int64

In [53]:
# grouping the spb_general table by genre
# counting the number of tracks in each genre
# sorting in the descending order
spb_genres = spb_general.groupby('genre')['track'].count().sort_values(ascending=False)
spb_genres[:10]

genre
pop            2431
dance          1932
rock           1879
electronic     1736
hip             934
alternative     649
classical       646
rusrap          564
ruspop          538
world           515
Name: track, dtype: int64

**Conclusion**

The third hypothesis was partially supported:

- Pop music is the most popular genre in Moscow;
- Pop music is also the most popular genre in St.

## Final conclusion

I tested three hypotheses and conlcuded that the day of the week has a different effect on the activity of users in Moscow and St. Petersburg.

The analysis supports the first hypothesis that users in two cities listen to music differently on different days of the week:
- Listening peaks in Moscow on Mondays and Fridays, and declines on Wednesdays;
- Listening peaks in St. Petersburg on Wednesdays, and declines on Mondays and Fridays.

The second hypothesis was partially supported. It states that certain genres predominate in Moscow on Monday mornings, whereas others predominate in St. Petersburg. Similarly, Friday evenings vary by city in terms of genre:

- Users listen to similar music at the beginning and at the end of the week;
- There is not much difference between Moscow and St. Petersburg.

The third hypothesis was partially supported. It states that rap is the most popular genre in St. Petersburg. Moscow is dominated by pop music.:

- Pop music is the most popular genre in Moscow;
- Pop music is also the most popular genre in St.