<h1>Содержание<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Data-overview" data-toc-modified-id="Data-overview-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Data overview</a></span></li><li><span><a href="#Data-Preprocessing" data-toc-modified-id="Data-Preprocessing-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Data Preprocessing</a></span><ul class="toc-item"><li><span><a href="#Missing-values" data-toc-modified-id="Missing-values-2.1"><span class="toc-item-num">2.1&nbsp;&nbsp;</span>Missing values</a></span></li><li><span><a href="#Duplicate-rows" data-toc-modified-id="Duplicate-rows-2.2"><span class="toc-item-num">2.2&nbsp;&nbsp;</span>Duplicate rows</a></span></li></ul></li><li><span><a href="#Hypothesis-Testing" data-toc-modified-id="Hypothesis-Testing-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Hypothesis Testing</a></span><ul class="toc-item"><li><span><a href="#Comparison-of-user-behavior-of-two-citys" data-toc-modified-id="Comparison-of-user-behavior-of-two-citys-3.1"><span class="toc-item-num">3.1&nbsp;&nbsp;</span>Comparison of user behavior of two citys</a></span></li><li><span><a href="#Music-at-the-beginning-and-end-of-the-week" data-toc-modified-id="Music-at-the-beginning-and-end-of-the-week-3.2"><span class="toc-item-num">3.2&nbsp;&nbsp;</span>Music at the beginning and end of the week</a></span></li><li><span><a href="#Genre-preferences-in-Moscow-and-St.-Petersburg" data-toc-modified-id="Genre-preferences-in-Moscow-and-St.-Petersburg-3.3"><span class="toc-item-num">3.3&nbsp;&nbsp;</span>Genre preferences in Moscow and St. Petersburg</a></span></li></ul></li><li><span><a href="#Conclusion" data-toc-modified-id="Conclusion-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Conclusion</a></span></li></ul></div>

**Research goal:**  
test three hypotheses: 
1. User activity depends on the day of the week. Moreover, in Moscow and St. Petersburg this manifests itself in different ways. 
2. On Monday morning, certain genres dominate in Moscow, while others dominate in St. Petersburg. Similarly, Friday evenings are dominated by different genres, depending on the city. 
3. Moscow and St. Petersburg prefer different genres of music. In Moscow, they listen to pop music more often, in St. Petersburg - Russian rap.


## Data overview

In [1]:
# Importing libraries
import pandas as pd
from ydata_profiling import ProfileReport
from scipy.stats import ttest_ind
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
from scipy.stats import ttest_ind, f_oneway


In [2]:
# Loading the dataset
file_path = "C:/Users/Julia/Documents/Portfolio/yandex_music_project/yandex_music_project.csv"

try:
    df = pd.read_csv(file_path)
    print("File successfully loaded")
    
except FileNotFoundError:
    print("File not found")
except Exception as e:
    print("An error occurred:", str(e))


File successfully loaded


In [3]:
df.head()

Unnamed: 0,userID,Track,artist,genre,City,time,Day
0,FFB692EC,Kamigata To Boots,The Mass Missile,rock,Saint-Petersburg,20:28:33,Wednesday
1,55204538,Delayed Because of Accident,Andreas Rönnberg,rock,Moscow,14:07:09,Friday
2,20EC38,Funiculì funiculà,Mario Lanza,pop,Saint-Petersburg,20:58:07,Wednesday
3,A3DD03C9,Dragons in the Sunset,Fire + Ice,folk,Saint-Petersburg,08:37:09,Monday
4,E2DC1FAE,Soul People,Space Echo,dance,Moscow,08:34:34,Monday


Columns should be renamed to the snake_case

In [4]:
# Renaming columns to snake_case and removing spaces
df.rename(columns=lambda x: x.strip().replace(' ', '').lower()
                              .replace('userid', 'user_id'), inplace=True)

df.columns

Index(['user_id', 'track', 'artist', 'genre', 'city', 'time', 'day'], dtype='object')

In [13]:
def data_info(df):
    # Function for describing DataFrame characteristics
    info = print('Random 10 rows of the dataset'), display(df.sample(n=10)), \
    print('\nInformation about the number of observations and data types\n'), df.info(), \
    print('\n\nStatistical data'), display(df.describe(include='all', datetime_is_numeric=True)), \
    print('\n\nMissing values:'), display(pd.DataFrame(round(df.isna().mean()*100,1))\
                                    .style.background_gradient('coolwarm')),\
    print('\n\nNumber of zero values in columns:'), display(pd.DataFrame(df.eq(0).sum())),\
    print('\nNumber of explicit duplicates:', df.duplicated().sum())



In [14]:

data_info(df)

Random 10 rows of the dataset


Unnamed: 0,user_id,track,artist,genre,city,time,day
4207,2864D190,Давай поженимся,Олег Пахомов,other,Moscow,20:16:57,Monday
44431,B71ABAB6,Searchin' (Re-Mastered),The Coasters,rock,Moscow,13:57:44,Monday
12659,91EE6865,Поле чудес,Гарри Бардин,children,Moscow,21:30:49,Friday
62161,962D83D2,Al Pacino,,dance,Saint-Petersburg,14:28:56,Wednesday
31747,E3AA0FCE,Hard Corpse,Bass Kittens,electronic,Saint-Petersburg,20:24:48,Friday
6809,618195A6,Save Tonight,The Blackout,posthardcore,Saint-Petersburg,08:38:06,Wednesday
36178,F22A584B,Heavenly,Priscilla Renea,rnb,Moscow,21:48:02,Friday
47585,112AE694,Pulsar Activity,Cosmic Replicant,electronic,Moscow,20:56:45,Friday
9862,56496127,King Cobra,,soundtrack,Saint-Petersburg,20:03:27,Monday
386,CA88AF3B,Healing,Love Eternal,reggae,Saint-Petersburg,13:23:43,Wednesday



Information about the number of observations and data types

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 65079 entries, 0 to 65078
Data columns (total 7 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   user_id  65079 non-null  object
 1   track    63848 non-null  object
 2   artist   57876 non-null  object
 3   genre    63881 non-null  object
 4   city     65079 non-null  object
 5   time     65079 non-null  object
 6   day      65079 non-null  object
dtypes: object(7)
memory usage: 3.5+ MB


Statistical data


Unnamed: 0,user_id,track,artist,genre,city,time,day
count,65079,63848,57876,63881,65079,65079,65079
unique,41748,47245,43605,289,2,20392,3
top,A8AE9169,Intro,Sasha,pop,Moscow,08:14:07,Friday
freq,76,34,6,8850,45360,14,23149




Missing values:


Unnamed: 0,0
user_id,0.0
track,1.9
artist,11.1
genre,1.8
city,0.0
time,0.0
day,0.0




Number of zero values in columns:


Unnamed: 0,0
user_id,0
track,0
artist,0
genre,0
city,0
time,0
day,0



Number of explicit duplicates: 3826


Missing values has variables `track, artist and genre.`    
Dataset has 5.7% duplicate rows.  
Dataset must be checked for implicit duplicates.  


## Data Preprocessing

### Missing values

Variables `track, artist` are not used in the set hypotheses, missing values will be replaced by the value "unknown". 
missing values in `genre` will be replaced by mode of values based on matching `artist` variable, if there is no matching - by "unknown"

In [None]:
# Replace missing values
columns_to_replace = ['track', 'artist']
replace_value = 'unknown'
df[columns_to_replace] = df[columns_to_replace].fillna(replace_value)
# Fill missing 'genre' values based on matching 'artist'
df['genre'] = df.groupby('artist')['genre'].transform(lambda x: x.fillna(x.mode()[0] if not x.mode().empty else 'unknown'))
df.isna().sum()

### Duplicate rows

Check for implicit duplicates

In [None]:
genre = df['genre'].sort_values().unique()
genre

There are synonyms of genres, it is necessary to bring them to a single spelling

In [None]:
# Replacement of synonyms
df['genre'] = df['genre'].replace({
    'frankreich': 'french',
    'französisch': 'french',
    'hip': 'hiphop',
    'hip-hop': 'hiphop',
    'hop': 'hiphop',
    'independent': 'indie',
    'neue': 'new',
    'türk': 'türkçe',
    'электроника': 'electronic'
})

In [None]:
df['city'].unique()

In [None]:
df.duplicated().sum()

In [None]:
df = df.drop_duplicates()

In [None]:
df.duplicated().sum()

## Hypothesis Testing

### Comparison of user behavior of two citys

* **Null Hypothesis (H0)**:   
There is no significant difference in user activity based on the day of the week between Moscow and Saint Petersburg.  

* **Alternative Hypothesis (H1)**:   
User activity varies significantly based on the day of the week, and this variation differs between Moscow and Saint Petersburg.

In [None]:
moscow_users = len(df.query('city == "Moscow"')['user_id'])
moscow_users

In [None]:
spb_users = len(df.query('city == "Saint-Petersburg"')['user_id'])
spb_users

In [None]:
%matplotlib inline
plt.style.use('dark_background')
# Group data by day and count activities
grouped = df.groupby(['day', 'city']).size().reset_index(name='count')
grouped['percentage_city_users'] = grouped.apply(
    lambda row: row['count'] / moscow_users * 100 
    if row['city'] == 'Moscow' 
    else row['count'] / spb_users * 100, axis=1)
# Visualization
plt.figure(figsize=(10, 6))
sns.barplot(data=grouped, x='day', y='percentage_city_users', hue='city')

for p in plt.gca().patches:
    height = p.get_height()
    plt.gca().text(p.get_x() + p.get_width() / 2, height + 0.5, f'{height:.1f}%', ha='center', va='bottom')

plt.xlabel('Day of the Week')
plt.ylabel('Percentage of City Listens')
plt.title('Percentage of City Listens by Day of the Week for Moscow and Saint-Petersburg')
plt.xticks(rotation=45)
plt.tight_layout()
plt.legend(loc='lower right')
plt.show()


When examining user activity across days of the week in Moscow and St. Petersburg, distinct patterns emerge. On Wednesdays in Moscow, user activity experiences a noticeable decrease. In contrast, in St. Petersburg, there is a slight increase in user activity on the same day. Moreover, it is evident that overall user activity in Moscow is considerably higher than in St. Petersburg.

In [None]:
# Dividing the data into two groups: Moscow and St. Petersburg
moscow = grouped[grouped['city'] == 'Moscow']['count']
spb = grouped[grouped['city'] == 'Saint-Petersburg']['count']
#  t-test
t_stat, p_value = ttest_ind(moscow, spb)
print(f"t-statistic: {t_stat}")
print(f"p-value: {p_value}")

# Determining Statistical Significance
alpha = 0.05
if p_value < alpha:
    print("\nReject null hypothesis: There is a significant difference between Moscow and Saint-Petersburg.")
else:
    print("\nFail to reject null hypothesis: No significant difference between Moscow and Saint-Petersburg.")

### Music at the beginning and end of the week

* **Null hypothesis (H0):** There is no difference in the predominant genres between Moscow and Saint-Petersburg on Monday mornings or Friday evenings.

* **Alternative hypothesis (H1):** There is a difference in the predominant genres between Moscow and Saint-Petersburg on Monday mornings and Friday evenings.

In [None]:
moscow_general = df[df['city'] == 'Moscow']
spb_general = df[df['city'] == 'Saint-Petersburg']

In [None]:
def genre_weekday(df, day, time1, time2):
    # Filter the DataFrame to keep rows where the 'day' column equals the specified 'day'.
    genre_df = df[df['day'] == day]

    # Filter the 'genre_df' DataFrame to keep rows with a 'time' column greater than 'time1'.
    genre_df = genre_df[genre_df['time'] > time1]

    # Filter the 'genre_df' DataFrame again to keep rows with a 'time' column less than 'time2'.
    genre_df = genre_df[genre_df['time'] < time2]

    # Group the filtered DataFrame by the 'genre' column and count the occurrences of each genre.
    genre_df_grouped = genre_df.groupby('genre')['genre'].count()

    # Sort the resulting Series in descending order to have the most popular genres first.
    genre_df_sorted = genre_df_grouped.sort_values(ascending=False)

    # Return a Series with the top 10 most popular genres during the specified time interval of the given day.
    return genre_df_sorted[:10]

In [None]:
genre_weekday(moscow_general, 'Monday', '07:00', '11:00')

In [None]:
genre_weekday(spb_general, 'Monday', '07:00', '11:00')

In [None]:
genre_weekday(moscow_general, 'Friday', '17:00', '23:00')

In [None]:
genre_weekday(spb_general, 'Friday', '17:00', '23:00')

**Conclusions**

When comparing the top 10 genres on Monday mornings, the following conclusions can be drawn:

Moscow and St. Petersburg have similar music preferences. The only difference is that "world" music is in Moscow's ranking, while jazz and classical music are in St. Petersburg's ranking.

In Moscow, there are so many missing values that the "unknown" genre occupies the tenth place among the most popular genres. This indicates that missing values account for a significant portion of the data and threaten the reliability of the study.

Friday evening doesn't change this picture significantly. Some genres rise slightly, while others fall, but the overall top 10 remains largely the same.

Thus, the second hypothesis is only partially confirmed:

Users listen to similar music at the beginning and end of the week.
The difference between Moscow and St. Petersburg is not very pronounced. Moscow residents listen to Russian pop music more frequently, while St. Petersburg residents prefer jazz.
However, the presence of missing data questions this result. In Moscow, there are so many missing values that the top 10 ranking could look different if data about genres were not lost.

### Genre preferences in Moscow and St. Petersburg

Hypothesis: St. Petersburg is the capital of rap, the music of this genre is listened to there more often than in Moscow. And Moscow is a city of contrasts, which, nevertheless, is dominated by pop music.

In [None]:
moscow_genres = moscow_general.groupby('genre')['genre'].count()
moscow_genres = moscow_genres.sort_values(ascending=False)

In [None]:
moscow_genres.head(10)

In [None]:
spb_genres = spb_general.groupby('genre')['genre'].count()
spb_genres = spb_genres.sort_values(ascending=False)

In [None]:
spb_genres.head(10)

**Conclusions**  

The hypothesis was partially confirmed: * Pop music is the most popular genre in Moscow, as the hypothesis suggested. Moreover, in the top 10 genres there is a close genre - Russian popular music. * Contrary to expectations, rap is equally popular in Moscow and St. Petersburg.

## Conclusion

In this analysis, three hypotheses were tested regarding the influence of the day of the week on user activity in Moscow and St. Petersburg, as well as the variation in music preferences during the week.

The first hypothesis was fully confirmed, suggesting that the day of the week indeed has different effects on user activity in Moscow and St. Petersburg.

The second hypothesis was partially confirmed. It was found that musical preferences do not change significantly during the week, whether in Moscow or St. Petersburg. However, minor differences were observed on Mondays: "world" music is more popular in Moscow, while jazz and classical music are preferred in St. Petersburg. It's worth noting that this result could have been different if there were no missing data.

The third hypothesis was not confirmed. Despite expectations of significant differences, the analysis revealed more similarities than distinctions in the music preferences of Moscow and St. Petersburg users. If there are variations in preferences, they are not discernible in the overall user base.

In light of these findings, it is essential to revisit this analysis, taking into account the limitations introduced by missing data. Further investigations or data collection efforts may provide deeper insights into the differences and commonalities in user behavior and music preferences in Moscow and St. Petersburg.