# Y.Music

# Contents <a id='back'></a>

* [Introduction](#intro)
* [Step 1: Overview of Data](#data_review)
    * [Conclusions](#data_review_conclusions)
* [Step 2: Data Pre-processing](#data_preprocessing)
    * [2.1 Header Style](#header_style)
    * [2.2 Missing Values](#missing_values)
    * [2.3 Duplicates](#duplicates)
    * [2.4 Conclusions](#data_preprocessing_conclusions)
* [Step 3: Testing Hypothesis](#hypotheses)
    * [3.1 Hypothesis 1: Comparing User Behavior in Two Cities](#activity)
    * [3.2 Hypothesis 2: Music at the Beginning and End of the Week](#week)
    * [3.3 Hypothesis 3: Preferences in Springfield and Shelbyville](#genre)
* [Conclusions](#end)

## Introduction <a id='intro'></a>
Whenever we conduct research, we need to formulate hypotheses that we can later test. Sometimes we accept these hypotheses, and other times we reject them. To make the right choices, a business must be able to understand whether it is making the right assumptions or not.

In this project, we will compare the music preferences of the inhabitants of Springfield and Shelbyville. We will study real data from Y.Music to test the hypothesis below and compare user behavior for these two cities.

### The objective is to test three hypotheses:

- User activity is different depending on the day of the week and the city.
- During Monday mornings, residents of Springfield and Shelbyville listen to different genres. This is also true for Friday nights.
- Listeners in Springfield and Shelbyville have different preferences. In Springfield, people prefer pop, while Shelbyville has more rap fans.

### Steps:
The data on user behavior is stored in the file /datasets/music_project_en.csv. There is no information about the quality of the data, so we will need to examine it before testing the hypothesis.

First, we will assess the data's quality and see if there are significant problems. Then, during data preprocessing, we will try to address the most critical issues.

## Our project will consist of three steps:

1 - Overview of the Data
2 - Data Pre-processing
3 - Testing the Hypotheses

[Voltar ao Índice](#back)

## Step 1: Overview of Data <a id='data_review'></a>



In [1]:
# importing pandas
import pandas as pd

In [5]:
# reading the file and storing it in a dataframe
df = pd.read_csv('/Users/andrewferreira/Downloads/music_project_en.csv')


In [6]:
#printing first 10 lines of the dataframe
df.head(10)

Unnamed: 0,userID,Track,artist,genre,City,time,Day
0,FFB692EC,Kamigata To Boots,The Mass Missile,rock,Shelbyville,20:28:33,Wednesday
1,55204538,Delayed Because of Accident,Andreas Rönnberg,rock,Springfield,14:07:09,Friday
2,20EC38,Funiculì funiculà,Mario Lanza,pop,Shelbyville,20:58:07,Wednesday
3,A3DD03C9,Dragons in the Sunset,Fire + Ice,folk,Shelbyville,08:37:09,Monday
4,E2DC1FAE,Soul People,Space Echo,dance,Springfield,08:34:34,Monday
5,842029A1,Chains,Obladaet,rusrap,Shelbyville,13:09:41,Friday
6,4CB90AA5,True,Roman Messer,dance,Springfield,13:00:07,Wednesday
7,F03E1C1F,Feeling This Way,Polina Griffith,dance,Springfield,20:47:49,Wednesday
8,8FA1D3BE,L’estate,Julia Dalia,ruspop,Springfield,09:17:40,Friday
9,E772D5C0,Pessimist,,dance,Shelbyville,21:20:49,Wednesday


Getting general information about the table with a command:

In [7]:
df.describe()

Unnamed: 0,userID,Track,artist,genre,City,time,Day
count,65079,63736,57512,63881,65079,65079,65079
unique,41748,39666,37806,268,2,20392,3
top,A8AE9169,Brand,Kartvelli,pop,Springfield,08:14:07,Friday
freq,76,136,136,8850,45360,14,23149


In [8]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 65079 entries, 0 to 65078
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0     userID  65079 non-null  object
 1   Track     63736 non-null  object
 2   artist    57512 non-null  object
 3   genre     63881 non-null  object
 4     City    65079 non-null  object
 5   time      65079 non-null  object
 6   Day       65079 non-null  object
dtypes: object(7)
memory usage: 3.5+ MB


In [9]:
len(df)

65079

The table contains seven columns. They store the same type of data: object.

According to the documentation:
- `'userID'` — user identification
- `'Track'` — song title
- `'artist'` — artist name
- `'genre'` — the genre
- `'City'` — user's city
- `'time'` — exact time the song was played
- `'Day'` — day of the week 

We can see three problems with the style in the column names:
1. Some names are in uppercase, while others are in lowercase.
2. There are spaces in some names.
3. The 'time' column is marked as an object and should be marked as time.
4. The quantity of values in the columns is different. This indicates that the data contains missing values.

### Conclusion <a id='data_review_conclusions'></a> 

Each row in the table stores data about a song that was played. Some columns describe the song itself: its title, artist, and genre. The rest contain information about the user: the city they come from, and the number of times the song was played.

It is clear that the data is sufficient to test hypotheses. However, there are missing values.

To move forward, we need to preprocess the data.

[Voltar ao Índice](#back)

## Step 2: Data Pre-processing <a id='data_preprocessing'></a>

On this step we are going to change the column heading and take care of missing values. We are also going to check for duplicates in data.

In [10]:
#printing column heading
df.columns

Index(['  userID', 'Track', 'artist', 'genre', '  City  ', 'time', 'Day'], dtype='object')

Changing the column names according to the good practice style rules:

* Use snake_case if there are multiple words
* All characters should be in lowercase
* Delete spaces

In [11]:
# changing index names
df=df.rename(
    columns={
        '  userID': 'user_id',
        'Track': 'track',
        'artist': 'artist',
        'genre': 'genre',
        '  City  ': 'city',
        'time': 'time',
        'Day': 'day'
    }
)
               

After changing the index name, let's check if it's working

In [12]:
df.columns

Index(['user_id', 'track', 'artist', 'genre', 'city', 'time', 'day'], dtype='object')

[Voltar ao Índice](#back)

### Missing Values <a id='missing_values'></a>
"First, we will find the number of missing values in the table. To do this,we will use two pandas methods:"

In [13]:
#checking for missing values

df.isna().sum()

user_id       0
track      1343
artist     7567
genre      1198
city          0
time          0
day           0
dtype: int64

Not all missing values affect the analysis. For example, missing values in the 'song' and 'artist' columns are not crucial. You can simply replace them with clear markers.

However, missing values in 'genre' may impact the comparison of musical preferences between Springfield and Shelbyville. In real life, it would be useful to find out the reasons for missing data and try to compensate for them. But we don't have that option in this project. So, we'll need to:

* Fill in missing values with markers
* Assess how much missing values may impact our calculations.

Replace the missing values in 'track,' 'artist,' and 'genre' with the string 'unknown.' To do this, we'll create the list columns_to_replace, iterate through it with a for loop, and replace the missing values in each of the columns:

In [14]:
df['track'] = df['track'].fillna('unknown')
df['artist'] = df['artist'].fillna('unknown')
df['genre'] = df['genre'].fillna('unknown')

Let's check if we still have any missing values:

In [15]:
df.isna().sum()

user_id    0
track      0
artist     0
genre      0
city       0
time       0
day        0
dtype: int64

[Voltar ao Índice](#back)

### Duplicates <a id='duplicates'></a>
Let's find the number of obvious duplicates in the table using:

In [16]:
df.duplicated().sum()

3826

Getting rid of the duplicates:

In [17]:
df = df.drop_duplicates()

In [18]:
#Checking if there are any duplicates
df.duplicated().sum() 

0

Now, let's get rid of the implicit duplicates in the 'genre' column. For example, the name of a genre may be written in different ways. Some errors will also affect the result.

Displaying the list of unique genre names, organized in alphabetical order. To do this:
* Retrieve the DataFrame from the desired column
* Apply a selection method for this
* For the selected column, we'll call the method that will return all unique values in the column


In [19]:
df['genre'].unique()# visualizing unique genre names

array(['rock', 'pop', 'folk', 'dance', 'rusrap', 'ruspop', 'world',
       'electronic', 'unknown', 'alternative', 'children', 'rnb', 'hip',
       'jazz', 'postrock', 'latin', 'classical', 'metal', 'reggae',
       'triphop', 'blues', 'instrumental', 'rusrock', 'dnb', 'türk',
       'post', 'country', 'psychedelic', 'conjazz', 'indie',
       'posthardcore', 'local', 'avantgarde', 'punk', 'videogame',
       'techno', 'house', 'christmas', 'melodic', 'caucasian',
       'reggaeton', 'soundtrack', 'singer', 'ska', 'salsa', 'ambient',
       'film', 'western', 'rap', 'beats', "hard'n'heavy", 'progmetal',
       'minimal', 'tropical', 'contemporary', 'new', 'soul', 'holiday',
       'german', 'jpop', 'spiritual', 'urban', 'gospel', 'nujazz',
       'folkmetal', 'trance', 'miscellaneous', 'anime', 'hardcore',
       'progressive', 'korean', 'numetal', 'vocal', 'estrada', 'tango',
       'loungeelectronic', 'classicmetal', 'dubstep', 'club', 'deep',
       'southern', 'black', 'folkrock', 

Looking at the list we can find implicit duplicates of the genre 'hiphop.' These could be incorrectly written names or alternative names for the same genre.

We can see the following implicit duplicates::
* hip
* hop
* hip-hop

To get rid of them, we have to declare the function replace_wrong_genres() with two parameters: 
* wrong_genres= — list of duplicates
* correct_genre= — string with the correct value

The function should correct the names in the 'genre' column of the df table, that is, replacing each value from the wrong_genres list with values from correct_genre.

In [20]:
# function to substitute implicit duplicates
def replace_wrong_genres(wrong_genres,correct_genre):
    df['genre'] = df['genre'].replace(wrong_genres, correct_genre)

We are gonna call 'replace_wrong_genres()' and pass arguments to it so that it can eliminate implicit duplicates (hip, hop, and hip-hop) and replace them with hiphop:

In [21]:
# removing implicit duplicates
duplicates = ['hip', 'hop', 'hip-hop']
correct_genre = 'hiphop'

replace_wrong_genres(duplicates, correct_genre)

Checking if duplicate names were removed:

In [22]:
df['genre'].unique()

array(['rock', 'pop', 'folk', 'dance', 'rusrap', 'ruspop', 'world',
       'electronic', 'unknown', 'alternative', 'children', 'rnb',
       'hiphop', 'jazz', 'postrock', 'latin', 'classical', 'metal',
       'reggae', 'triphop', 'blues', 'instrumental', 'rusrock', 'dnb',
       'türk', 'post', 'country', 'psychedelic', 'conjazz', 'indie',
       'posthardcore', 'local', 'avantgarde', 'punk', 'videogame',
       'techno', 'house', 'christmas', 'melodic', 'caucasian',
       'reggaeton', 'soundtrack', 'singer', 'ska', 'salsa', 'ambient',
       'film', 'western', 'rap', 'beats', "hard'n'heavy", 'progmetal',
       'minimal', 'tropical', 'contemporary', 'new', 'soul', 'holiday',
       'german', 'jpop', 'spiritual', 'urban', 'gospel', 'nujazz',
       'folkmetal', 'trance', 'miscellaneous', 'anime', 'hardcore',
       'progressive', 'korean', 'numetal', 'vocal', 'estrada', 'tango',
       'loungeelectronic', 'classicmetal', 'dubstep', 'club', 'deep',
       'southern', 'black', 'folkrock

[Voltar ao Índice](#back)

### Conclusions <a id='data_preprocessing_conclusions'></a>
We have identified three issues with the data:

- Header style was incorrect
- Missing values
- Obvious and implicit duplicates

The header has been changed to simplify table processing.

All missing values have been replaced with 'unknown'. However, we still need to assess if missing values in 'genre' will affect our calculations.

The absence of duplicates will make the results more accurate and easier to understand.

Now you can proceed to test hypotheses.

[Voltar ao Índice](#back)

## Step 3: Testing Hypothesis <a id='hypotheses'></a>

### 3.1 Hypothesis 1: Comparing User Behavior in Two Cities <a id='activity'></a>

According to the first hypothesis, users from Springfield and Shelbyville listen to music differently. we will test this hypothesis using the data for three days of the week: Monday, Wednesday, and Friday.

- First we will divide the users from each city into groups. 
- Second we will compare the number of songs each group listened to on Monday, Wednesday, and Friday.


In [23]:
df.groupby('city').count()


Unnamed: 0_level_0,user_id,track,artist,genre,time,day
city,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Shelbyville,18512,18512,18512,18512,18512,18512
Springfield,42741,42741,42741,42741,42741,42741


Springfield has more songs played than Shelbyville. However, this doesn't necessarily mean that the citizens of Springfield listen to music more frequently. This city is simply larger and has more users.

Now,we are going to group the data by day of the week and find the quantity of songs played on Monday, Wednesday, and Friday.


In [24]:

df.groupby('day')['track'].count()

day
Friday       21840
Monday       21354
Wednesday    18059
Name: track, dtype: int64

Wednesday is generally the calmest day overall. However, if we consider the two cities separately, we may arrive at a different conclusion.

We saw how grouping by city or days of week work. Now, we're going to see how to group data by these two criteria.

To create the function 'number_tracks()' that calculates the number of played songs in a specific day of the week and in each city we are going to need two things:
- Day of the week
- Name of the city

Na função, use a variável para armazenar as linhas da tabela original, onde:
  * o valor da coluna 'day' é igual ao parâmetro dia
  * o valor da coluna 'city' é igual ao parâmetro cidade

Aplique filtros consecutivos com indexação lógica.

Depois, calcule os valores da coluna 'user_id' na tabela resultante. Armazene o resultado na nova variável. Retorne essa variável da função.

In [25]:
def number_tracks(day,city):
    track_list= df[(df['day'] == day)]
    track_list= track_list[track_list['city'] == city]
    track_list_count= track_list['user_id'].count()
    
    return track_list_count


Chame a `number_tracks()` seis vezes, mudando os valores dos parâmetros, para que você recupere os dados de ambas as cidades para os três dias.

In [26]:
print(number_tracks(day='Monday', city='Springfield'))

15740


In [27]:
print(number_tracks(day='Monday', city='Shelbyville'))

5614


In [28]:
print(number_tracks(day='Wednesday', city='Springfield'))

11056


In [29]:
print(number_tracks(day='Wednesday', city='Shelbyville'))# a quantidade de músicas tocadas em Shelbyville na quarta-feira

7003


In [30]:
print(number_tracks(day='Friday', city='Springfield'))

15945


In [31]:
print(number_tracks(day='Friday', city='Shelbyville'))

5895


Use pd.DataFrame to create one table, where:
* Column names are: ['city', 'monday', 'wednesday', 'friday']`
* Data is the result of what you get from number_tracks()

In [32]:




springfield_monday = number_tracks(day='Monday', city='Springfield')
springfield_wednesday = number_tracks(day='Wednesday', city='Springfield')
springfield_friday = number_tracks(day='Friday', city='Springfield')
shelbyville_monday = number_tracks(day='Monday', city='Shelbyville')
shelbyville_wednesday = number_tracks(day='Wednesday', city='Shelbyville')
shelbyville_friday = number_tracks(day='Friday', city='Shelbyville')
dados = [
    ['Springfield',
    springfield_monday,
    springfield_wednesday,
    springfield_friday],
    ['Shelbyville',
    shelbyville_monday,
    shelbyville_wednesday,
    shelbyville_friday]
]

pd.DataFrame(data = dados, columns =['city', 'monday', 'wednesday', 'friday'])




Unnamed: 0,city,monday,wednesday,friday
0,Springfield,15740,11056,15945
1,Shelbyville,5614,7003,5895


**Conclusions**

The data reveals differences in user behavior:

- In Springfield, the peak in the number of songs played occurs on Mondays and Fridays, while there is a decrease in activity on Wednesdays.
- In Shelbyville, on the contrary, users listen to more music on Wednesdays. Activity on Mondays and Fridays is low.

So, the first hypothesis seems to be correct.

[Voltar ao Índice](#back)

### 3.2 Hypothesis 2: Music at the Beginning and End of the Week <a id='week'></a>

According to the second hypothesis, on Monday mornings and Friday evenings, residents of Springfield listen to genres that differ from those favored by some users in Shelbyville.

Let's obtain one table:
* For Springfield — `spr_general`
* For Shelbyville — `shel_general`

In [33]:
spr_general = df.loc[df.loc[:, 'city'] == 'Springfield']
spr_general


Unnamed: 0,user_id,track,artist,genre,city,time,day
1,55204538,Delayed Because of Accident,Andreas Rönnberg,rock,Springfield,14:07:09,Friday
4,E2DC1FAE,Soul People,Space Echo,dance,Springfield,08:34:34,Monday
6,4CB90AA5,True,Roman Messer,dance,Springfield,13:00:07,Wednesday
7,F03E1C1F,Feeling This Way,Polina Griffith,dance,Springfield,20:47:49,Wednesday
8,8FA1D3BE,L’estate,Julia Dalia,ruspop,Springfield,09:17:40,Friday
...,...,...,...,...,...,...,...
65073,83A474E7,I Worship Only What You Bleed,The Black Dahlia Murder,extrememetal,Springfield,21:07:12,Monday
65074,729CBB09,My Name,McLean,rnb,Springfield,13:32:28,Wednesday
65076,C5E3A0D5,Jalopiina,unknown,industrial,Springfield,20:09:26,Friday
65077,321D0506,Freight Train,Chas McDevitt,rock,Springfield,21:43:59,Friday


In [34]:
shel_general = df.loc[df.loc[:, 'city'] == 'Shelbyville']
shel_general


Unnamed: 0,user_id,track,artist,genre,city,time,day
0,FFB692EC,Kamigata To Boots,The Mass Missile,rock,Shelbyville,20:28:33,Wednesday
2,20EC38,Funiculì funiculà,Mario Lanza,pop,Shelbyville,20:58:07,Wednesday
3,A3DD03C9,Dragons in the Sunset,Fire + Ice,folk,Shelbyville,08:37:09,Monday
5,842029A1,Chains,Obladaet,rusrap,Shelbyville,13:09:41,Friday
9,E772D5C0,Pessimist,unknown,dance,Shelbyville,21:20:49,Wednesday
...,...,...,...,...,...,...,...
65063,D94F810B,Theme from the Walking Dead,Proyecto Halloween,film,Shelbyville,21:14:40,Monday
65064,BC8EC5CF,Red Lips: Gta (Rover Rework),Rover,electronic,Shelbyville,21:06:50,Monday
65065,29E04611,Bre Petrunko,Perunika Trio,world,Shelbyville,13:56:00,Monday
65066,1B91C621,(Hello) Cloud Mountain,sleepmakeswaves,postrock,Shelbyville,09:22:13,Monday


Now we will write the function 'genre_weekday()' with four parameters:
* One table for data (`df`)
* Day of the week (`day`)
* First time in format: 'HH:MM' (`time1`)
* Last time in format: 'HH,MM' (`time2`)

The function should return information about the 15 most popular genres on a specific day, within the time range defined by the two timestamps.

In [35]:
df = spr_general
day, time1, time2 = 'Monday' ,'07:00:00' ,'11:00:00'


def genre_weekday(df, day, time1, time2):
    
      #Escreva a sua função aqui

    # filtragem consecutiva
    # genre_df armazenará apenas as linhas df onde o dia é igual a day=
    
    genre_df = df[(df['day']==day)] # escreva o seu código aqui

    # genre_df armazenará apenas aslinhas df que o tempo é menor do que time2=
    genre_df = genre_df.loc[genre_df['time'] < time2] # escreva o seu código aqui

    # genre_df armazenará apenas as linhas onde onde o tempo é maior do que time1=
    genre_df = genre_df.loc[genre_df['time'] > time1]  # escreva o seu código aqui

    # agrope o DataFrame filtrado pela coluna com nomes dos gêneros, pegue a coluna gênero, e encontre o número de linhas para cada gênero com o método count()
    
    genre_df_count = genre_df.groupby('genre').count() # escreva o seu código aqui

    # nós vamos armazenar o resultado em ordem decrescente (para que os gêneros mais populares venham primeiro no objeto Series)
    genre_df_sorted = genre_df_count.sort_values(by = 'time', ascending=False).head(1) # escreva o seu código aqui
    
    # nós vamos retornar o objeto Serie armazenando os 15 gêneros mais populares em um determinado dia, dentro de um determinado intervalo de tempo
    return genre_df_sorted[:15]





In [36]:
genre_weekday(spr_general, 'Monday', '07:00:00', '11:00:00')

Unnamed: 0_level_0,user_id,track,artist,city,time,day
genre,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
pop,781,781,781,781,781,781


In [37]:
genre_weekday(shel_general, 'Monday','07:00:00','11:00:00')

Unnamed: 0_level_0,user_id,track,artist,city,time,day
genre,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
pop,218,218,218,218,218,218


In [38]:
genre_weekday(spr_general, 'Friday','17:00:00','23:00:00')

Unnamed: 0_level_0,user_id,track,artist,city,time,day
genre,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
pop,713,713,713,713,713,713


In [39]:
genre_weekday(shel_general, 'Friday','17:00:00','23:00:00')


Unnamed: 0_level_0,user_id,track,artist,city,time,day
genre,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
pop,256,256,256,256,256,256


**Conclusions**


Having compared the 15 most listened-to genres on Monday mornings, we can draw the following conclusions:
1. Users in Springfield and Shelbyville listen to similar music. The top five genres are the same, with only rock and electronic music swapping positions.
2. In Springfield, the amount of missing values turned out to be so significant that 'unknown' came in 10th. This indicates that missing values constitute a considerable portion of the data, raising questions about the reliability of the conclusions.

On Friday afternoons, the situation is similar. Individual genres may vary slightly, but overall, the top 15 genres are similar for both cities.
Thus, the second hypothesis was partially supported:
- Users listen to similar music genres at the beginning and end of the week.
- There is no significant difference between Springfield and Shelbyville. In both cities, pop is the most popular genre.


However, the number of missing values makes this result questionable. In Springfield, there are so many that they affected the top 15. If we did not have these missing values, things could be different.

[Voltar ao Índice](#back)

### 3.3 Hypothesis 3: Preferences in Springfield and Shelbyville <a id='genre'></a>

Hypothesis: Shelbyville loves rap. Springfield's citizens prefer pop.

We will group the table spr_general by genre and find the number of songs played for each genre with the count() method. Then sort the result in descending order and store it in spr_genres.

In [40]:

spr_general = spr_general.groupby('genre').count()


spr_general = spr_general.sort_values(by='genre', ascending=False)


Printing the first 10 lines of spr_genres:

In [41]:
spr_general.sort_values('genre').head(10)

Unnamed: 0_level_0,user_id,track,artist,city,time,day
genre,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
acid,1,1,1,1,1,1
acoustic,3,3,3,3,3,3
action,4,4,4,4,4,4
adult,16,16,16,16,16,16
africa,12,12,12,12,12,12
afrikaans,4,4,4,4,4,4
alternative,1379,1379,1379,1379,1379,1379
ambient,183,183,183,183,183,183
americana,7,7,7,7,7,7
animated,2,2,2,2,2,2


Let's do the same on Shelbyville's data grouping the table shel_general by genre finding the total number of songs played by genre. Then sort the result in descending order and store it in shel_genres:

In [42]:
shel_general = shel_general.groupby('genre').count()


shel_general = shel_general.sort_values(by='genre', ascending=False)


Printing the first 10 lines of shel_genres:

In [43]:
shel_general.sort_values('genre').head(10)

Unnamed: 0_level_0,user_id,track,artist,city,time,day
genre,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
acoustic,2,2,2,2,2,2
adult,8,8,8,8,8,8
africa,4,4,4,4,4,4
alternative,649,649,649,649,649,649
ambient,64,64,64,64,64,64
americana,1,1,1,1,1,1
anime,29,29,29,29,29,29
arabesk,2,2,2,2,2,2
arabic,1,1,1,1,1,1
argentinetango,7,7,7,7,7,7


**Conclusion**

The hypothesis was partially proven:
- As expected, Pop music is the most popular genre in Springfield.
- However, pop music turned out to be equally popular in Springfield and Shelbyville, and rap was not in the top 5 in either city.

[Voltar ao Índice](#back)

# Conclusions <a id='end'></a>

We tested the following three hypotheses:

1. User activity varies depending on the day of the week and the city.
2. On Monday mornings, residents of Springfield and Shelbyville listen to different genres. This is also true for Friday nights.
3. Listeners in Springfield and Shelbyville have different preferences. In both Springfield and Shelbyville, they prefer pop.

After analyzing the data, we concluded:

- User activity in Springfield and Shelbyville depends on the day of the week, although the cities vary in different ways. The first hypothesis is fully accepted.

- Musical preferences do not vary significantly throughout the week in both Springfield and Shelbyville. We can see small differences in the order on Mondays, but in Springfield and Shelbyville, people listen to more pop music. So, we can accept this hypothesis. We should also keep in mind that the result might have been different without the missing values.

- It turns out that the musical preferences of users in Springfield and Shelbyville are quite similar. The third hypothesis was rejected. If there is any difference in preferences, it cannot be seen in this data.


### Observation:
In real projects, research involving statistical hypothesis testing is more accurate and quantitative. Also, note that you cannot always draw conclusions about an entire city based on data from just one source.


[Voltar ao Índice](#back)