# Music App

## Introduction
Through the analysis of real music streaming data, this project seeks to compare and contrast the musical preferences of residents in Springfield and Shelbyville, with the aim of validating hypothesis about user behavior in both cities.

### Goal:
Validate the folowwing hipotesys:
1. User activity differs depending on the day of the week and the city.


### Stages
User behavior data is stored in the file `/data/music_project_en.csv`. As there is no information regarding data quality, we will examine it prior to hypothesis testing.


[Volver a Contenidos](#back)

## Stage 1. Data description

In [2]:
import pandas as pd

In [3]:
df = pd.read_csv('./data/music_project_en.csv')

In [4]:
df.head(10)

Unnamed: 0,userID,Track,artist,genre,City,time,Day
0,FFB692EC,Kamigata To Boots,The Mass Missile,rock,Shelbyville,20:28:33,Wednesday
1,55204538,Delayed Because of Accident,Andreas Rönnberg,rock,Springfield,14:07:09,Friday
2,20EC38,Funiculì funiculà,Mario Lanza,pop,Shelbyville,20:58:07,Wednesday
3,A3DD03C9,Dragons in the Sunset,Fire + Ice,folk,Shelbyville,08:37:09,Monday
4,E2DC1FAE,Soul People,Space Echo,dance,Springfield,08:34:34,Monday
5,842029A1,Chains,Obladaet,rusrap,Shelbyville,13:09:41,Friday
6,4CB90AA5,True,Roman Messer,dance,Springfield,13:00:07,Wednesday
7,F03E1C1F,Feeling This Way,Polina Griffith,dance,Springfield,20:47:49,Wednesday
8,8FA1D3BE,L’estate,Julia Dalia,ruspop,Springfield,09:17:40,Friday
9,E772D5C0,Pessimist,,dance,Shelbyville,21:20:49,Wednesday


In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 65079 entries, 0 to 65078
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0     userID  65079 non-null  object
 1   Track     63736 non-null  object
 2   artist    57512 non-null  object
 3   genre     63881 non-null  object
 4     City    65079 non-null  object
 5   time      65079 non-null  object
 6   Day       65079 non-null  object
dtypes: object(7)
memory usage: 3.5+ MB


The table has seven columns, all of type object.

In [6]:
df.isna().sum()

  userID       0
Track       1343
artist      7567
genre       1198
  City         0
time           0
Day            0
dtype: int64

In [7]:
df.duplicated().sum()

3826

### Data review conclusions: <a id='data_review_conclusions'></a>

- Each row represents a song played by a user of a music streaming app. The records may include information about the user's temporal and geographic location when the song was played, as well as additional information about the song itself

- The hypothesis could be validated with the provided dataset. By having information from only two cities, we will verify the validity of the hypothesis only for those cities. If we want to extend the hypothesis to be more general, more cities could be included. In this way, if the hypothesis that musical tastes depend on different cities is validated, we could have a higher degree of certainty.

- A preliminary analysis of the data suggests that preprocessing is necessary to handle null values, duplicate records, and inconsistent data types across columns.

## Stage 2. Data preprocessing <a id='data_preprocessing'></a>


### Header style <a id='header_style'></a>


In [8]:
df.columns

Index(['  userID', 'Track', 'artist', 'genre', '  City  ', 'time', 'Day'], dtype='object')

In [9]:
# The following code is an alternative to the suggested solution, it implements a function called
# 'column_name_formater' which is used as argument to the parameter 'column' in the 'rename' method

# def column_name_formater(column_name):
#    return column_name.strip().replace(' ', '_').lower()

# df = df.rename(columns=column_name_formater)

# It is also possible to use the anonymous function version using lambda syntax
#lambda x: x.strip().replace(' ','_').lower() 

# There are two different options to rename the column names using the for iterator:
# The first option uses the dict type

# fixed_column_names = {}
# for column_name in df.columns:
#     fixed_column_names[column_name] = column_name.strip().replace(' ','_').lower()

# print(fixed_column_names)
# df = df.rename(columns=fixed_column_names)

# The second alternative is using a list to redeclare the 'columns' attribute of the given DataFrame
fixed_column_names = []
for column_name in df.columns:
    fixed_column_names.append(column_name.strip().replace(' ', '_').lower())

df.columns = fixed_column_names

print(df.columns)

Index(['userid', 'track', 'artist', 'genre', 'city', 'time', 'day'], dtype='object')


In [11]:
df = df.rename(columns={'userid':'user_id'})

In [12]:
print(df.columns)

Index(['user_id', 'track', 'artist', 'genre', 'city', 'time', 'day'], dtype='object')


### Missing values <a id='missing_values'></a>
 

In [13]:
# Calcular el número de valores ausentes
df.isna().sum()

user_id       0
track      1343
artist     7567
genre      1198
city          0
time          0
day           0
dtype: int64

Let's address missing values

In [32]:
column_names = ['track', 'artist', 'genre']
for column_name in column_names:
    df[column_name] = df[column_name].fillna('unknown')

In [15]:
df.isna().sum()

user_id    0
track      0
artist     0
genre      0
city       0
time       0
day        0
dtype: int64

### Duplicates <a id='duplicates'></a>

In [16]:
df.duplicated().sum()

3826

In [17]:
df = df.drop_duplicates()

In [18]:
df.duplicated().sum()


0

Due to the range of values in `genre` column we are more likely to find duplicates. Let's address them

In [19]:
# The following code includes 2 different ways to search for similar genres.

#First approch
similar_values = {}
for term in df['genre'].unique():
    for genre in df['genre'].unique():
        if genre.__contains__(term) and term != genre: 
            if similar_values.get(term) == None:
                similar_values[term] = [genre]
            else:
                similar_values[term].append(genre)
print(similar_values)
 
# Second approch
for genre in df['genre'].sort_values().unique():
    print(genre)
    print(df.loc[df['genre'].str.contains(genre), 'genre'].unique())
                

{'rock': ['postrock', 'rusrock', 'folkrock', 'stonerrock', 'deutschrock', 'rockabilly'], 'pop': ['ruspop', 'jpop', 'k-pop', 'electropop', 'dancepop', 'cantopop', 'popeurodance', 'popelectronic', 'synthpop', 'indipop', 'mandopop'], 'folk': ['folkmetal', 'folkrock', 'eurofolk', 'folklore', 'folktronica'], 'dance': ['dancehall', 'dancepop', 'popeurodance'], 'world': ['worldbeat'], 'electronic': ['loungeelectronic', 'popelectronic'], 'hip': ['hiphop', 'hip-hop'], 'jazz': ['conjazz', 'nujazz', 'tradjazz'], 'latin': ['latino'], 'metal': ['progmetal', 'folkmetal', 'numetal', 'classicmetal', 'extrememetal', 'metalcore', 'epicmetal'], 'reggae': ['reggaeton'], 'türk': ['türkçe'], 'post': ['postrock', 'posthardcore'], 'techno': ['hardtechno'], 'rap': ['rusrap'], 'new': ['newage', 'newwave'], 'soul': ['soulful'], 'hardcore': ['posthardcore'], 'tango': ['argentinetango'], 'nu': ['nujazz', 'numetal'], 'dub': ['dubstep'], 'tech': ['techno', 'hardtechno'], 'top': ['cantopop'], 'sound': ['soundtrack'],

In [20]:
wrong_genres = ['hip', 'hip-hop', 'hip hop']
correct_genre = 'hiphop'

# The replace_wrong_genres function replaces every item in a given list of wrong genres for the right
# genre.
# The function overwrites the DataFrame directly but an alternative is to return
# a copy of the DataFrame using the .copy() method. 

def replace_wrong_genres(wrong_genres: list, correct_genre:str, column_name:str='genre')->None:
    for genre in wrong_genres:
        df[column_name] = df[column_name].replace(genre, correct_genre)

In [21]:
replace_wrong_genres(wrong_genres, correct_genre)

In [22]:
# Comprobación de duplicados implícitos
print(df['genre'].sort_values().unique())


['acid' 'acoustic' 'action' 'adult' 'africa' 'afrikaans' 'alternative'
 'ambient' 'americana' 'animated' 'anime' 'arabesk' 'arabic' 'arena'
 'argentinetango' 'art' 'audiobook' 'avantgarde' 'axé' 'baile' 'balkan'
 'beats' 'bigroom' 'black' 'bluegrass' 'blues' 'bollywood' 'bossa'
 'brazilian' 'breakbeat' 'breaks' 'broadway' 'cantautori' 'cantopop'
 'canzone' 'caribbean' 'caucasian' 'celtic' 'chamber' 'children' 'chill'
 'chinese' 'choral' 'christian' 'christmas' 'classical' 'classicmetal'
 'club' 'colombian' 'comedy' 'conjazz' 'contemporary' 'country' 'cuban'
 'dance' 'dancehall' 'dancepop' 'dark' 'death' 'deep' 'deutschrock'
 'deutschspr' 'dirty' 'disco' 'dnb' 'documentary' 'downbeat' 'downtempo'
 'drum' 'dub' 'dubstep' 'eastern' 'easy' 'electronic' 'electropop' 'emo'
 'entehno' 'epicmetal' 'estrada' 'ethnic' 'eurofolk' 'european'
 'experimental' 'extrememetal' 'fado' 'film' 'fitness' 'flamenco' 'folk'
 'folklore' 'folkmetal' 'folkrock' 'folktronica' 'forró' 'frankreich'
 'französisch' 

### Data Preprocessing conclusions <a id='data_preprocessing_conclusions'></a>

When analizyng tabular data using DataFrames, is necesary to process duplicated values. In this case the data contained both explicit and implicit duplicated values. In the case of the explicitly duplicated values we processed them using the `drop_duplicates()` method. To get rid of implicitly duplicated values requires a deep analysis of the range of values present in a column, for that purpose the Pandas library implements a method called `unique()` which returns a list that represents the range of unique values in a column.
On the other hand there are minor genres that could be merged in major groups. In the previous cells a sugessted code was added, that code go deeper into the problem and presents a solution based on this minor-major groups


[Volver a Contenidos](#back)

## Stage 3. Hypothesis <a id='hypothesis'></a>

The hypothesis states that there are differences in how users from Springfield and Shelbyville consume music. To test this, we will use data from three days of the week: Monday, Wednesday, and Friday

In [23]:
df.groupby('city')['track'].count()

city
Shelbyville    18512
Springfield    42741
Name: track, dtype: int64


Springfield's people played more than twice tracks than Shelbyville's

In [24]:
df.groupby('day')['track'].count()

day
Friday       21840
Monday       21354
Wednesday    18059
Name: track, dtype: int64


People played more tracks on Friday

In [25]:
def number_tracks(day, city):
    tracks_by_day = df[df['day']==day]
    tracks_filtered = tracks_by_day[tracks_by_day['city']==city]
    number_tracks = tracks_filtered['user_id'].count()
    return number_tracks



In [26]:
number_tracks('Monday', 'Shelbyville')

5614

In [27]:
number_tracks('Monday', 'Springfield')

15740

In [28]:
number_tracks('Wednesday', 'Shelbyville')

7003

In [29]:
number_tracks('Wednesday', 'Springfield')

11056

In [30]:
number_tracks('Friday', 'Shelbyville')

5895

In [31]:
number_tracks('Friday', 'Springfield')

15945

**Conclusions**
1. User activity varies by day of the week and city

Based on the results the Hipotesys is valid since we found different values of played songs for each city in every single day. We also didn't find a trend or pattern in the data
We can notice that Springfield users are more active in the App. Overall they played more tracks than Shelbyville users, not only this, they also played more tracks every single day than Shelbyville users. The following table summarize the total number of tracks played in the given days:
    <table>
    <thead>
        <th>City</th>
        <th>Monday</th>
        <th>Wednesday</th>
        <th>Friday</th>
    </thead>
    <tbody>
        <tr>
            <td>Springfield</td>
            <td>15740</td>
            <td>11056</td>
            <td>15945</td>
        </tr>
        <tr>
            <td>Shelbyville</td>
            <td>5614</td>
            <td>7003</td>
            <td>5895</td>
        </tr>
    </tbody>
    </table>
    
We can finally conclude, there is no a pattern in the data, but Springfield is the city with more active users


The given hipotesys was validated using data analysis, the pre processing of data is a crucial part in the analysis and can affect seriously the final results of the analysis. As part of the study we found that the day with more played songs was Friday and we can use this information to adjust the servers for the increasing demand on fridays.