# Y.Music

# Table of Contents <a id='back'></a>

* [Introduction](#intro)
* [Stage 1 - Data Review](#data_review)
    * [Conclusion](#data_review_conclusions)
* [Stage 2 - Data Processing](#data_preprocessing)
    * [2.1 Heading Writing Style](#header_style)
    * [2.2 Missing Values](#missing_values)
    * [2.3 Duplicates](#duplicates)
    * [2.4 Conclusion](#data_preprocessing_conclusions)
* [Stage 3 - Hypotheses Testing](#hypotheses)
    * [3.1 Hypothesis 1: User Activity in Both Cities](#activity)
    * [3.2 Hypothesis 2: Music Preferences on Monday and Friday](#week)
    * [3.3 Hypothesis 3: Genre Preferences in Springfield and Shelbyville](#genre)
* [Findings](#end)

# Introduction <a id='intro'></a>
Every time I conduct analysis, I need to formulate several hypotheses that will be further tested. Sometimes, the testing results lead me to accept these hypotheses, while at other times, I have to reject them. To make the right decisions in a business context, it is crucial for me to understand whether the assumptions I make are correct or need to be revisited.

In this project, I will compare the music preferences of listeners in two cities: *Springfield* and *Shelbyville*. I will use real data from Y.Music to test a number of hypotheses and analyze user behavior in these two cities.

## Objective
I will test the following three hypotheses:
1. User activity differs depending on the day of the week and the city.
2. On Monday morning, listeners in Springfield and Shelbyville listen to different music genres. This also applies to Friday night.
3. Listeners in Springfield and Shelbyville have different genre preferences. In Springfield, users tend to prefer pop music, while in Shelbyville, rap music is more popular.

## Stage
User behavior data is stored in the file `1_music_project_en.csv`. Since there is no initial information regarding the quality of the data, I will first inspect and evaluate its quality. If any issues are found, I will address them during the data preprocessing stage.

This project will be carried out in three main stages:
1. Data review
2. Data preprocessing
3. Hypothesis testing
 
[Back to Table of Contents](#back)

# Stage 1 - Data Review <a id='data_review'></a>

I will open the data related to Y.Music and study the data.

In [1]:
# importing Pandas
import pandas as pd

Read file `1_music_project_en.csv` and save the file in `df`:

In [2]:
df = pd.read_csv('data\ymusic_data.csv')
df.describe()

Unnamed: 0,userID,Track,artist,genre,City,time,Day
count,65079,63736,57512,63881,65079,65079,65079
unique,41748,39666,37806,268,2,20392,3
top,A8AE9169,Brand,Kartvelli,pop,Springfield,08:14:07,Friday
freq,76,136,136,8850,45360,14,23149


Show the first 10 row of the table:

In [3]:
(df.head(10))

Unnamed: 0,userID,Track,artist,genre,City,time,Day
0,FFB692EC,Kamigata To Boots,The Mass Missile,rock,Shelbyville,20:28:33,Wednesday
1,55204538,Delayed Because of Accident,Andreas Rönnberg,rock,Springfield,14:07:09,Friday
2,20EC38,Funiculì funiculà,Mario Lanza,pop,Shelbyville,20:58:07,Wednesday
3,A3DD03C9,Dragons in the Sunset,Fire + Ice,folk,Shelbyville,08:37:09,Monday
4,E2DC1FAE,Soul People,Space Echo,dance,Springfield,08:34:34,Monday
5,842029A1,Chains,Obladaet,rusrap,Shelbyville,13:09:41,Friday
6,4CB90AA5,True,Roman Messer,dance,Springfield,13:00:07,Wednesday
7,F03E1C1F,Feeling This Way,Polina Griffith,dance,Springfield,20:47:49,Wednesday
8,8FA1D3BE,L’estate,Julia Dalia,ruspop,Springfield,09:17:40,Friday
9,E772D5C0,Pessimist,,dance,Shelbyville,21:20:49,Wednesday


Obtain general information about the table with one order:

In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 65079 entries, 0 to 65078
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0     userID  65079 non-null  object
 1   Track     63736 non-null  object
 2   artist    57512 non-null  object
 3   genre     63881 non-null  object
 4     City    65079 non-null  object
 5   time      65079 non-null  object
 6   Day       65079 non-null  object
dtypes: object(7)
memory usage: 3.5+ MB


This table contains seven columns. All columns have the same data type: `object`.

Based on the documentation:
- `userID` — User ID
- `Track` — Song title
- `artist` — Artist name
- `genre`
- `City` — User's city
- `time` — Time the song was played
- `Day` — Day of the week

We can observe three issues with the naming convention of the columns:
1. Some names are written in uppercase, others in lowercase.
2. Some names include spaces.
3. Some columns use a mix of uppercase and lowercase letters within the same name (for example, `userID` and `Track`), which can cause inconsistencies and difficulties when accessing or processing the data.

We also notice that there are different numbers of values across the columns. This suggests that the data contains missing values.

## Conclusion <a id='data_review_conclusions'></a> 

Each row in the table stores data related to a song track that was played. Some columns store data that describe the track itself: song title, artist, and genre. The remaining columns store user-related information: their city of origin and the time they played the track.

It is clear that the data we have is sufficient to test the hypotheses. Unfortunately, there are several missing values.

To continue with the analysis, we need to perform data preprocessing first.

[Back to Table of Contents](#back)

# Stage 2 - Data Processing <a id='data_preprocessing'></a>
Fix every title column format and missing values. Then, check if the data still contains duplications.

## 2.1. Header Style <a id='header_style'></a>

In [5]:
# list the column names in df table
(df.columns)

Index(['  userID', 'Track', 'artist', 'genre', '  City  ', 'time', 'Day'], dtype='object')

Change the column name with the correct writing styles:
* If the column name contains many words, use snake_case
* Every character must use lower case
* Delete space

In [6]:
# ca
df = df.rename(
    columns = {
        '  userID' : 'user_id',
        'Track' : 'track',
        'artist' : 'artist',
        'genre' : 'genre',
        '  City  ' : 'city',
        'time' : 'time',
        'Day' : 'day', 
    }
)

In [7]:
(df.columns)

Index(['user_id', 'track', 'artist', 'genre', 'city', 'time', 'day'], dtype='object')

[Back to Table of Contents](#back)

## 2.2. Missing Values <a id='missing_values'></a>
First, find the sum of missing values. Use two methods of `Pandas`:

In [8]:
# counting missing values
(df.isna().sum())

user_id       0
track      1343
artist     7567
genre      1198
city          0
time          0
day           0
dtype: int64

Not all missing values have an impact on your research. For example, missing values in the `track` and `artist` columns are not as critical. You can simply replace them with a clear placeholder. However, missing values in the `genre` column could affect the comparison of music preferences in Springfield and Shelbyville. In real life, it is very useful to study the reasons behind the missing data and attempt to correct it. Unfortunately, we do not have the opportunity to do that in this project. Therefore, you should:

1: Replace missing values in th` "tra`k" an` "arti`t" columns with a clear marker, such as "Unknown" or "Missing."
   **: Assess how missing values in `he "g`nre" column might influence the comparison of music preferences between Springfield and Shelbyville. You can do this by analyzing the extent of the missing data and considering whether it will distort the conclusions or if imputation methods (such as filling missing genre values with the most common genre per city) are appropriate.

Replace missing values in `track`, `artist`, dan `genre` with `unknown` string. To apply it, make a list named `columns_to_replace`, apply loop `for` on that list, and replace every missing values in the columns:

In [9]:
# apply loop to the columns and replace missing value with 'unknown'
columns_to_replace = ['track','artist','genre']

for columns in df.columns:
    if columns in columns_to_replace:
          df[columns] = df[columns].fillna('unknown')

Ensure that no table with missing values. Recount the missing values.

In [10]:
# count the missing values
(df.isna().sum())

user_id    0
track      0
artist     0
genre      0
city       0
time       0
day        0
dtype: int64

[Back to Table of Contents](#back)

## 2.3. Duplicates <a id='duplicates'></a>
Find the sum of explicit duplication in the table with one order:

In [11]:
(df.duplicated().sum())

3826

Call one `Pandas` method to delete explicit duplicate:

In [12]:
df = df.drop_duplicates().reset_index(drop=True)

Count the duplicates one more time to ensure that all has been deleted:

In [13]:
(df.duplicated().sum())

0

Now, delete implicit duplicates in the `genre` column. For example, a different typing of genre name could be an example of implicit duplicate. This mistake will affect the analysis.

Show the list with unique genre names, sort the list alphabetically and apply the steps you mentioned, this is how I can proceed:
1. Select the desired column from your DataFrame.
2. Apply the sorting method on that column.
3. Call the method to get all the unique values in the sorted column.

In [14]:
# show the unique genre name
df['genre'].sort_values().unique()

array(['acid', 'acoustic', 'action', 'adult', 'africa', 'afrikaans',
       'alternative', 'ambient', 'americana', 'animated', 'anime',
       'arabesk', 'arabic', 'arena', 'argentinetango', 'art', 'audiobook',
       'avantgarde', 'axé', 'baile', 'balkan', 'beats', 'bigroom',
       'black', 'bluegrass', 'blues', 'bollywood', 'bossa', 'brazilian',
       'breakbeat', 'breaks', 'broadway', 'cantautori', 'cantopop',
       'canzone', 'caribbean', 'caucasian', 'celtic', 'chamber',
       'children', 'chill', 'chinese', 'choral', 'christian', 'christmas',
       'classical', 'classicmetal', 'club', 'colombian', 'comedy',
       'conjazz', 'contemporary', 'country', 'cuban', 'dance',
       'dancehall', 'dancepop', 'dark', 'death', 'deep', 'deutschrock',
       'deutschspr', 'dirty', 'disco', 'dnb', 'documentary', 'downbeat',
       'downtempo', 'drum', 'dub', 'dubstep', 'eastern', 'easy',
       'electronic', 'electropop', 'emo', 'entehno', 'epicmetal',
       'estrada', 'ethnic', 'eurofo

Look at the list to find implicit duplicates from `hiphop` genre. These duplicates could be the incorrect typing or alternative names of the genre.

We will see the following implicit duplicates:
* `hip`
* `hop`
* `hip-hop`

To correct it, use `replace_wrong_genres()` function with two parameters:
* `wrong_genres=` — list with the duplicates that need to be changed
* `correct_genre=` — string with correct values

This function must correct the name in `genre` column from `df` table, change every value from `wrong_genres` list with the value from `correct_genre`.
Once applied, the df['genre'] column will have the corrected genre names, and any implicit duplicates will be resolved.

In [15]:
def replace_wrong_values(wrong_values, correct_value):
    for wrong_value in wrong_values: 
        df['genre'] = df['genre'].replace(wrong_value, correct_value)

Call `replace_wrong_genres()` and proceed the argument on that function, so it can correct the implicit duplicates (`hip`, `hop`, and `hip-hop`) with `hiphop`:

In [16]:
duplicates = ['hip', 'hop', 'hip-hop']
name = 'hiphop' 
replace_wrong_values(duplicates, name)

df['genre']

0              rock
1              rock
2               pop
3              folk
4             dance
            ...    
61248           rnb
61249        hiphop
61250    industrial
61251          rock
61252       country
Name: genre, Length: 61253, dtype: object

Ensure that the duplicated values are deleted. Show the unique value list from `genre` column:

In [17]:
df['genre'].sort_values().unique()

array(['acid', 'acoustic', 'action', 'adult', 'africa', 'afrikaans',
       'alternative', 'ambient', 'americana', 'animated', 'anime',
       'arabesk', 'arabic', 'arena', 'argentinetango', 'art', 'audiobook',
       'avantgarde', 'axé', 'baile', 'balkan', 'beats', 'bigroom',
       'black', 'bluegrass', 'blues', 'bollywood', 'bossa', 'brazilian',
       'breakbeat', 'breaks', 'broadway', 'cantautori', 'cantopop',
       'canzone', 'caribbean', 'caucasian', 'celtic', 'chamber',
       'children', 'chill', 'chinese', 'choral', 'christian', 'christmas',
       'classical', 'classicmetal', 'club', 'colombian', 'comedy',
       'conjazz', 'contemporary', 'country', 'cuban', 'dance',
       'dancehall', 'dancepop', 'dark', 'death', 'deep', 'deutschrock',
       'deutschspr', 'dirty', 'disco', 'dnb', 'documentary', 'downbeat',
       'downtempo', 'drum', 'dub', 'dubstep', 'eastern', 'easy',
       'electronic', 'electropop', 'emo', 'entehno', 'epicmetal',
       'estrada', 'ethnic', 'eurofo

[Back to Table of Contents](#back)

## 2.4. Conclusion <a id='data_preprocessing_conclusions'></a>
We have detected three problems within the data:

- Wrong header writing styles
- Missing values
- Explicit and implicit duplicates

Now, the column names have been corrected to help me processing the table.
Every missing value has been changed with `unknown`. Nevertheless, we need to see if the missing values in the `genre` column will affect our calculation.

With no duplicate values, it will make the outcome more precise and easy to understand.

[Back to Table of Contents](#back)

# Stage 3 - Hypotheses Testing <a id='hypotheses'></a>

## Hypothesis #1: User Activity in Both Cities <a id='activity'></a>

Hypothesis: users from Springfield and Shelbyville exhibit different behaviors in music listening. This test uses data collected from three days of the week: Monday, Wednesday, and Friday.

* Group the users based on their city.
* Compare how many tracks were played by each group on Monday, Wednesday, and Friday.

In [18]:
# Counting tracks played in each city
tracks_by_city = df.groupby('city')['track'].count().reset_index()
(tracks_by_city)

Unnamed: 0,city,track
0,Shelbyville,18512
1,Springfield,42741


Users from Springfield played more music tracks than users from Shelbyville. However, this does not necessarily indicate that Springfield residents listen to music more frequently. Springfield is a larger city with more users, so this is to be expected.

NowI will , group the data by day and find the total number of tracks played on Monday, Wednesday, and Friday.

In [19]:
# Counting tracks played each day
tracks_by_day = df.groupby('day')['track'].count().reset_index()
(tracks_by_day)

Unnamed: 0,day,track
0,Friday,21840
1,Monday,21354
2,Wednesday,18059


Wednesday is the overall "quietest" day. However, if we consider the two cities separately, we might arrive at a different conclusion.

We have seen how grouping works by city or by day. Now, I write a function that will group the data by both city and day.

Create a function called `number_tracks()` to count the number of music tracks played for a given day and city. This function will require two parameters:
* the name of the day of the we(day=)k  
* the name of the c(city=)ty  

In the funIion we create, use a variable to store the rows from the original table where:
  * The value i th `'day'` column matches the `day` parameter  
  * The value n th `'city'` column matches the `city` paraI will ater  

Apply sequential filtering using logical
 ndexing.

Then, count the vales in te `'user_id'` column of the resulting table. Store the result in a  (`track_list`)new variable. Return this variable from the function.

In [20]:
def number_tracks(day,city):
    track_list = df[(df['day'] == day) & (df['city'] == city)]
    track_list_count = track_list['user_id'].count()
    return track_list_count

Call `number_tracks()` six times and change the parameter in each call, so we can obtain data from both cities for each day (Monday, Wednesday, and Friday).

In [21]:
# total tracks played in Springfield on Monday
number_tracks('Monday','Springfield')

15740

In [22]:
# total tracks played in  Shelbyville on Monday
number_tracks('Monday','Shelbyville')

5614

In [23]:
# total tracks played in Springfield on Wednesday
number_tracks('Wednesday','Springfield')

11056

In [24]:
# total tracks played in Shelbyville on Wednesday
number_tracks('Wednesday','Shelbyville')

7003

In [25]:
# total tracks played in Springfield on Friday
number_tracks('Friday','Shelbyville')

5895

In [26]:
# total tracks played in Shelbyville on Friday
number_tracks('Friday','Shelbyville')

5895

Use `pd.DataFrame` to create a table with
* Column names: `['city', 'monday', 'wednesday', 'friday']`
* The content is the outcome of `number_tracks()`

In [27]:
data = [
    ['Springfield', number_tracks('Monday','Springfield'), number_tracks('Wednesday','Springfield'), number_tracks('Friday','Springfield')],
    ['Shelbyville', number_tracks('Monday','Shelbyville'), number_tracks('Wednesday','Shelbyville'), number_tracks('Friday','Shelbyville')]
]

pd.DataFrame(data, columns = ['city', 'monday', 'wednesday', 'friday'])

Unnamed: 0,city,monday,wednesday,friday
0,Springfield,15740,11056,15945
1,Shelbyville,5614,7003,5895


**Conclusion**

The data you obtained successfully revealed several differences in user behavior:

- In the city of Springfield, the number of music tracks played peaks on Monday and Friday, while there is a decline in activity on Wednesday.  
- In the city of Shelbyville, on the contrary, users tend to listen to more music on Wednesday. User activity is lower on Monday and Friday.  

Thus, it can be concluded that the first hypothesis appears to be correct.

[Back to Table of Contents](#back)

## Hypothesis 2: Music Preferences on Monday and Friday <a id='week'></a>

Hypothesis: On Monday mornings and Friday evenings, users of Springfield listen to different music genres compared to those enjoyed by users of Shelbyville.

Get the tables (we make sure the combined table names match the DataFrame names given in the two code blocks below):  
* For Springfield — `spr_general`  
* For Shelbyville — `shel_general`

In [28]:
# get the spr_general table from the rows of df where the value in the 'city' column is 'Springfield'
spr_general = df[(df['city'] == 'Springfield')]

In [29]:
# get the shel_general table from the rows of df where the value in the 'city' column is 'Shelbyville'
shel_general = df[(df['city'] == 'Shelbyville')]

Create a function called `genre_weekday()` with four parameters:  
* A table containing the data  
* The name of the day  
* A start timestamp in the format 'hh:mm'  
* An end timestamp in the format 'hh:mm'  

The function should return information about the 15 most popular genres on a given day within the time period between the two timestamps.waktu.

In [30]:
def genre_weekday(data, day, time1, time2):
    
    # pemfilteran berurutan
    # genre_df hanya akan menyimpan baris df yang day-nya sama dengan day
    genre_df = data[data['day'] == day] # tulis kode program Anda di sini

    # genre_df hanya akan menyimpan baris df yang time-nya lebih kecil dari time2
    genre_df = genre_df[genre_df['time'] < time2] # tulis kode program Anda di sini

    # genre_df hanya akan menyimpan baris df yang time-nya lebih besar dari time1
    genre_df = genre_df[genre_df['time'] > time1] # tulis kode program Anda di sini

    # kelompokkan DataFrame yang telah difilter berdasarkan kolom dengan nama genre, ambil kolom genre, dan temukan jumlah baris untuk setiap genre dengan metode count()
    genre_df_grouped = genre_df.groupby('genre')['user_id'].count() # tulis kode program Anda di sini

    # # kita akan mengurutkan hasilnya dalam urutan menurun (sehingga genre yang paling populer ditampilkan lebih awal pada objek Series)
    genre_df_sorted = genre_df_grouped.sort_values(ascending=False) # tulis kode program Anda di sini

    # kita akan menghasilkan objek Series yang menyimpan 15 genre paling populer pada hari tertentu dalam jangka waktu tertentu
    return genre_df_sorted[:15]

Compare the results of the `genre_weekday()` function for Springfield and Shelbyville on Monday morning (from 07:00 to 11:00) and on Friday evening (from 17:00 to 23:00):

In [31]:
# call the function for Monday morning in Springfield (use `spr_general` instead of the `df` table)
genre_weekday(spr_general, 'Monday', '07:00', '11:00')

genre
pop            781
dance          549
electronic     480
rock           474
hiphop         286
ruspop         186
world          181
rusrap         175
alternative    164
unknown        161
classical      157
metal          120
jazz           100
folk            97
soundtrack      95
Name: user_id, dtype: int64

In [32]:
# call the function for Monday morning in Shelbyville (use `shel_general` instead of the `df` table)
genre_weekday(shel_general, 'Monday', '07:00', '11:00')

genre
pop            218
dance          182
rock           162
electronic     147
hiphop          80
ruspop          64
alternative     58
rusrap          55
jazz            44
classical       40
world           36
rap             32
soundtrack      31
rnb             27
metal           27
Name: user_id, dtype: int64

In [33]:
# call the function for Friday night in Springfield\n"
genre_weekday(spr_general, 'Friday', '17:00', '23:00')

genre
pop            713
rock           517
dance          495
electronic     482
hiphop         273
world          208
ruspop         170
classical      163
alternative    163
rusrap         142
jazz           111
unknown        110
soundtrack     105
rnb             90
metal           88
Name: user_id, dtype: int64

In [34]:
# call the function for Friday night in Shelbyville
genre_weekday(shel_general, 'Friday', '17:00', '23:00')

genre
pop            256
rock           216
electronic     216
dance          210
hiphop          97
alternative     63
jazz            61
classical       60
rusrap          59
world           54
unknown         47
ruspop          47
soundtrack      40
metal           39
rap             36
Name: user_id, dtype: int64

**Conclusion**

After comparing the top 15 genres on Monday morning, we can draw the following conclusions:

1. Users from both Springfield and Shelbyville listen to the same music genres. The top five genres in both cities are the same, with only rock and electronic swapping places.

2. In Springfield, the number of missing values is quite significant, causing `'unknown'` to rank 10th. This means the missing values make up a substantial portion of the data, which raises questions about the reliability of our conclusion.

For Friday evening, the situation is similar. Individual genres vary slightly, but overall, the top 15 genres are the same for both cities.

Therefore, the second hypothesis is partially confirmed:
* Users listen to the same music at the beginning and end of the week.
* There are no significant differences between Springfield and Shelbyville. In both cities, pop is the most popular genre.

However, the significant number of missing values calls these results into question. In Springfield, the amount of missing data affects our top 15 genre results. If we had those missing values, the outcome might be different.

[Back to Table of Contents](#back)

## Hypothesis 3: Genre Preferences in Springfield and Shelbyville <a id='genre'></a>
Hypothesis: users in Shelbyville like rap music, while users in Springfield like pop music.

Group the `spr_general` table by genre and find the number of tracks played for each genre using the `count()` method. Then, sort the results in descending order and save it to `spr_genres`.

In [35]:
# in a single line: group the `spr_general` table by the `genre` column,  
# count the values in the `genre` column within the grouping,  
# sort the resulting Series in descending order, then save the result to `spr_genres`.

spr_genres = spr_general.groupby(['genre'])['track'].count().sort_values(ascending=False)

Show the first 10 row from `spr_genres`:

In [36]:
(spr_genres.head(10))

genre
pop            5892
dance          4435
rock           3965
electronic     3786
hiphop         2096
classical      1616
world          1432
alternative    1379
ruspop         1372
rusrap         1161
Name: track, dtype: int64

Now, we do the same for the date from Shelbyville.

Group the `shel_general` table by genre and find the number of tracks played for each genre. Then, sort the results in descending order and save the result to the `shel_genres` table.

In [37]:
# in a single line: group the `shel_general` table by the `genre` column,  
# count the values in the `genre` column within the grouping using `count()`,  
# sort the resulting Series in descending order, and save it to `shel_genres`.

shel_genres = shel_general.groupby(['genre'])['track'].count().sort_values(ascending=False)

Show the first 10 row from `shel_genres`:

In [38]:
(shel_genres.head(10))

genre
pop            2431
dance          1932
rock           1879
electronic     1736
hiphop          960
alternative     649
classical       646
rusrap          564
ruspop          538
world           515
Name: track, dtype: int64

**Conclusion**

This hypothesis is partially confirmed:  
* Pop music is the most popular genre in Springfield, as we expected.  
* However, pop music turns out to be equally popular in both Springfield and Shelbyville, and rap music did not make it into the top 5 genres for either cit.


[Back to Table of Contents](#back)

# Findings <a id='end'></a>

We have tested the following three hypotheses:

1. User activity varies depending on the day and the city.  
2. On Monday mornings, residents of Springfield and Shelbyville listen to different music genres. The same applies to Friday evenings.  
3. Listeners in Springfield and Shelbyville have different preferences. In both cities, users prefer pop music.

After analyzing the available data, we can conclude that:

1. User activity in Springfield and Shelbyville depends on the day of the week, although the two cities vary in differe
   ays.  

The first hypothesis can be fully a2cepted.

2. Music preferences do not vary significantly throughout the week in Springfield and Shelbyville. We can observe slight differences in rankings on i, but:  
* In both Springfield and Shelbyville, users listen to pop mu
   the most.  

Therefore, this hypothesis cannot be accepted. It's also important to remember that the results might have been different if we had no mi4sing values.

3. It turns out that the music preferences of users in Springfield and Shelbyville are very similar.  

The third hypothesis is rejected. If there are indeed differences in preferences, unfortunately, they cannot be determined from this data.

### Note  
In real-world projects, research involves testing statistical hypotheses, which is more precise and quantitative. Also note that you cannot always draw conclusions about an entire city based on data from a single source.

You will learn about hypothesis testing in the statistical data analysis sprint.

[Back to Table of Content](#back)