# Data Normalization

What does data normalization mean in this context? Is it different from what it means when dealing with databases?

In this context, yes, this is similar to what we'd use data normalization for in database management systems. 
In this section, we're thinking not really in terms of analysis but more in terms of how we should store and manage our data.

The goal of normalizing our datasets is that when we store them, we want to reduce duplication.
This means normalizing our data into separate tables. 
Normalization allows us to fix an error in the data in only one location and this fix will propagate
when we combine these separate tables together for analysis later.

---

## Multiple Observational Units in a Table (Normalization)

To know if multiple observational units are represented in a table, we can look at each row and see if any cells or values are repeating from
row to row.

In [1]:
import pandas as pd
import warnings

warnings.filterwarnings('ignore')

billboard = pd.read_csv('./billboard.csv')

billboard

Unnamed: 0,year,artist,track,time,date.entered,wk1,wk2,wk3,wk4,wk5,...,wk67,wk68,wk69,wk70,wk71,wk72,wk73,wk74,wk75,wk76
0,2000,2 Pac,Baby Don't Cry (Keep...,4:22,2000-02-26,87,82.0,72.0,77.0,87.0,...,,,,,,,,,,
1,2000,2Ge+her,The Hardest Part Of ...,3:15,2000-09-02,91,87.0,92.0,,,...,,,,,,,,,,
2,2000,3 Doors Down,Kryptonite,3:53,2000-04-08,81,70.0,68.0,67.0,66.0,...,,,,,,,,,,
3,2000,3 Doors Down,Loser,4:24,2000-10-21,76,76.0,72.0,69.0,67.0,...,,,,,,,,,,
4,2000,504 Boyz,Wobble Wobble,3:35,2000-04-15,57,34.0,25.0,17.0,17.0,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
312,2000,Yankee Grey,Another Nine Minutes,3:10,2000-04-29,86,83.0,77.0,74.0,83.0,...,,,,,,,,,,
313,2000,"Yearwood, Trisha",Real Live Woman,3:55,2000-04-01,85,83.0,83.0,82.0,81.0,...,,,,,,,,,,
314,2000,Ying Yang Twins,Whistle While You Tw...,4:19,2000-03-18,95,94.0,91.0,85.0,84.0,...,,,,,,,,,,
315,2000,Zombie Nation,Kernkraft 400,3:30,2000-09-02,99,99.0,,,,...,,,,,,,,,,


Looking at this data, we can see a few things. 
First, we should probably melt the week columns into a single variable.
So, we should do that first.

In [2]:
billboard_long = billboard.melt(
    id_vars=['year', 'artist', 'track', 'time', 'date.entered'],
    var_name='week',
    value_name='rating',
)

billboard_long

Unnamed: 0,year,artist,track,time,date.entered,week,rating
0,2000,2 Pac,Baby Don't Cry (Keep...,4:22,2000-02-26,wk1,87.0
1,2000,2Ge+her,The Hardest Part Of ...,3:15,2000-09-02,wk1,91.0
2,2000,3 Doors Down,Kryptonite,3:53,2000-04-08,wk1,81.0
3,2000,3 Doors Down,Loser,4:24,2000-10-21,wk1,76.0
4,2000,504 Boyz,Wobble Wobble,3:35,2000-04-15,wk1,57.0
...,...,...,...,...,...,...,...
24087,2000,Yankee Grey,Another Nine Minutes,3:10,2000-04-29,wk76,
24088,2000,"Yearwood, Trisha",Real Live Woman,3:55,2000-04-01,wk76,
24089,2000,Ying Yang Twins,Whistle While You Tw...,4:19,2000-03-18,wk76,
24090,2000,Zombie Nation,Kernkraft 400,3:30,2000-09-02,wk76,


In [3]:
billboard_long['track'].value_counts()

track
Where I Wanna Be           152
Baby Don't Cry (Keep...     76
No Leaf Clover (Live...     76
Case Of The Ex (What...     76
Just Friends                76
                          ... 
Learn To Fly                76
Take A Picture              76
The Rockafeller Skan...     76
I Will Love Again           76
Bent                        76
Name: count, Length: 316, dtype: int64

After tidying up the dataset, it's in a suitable format for analysis.
However, if we wanted to store this data somewhere, we can see that there are a lot of redundant information. 

For example, multiple rows contain the same information about the same track.
Meaning conceptually, this table contains two types of data: 
1. track information (`year`, `artist`, `track`, `time`)
2. the track's weekly rating information (`date.entered`, `week`, `rating`)

Again for analysis, this is fine but if we want to store this, we could normalize the dataset by placing the 
track information in its own table and have a reference to each track for the corresponding observations in the 
weekly ratings table. 

So, then let's do that. 
We'll create a separate dataset with the song information and use the unique ID as a reference 
in another dataset containing the weekly rating information.
Also notice we're doing it with the tidy dataset and not the original.

In [4]:
# create dataset containing track information
billboard_songs = billboard_long[['year', 'artist', 'track', 'time']]
billboard_songs

Unnamed: 0,year,artist,track,time
0,2000,2 Pac,Baby Don't Cry (Keep...,4:22
1,2000,2Ge+her,The Hardest Part Of ...,3:15
2,2000,3 Doors Down,Kryptonite,3:53
3,2000,3 Doors Down,Loser,4:24
4,2000,504 Boyz,Wobble Wobble,3:35
...,...,...,...,...
24087,2000,Yankee Grey,Another Nine Minutes,3:10
24088,2000,"Yearwood, Trisha",Real Live Woman,3:55
24089,2000,Ying Yang Twins,Whistle While You Tw...,4:19
24090,2000,Zombie Nation,Kernkraft 400,3:30


Good but now we'll have to make sure each song only has one entry in this dataset.

In [5]:
billboard_songs = billboard_songs.drop_duplicates()
billboard_songs

Unnamed: 0,year,artist,track,time
0,2000,2 Pac,Baby Don't Cry (Keep...,4:22
1,2000,2Ge+her,The Hardest Part Of ...,3:15
2,2000,3 Doors Down,Kryptonite,3:53
3,2000,3 Doors Down,Loser,4:24
4,2000,504 Boyz,Wobble Wobble,3:35
...,...,...,...,...
312,2000,Yankee Grey,Another Nine Minutes,3:10
313,2000,"Yearwood, Trisha",Real Live Woman,3:55
314,2000,Ying Yang Twins,Whistle While You Tw...,4:19
315,2000,Zombie Nation,Kernkraft 400,3:30


Next, we'll need to add a column containing a unique identifier that we can use as a reference 
for the song in another dataset containing the weekly rating information.

To create a unique identifier, we'll just increment the row index by one.  

In [6]:
billboard_songs['id'] = billboard_songs.index + 1
billboard_songs

Unnamed: 0,year,artist,track,time,id
0,2000,2 Pac,Baby Don't Cry (Keep...,4:22,1
1,2000,2Ge+her,The Hardest Part Of ...,3:15,2
2,2000,3 Doors Down,Kryptonite,3:53,3
3,2000,3 Doors Down,Loser,4:24,4
4,2000,504 Boyz,Wobble Wobble,3:35,5
...,...,...,...,...,...
312,2000,Yankee Grey,Another Nine Minutes,3:10,313
313,2000,"Yearwood, Trisha",Real Live Woman,3:55,314
314,2000,Ying Yang Twins,Whistle While You Tw...,4:19,315
315,2000,Zombie Nation,Kernkraft 400,3:30,316


This dataset containing information about songs is complete. The only thing left to do at this point is to use this 
unique id and match the song to a weekly ranking. 

We'll do this by creating a new dataframe by merging the tidied dataframe with this new songs dataframe 
and merging them on the shared attributes which will match the id to the song. 

In [7]:
billboard_ratings = billboard_long.merge(
    billboard_songs,
    on=['year', 'artist', 'track', 'time']
)

billboard_ratings

Unnamed: 0,year,artist,track,time,date.entered,week,rating,id
0,2000,2 Pac,Baby Don't Cry (Keep...,4:22,2000-02-26,wk1,87.0,1
1,2000,2Ge+her,The Hardest Part Of ...,3:15,2000-09-02,wk1,91.0,2
2,2000,3 Doors Down,Kryptonite,3:53,2000-04-08,wk1,81.0,3
3,2000,3 Doors Down,Loser,4:24,2000-10-21,wk1,76.0,4
4,2000,504 Boyz,Wobble Wobble,3:35,2000-04-15,wk1,57.0,5
...,...,...,...,...,...,...,...,...
24087,2000,Yankee Grey,Another Nine Minutes,3:10,2000-04-29,wk76,,313
24088,2000,"Yearwood, Trisha",Real Live Woman,3:55,2000-04-01,wk76,,314
24089,2000,Ying Yang Twins,Whistle While You Tw...,4:19,2000-03-18,wk76,,315
24090,2000,Zombie Nation,Kernkraft 400,3:30,2000-09-02,wk76,,316


So, now what we've done is create a new dataframe containing song information only.
Next, we used this new dataframe information to match up the song's unique id to its rating information.
Finally, we'll need to create that separate rating information dataframe so that we can then store that dataframe.

This is how we'll be able to normalize the data for storage and have a table (entity) for songs and another 
table (entity) for rating information with references to the other table.

In [8]:
# create a ratings dataframe by filtering out the song attributes 
# and keeping the rating information only
billboard_ratings = billboard_ratings[['id', 'date.entered', 'week', 'rating']]
billboard_ratings

Unnamed: 0,id,date.entered,week,rating
0,1,2000-02-26,wk1,87.0
1,2,2000-09-02,wk1,91.0
2,3,2000-04-08,wk1,81.0
3,4,2000-10-21,wk1,76.0
4,5,2000-04-15,wk1,57.0
...,...,...,...,...
24087,313,2000-04-29,wk76,
24088,314,2000-04-01,wk76,
24089,315,2000-03-18,wk76,
24090,316,2000-09-02,wk76,


Now, this allows us to store the `billboard_songs` and `billboard_ratings` dataframes in some kind of 
datastore (e.g. SQL database) without having duplicate values stored.

When we do analysis, on normalized data like this, we'll need to use the skills from the previous section which 
allows us to merge and concatenate to put the observational units together in a tidy manner. 

I get it now.