# Eric Johnson, Video Game Music Composers

## Overview of this File

This Jupyter Notebook contains all the code that takes my five intermediate datasets and combines them into my single final dataset. This file is divided up into sections to better clarify which datasets are being combined with which other datasets in the correct order. At the end of this file, there is a link which can be clicked on to download the final dataset, which is also included as an already-downloaded separate file in my project folder.  

### Reading in Intermediate Datasets

First, the five intermediate datasets need to be read in. Because of variations and unique punctuation used in some of the data, many of the separators are different characters than the typical comma, space, or tab. The creation of these separators is discussed in the individual documentation for each of these datasets. 

In [82]:
import pandas as pd

In [83]:
kaggle_file = '../Datasets/Kaggle/video-game-sales-with-ratings/video_game_sales_cleaned.csv'
kaggle = pd.read_csv(kaggle_file)

In [84]:
wiki_file = '../Datasets/Wikipedia/List of Video Game Musicians/all_composer_data_cleaned_adjusted.txt'
wiki = pd.read_csv(wiki_file, sep=';')

In [85]:
vgmdb_artist_file = '../Datasets/vgmdb/highest_rated_artists_cleaned_adjusted.txt'
vgmdb_artist = pd.read_csv(vgmdb_artist_file, sep='<')

In [86]:
vgmdb_least_pop_album_file = '../Datasets/vgmdb/least_popular_albums_cleaned_adjusted.txt'
vgmdb_least_pop_album = pd.read_csv(vgmdb_least_pop_album_file)

In [87]:
mp3_downloads_file = '../Datasets/Video Game Music mp3 Downloads/top_1000_downloaded_soundtracks_cleaned_adjusted.txt'
mp3_downloads = pd.read_csv(mp3_downloads_file, sep=';')

### Kaggle Dataset Preparation

Since all the other datasets will eventually be combined with the Kaggle dataset, let's take a look at that one first and prepare it for future combination. 

In [88]:
kaggle.head(20)

Unnamed: 0,Name,Platform,Year_of_Release,Genre,Publisher,NA_Sales,EU_Sales,JP_Sales,Other_Sales,Global_Sales,Critic_Score,Critic_Count,User_Score,User_Count,Developer,Rating,Column 17,Column 18,Column 19
0,Wii Sports,Wii,2006,Sports,Nintendo,41.36,28.96,3.77,8.45,82.53,76.0,51.0,8.0,322.0,Nintendo,E,,,
1,Super Mario Bros.,NES,1985,Platform,Nintendo,29.08,3.58,6.81,0.77,40.24,,,,,,,,,
2,Mario Kart Wii,Wii,2008,Racing,Nintendo,15.68,12.76,3.79,3.29,35.52,82.0,73.0,8.3,709.0,Nintendo,E,,,
3,Wii Sports Resort,Wii,2009,Sports,Nintendo,15.61,10.93,3.28,2.95,32.77,80.0,73.0,8.0,192.0,Nintendo,E,,,
4,Pokemon Red/green/blue/yellow,GB,1996,Role-Playing,Nintendo,11.27,8.89,10.22,1.0,31.37,,,,,,,,,
5,Tetris,GB,1989,Puzzle,Nintendo,23.2,2.26,4.22,0.58,30.26,,,,,,,,,
6,New Super Mario Bros.,DS,2006,Platform,Nintendo,11.28,9.14,6.5,2.88,29.8,89.0,65.0,8.5,431.0,Nintendo,E,,,
7,Wii Play,Wii,2006,Misc,Nintendo,13.96,9.18,2.93,2.84,28.92,58.0,41.0,6.6,129.0,Nintendo,E,,,
8,New Super Mario Bros. Wii,Wii,2009,Platform,Nintendo,14.44,6.94,4.7,2.24,28.32,87.0,80.0,8.4,594.0,Nintendo,E,,,
9,Duck Hunt,NES,1984,Shooter,Nintendo,26.93,0.63,0.28,0.47,28.31,,,,,,,,,


In [89]:
len(kaggle.index)

16719

Since we won't need all the columns, let's extract just the ones we will need:

In [90]:
kaggle = kaggle.loc[:, ['Name', 'Platform', 'Year_of_Release', 'Genre', 'Publisher', 'Critic_Score', 'User_Score', 
                        'Developer', 'Rating']]
kaggle.head(20)

Unnamed: 0,Name,Platform,Year_of_Release,Genre,Publisher,Critic_Score,User_Score,Developer,Rating
0,Wii Sports,Wii,2006,Sports,Nintendo,76.0,8.0,Nintendo,E
1,Super Mario Bros.,NES,1985,Platform,Nintendo,,,,
2,Mario Kart Wii,Wii,2008,Racing,Nintendo,82.0,8.3,Nintendo,E
3,Wii Sports Resort,Wii,2009,Sports,Nintendo,80.0,8.0,Nintendo,E
4,Pokemon Red/green/blue/yellow,GB,1996,Role-Playing,Nintendo,,,,
5,Tetris,GB,1989,Puzzle,Nintendo,,,,
6,New Super Mario Bros.,DS,2006,Platform,Nintendo,89.0,8.5,Nintendo,E
7,Wii Play,Wii,2006,Misc,Nintendo,58.0,6.6,Nintendo,E
8,New Super Mario Bros. Wii,Wii,2009,Platform,Nintendo,87.0,8.4,Nintendo,E
9,Duck Hunt,NES,1984,Shooter,Nintendo,,,,


Everything looks good with the Kaggle dataset for now, let's move on to the next ones. 

### Combining Wikipedia and VGMDB Artist Datasets

The first datasets that we will actually combine will be the Wikipedia data with the vgmdb.net artist dataset. Both of these datasets provide data on composers. The vgmdb dataset will combine its data on composer ratings and ranking with the Wikipedia data on composers and their birthdays and soundtracks they've worked on. These datasets can be combined on the shared musician/artist column. 

To start, let's take a look at the Wikipedia data:

In [91]:
wiki.head(20)

Unnamed: 0,Soundtrack,Composer,Composer Birthday
0,1942,Akari Kaida,1974-01-10
1,Ace Attorney,Akari Kaida,1974-01-10
2,Bionic Commando,Akari Kaida,1974-01-10
3,Breath Of Fire,Akari Kaida,1974-01-10
4,Buster Bros.,Akari Kaida,1974-01-10
5,Commando,Akari Kaida,1974-01-10
6,Darkstalkers,Akari Kaida,1974-01-10
7,Dark Void,Akari Kaida,1974-01-10
8,Dead Rising,Akari Kaida,1974-01-10
9,Devil May Cry,Akari Kaida,1974-01-10


In [92]:
len(wiki.index)

7987

That looks okay, so we'll check the Highest Rated Artist dataset from vgmdb.net:

In [93]:
vgmdb_artist.head(20)

Unnamed: 0,Rank,Artist Name,Rating,#votes,Column 4
0,1,Hiromi Uehara,4.98,22.0,
1,2,Kyle Scott,4.98,21.0,
2,3,Kou Nakamura,4.94,43.0,
3,4,Janne Sala,4.94,24.0,
4,5,Alessandro Salerno,4.9,21.0,
5,6,Asuka Oda,4.9,68.0,
6,7,Junpei Ohno,4.9,25.0,
7,8,Meiko Nakahara,4.89,31.0,
8,9,Monami Okawa,4.88,65.0,
9,10,Miho Matsumoto,4.88,21.0,


In [94]:
len(vgmdb_artist.index)

10414

We don't need all of those columns, so we can extract just the ones we want:

In [95]:
vgmdb_artist = vgmdb_artist.loc[:, ['Rank', 'Artist Name', 'Rating']]
vgmdb_artist.head(20)

Unnamed: 0,Rank,Artist Name,Rating
0,1,Hiromi Uehara,4.98
1,2,Kyle Scott,4.98
2,3,Kou Nakamura,4.94
3,4,Janne Sala,4.94
4,5,Alessandro Salerno,4.9
5,6,Asuka Oda,4.9
6,7,Junpei Ohno,4.9
7,8,Meiko Nakahara,4.89
8,9,Monami Okawa,4.88
9,10,Miho Matsumoto,4.88


There are some extra white spaces at the beginnings of some of the column names/data names, which we can remove using a basic lambda function$.^{1}$

In [96]:
wiki = wiki.rename(columns=lambda x: x.strip())
wiki.columns

Index(['Soundtrack', 'Composer', 'Composer Birthday'], dtype='object')

In [97]:
vgmdb_artist = vgmdb_artist.rename(columns=lambda x: x.strip())
vgmdb_artist.columns

Index(['Rank', 'Artist Name', 'Rating'], dtype='object')

Next, in order to combine these two datasets and not have repeated data across two columns, we will rename "Artist Name" in the vgmdb dataset to "Composer" to match the name in the Wikipedia dataset, as these are the two columns that will be merged. 

In [98]:
vgmdb_artist = vgmdb_artist.rename(columns={'Artist Name': 'Composer'})
vgmdb_artist.head(20)

Unnamed: 0,Rank,Composer,Rating
0,1,Hiromi Uehara,4.98
1,2,Kyle Scott,4.98
2,3,Kou Nakamura,4.94
3,4,Janne Sala,4.94
4,5,Alessandro Salerno,4.9
5,6,Asuka Oda,4.9
6,7,Junpei Ohno,4.9
7,8,Meiko Nakahara,4.89
8,9,Monami Okawa,4.88
9,10,Miho Matsumoto,4.88


These two datasets can now be combined using the pandas merge function. They are being combined using an Outer join because this will allow the new dataset to have as many different composer names as possible in it, even if it means some of the data will be missing some attributes.

In [99]:
vgmdb_artist_merge_wiki = pd.merge(wiki, vgmdb_artist, how='outer', on='Composer')
vgmdb_artist_merge_wiki.head(20)

Unnamed: 0,Soundtrack,Composer,Composer Birthday,Rank,Rating
0,1942,Akari Kaida,1974-01-10,7252.0,4.04
1,Ace Attorney,Akari Kaida,1974-01-10,7252.0,4.04
2,Bionic Commando,Akari Kaida,1974-01-10,7252.0,4.04
3,Breath Of Fire,Akari Kaida,1974-01-10,7252.0,4.04
4,Buster Bros.,Akari Kaida,1974-01-10,7252.0,4.04
5,Commando,Akari Kaida,1974-01-10,7252.0,4.04
6,Darkstalkers,Akari Kaida,1974-01-10,7252.0,4.04
7,Dark Void,Akari Kaida,1974-01-10,7252.0,4.04
8,Dead Rising,Akari Kaida,1974-01-10,7252.0,4.04
9,Devil May Cry,Akari Kaida,1974-01-10,7252.0,4.04


In [100]:
len(vgmdb_artist_merge_wiki.index)

18276

### Combining Wikipedia-VGMDB Artist Dataset with Kaggle

Now that we have this dataset, we can combine it with the Kaggle dataset. These two datasets will be combined on the Soundtrack/Game columns, matching up the individual soundtracks each artist has worked on with the Games of the same name listed in the Kaggle data. 

The first step in this process is to rename some of the Kaggle columns so they don't conflict with other column names in other datasets:

In [101]:
kaggle = kaggle.rename(columns={'Name': 'Game', 'Rating': 'ESRB Rating'})
kaggle.head(20)

Unnamed: 0,Game,Platform,Year_of_Release,Genre,Publisher,Critic_Score,User_Score,Developer,ESRB Rating
0,Wii Sports,Wii,2006,Sports,Nintendo,76.0,8.0,Nintendo,E
1,Super Mario Bros.,NES,1985,Platform,Nintendo,,,,
2,Mario Kart Wii,Wii,2008,Racing,Nintendo,82.0,8.3,Nintendo,E
3,Wii Sports Resort,Wii,2009,Sports,Nintendo,80.0,8.0,Nintendo,E
4,Pokemon Red/green/blue/yellow,GB,1996,Role-Playing,Nintendo,,,,
5,Tetris,GB,1989,Puzzle,Nintendo,,,,
6,New Super Mario Bros.,DS,2006,Platform,Nintendo,89.0,8.5,Nintendo,E
7,Wii Play,Wii,2006,Misc,Nintendo,58.0,6.6,Nintendo,E
8,New Super Mario Bros. Wii,Wii,2009,Platform,Nintendo,87.0,8.4,Nintendo,E
9,Duck Hunt,NES,1984,Shooter,Nintendo,,,,


Additionally, we will need to rename some of the columns from the Wikipedia-VGMDB Artist dataset in order to allow for merging the datasets, and again to avoid confusion or conflict between similar column names. 

In [102]:
vgmdb_artist_merge_wiki = vgmdb_artist_merge_wiki.rename(
    columns={'Soundtrack': 'Game', 'Rank': 'Composer Rank', 
             'Rating': 'Composer Rating'})
vgmdb_artist_merge_wiki.head(20)

Unnamed: 0,Game,Composer,Composer Birthday,Composer Rank,Composer Rating
0,1942,Akari Kaida,1974-01-10,7252.0,4.04
1,Ace Attorney,Akari Kaida,1974-01-10,7252.0,4.04
2,Bionic Commando,Akari Kaida,1974-01-10,7252.0,4.04
3,Breath Of Fire,Akari Kaida,1974-01-10,7252.0,4.04
4,Buster Bros.,Akari Kaida,1974-01-10,7252.0,4.04
5,Commando,Akari Kaida,1974-01-10,7252.0,4.04
6,Darkstalkers,Akari Kaida,1974-01-10,7252.0,4.04
7,Dark Void,Akari Kaida,1974-01-10,7252.0,4.04
8,Dead Rising,Akari Kaida,1974-01-10,7252.0,4.04
9,Devil May Cry,Akari Kaida,1974-01-10,7252.0,4.04


The Wikipedia-VGMDB Artist dataset can now be combined with the Kaggle dataset. For this join, we will use the default Inner join, removing any data that does not match up to rows in the Kaggle dataset. This data is being removed because it will not provide enough information to do any kind of analysis on it, and the Kaggle dataset is remaining unchanged as the other datasets still need to be merged into it. 

These two datasets are being combined on the "Game" column, which matches up the Kaggle game names with the previously named "Soundtrack" column from the Wikipedia-VGMDB Artist data. 

In [103]:
vgmdb_wiki_kaggle = pd.merge(kaggle, vgmdb_artist_merge_wiki, on='Game')
vgmdb_wiki_kaggle

Unnamed: 0,Game,Platform,Year_of_Release,Genre,Publisher,Critic_Score,User_Score,Developer,ESRB Rating,Composer,Composer Birthday,Composer Rank,Composer Rating
0,Wii Sports,Wii,2006,Sports,Nintendo,76.0,8,Nintendo,E,Kazumi Totaka,1967-08-23,6777.0,4.07
1,Wii Sports,Wii,2006,Sports,Nintendo,76.0,8,Nintendo,E,Kazumi Totaka,1967-08-23,6777.0,4.07
2,Pokemon Red/green/blue/yellow,GB,1996,Role-Playing,Nintendo,,,,,Junichi Masuda,1968-01-12,5712.0,4.14
3,Tetris,GB,1989,Puzzle,Nintendo,,,,,Hirokazu Tanaka,1957-12-13,6489.0,4.09
4,Tetris,GB,1989,Puzzle,Nintendo,,,,,Jun Senoue,1970-08-02,6091.0,4.12
5,Tetris,GB,1989,Puzzle,Nintendo,,,,,Kazumi Totaka,1967-08-23,6777.0,4.07
6,Tetris,GB,1989,Puzzle,Nintendo,,,,,Tomoya Ohtani,1974-07-01,5079.0,4.18
7,Tetris,NES,1988,Puzzle,Nintendo,,,,,Hirokazu Tanaka,1957-12-13,6489.0,4.09
8,Tetris,NES,1988,Puzzle,Nintendo,,,,,Jun Senoue,1970-08-02,6091.0,4.12
9,Tetris,NES,1988,Puzzle,Nintendo,,,,,Kazumi Totaka,1967-08-23,6777.0,4.07


In [104]:
len(vgmdb_wiki_kaggle.index)

23264

Some of the rows seem to be exact duplicates of each other, so we can drop any duplicate records. 

In [105]:
vgmdb_wiki_kaggle = vgmdb_wiki_kaggle.drop_duplicates()
vgmdb_wiki_kaggle.head(20)

Unnamed: 0,Game,Platform,Year_of_Release,Genre,Publisher,Critic_Score,User_Score,Developer,ESRB Rating,Composer,Composer Birthday,Composer Rank,Composer Rating
0,Wii Sports,Wii,2006,Sports,Nintendo,76.0,8.0,Nintendo,E,Kazumi Totaka,1967-08-23,6777.0,4.07
2,Pokemon Red/green/blue/yellow,GB,1996,Role-Playing,Nintendo,,,,,Junichi Masuda,1968-01-12,5712.0,4.14
3,Tetris,GB,1989,Puzzle,Nintendo,,,,,Hirokazu Tanaka,1957-12-13,6489.0,4.09
4,Tetris,GB,1989,Puzzle,Nintendo,,,,,Jun Senoue,1970-08-02,6091.0,4.12
5,Tetris,GB,1989,Puzzle,Nintendo,,,,,Kazumi Totaka,1967-08-23,6777.0,4.07
6,Tetris,GB,1989,Puzzle,Nintendo,,,,,Tomoya Ohtani,1974-07-01,5079.0,4.18
7,Tetris,NES,1988,Puzzle,Nintendo,,,,,Hirokazu Tanaka,1957-12-13,6489.0,4.09
8,Tetris,NES,1988,Puzzle,Nintendo,,,,,Jun Senoue,1970-08-02,6091.0,4.12
9,Tetris,NES,1988,Puzzle,Nintendo,,,,,Kazumi Totaka,1967-08-23,6777.0,4.07
10,Tetris,NES,1988,Puzzle,Nintendo,,,,,Tomoya Ohtani,1974-07-01,5079.0,4.18


In [106]:
len(vgmdb_wiki_kaggle.index)

12783

### VGMDB Albums and mp3 Downloads Datasets

The next stage will be combining the vgmdb.net album data with the mp3 Downloads data. These datasets feature information on the ranking and popularity of soundtracks, and can be combined together on those columns. 

First we will look at the Least Popular Albums data from vgmdb.net. This data was essentially identical to the Most Popular Albums data, so it did not matter which dataset was chosen. 

In [107]:
vgmdb_least_pop_album.head(20)

Unnamed: 0,Rank,Album Name,Rating,#votes,Popularity
0,1,Diabolik Lovers,3.8,15.0,0.0006
1,2,Eiga Precure Dream Stars!,4.1,15.0,0.0006
2,3,Diabolik Lovers,4.04,14.0,0.0006
3,4,Haikyu,4.04,14.0,0.0006
4,5,Eiga Kirakira precure a La Mode Paris-tto! Omo...,4.5,13.0,0.0006
5,6,Nightmare,4.31,13.0,0.0006
6,14,Dive & Drama,5.0,11.0,0.0006
7,16,Square Enix,4.58,13.0,0.0029
8,18,Kirakira precure a La Mode 1: Precure sound de...,4.33,15.0,0.004
9,19,Ao Haru Ride,4.29,14.0,0.0043


In [108]:
len(vgmdb_least_pop_album.index)

1807

First we'll strip the white spaces from the header names like before:

In [109]:
vgmdb_least_pop_album = vgmdb_least_pop_album.rename(columns=lambda x: x.strip())
vgmdb_least_pop_album.columns

Index(['Rank', 'Album Name', 'Rating', '#votes', 'Popularity'], dtype='object')

Now we can extract just the columns that we want, and we can also rename the columns to avoid conflict with other similar column names. We can also rename the "Album Name" column to "Game" to be consistent with the Kaggle data that we will eventually be merging this data with.  

In [110]:
vgmdb_least_pop_album = vgmdb_least_pop_album.loc[:, ['Album Name', 'Rating', 'Popularity']]

vgmdb_least_pop_album = vgmdb_least_pop_album.rename(
    columns={'Album Name': 'Game', 
             'Rating': 'VGMDB Soundtrack Rating', 
             'Popularity': 'VGMDB Soundtrack Popularity'})
vgmdb_least_pop_album.head(20)

Unnamed: 0,Game,VGMDB Soundtrack Rating,VGMDB Soundtrack Popularity
0,Diabolik Lovers,3.8,0.0006
1,Eiga Precure Dream Stars!,4.1,0.0006
2,Diabolik Lovers,4.04,0.0006
3,Haikyu,4.04,0.0006
4,Eiga Kirakira precure a La Mode Paris-tto! Omo...,4.5,0.0006
5,Nightmare,4.31,0.0006
6,Dive & Drama,5.0,0.0006
7,Square Enix,4.58,0.0029
8,Kirakira precure a La Mode 1: Precure sound de...,4.33,0.004
9,Ao Haru Ride,4.29,0.0043


Now that that dataset is ready, let's look at the mp3 Downloads dataset and prepare it to combine with the vgmdb album data we just looked at.

In [111]:
mp3_downloads.head(20)

Unnamed: 0,#,Album
0,1,Persona 5
1,2,Need For Speed: Most Wanted
2,3,Super Mario World
3,4,Minecraft
4,5,Legend Of Zelda: Ocarina Of Time
5,6,Super Smash Bros Brawl: Gamerip
6,7,Persona 4
7,8,Need For Speed: Underground 2
8,9,Nier Automata
9,10,Legend Of Zelda: Majora's Mask


In [112]:
len(mp3_downloads.index)

983

Let's strip the white space from the headers again. 

In [113]:
mp3_downloads = mp3_downloads.rename(columns=lambda x: x.strip())
mp3_downloads.columns

Index(['#', 'Album'], dtype='object')

Again, we will rename the columns to clarify the data and to make it consistent with the other columns that we will be joining on. 

In [114]:
mp3_downloads = mp3_downloads.rename(columns={'#': 'mp3 Downloads Rank', 
                                             'Album': 'Game'})
mp3_downloads.head(20)

Unnamed: 0,mp3 Downloads Rank,Game
0,1,Persona 5
1,2,Need For Speed: Most Wanted
2,3,Super Mario World
3,4,Minecraft
4,5,Legend Of Zelda: Ocarina Of Time
5,6,Super Smash Bros Brawl: Gamerip
6,7,Persona 4
7,8,Need For Speed: Underground 2
8,9,Nier Automata
9,10,Legend Of Zelda: Majora's Mask


We can now merge the mp3 Downloads dataset with the VGMDB Album dataset on the "Game" column. This will be done using an Outer join so as to account for as many different album names as possible, as was done with the Wikipedia and VGMDB Artist datasets. 

In [115]:
mp3_merge_vgmdb = pd.merge(vgmdb_least_pop_album, mp3_downloads, how='outer', on='Game')
mp3_merge_vgmdb.head(20)

Unnamed: 0,Game,VGMDB Soundtrack Rating,VGMDB Soundtrack Popularity,mp3 Downloads Rank
0,Diabolik Lovers,3.8,0.0006,
1,Diabolik Lovers,4.1,0.0077,
2,Eiga Precure Dream Stars!,4.1,0.0006,
3,Diabolik Lovers,4.04,0.0006,
4,Haikyu,4.04,0.0006,
5,Eiga Kirakira precure a La Mode Paris-tto! Omo...,4.5,0.0006,
6,Nightmare,4.31,0.0006,
7,Dive & Drama,5.0,0.0006,
8,Square Enix,4.58,0.0029,
9,Square Enix,3.17,0.0749,


In [116]:
len(mp3_merge_vgmdb.index)

2673

In case there are any duplicate rows, we can remove those:

In [117]:
mp3_merge_vgmdb = mp3_merge_vgmdb.drop_duplicates()
mp3_merge_vgmdb.head(20)

Unnamed: 0,Game,VGMDB Soundtrack Rating,VGMDB Soundtrack Popularity,mp3 Downloads Rank
0,Diabolik Lovers,3.8,0.0006,
1,Diabolik Lovers,4.1,0.0077,
2,Eiga Precure Dream Stars!,4.1,0.0006,
3,Diabolik Lovers,4.04,0.0006,
4,Haikyu,4.04,0.0006,
5,Eiga Kirakira precure a La Mode Paris-tto! Omo...,4.5,0.0006,
6,Nightmare,4.31,0.0006,
7,Dive & Drama,5.0,0.0006,
8,Square Enix,4.58,0.0029,
9,Square Enix,3.17,0.0749,


In [118]:
len(mp3_merge_vgmdb.index)

2672

### Creating the Final Dataset

We can now merge the mp3 Downloads-VGMDB Album dataset with the Kaggle-Wikipedia-VGMDB Artist dataset. This final merge will create the final, new dataset. 

This merge will be an Inner join, keeping the Kaggle data intact, as any rows in the mp3 Downloads-VGMDB Album data that do not match up will not be useful for performing any type of analysis. The datasets will be combined on the "Game" columns. 

In [119]:
final_dataset = pd.merge(vgmdb_wiki_kaggle, mp3_merge_vgmdb, how='inner', on='Game')
final_dataset.head(20)

Unnamed: 0,Game,Platform,Year_of_Release,Genre,Publisher,Critic_Score,User_Score,Developer,ESRB Rating,Composer,Composer Birthday,Composer Rank,Composer Rating,VGMDB Soundtrack Rating,VGMDB Soundtrack Popularity,mp3 Downloads Rank
0,Wii Sports,Wii,2006,Sports,Nintendo,76.0,8.0,Nintendo,E,Kazumi Totaka,1967-08-23,6777.0,4.07,,,278.0
1,Pokemon Red/green/blue/yellow,GB,1996,Role-Playing,Nintendo,,,,,Junichi Masuda,1968-01-12,5712.0,4.14,,,235.0
2,Pokemon Red/green/blue/yellow,GB,1996,Role-Playing,Nintendo,,,,,Junichi Masuda,1968-01-12,5712.0,4.14,,,948.0
3,Tetris,GB,1989,Puzzle,Nintendo,,,,,Hirokazu Tanaka,1957-12-13,6489.0,4.09,,,364.0
4,Tetris,GB,1989,Puzzle,Nintendo,,,,,Jun Senoue,1970-08-02,6091.0,4.12,,,364.0
5,Tetris,GB,1989,Puzzle,Nintendo,,,,,Kazumi Totaka,1967-08-23,6777.0,4.07,,,364.0
6,Tetris,GB,1989,Puzzle,Nintendo,,,,,Tomoya Ohtani,1974-07-01,5079.0,4.18,,,364.0
7,Tetris,NES,1988,Puzzle,Nintendo,,,,,Hirokazu Tanaka,1957-12-13,6489.0,4.09,,,364.0
8,Tetris,NES,1988,Puzzle,Nintendo,,,,,Jun Senoue,1970-08-02,6091.0,4.12,,,364.0
9,Tetris,NES,1988,Puzzle,Nintendo,,,,,Kazumi Totaka,1967-08-23,6777.0,4.07,,,364.0


In [120]:
len(final_dataset.index)

1395

We can then drop any final duplicate rows in the dataset:

In [121]:
final_dataset = final_dataset.drop_duplicates()
final_dataset.head(20)

Unnamed: 0,Game,Platform,Year_of_Release,Genre,Publisher,Critic_Score,User_Score,Developer,ESRB Rating,Composer,Composer Birthday,Composer Rank,Composer Rating,VGMDB Soundtrack Rating,VGMDB Soundtrack Popularity,mp3 Downloads Rank
0,Wii Sports,Wii,2006,Sports,Nintendo,76.0,8.0,Nintendo,E,Kazumi Totaka,1967-08-23,6777.0,4.07,,,278.0
1,Pokemon Red/green/blue/yellow,GB,1996,Role-Playing,Nintendo,,,,,Junichi Masuda,1968-01-12,5712.0,4.14,,,235.0
2,Pokemon Red/green/blue/yellow,GB,1996,Role-Playing,Nintendo,,,,,Junichi Masuda,1968-01-12,5712.0,4.14,,,948.0
3,Tetris,GB,1989,Puzzle,Nintendo,,,,,Hirokazu Tanaka,1957-12-13,6489.0,4.09,,,364.0
4,Tetris,GB,1989,Puzzle,Nintendo,,,,,Jun Senoue,1970-08-02,6091.0,4.12,,,364.0
5,Tetris,GB,1989,Puzzle,Nintendo,,,,,Kazumi Totaka,1967-08-23,6777.0,4.07,,,364.0
6,Tetris,GB,1989,Puzzle,Nintendo,,,,,Tomoya Ohtani,1974-07-01,5079.0,4.18,,,364.0
7,Tetris,NES,1988,Puzzle,Nintendo,,,,,Hirokazu Tanaka,1957-12-13,6489.0,4.09,,,364.0
8,Tetris,NES,1988,Puzzle,Nintendo,,,,,Jun Senoue,1970-08-02,6091.0,4.12,,,364.0
9,Tetris,NES,1988,Puzzle,Nintendo,,,,,Kazumi Totaka,1967-08-23,6777.0,4.07,,,364.0


In [122]:
len(final_dataset.index)

1395

Finally, to create a download link for this final dataset, we can use the following code$:^{2}$

In [123]:
from IPython.display import HTML
import base64 

In [124]:
def create_download_link( df, title = "Download CSV file", filename = "johnson_eric_final_dataset.csv"):  
    csv = df.to_csv()
    b64 = base64.b64encode(csv.encode())
    payload = b64.decode()
    html = '<a download="{filename}" href="data:text/csv;base64,{payload}" target="_blank">{title}</a>'
    html = html.format(payload=payload,title=title,filename=filename)
    return HTML(html)

create_download_link(final_dataset)

References

1. https://stackoverflow.com/questions/21606987/how-can-i-strip-the-whitespace-from-pandas-dataframe-headers 
2. https://blog.softhints.com/jupyter-ipython-download-files/#zip 