## Preprocessing 
This notebook will be used to preprocess the scrobbles .csv from [here](https://benjaminbenben.com/lastfm-to-csv/). 

First, we will load the data and take a peek at it. 

In [290]:
import pandas as pd 

data = "scrobbles.csv"

scrobbles = pd.read_csv(data, names=['Artist', 'Album', 'Track', 'Time'])

scrobbles

Unnamed: 0,Artist,Album,Track,Time
0,Brave Little Abacus,Masked Dancers: Concern in So Many Things You ...,born again so many times you forget you are,26 Feb 2021 11:33
1,Brave Little Abacus,Masked Dancers: Concern in So Many Things You ...,he never existed in the first place,26 Feb 2021 11:29
2,Brave Little Abacus,Masked Dancers: Concern in So Many Things You ...,Through Hallways,26 Feb 2021 11:26
3,Brave Little Abacus,Masked Dancers: Concern in So Many Things You ...,Waiting for Your Return Like Running Backwards,26 Feb 2021 11:24
4,Brave Little Abacus,Masked Dancers: Concern in So Many Things You ...,A Map Of the Stars,26 Feb 2021 11:21
...,...,...,...,...
126910,Radiohead,The Bends,(Nice Dream),01 Jan 1970 00:00
126911,Radiohead,The Bends,(Nice Dream),01 Jan 1970 00:00
126912,Radiohead,The Bends,(Nice Dream),01 Jan 1970 00:00
126913,Radiohead,The Bends,(Nice Dream),01 Jan 1970 00:00


### Times

We can see that the scrobble times are in a format that we want to improve. They are currently looking like "DD Month YYYY HH:MM". It would be easier to represent the month as a number. We will make a function that can convert this to a datetime object which we can then manipulate. We will then apply this to the time column and replace the column with our new dataframe. 

In [291]:
times = scrobbles["Time"]

times[0]

'26 Feb 2021 11:33'

In [292]:
from datetime import datetime

def convert_time(time):
    ''' Convert string time in DD MON YYYY HH:MM format to a timestamp. ''' 
    time = datetime.strptime(time, "%d %b %Y %H:%M")

    return time.day, time.month, time.year, time.hour, time.minute # Tuple

scrobbles['Time'] = times.apply(convert_time)
scrobbles[['Day', 'Month', 'Year', 'Hour', 'Minute']] = pd.DataFrame(scrobbles['Time'].tolist(), index=scrobbles.index) # Create new column for day, month, year, hour, minute
scrobbles.drop(columns='Time', inplace=True) # Drop original time column

In [293]:
scrobbles.tail(100)

Unnamed: 0,Artist,Album,Track,Day,Month,Year,Hour,Minute
126815,Tyler The Creator,Wolf,Awkward,1,1,1970,0,1
126816,The Antlers,Hospice,Atrophy,1,1,1970,0,1
126817,The Antlers,Hospice,Atrophy,1,1,1970,0,1
126818,The Antlers,Hospice,Atrophy,1,1,1970,0,1
126819,The Antlers,Hospice,Atrophy,1,1,1970,0,1
...,...,...,...,...,...,...,...,...
126910,Radiohead,The Bends,(Nice Dream),1,1,1970,0,0
126911,Radiohead,The Bends,(Nice Dream),1,1,1970,0,0
126912,Radiohead,The Bends,(Nice Dream),1,1,1970,0,0
126913,Radiohead,The Bends,(Nice Dream),1,1,1970,0,0


Looking at the tail, we also see there are scrobbles on 1st January 1970 (UNIX start time), which is clearly wrong. We will remove all entries where the year is 1970. There are 1679 incorrect files. 

In [294]:
time_errors = scrobbles.loc[scrobbles['Year'] == 1970]
time_errors

Unnamed: 0,Artist,Album,Track,Day,Month,Year,Hour,Minute
125236,Rush,Moving Pictures,YYZ,1,1,1970,0,27
125237,Rush,Moving Pictures,YYZ,1,1,1970,0,27
125238,Rush,Moving Pictures,YYZ,1,1,1970,0,27
125239,Rush,Moving Pictures,YYZ,1,1,1970,0,27
125240,Rush,Moving Pictures,YYZ,1,1,1970,0,27
...,...,...,...,...,...,...,...,...
126910,Radiohead,The Bends,(Nice Dream),1,1,1970,0,0
126911,Radiohead,The Bends,(Nice Dream),1,1,1970,0,0
126912,Radiohead,The Bends,(Nice Dream),1,1,1970,0,0
126913,Radiohead,The Bends,(Nice Dream),1,1,1970,0,0


In [301]:
scrobbles.drop(scrobbles[scrobbles['Year'] == 1970].index, inplace=True)
scrobbles

Unnamed: 0,Artist,Album,Track,Day,Month,Year,Hour,Minute
0,Brave Little Abacus,Masked Dancers: Concern in So Many Things You ...,born again so many times you forget you are,26,2,2021,11,33
1,Brave Little Abacus,Masked Dancers: Concern in So Many Things You ...,he never existed in the first place,26,2,2021,11,29
2,Brave Little Abacus,Masked Dancers: Concern in So Many Things You ...,Through Hallways,26,2,2021,11,26
3,Brave Little Abacus,Masked Dancers: Concern in So Many Things You ...,Waiting for Your Return Like Running Backwards,26,2,2021,11,24
4,Brave Little Abacus,Masked Dancers: Concern in So Many Things You ...,A Map Of the Stars,26,2,2021,11,21
...,...,...,...,...,...,...,...,...
125231,Jai Paul,Jasmine (Demo),Jasmine (Demo),25,1,2014,22,24
125232,Radiohead,The Bends,Fake Plastic Trees,25,1,2014,22,17
125233,Radiohead,The Bends,High and Dry,25,1,2014,22,13
125234,Radiohead,The Bends,The Bends,25,1,2014,22,9


In [309]:
for e, name in enumerate(scrobbles.Artist.unique()):
    print(e, name)

1306 Digga D
1307 Tangerine Dream
1308 Terry Riley
1309 Oval
1310 Kenso
1311 Portobello
1312 Pity Sex
1313 Eric Lau
1314 Soviet Soviet
1315 Prophet
1316 Kenichiro Nishihara
1317 bbno$
1318 Lil Peep
1319 B. Lustmord
1320 Chinese Football
1321 Nujabes
1322 SALES
1323 LOVING
1324 Nicotine
1325 Jakob Ogawa
1326 zack villere
1327 Good Morning
1328 Library Tapes
1329 TEMPOREX
1330 Puma Blue
1331 Dirty Three
1332 Oliver Houston
1333 Flume
1334 Malik Flavors
1335 村中りか
1336 Akira Yamaoka
1337 Steve Reich
1338 Sir Froderick
1339 Gabriela Parra
1340 Peter Broderick
1341 David Wenngren
1342 PandRezz
1343 Ol' Dirty Bastard
1344 Dark Sky
1345 lovelettertypewriter
1346 Saba Alizadeh
1347 KID FRESINO
1348 Lord Huron
1349 Home
1350 S U R F I N G
1351 Eyeliner
1352 VAPERROR
1353 Blank Banshee
1354 18 Carat Affair
1355 Milo
1356 DJ Muggs
1357 Cassiano
1358 Tim Maia
1359 Beto Guedes
1360 Jackson do Pandeiro
1361 Milton Nascimento
1362 Cartola
1363 Lô Borges
1364 Bellaire
1365 Waste of Space Orchestra
1366

In [318]:
scrobbles.groupby(['Artist']).size().reset_index(name='count').sort_values(by='count').tail(20)

Unnamed: 0,Artist,count
1852,Slowdive,1082
1850,Slint,1095
1608,Pink Floyd,1113
1375,Mike,1174
2140,The World Is a Beautiful Place & I Am No Longe...,1192
511,Danny Brown,1307
854,Have a Nice Life,1361
2025,The Brave Little Abacus,1365
621,Duster,1366
172,BROCKHAMPTON,1369
