# Exploring Rolling Stone Top 500 Albums 2012

The dataset used here came from [Kaggle](https://www.kaggle.com/datasets) via [data.world](https://data.world/notgibs/rolling-stones-top-500-albums). The analysis performed in this notebook heavily focuses on the Pandas functionality found in [The Pandas Cheatsheet](https://github.com/pandas-dev/pandas/blob/master/doc/cheatsheet/Pandas_Cheat_Sheet.pdf). 

If something is used that isn't in the cheatsheet, follow links! They lead to official docs. They're included in lieu of explicit explanation of what different cells are doing, because that's what docs are for! Get the most out of this by digging in, messing around, and breaking stuff.

In [21]:
import pandas as pd

df = pd.read_csv('albumlist.csv')

With our data imported, let's take a cursory [look](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.dtypes.html) at what we're dealing with.

## High Level Analysis

In [35]:
df.dtypes

Number       int64
Year         int64
Album       object
Artist      object
Genre       object
Subgenre    object
dtype: object

In [36]:
df.head()

Unnamed: 0,Number,Year,Album,Artist,Genre,Subgenre
0,1,1967,Sgt. Pepper's Lonely Hearts Club Band,The Beatles,Rock,"Rock & Roll, Psychedelic Rock"
1,2,1966,Pet Sounds,The Beach Boys,Rock,"Pop Rock, Psychedelic Rock"
2,3,1966,Revolver,The Beatles,Rock,"Psychedelic Rock, Pop Rock"
3,4,1965,Highway 61 Revisited,Bob Dylan,Rock,"Folk Rock, Blues Rock"
4,5,1965,Rubber Soul,The Beatles,"Rock, Pop",Pop Rock


Six columns. We've got:
- *Number*, a simple integer index.
- *Year*, the year the album was released.
- *Album*, the title. (An object, which is Pandas for String)
- *Genre* 
- *Subgenre*

With a quick [summary](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.describe.html#pandas.DataFrame.describe) of the dataset, we can get a feel for what might be some interesting questions to ask about our data and scheme out our analysis for the rest of the notebook.

In [27]:
df.describe(include='all')

Unnamed: 0,Number,Year,Album,Artist,Genre,Subgenre
count,500.0,500.0,500,500,500,500.0
unique,,,497,289,63,290.0
top,,,Greatest Hits,The Beatles,Rock,
freq,,,3,10,249,29.0
mean,250.5,1979.27,,,,
std,144.481833,12.093701,,,,
min,1.0,1955.0,,,,
25%,125.75,1970.0,,,,
50%,250.5,1976.0,,,,
75%,375.25,1988.0,,,,


This small summary says a lot!

*Number* isn't gonna be useful since it's just an index. Let's plan on disposing of that.

The min *Year* of this dataset is 1955 and the max is 2011, so we know that this info covers 56 years of albums. The [interquartile range](https://en.wikipedia.org/wiki/Interquartile_range) (the 3 rows below *min*) tell us even more. The first (Q1) is 1970, so 1/4 of all these albums came out before then, or within the first 15 years of this time period. The next 25% (Q2) of the data was all from the following 6 years! That tells us that a ton of these top 500 albums came from the 70s, so maybe that'd be a good subset of the data to analyze later. Q3 covers 8 years of data, and then Q4 is the last 23. This shows us another trend: There was a huge dropoff for Rolling Stone adding new albums to their best-of-all-time after 1988. Something else to consider later.

Next, we look *Album* and notice that there are only 497 unique albums in this list of 500 albums. That doesn't seem right... Apparently, the album title that's getting repeated is *Greatest Hits* (*freq* tells us that it was repeated 3 times). We'll come back to that later as well.

*Artist*: We see the Beatles made it on 10 times. Quite a few artists made it on more than once as well, it appears since there are only 289 unique artists out of 500 rows of data.

Finally, *Genre* and *Subgenre*. Rock dominates the genres in the list with more than half the "best" albums being in that category. *Subgenre* will be interesting to look at, since even though the #1 subgenre is `None`, that only happens 29 times, so 461 of these have subgenres listed.

Armed with some paths to head down, let's explore.

## Diving In
 