<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Lab: Cleaning Rock Song Data

---


In [1]:
import pandas as pd
import numpy as np 
import seaborn as sns

%matplotlib inline

### 1. Load `rock.csv` and do an initial examination of its data columns.

In [2]:
rockfile = "../../../../resource-datasets/rock_songs/rock.csv"

In [3]:
# Load the data.
rockdata = pd.read_csv(rockfile)

In [4]:
# Look at the information regarding its columns.
rockdata.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2230 entries, 0 to 2229
Data columns (total 8 columns):
Song Clean      2230 non-null object
ARTIST CLEAN    2230 non-null object
Release Year    1653 non-null object
COMBINED        2230 non-null object
First?          2230 non-null int64
Year?           2230 non-null int64
PlayCount       2230 non-null int64
F*G             2230 non-null int64
dtypes: int64(4), object(4)
memory usage: 139.5+ KB


### 2.  Clean up the column names.

Let's clean up the column names. There are two ways we can accomplish this:

#### 2.A Change the column names when you import the data using `pd.read_csv()`.

Notice that, when passing `names=[..A LIST OF STRING..]` with a number of columns that matches the number of strings in the passed list, you replace the column names.

NOTE: When you create custom column names, the first row of the `.csv` already represents a header. It is important to tell `pandas` to skip that row. The `skiprows=1` keyword argument to `read_csv()` will tell `pandas` to skip the first row.

In [5]:
# Change the column names when loading the '.csv':

#set new names array 
new_nameslist = ['song_clean','artist_clean', 'release_year', 'combined', 'first', 'year','play_count','fg']

#load the datat and change column names 
rockdata = pd.read_csv(rockfile,skiprows=1,names=new_nameslist)

#print
rockdata.head()

Unnamed: 0,song_clean,artist_clean,release_year,combined,first,year,play_count,fg
0,Caught Up in You,.38 Special,1982.0,Caught Up in You by .38 Special,1,1,82,82
1,Fantasy Girl,.38 Special,,Fantasy Girl by .38 Special,1,0,3,0
2,Hold On Loosely,.38 Special,1981.0,Hold On Loosely by .38 Special,1,1,85,85
3,Rockin' Into the Night,.38 Special,1980.0,Rockin' Into the Night by .38 Special,1,1,18,18
4,Art For Arts Sake,10cc,1975.0,Art For Arts Sake by 10cc,1,1,1,1


#### 2.B Change column names using the `.rename()` function.

The `.rename()` function takes an argument, `columns=name_dict`, in which `name_dict` is a dictionary containing the original column names as keys and the new column names as values.

In [6]:
# Change the column names using the `.rename()` function.

#reload the data
rockdata = pd.read_csv(rockfile)

#create the new names dict
new_names = {'Song Clean':'song_clean'
             ,'ARTIST CLEAN':'artist_clean',
             'Release Year':'release_year',
             'COMBINED':'combined',
             'First?':'first', 'Year?':'year'
             ,'PlayCount':'play_count','F*G':'fg'}

# Change the column names using the `.rename()` function.
rockdata.rename(index=str,columns=new_names)

#print
rockdata.head()

Unnamed: 0,Song Clean,ARTIST CLEAN,Release Year,COMBINED,First?,Year?,PlayCount,F*G
0,Caught Up in You,.38 Special,1982.0,Caught Up in You by .38 Special,1,1,82,82
1,Fantasy Girl,.38 Special,,Fantasy Girl by .38 Special,1,0,3,0
2,Hold On Loosely,.38 Special,1981.0,Hold On Loosely by .38 Special,1,1,85,85
3,Rockin' Into the Night,.38 Special,1980.0,Rockin' Into the Night by .38 Special,1,1,18,18
4,Art For Arts Sake,10cc,1975.0,Art For Arts Sake by 10cc,1,1,1,1


#### 2.C Reassigning the `.columns` attribute of a DataFrame.

You can also just reassign the `.columns` attribute to a list of strings containing the new column names. 

The only caveat with reassigning `.columns` is that you have to reassign all of the column names at once. You can't partially replace a value by working on `.columns` directly. You have to reassign `.columns` with a list of equal length. 

In [7]:
# Replace the column names by reassigning the `.columns` attribute.
rockdata.columns = new_nameslist

#print
rockdata.head()

Unnamed: 0,song_clean,artist_clean,release_year,combined,first,year,play_count,fg
0,Caught Up in You,.38 Special,1982.0,Caught Up in You by .38 Special,1,1,82,82
1,Fantasy Girl,.38 Special,,Fantasy Girl by .38 Special,1,0,3,0
2,Hold On Loosely,.38 Special,1981.0,Hold On Loosely by .38 Special,1,1,85,85
3,Rockin' Into the Night,.38 Special,1980.0,Rockin' Into the Night by .38 Special,1,1,18,18
4,Art For Arts Sake,10cc,1975.0,Art For Arts Sake by 10cc,1,1,1,1


### 3. Subsetting data where null values exist.

We have mixed `str` and `NaN` values in the `release` column. `NaN` stands for "not a number" and is the way `pandas` handles "nulls" or nonexistent data. We can use the `.isnull()` method of a Series to find null values.

Print the header of the data subset to where the `release` column is null values.

In [8]:
# Show records where df['release'] is null
rockdata[rockdata["release_year"].isnull()].head()

Unnamed: 0,song_clean,artist_clean,release_year,combined,first,year,play_count,fg
1,Fantasy Girl,.38 Special,,Fantasy Girl by .38 Special,1,0,3,0
10,"Baby, Please Don't Go",AC/DC,,"Baby, Please Don't Go by AC/DC",1,0,1,0
13,CAN'T STOP ROCK'N'ROLL,AC/DC,,CAN'T STOP ROCK'N'ROLL by AC/DC,1,0,5,0
16,Girls Got Rhythm,AC/DC,,Girls Got Rhythm by AC/DC,1,0,24,0
24,Let's Get It Up,AC/DC,,Let's Get It Up by AC/DC,1,0,4,0


### 4. Update slices of your DataFrame based on mask selection/slices.

In many scenarios, we want to upate values in our DataFrame according to criteria. Let's say we wanted to set all of the null values in `release` to 0.

With newer versions of `pandas`, in order to manipulate data in the original DataFrame, we have to use `.loc` while performing reassignment using a mask and an index.

For example, the following won't always work:
```python
df[row_mask]['column_name'] = new_value
```

The best way to accomplish the same task is:
```python
df.loc[row_mask, 'column_name'] = new_value
```

For multiple column assignment, you would use:
```python
df.loc[row_mask, ['col_1', 'col_2', 'col_3']] = new_value
```

#### 4.A Let's try it out. Make all of the null values in `release` 0.

In [9]:
# Replace release nulls with 0
rockdata['release_year'].fillna(0, inplace=True)

#print
rockdata.head()

Unnamed: 0,song_clean,artist_clean,release_year,combined,first,year,play_count,fg
0,Caught Up in You,.38 Special,1982,Caught Up in You by .38 Special,1,1,82,82
1,Fantasy Girl,.38 Special,0,Fantasy Girl by .38 Special,1,0,3,0
2,Hold On Loosely,.38 Special,1981,Hold On Loosely by .38 Special,1,1,85,85
3,Rockin' Into the Night,.38 Special,1980,Rockin' Into the Night by .38 Special,1,1,18,18
4,Art For Arts Sake,10cc,1975,Art For Arts Sake by 10cc,1,1,1,1


#### 4.B Verify that `release` contains no null values.

In [10]:
# A:
rockdata['release_year'].isnull().sum()

0

### 5. Ensure that the data types of the columns make sense. 

Verifying column data types is a critical part of data munging. If columns have the wrong data type, then there is usually corrupted or incorrect data in some of the observations.

#### 5.A Look at the data types for the columns. Are any incorrect given what the data represents?

In [11]:
# A: The release_year column must be int64
rockdata.dtypes

song_clean      object
artist_clean    object
release_year    object
combined        object
first            int64
year             int64
play_count       int64
fg               int64
dtype: object

### 6. Investigate and clean up the `release` column.

The `release` column is a string data type when it should be an integer.

#### 6.A Figure out what value(s) are causing the `release` column to be encoded as a string instead of an integer.

In [12]:
# A: Cause of NaN,cannot convert NaN to integer, To know that just run this
# data['release_year'].astype(int)
# and it will cause an erorr with the value 




#### 6.B Look at the rows in which there is incorrect data in the `release` column.

In [13]:
# A: 
rockdata.loc[rockdata['release_year'] == 'SONGFACTS.COM']

Unnamed: 0,song_clean,artist_clean,release_year,combined,first,year,play_count,fg
1504,Bullfrog Blues,Rory Gallagher,SONGFACTS.COM,Bullfrog Blues by Rory Gallagher,1,1,1,1


#### 6.C. Clean up the data. Normally we may replace the offending data with null np.nan values, however we previously converted all of the nan values in the release column to zeros so we might as well continue with the same practice. Replacing with 0 (or nan) will allow us to convert the column to numeric.

In [14]:
# A: 
rockdata.loc[rockdata['release_year'] == 'SONGFACTS.COM' , ['release_year']] = 0

#change the release_year type to int
rockdata['release_year'] = rockdata['release_year'].astype(int)

#check 
rockdata['release_year'].dtype


dtype('int64')

### 7. Get summary statistics for the `release` column using the `.describe()` function.

Now that the `release` column is finally a numeric data type, we can apply the `.describe()` function.  

#### 7.A Print out the summary stats for the `release` column. What is the earliest and latest release date?

In [15]:
# A: earliest = 1071   latest = 2014

# to get the earliest ( min ) without considering zeros 
print(rockdata[rockdata['release_year'] > 0].min()['release_year'])


#describe
rockdata['release_year'].describe()

1071


count    2230.000000
mean     1465.331390
std       867.196161
min         0.000000
25%         0.000000
50%      1973.000000
75%      1981.000000
max      2014.000000
Name: release_year, dtype: float64

#### 7.B Based on the summary statistics, is there anything else wrong with the `release` column? 

In [16]:
# A: the minimum is zero
print("earliest year =",str(rockdata['release_year'].describe()[3]))

earliest year = 0.0


### 8. Make changes and investigate using custom functions with `.apply()`.

Let's say we want to traverse every single row in our data set and apply a function to that row.

#### 8.A Write a function that will take a row of a DataFrame and print out the song, artist, and whether or not the release date is < 1970.

In [17]:
# A:
def printdara(row):
    if row['release_year'] < 1970:
        print( "song : " + str(row['song_clean']) + ", Artist:" +  str( row['artist_clean'])
            + ", was released before 1970." )
    else:
        print( "song : " + str(row['song_clean']) + ", Artist: " +  str(row['artist_clean']) 
            + ", was released after 1970." )   

#### 8.B Using the `.apply()` function, apply the function you wrote to the first four rows of the DataFrame.

You will need to tell the `apply` function to operate row by row. Setting the keyword argument as `axis=1` indicates that the function should be applied to each row individually.

In [18]:
# A:
rockdata.apply(printdara, axis=1)

song : Caught Up in You, Artist: .38 Special, was released after 1970.
song : Fantasy Girl, Artist:.38 Special, was released before 1970.
song : Hold On Loosely, Artist: .38 Special, was released after 1970.
song : Rockin' Into the Night, Artist: .38 Special, was released after 1970.
song : Art For Arts Sake, Artist: 10cc, was released after 1970.
song : Kryptonite, Artist: 3 Doors Down, was released after 1970.
song : Loser, Artist: 3 Doors Down, was released after 1970.
song : When I'm Gone, Artist: 3 Doors Down, was released after 1970.
song : What's Up?, Artist: 4 Non Blondes, was released after 1970.
song : Take On Me, Artist: a-ha, was released after 1970.
song : Baby, Please Don't Go, Artist:AC/DC, was released before 1970.
song : Back In Black, Artist: AC/DC, was released after 1970.
song : Big Gun, Artist: AC/DC, was released after 1970.
song : CAN'T STOP ROCK'N'ROLL, Artist:AC/DC, was released before 1970.
song : Dirty Deeds Done Dirt Cheap, Artist: AC/DC, was released after 

song : Locomotive Breath, Artist: Jethro Tull, was released after 1970.
song : Nothing Is Easy, Artist:Jethro Tull, was released before 1970.
song : Teacher, Artist: Jethro Tull, was released after 1970.
song : Thick As A Brick, Artist:Jethro Tull, was released before 1970.
song : Too Old To Rock 'n' Roll, Artist:Jethro Tull, was released before 1970.
song : TIME IN A BOTTLE, Artist: Jim Croce, was released after 1970.
song : All Along the Watchtower, Artist:Jimi Hendrix, was released before 1970.
song : Angel, Artist: Jimi Hendrix, was released after 1970.
song : Are You Experienced?, Artist:Jimi Hendrix, was released before 1970.
song : Crosstown Traffic, Artist:Jimi Hendrix, was released before 1970.
song : Dolly Dagger, Artist: Jimi Hendrix, was released after 1970.
song : Fire, Artist:Jimi Hendrix, was released before 1970.
song : Foxey Lady, Artist:Jimi Hendrix, was released before 1970.
song : Hey Joe, Artist:Jimi Hendrix, was released before 1970.
song : If 6 Was 9, Artist:Jimi

song : One Vision, Artist: Queen, was released after 1970.
song : Somebody To Love, Artist: Queen, was released after 1970.
song : THESE ARE THE DAYS OF OUR LI, Artist: Queen, was released after 1970.
song : Tie Your Mother Down, Artist: Queen, was released after 1970.
song : Under Pressure, Artist: Queen, was released after 1970.
song : We Are the Champions, Artist: Queen, was released after 1970.
song : We Will Rock You, Artist: Queen, was released after 1970.
song : We Will Rock You/We Are The Champions, Artist: Queen, was released after 1970.
song : You're My Best Friend, Artist: Queen, was released after 1970.
song : Under Pressure, Artist: Queen / David Bowie, was released after 1970.
song : No One Knows, Artist: Queens of the Stone Age, was released after 1970.
song : Another Rainy Night, Artist: Queensryche, was released after 1970.
song : Eyes Of A Stranger, Artist:Queensryche, was released before 1970.
song : I Don't Believe In Love, Artist:Queensryche, was released before 19

song : Candy-o, Artist:The Cars, was released before 1970.
song : Dangerious Type, Artist:The Cars, was released before 1970.
song : Drive, Artist: The Cars, was released after 1970.
song : Good Times Roll, Artist:The Cars, was released before 1970.
song : Hello Again, Artist: The Cars, was released after 1970.
song : It's All I Can Do, Artist:The Cars, was released before 1970.
song : Just What I Needed, Artist: The Cars, was released after 1970.
song : Let's Go, Artist: The Cars, was released after 1970.
song : Magic, Artist: The Cars, was released after 1970.
song : Moving In Stereo, Artist: The Cars, was released after 1970.
song : My Best Friend's Girl, Artist:The Cars, was released before 1970.
song : Shake It Up, Artist: The Cars, was released after 1970.
song : Since You're Gone, Artist:The Cars, was released before 1970.
song : Stereo/All Mixed Up, Artist:The Cars, was released before 1970.
song : Tonight She Comes, Artist:The Cars, was released before 1970.
song : You Might T

0       None
1       None
2       None
3       None
4       None
5       None
6       None
7       None
8       None
9       None
10      None
11      None
12      None
13      None
14      None
15      None
16      None
17      None
18      None
19      None
20      None
21      None
22      None
23      None
24      None
25      None
26      None
27      None
28      None
29      None
        ... 
2200    None
2201    None
2202    None
2203    None
2204    None
2205    None
2206    None
2207    None
2208    None
2209    None
2210    None
2211    None
2212    None
2213    None
2214    None
2215    None
2216    None
2217    None
2218    None
2219    None
2220    None
2221    None
2222    None
2223    None
2224    None
2225    None
2226    None
2227    None
2228    None
2229    None
Length: 2230, dtype: object

You'll notice that there will be a final output Series of `None` values. The `.apply()` function, if a return value is not specified, will return a Series of `None` values (similar to how the default return for Python functions is `None` when a return statement is not specified).

### 9. Write a function that converts cells in a DataFrame to float and otherwise replaces them with `np.nan`.

If applied to our data, it would keep only the numeric information and otherwise input null values.

Recall that the try-except syntax in Python is a great way to try something and take another action if the initial step fails:

```python
try:
    Perform some action.
except:
   Perform some other action if the first failed with an error.
```

#### 9.A Write the function that takes a column and converts all of its values to float if possible and `np.nan` otherwise. The return value should be the converted Series.

In [19]:
#A:
def convertfun(cell):  
    try:
        return float(cell)
    except:
        return np.nan
    
    
def tofloat(col):
    col = col.map(convertfun)
    return col

In [32]:
# A:
rockdata.apply(tofloat).head()


Unnamed: 0,song_clean,artist_clean,release_year,combined,first,year,play_count,fg
0,,,1982.0,,1.0,1.0,82.0,82.0
1,,,0.0,,1.0,0.0,3.0,0.0
2,,,1981.0,,1.0,1.0,85.0,85.0
3,,,1980.0,,1.0,1.0,18.0,18.0
4,,,1975.0,,1.0,1.0,1.0,1.0


#### 9.C Describe the new float-only DataFrame.

In [33]:
# A:
rockdata.describe()

Unnamed: 0,release_year,first,year,play_count,fg
count,2230.0,2230.0,2230.0,2230.0,2230.0
mean,1465.33139,1.0,0.741256,16.872646,15.04843
std,867.196161,0.0,0.438043,25.302972,25.288366
min,0.0,1.0,0.0,0.0,0.0
25%,0.0,1.0,0.0,1.0,0.0
50%,1973.0,1.0,1.0,4.0,3.0
75%,1981.0,1.0,1.0,21.0,18.0
max,2014.0,1.0,1.0,142.0,142.0


### 10. What are the top 20 most popular songs by plays?

In [66]:
# A:
#most_popular = rockdata.groupby(['play_count'], as_index=False)
most_popular = rockdata.groupby(rockdata['play_count'], sort = True)
print(most_popular)
for name,group in most_popular:
    print(name)
    print(group)
#print(most_popular.groups.keys())

<pandas.core.groupby.groupby.DataFrameGroupBy object at 0x107b73240>
0
                    song_clean  artist_clean  release_year  \
494  Hey, Hey (What Can I Do?)  Led Zeppelin          1970   
579                      Layla  Eric Clapton             0   

                                      combined  first  year  play_count  fg  
494  Hey, Hey (What Can I Do?) by Led Zeppelin      1     1           0   0  
579                      Layla by Eric Clapton      1     0           0   0  
1
                               song_clean          artist_clean  release_year  \
4                       Art For Arts Sake                  10cc          1975   
6                                   Loser          3 Doors Down          2000   
9                              Take On Me                  a-ha          1985   
10                  Baby, Please Don't Go                 AC/DC             0   
17                         Hard As A Rock                 AC/DC          1995   
22                  

                               song_clean                  artist_clean  \
7                           When I'm Gone                  3 Doors Down   
12                                Big Gun                         AC/DC   
73                        You Oughta Know             Alanis Morissette   
97                                Melissa          Allman Brothers Band   
104                         Whippin' Post          Allman Brothers Band   
150                        Cradle of Love                    Billy Idol   
225                    Feel Like a Number                     Bob Seger   
394                              Carry On  Crosby, Stills, Nash & Young   
395                                  Ohio  Crosby, Stills, Nash & Young   
482                          Sunset Grill                    Don Henley   
582                            Pretending                  Eric Clapton   
624                                  Tusk                 Fleetwood Mac   
625                      

34
                                 song_clean          artist_clean  \
33                            Thunderstruck                 AC/DC   
52                        Janie's Got A Gun             Aerosmith   
101                            Ramblin' Man  Allman Brothers Band   
190                              Sweet Leaf         Black Sabbath   
550   Saturday Night's Alright for Fighting            Elton John   
651                Long, Long Way From Home             Foreigner   
931                             Pink Houses       John Mellencamp   
1116                          Enter Sandman             Metallica   
2113                      (Oh) Pretty Woman             Van Halen   
2196              I've Seen All Good People                   Yes   
2211                       Cheap Sunglasses                ZZ Top   

      release_year                                           combined  first  \
33            1990                             Thunderstruck by AC/DC      1   
52      

2101  Sunday Bloody Sunday by U2      1     1          54  54  
55
                      song_clean         artist_clean  release_year  \
934                   Small Town      John Mellencamp          1985   
1416  Rock And Roll, Hoochie Koo       Rick Derringer          1974   
1809                  Revolution          The Beatles          1968   
1891          Long Train Runnin'  The Doobie Brothers          1973   
2159             Brown Eyed Girl         Van Morrison          1967   

                                          combined  first  year  play_count  \
934                  Small Town by John Mellencamp      1     1          55   
1416  Rock And Roll, Hoochie Koo by Rick Derringer      1     1          55   
1809                     Revolution by The Beatles      1     1          55   
1891     Long Train Runnin' by The Doobie Brothers      1     1          55   
2159               Brown Eyed Girl by Van Morrison      1     1          55   

      fg  
934   55  
1416  55 

2006             Behind Blue Eyes by The Who      1     1          75  75  
76
             song_clean                   artist_clean  release_year  \
1557    No One Like You                      Scorpions          1982   
1689   The Logical Song                     Supertramp          1979   
2056  I Won't Back Down  Tom Petty & The Heartbreakers          1989   

                                               combined  first  year  \
1557                       No One Like You by Scorpions      1     1   
1689                     The Logical Song by Supertramp      1     1   
2056  I Won't Back Down by Tom Petty & The Heartbrea...      1     1   

      play_count  fg  
1557          76  76  
1689          76  76  
2056          76  76  
77
               song_clean   artist_clean  release_year  \
606       Go Your Own Way  Fleetwood Mac          1977   
631             Slow Ride         Foghat          1975   
1520  The Spirit of Radio           Rush          1980   

               

2135          96  96  
97
         song_clean                   artist_clean  release_year  \
11    Back In Black                          AC/DC          1980   
660   All Right Now                           Free          1970   
2062        Refugee  Tom Petty & The Heartbreakers          1979   

                                      combined  first  year  play_count  fg  
11                      Back In Black by AC/DC      1     1          97  97  
660                      All Right Now by Free      1     1          97  97  
2062  Refugee by Tom Petty & The Heartbreakers      1     1          97  97  
99
                  song_clean         artist_clean  release_year  \
203  (Don't Fear) The Reaper     Blue Oyster Cult          1976   
457                    Layla  Derek & The Dominos          1970   

                                        combined  first  year  play_count  fg  
203  (Don't Fear) The Reaper by Blue Oyster Cult      1     1          99  99  
457                 Layl

### 11. Which years have the most plays?

In [23]:
# A: 

### 12. Which records don't have matching "Play Count" corresponding to "F*G"?

In [24]:
# A:

### Bonus: Which artists have the most missing values between each of the variables? 

In [25]:
# A: