<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

## Lab: Cleaning Rock Song Data

_Authors: Dave Yerrington (SF)_

---


In [1]:
import pandas as pd
import numpy as np 
import seaborn as sns

%matplotlib inline

### 1. Load `rock.csv` and do an initial examination of its data columns.

In [2]:
rockfile = "datasets/rock.csv"

In [3]:
# Load the data.
data = pd.read_csv(rockfile)

data.head(20)


Unnamed: 0,Song,Artist,Release Year,Combined,First?,Year?,PlayCount,F*G
0,Hold On Loosely,.38 Special,1981.0,Hold On Loosely by .38 Special,1,1,85,85
1,Art For Arts Sake,10cc,1975.0,Art For Arts Sake by 10cc,1,1,1,1
2,Kryptonite,3 Doors Down,2000.0,Kryptonite by 3 Doors Down,1,1,13,13
3,Loser,3 Doors Down,2000.0,Loser by 3 Doors Down,1,1,1,1
4,When I'm Gone,3 Doors Down,2002.0,When I'm Gone by 3 Doors Down,1,1,6,6
5,What's Up?,4 Non Blondes,1992.0,What's Up? by 4 Non Blondes,1,1,3,3
6,Take On Me,a-ha,1985.0,Take On Me by a-ha,1,1,1,1
7,"Baby, Please Don't Go",AC/DC,,"Baby, Please Don't Go by AC/DC",1,0,1,0
8,Back In Black,AC/DC,1980.0,Back In Black by AC/DC,1,1,97,97
9,Big Gun,AC/DC,1993.0,Big Gun by AC/DC,1,1,6,6


In [4]:
# Look at the information regarding its columns.
data.keys()
data.isnull().sum()

Song               0
 Artist            0
 Release Year    533
Combined           0
First?             0
Year?              0
PlayCount          0
F*G                0
dtype: int64

### 2.  Clean up the column names.

Let's clean up the column names. There are two ways we can accomplish this:

#### 2.A Change the column names when you import the data using `pd.read_csv()`.

Notice that, when passing `names=[..A LIST OF STRING..]` with a number of columns that matches the number of strings in the passed list, you replace the column names.

NOTE: When you create custom column names, the first row of the `.csv` already represents a header. It is important to tell `pandas` to skip that row. The `skiprows=1` keyword argument to `read_csv()` will tell `pandas` to skip the first row.

In [5]:
# Change the column names when loading the '.csv':
columnNames =['SongName','Artist','Release Year','Combined',"First?",'year?','Playcount','fg']
df = pd.read_csv(rockfile, names=columnNames, skiprows=1)
df.head()


Unnamed: 0,SongName,Artist,Release Year,Combined,First?,year?,Playcount,fg
0,Hold On Loosely,.38 Special,1981,Hold On Loosely by .38 Special,1,1,85,85
1,Art For Arts Sake,10cc,1975,Art For Arts Sake by 10cc,1,1,1,1
2,Kryptonite,3 Doors Down,2000,Kryptonite by 3 Doors Down,1,1,13,13
3,Loser,3 Doors Down,2000,Loser by 3 Doors Down,1,1,1,1
4,When I'm Gone,3 Doors Down,2002,When I'm Gone by 3 Doors Down,1,1,6,6


#### 2.B Change column names using the `.rename()` function.

The `.rename()` function takes an argument, `columns=name_dict`, in which `name_dict` is a dictionary containing the original column names as keys and the new column names as values.

In [6]:
# Change the column names using the `.rename()` function.
df= pd.read_csv(rockfile)

rename_columns = {
        # Original column: [renamed column]
    'Song Clean':    'song', 
    'ARTIST CLEAN':  'artist', 
    'Release Year':  'release', 
    'COMBINED':      'song_artist', 
    'First?':        'first', 
    'Year?':         'year', 
    'PlayCount':     'playcount', 
    'F*G':           'fg'
}
df.head()
df.rename( columns = rename_columns, inplace = 1)
df.head()


Unnamed: 0,Song,Artist,Release Year,Combined,first,year,playcount,fg
0,Hold On Loosely,.38 Special,1981,Hold On Loosely by .38 Special,1,1,85,85
1,Art For Arts Sake,10cc,1975,Art For Arts Sake by 10cc,1,1,1,1
2,Kryptonite,3 Doors Down,2000,Kryptonite by 3 Doors Down,1,1,13,13
3,Loser,3 Doors Down,2000,Loser by 3 Doors Down,1,1,1,1
4,When I'm Gone,3 Doors Down,2002,When I'm Gone by 3 Doors Down,1,1,6,6


#### 2.C Reassigning the `.columns` attribute of a DataFrame.

You can also just reassign the `.columns` attribute to a list of strings containing the new column names. 

The only caveat with reassigning `.columns` is that you have to reassign all of the column names at once. You can't partially replace a value by working on `.columns` directly. You have to reassign `.columns` with a list of equal length. 

In [7]:
# Replace the column names by reassigning the `.columns` attribute.
df= pd.read_csv(rockfile)

rename_columns = {
        # Original column: [renamed column]
    'Song Clean':    'song', 
    'ARTIST CLEAN':  'artist', 
    'Release Year':  'release', 
    'COMBINED':      'song_artist', 
    '??First?':        'first', 
    '??Year?':         'year', 
    'PlayCount':     'playcount', 
    'F*G':           'fg'
}
df.columns = rename_columns
df.head(12)

Unnamed: 0,Song Clean,ARTIST CLEAN,Release Year,COMBINED,??First?,??Year?,PlayCount,F*G
0,Hold On Loosely,.38 Special,1981.0,Hold On Loosely by .38 Special,1,1,85,85
1,Art For Arts Sake,10cc,1975.0,Art For Arts Sake by 10cc,1,1,1,1
2,Kryptonite,3 Doors Down,2000.0,Kryptonite by 3 Doors Down,1,1,13,13
3,Loser,3 Doors Down,2000.0,Loser by 3 Doors Down,1,1,1,1
4,When I'm Gone,3 Doors Down,2002.0,When I'm Gone by 3 Doors Down,1,1,6,6
5,What's Up?,4 Non Blondes,1992.0,What's Up? by 4 Non Blondes,1,1,3,3
6,Take On Me,a-ha,1985.0,Take On Me by a-ha,1,1,1,1
7,"Baby, Please Don't Go",AC/DC,,"Baby, Please Don't Go by AC/DC",1,0,1,0
8,Back In Black,AC/DC,1980.0,Back In Black by AC/DC,1,1,97,97
9,Big Gun,AC/DC,1993.0,Big Gun by AC/DC,1,1,6,6


### 3. Subsetting data where null values exist.

We have mixed `str` and `NaN` values in the `release` column. `NaN` stands for "not a number" and is the way `pandas` handles "nulls" or nonexistent data. We can use the `.isnull()` method of a Series to find null values.

Print the header of the data subset to where the `release` column is null values.

In [13]:
# Show records where df['release'] is null
print(df.isnull().sum())

nullReleasValue = df['Release Year'].isnull()
df[nullReleasValue].head(10)

#df.info()


Song Clean        0
ARTIST CLEAN      0
Release Year    533
COMBINED          0
??First?          0
??Year?           0
PlayCount         0
F*G               0
dtype: int64


Unnamed: 0,Song Clean,ARTIST CLEAN,Release Year,COMBINED,??First?,??Year?,PlayCount,F*G
7,"Baby, Please Don't Go",AC/DC,,"Baby, Please Don't Go by AC/DC",1,0,1,0
10,CAN'T STOP ROCK'N'ROLL,AC/DC,,CAN'T STOP ROCK'N'ROLL by AC/DC,1,0,5,0
12,Girls Got Rhythm,AC/DC,,Girls Got Rhythm by AC/DC,1,0,24,0
16,Live Wire,AC/DC,,Live Wire by AC/DC,1,0,2,0
17,Moneytalks,AC/DC,,Moneytalks by AC/DC,1,0,20,0
19,Shoot To Thrill,AC/DC,,Shoot To Thrill by AC/DC,1,0,45,0
21,Sin City,AC/DC,,Sin City by AC/DC,1,0,1,0
24,What Do You Do For Money Honey,AC/DC,,What Do You Do For Money Honey by AC/DC,1,0,2,0
30,"Baby, Please Don't Go",Aerosmith,,"Baby, Please Don't Go by Aerosmith",1,0,1,0
33,Come Together,Aerosmith,,Come Together by Aerosmith,1,0,65,0


### 4. Update slices of your DataFrame based on mask selection/slices.

In many scenarios, we want to upate values in our DataFrame according to criteria. Let's say we wanted to set all of the null values in `release` to 0.

With newer versions of `pandas`, in order to manipulate data in the original DataFrame, we have to use `.loc` while performing reassignment using a mask and an index.

For example, the following won't always work:
```python
df[row_mask]['column_name'] = new_value
```

The best way to accomplish the same task is:
```python
df.loc[row_mask, 'column_name'] = new_value
```

For multiple column assignment, you would use:
```python
df.loc[row_mask, ['col_1', 'col_2', 'col_3']] = new_value
```

#### 4.A Let's try it out. Make all of the null values in `release` 0.

In [71]:
# Replace release nulls with 0

nullReleasValue = df['Release Year'].isnull()
df[nullReleasValue].head(10)

df.loc[nullReleasValue, 'Release Year'] = 0

df.head(15)
df[df['Release Year'] == 0]

Unnamed: 0,Song Clean,ARTIST CLEAN,Release Year,COMBINED,??First?,??Year?,PlayCount,F*G
7,"Baby, Please Don't Go",AC/DC,0.0,"Baby, Please Don't Go by AC/DC",1,0,1,0
10,CAN'T STOP ROCK'N'ROLL,AC/DC,0.0,CAN'T STOP ROCK'N'ROLL by AC/DC,1,0,5,0
12,Girls Got Rhythm,AC/DC,0.0,Girls Got Rhythm by AC/DC,1,0,24,0
16,Live Wire,AC/DC,0.0,Live Wire by AC/DC,1,0,2,0
17,Moneytalks,AC/DC,0.0,Moneytalks by AC/DC,1,0,20,0
19,Shoot To Thrill,AC/DC,0.0,Shoot To Thrill by AC/DC,1,0,45,0
21,Sin City,AC/DC,0.0,Sin City by AC/DC,1,0,1,0
24,What Do You Do For Money Honey,AC/DC,0.0,What Do You Do For Money Honey by AC/DC,1,0,2,0
30,"Baby, Please Don't Go",Aerosmith,0.0,"Baby, Please Don't Go by Aerosmith",1,0,1,0
33,Come Together,Aerosmith,0.0,Come Together by Aerosmith,1,0,65,0


#### 4.B Verify that `release` contains no null values.

In [72]:
# A:
print(df.isnull().sum())

Song Clean      0
ARTIST CLEAN    0
Release Year    0
COMBINED        0
??First?        0
??Year?         0
PlayCount       0
F*G             0
dtype: int64


### 5. Ensure that the data types of the columns make sense. 

Verifying column data types is a critical part of data munging. If columns have the wrong data type, then there is usually corrupted or incorrect data in some of the observations.

#### 5.A Look at the data types for the columns. Are any incorrect given what the data represents?

In [73]:
# A:
df.info()

df.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2087 entries, 0 to 2086
Data columns (total 8 columns):
Song Clean      2087 non-null object
ARTIST CLEAN    2087 non-null object
Release Year    2087 non-null float64
COMBINED        2087 non-null object
??First?        2087 non-null int64
??Year?         2087 non-null int64
PlayCount       2087 non-null int64
F*G             2087 non-null int64
dtypes: float64(1), int64(4), object(3)
memory usage: 130.5+ KB


Unnamed: 0,Song Clean,ARTIST CLEAN,Release Year,COMBINED,??First?,??Year?,PlayCount,F*G
0,Hold On Loosely,.38 Special,1981.0,Hold On Loosely by .38 Special,1,1,85,85
1,Art For Arts Sake,10cc,1975.0,Art For Arts Sake by 10cc,1,1,1,1
2,Kryptonite,3 Doors Down,2000.0,Kryptonite by 3 Doors Down,1,1,13,13
3,Loser,3 Doors Down,2000.0,Loser by 3 Doors Down,1,1,1,1
4,When I'm Gone,3 Doors Down,2002.0,When I'm Gone by 3 Doors Down,1,1,6,6


### 6. Investigate and clean up the `release` column.

The `release` column is a string data type when it should be an integer.

#### 6.A Figure out what value(s) are causing the `release` column to be encoded as a string instead of an integer.

In [74]:
# A:
df['Release Year'].unique()  
#df.release.unique() #It's not working does space effect so I can't solve it

array([1981., 1975., 2000., 2002., 1992., 1985.,    0., 1980., 1993.,
       1984., 1977., 1979., 1990., 1986., 1974., 2014., 1987., 1976.,
       1973., 2001., 1989., 1997., 1995., 1971., 1972., 1994., 1970.,
       1966., 1965., 1982., 1983., 1955., 1978., 1969., 1999., 1968.,
       1988., 1962., 2007., 1967., 1958., 1071., 1996., 1991., 2005.,
       2011., 2004., 2012., 2003., 1998., 2008., 1964., 2013., 2006.,
       1961., 1963.])

#### 6.B Look at the rows in which there is incorrect data in the `release` column.

In [76]:
# A:
relase_mask = df['Release Year'] == "SONGFACTS.COM"
df[relase_mask]

Unnamed: 0,Song Clean,ARTIST CLEAN,Release Year,COMBINED,??First?,??Year?,PlayCount,F*G


#### 6.C. Clean up the data. Normally we may replace the offending data with null np.nan values, however we previously converted all of the nan values in the release column to zeros so we might as well continue with the same practice. Replacing with 0 (or nan) will allow us to convert the column to numeric.

In [77]:
# A:
df.info()
df.loc[relase_mask,'Release Year'] = np.nan
df['Release Year'] = df['Release Year'].map(lambda x: float(x))
#df

#df.loc[nullReleasValue, 'Release Year'] = 0

#df.head(15)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2087 entries, 0 to 2086
Data columns (total 8 columns):
Song Clean      2087 non-null object
ARTIST CLEAN    2087 non-null object
Release Year    2087 non-null float64
COMBINED        2087 non-null object
??First?        2087 non-null int64
??Year?         2087 non-null int64
PlayCount       2087 non-null int64
F*G             2087 non-null int64
dtypes: float64(1), int64(4), object(3)
memory usage: 130.5+ KB


In [69]:
################
ads = df['Release Year'] == 0  # hwo remove all zeros or change it with what
df[ads]

#df.release.unique() #It's not working does space effect so I can't solve it

Unnamed: 0,Song Clean,ARTIST CLEAN,Release Year,COMBINED,??First?,??Year?,PlayCount,F*G
7,"Baby, Please Don't Go",AC/DC,0.0,"Baby, Please Don't Go by AC/DC",1,0,1,0
10,CAN'T STOP ROCK'N'ROLL,AC/DC,0.0,CAN'T STOP ROCK'N'ROLL by AC/DC,1,0,5,0
12,Girls Got Rhythm,AC/DC,0.0,Girls Got Rhythm by AC/DC,1,0,24,0
16,Live Wire,AC/DC,0.0,Live Wire by AC/DC,1,0,2,0
17,Moneytalks,AC/DC,0.0,Moneytalks by AC/DC,1,0,20,0
19,Shoot To Thrill,AC/DC,0.0,Shoot To Thrill by AC/DC,1,0,45,0
21,Sin City,AC/DC,0.0,Sin City by AC/DC,1,0,1,0
24,What Do You Do For Money Honey,AC/DC,0.0,What Do You Do For Money Honey by AC/DC,1,0,2,0
30,"Baby, Please Don't Go",Aerosmith,0.0,"Baby, Please Don't Go by Aerosmith",1,0,1,0
33,Come Together,Aerosmith,0.0,Come Together by Aerosmith,1,0,65,0


### 7. Get summary statistics for the `release` column using the `.describe()` function.

Now that the `release` column is finally a numeric data type, we can apply the `.describe()` function.  

#### 7.A Print out the summary stats for the `release` column. What is the earliest and latest release date?

In [62]:
# A:
#df.info()
df['Release Year'].describe()

count    2086.000000
mean     1472.519655
std       863.132348
min         0.000000
25%         0.000000
50%      1973.000000
75%      1981.000000
max      2014.000000
Name: Release Year, dtype: float64

#### 7.B Based on the summary statistics, is there anything else wrong with the `release` column? 

In [83]:
# A:
df['Release Year'].unique()
#there many release data are 0 (25%) and one is 1071
df[df ['Release Year'] == 0]
df[df ['Release Year'] == 1071]


Unnamed: 0,Song Clean,ARTIST CLEAN,Release Year,COMBINED,??First?,??Year?,PlayCount,F*G
501,Levon,Elton John,1071.0,Levon by Elton John,1,1,8,8


_Looking at the DataFrame that contains the year 1071, we can see that the year was probably corrupted and should be replaced with something else if possible._

### 8. Make changes and investigate using custom functions with `.apply()`.

Let's say we want to traverse every single row in our data set and apply a function to that row.

#### 8.A Write a function that will take a row of a DataFrame and print out the song, artist, and whether or not the release date is < 1970.


In [110]:
# reload our data to refresh status

columnName = ['Song','Artist','Release','song_artist','first','year','playcount','fg']
df= pd.read_csv(rockfile, names = columnName, skiprows = 1)
df.head()


Unnamed: 0,Song,Artist,Release,song_artist,first,year,playcount,fg
0,Hold On Loosely,.38 Special,1981,Hold On Loosely by .38 Special,1,1,85,85
1,Art For Arts Sake,10cc,1975,Art For Arts Sake by 10cc,1,1,1,1
2,Kryptonite,3 Doors Down,2000,Kryptonite by 3 Doors Down,1,1,13,13
3,Loser,3 Doors Down,2000,Loser by 3 Doors Down,1,1,1,1
4,When I'm Gone,3 Doors Down,2002,When I'm Gone by 3 Doors Down,1,1,6,6


In [117]:
# A:


def release (row):
    print ('_______________________________')
    print (row['Song'], row['Artist'],row['Release'],'< 1970?', int (row['Release']) < 1970)


#### 8.B Using the `.apply()` function, apply the function you wrote to the first four rows of the DataFrame.

You will need to tell the `apply` function to operate row by row. Setting the keyword argument as `axis=1` indicates that the function should be applied to each row individually.

In [120]:
# A:
df.head().apply(release, axis=1)

_______________________________
Hold On Loosely .38 Special 1981 < 1970? False
_______________________________
Art For Arts Sake 10cc 1975 < 1970? False
_______________________________
Kryptonite 3 Doors Down 2000 < 1970? False
_______________________________
Loser 3 Doors Down 2000 < 1970? False
_______________________________
When I'm Gone 3 Doors Down 2002 < 1970? False


0    None
1    None
2    None
3    None
4    None
dtype: object

You'll notice that there will be a final output Series of `None` values. The `.apply()` function, if a return value is not specified, will return a Series of `None` values (similar to how the default return for Python functions is `None` when a return statement is not specified).

### 9. Write a function that converts cells in a DataFrame to float and otherwise replaces them with `np.nan`.

If applied to our data, it would keep only the numeric information and otherwise input null values.

Recall that the try-except syntax in Python is a great way to try something and take another action if the initial step fails:

```python
try:
    Perform some action.
except:
   Perform some other action if the first failed with an error.
```

#### 9.A Write the function that takes a column and converts all of its values to float if possible and `np.nan` otherwise. The return value should be the converted Series.

In [131]:
# A:
df.info()

def value_converter(value):
    try:
        return float (value)
    except :
        return np.nan
    
def column_converter_to_float (column):
    column = column.map(value_converter)
    return column

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2087 entries, 0 to 2086
Data columns (total 8 columns):
Song           2087 non-null object
Artist         2087 non-null object
Release        1554 non-null object
song_artist    2087 non-null object
first          2087 non-null int64
year           2087 non-null int64
playcount      2087 non-null int64
fg             2087 non-null int64
dtypes: int64(4), object(4)
memory usage: 130.5+ KB


#### 9.B Try your function out on the rock song data and ensure the output is what you expected.


In [138]:
# A:
df2 = df.apply(column_converter_to_float)
df2.head()

Unnamed: 0,Song,Artist,Release,song_artist,first,year,playcount,fg
0,,,1981.0,,1.0,1.0,85.0,85.0
1,,,1975.0,,1.0,1.0,1.0,1.0
2,,,2000.0,,1.0,1.0,13.0,13.0
3,,,2000.0,,1.0,1.0,1.0,1.0
4,,,2002.0,,1.0,1.0,6.0,6.0


#### 9.C Describe the new float-only DataFrame.

In [140]:
# A: 
df2.describe().T


Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Song,2.0,1012.0,1367.544515,45.0,528.5,1012.0,1495.5,1979.0
Artist,0.0,,,,,,,
Release,1553.0,1977.898261,24.875307,1071.0,1971.0,1977.0,1984.0,2014.0
song_artist,0.0,,,,,,,
first,2087.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0
year,2087.0,0.744609,0.436185,0.0,0.0,1.0,1.0,1.0
playcount,2087.0,16.67609,25.021707,0.0,1.0,4.0,21.0,142.0
fg,2087.0,14.896981,25.049489,0.0,0.0,3.0,17.5,142.0
