<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

## Lab: Cleaning Rock Song Data

_Authors: Dave Yerrington (SF)_

---


### 1. Load `rock.csv` and do an initial examination of its data columns.

In [1]:
import pandas as pd
import numpy as np 

### 2.  Clean up the column names.

Let's clean up the column names. There are two ways we can accomplish this:

#### 2.A Change the column names when you import the data using `pd.read_csv()`.

Notice that, when passing `names=[..A LIST OF STRING..]` with a number of columns that matches the number of strings in the passed list, you replace the column names.

NOTE: When you create custom column names, the first row of the `.csv` already represents a header. It is important to tell `pandas` to skip that row. The `skiprows=1` keyword argument to `read_csv()` will tell `pandas` to skip the first row.

In [2]:
df = pd.read_csv("../../data/rock.csv",
                 names=["Song","Artist","Release Year","Full Title", "First","Year","Play Count","YearCount"], 
                 skiprows=1)
df

Unnamed: 0,Song,Artist,Release Year,Full Title,First,Year,Play Count,YearCount
0,Caught Up in You,.38 Special,1982,Caught Up in You by .38 Special,1,1,82,82
1,Fantasy Girl,.38 Special,,Fantasy Girl by .38 Special,1,0,3,0
2,Hold On Loosely,.38 Special,1981,Hold On Loosely by .38 Special,1,1,85,85
3,Rockin' Into the Night,.38 Special,1980,Rockin' Into the Night by .38 Special,1,1,18,18
4,Art For Arts Sake,10cc,1975,Art For Arts Sake by 10cc,1,1,1,1
...,...,...,...,...,...,...,...,...
2225,She Loves My Automobile,ZZ Top,,She Loves My Automobile by ZZ Top,1,0,1,0
2226,Tube Snake Boogie,ZZ Top,1981,Tube Snake Boogie by ZZ Top,1,1,32,32
2227,Tush,ZZ Top,1975,Tush by ZZ Top,1,1,109,109
2228,TV Dinners,ZZ Top,1983,TV Dinners by ZZ Top,1,1,1,1


#### 2.B Change column names using the `.rename()` function.

The `.rename()` function takes an argument, `columns=name_dict`, in which `name_dict` is a dictionary containing the original column names as keys and the new column names as values.

In [3]:
df=df.rename(columns= {
    "Song":"Song Name",
    "Artist":"Artist Name",
    "Release Year":"Release Year",
    "Full Title":"Title and Artist",
    "First":"First in Charts", 
    "Year":"Year Available",
    "Play Count":"Number of Plays",
    "YearCount":"Plays in Year"
})



#### 2.C Reassigning the `.columns` attribute of a DataFrame.

You can also just reassign the `.columns` attribute to a list of strings containing the new column names. 

The only caveat with reassigning `.columns` is that you have to reassign all of the column names at once. You can't partially replace a value by working on `.columns` directly. You have to reassign `.columns` with a list of equal length. 

In [4]:
# your answer here

df.columns=("Song Names", "Artist Names","Year of Release",
           "Title&Artist","Top Charts","Year","# of Plays", "Plays in Year")


### 3. Subsetting data where null values exist.

We have mixed `str` and `NaN` values in the `release` column. `NaN` stands for "not a number" and is the way `pandas` handles "nulls" or nonexistent data. We can use the `.isnull()` method of a Series to find null values.

Print the header of the data subset to where the `release` column is null values.

In [5]:
# your answer here

nulls = df['Year of Release'].isnull()

df[nulls].head()

#OR 
df[df['Year of Release'].isnull()]


Unnamed: 0,Song Names,Artist Names,Year of Release,Title&Artist,Top Charts,Year,# of Plays,Plays in Year
1,Fantasy Girl,.38 Special,,Fantasy Girl by .38 Special,1,0,3,0
10,"Baby, Please Don't Go",AC/DC,,"Baby, Please Don't Go by AC/DC",1,0,1,0
13,CAN'T STOP ROCK'N'ROLL,AC/DC,,CAN'T STOP ROCK'N'ROLL by AC/DC,1,0,5,0
16,Girls Got Rhythm,AC/DC,,Girls Got Rhythm by AC/DC,1,0,24,0
24,Let's Get It Up,AC/DC,,Let's Get It Up by AC/DC,1,0,4,0
...,...,...,...,...,...,...,...,...
2216,"I'm Bad, I'm Nationwide",ZZ Top,,"I'm Bad, I'm Nationwide by ZZ Top",1,0,10,0
2218,Just Got Paid,ZZ Top,,Just Got Paid by ZZ Top,1,0,2,0
2221,My Head's In Mississippi,ZZ Top,,My Head's In Mississippi by ZZ Top,1,0,1,0
2222,Party On The Patio,ZZ Top,,Party On The Patio by ZZ Top,1,0,14,0


### 4. Update slices of your DataFrame based on mask selection/slices.

In many scenarios, we want to upate values in our DataFrame according to criteria. Let's say we wanted to set all of the null values in `release` to 0.

With newer versions of `pandas`, in order to manipulate data in the original DataFrame, we have to use `.loc` while performing reassignment using a mask and an index.

For example, the following won't always work:
```python
df[row_mask]['column_name'] = new_value
```

The best way to accomplish the same task is:
```python
df.loc[row_mask, 'column_name'] = new_value
```

For multiple column assignment, you would use:
```python
df.loc[row_mask, ['col_1', 'col_2', 'col_3']] = new_value
```

#### 4.A Let's try it out. Make all of the null values in `release` 0.

In [27]:
df.loc[df['Year of Release'].isnull(),'Year of Release']=0


#### 4.B Verify that `release` contains no null values.

In [28]:
df['Year of Release'].isnull().value_counts()

False    2230
Name: Year of Release, dtype: int64

### 5. Ensure that the data types of the columns make sense. 

Verifying column data types is a critical part of data munging. If columns have the wrong data type, then there is usually corrupted or incorrect data in some of the observations.

#### 5.A Look at the data types for the columns. Are any incorrect given what the data represents?

In [29]:
df.dtypes

Song Names          object
Artist Names        object
Year of Release    float64
Title&Artist        object
Top Charts           int64
Year                 int64
# of Plays           int64
Plays in Year        int64
dtype: object

_Only the `release` column appears to be wrong. It is represented as a string but should be an integer for year._

### 6. Investigate and clean up the `release` column.

The `release` column is a string data type when it should be an integer.

#### 6.A Figure out what value(s) are causing the `release` column to be encoded as a string instead of an integer.

In [30]:
df['Year of Release'].unique()

array([1982.,    0., 1981., 1980., 1975., 2000., 2002., 1992., 1985.,
       1993., 1976., 1995., 1979., 1984., 1977., 1990., 1986., 1974.,
       2014., 1987., 1973., 2001., 1989., 1997., 1971., 1972., 1994.,
       1970., 1966., 1965., 1983., 1955., 1978., 1969., 1999., 1968.,
       1988., 1962., 2007., 1967., 1958., 1071., 1996., 1991., 2005.,
       2011., 2004., 2012., 2003., 1998., 2008., 1964., 2013., 2006.,
       1963., 1961.])

#### 6.B Look at the rows in which there is incorrect data in the `release` column.

In [31]:
# your answer here
df[df['Year of Release']=='SONGFACTS.COM']

Unnamed: 0,Song Names,Artist Names,Year of Release,Title&Artist,Top Charts,Year,# of Plays,Plays in Year


#### 6.C. Clean up the data. Normally we may replace the offending data with null np.nan values, however we previously converted all of the nan values in the release column to zeros so we might as well continue with the same practice. Replacing with 0 (or nan) will allow us to convert the column to numeric.

In [32]:
df.loc[df['Year of Release']=='SONGFACTS.COM','Year of Release']= np.nan
df['Year of Release']=df['Year of Release'].apply(lambda x: float(x))

**Note:** Year can also be considered a descriptive value and therefore it also makes sense for a Year column to be an object.  
However, just like in this situation, using conversion to numerics is a great way of identifying improper values in a year column.

### 7. Get summary statistics for the `release` column using the `.describe()` function.

Now that the `release` column is finally a numeric data type, we can apply the `.describe()` function.  

#### 7.A Print out the summary stats for the `release` column. What is the earliest and latest release date?

In [35]:
df.describe()

Unnamed: 0,Year of Release,Top Charts,Year,# of Plays,Plays in Year
count,2230.0,2230.0,2230.0,2230.0,2230.0
mean,1465.33139,1.0,0.741256,16.872646,15.04843
std,867.196161,0.0,0.438043,25.302972,25.288366
min,0.0,1.0,0.0,0.0,0.0
25%,0.0,1.0,0.0,1.0,0.0
50%,1973.0,1.0,1.0,4.0,3.0
75%,1981.0,1.0,1.0,21.0,18.0
max,2014.0,1.0,1.0,142.0,142.0


#### 7.B Based on the summary statistics, is there anything else wrong with the `release` column? 

In [36]:
# your answer here

_Looking at the DataFrame that contains the year 1071, we can see that the year was probably corrupted and should be replaced with something else if possible._

### 8. Make changes and investigate using custom functions with `.apply()`.

Let's say we want to traverse every single row in our data set and apply a function to that row.

#### 8.A Write a function that will take a row of a DataFrame and print out the song, artist, and whether or not the release date is < 1970.


In [37]:
def organized(cell):
    print(cell['Song Names'], cell['Artist Names'],'<1970?',cell["Year of Release"]<1970)
    
    

#### 8.B Using the `.apply()` function, apply the function you wrote to the first four rows of the DataFrame.

You will need to tell the `apply` function to operate row by row. Setting the keyword argument as `axis=1` indicates that the function should be applied to each row individually.

In [38]:
df.head(4).apply(organized, axis=1)

Caught Up in You .38 Special <1970? False
Fantasy Girl .38 Special <1970? True
Hold On Loosely .38 Special <1970? False
Rockin' Into the Night .38 Special <1970? False


0    None
1    None
2    None
3    None
dtype: object

You'll notice that there will be a final output Series of `None` values. The `.apply()` function, if a return value is not specified, will return a Series of `None` values (similar to how the default return for Python functions is `None` when a return statement is not specified).