<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

## Lab: Cleaning Rock Song Data

_Authors: Dave Yerrington (SF)_

---


In [99]:
import pandas as pd
import numpy as np 
import seaborn as sns

%matplotlib inline

### 1. Load `rock.csv` and do an initial examination of its data columns.

In [100]:
rockfile =("/Users/bianca/Documents/GitHub/DAT-10-14/class material/Unit2/data/rock.csv")

In [101]:
# Load the data.
df = pd.read_csv(rockfile)


In [102]:
# Look at the information regarding its columns.
df[:10]

Unnamed: 0,Song Clean,ARTIST CLEAN,Release Year,COMBINED,First?,Year?,PlayCount,F*G
0,Caught Up in You,.38 Special,1982.0,Caught Up in You by .38 Special,1,1,82,82
1,Fantasy Girl,.38 Special,,Fantasy Girl by .38 Special,1,0,3,0
2,Hold On Loosely,.38 Special,1981.0,Hold On Loosely by .38 Special,1,1,85,85
3,Rockin' Into the Night,.38 Special,1980.0,Rockin' Into the Night by .38 Special,1,1,18,18
4,Art For Arts Sake,10cc,1975.0,Art For Arts Sake by 10cc,1,1,1,1
5,Kryptonite,3 Doors Down,2000.0,Kryptonite by 3 Doors Down,1,1,13,13
6,Loser,3 Doors Down,2000.0,Loser by 3 Doors Down,1,1,1,1
7,When I'm Gone,3 Doors Down,2002.0,When I'm Gone by 3 Doors Down,1,1,6,6
8,What's Up?,4 Non Blondes,1992.0,What's Up? by 4 Non Blondes,1,1,3,3
9,Take On Me,a-ha,1985.0,Take On Me by a-ha,1,1,1,1


### 2.  Clean up the column names.

Let's clean up the column names. There are two ways we can accomplish this:

#### 2.A Change the column names when you import the data using `pd.read_csv()`.

Notice that, when passing `names=[..A LIST OF STRING..]` with a number of columns that matches the number of strings in the passed list, you replace the column names.

NOTE: When you create custom column names, the first row of the `.csv` already represents a header. It is important to tell `pandas` to skip that row. The `skiprows=1` keyword argument to `read_csv()` will tell `pandas` to skip the first row.

In [41]:
# Change the column names when loading the '.csv':
col_names = ['True Song', 'True Artist', 'Year Released', 'Song_Artist', 'First Album', 'Year','PlayCount', 'F*G']

df = pd.read_csv("/Users/bianca/Documents/GitHub/DAT-10-14/class material/Unit2/data/rock.csv", skiprows =1,names=col_names)

df.head()

Unnamed: 0,True Song,True Artist,Year Released,Song_Artist,First Album,Year,PlayCount,F*G
0,Caught Up in You,.38 Special,1982.0,Caught Up in You by .38 Special,1,1,82,82
1,Fantasy Girl,.38 Special,,Fantasy Girl by .38 Special,1,0,3,0
2,Hold On Loosely,.38 Special,1981.0,Hold On Loosely by .38 Special,1,1,85,85
3,Rockin' Into the Night,.38 Special,1980.0,Rockin' Into the Night by .38 Special,1,1,18,18
4,Art For Arts Sake,10cc,1975.0,Art For Arts Sake by 10cc,1,1,1,1


#### 2.B Change column names using the `.rename()` function.

The `.rename()` function takes an argument, `columns=name_dict`, in which `name_dict` is a dictionary containing the original column names as keys and the new column names as values.

In [38]:
# Change the column names using the `.rename()` function.

In [114]:
col_name_dict ={
    
    'Song Clean':'True Song', 
    'ARTIST CLEAN':'True Artist', 
    'Release Year':'Year Released',
    'COMBINED':'Song_Artist',
    'First?':'First Album',
    'Year?':'Year',
    'PlayCount':'PlayCount',
    'F*G':'F*G'
}

df.rename(columns=col_name_dict, inplace = True)

df.head()


#or df = df.rename({'Old Column Name: New Column Name'}, axis=1)


Unnamed: 0,True Song,True Artist,Year Released,Song_Artist,First Album,Year,PlayCount,F*G
0,Caught Up in You,.38 Special,1982,Caught Up in You by .38 Special,1,1,82,82
1,Fantasy Girl,.38 Special,0,Fantasy Girl by .38 Special,1,0,3,0
2,Hold On Loosely,.38 Special,1981,Hold On Loosely by .38 Special,1,1,85,85
3,Rockin' Into the Night,.38 Special,1980,Rockin' Into the Night by .38 Special,1,1,18,18
4,Art For Arts Sake,10cc,1975,Art For Arts Sake by 10cc,1,1,1,1


In [None]:
new_lower_cols = [col.lower() for col in df.columns.tolist()]

df.colums = new_lower_cols

### 3. Subsetting data where null values exist.

We have mixed `str` and `NaN` values in the `release` column. `NaN` stands for "not a number" and is the way `pandas` handles "nulls" or nonexistent data. We can use the `.isnull()` method of a Series to find null values.

Print the header of the data subset to where the `release` column is null values.

In [106]:
# Show records where df['release'] is null

df[df['Year Released'].isnull()]

Unnamed: 0,True Song,True Artist,Year Released,Song_Artist,First Album,Year,PlayCount,F*G
1,Fantasy Girl,.38 Special,,Fantasy Girl by .38 Special,1,0,3,0
10,"Baby, Please Don't Go",AC/DC,,"Baby, Please Don't Go by AC/DC",1,0,1,0
13,CAN'T STOP ROCK'N'ROLL,AC/DC,,CAN'T STOP ROCK'N'ROLL by AC/DC,1,0,5,0
16,Girls Got Rhythm,AC/DC,,Girls Got Rhythm by AC/DC,1,0,24,0
24,Let's Get It Up,AC/DC,,Let's Get It Up by AC/DC,1,0,4,0
25,Live Wire,AC/DC,,Live Wire by AC/DC,1,0,2,0
26,Moneytalks,AC/DC,,Moneytalks by AC/DC,1,0,20,0
29,Shoot To Thrill,AC/DC,,Shoot To Thrill by AC/DC,1,0,45,0
31,Sin City,AC/DC,,Sin City by AC/DC,1,0,1,0
35,What Do You Do For Money Honey,AC/DC,,What Do You Do For Money Honey by AC/DC,1,0,2,0


### 4. Update slices of your DataFrame based on mask selection/slices.

In many scenarios, we want to upate values in our DataFrame according to criteria. Let's say we wanted to set all of the null values in `release` to 0.

With newer versions of `pandas`, in order to manipulate data in the original DataFrame, we have to use `.loc` while performing reassignment using a mask and an index.

For example, the following won't always work:
```python
df[row_mask]['column_name'] = new_value
```

The best way to accomplish the same task is:
```python
df.loc[row_mask, 'column_name'] = new_value
```

For multiple column assignment, you would use:
```python
df.loc[row_mask, ['col_1', 'col_2', 'col_3']] = new_value
```

#### 4.A Let's try it out. Make all of the null values in `release` 0.

In [116]:
# Another way to replace release nulls with 0
df.loc[df['Year Released'].isnull(),'Year Released']=0

In [117]:
# Replace release nulls with 0
df['Year Released'].fillna(0,inplace = True)


#### 4.B Verify that `release` contains no null values.

In [118]:
# A: 

df['Year Released'].isnull().sum()

0

### 5. Ensure that the data types of the columns make sense. 

Verifying column data types is a critical part of data munging. If columns have the wrong data type, then there is usually corrupted or incorrect data in some of the observations.

#### 5.A Look at the data types for the columns. Are any incorrect given what the data represents?

In [119]:
# A: 

df.dtypes

True Song        object
True Artist      object
Year Released    object
Song_Artist      object
First Album       int64
Year              int64
PlayCount         int64
F*G               int64
dtype: object

### 6. Investigate and clean up the `release` column.

The `release` column is a string data type when it should be an integer.

#### 6.A Figure out what value(s) are causing the `release` column to be encoded as a string instead of an integer.

In [120]:
df['Year Released'].unique()

array(['1982', 0, '1981', '1980', '1975', '2000', '2002', '1992', '1985',
       '1993', '1976', '1995', '1979', '1984', '1977', '1990', '1986',
       '1974', '2014', '1987', '1973', '2001', '1989', '1997', '1971',
       '1972', '1994', '1970', '1966', '1965', '1983', '1955', '1978',
       '1969', '1999', '1968', '1988', '1962', '2007', '1967', '1958',
       '1071', '1996', '1991', '2005', '2011', '2004', '2012', '2003',
       '1998', '2008', '1964', '2013', '2006', 'SONGFACTS.COM', '1963',
       '1961'], dtype=object)

#### 6.B Look at the rows in which there is incorrect data in the `release` column.

In [125]:
df['Year Released'] = df['Year Released'].astype(str)
df["Year Released"].str.isdigit()

0       True
1       True
2       True
3       True
4       True
5       True
6       True
7       True
8       True
9       True
10      True
11      True
12      True
13      True
14      True
15      True
16      True
17      True
18      True
19      True
20      True
21      True
22      True
23      True
24      True
25      True
26      True
27      True
28      True
29      True
        ... 
2200    True
2201    True
2202    True
2203    True
2204    True
2205    True
2206    True
2207    True
2208    True
2209    True
2210    True
2211    True
2212    True
2213    True
2214    True
2215    True
2216    True
2217    True
2218    True
2219    True
2220    True
2221    True
2222    True
2223    True
2224    True
2225    True
2226    True
2227    True
2228    True
2229    True
Name: Year Released, Length: 2230, dtype: bool

In [126]:
# A:

df[~df['Year Released'].str.isdigit()]

Unnamed: 0,True Song,True Artist,Year Released,Song_Artist,First Album,Year,PlayCount,F*G
1504,Bullfrog Blues,Rory Gallagher,SONGFACTS.COM,Bullfrog Blues by Rory Gallagher,1,1,1,1


#### 6.C. Clean up the data. Normally we may replace the offending data with null np.nan values, however we previously converted all of the nan values in the release column to zeros so we might as well continue with the same practice. Replacing with 0 (or nan) will allow us to convert the column to numeric.

In [14]:
# A: 

### 7. Get summary statistics for the `release` column using the `.describe()` function.

Now that the `release` column is finally a numeric data type, we can apply the `.describe()` function.  

#### 7.A Print out the summary stats for the `release` column. What is the earliest and latest release date?

In [127]:
# A: 

df.describe()

Unnamed: 0,First Album,Year,PlayCount,F*G
count,2230.0,2230.0,2230.0,2230.0
mean,1.0,0.741256,16.872646,15.04843
std,0.0,0.438043,25.302972,25.288366
min,1.0,0.0,0.0,0.0
25%,1.0,0.0,1.0,0.0
50%,1.0,1.0,4.0,3.0
75%,1.0,1.0,21.0,18.0
max,1.0,1.0,142.0,142.0


#### 7.B Based on the summary statistics, is there anything else wrong with the `release` column? 

In [16]:
# A:

_Looking at the DataFrame that contains the year 1071, we can see that the year was probably corrupted and should be replaced with something else if possible._

### 8. Make changes and investigate using custom functions with `.apply()`.

Let's say we want to traverse every single row in our data set and apply a function to that row.

#### 8.A Write a function that will take a row of a DataFrame and print out the song, artist, and whether or not the release date is < 1970.


In [17]:
# A: 

def get_vals(rows):
    print(row['song clean'], row['artist'], row['release year'], '<1970?:', row['release year'<1970]
          

#### 8.B Using the `.apply()` function, apply the function you wrote to the first four rows of the DataFrame.

You will need to tell the `apply` function to operate row by row. Setting the keyword argument as `axis=1` indicates that the function should be applied to each row individually.

In [18]:
# A:

rock['release year'] = df['release year'].astype