In [1]:
import pandas as pd, numpy as np, seaborn as sns

%matplotlib inline

In [2]:
rockfile = "../../../datasets/rock_songs/rock.csv"

df = pd.read_csv(rockfile)
df.head()

Unnamed: 0,Song Clean,ARTIST CLEAN,Release Year,COMBINED,First?,Year?,PlayCount,F*G
0,Caught Up in You,.38 Special,1982.0,Caught Up in You by .38 Special,1,1,82,82
1,Fantasy Girl,.38 Special,,Fantasy Girl by .38 Special,1,0,3,0
2,Hold On Loosely,.38 Special,1981.0,Hold On Loosely by .38 Special,1,1,85,85
3,Rockin' Into the Night,.38 Special,1980.0,Rockin' Into the Night by .38 Special,1,1,18,18
4,Art For Arts Sake,10cc,1975.0,Art For Arts Sake by 10cc,1,1,1,1


In [7]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2230 entries, 0 to 2229
Data columns (total 8 columns):
Song Clean      2230 non-null object
ARTIST CLEAN    2230 non-null object
Release Year    1653 non-null object
COMBINED        2230 non-null object
First?          2230 non-null int64
Year?           2230 non-null int64
PlayCount       2230 non-null int64
F*G             2230 non-null int64
dtypes: int64(4), object(4)
memory usage: 139.4+ KB


# Clean Columns Names

Let's clean up the column names.  There are two ways we can do this:

## 1. At the point when we import our data with pd.read_csv()

Notice that when passing `names=[A LIST]`, the offsets matching the passed list, will replace the names of the columns.


In [17]:
column_names = ['song', 'artist', 'release', 'song_artist', 'first', 'year', 'playcount', 'fg']
df = pd.read_csv(rockfile, names=column_names)
df.head()

Unnamed: 0,song,artist,release,song_artist,first,year,playcount,fg
0,Song Clean,ARTIST CLEAN,Release Year,COMBINED,First?,Year?,PlayCount,F*G
1,Caught Up in You,.38 Special,1982,Caught Up in You by .38 Special,1,1,82,82
2,Fantasy Girl,.38 Special,,Fantasy Girl by .38 Special,1,0,3,0
3,Hold On Loosely,.38 Special,1981,Hold On Loosely by .38 Special,1,1,85,85
4,Rockin' Into the Night,.38 Special,1980,Rockin' Into the Night by .38 Special,1,1,18,18


## 2. Using the .rename() function
The rename function

In [20]:
df = pd.read_csv(rockfile)

rename_map = {
    # Original column: [renamed column]
    'Song Clean':    'song', 
    'ARTIST CLEAN':  'artist', 
    'Release Year':  'release', 
    'COMBINED':      'song_artist', 
    'First?':        'first', 
    'Year?':         'year', 
    'PlayCount':     'playcount', 
    'F*G':           'fg'
}

df.rename(columns=rename_map, inplace=True)
df

Unnamed: 0,song,artist,release,song_artist,first,year,playcount,fg
0,Caught Up in You,.38 Special,1982,Caught Up in You by .38 Special,1,1,82,82
1,Fantasy Girl,.38 Special,,Fantasy Girl by .38 Special,1,0,3,0
2,Hold On Loosely,.38 Special,1981,Hold On Loosely by .38 Special,1,1,85,85
3,Rockin' Into the Night,.38 Special,1980,Rockin' Into the Night by .38 Special,1,1,18,18
4,Art For Arts Sake,10cc,1975,Art For Arts Sake by 10cc,1,1,1,1
5,Kryptonite,3 Doors Down,2000,Kryptonite by 3 Doors Down,1,1,13,13
6,Loser,3 Doors Down,2000,Loser by 3 Doors Down,1,1,1,1
7,When I'm Gone,3 Doors Down,2002,When I'm Gone by 3 Doors Down,1,1,6,6
8,What's Up?,4 Non Blondes,1992,What's Up? by 4 Non Blondes,1,1,3,3
9,Take On Me,a-ha,1985,Take On Me by a-ha,1,1,1,1


## 3. Using the .columns attribute of a dataframe
The only caveat with .columns is that you have to reassign all of it at once.  You can't partially replace a value by working on .columns directly.  You have to reasign the `.columns`, with a list of equal length. 

In [22]:
df = pd.read_csv(rockfile)
column_names = ['song', 'artist', 'release', 'song_artist', 'first', 'year', 'playcount', 'fg']
df.columns = column_names
df.head()

Unnamed: 0,song,artist,release,song_artist,first,year,playcount,fg
0,Caught Up in You,.38 Special,1982.0,Caught Up in You by .38 Special,1,1,82,82
1,Fantasy Girl,.38 Special,,Fantasy Girl by .38 Special,1,0,3,0
2,Hold On Loosely,.38 Special,1981.0,Hold On Loosely by .38 Special,1,1,85,85
3,Rockin' Into the Night,.38 Special,1980.0,Rockin' Into the Night by .38 Special,1,1,18,18
4,Art For Arts Sake,10cc,1975.0,Art For Arts Sake by 10cc,1,1,1,1


## Accessing Null Values

We have mixed str and NaN values in "release".  NaN is essentially "not a number" and it the way Pandas handles "nulls".  We can use the `.isnull()` method of a series to find null values.

In [24]:
# This will show us records where df['release'] is null

null_release_mask = df['release'].isnull()
df[null_release_mask]

Unnamed: 0,song,artist,release,song_artist,first,year,playcount,fg
1,Fantasy Girl,.38 Special,,Fantasy Girl by .38 Special,1,0,3,0
10,"Baby, Please Don't Go",AC/DC,,"Baby, Please Don't Go by AC/DC",1,0,1,0
13,CAN'T STOP ROCK'N'ROLL,AC/DC,,CAN'T STOP ROCK'N'ROLL by AC/DC,1,0,5,0
16,Girls Got Rhythm,AC/DC,,Girls Got Rhythm by AC/DC,1,0,24,0
24,Let's Get It Up,AC/DC,,Let's Get It Up by AC/DC,1,0,4,0
25,Live Wire,AC/DC,,Live Wire by AC/DC,1,0,2,0
26,Moneytalks,AC/DC,,Moneytalks by AC/DC,1,0,20,0
29,Shoot To Thrill,AC/DC,,Shoot To Thrill by AC/DC,1,0,45,0
31,Sin City,AC/DC,,Sin City by AC/DC,1,0,1,0
35,What Do You Do For Money Honey,AC/DC,,What Do You Do For Money Honey by AC/DC,1,0,2,0


# Updating slices of our dataframe, based on mask selection / slices

Many times, we want to upate a value in our DataFrame.  Let's show how to do that for release.  Let's set all of the null values in `release` to 0 (we ultimately wouldn't find this practical in this circumstance but just to show how to do it).

For newer version of Pandas, in order to manipulate data from the original dataframe that is referenced, it's necessary to use `.loc` rather than the obvious reference when doing resignment using a mask and an index.

This won't always work:
```python
df[mask]['column_name'] = new_value
```

The best way to accomplish the same thing is:
```python
df.loc[mask, 'column_name'] = new_value
```

For multiple column assignment:
```python
df.loc[mask, ['col_1', 'col_2', 'col_3']] = new_value
```

Let's try it out.


In [31]:
# We're going to reload our data to a fresh state
column_names = ['song', 'artist', 'release', 'song_artist', 'first', 'year', 'playcount', 'fg']
df = pd.read_csv(rockfile, names=column_names)

null_release_mask = df['release'].isnull()
df.loc[null_release_mask, 'release'] = 0

# Print out our DataFrame
df

Unnamed: 0,song,artist,release,song_artist,first,year,playcount,fg
0,Song Clean,ARTIST CLEAN,Release Year,COMBINED,First?,Year?,PlayCount,F*G
1,Caught Up in You,.38 Special,1982,Caught Up in You by .38 Special,1,1,82,82
2,Fantasy Girl,.38 Special,0,Fantasy Girl by .38 Special,1,0,3,0
3,Hold On Loosely,.38 Special,1981,Hold On Loosely by .38 Special,1,1,85,85
4,Rockin' Into the Night,.38 Special,1980,Rockin' Into the Night by .38 Special,1,1,18,18
5,Art For Arts Sake,10cc,1975,Art For Arts Sake by 10cc,1,1,1,1
6,Kryptonite,3 Doors Down,2000,Kryptonite by 3 Doors Down,1,1,13,13
7,Loser,3 Doors Down,2000,Loser by 3 Doors Down,1,1,1,1
8,When I'm Gone,3 Doors Down,2002,When I'm Gone by 3 Doors Down,1,1,6,6
9,What's Up?,4 Non Blondes,1992,What's Up? by 4 Non Blondes,1,1,3,3


## Searching for gremlins, in `release`

So far we know that `release`, is an object.  Let's find out why.

In [35]:
# We're going to reload our data to a fresh state
column_names = ['song', 'artist', 'release', 'song_artist', 'first', 'year', 'playcount', 'fg']
df = pd.read_csv(rockfile, names=column_names)

df['release'].value_counts()

1973             104
1977              83
1975              83
1970              81
1971              75
1969              72
1980              70
1978              64
1979              63
1967              61
1981              61
1983              60
1976              56
1982              54
1984              51
1972              50
1974              48
1968              46
1987              39
1985              39
1986              37
1991              34
1989              32
1966              30
1988              29
1965              28
1994              25
1990              22
1993              19
1992              14
1964              14
1999              13
1995              10
1996               9
1997               9
1963               9
2002               6
1998               6
2004               5
2005               5
2012               5
2001               4
2011               3
2008               3
1962               3
2003               3
2000               3
2007         

## We have "SONGFACTS.COM" in our "release" data

Let's get rid of it the direct way.  First let's select the row.

In [40]:
# Slice and assign
release_mask = df['release'] == "SONGFACTS.COM"
df[release_mask]

Unnamed: 0,song,artist,release,song_artist,first,year,playcount,fg
1505,Bullfrog Blues,Rory Gallagher,SONGFACTS.COM,Bullfrog Blues by Rory Gallagher,1,1,1,1


Let's take our row slice, and replace the value in "release" to a NaN (np.nan)

In [42]:
df.loc[release_mask, 'release'] = np.nan
df[release_mask]

Unnamed: 0,song,artist,release,song_artist,first,year,playcount,fg
1505,Bullfrog Blues,Rory Gallagher,,Bullfrog Blues by Rory Gallagher,1,1,1,1


## It appears we still have cruft in our `release` column series

Let's look for multiple values using an explicit filter to mask exact values matching the data found in the series for `release`

In [69]:
# We're going to reload our data to a fresh state
column_names = ['song', 'artist', 'release', 'song_artist', 'first', 'year', 'playcount', 'fg']
df = pd.read_csv(rockfile, names=column_names)

release_mask = df['release'].isin(["Release Year", "SONGFACTS.COM"])
df[release_mask]

Unnamed: 0,song,artist,release,song_artist,first,year,playcount,fg
0,Song Clean,ARTIST CLEAN,Release Year,COMBINED,First?,Year?,PlayCount,F*G
1505,Bullfrog Blues,Rory Gallagher,SONGFACTS.COM,Bullfrog Blues by Rory Gallagher,1,1,1,1


So it appears that we have column headers in our first row.  There are a few ways we can deal with this problem.  We can pass `skiprows=1` to our `pd.read_csv()` to tell it to skip the first row.  When we pass `name=[list of columns]`, Pandas will push the header row from the CSV into the dataset.  We can get around it like so:  `pd.read_csv(rockfile, names=column_names, skiprows=1)`

However, if we want to use code to keep it, but update multiple column values within a series, matching multiple values, we can use this example.

In [70]:
release_mask = df['release'].isin(["Release Year", "SONGFACTS.COM"])
df.loc[release_mask, 'release'] = np.nan

## Now let's try to convert `release` to floats
Why floats?  Because the `float()` conversion will be able to handle the `NaN` values, where `int()` wont.

In [71]:
# Try to check the output of df['release'].apply(float), before reasigning!
df['release'] = df['release'].apply(float)

## Now let's check our dtypes out

In [72]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2231 entries, 0 to 2230
Data columns (total 8 columns):
song           2231 non-null object
artist         2231 non-null object
release        1652 non-null float64
song_artist    2231 non-null object
first          2231 non-null object
year           2231 non-null object
playcount      2231 non-null object
fg             2231 non-null object
dtypes: float64(1), object(7)
memory usage: 139.5+ KB


Alright!  Looks like our `release` column is finally a fully fledged `float64` type allowing us to do a `describe`.  Our precentiles are having some issues calculating because of a problem with division by 0.  There is more info to read up on about this if you check out the [numpy.seterr](http://docs.scipy.org/doc/numpy/reference/generated/numpy.seterr.html).

In [74]:
df['release'].describe()

count    1652.000000
mean     1978.019976
std        24.191247
min      1071.000000
25%              NaN
50%              NaN
75%              NaN
max      2014.000000
Name: release, dtype: float64

In [150]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2230 entries, 0 to 2229
Data columns (total 9 columns):
Song Clean      2230 non-null object
ARTIST CLEAN    2230 non-null object
Release Year    1653 non-null object
COMBINED        2230 non-null object
First?          2230 non-null int64
Year?           2230 non-null int64
PlayCount       2230 non-null int64
F*G             2230 non-null int64
cleaned_year    1652 non-null float64
dtypes: float64(1), int64(4), object(4)
memory usage: 156.9+ KB


## More About Apply

Let's say we want to traverse every single value to see what our series looks like with axis=1 (rows).

In [83]:
# We're going to reload our data to a fresh state
column_names = ['song', 'artist', 'release', 'song_artist', 'first', 'year', 'playcount', 'fg']
df = pd.read_csv(rockfile, names=column_names)

In [102]:
def inspect_data(row):
    
    print "------------- Begin Row -------------"
    for key in row.index:
        print key,":", row[key], "type:", type(row[key]).__name__

# Operating on axis=1 (row-wise)
df.head(3).apply(inspect_data, axis=1)

_ = "" # supressed output of returned statement above -- assigning a dummy variable

------------- Begin Row -------------
song : Song Clean type: str
artist : ARTIST CLEAN type: str
release : Release Year type: str
song_artist : COMBINED type: str
first : First? type: str
year : Year? type: str
playcount : PlayCount type: str
fg : F*G type: str
------------- Begin Row -------------
song : Caught Up in You type: str
artist : .38 Special type: str
release : 1982 type: str
song_artist : Caught Up in You by .38 Special type: str
first : 1 type: str
year : 1 type: str
playcount : 82 type: str
fg : 82 type: str
------------- Begin Row -------------
song : Fantasy Girl type: str
artist : .38 Special type: str
release : nan type: float
song_artist : Fantasy Girl by .38 Special type: str
first : 1 type: str
year : 0 type: str
playcount : 3 type: str
fg : 0 type: str


Let's find out what can be converted to a float, what can't.

In [100]:
def inspect_data(row):
    
    print "------------- Begin Row #", row._name, " -------------"

    for key in row.index:
        try:
            row[key] = float(row[key])
            print "Can convert `", key, "`: ", row[key]
        except:
            print "Can't be converted `", key, "`: ", row[key]

# Operating on axis=1 (row-wise)
df.head(3).apply(inspect_data, axis=1)

_ = "" # supressed output of returned statement above -- assigning a dummy variable

------------- Begin Row # 0  -------------
Can't be converted ` song `:  Song Clean
Can't be converted ` artist `:  ARTIST CLEAN
Can't be converted ` release `:  Release Year
Can't be converted ` song_artist `:  COMBINED
Can't be converted ` first `:  First?
Can't be converted ` year `:  Year?
Can't be converted ` playcount `:  PlayCount
Can't be converted ` fg `:  F*G
------------- Begin Row # 1  -------------
Can't be converted ` song `:  Caught Up in You
Can't be converted ` artist `:  .38 Special
Can convert ` release `:  1982.0
Can't be converted ` song_artist `:  Caught Up in You by .38 Special
Can convert ` first `:  1.0
Can convert ` year `:  1.0
Can convert ` playcount `:  82.0
Can convert ` fg `:  82.0
------------- Begin Row # 2  -------------
Can't be converted ` song `:  Fantasy Girl
Can't be converted ` artist `:  .38 Special
Can convert ` release `:  nan
Can't be converted ` song_artist `:  Fantasy Girl by .38 Special
Can convert ` first `:  1.0
Can convert ` year `:  0.

## Alright, axis = 1 is rows, how does this look for columns?

Well rather than rows, we iterate by columns, which is strange to think of.  If you can think of accessing the keys of a dictionary that would contain all the values by row as a list, rather than a list of dictionaries having multiple keys for each row.  Confusing?  Let's have a look.


In [105]:
def inspect_data(column):
    
    print "------------- Begin Column #", column._name, " -------------"

    for key in column.index:
        try:
            column[key] = float(column[key])
            print "Can convert `", key, "`: ", column[key]
        except:
            print "Can't be converted `", key, "`: ", column[key]

# Operating on axis=0 (column-wise)
df.head(3).apply(inspect_data, axis=0)

_ = "" # supressed output of returned statement above -- assigning a dummy variable

------------- Begin Column # song  -------------
Can't be converted ` 0 `:  Song Clean
Can't be converted ` 1 `:  Caught Up in You
Can't be converted ` 2 `:  Fantasy Girl
------------- Begin Column # artist  -------------
Can't be converted ` 0 `:  ARTIST CLEAN
Can't be converted ` 1 `:  .38 Special
Can't be converted ` 2 `:  .38 Special
------------- Begin Column # release  -------------
Can't be converted ` 0 `:  Release Year
Can convert ` 1 `:  1982.0
Can convert ` 2 `:  nan
------------- Begin Column # song_artist  -------------
Can't be converted ` 0 `:  COMBINED
Can't be converted ` 1 `:  Caught Up in You by .38 Special
Can't be converted ` 2 `:  Fantasy Girl by .38 Special
------------- Begin Column # first  -------------
Can't be converted ` 0 `:  First?
Can convert ` 1 `:  1.0
Can convert ` 2 `:  1.0
------------- Begin Column # year  -------------
Can't be converted ` 0 `:  Year?
Can convert ` 1 `:  1.0
Can convert ` 2 `:  0.0
------------- Begin Column # playcount  ---------

Same thing but by column, instead of row.  Notice that each column series, only has 3 elements.  This is because we are chaining `.head(3)` before the `.apply(inspect_data)`.

Also note, that the input `column` to our `insepct_data(column)` function is being fed an entire column of values instead of a row.

## Let's get serious and clean everything we can using a single function
With apply.  First thing, let's kill that first row, then build a function through apply.  We will set anything that is not convertable to `np.nan`, just to illustrate how you might design a process for dealing with bad data.  

Then we will only run it on columns that we want to handle in this way.  First let's make it work with all our data.

In [113]:
# We're going to reload our data to a fresh state
column_names = ['song', 'artist', 'release', 'song_artist', 'first', 'year', 'playcount', 'fg']
df = pd.read_csv(rockfile, names=column_names)

def deep_clean(row):
    
    for key in row.index:
        try:
            row[key] = float(row[key])
        except:
            row[key] = np.nan
    
    # don't forget to return if we want to update the copy of our data
    return row
            
# Chop off the first row by selecting everything after it, then reassigning to original `df` object.
df = df[1:]

# apply our function, testing first, not reassigning so we can check our results before committing
df.apply(deep_clean).head(20) # only the first 20 records..

Unnamed: 0,song,artist,release,song_artist,first,year,playcount,fg
1,,,1982.0,,1.0,1.0,82.0,82.0
2,,,,,1.0,0.0,3.0,0.0
3,,,1981.0,,1.0,1.0,85.0,85.0
4,,,1980.0,,1.0,1.0,18.0,18.0
5,,,1975.0,,1.0,1.0,1.0,1.0
6,,,2000.0,,1.0,1.0,13.0,13.0
7,,,2000.0,,1.0,1.0,1.0,1.0
8,,,2002.0,,1.0,1.0,6.0,6.0
9,,,1992.0,,1.0,1.0,3.0,3.0
10,,,1985.0,,1.0,1.0,1.0,1.0


## Looks like our function can be applied to everything!

Great right?  Well it's designed to process everything it is fed, so we only want to selectively apply this `deep_clean` method to our DataFrame.  Let's do that.

In [116]:
# Select only the variables with mostly continious values
continious_columns = ['release', 'first', 'year', 'playcount', 'fg']
df[continious_columns].apply(deep_clean)

Unnamed: 0,release,first,year,playcount,fg
1,1982.0,1.0,1.0,82.0,82.0
2,,1.0,0.0,3.0,0.0
3,1981.0,1.0,1.0,85.0,85.0
4,1980.0,1.0,1.0,18.0,18.0
5,1975.0,1.0,1.0,1.0,1.0
6,2000.0,1.0,1.0,13.0,13.0
7,2000.0,1.0,1.0,1.0,1.0
8,2002.0,1.0,1.0,6.0,6.0
9,1992.0,1.0,1.0,3.0,3.0
10,1985.0,1.0,1.0,1.0,1.0


This looks great so let's commit to this update!

In [118]:
df[continious_columns] = df[continious_columns].apply(deep_clean)
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2230 entries, 1 to 2230
Data columns (total 8 columns):
song           2 non-null object
artist         0 non-null object
release        1652 non-null float64
song_artist    0 non-null object
first          2230 non-null float64
year           2230 non-null float64
playcount      2230 non-null float64
fg             2230 non-null float64
dtypes: float64(5), object(3)
memory usage: 139.4+ KB


## And that's it!  Check out describe now:
`.T` = Transpose Matrix

In [120]:
df.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
release,1652.0,1978.019976,24.191247,1071.0,,,,2014.0
first,2230.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0
year,2230.0,0.741256,0.438043,0.0,0.0,1.0,1.0,1.0
playcount,2230.0,16.872646,25.302972,0.0,1.0,4.0,21.0,142.0
fg,2230.0,15.04843,25.288366,0.0,0.0,3.0,18.0,142.0


Let's try a different approach for the same thing, without using `.apply()`

In [142]:
# We're going to reload our data to a fresh state
column_names = ['song', 'artist', 'release', 'song_artist', 'first', 'year', 'playcount', 'fg']
df = pd.read_csv(rockfile, names=column_names)

# We know that release date SHOULD be 4 digits long. 
# Let's find everything that is greater than 4 characters.
df[df['release'].str.len() > 4]

Unnamed: 0,song,artist,release,song_artist,first,year,playcount,fg
0,Song Clean,ARTIST CLEAN,Release Year,COMBINED,First?,Year?,PlayCount,F*G
1505,Bullfrog Blues,Rory Gallagher,SONGFACTS.COM,Bullfrog Blues by Rory Gallagher,1,1,1,1


## Now let's update those values to nans
We're performing the mask selection inline here.  Also keep in mind since it is a mask, we need to use .loc in order to do assignment

In [147]:
df.loc[df['release'].str.len() > 4, 'release'] = np.nan
df['release'] = df['release'].map(float)
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2231 entries, 0 to 2230
Data columns (total 8 columns):
song           2231 non-null object
artist         2231 non-null object
release        1652 non-null float64
song_artist    2231 non-null object
first          2231 non-null object
year           2231 non-null object
playcount      2231 non-null object
fg             2231 non-null object
dtypes: float64(1), object(7)
memory usage: 139.5+ KB


We updated the type for `release` successfully!

## The Final Trick

The final piece of wizdom, is using `applymap` to search for values that are strings, but could be converted to floats easily.  What `.applymap()` does, is apply the `.map()` function individually to every single element.  This is different than `apply()` becuase apply will give you a series as input (multiple values), not individual values, one at a time.

In [163]:
# We're going to reload our data to a fresh state
column_names = ['song', 'artist', 'release', 'song_artist', 'first', 'year', 'playcount', 'fg']
df = pd.read_csv(rockfile, names=column_names)

# Create mask to find numeric values in object / string, that CAN be converted
digit_mask = df['year'].apply(lambda x: x.isdigit())

# Find everyhing thats NOT a digit hanging out in an object / string
df[~digit_mask]

Unnamed: 0,song,artist,release,song_artist,first,year,playcount,fg
0,Song Clean,ARTIST CLEAN,Release Year,COMBINED,First?,Year?,PlayCount,F*G


This is obviously the easy way, but you should should know how to investigate your data by careful forensics, slicing, and Pandas operations.  From here, you can take this mask reference and update the values as needed.

In [167]:
def find_numeric(value):
    
    try:
        # Is a digit
        return value.isdigit()
    except:
        # Is a NaN
        if type(value) == float:
            return True
        # Not ditit or NaN (a naughty string)
        else:
            return False

# Create mask to find numeric values in object / string, that CAN be converted
digit_mask = df['release'].apply(find_numeric)

# Find everyhing thats NOT a digit hanging out in an object / string
df[~digit_mask]


Unnamed: 0,song,artist,release,song_artist,first,year,playcount,fg
0,Song Clean,ARTIST CLEAN,Release Year,COMBINED,First?,Year?,PlayCount,F*G
1505,Bullfrog Blues,Rory Gallagher,SONGFACTS.COM,Bullfrog Blues by Rory Gallagher,1,1,1,1
