## Introduction

Import required libraries

In [1]:
%matplotlib inline
import pandas as pd

Provide the path to the file containing the data

In [2]:
path_to_csv = "data/abalone.csv"

Load the data into pandas. The data contains the following

| <p align="left">Name</p>           | <p align="left">Units</p>   | <p align="left">Description</p>      |
|:-----------------------------------|:-------:|:---------------------------------------------------------|
| <p align="left">Sex</p>            |         | <p align="left">M (male), F (female), and I (infant)</p> |
| <p align="left">Length</p>         | mm      | <p align="left">Longest shell measurement</p>            |
| <p align="left">Diameter</p>       | mm      | <p align="left">Perpendicular to length</p>              |
| <p align="left">Height</p>         | mm      | <p align="left">With meat in shell</p>                   |
| <p align="left">Whole Height</p>   | grams   | <p align="left">Whole abalone</p>                        |
| <p align="left">Shucked weight</p> | grams   | <p align="left">Weight of meat</p>                       |
| <p align="left">Viscera weight</p> | grams   | <p align="left">Gut weight (after bleeding)</p>          |
| <p align="left">Shell weight</p>   | grams   | <p align="left">After being dried</p>                    |
| <p align="left">Rings</p>          |         | <p align="left">+1.5 gives the age in years</p>          |

In [3]:
df = pd.read_csv(path_to_csv, header=None, names = ["Sex", "Length", "Diameter", "Height", "Whole Height", "Shucked weight", "Viscera weight", "Shell weight", "Rings"])

Lets take a look at the data

In [4]:
df.head()

Unnamed: 0,Sex,Length,Diameter,Height,Whole Height,Shucked weight,Viscera weight,Shell weight,Rings
0,M,0.455,0.365,0.095,0.514,0.2245,0.101,0.15,15
1,M,0.35,0.265,0.09,0.2255,0.0995,0.0485,0.07,7
2,F,0.53,0.42,0.135,0.677,0.2565,0.1415,0.21,9
3,M,0.44,0.365,0.125,0.516,0.2155,0.114,0.155,10
4,I,0.33,0.255,0.08,0.205,0.0895,0.0395,0.055,7


### Missing Data

Lets do a search for missing data. We can do this easily with pandas using the code below.

In [5]:
nans = df[df.isnull().any(axis=1)]
nans

Unnamed: 0,Sex,Length,Diameter,Height,Whole Height,Shucked weight,Viscera weight,Shell weight,Rings
878,F,0.635,0.485,0.165,1.2945,0.668,,0.2715,9.0
1888,F,0.565,0.445,0.125,0.8305,0.3135,0.1785,0.23,
3093,,0.52,0.43,0.15,0.728,0.302,0.1575,0.235,11.0


As can be seen from the results of our search we have 3 NaN rows. 

- Row 878 has an invalid Viscera weight
- row 1888 has an invalid Rings value 
- row 3093 has an invalid Sex value

We can now drop the indexes that are invalid from our dataframe. Before we do that lets display the size of our data frame before and after to confirm that the 3 rows were dropped

In [6]:
df.shape

(4177, 9)

In [7]:
df.drop(df.index[[878,1888,3093]], inplace=True)

We have now dropped the indexes, lets see if the dataframe has reduced by 3

In [8]:
df.shape

(4174, 9)

We can see above that the row length has reduced from 4177 to 4174. This is a difference of exactly 3 rows.

### Erroneous Data

In [9]:
df[(df.select_dtypes(include=['number']) == 0).any(1)]

Unnamed: 0,Sex,Length,Diameter,Height,Whole Height,Shucked weight,Viscera weight,Shell weight,Rings
1257,I,0.43,0.34,0.0,0.428,0.2065,0.086,0.115,8
3996,I,0.315,0.23,0.0,0.134,0.0575,0.0285,0.3505,6


Lets make sure all values in our Sex column are upper case for consistency. We will also remove any trailing whitespace which has been put in the Sex column unintentionally 

In [13]:
df['Sex'].str.upper()   # 
df['Sex'].str.strip()   # Strip whitespace
df.head()

Unnamed: 0,Sex,Length,Diameter,Height,Whole Height,Shucked weight,Viscera weight,Shell weight,Rings
0,M,0.455,0.365,0.095,0.514,0.2245,0.101,0.15,15
1,M,0.35,0.265,0.09,0.2255,0.0995,0.0485,0.07,7
2,F,0.53,0.42,0.135,0.677,0.2565,0.1415,0.21,9
3,M,0.44,0.365,0.125,0.516,0.2155,0.114,0.155,10
4,I,0.33,0.255,0.08,0.205,0.0895,0.0395,0.055,7
