Import necessary libraries:

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

### At first, I will load data and check if it needs cleaning

I load data from _.csv_ file:

In [2]:
df = pd.read_csv("abalone.csv", sep=',')

I make a copy :)

In [3]:
df_copy = df.copy(deep = True)

And look what do we have here:

In [4]:
df_copy.head()

Unnamed: 0,Sex,Length,Diameter,Height,Whole weight,Shucked weight,Viscera weight,Shell weight,Rings
0,M,0.455,0.365,0.095,0.514,0.2245,0.101,0.15,15
1,M,0.35,0.265,0.09,0.2255,0.0995,0.0485,0.07,7
2,F,0.53,0.42,0.135,0.677,0.2565,0.1415,0.21,9
3,M,0.44,0.365,0.125,0.516,0.2155,0.114,0.155,10
4,I,0.33,0.255,0.08,0.205,0.0895,0.0395,0.055,7


So, first thing to do is to **change strings to integers**.  

Let's see what how many categories we have in _Age group_:

In [5]:
df_copy.groupby(['Age group']).size()

KeyError: 'Age group'

#### Now I change strings to integers  in _Age group_:
I could do this automatically using LabelEncoder, but I want to have values growing with the age, as below: 
- young -> 1
- mature -> 2
- middle-age -> 3
- senior -> 4  


So, I will change it 'by hand' ;)

In [None]:
d = {'young abalone':1,'mature abalone':2,'middle-aged abalone':3, 'senior abalone':4}
df_copy = df_copy.replace(d)

Does it work?

In [None]:
df_copy.head()

Yup! Now it's time for _Sex_ feature:

In [None]:
df_copy.groupby(['Sex']).size()

Here numbers can be selected randomly, so I will change it using LabelEncoder :)

In [None]:
from sklearn import preprocessing
le = preprocessing.LabelEncoder()

In [None]:
df_copy['Sex'] = le.fit_transform(df_copy['Sex'])

Does this method work too?

In [None]:
df_copy.head()

Yes! :)

### Next I will check some informations:

Shape:

In [None]:
df_copy.shape

Basic informations:

In [None]:
df_copy.info()

I check if there are any null objects:

In [None]:
df_copy.isnull().values.any()

And check how many duplicates do we have:

In [None]:
df_copy.duplicated().sum()

Let's see basic statistics:

In [None]:
df_copy.describe()

#### It looks like we can have some outliers in _Height_, _Shucked weight_ and _Shell weight_.  
Check it out!

#### Height

In [None]:
df_copy.groupby(['Height']).size()

There are to zeros (0.000) and one 1.130. In my opinion all three values are not correct.  
What I can do with it is to replace values by e.g. mean or I can just drop these rows.  
We have more than 4000 observations, so if I drop 3 rows it will not make big difference.

So, which rows should I drop?

In [None]:
height_outliers = ['0.000', '1.130']

df_copy.loc[df_copy['Height'].isin(height_outliers)]

Ok, let's do this:

In [None]:
to_drop = [1257,2051,3996]
df_copy.drop(to_drop, inplace=True, axis=0)

Did it work?

In [None]:
df_copy.loc[df_copy['Height'].isin(height_outliers)]

Yes, it's quite clean here ;)

#### Shucked weight

In [None]:
df_copy.groupby(['Shucked weight']).size()

Hmm, these values look good. Nothing to change.

#### Shell weight

In [None]:
df_copy.groupby(['Shell weight']).size()

This feature also is ok. 

## How looks our data after cleaning?

In [None]:
df_copy.info()

In [None]:
df_copy.describe()

### Histograms time :)

In [None]:
sns.set()
df_copy.hist(figsize=(10,10), color='blue')
plt.show()

### Scatter matrix

In [None]:
from pandas.plotting import scatter_matrix
p=scatter_matrix(df_copy,figsize=(25, 25))

Whoah! :o  
Can't wait to see correlations!

But at first I'll check how does look _Age Group_ in here:

In [None]:
p=sns.pairplot(df_copy, hue = 'Age group')

### Correlation

In [None]:
correlation = df_copy.corr()

import matplotlib.pyplot as plt
fig = plt.subplots(figsize=(10,10))
sns.heatmap(correlation,vmax=1,square=True,annot=True,cmap='Blues')

Wow! Almost every pair of features (without _Sex_ ) has a strong positive correlation (corr > 0.41)! It's important information when we are creating a model.

#### So, our target is to categorize abalones to correct age group based on the feature other than number of rings. Except _Rings_ , the most correlated with _Age group_ features are: 
- _Shell weight_ , correlation = 0.63
- _Height_ , correlation = 0.62

Diameter and length have also comparable correlations (0.6 and 0.59), but I will focus on these two above. I don't want to use too many features, especially that all of them have quite strong and strong correlations with each other. Model can give incorrect results if features which were used are correlated.

#### Remember, that _Shell weight_ and _Height_ are strongly correlated  (correlation = 0.89)!!!

## Models

All models which I will create:
* are **classifiers**, not regressors - _it is because our target feature, Age group, contains classes, it is not continuous,_
* have test size = 0.2,
* have random state = 42.


<img src="tbc1.jpg">