# Exercise 1 - Data Cleaning, Exploration and Visualization

## Overview                                                                                                           

In this exercise we will learn how to perform data cleaning and transformation using Python.

## Exercise 1

Look at the visualizations in the following article: https://flowingdata.com/2017/01/24/one-dataset-visualized-25-ways/. In your opinion what are the two best and worst visualizations?

## Feature Engineering

Getting the input data in the correct format before training Machine Learning algorithms is exteremely important.  Machine Learning algorithms will only work properly if they are trained using data which contains features with some specific characteristics. This features are usually in the form of structured columns. Incorrect or inconsistent data leads to false conclusions. So, how well you clean and understand the data has a high impact on the quality of your results. In fact, a simple algorithm can outweigh a complex one just because it was given enough and high-quality data.


### Data Exploration and Cleaning

Data Cleaning consist of:
- Inspection: Detect unexpected, incorrect, and inconsistent data.
- Cleaning: Fix or remove the anomalies discovered.
- Verifying: After cleaning, the results are inspected to verify correctness.
- Reporting: Make a report about the changes and the quality of the data.

Some common data cleaning techniques include:
1. Check for duplicates.
2. Check for syntax errors.
3. Fill missing values (also known as imputation).
4. Handle outliers.
6. Scale/transform the data.
7. Handle categorical data.

## Exercise 2 Data Exploration

Let's explore a  simplified version of the Abalone dataset. The original dataset can be found here https://archive.ics.uci.edu/ml/datasets/abalone. Thia dataset is used for prediction of the age of abalone from physical measurements.

![image.png](attachment:image.png)

In [None]:
import sys
!{sys.executable} -m pip install seaborn

In [None]:
# Import the necessary libraries.
import pandas as pd # data processing, CSV file I/O
import seaborn as sns # data visualization

The first thing you want to do when you get a new dataset, is to quickly to verify the contents with the `.head()` method.

In [None]:
# Import the dataset
abalone_df = pd.read_csv('abalone.csv')
abalone_df.head()

Now let’s quickly see the names and types of the columns. Most of the time you’re going get data that is not quite what you expected, such as dates which are actually strings and other oddities. But to check upfront:

In [None]:
# Get column names
column_names = abalone_df.columns
print(column_names)

# Get column data types
abalone_df.dtypes

# Also check if the column contents are unique
for i in column_names:
  print('{} is unique: {}'.format(i, abalone_df[i].is_unique))

In [None]:
abalone_df.describe()

In [None]:
# View the unique data by column.
for item in abalone_df.columns:
    print(item)
    print (abalone_df[item].unique())

In [None]:
# Check for nulls
abalone_df.isnull().sum(axis = 0)
# Fill nulls
# df = df.fillna(value=0)

### Step 1. What are the column names of the dataset?

### Step 2. Rename the 'Class_number_of_rings' column to 'Rings'.

### Step 3. How many observations (i.e. rows) are in this data frame?

### Step 4. Examples for descriptive statistics

In [None]:
abalone_df['Rings'].describe()

In [None]:
abalone_df['Gender'].value_counts()

### Step 5. Plot the histogram for Rings

### Step 6. Creare a Weight/Height scatter plot

### Step 7. Print the first 4 lines from the dataset. What are the values of feature rings of the printed observations?

### Step 8. Extract the last 3 rows of the data frame. What is the weight of these abalones?

### Step 9. What is the value of the diameter in the row 755?

### Step 10. How many missing values are in the height column?

### Step 11. What is the mean of the height column? Exclude missing values from this calculation.

### Step 12. Extract the subset of rows of the data frame where gender is M and weight values are below 0.75. What is the mean of diameter in this subset?

### Step 13. What is the most frequent rings value?

### Step 14. What is the minimum length when rings is equal to 18?