# Lesson 6: Data Wrangling

*Learn to prepare data for visualization and analytics.*


## Instructions
This tutorial provides step-by-step training divided into numbered sections. The sections often contain embeded exectable code for demonstration.  This tutorial is accompanied by a practice notebook: [L06-Data_Wrangling-Practice.ipynb](./L06-Data_Wrangling-Practice.ipynb). 

Throughout this tutorial sections labeled as "Tasks" are interspersed and indicated with the icon: ![Task](http://icons.iconarchive.com/icons/sbstnblnd/plateau/16/Apps-gnome-info-icon.png). You should follow the instructions provided in these sections by performing them in the practice notebook.  When the tutorial is completed you can turn in the final practice notebook. 

## Introduction
The purpose of this assignment is to build on Tidy data cleaning by using Python tools to "massage" or "wrangle" data into formats that are most useful for visualization and analytics.

**What is data wrangling?**

> Data wrangling, sometimes referred to as data munging, is the process of transforming and mapping data from one "raw" data form into another format with the intent of making it more appropriate and valuable for a variety of downstream purposes such as analytics. 

- [Data Wangling](https://en.wikipedia.org/wiki/Data_wrangling) *Wikipedia*

Previously, we learned about Tidy rules for reformatting data.  Transforming data into a Tidy dataset is data wrangling.  We have also learned to how to correct data types, remove missing values and duplicates.  This lessons is, therefore, an opportunity to bring everything together.  Some of the material will be a review, but should help reinforce the concepts.

---
## 1. Getting Started
As before, we import any needed packages at the top of our notebook. Let's import Numpy and Pandas:

In [4]:
import numpy as np
import pandas as pd

---
## 2. Data Exploration
The first step in any data analytics task is import and exploration of data.  At this point, we have learned all of the steps we need to identify the data columns, their data types, recognize where we have missing values and recognize categorical and numeric variables in the data.   

For this tutorial we will use a dataset named "Abolone" from the [University of California Irvine Machine Learning Repository](https://archive.ics.uci.edu/ml/datasets/Abalone). The datafile is named `abalone.data` and is available in the data directory that accompanies this notebook.  The data has 10 "attributes" or variables. The following table describes these 10 variables, their types, and additional details.

<table>
    <tr><th>Name</th><th>Data Type</th><th>Metric</th><th>Description</th></tr>
    <tr><td>Sample ID</td><td>integer</td><td></td><td>A unique number for each sample taken</td></tr>
    <tr><td>Sex</td><td>nominal</td><td></td><td>M = 0, F = 1, and I = 2 (infant)</td></tr>
	<tr><td>Length</td><td>continuous</td><td>mm</td><td>Longest shell measurement</td></tr>
	<tr><td>Diameter</td><td>continuous</td><td>mm</td><td>perpendicular to length</td></tr>
	<tr><td>Height</td><td>continuous</td><td>mm</td><td>with meat in shell</td></tr>
	<tr><td>Whole weight</td><td>continuous</td><td>grams</td><td>whole abalone</td></tr>
	<tr><td>Shucked weight</td><td>continuous</td><td>grams</td><td>weight of meat</td></tr>
	<tr><td>Viscera weight</td><td>continuous</td><td>grams</td><td>gut weight (after bleeding)</td></tr>
	<tr><td>Shell weight</td><td>continuous</td><td>grams</td><td>after being dried</td></tr>
	<tr><td>Rings</td><td>integer</td><td></td><td>+1.5 gives the age in years</td></tr>
</table>

***Note:*** To demonstrate specific techniques of data wrangling, the dataset provided to you was altered: a sample ID column was added, the Sex column contains numeric IDs, and missing values were added as were duplicates.

This data has no header information, so, we'll provide it when we import the data:

In [6]:
abalone = pd.read_csv('afs505_u2/git files/data/abalone.data', header = None)
abalone.columns = ['Sample_ID','Sex', 'Length', 'Diameter', 'Height', 
          'Whole_weight', 'Shucked_weight', 'Viscera_weight', 
          'Shell_weight', 'Rings']
abalone.head()

FileNotFoundError: [Errno 2] File b'Data-Analytics-With-Python/data/abalone.data' does not exist: b'Data-Analytics-With-Python/data/abalone.data'

### 2.1 Exploring Data Types
First, let's explore how Pandas imported the data types:

In [3]:
abalone.dtypes

NameError: name 'abalone' is not defined

Other than the first, second and last columns, all others were imported as `float64` which is a decimal value. The others were imported as an `integer`.  This looks correct for the data.

Let's get a sense of how big the data is:

In [None]:
abalone.shape

Next, we can explore the distribution of numerical data using the `describe` function:

In [None]:
abalone.describe()

Observe that even though the 'Sex' column was provided as a numeric value, it is actually meant to be categorical, with each sex represented as a unique number.  We can explore the categorical data using the `groupby` function, followed by the `size` function.

In [None]:
abalone.groupby(by=['Sex']).size()

### 2.2 Finding Missing Values
Before proceeding with any analysis you should know the state of missing values in the dataset.  For most analytics missing values are not supported. Some tools will automatically ignore them but it may be easier, in some cases, to remove them.

First, let's quantify how many missing values we have. The `isna` function will convert the data into `True` or `False` values: `True` if the value is missing:

In [None]:
abalone.isna().head()

We can use the `sum` function to then identify how many missing values we have per column:

In [None]:
abalone.isna().sum()

### 2.3 Inspecting Duplicates
Sometimes we may or may not want duplicates in the data. This depends on the expectations of the experiments and the measurements taken. Sometimes duplicates may represent human error in data entry. So, let's look for duplicated data.  We have 4,184 rows, let's see how many unique values per column that we have:

In [None]:
abalone.nunique()

For all of the columns we have fewer that 4,184 values.  For columns like 'Sex' we have 3 unique values, but these repeated values are expected.  The decimal values also have duplicates. The likelihood of seeing the exact same decimal values varies based on the distribution for the variable and the number of decimal values in the measurement.  The number of duplicated values does not seem unordinary.  However, the sample ID should be unique, yet we have 4,177 of them instead of 4,184. This implies we have duplicated samples in the data. 

We can identify then umber of duplicated 'Sample_ID' values are in the data by using the `duplicated` function. 

In [None]:
abalone.duplicated(subset='Sample_ID').sum()

We have 7 duplicated rows. Now let's see which rows have duplicated samples:

In [None]:
abalone[abalone.duplicated(subset='Sample_ID', keep= False)]

It looks like the rows are exact duplicates, so this was probably human entry error. We need to remove the copies rows. We will do so in the **3.1 Filtering** section below.

---
## 3. Cleanup
### 3.1 Correcting Data Types
During the data exploration phase above, we noticed that the Sex column was provided as a number to represent the Sex category, and therefore, Pandas imported that column as a numeric value. We need to convert that to a categorical value, because the meaning of the column is not ordinal or numeric. We should covert it to a string object.

We can do that with two functions that work on Series:  
- `astype`  converts the type of data in the series. 
- `replace`  replaces values in the series.

We'll use `astype` to convert the column to a string and `replace` to convert the numbers to more easily recognizable 'Male', 'Female' and 'Infant' strings.

In [None]:
# First convert the column from an integer to a string.
sex = abalone['Sex'].astype(str)

# Second convert 0 to Male, 1 to Female, and 2 to Infant.
sex = sex.replace('0', 'Male')
sex = sex.replace('1', 'Female')
sex = sex.replace('2', 'Infant')

# Now replace the 'Sex' column of the dataframe with the new Series.
abalone['Sex'] = sex
abalone.head()

In addition, the Sample ID column, despite that it is numeric should not be treated as a numeric column, so let's convert that too:

In [None]:
# Convert Sample_ID to a string
abalone['Sample_ID'] = abalone['Sample_ID'].astype(str)

# Let's check out the datatypes to make sure they match our expectations:
abalone.dtypes

### 3.2 Handling Missing Values
As observed in section 2.2, we do indeed have missing values! Let's remove rows with missing values.  We can do so with the `dropna` function:

In [None]:
abalone = abalone.dropna(axis=0)
abalone.shape

Observe that the `axis` argument is set to 0 indicating we will remove rows with missing values. If we compare the `shape` of the dataframe now, with the shape when we first loaded it we will see that we have lost 2 rows with missing values.

In addition to `dropna` you can also use the `fillna` and `replace` functions to rewrite the missing values to something else.

### 3.3 Removing Duplicates

To remove duplicates we can use the [drop_duplicates](http://pandas.pydata.org/pandas-docs/version/0.17/generated/pandas.DataFrame.drop_duplicates.html) function of Pandas.  If we explore the duplicated columns of section 2.3 above we'll see that the rows are the same for all columns.  In this case we can call `drop_duplicates` with no arguments.  However, let's assume we can't guarantee that each column is the same,  but we do want to remove duplicated samples.  We can do this by using the `subset` argument of the `drop_duplicates` function. We don't want to drop all duplicates, we need to keep one set. Therefore, we'll use the `keep` argument to do this.

In [None]:
abalone = abalone.drop_duplicates(['Sample_ID'], keep='first')
abalone.shape

In practice, the `keep` argument will default to `first` so we don't need to provide it, but including it makes the code more clear.  We have now dropped all duplicated rows and we have 4,177 valid rows

---
## 4. Reshaping Data
Data reshaping is about altering the way data is housed in the data frames of Pandas. It includes filtering of rows, merging data frames, concatenating data frames, grouping, melting and pivoting. We have learned about all of these functions already. As a reminder, the following is a summary of what we've learned:

**Subsetting by Column**:
- *Indexing with column names*
  - Purpose: Allows you to slice the dataframe using column index names.
  - Introduced:  Pandas Part 1 Notebook
  - Example:
  ```python
   # Get the columns: Sample_ID, Sex, Height and Rings
   subset = abalone[['Sample_ID', 'Sex', 'Height', 'Rings']]
  ```
- *Indexing with the `loc` function*
  - Purpose: Allows you to slice the dataframe using row and column index names.
  - Introduced:  Pandas Part 1 Notebook
  - Example:
  ```python
   # Get the columns: Sample_ID, Sex, Height and Rings
   subset = abalone.loc[:,['Sample_ID', 'Sex', 'Height', 'Rings']]
  ```    
  
**Filtering Rows**:
- *Boolean Indexing*
  - Purpose: to filter rows that match desired criteria
  - Introduced:  Pandas Part 1 Notebook
  - Example:  
  ```python
   # Finds all rows with sex of "Male" and the number of rings > 3.
   matches = (abalone['Sex'] == 'Male') & (abalone['Rings'] > 3)
   male = abalone[matches]

   # Or more succinctly
   male = abalone[(abalone['Sex'] == 'Male') & (abalone['Rings'] > 3)]
  ```

**Grouping Data**:
- *`groupby` function*
  - Purpose:  To group rows together by "classes" or values of data. Allows you to perform aggregate functions, such as calculating means, summations, sizes, etc. You can create new data frames with aggregated values.
  - Introduced:  Pandas Part 2 Notebook. 
  - Example:
  ```python
  # Calculate the mean column value by each sex:
  abalone.groupby(by="Sex").mean()
  ```
  
**Merging DataFrames**:
- *`concat` function*
  - Purpose: To combine two dataframes.  Depending if the columns and row indexes are the same determines how the data frames are combined.
  - Introduced:  Pandas Part 2 Notebook.

**Melting**:
- *`melt` function*
  - Purpose:  Handles the case where categorical observations are stored in the header labels (i.e. violates Tidy rules).  It moves the header names into a new column and matches the corresponding values.
  - Introduced:  Tidy Part 1 Notebook.

**Pivoting**:
- *`pivot` and `pivot_table` functions*
  - Purpose: The opposite of `melt`. Uses unique values from one more columns to create new columns.
  - Intorduced: Tidy Part 1 Notebook.
  
You can use any of these functions/techniques to reshape the data to meet Tidy standards and appropriate for the analytic or visualization you want to perform.