# Data Wrangling



-----------------------

# The idea behind Data Wrangling

Data wrangling, also known as data munging, is a crucial step in the data analysis process that involves cleaning, structuring, and enriching raw data into a desired format for better decision making in less time. It is essentially the process of transforming and mapping data from one "raw" data form into another format with the intent of making it more appropriate and valuable for a variety of downstream purposes, such as analytics and reporting. Let's delve into the idea behind data wrangling and why it's so important.

## Why is Data Wrangling Important?

Data wrangling is essential for several reasons:

1. **Facilitates Easier Analysis:** Data in its raw form is often complex and unwieldy. Wrangling simplifies this data, making it easier to work with in analysis tools.

2. **Improves Data Quality:** The process helps identify and correct errors or inconsistencies in data, leading to more accurate analysis results.

3. **Saves Time:** Although data wrangling can be time-consuming, it ultimately saves time during the analysis phase by ensuring that data is in a consistent format that can be easily manipulated and explored.


### Concept of Tidy Data (Long Format)





Tidy datasets are easy to manipulate, model and visualize, and have a specific structure:
- each variable is a column
- each observation is a row
- and each type of observational unit is a table
Tidy data makes it easy for an analyst or a computer to extract needed variables because it provides a standard way of structuring a dataset. It is also very pleasant to work with in the data exploration and plotting phases of a data project.

![Tidy data](https://d33wubrfki0l68.cloudfront.net/6f1ddb544fc5c69a2478e444ab8112fb0eea23f8/91adc/images/tidy-1.png)

In [None]:
# example of a tidy dataset --> Pinguins dataset

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

In [None]:
penguins = sns.load_dataset("penguins")

In [None]:
penguins.head()

### Penguins is a tidy dataset. Why is this data tidy ?

Characteristics of tidy datasets:
- observational unit ==: one penguin
- each variable is a column
- the table is all about penguins


#### Untidy datasets can violate the Tidy data structure rules above in different ways. For example:
- Column headers are values, not variable names.
- Multiple variables are stored in one column.
- Variables are stored in both rows and columns.
- Multiple types of observational units are stored in the same table.
- A single observational unit is stored in multiple tables.

In [None]:
penguins.describe()

In [None]:
penguins.info()

In [None]:
penguins.shape

#### Let's visualize how many penguins are living on different islands, including the information about gender.

In [None]:
sns.countplot(data=penguins, hue='sex', x='island')

#### Let's imagine we want now to compare penguins from different islands, but now we want to use the type of the bill measurement as the hue.

In [None]:
# let's experiment with the possible options: barplot instead of countplot

sns.barplot(data=penguins, x='island', y='bill_length_mm')

In [None]:
# let's experiment with the possible options: boxplot instead of barplot

sns.boxplot(hue='sex', y='island', x='bill_length_mm', data=penguins)

There is no easy option to bring both of the bill measurements to a single visual besides changing the format of the data. To bring both of the bill measurements it would be helpful to have the distinction between the measurements in a column (the same way as we have it for e.g. island)

### Wide versus Long Data Format
Data in pandas and tabular data in general can exist in two forms: long and wide format.

- In the long format, for each type of variable, there is a single value column and another column that contains the variable name for each of the values. This format is great for plotting with seaborn.

- In the wide format, each variable has its own column. This format is great for calculating descriptive statistics or for applying machine learning with sklearn.

The conversion between long and wide format helps you to bring data into the right format for merging, concatenation or plotting.

Before applying the transformations, make sure that your data is Tidy Data. Once your data is tidy, transformations from one format to the other will become simple.


We can melt bill measurements and have a single bill column, this will give us a long format dataframe

In [None]:
# before we do that, it would be helpful to have an additional penguin id column
penguins['id'] = penguins.index

In [None]:
penguins.head()

Melt is used to transform the data from wide to long format.

- id_vars:                 <=== Column(s) to use as identifier variables
- value_vars:              <=== Column(s) to unpivot. If not specified, uses all columns that are not set as id_vars.
- var_name:                <=== Name to use for the ‘variable’ column.
- value_name:              <=== Name to use for the ‘value’ column.

In [None]:
penguins_long = pd.melt(penguins,                                        # dataset
                        id_vars=['id', 'sex', 'species', 'island'],      # Column(s) to use as identifier variables
                        value_vars=['bill_length_mm', 'bill_depth_mm'],  # Column(s) to unpivot
                        var_name='bill_measurement',                     # Name to use for the ‘variable’ column
                        value_name='value')                              # Name to use for the ‘value’ column

In [None]:
penguins_long.head()

How are the dimensions different from the original dataframe? How to interpret it?

In [None]:
penguins_long.shape

A: For every penguin we have 2 rows right now (double number of rows). Numerical columns were reduced to only one.

In [None]:
penguins_long.sort_values(by='id')

In [None]:
sns.boxplot(hue='bill_measurement', y='value', x='island', data=penguins_long)

Come back to the previous format using pivot function

In [None]:
wide_penguins = pd.pivot(penguins_long,             # <===  tidy/long format df
                     columns='bill_measurement',    # <=== column(s) whose values we want as our new columns
                     index=['id', 'island', 'sex'], # <=== column(s) that will be used as a new index
                     values='value')                # <=== column whose values we want to populate our new wide dataframe

In [None]:
wide_penguins.reset_index()

#### Is the data the same as before? If not, why do you think so?

We dropped some of the columns while performing melting at the beginning and we can't retrieve them

### What are the average bill measurements? More about penguins statistics

There are a few ways how to get some descriptive statistics about the data:
- run `.describe` method
- perform groupby
- pivot table the data

### Pivot Table
- pivots table with aggregation
- default aggregation function is mean
- do not use unless you want to have aggregation
- if you have duplicates in the table pivot will return an error
- if you have duplicates in the table pivot_table will take the mean of the two observations
- please be careful which function you use and ensure the result you want is not aggregation without that being you goal

In [None]:
penguins_wide = penguins_long.pivot_table(values='value', 
                                            columns='bill_measurement',
                                            index=['island', 'sex'],
                                         aggfunc="mean")

In [None]:
penguins_wide

In [None]:
penguins_wide.reset_index()