# Data Wrangling - Review

**Objective**
1. Students will be able to understand the purpose and process of data wrangling.
2. Students will be able to utilize Pandas for data wrangling.
3. When given a dataset and objectives, students will be able to create a data profile report and implement wrangling.

**Not Objective**
1. Perform statistical analysis on data.
2. Prepare data for Machine Learning.

# Introduction to Data Wrangling

Data wrangling involves various activities to prepare data before the analysis stage. This is necessary because we assume data is of poor quality until proven good.

Bad data is data that slows down the process and confounds analysis results, including:
1. incorrect definitions
3. inconsistent structures
2. inaccurate values
4. incomplete information
5. structures that do not meet the analysis requirements (debatable!)

## Data Wrangling Process

Data wrangling consists of three main stages:
1. Access
2. Profile & Transform
3. Publish

### 1. Access

In this stage, we retrieve data from various sources such as databases, CSV files, spreadsheets, and APIs.

### 2. Profile & Transform

#### **Profiling**

Profiling is the process of understanding the current state of the data. At the beginning of this process, we need to find out information, including:
1. What is the definition of this data?
2. What is the definition of each column?
3. What should be the column data type?
4. What are the allowed values for each column?

##### Individual profiling
Checking the correctness at the cell level, including:
- correct data type, e.g., age is int, not str
- allowed values, e.g., age > 0
- correct structure, e.g., date formatted YYYY-MM-DD

##### Set-based profiling
Checking the correctness of values in groups, both columns and data subsets:
- duplication
- missing values
- valid percentage
- value distribution
- anomalies
- relationships between variables

Profiling results should be considered in determining the next steps in the DA/DS process.



#### **Transforming**

##### Cleansing
Cleansing is the process of clearing cell values, including:
- handling nulls
- handling duplicates
- structure validation

##### Structuring
Structuring is the process of changing the structure of data, including:
- sorting rows and columns
- splitting fields, e.g., `created_at:2020-12-31 -> (create_year:2020, create_month:12, create_date:31)`
- merging fields, e.g., `(id_prov:11, id_kota:01, id_kec:23) -> id_lokasi: 110123`
- aggregation
- pivoting

##### Enrichment
Enrichment is the process of enriching a dataset, including:

Enrichment can be achieved by joining with other datasets, e.g., `Transactions JOIN Customers ON customer_id` to obtain demographic information from transactions.

Enrichment can also be derived from value derivation, providing additional information through calculations such as:
- calculating the remaining warranty period from `purchase_date` and `guarantee_duration`
- predicting sentiment from `comments` in the `ProductReview` dataset
- finding location from `address` using the Google Maps API on `Outlets` data

### 3. Publish

At this stage, we share the ready-to-use data via database, CSV, spreadsheet, or WhatsApp.