#### **What is data preparation?**

Data preparation is the process of preparing source data for efficient and accurate analysis. Data-preparation activities include removing missing values, formatting features uniformly, and appending related data from external sources. Data preparation is sometimes called data munging or data wrangling.

Data wrangling has six steps.


1. **Discovering**
- Discovery, also called data exploration, familiarizes the data scientist with source data in preparation for subsequent steps.

2. **Structuring**	
- Structuring data transforms features to uniform formats, units, and scales.

3. **Cleaning**	
- Cleaning data removes or replaces missing and outlier data.

4. **Enriching**	
- Enriching data derives new features from existing features and appends new data from external sources.

5. **Validating**	
- Validating data verifies that the dataset is internally consistent and accurate.
    
6. **Publishing**	
- Publishing data makes the dataset available to other data scientists by storing data in a database, uploading data to the cloud, or distributing data files.

---

#### **Extract, Transform, Load**
***Extract, Transform, Load (ETL)*** is a process that extracts data from transactional databases, transforms the data, and loads the data into an analytic database. ETL transforms data in a staging area, such as a temporary database, prior to loading data to the analytic database. 

***Extract, Load, Transform (ELT)*** is a variant of ETL that loads raw data directly to the analytic database and transforms the data in place.

ETL is similar to data wrangling. Both processes structure, clean, enrich, and publish data. However, data wrangling is usually an informal process, executed manually by data scientists on a static dataset. ETL is an automated process that repeatedly extracts new data from transactional databases. ETL is usually applied to larger data volumes and more sources than data wrangling.

ETL tools, also called data integration tools, extract and merge data from many different database systems. In principle, ETL tools can be used for data wrangling. However, because data wrangling is often manual and ad-hoc, data scientists usually prefer spreadsheets or programming languages such as Python, R, and SQL.

---

#### **Data Preparation with Python**
***pandas*** is a Python package that supports data wrangling. ***DataFrame*** is a pandas class that stores and manipulates datasets. In this material, ***dataframe***, in lowercase, refers to a DataFrame object.

Dataframes consist of rows and columns, representing dataset instances and features. Each column has a data type. Rows and columns are identified by integer or string ***labels***. The set of row labels is called the ***index*** and the set of column labels is called ***columns***. Usually, row labels are automatically generated integers and column labels are manually specified strings.

#### **Data Preparation with pandas**

Method	|  Parameters   |   Description

**read_csv()**
- filepath_or_buffer, sep=NoDefault.no_default	
- Returns a dataframe constructed from a CSV file. filepath_or_buffer is a string containing the full path for the CSV file. When the file is in the same directory as the code, only the file name is needed. sep specifies the character that separates values in the CSV file.

**read_excel()**	
- io, sheet_name=0	
- Returns a dataframe constructed from an Excel spreadsheet. io is a string containing the full path for the Excel file. When the file is in the same directory as the code, only the file name is needed. sheet_name is a string or integer that specifies which Excel sheet to read.

**read_sql_table()**	
- table_name, con, schema=None, columns=none	
- Returns a dataframe constructed from an SQL table. table_name specifies the table name. con specifies a database server connection string. schema specifies the schema in the database server. columns specifies which table columns to include in the dataframe.

**DataFrame()**	
- data=None, index=None, columns=None
- Returns a new dataframe. data specifies dataframe values as an array, dictionary, or another dataframe. index and columns specify row and column labels. The defaults index=None and columns=None generate integer labels.

**dataframe.at[]**	
- index, column	
- Returns the dataframe value stored at index and column.

**dataframe.info()**
- verbose=None	
- Returns information about dataframe, such as number of rows and columns, data types, and memory usage. If verbose=False, shows only summary dataframe information and hides column details.

**dataframe.loc[]**	
- indexRange, columnRange	
- Returns a slice of dataframe. indexRange specifies rows in the slice, as startIndex:endIndex. columnRange specifies columns in the slice as startLabel:endLabel.

**dataframe.sort_values()**	
- by, axis=0, ascending=True, inplace=False	
- Sorts dataframe columns or rows. by specifies indexes or labels on which to sort. axis specifies whether to sort rows (0) or columns (1). ascending specifies whether to sort ascending or descending. inplace specifies whether to sort dataframe or return a new dataframe.

---
---
---

## **Data Manipulation: Comparison of Groups**

The first step of the data wrangling process is data discovery, or exploring patterns and trends within a dataset. Data exploration can be done visually through plots or figures, or numerically by comparing descriptive statistics.

***Data manipulation*** is the process of organizing or subsetting a dataset to explore a research problem. Data manipulation is used to split datasets into multiple groups based on a categorical feature, or compare values of a dataset according to a specific condition. After data manipulation, descriptive statistics like the mean, median, or proportion can be calculated and compared across groups or conditions.

Ex.
1. The United Nations has several programs designed to increase access to education. One measure of educational access is the average years in school.
2. GDP and EducationYears are unavailable for Brazil. Since data is missing, Brazil is filtered, or ignored, during data manipulation.
3. The mean, or average, EducationYears for the five countries in the dataset with data available is (4.7 + 8.5 + 5.7 + 14.2 + 13.5) / 5 = 9.3 years.
4. However, differences exist between countries, which affect access to education. Countries in the same continent or region are more likely to have similar education structures.
5. **Grouping by continent highlights regional differences in education. Based on this data, students in Asia tend to spend less time in school than students in Europe and North America.**

---

#### **Grouping Data**
***Grouping*** is used to separate a dataframe into subsets based on levels of a categorical feature. In some cases, a different analysis or model may be applied to each group. In other cases, grouping may be a temporary operation for calculating group sizes or descriptive statistics like group means. A ***frequency table*** is a table containing group sizes for a categorical feature.

Ex. 
1. A data scientist would like to compare the mean years of schooling (Years) for each continent.
2. The dataset is grouped into three subsets: one subset for Africa, one subset for the Americas, and one subset Asia.
3. After grouping, the mean is calculated for each subset.
4. Means for each continent are combined into a single table. Based on the table, countries in Africa have the lowest mean years in school, then Asia, then the Americas.

---

#### **Pivot Tables**

A ***pivot table*** calculates and displays descriptive statistics after grouping based on values of two categorical features. One categorical feature is assigned to the pivot table's rows, and another categorical feature is assigned to the columns. A ***contingency table*** is a special case of a pivot table in which the descriptive statistic is the number of instances in each combination of categorical features.

ex.

1. Pivot tables provide the number of instances that share a combination of two categorical features.
2. The row feature's unique values are listed on the pivot table's rows.
3. The column feature's unique values are listed on the pivot table's columns.
4. A descriptive statistic like the number of instances is added to each row/column combination. Ex: 26 European countries have high internet access. 13 Asian countries have low internet access.
5. Pivot tables may contain descriptive statistics for additional features. Ex: The mean years in school for European countries with high internet access is 11.6 years, and the mean for Asian countries with low internet access is 6.3 years.

---

#### **Data Manipulation with Python**

In [4]:
# Import packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Import data and display
country = pd.read_csv('/Users/dylanlam/Documents/GitHub/data_science_practice_and_skills/datasets/country_complete.csv')
country


Unnamed: 0,Country,Continent,Years,Internet access,Emissions range,Fertility,Emissions,Internet
0,Afghanistan,Asia,3.8,Low,Low,4.33,0.254,16.8
1,Albania,Europe,10.0,Moderate,Low,1.71,1.590,65.4
2,Algeria,Africa,8.0,Low,Moderate,2.64,3.690,49.0
3,Angola,Africa,5.1,Low,Low,5.55,1.120,29.0
4,Argentina,Americas,9.9,High,Moderate,2.26,4.410,77.7
...,...,...,...,...,...,...,...,...
146,Uruguay,Americas,8.7,High,Moderate,1.97,2.010,80.7
147,Uzbekistan,Asia,11.5,Moderate,Moderate,2.23,2.810,55.2
148,Vietnam,Asia,8.2,Moderate,Moderate,1.95,2.160,69.8
149,Zambia,Africa,7.0,Low,Low,4.87,0.302,14.3


In [5]:
# Categorical features are sorted in alphabetical order by default
# np.size counts the number of entries
country['Internet access'] = country['Internet access'].astype('category')
country.pivot_table(
    values='Years', index='Continent', columns='Internet access', aggfunc=np.size
)

Internet access,High,Low,Moderate,Very high
Continent,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Africa,1.0,36.0,8.0,
Americas,8.0,7.0,10.0,1.0
Asia,10.0,13.0,9.0,8.0
Europe,26.0,,3.0,7.0
Oceania,2.0,1.0,1.0,


In [6]:
# cat.reorder_categories is useful for rearranging the order
# (ex: low to high)
country['Internet access'] = country['Internet access'].cat.reorder_categories(
    ['Low', 'Moderate', 'High', 'Very high']
)
# Display the number of countries in a pivot table of continent and
# internet access
country.pivot_table(
    values='Years', index='Continent', columns='Internet access', aggfunc=np.size
)

Internet access,Low,Moderate,High,Very high
Continent,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Africa,36.0,8.0,1.0,
Americas,7.0,10.0,8.0,1.0
Asia,13.0,9.0,10.0,8.0
Europe,,3.0,26.0,7.0
Oceania,1.0,1.0,2.0,


In [10]:
# Which 7 countries in the Americas have low Internet access?
country[(country['Continent'] == 'Americas') & (country['Internet access'] == 'Low')][['Country', 'Continent', 'Internet access']]

Unnamed: 0,Country,Continent,Internet access
14,Belize,Americas,Low
17,Bolivia,Americas,Low
43,El Salvador,Americas,Low
56,Guatemala,Americas,Low
59,Haiti,Americas,Low
60,Honduras,Americas,Low
101,Nicaragua,Americas,Low


---
---
---


## **Structuring Data**

#### **Formatting Data**

All data within a single column and similar data in multiple columns should be stored in a uniform format. Ex:

All dates and times might be stored as the datetime data type, with a 24-hour clock in Coordinated Universal Time (UTC).
All lengths might be stored as meters.
All percentages might be stored as decimal values between 0 and 1, rather than integers between 0 and 100.
All names of people might be stored as 'FirstName LastName' with no prefix, suffix, or middle initial.
A uniform storage format facilitates aggregating and comparing data within and across columns. Standardizing the storage format also minimizes analysis errors.

Although storage formats should be uniform, display formats may vary according to the audience. Ex: In the animation below, Gross Domestic Product (GDP) is stored as integer dollars but might be displayed as billions of dollars.

---

#### **Feature Scaling**

The numeric features in a dataset often have different scales. In some datasets, scales may differ by orders of magnitude. Many algorithms execute faster or generate better results when scales are similar or identical. ***Feature scaling*** converts numeric features to uniform ranges. Two common feature scaling methods are ***standardization*** and normalization.

Standardization converts features to a range centered at 0, with 1 representing a standard deviation:

 x(standardized) = (ORIGINALx - MEANx)/(SDx)
 
(μx) is the mean and (σx) is the standard deviation of feature. The standardized value is called a z-score. Since each unit represents one standard deviation, most z-scores fall between -2 and 2.

***Normalization*** converts features to the range [0,1]:

x(normalized) = (xoriginal - MINx)/(MAXx - MINx)

Standardization is usually preferred over normalization, since standardization positions values relative to the mean and standard deviation. Normalization is useful when algorithms require all features on identical scales.

Standardization is best when outliers are present. Standardized values are not skewed by outliers, but most normalized values are compressed into a small range.

***Terminology***

- Feature scaling terminology varies. Standardization is sometimes called z-score normalization. Normalization is sometimes called min-max scaling.

Ex. 
1. The housing dataset has the features Price and Age. The feature scales differ by orders of magnitude.
2. Standardized Price values are computed from the mean, 170,590, and standard deviation, 74,010.
3. Standardized Age values are computed from the mean, 21.2, and standard deviation, 5.8.
4. Normalized Price values are computed from the minimum, 90,300, and maximum, 269,500.
5. Normalized Age values are computed from the minimum, 14, and maximum, 28.

---

#### **Structuring Data with Python**

The Python language, and the pandas and sklearn packages, have many data structuring methods. Selected methods and functions that format, scale, and unpack data are described in the tables below. The tables include all required parameters and important optional parameters, but exclude infrequently used optional parameters.

string[start:end] is not a method but is useful for unpacking string columns and therefore included in the table.

Methods that change data contain an optional copy parameter. If copy is True, changes are returned in a new dataframe or array. If copy is False, changes are made to the input dataframe or array.

##### **Python data structuring methods.**

**Method	Parameters	Description**

***string[start:end]***	
- none	
- Returns the substring of string that begins at the index start and ends at the index end - 1.

***string.capitalize(), string.upper(), string.lower(), string.title()***	
- none	
- Returns a copy of string with the initial character uppercase, all characters uppercase, all characters lowercase, or the initial character of all words uppercase.

***to_datetime()***	
- arg	
- Converts arg to datetime data type and returns the converted object. Data type of arg may be int, float, str, datetime, list, tuple, one-dimensional array, Series, or DataFrame.

***to_numeric()***	
- arg	
- Converts arg to numeric data type and returns the converted object. Data type of arg may be scalar, list, tuple, one-dimensional array, or Series.


##### **Pandas data structuring methods**

**Method	Parameters	Description**

***df.astype()***	
- dtype, copy=True	
- Converts the data type of all dataframe df columns to dtype. To alter individual columns, specify dtype as {col: dtype, col:dtype, . . .}.

***df.insert()***	
- loc, column, value	
- Inserts a new column with label column at location loc in dataframe df. value is a Scalar, Series, or Array of values for the new column.

###### **sklearn data structuring methods**

**Method	Parameters	Description**

***preprocessing.scale()***
- X, axis=0, with_mean=True, with_std=True, copy=True	
- Standardizes data in input X of data type Array or DataFrame. axis indicates whether to standarize along columns (0) or rows (1). with_mean=True centers the data at the mean value. with_std=True scales the data so that one represents a standard deviation.

***preprocessing.MinMaxScaler().fit_transform()***	
- feature_range=(0,1), copy=True, X	
- Normalizes data in input X, a fit_transform() parameter of data type Array or DataFrame. feature_range specifies the range of scaled data. feature_range and copy are MinMaxScaler() parameters.

In [12]:
import pandas as pd
from sklearn import preprocessing

data = {'Price': [90300, 150500, 269500, 98000, 244650],
        'Age': [14, 27, 22, 15, 28]}

original = pd.DataFrame(data)
original

Unnamed: 0,Price,Age
0,90300,14
1,150500,27
2,269500,22
3,98000,15
4,244650,28


In [16]:
# Standardize dataframe and return as an array
standardizedArray = preprocessing.scale(original)

# Convert standardized array to dataframe 'standardized'
standardized = pd.DataFrame(standardizedArray, columns=["Price", "Age"])
standardized

Unnamed: 0,Price,Age
0,-1.084852,-1.231895
1,-0.271449,0.99236
2,1.336439,0.136877
3,-0.980812,-1.060798
4,1.000674,1.163456


In [15]:
# Normalize dataframe and return as an array
normalizedArray = preprocessing.MinMaxScaler().fit_transform(original)

# Convert normalized array to dataframe 'normalized'
normalized = pd.DataFrame(normalizedArray, columns=["Price", "Age"])
normalized

Unnamed: 0,Price,Age
0,0.0,0.0
1,0.335938,0.928571
2,1.0,0.571429
3,0.042969,0.071429
4,0.861328,1.0


---
---
---

### **Cleaning Data**

#### **Dirty Data**

Raw datasets often contain missing, outlier, and duplicate data.

- ***Missing data*** is an unknown or inapplicable value. In a database, missing data is represented as NULL. In Python, missing data is represented as NaN (not a number), NaT (not a time), None (an unspecified object), or a blank value.

- ***Outlier data*** is a numeric value that is much larger or smaller than other values in the same feature. Outlier data is usually defined as two or three standard deviations from the feature mean.

- ***Duplicate data*** are two or more identical instances in a dataset. Duplicate instances are usually erroneous and should be removed.

Missing, outlier, and duplicate data are collectively called ***dirty data***. A ***dirty instance*** and a ***dirty feature*** contain dirty data.

Dirty data creates bias and inefficiencies in data analysis. Data scientists may struggle to interpret missing data. Values in erroneous duplicates appear too often and are weighted too heavily. Outliers skew results due to one potentially erroneous value. Consequently, missing, outlier, and duplicate data should be corrected or deleted.

---

#### **Discarding data**

Dirty data may be removed from a dataset by discarding instances, discarding features, or pairwise discarding.

***Discarding instances***, also called ***listwise deletion*** or complete ***case removal***, removes dirty instances from the dataset. Dirty instances are usually discarded when:

The dirty instances comprise a small percentage of the dataset.

The dirty instances are random. When missing or outlier values are correlated with values in another feature, discarding dirty instances introduces bias.

Instances are duplicates. Usually, duplicate instances are erroneous, and one instance should be discarded.

***Discarding features*** removes dirty features that contain a high percentage of missing values, such as 60% or more. Discarding features does not usually apply to outlier data since, by definition, a small percentage of values can be outliers. Discarding features never applies to duplicate data.

***Pairwise discarding*** retains dirty instances for some analyses and discards dirty instances for others. Instances are discarded only when an analysis uses a dirty feature. With pairwise discarding, the total number of instances varies for different analyses, which complicates comparisons and correlations. For this reason, pairwise discarding is not commonly used.

---

#### **Imputing Data**
Imputing data replaces missing and outlier data with new values. Imputing is more complex than discarding but retains all instances and features. Data may be imputed in several ways:

- ***Hot-deck and cold-deck imputation*** replace missing and outlier data with a value from a randomly selected instance. In hot-deck imputation, the value is selected from other instances in the same dataset. In cold-deck imputation, the value is selected from a different dataset.

- ***Mean imputation*** replaces missing and outlier data with the mean value of the feature. Missing and outlier data are excluded from the computation of the mean.

- ***Regression imputation*** replaces missing and outlier data with a value computed from a regression model. In the regression model, the dependent variable is the dirty feature and the independent variables are other features. ***Stochastic regression imputation*** introduces uncertainty by adding or subtracting the regression variance to the new value. Regression models are discussed elsewhere in this material.

Regression imputation is valuable if the dirty feature is highly correlated with other features. If not, mean imputation is commonly used. Hot- and cold-deck imputation were common when less computer power was available to compute mean and regression but are not widely used today.

---

#### **Cleaning data with Python**

In Python, three special symbols represent missing values:

- None represents any missing Python object, such as a string.

- NaN represents a missing numeric value. NaN is a NumPy value, specified as numpy.NaN.

- NaT represents a missing datetime value. NaT is a pandas value, specified as pandas.NaT.

In addition, blank and 0 sometimes indicate a missing value, as in any tool.

The pandas DataFrame class has methods that identify dirty data and discard and impute values. Important data cleaning methods are described in the table below. The table includes all required parameters and important optional parameters but excludes infrequently used optional parameters.

Methods that change data contain an optional inplace parameter. If inplace is True, changes are made in the input dataframe. If inplace is False, changes are returned in a new dataframe.

##### **pandas data cleaning methods.**
**Method	Parameters	Description**
***df.drop()***	
- labels=None, axis=0, inplace=False	
- Removes rows (axis=0) or columns (axis=1) from dataframe df. labels specifies the labels of rows or columns to drop.

***df.drop_duplicates()***	
- subset=None, inplace=False	
- Removes duplicate rows from df. subset specifies the labels of columns used to identify duplicates. If subset=None, all columns are used.

***df.dropna()***	
- axis=0, how='any', subset=None, inplace=False	
- Removes rows (axis=0) or columns (axis=1) containing missing values from df. subset specifies labels on the opposite axis to consider for missing values. how indicates whether to drop the row or column if any or if all values are missing.

***df.duplicated()***	
- subset=None	
- Returns a Boolean series that identifies duplicate rows in df. true indicates a duplicate row. subset specifies the labels of columns used to identify duplicates. If subset=None, all columns are used.

***df.fillna()***	
- value=None, inplace=False	
- Replaces NA and NaN values in df with value, which may be a scalar, dict, Series, or DataFrame.

***df.isnull(), df.isna()***	
- none	
- Returns a dataframe of Boolean values. True in the returned dataframe indicates the corresponding value of the input df is None, NaT or NaN.

***df.mean()***	
- axis=0, skip_na=True, numeric_only=None	
- Returns the mean values of rows (axis=0) or columns (axis=1) of df. skipna indicates whether to exclude unknown values in the calculation. numeric_only indicates whether to exclude non-numeric rows or columns.

***df.replace()***	
- to_replace=None, value=NoDefault.no_default, inplace=False	
- Replaces to_replace values in df with value. to_replace and value may be str, dict, list, regex, or other data types.

---
---
---

### **Enriching Data**

#### **Appending data**
Datasets can be enriched by appending new instances or features from external datasets. Leading sources of public datasets are described in the table below.

##### **Leading public datasets.**

**Name	Link	Description**
***Kaggle***	
- kaggle.com	
- Over 50,000 datasets on a broad range of subjects. Also provides Jupyter notebooks that analyze the datasets.

***FiveThirtyEight***	
- data.fivethirtyeight.com	
- Datasets on politics, sports, science, economics, health, and culture, initially developed to support FiveThirtyEight publications.

***University of California Irvine Machine Learning Repository***	
- archive.ics.uci.edu	
- 622 datasets, primarily in science, engineering, and business.

***Data.gov***	
- data.gov	
- U.S. government datasets on agriculture, climate, energy, maritime, oceans, and health.

***World Bank Open Data***	
- data.worldbank.org	
- Global datasets on subjects such as health, education, agriculture, and economics.

***Nasdaq Data Link***	
- data.nasdaq.com	
- Financial and economic datasets.

To append instances or features, prepare a subset of external data as follows:

1. Identify the external dataset of interest.
2. Identify a matching feature in the external and original datasets. The matching feature must uniquely identify instances of both datasets.
3. Usually, only a subset of the external dataset is of interest. Extract the subset, including the matching feature.
4. Structure and clean the subset, as described elsewhere in this material.

To append instances, insert subset instances to the original dataset. To append features, merge subset instances with original instances using the matching feature, as illustrated in the animation below.

Appending data may create missing data:

- When appending instances, missing data is created if the two datasets have different features.
- When appending features, missing data is created if instances of the two datasets do not match.

Discard or impute the new missing data, as described elsewhere in this material.

#### **Enriching data with Python**

pandas has many data enriching methods. Selected methods that append and derive data are described in the table below. The table includes all required parameters and important optional parameters but excludes infrequently used optional parameters.

df.merge() emulates a relational database join. Relational joins merge two tables by specifying join columns in each table. The join columns correspond to the matching feature described in Appending data, above.

A relational join merges rows that have matching join column values. Relational joins can be executed in several ways, including inner, outer, left, and right joins. These join types specify how to handle rows that do not have matching join column values. Inner, outer, left, and right joins are described in detail elsewhere in this material.

##### **Python data enriching methods**

**Method	Parameters	Description**

***pd.concat()***
- objs, axis=0, join='outer', ignore_index=False	
- Appends dataframes specified in objs parameter. Appends rows if axis=0 or columns if axis=1. join specifies whether to perform an 'outer' or 'inner' join. Resulting index values are unchanged if ignore_index=False or renumbered if ignore_index=True.

***df.apply()***	
- func, axis=0	
- Applies the function specified in func parameter to a dataframe df. Applies function to each column if axis=0 or to each row if axis=1. Returns a Series or DataFrame.

***df.insert()***	
- loc, column, value	
- Inserts a column to df. loc specifies the integer position of the new column. column specifies a string or numeric column label. value specifies column values as a Scalar or Series.

***df.merge()***	
- right, how='inner', on=None, sort=False	
- Joins df with the right dataframe. how specifies whether to perform a 'left', 'right', 'outer', or 'inner' join. on specifies join column labels, which must appear in both dataframes. If on=None, all matching labels become join columns. sort=True sorts rows on the join columns.

---