# Advanced Data Transformations
## An introduction to log transformation and other advanced data transformations
*** 

In this article, you will learn how to transform data in particular situations, such as when the data you’re working with is skewed. You will learn:

* How to transform data that is skewed
* What is log transformation and why is it useful?
* How to implement log transformation in Python
* Different kinds of advanced data transformations

### Skewed Data

As a data analyst and scientist, you will often work with normally distributed data. Just as a refresher, normally distributed data is data where most data points are close to a central value with fewer instances as you get further away from the center. Visually, this looks like a “bell curve”:

<img src=https://static-assets.codecademy.com/Paths/data-analyst-career-path/advanced-data-transformations/bellcurve.jpeg width=500>

There are several reasons why having normally distributed data is useful:

* Normal distributions are symmetric around their mean. Moreover, the mean, median, and mode of a normal distribution are equal. These properties make the data easier to analyze because we can make some assumptions about how many datapoints are on either side of the mean.
* As shown above, approximately 68% of the data falls within 1 standard deviation of the mean, and approximately 95% of the data falls within 2 standard deviations of the mean. Again, these properties make it easier to make generalizations about the data you’re working with.
* Many machine learning algorithms (such as linear regression) assume that the data distribution is close to normal. Therefore, transforming data so that it is normally distributed can enhance your ability to fit an accurate predictive model.


Oftentimes, real-life data is messy and does not conform to a normal distribution. Instead, the data might be skewed to the left or right.

For example, imagine you’re a data analyst looking at income data in a population. You may find that a lot of the data is centered around lower values but that there are high values in the data as well. This results in a skewed distribution

<img src=https://content.codecademy.com/courses/learn-pandas/distribution-types-ii-skew-right-noline.svg width=500>

This is an example of a right-skewed data set. Lower incomes, which make up most of the data, fall on the left side of the diagram, while higher incomes fall on the right.

Feeding this data to some machine learning or statistical models will present issues if the model assumes normality, because the model will be trained on a much larger number of lower incomes. The skewness of the data violates the assumptions of the model.

In situations like this, we would like to make our data less skewed so that it conforms to model assumptions and is easier to work with. To help with this issue, we can use *Log Transformation*.

### Log Transformation

Log transformation is a data transformation method that replaces each variable x with log(x).

When log transformation is applied to data that is not normally distributed, the result is that the data will be less skewed, or more “normal” than before. For example, after applying log transformation to the right-skewed data set above, we would see something similar to the following:

<img src=https://static-assets.codecademy.com/Paths/data-analyst-career-path/advanced-data-transformations/logtransform.jpeg width=500>

Notice how the shape of the data changed. The logarithm function squeezes together the larger values in your data set and stretches out the smaller values, making the transformed data more “normal” than before.

Let’s see how we can apply this on real data in Python.

### Python Implementation

Python contains libraries that make it simple to transform your data using log transformation.

One quick method is to use `NumPy`, a Python library that makes it easy to work with arrays of data and perform statistical analysis. Numpy has a natural log transformation function that you can apply to an array of data, like so:


```python
import numpy as np
 
data = [1, 1, 1, 1, 2, 2, 2, 3, 3, 4, 4]
 
# log transform
log_data = np.log(data)
 
print(log_data)
# 0, 0, 0, 0, 0.693, 0.693, 0.693, 1.098, 1.098, 1.386, 1.386
```

Data scientists and machine learning engineers will generally usually use powerful machine learning libraries such as `scikit-learn` to apply log transformation. You can import `PowerTransformer` from sklearn’s `preprocessing` module and then pass in your data to a simple function call:

```python
from sklearn.preprocessing import PowerTransformer
 
# log transform 
log_transform = PowerTransformer()
log_transform.fit_transform(data)
```

Let’s try to perform log transformation on real data in Python. We’ll be working with home price data from Kaggle.

First, we will need to import several Python libraries:

* `numpy` (to perform log transformation)
* `pandas` (to read in a CSV file and perform data analysis)
* `seaborn` (to plot our data)

```python
import numpy as np
import pandas as pd
import seaborn as sns
```

Next, we need to read in our data file and store the home price data in a new variable:

```python
import numpy as np
import pandas as pd
import seaborn as sns

# read in data file 
home_data = pd.read_csv('home_data.csv')
 
# store home price data
home_prices = home_data['SalePrice']
```

Now that we have our data stored, we can plot it using `seaborn`:

```python
import numpy as np
import pandas as pd
import seaborn as sns
 
home_data = pd.read_csv('home_data.csv')
home_prices = home_data['SalePrice']
 
# plot data
sns.distplot(home_prices)
```

<img src=https://static-assets.codecademy.com/Paths/data-analyst-career-path/advanced-data-transformations/skewed.jpeg width=600>

Notice how the shape of the distribution is much more symmetrical (normal) than before. Let’s see how skewed our data is now:
```python
print(home_prices.skew())
```

**`0.12133506220520406`1**

Applying log transformation reduced the skewness from 1.88 to 0.12 — that’s a significant change!

Now that our transformed data is less skewed and more normally distributed, we can use this data in our prediction model to more accurately predict home prices.

### Other Transformations

In this article, we discussed only one advanced data transformation method – log transformation. But different situations call for other types of transformations.

In the examples above, we dealt with income and price data, which cannot be negative. But what happens if you have negative numbers in your data? You cannot take the log of a negative number, so log transformation will not work.

In this case, you would have to explore other data transformation techniques, such as cube root transformation, which involves converting x to x^(1/3). This transformation reduces the right skewness but also has the benefit of working with zero and negative values.

To reduce left skewness, you might look at a technique such as square transformation, which involves converting x to x^2.

An important part of your job as a data analyst and scientist is understanding the data you’re working with and the shape of that data. Once you know this, it becomes easier to figure out how you should transform that data.

As you progress in your journey, you’ll learn many techniques to deal with unique situations.