# Hypothesis Testing in Exploratory Data Analysis (EDA): Correlations and Normality

Let's look at how we use hypothesis testing to investigate areas that we examined in Exploratory Data Analysis (EDA) lessons. Hypothesis testing enables us to statistically evaluate relationships between variables and scrutinize their distributions with a more analytical approach.

We'll start by applying hypothesis testing to relationships between variables, and then, we'll delve into examining distributions (specifically normality).

We'll use the housing price dataset. 

In [None]:
import pandas as pd

# Load the housing price dataset
df = pd.read_csv("https://raw.githubusercontent.com/data-bootcamp-v4/data/main/housing_price_eda.csv")

# Correlation

During EDA, we often have certain presumptions about the data. For example, we might hypothesize that houses with larger 'LotArea' have higher 'SalePrice'. Hypothesis testing allows us to test these assumptions rigorously.

To statistically test the relationship, we use correlation tests like Pearson's correlation test. The **null hypothesis is that there is no relationship between the two variables.**

# Normal Distribution

Many statistical techniques assume that data is normally distributed. Checking for normality and applying necessary transformations to make data more "normal" can be crucial for the success of these techniques. 

Applications of Normality Tests (or other ways of checking for normality):

1. **Assumption Testing**: Normality tests are employed to **assess the assumption of normality in various statistical techniques, such as t-tests, analysis of variance (ANOVA), linear regression, and others**. Violations of normality assumptions may require alternative approaches or data transformations.

2. **Data Exploration**: Normality tests help analysts understand the distributional properties of the data they are working with. This information can **guide the selection of appropriate statistical methods and provide insights into the nature of the variables**.

We'll be using the housing price datasets. Let's check the normality of the 'SalePrice',  'LotArea' and '1stFlrSF' columns and try a transformation if needed.

## Checking if data is normally distributed

Checking for normality can be done both visually (using histograms and Q-Q plots) and statistically using tests like the Kolmogorov-Smirnov (K-S) test, Shapiro-Wilk test, among others.

**Visual Inspection:**
- **Histogram**: A bell-shaped curve in a histogram is indicative of a normal distribution.
- **Q-Q Plot**: In this plot, the quantiles of your data are plotted against the quantiles of a normal distribution. If the data is normally distributed, the points should roughly lie on the y=x line.
- **Box Plots**: The symmetry of a box plot can give hints about data normality.

**Statistical Tests:**

- **Shapiro-Wilk Test**: This is a popular test for normality. A low p-value (typically p<0.05) indicates that the data is not normally distributed.
- **Kolmogorov-Smirnov Test**: This test compares the cumulative distribution of your data to a normal distribution. Again, a low p-value suggests non-normality.

**Descriptive Statistics:**
- **Skewness and Kurtosis**: Skewness measures the asymmetry of the data distribution, while kurtosis measures the "tailedness". For a normal distribution, skewness should be close to 0 (indicating symmetry), and kurtosis should be close to 3. Should be used in conjunction with other methods.


### Histogram and Q-Q plot

We'll visualize the distribution using a histogram and also use a Q-Q plot, which plots the quantiles of our data against the quantiles of a normal distribution.

#### SalePrice

#### LotArea

#### 1stFlrSF

### Hypothesis Testing

Perform hypothesis tests on variables like 'SalePrice', 'LotArea' and '1stFlrSF' to explore their distribution.

#### Kolmogorov-Smirnov (K-S) Test:
The K-S test is a non-parametric test that compares the cumulative distribution function of the sample data to that of a specified theoretical distribution (like the normal distribution). The null hypothesis of the test is that the sample data follows the specified distribution.

By conducting the Kolmogorov-Smirnov test, we can gain insights into the distributional properties of the variable and determine if it follows a normal distribution or not.

##### SalePrice

##### LotArea

##### 1stFlrSF

#### Shapiro-Wilk Test

The Shapiro-Wilk test is another popular method to check for normality. Its null hypothesis is that the data follows a normal distribution. The Shapiro-Wilk test is known to be more appropriate for smaller sample sizes compared to the K-S test.

We only need to perform one test, but we will show how the Shapiro-Wilk Test works anyways with SalePrice.

##### SalePrice

## Transforming Data to Be Normally Distributed

Transforming data to be approximately normal can aid in statistical analysis and modeling. 

**Log Transformation:**
- Useful for data that shows exponential growth, like population or financial data.
- Use when data is right-skewed.

**Square Root Transformation:**
- Moderates the impact of extreme values.
- Suitable for data with mild skewness.

**Box-Cox Transformation:**
- Requires positive data values.
- Automatically determines the best power transformation.


#### Logaritmic Transformation

If the 'SalePrice' distribution seems non-normal, a common technique is to apply a logarithmic transformation to the data to make it more normal.

#### Square Root Transformation

'1stFlrSF' could be a candidate for the Square Root transformation given its mild to moderate skewness.

#### Box-Cox Transformation

'LotArea' seems like a good candidate for the Box-Cox transformation due to its high skewness and positive values.

## After Transformation

- **Re-assess Distribution:** After applying a transformation, visually assess the distribution again using histograms and Q-Q plots.
- **Statistical Testing:** Shapiro-Wilk or Kolmogorov-Smirnov tests can be used to statistically assess normality.
Remember to reverse transformations (when needed) for interpretation.

Always consider the underlying reasons for any non-normality, as transformations might not always be the best solution.

Let's do it just for SalePrice.

Let's also check with Kolmogorov-Smirnov (K-S) Test.

## 💡 Check for understanding

- Do the after transformation checks for `LotArea` and `1stFlrSF`
- Choose another numerical continuous variable from the dataset and check if it's normally distributed. If it's not, try transforming it so it becomes normally distributed, and check for normality again. Explain why you chose that variable and your results.

## Note: Central Limit Theorem

The CLT states that, regardless of the shape of the underlying population, the sampling distribution of the mean will approximate a normal distribution as the sample size grows larger (n > 30), assuming all samples are identical in size and are randomly sampled. 

1. **Large Sample Size & Individual Data Points**: Even with a large sample, the distribution of individual data points could still be non-normal. For instance, a dataset with millions of data points could still be heavily skewed or have extreme kurtosis.
  
2. **Large Sample Size & Averages of Samples**: If you're taking multiple samples from a population and calculating their averages, the distribution of those averages tends to be normal due to the CLT, even if the underlying population is not normal.

3. **Practical Implications**: While the CLT is powerful, remember that many statistical tests and methods assume that the individual data points (not their means) are normally distributed. So, you can't bypass these assumptions simply because you have a large dataset.