# Correlation and Causation, And ANOVA

Correlation and causation, as well as ANOVA (Analysis of Variance), are important concepts in statistics and data analysis. Let's briefly explain each of them:

1. **Correlation and Causation:**
   - **Correlation:** Correlation refers to the statistical relationship between two or more variables. It measures the strength and direction of the association between variables. A correlation can be positive (both variables increase together), negative (one variable increases while the other decreases), or zero (no linear relationship). Common correlation measures include the Pearson Correlation Coefficient and the Spearman Rank Correlation Coefficient.
   - **Causation:** Causation, on the other hand, refers to a cause-and-effect relationship between two variables. It implies that changes in one variable directly influence the changes in another variable. However, establishing causation requires more than just observing a correlation. Additional evidence, such as experimental studies or rigorous statistical analyses, is needed to demonstrate causation. The phrase "correlation does not imply causation" emphasizes the importance of distinguishing between the two concepts.

2. **ANOVA (Analysis of Variance):**
   ANOVA is a statistical technique used to compare the means of two or more groups to determine if there are any statistically significant differences among them. It helps to assess whether the variation between group means is greater than the variation within groups. ANOVA is commonly used when comparing means across different treatments, interventions, or categories. It provides an F-statistic and a p-value to determine if there are significant differences between groups. If the p-value is below a predefined significance level (usually 0.05), it indicates that there are statistically significant differences between at least two groups.

In summary, correlation measures the statistical relationship between variables, while causation implies a cause-and-effect relationship between variables, which requires additional evidence. ANOVA, on the other hand, helps to compare means between different groups to identify significant differences among them. All these concepts are essential in drawing accurate and meaningful conclusions from data analysis.

## 4- Correlation and Causation 

**Correlation:** a measure of the extent of interdependence between variables.

**Causation:** the relationship between cause and effect between two variables.

It is important to know the difference between these two. Correlation does not imply causation. Determining correlation is much simpler the determining causation as causation may require independent experimentation.

Pearson Correlation

The Pearson Correlation measures the linear dependence between two variables X and Y.

The resulting coefficient is a value between -1 and 1 inclusive, where:

- 1: Perfect positive linear correlation.
- 0: No linear correlation, the two variables most likely do not affect each other.
- -1: Perfect negative linear correlation.
Pearson Correlation is the default method of the function "corr". Like before, we can calculate the Pearson Correlation of the of the 'int64' or 'float64' variables.

#### P-value

What is this P-value? The P-value is the probability value that the correlation between these two variables is statistically significant. Normally, we choose a significance level of 0.05, which means that we are 95% confident that the correlation between the variables is significant.

By convention, when the


-  p-value is $<$ 0.001: we say there is strong evidence that the correlation is significant.
-  the p-value is $<$ 0.05: there is moderate evidence that the correlation is significant.
-  the p-value is $<$ 0.1: there is weak evidence that the correlation is significant.
-  the p-value is $>$ 0.1: there is no evidence that the correlation is significant.


We can obtain this information using "stats" module in the "scipy" library.

In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline 
from scipy import stats

In [3]:
df = pd.read_csv('clean_df')
df.head()

Unnamed: 0.1,Unnamed: 0,symboling,normalized-losses,make,num-of-doors,body-style,drive-wheels,engine-location,wheel-base,length,...,city-L/100km,length_new,length_zscore,horsepower-binned,fuel-type-diesel,fuel-type-gas,aspiration-std,aspiration-turbo,aspiration-std.1,aspiration-turbo.1
0,0,3,122,alfa-romero,two,convertible,rwd,front,88.6,0.811148,...,11.190476,0.413433,-0.438315,Low,False,True,True,False,True,False
1,1,3,122,alfa-romero,two,convertible,rwd,front,88.6,0.811148,...,11.190476,0.413433,-0.438315,Low,False,True,True,False,True,False
2,2,1,122,alfa-romero,two,hatchback,rwd,front,94.5,0.822681,...,12.368421,0.449254,-0.243544,Medium,False,True,True,False,True,False
3,3,2,164,audi,four,sedan,fwd,front,99.8,0.84863,...,9.791667,0.529851,0.19469,Low,False,True,True,False,True,False
4,4,2,164,audi,four,sedan,4wd,front,99.4,0.84863,...,13.055556,0.529851,0.19469,Low,False,True,True,False,True,False


Wheel-Base vs. Price

calculate the Pearson Correlation Coefficient and P-value of 'wheel-base' and 'price'.

In [5]:
pearson_coef, p_value = stats.pearsonr(df['wheel-base'], df['price'])

In [6]:
print("The Pearson Correlation Coefficient is", pearson_coef, " with a P-value of P =", p_value)  

The Pearson Correlation Coefficient is 0.584641822265508  with a P-value of P = 8.076488270732947e-20


Conclusion:
Since the p-value is  <
  0.001, the correlation between wheel-base and price is statistically significant, although the linear relationship isn't extremely strong (~0.585).

The pearson_coef and p_value shown here are the results of a calculation called Pearson correlation coefficient. It helps us understand if there's a relationship between two sets of data. In this case, we want to see if there's a connection between the wheel-base of cars and their prices.

The pearson_coef value tells us how strong the relationship is. If it's close to 1 or -1, it means there's a strong positive or negative relationship, respectively. If it's close to 0, it means there's little to no relationship. For example, if the pearson_coef is 0.8, it means there's a strong positive relationship between wheel-base and price.

The p_value is like a measure of confidence in the result. It tells us if the relationship found is statistically significant. If the p_value is small (usually less than 0.05), it means we can be more confident that there's a genuine relationship between wheel-base and price. For example, if the p_value is 0.02, it means there's a strong indication of a relationship.

In summary, the pearson_coef tells us how strong the relationship is, and the p_value tells us if we can trust the result.

#### Horsepower vs. Price
Let's calculate the Pearson Correlation Coefficient and P-value of 'horsepower' and 'price'.

In [7]:
pearson_coef, p_value = stats.pearsonr(df['horsepower'], df['price'])

In [8]:
print("The Pearson Correlation Coefficient is", pearson_coef, " with a P-value of P =", p_value)  

The Pearson Correlation Coefficient is 0.8096068016571054  with a P-value of P = 6.273536270650436e-48


Conclusion:
Since the p-value is  <
  0.001, the correlation between horsepower and price is statistically significant, and the linear relationship is quite strong (~0.809, close to 1).

#### Length vs. Price

Let's calculate the Pearson Correlation Coefficient and P-value of 'length' and 'price'.

In [10]:
pearson_coef, p_value = stats.pearsonr(df['length'], df['price'])
print (pearson_coef)
print (p_value)

0.6906283804483639
8.016477466159153e-30


Conclusion:

Since the p-value is  <  0.001, the correlation between length and price is statistically significant, and the linear relationship is moderately strong (~0.691).

#### Width vs. Price
Conclusion:
Since the p-value is < 0.001, the correlation between width and price is statistically significant, and the linear relationship is quite strong (~0.751).

#### Curb-Weight vs. Price
Conclusion:
Since the p-value is  <  0.001, the correlation between curb-weight and price is statistically significant, and the linear relationship is quite strong (~0.834).

#### Engine-Size vs. Price
Conclusion:
Since the p-value is  <  0.001, the correlation between engine-size and price is statistically significant, and the linear relationship is very strong (~0.872).

#### Bore vs. Price
Conclusion:
Since the p-value is  <  0.001, the correlation between bore and price is statistically significant, but the linear relationship is only moderate (~0.521).
We can relate the process for each 'city-mpg' and 'highway-mpg':

#### City-mpg vs. Price
Conclusion:
Since the p-value is  <  0.001, the correlation between city-mpg and price is statistically significant, and the coefficient of about -0.687 shows that the relationship is negative and moderately strong.

#### Highway-mpg vs. Price
Conclusion:
Since the p-value is < 0.001, the correlation between highway-mpg and price is statistically significant, and the coefficient of about -0.705 shows that the relationship is negative and moderately strong.

- p-value is  <  0.001: we say there is strong evidence that the correlation is significant.
- the p-value is  < 0.05: there is moderate evidence that the correlation is significant.
- the p-value is  <  0.1: there is weak evidence that the correlation is significant.
- the p-value is  >  0.1: there is no evidence that the correlation is significant.

# 5. ANOVA

#### ANOVA: Analysis of Variance

The Analysis of Variance (ANOVA) is a statistical method used to test whether there are significant differences between the means of two or more groups. ANOVA returns two parameters:

F-test score: ANOVA assumes the means of all groups are the same, calculates how much the actual means deviate from the assumption, and reports it as the F-test score. A larger score means there is a larger difference between the means.

P-value: P-value tells how statistically significant our calculated score value is.

If our price variable is strongly correlated with the variable we are analyzing, we expect ANOVA to return a sizeable F-test score and a small p-value.

### Drive Wheels

Since ANOVA analyzes the difference between different groups of the same variable, the groupby function will come in handy. Because the ANOVA algorithm averages the data automatically, we do not need to take the average before hand.

To see if different types of 'drive-wheels' impact 'price', we group the data.

In [12]:
df_gptest = df[['drive-wheels','body-style','price']]
grouped_test2=df_gptest[['drive-wheels', 'price']].groupby(['drive-wheels'])
grouped_test2.head(2)

Unnamed: 0,drive-wheels,price
0,rwd,13495.0
1,rwd,16500.0
3,fwd,13950.0
4,4wd,17450.0
5,fwd,15250.0
136,4wd,7603.0


In [13]:
df_gptest

Unnamed: 0,drive-wheels,body-style,price
0,rwd,convertible,13495.0
1,rwd,convertible,16500.0
2,rwd,hatchback,16500.0
3,fwd,sedan,13950.0
4,4wd,sedan,17450.0
...,...,...,...
196,rwd,sedan,16845.0
197,rwd,sedan,19045.0
198,rwd,sedan,21485.0
199,rwd,sedan,22470.0


We can obtain the values of the method group using the method "get_group".


In [14]:
grouped_test2.get_group('4wd')['price']

4      17450.0
136     7603.0
140     9233.0
141    11259.0
144     8013.0
145    11694.0
150     7898.0
151     8778.0
Name: price, dtype: float64

We can use the function 'f_oneway' in the module 'stats' to obtain the F-test score and P-value.

In [16]:
# ANOVA
f_val, p_val = stats.f_oneway(grouped_test2.get_group('fwd')['price'], grouped_test2.get_group('rwd')['price'], grouped_test2.get_group('4wd')['price'])  
 
print( "ANOVA results: F=", f_val, ", P =", p_val)   

ANOVA results: F= 67.95406500780399 , P = 3.3945443577149576e-23


This is a great result with a large F-test score showing a strong correlation and a P-value of almost 0 implying almost certain statistical significance. But does this mean all three tested groups are all this highly correlated?

Let's examine them separately.

### fwd and rwd

In [17]:
f_val, p_val = stats.f_oneway(grouped_test2.get_group('fwd')['price'], grouped_test2.get_group('rwd')['price'])  
 
print( "ANOVA results: F=", f_val, ", P =", p_val )

ANOVA results: F= 130.5533160959111 , P = 2.2355306355677366e-23


### 4wd and rwd

In [18]:
f_val, p_val = stats.f_oneway(grouped_test2.get_group('4wd')['price'], grouped_test2.get_group('rwd')['price'])  
   
print( "ANOVA results: F=", f_val, ", P =", p_val)   

ANOVA results: F= 8.580681368924756 , P = 0.004411492211225367


### 4wd and fwd

In [19]:
f_val, p_val = stats.f_oneway(grouped_test2.get_group('4wd')['price'], grouped_test2.get_group('fwd')['price'])  
 
print("ANOVA results: F=", f_val, ", P =", p_val)   

ANOVA results: F= 0.665465750252303 , P = 0.4162011669784502


Conclusion: Important Variables
We now have a better idea of what our data looks like and which variables are important to take into account when predicting the car price. We have narrowed it down to the following variables:

Continuous numerical variables:

- Length
- Width
- Curb-weight
- Engine-size
- Horsepower
- City-mpg
- Highway-mpg
- Wheel-base
- Bore
- Categorical variables:

Drive-wheels
As we now move into building machine learning models to automate our analysis, feeding the model with variables that meaningfully affect our target variable will improve our model's prediction performance.