<a href="https://colab.research.google.com/github/cloudpedagogy/statistics-python/blob/main/03_Correlation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Correlation Analysis


##Overview

Correlation analysis is a fundamental statistical technique used in data science to understand the relationship between two or more variables in a dataset. It helps data scientists and analysts to identify patterns, dependencies, and associations between different features or attributes. In correlation analysis, the strength and direction of the relationship between variables are quantified, providing valuable insights for decision-making, feature selection, and predictive modeling.

Here's a brief overview of correlation analysis in data science:

1. Pearson Correlation Coefficient:
The Pearson correlation coefficient, often denoted as "r," is the most common measure of correlation used in data science. It quantifies the linear relationship between two continuous variables. The value of "r" ranges from -1 to 1, where:
   - r = 1 indicates a perfect positive correlation (both variables increase together).
   - r = -1 indicates a perfect negative correlation (as one variable increases, the other decreases).
   - r ≈ 0 indicates little to no linear correlation between the variables.

2. Spearman Rank Correlation:
The Spearman rank correlation coefficient is used when dealing with ordinal or non-normally distributed data. It calculates the correlation between the ranks of the data rather than the actual data values. Like the Pearson coefficient, it also ranges from -1 to 1.

3. Kendall Rank Correlation:
The Kendall rank correlation is another method used to measure the correlation between two variables based on their ranks. It is particularly useful for dealing with small sample sizes and ties in the data.

4. Visualizing Correlation:
Correlation analysis can be complemented with visualizations, such as scatter plots and heatmaps, to better understand the relationship between variables. Scatter plots help visualize the overall pattern of the relationship, while heatmaps provide a quick overview of the correlation matrix when dealing with multiple variables.

5. Correlation and Causation:
It's essential to understand that correlation does not imply causation. Just because two variables are correlated doesn't mean that one causes the other. It is crucial to interpret correlation results carefully and consider domain knowledge and experimentation to establish causation.

6. Feature Selection:
Correlation analysis is valuable in feature selection for predictive modeling. Highly correlated features might introduce multicollinearity in regression models, affecting model performance. In such cases, it might be necessary to remove one of the highly correlated features.

7. Limitations of Correlation Analysis:
Correlation analysis is limited to detecting linear relationships between variables. It may not capture complex or nonlinear associations. Additionally, correlation analysis does not consider the influence of other variables that might affect the relationship between the variables being studied.



#Understanding correlation: Pearson correlation, Spearman correlation


Pearson correlation and Spearman correlation are two common measures used in data science to quantify the relationship between two variables. Both methods assess the strength and direction of the association, but they differ in the type of data they are suitable for and their sensitivity to different types of relationships.

**1. Pearson Correlation:**
Pearson correlation, also known as Pearson's r or Pearson product-moment correlation coefficient, is used to measure the linear relationship between two continuous variables. It provides a value between -1 and 1, where:

- A correlation of +1 indicates a perfect positive linear relationship, meaning that as one variable increases, the other variable increases proportionally.
- A correlation of -1 indicates a perfect negative linear relationship, meaning that as one variable increases, the other variable decreases proportionally.
- A correlation of 0 indicates no linear relationship between the two variables.

The formula for Pearson correlation is based on the covariance between the two variables and their individual standard deviations. It is sensitive to outliers and assumes that the data follows a roughly normal distribution.

Pearson correlation is suitable for variables that have a linear relationship and when the data is continuous and normally distributed.

**2. Spearman Correlation:**
Spearman correlation, also known as Spearman's rank correlation coefficient, is a non-parametric measure used to assess the monotonic relationship between two variables. It does not assume that the relationship is linear or that the data follows a specific distribution.

Instead of working with the actual values of the variables, Spearman correlation ranks the data, converting the values to their corresponding ranks. The correlation is then calculated based on the ranked values. This approach makes Spearman correlation less sensitive to outliers and more suitable for variables with non-linear relationships.

Like Pearson correlation, Spearman correlation also provides a value between -1 and 1, where the interpretations are similar. A Spearman correlation of +1 indicates a perfect monotonic positive relationship, -1 indicates a perfect monotonic negative relationship, and 0 indicates no monotonic relationship.

Spearman correlation is preferred when dealing with ordinal or non-normally distributed data or when the relationship between variables is not expected to be linear.

In summary, Pearson correlation is appropriate for assessing linear relationships between continuous, normally distributed data, while Spearman correlation is more suitable for assessing monotonic relationships and works well with ordinal or non-normally distributed data. Choosing the appropriate correlation method depends on the nature of the data and the research question at hand in a data science analysis.

##Pearson correlation



The Pearson correlation coefficient is a measure of the linear relationship between two continuous variables. It quantifies the strength and direction of the linear association between two variables, ranging from -1 to 1. A value of 1 indicates a perfect positive linear relationship, -1 indicates a perfect negative linear relationship, and 0 indicates no linear relationship.

To calculate the Pearson correlation coefficient using Python and the `scipy` library, we can use the `pearsonr` function from the `scipy.stats` module.

Here's an example of how to calculate the Pearson correlation coefficient between two variables in the Pima Indians Diabetes dataset:


In [None]:
import pandas as pd
from scipy.stats import pearsonr

# Load the dataset
url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.data.csv"
names = ['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI', 'DiabetesPedigreeFunction', 'Age', 'Outcome']
data = pd.read_csv(url, names=names)

# Select the variables of interest
variable1 = data['Glucose']
variable2 = data['BMI']

# Calculate the Pearson correlation coefficient and p-value
corr_coeff, p_value = pearsonr(variable1, variable2)

# Print the results
print("Pearson Correlation Coefficient: ", corr_coeff)
print("p-value: ", p_value)


In this example, we selected the 'Glucose' variable and the 'BMI' variable from the Pima Indians Diabetes dataset. We then used the `pearsonr` function to calculate the Pearson correlation coefficient and the corresponding p-value between these two variables. Finally, we printed the results.

Note that the `pearsonr` function returns two values: the Pearson correlation coefficient and the p-value. The p-value indicates the statistical significance of the correlation coefficient. If the p-value is less than a chosen significance level (e.g., 0.05), we can conclude that there is a statistically significant linear relationship between the two variables.


##Spearman correlation



The Spearman correlation coefficient, also known as Spearman's rank correlation coefficient, is a non-parametric measure of the strength and direction of monotonic association between two variables. It assesses how well the relationship between two variables can be described using a monotonic function.

To calculate the Spearman correlation coefficient using Python's `scipy` library with the Pima Indian Diabetes dataset, follow these steps:

**Step 1: Load Libraries and Data**


In [None]:
import pandas as pd
from scipy.stats import spearmanr

# Load the Pima Indians Diabetes dataset
url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.data.csv"
names = ['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI', 'DiabetesPedigreeFunction', 'Age', 'Outcome']
data = pd.read_csv(url, names=names)

# Select two variables for correlation analysis
variable1 = data['Glucose']
variable2 = data['BMI']


**Step 2: Calculate Spearman Correlation**


In [None]:
# Calculate Spearman correlation coefficient and p-value
correlation, p_value = spearmanr(variable1, variable2)

# Print the correlation coefficient and p-value
print("Spearman Correlation Coefficient: %.3f" % correlation)
print("p-value: %.3f" % p_value)


In this example, we selected the 'Glucose' variable as `variable1` and the 'BMI' variable as `variable2`. We then used the `spearmanr` function from `scipy.stats` to calculate the Spearman correlation coefficient between the two variables. The function returns the correlation coefficient and the corresponding p-value.

Finally, we printed the Spearman correlation coefficient and p-value using the `print` statements.

Please note that the Spearman correlation coefficient ranges from -1 to 1, where -1 indicates a perfect negative correlation, 1 indicates a perfect positive correlation, and 0 indicates no correlation. The p-value helps assess the statistical significance of the correlation. If the p-value is less than a chosen significance level (e.g., 0.05), it suggests that the correlation is statistically significant.


#Reflection Points

1. **Understanding Correlation**: Reflect on the concept of correlation and its significance in analyzing relationships between variables. What are the key aspects of correlation, and why is it important in data analysis?

Answer: Correlation measures the strength and direction of the linear relationship between two variables. It helps identify patterns, dependencies, and associations in data. A positive correlation indicates that as one variable increases, the other tends to increase, while a negative correlation suggests that as one variable increases, the other tends to decrease.

2. **Pearson Correlation Coefficient**: Consider the Pearson correlation coefficient and its characteristics. How does it measure the strength and direction of the linear relationship between two variables? What is the range of values it can take?

Answer: The Pearson correlation coefficient quantifies the linear relationship between two variables. It ranges from -1 to +1. A value close to +1 indicates a strong positive correlation, close to -1 indicates a strong negative correlation, and close to 0 suggests no significant linear correlation.

3. **Calculating Pearson Correlation**: Reflect on the process of calculating the Pearson correlation coefficient using SciPy. What are the steps involved? How can you interpret the resulting correlation coefficient?

Answer: To calculate the Pearson correlation coefficient using SciPy's `pearsonr` function, you need to provide two arrays representing the variables of interest. The function returns the correlation coefficient and a p-value for testing non-correlation. A higher correlation coefficient value (close to -1 or +1) suggests a stronger linear relationship, while a lower value (close to 0) indicates a weaker or no significant correlation.

4. **Interpreting Pearson Correlation Results**: Consider scenarios where you obtain different Pearson correlation coefficients. How can you interpret these results in terms of the strength and direction of the relationship between variables?

Answer: When the Pearson correlation coefficient is close to +1, it indicates a strong positive linear relationship. As it approaches 0, the correlation weakens, indicating a weaker linear relationship. A value close to -1 suggests a strong negative linear relationship, where one variable tends to decrease as the other increases.

5. **Spearman Rank-Order Correlation Coefficient**: Reflect on the Spearman rank-order correlation coefficient and its purpose. How does it differ from the Pearson correlation coefficient? When is it more appropriate to use?

Answer: The Spearman rank-order correlation coefficient measures the strength and direction of the monotonic relationship between two variables. Unlike Pearson's coefficient, it does not assume linearity. It is useful when the variables' relationship is not strictly linear, but their rankings or orders still hold meaning.

6. **Calculating Spearman Correlation**: Consider the steps involved in calculating the Spearman rank-order correlation coefficient using SciPy's `spearmanr` function. How does it handle tied ranks, and what does the returned p-value indicate?

Answer: The `spearmanr` function in SciPy takes two arrays as input representing the variables. It computes the correlation coefficient and returns it along with a p-value. The function handles tied ranks by applying correction methods. The p-value indicates the probability of observing a correlation as extreme as the calculated value, assuming the variables are not correlated.

7. **Interpreting Spearman Correlation Results**: Reflect on scenarios where you obtain different Spearman correlation coefficients. How can you interpret these results in terms of the strength and monotonicity of the relationship between variables?

Answer: The Spearman correlation coefficient ranges from -1 to +1. A value close to +1 indicates a strong positive monotonic relationship, meaning that as one variable increases, the other tends to increase in rank order. A value close to -1 suggests a strong negative monotonic relationship, where one variable tends to decrease as the other increases in rank order. A value close to 0 indicates no significant monotonic relationship.


#A quiz on Correlation Analysis


1. What is the purpose of correlational analysis?
<br>a) To determine causation between variables
<br>b) To identify the strength and direction of a relationship between two variables
<br>c) To measure the effect size of an experiment
<br>d) To analyze the distribution of data

2. In correlational analysis, what value represents a perfect positive correlation?
<br>a) 1.0
<br>b) -1.0
<br>c) 0.0
<br>d) There is no specific value for a perfect positive correlation

3. Which SciPy function is used to compute the correlation coefficient between two datasets?
<br>a) `scipy.stats.corrcoef()`
<br>b) `scipy.stats.pearsonr()`
<br>c) `scipy.correlation_coefficient()`
<br>d) `scipy.stats.correlation()`

4. What does a correlation coefficient value of -0.75 indicate?
<br>a) A strong positive correlation
<br>b) A strong negative correlation
<br>c) A weak positive correlation
<br>d) A weak negative correlation

5. When interpreting a correlation coefficient, what range of values does it typically fall between?
<br>a) -1.0 to 1.0
<br>b) 0 to 1.0
<br>c) -1 to 0
<br>d) 0 to 100

6. In a scatter plot, how is a strong positive correlation represented?
<br>a) Points scattered randomly with no apparent pattern
<br>b) Points forming a straight line from top-left to bottom-right
<br>c) Points forming a straight line from bottom-left to top-right
<br>d) Points clustering around the center of the plot

7. What does a p-value in correlation analysis signify?
<br>a) The strength of the relationship between the variables
<br>b) The probability of observing the correlation by chance
<br>c) The number of data points in the dataset
<br>d) The range of the correlation coefficient

8. In Python's SciPy library, which function is used to calculate the p-value of a correlation?
<br>a) `scipy.stats.pvalue()`
<br>b) `scipy.stats.pearsonr()`
<br>c) `scipy.stats.p_corr()`
<br>d) `scipy.pvalue.correlation()`

---

Answers:

1. b) To identify the strength and direction of a relationship between two variables.
2. a) 1.0

3. b) `scipy.stats.pearsonr()`

4. b) A strong negative correlation

5. a) -1.0 to 1.0

6. c) Points forming a straight line from bottom-left to top-right

7. b) The probability of observing the correlation by chance

8. b) `scipy.stats.pearsonr()`

---

Note: Correlational analysis is used to examine relationships between variables, but it cannot determine causation. It only provides information about how two variables are related to each other.