![Title](Images/cisco.png)

# Lab - Correlation Analysis in Python


### Scenario/Background


**Correlation analysis** is a statistical method used to evaluate the strength and direction of the relationship between two quantitative variables. It measures how changes in one variable are associated with changes in another variable.

The correlation coefficient, typically denoted by "r," quantifies this relationship.For example, Pearson correlation coefficient, It ranges from -1 to +1:

- A positive correlation (closer to +1) indicates that as one variable increases, the other tends to increase as well.
- A negative correlation (closer to -1) indicates that as one variable increases, the other tends to decrease.
- A correlation close to zero suggests a weak or no linear relationship between the variables.


Correlation analysis helps in understanding and exploring associations between variables in various fields such as economics, social sciences, finance, and more. 

However, it's important to note that correlation does not imply causation, meaning that even if two variables are correlated, it doesn't necessarily mean that changes in one variable cause changes in the other.

![Title](Images/correlation.png)


There are several other correlation coefficients used in statistics to measure the relationship between variables, apart from the Pearson correlation coefficient. 


In this lab, you will learn how to use Python to calculate correlation. In Part 1, you will setup the dataset. In Part 2, you will learn how to identify if the variables in a given dataset are correlatable. Finally, in Part 3, you will use Python to calculate the correlation between two sets of variable.

## Example 1

In [4]:
import pandas as pd

# Sample data
data = {
    'Variable_1': [1, 2, 3, 4, 5],
    'Variable_2': [5, 4, 3, 2, 1]
}

# Creating a DataFrame from the sample data
df = pd.DataFrame(data)

# Calculating mean of  Variable_1.
mean_of_variable1 = df['Variable_1'].mean()
print(f'Mean of variable1: {mean_of_variable1}')

# Calculating Pearson correlation coefficient between Variable_1 and Variable_2
pearson_corr = df['Variable_1'].corr(df['Variable_2'])

print(f"Pearson correlation coefficient: {pearson_corr}")


Mean of variable1: 3.0
Pearson correlation coefficient: -0.9999999999999999


## Example 2

- Seaborn is a Python data visualization library based on Matplotlib that provides a high-level interface for creating attractive and informative statistical graphics. It's built on top of Matplotlib and integrates well with Pandas data structures.

- Seaborn provides built-in datasets that are useful for practicing data visualization, testing various plotting functions, and learning data analysis techniques. These datasets are accessible through Seaborn's load_dataset() function and are stored in a structured format like Pandas DataFrames.
 
 Let's create an example using real-life data to calculate the Pearson correlation coefficient between two variables. 
 
- We'll use the seaborn library to load a sample dataset and calculate the correlation between two columns.

- First, if you don't have seaborn installed, you can install it via pip:

In [None]:
!pip install seaborn

Here's an example using the tips dataset available in seaborn to calculate the correlation between the total bill amount and the tip amount:

## Import Dataset

In [None]:
import seaborn as sns

# Load the 'tips' dataset from seaborn
tips = sns.load_dataset('tips')


## Exploring dataset

In [None]:
# Display the first few rows of the dataset
print(tips.head())

In [None]:
print(tips.tail())

In [None]:
print(tips.shape)

In [None]:
tips.describe()

## Calculating Correlation with Python

In [None]:
# Calculate the Pearson correlation coefficient between 'total_bill' and 'tip'
correlation = tips['total_bill'].corr(tips['tip'])
print(f"Pearson correlation coefficient: {correlation}")

Question: 
 
1. Create manDF and womanDF dataframes by filtering the data based on the 'sex' column.
2. Analyze correlation between total_bill and tip for each dataframe


In [None]:
manDF = tips[tips['sex'] == 'Male']
type(manDF)

In [None]:
correlation = manDF['total_bill'].corr(manDF['tip'])
print(f"Pearson correlation coefficient: {correlation}")

In [None]:
womanDF = tips[tips['sex'] == 'Female']
correlation = womanDF['total_bill'].corr(womanDF['tip'])
print(f"Pearson correlation coefficient: {correlation}")

Question: 
 
1. Create manDF and womanDF dataframes by filtering the data based on the 'smoker' column.
2. Analyze correlation between total_bill and tip for each dataframe

In [None]:
smokerDF = tips[tips['smoker'] == 'Yes']
correlation = smokerDF['total_bill'].corr(smokerDF['tip'])
print(f"Pearson correlation coefficient: {correlation}")

In [None]:
nonsmokerDF = tips[tips['smoker'] == 'No']
correlation = nonsmokerDF['total_bill'].corr(nonsmokerDF['tip'])
print(f"Pearson correlation coefficient: {correlation}")

## Visualizing Data

In [None]:
import matplotlib.pyplot as plt

In [None]:
plt.scatter(smokerDF['total_bill'], smokerDF['tip'])

plt.xlabel('Total Bill')
plt.ylabel('Tip')
plt.show()

%matplotlib inline

In [None]:
plt.scatter(nonsmokerDF['total_bill'], nonsmokerDF['tip'])

plt.xlabel('Total Bill')
plt.ylabel('Tip')
plt.show()

%matplotlib inline

In [None]:
tips

## Calculate correlation against tips.
The pandas `corr()` method provides an easy way to calculate correlation against a dataframe. By simply calling the method against a dataframe, one can get the correlation between all variables at the same time.

In [None]:
# correlation = tips['total_bill'].corr(tips['tip'])
tips.corr(method='pearson')

Notice at the left-to-right diagonal in the correlation table generated above. Why is the diagonal filled with 1s? Is that a coincidence? Explain.

Still looking at the correlation table above, notice that the values are mirrored; values below the 1 diagonal have a mirrored counterpart above the 1 diagonal. Is that a coincidence? Explain.

In [None]:
tcorr = tips.corr()
sns.heatmap(tcorr)