# **Correlation**

- Describes direction and strength of relationship between two variables
- Can help us use variables to predict future outcomes

In [None]:
divorce.corr()

In [None]:
                  income_man  income_woman  marriage_duration  num_kids  marriage_year
income_man              1.000         0.318              0.085     0.041          0.019
income_woman            0.318         1.000              0.079    -0.018          0.026
marriage_duration       0.085         0.079              1.000     0.447         -0.812
num_kids                0.041        -0.018              0.447     1.000         -0.461
marriage_year           0.019         0.026             -0.812    -0.461          1.000

- `.corr()` calculates Pearson correlation coefficient, measuring linear relationship

# **Correlation heatmaps**

- Let's wrap our divorce-dot-corr results in a Seaborn heatmap for quick visual interpretation. 
- A heatmap has the benefit of color coding so that strong positive and negative correlations, represented in deep purple and beige respectively, are easier to spot. 
- Setting the annot argument to True labels the correlation coefficient inside each cell. 
- Here, we can see that marriage year and marriage duration are strongly negatively correlated; in our dataset, marriages in later years are typically shorter.

In [None]:
# import necessary libraries
import seaborn as sns
import matplotlib.pyplot as plt

#Assuming 'divorce' is a pandas DataFrame
sns.heatmap(divorce.corr(), annot=True)
plt.show()

![image.png](attachment:image.png)

# **Correlation in context**

- However, this highlights an important point about correlations: we must always interpret them within the context of our data! Since our dataset is about marriages that ended between 2000 to 2015, marriages that started in earlier years will by definition have a longer duration than those that started in later ones.

In [None]:
# Find the minimum and maximum divorce dates.
divorce["divorce_date"].min()

In [None]:
Timestamp('2000-01-08 00:00:00')

In [None]:
divorce["divorce_date"].max()

In [None]:
Timestamp('2015-11-03 00:00:00')

# **Visualizing relationships**

- We also need to be careful to remember that the Pearson coefficient we've been looking at only describes the linear correlation between variables. 
- Variables can have a strong non-linear relationship and a Pearson correlation coefficient of close to zero. 
- Alternatively, data might have a correlation coefficient indicating a strong linear relationship when another relationship, such as quadratic, is actually a better fit for the data. 
- This is why it's important to complement our correlation calculations with scatter plots!

![image.png](attachment:image.png)

- Strong relationship—but not linear
- Pearson correlation coefficient: -6.48e-18

![image.png](attachment:image.png)

- Quadratic relationship; not linear
- Pearson correlation coefficient: .971211

# **Scatter plots**

- For example, the monthly income of the female partner and the male partner at the time of divorce showed a correlation coefficient of zero-point-three-two in our heatmap. 
- Let's check that this correctly indicates a small positive relationship between the two variables by passing them as x and y arguments to Seaborn's scatterplot function. 
- It looks like the relationship exists but is not particularly strong, just as our heatmap suggested.

In [None]:
# Import necessary libraries
import seaborn as sns
import matplotlib.pyplot as plt

# Assuming 'divorce' is a pandas DataFrame with columns 'income_man' and 'income_woman'
sns.scatterplot(data=divorce, x="income_man", y="income_woman")
plt.show()

![image.png](attachment:image.png)

# **Pairplots**

- We can take our scatterplots to the next level with Seaborn's pairplot. 
- When passed a DataFrame, pairplot plots all pairwise relationships between numerical variables in one visualization. On the diagonal from upper left to lower right, we see the distribution of each variable's observations. 
- This is useful for a quick overview of relationships within the dataset. 
- However, having this much information in one visual can be difficult to interpret, especially with big datasets which lead to very small plot labels like the ones we see here.

In [None]:
# import necessary libraries
import seaborn as sns
import matplotlib.pyplot as plt

# Assuming 'divorce' is a pandas DataFrame
sns.pairplot(data=divorce)
plt.show()

![image.png](attachment:image.png)

- We can limit the number of plotted relationships by setting the vars argument equal to the variables of interest. 
- This visual reassures us that what our correlation coefficients told us was true: variables representing the income of each partner as well as the marriage duration variable all have fairly weak relationships with each other. 
- We also notice in the lower right plot that the distribution of marriage durations includes many shorter marriages and fewer longer marriages.

In [None]:
# Create a pairplot
sns.pairplot(data=divorce, vars=["income_man", "income_woman", "marriage_duration"])

# Display the plot
plt.show()

![image.png](attachment:image.png)