In [4]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

sns.set()

In [13]:
df = pd.read_csv('master.csv')
population = df['population']
gdp_per_year = df['gdp_per_year($)'].map(lambda x:int(x.replace(',','')))

Correlation describes whether a relationship between an independent variable $x$, and a dependent variable $y$ exists. Let's looking at the dataset below. How do we know whether the population has an impact on the GDP per year?

### Dataset - Suicide Rates Overview 1985 to 2016
https://www.kaggle.com/russellyates88/suicide-rates-overview-1985-to-2016

In [14]:
df[['population','gdp_per_year($)']]

Unnamed: 0,population,gdp_per_year($)
0,312900,2156624900
1,308000,2156624900
2,289700,2156624900
3,21800,2156624900
4,274300,2156624900
...,...,...
27815,3620833,63067077179
27816,348465,63067077179
27817,2762158,63067077179
27818,2631600,63067077179


To see whether there is a correlation between population and GDP, we need to calculate the _Pearson's correlation coefficient_, or _Pearson's R_. The Pearson's R value tells us the correlation strength between two variables. To calculate the Pearson's R, we can use the following formula:

$$r = \frac{\sum(Z_xZ_y)}{N - 1}$$

where:
$$Z_x = \text{ Z-Score for X}$$
$$Z_y = \text{ Z-Score for Y}$$
$$N = \text{Number of data points}$$

In [16]:
x = population
y = gdp_per_year
N = len(df)

# Calculate Z-Scores
zx = (x - x.mean()) / x.std()
zy = (y - y.mean()) / y.std()

# Pearson's R
r = sum(zx * zy) / (N - 1)
print(f"Pearson's R: {r}")

Pearson's R: 0.7106973227934138


We can interpret the relationship between population and GDP with the following table. As the R value calculated is approximately 0.71, we can conclude that the population has an impact on the GDP per year.

![image.png](attachment:ef9fe148-3558-42a1-bfe5-ce9970819305.png)