# Introduction

We are provided with the CSV file 'all_data.csv,' a small dataset containing information provided by the [World Bank](https://data.worldbank.org/indicator/NY.GDP.MKTP.CD) and the [World Health Organization](https://apps.who.int/gho/data/node.main.688) on life expectancy and GDP across six nations.

The dataset contains the following columns:

* **Country** - nation  
* **Year** - the year for the observation  
* **Life expectancy at birth (years)** - life expectancy value in years  
* **GDP** - Gross Domestic Product in U.S. dollars  

We will be investigating the following five questions:

* Has life expectancy increased over time in the six nations?
* Has GDP increased over time in the six nations?
* Is there a correlation between life expectancy of a country and GDP?
* What is the average life expectancy in these nations?
* What is the distribution of that life expectancy?

# Import the necessary libraries

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import statsmodels.api as sm

%matplotlib inline

# Load and inspect the data

Now we load the CSV file into a pandas DataFrame and inspect it using the DataFrame.head() and DataFrame.describe() methods and the DataFrame.shape attribute

In [None]:
df = pd.read_csv('../input/life-expectancy-and-gdp-data/all_data.csv')
df.head()

In [None]:
df.shape

The DataFrame (or table) has 96 rows and four columns.

In [None]:
df.describe()

# Explore the data

We here employ a variety of visualizations with which to assess the dataset. We can attempt to draw out patterns from our visualizations that will help us answer the questions stated in the Introduction.

First, we look at a list of the unique countries in the dataset:

In [None]:
print(df.Country.unique())

Now we look at the unique years so we can determine the range of time we will be concerned with (and ensure that the dataset is complete in providing data for all years).

In [None]:
print(df.Year.unique())

We saw in the initial inspection of the data that there is a column titled "Life expectancy at birth (years)" - this is unwieldy for writing code to explore the data, so we will change it to an acronym - "LEABY" - to make things easier, and then use DataFrame.head() to ensure our change took effect.

In [None]:
df = df.rename({"Life expectancy at birth (years)":"LEABY"}, axis = "columns")
df.head()

**Has life expectancy increased over time in the six nations?**

We use Seaborn to create a line plot of life expectancy against year.

In [None]:
plt.figure(figsize=(8,6))
sns.lineplot(x=df.Year, y=df.LEABY, hue=df.Country)
plt.legend(loc='center left', bbox_to_anchor=(1, 0.5), ncol=1)
plt.ylabel("Life expectancy at birth (years)");

We can see that all six countries have experienced an increase in life expectancy over the sixteen years plotted. Except for Zimbabwe, all the countries appear to have had approximately linear growth in LEABY, with Chile and Mexico experiencing less smoothness in growth. Zimbabwe seems to be an outlier in its minimum, maximum, and pattern of growth.

In [None]:
countries_not_zimbabwe = df[df.Country != 'Zimbabwe'].LEABY
zimbabwe = df[df.Country == 'Zimbabwe'].LEABY

print('Minimum LEABY from Chile, China, Germany, Mexico, US:', np.min(countries_not_zimbabwe))
print('Maximum LEABY from Chile, China, Germany, Mexico, US:', np.max(countries_not_zimbabwe))
print('Mean LEABY from Chile, China, Germany, Mexico, US:', round(np.mean(countries_not_zimbabwe), 1), '\n')

print('Minimum LEABY from Zimbabwe:', np.min(zimbabwe))
print('Maximum LEABY from Zimbabwe:', np.max(zimbabwe))
print('Mean LEABY from Zimbabwe:', round(np.mean(zimbabwe), 1))

The following calculates a DataFrame connecting each country to the range of its LEABY, measured as its maximum life expectancy over the time series minus its minimum life expectancy.

In [None]:
df.groupby('Country').LEABY.apply(lambda country: country.max() - country.min())

Compare the above to the following, which calculates the difference between life expectancy in 2015 and life expectancy in 2000 (the difference between the last and first observations of LEABY for each country in the dataset).

In [None]:
df.groupby('Country').apply(lambda df: df.LEABY.iloc[-1] - df.LEABY.iloc[0])

Zimbabwe's difference is not the same in each DataFrame because, after 2000, its life expectancy continued to fall until, it appears on the line plot above, about 2004, which we can verify:

In [None]:
df.loc[(df.Country == 'Zimbabwe') & (df.LEABY == df[df.Country == 'Zimbabwe'].LEABY.min())]

This is the minimum life expectancy Zimbabwe experienced, as calculated earlier, so we eyeballed it accurately. We can look at the full data for Zimbabwe just to make sure:

In [None]:
df[df.Country == 'Zimbabwe']

We can take a look at the six individual countries with LEABY axes scaled to the range of that country's LEABY observations, which make it clear that all countries have experienced an increase in LEABY, answering our first question. We render these graphs as scatterplots with lines of best fit included.

In [None]:
graphLEABY = sns.FacetGrid(df, col="Country", col_wrap=3,
                      hue = "Country", sharey = False)

graphLEABY = (graphLEABY.map(sns.regplot,"Year","LEABY")
         .add_legend()
         .set_axis_labels("Year","LEABY"))

graphLEABY;

Zimbabwe's LEABY growth appears to be non-linear, and we can check whether, after a linear fitting, the data meets the conditions of normality and homoscedasticity.

In [None]:
zimb = df[df.Country == 'Zimbabwe']

model = sm.OLS.from_formula('LEABY ~ Year', zimb)
results = model.fit()

fitted_values = results.predict(zimb)

residuals = zimb.LEABY - fitted_values

# Check for normality
plt.hist(residuals)
plt.show()

In [None]:
# Check for homoscedasticity
plt.scatter(fitted_values, residuals)
plt.show()

Neither condition is met, which suggests that linear regression is inappropriate for assessing Zimbabwe's LEABY growth over the period.

Now we consider GDP instead of LEABY. **Has GDP increased over time in the six nations?**

In [None]:
plt.figure(figsize=(8,6))
sns.lineplot(x=df.Year, y=df.GDP, hue=df.Country)
plt.legend(loc='center left', bbox_to_anchor=(1, 0.5), ncol=1)
plt.ylabel("GDP in Trillions of U.S. Dollars");

It appears that only the United States and China have had considerable growth, in absolute terms, in GDP. Zimbabwe appears on the graph to have experienced no GDP growth at all, but looking at the countries individually clarifies things a bit:

In [None]:
graphGDP = sns.FacetGrid(df, col="Country", col_wrap=3,
                      hue = "Country", sharey = False)

graphGDP = (graphGDP.map(sns.lineplot,"Year","GDP")
         .add_legend()
         .set_axis_labels("Year","GDP"))

graphGDP;

All countries experienced GDP growth over the period, including Zimbabwe. It appears Chile, Germany, and Mexico were dealing with recession in 2015.

**Is there a correlation between life expectancy of a country and GDP?**

In [None]:
sns.scatterplot(x=df.GDP, y=df.LEABY, hue=df.Country).legend(loc='center left', bbox_to_anchor=(1, 0.5), ncol=1);

Positive correlations between LEABY and GDP are suggested everywhere except Zimbabwe, where the data in this graph is ambiguous (GDP appears not to grow at all), but we know from earlier that Zimbabwe's GDP did in fact grow over the period. We will look at the correlations in each country next, and we anticipate there will be a positive correlation for each country. Chile's LEABY appears to have increased sharply along with a minor boost in GDP, while the United States and China appear to be increasing life expectancy less as GDP rises (for China, a logarithmic relationship is suggested).

In [None]:
graph = sns.FacetGrid(df, col="Country", col_wrap=3,
                      hue = "Country", sharey = False, sharex = False)
graph = (graph.map(sns.scatterplot,"GDP", "LEABY")
         .add_legend()
         .set_axis_labels("GDP in Trillions of U.S. Dollars", "Life expectancy at birth (years)"));

**What is the average life expectancy in these nations?**

In [None]:
df.groupby('Country').LEABY.mean()

**What is the distribution of that life expectancy?**

In [None]:
plt.figure(figsize=(8,6))
sns.histplot(df.LEABY, kde=True)
plt.xlabel("Life expectancy at birth (years)");

The distribution of all LEABY datapoints is left-skewed, partially explained by Zimbabwe's being an outlier with unusually low LEABY numbers for the dataset.

# Conclusion

- Has life expectancy increased over time in the six nations?
    - Life expectancy increased in all six nations over the sixteen years covered by the dataset, 2000-2015. Zimbabwe had the most pronounced increase.
- Has GDP increased over time in the six nations?
    - GDP has also increased for all six countries, with China seeing the largest factor of growth.
- Is there a correlation between life expectancy of a country and its GDP?
    - All countries exhibited a positive correlation between LEABY and GDP.
- What is the average life expectancy in these nations?
    - For Zimbabwe, 50; for the others, between 74 and 80.
- What is the distribution of that life expectancy?
    - The life expectancy was left-skewed.