Inferred from population.csv, child_mortality.csv, fertility.csv, life_expectancy.csv, family_heights.csv

In [None]:
import numpy as np
from numpy.random import default_rng
rng = default_rng()
import pandas as pd
from scipy.optimize import minimize

# These lines do some fancy plotting magic.
import matplotlib
# This is a magic function that renders the figure in the notebook, instead of displaying a dump of the figure object.
%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')
import warnings
warnings.simplefilter('ignore', FutureWarning)

# **Part 1: Population**
The global population of humans reached 1 billion around 1800, 3 billion around 1960, and 7 billion around 2011. The potential impact of exponential population growth has concerned scientists, economists, and politicians alike.

The UN Population Division estimates that the world population will likely continue to grow throughout the 21st century, but at a slower rate, perhaps reaching 11 billion by 2100. However, the UN does not rule out scenarios of more extreme growth.

In this section, we will examine some of the factors that influence population growth and how they are changing around the world.

In [None]:
population = pd.read_csv('population.csv', header=0)
population.drop(population.index[(population["time"] >= 2021)],axis=0,inplace=True)
population.shape

### Haiti

In [None]:
h_pop_hti = population.loc[population['geo'] == 'hti']
h_pop_time = h_pop_hti[(h_pop_hti['time'] >= 1970) & (h_pop_hti['time'] <= 2020)]
h_pop = h_pop_time[['time','population_total']]
h_pop.shape
tens = np.arange(1970,2021,10)
h_decade = h_pop.loc[h_pop['time'].isin(tens)]
h_decade
initial = h_decade[(h_decade['time'] >= 1970) & (h_decade['time'] <= 2010)]['population_total']
changed = h_decade[(h_decade['time'] >= 1980) & (h_decade['time'] <= 2020)]['population_total']

growth_rates = ((changed/initial)**0.1)-1

h_1970_through_2010 = h_decade.loc[h_decade['time'] <= 2010]
h_decade_growth = h_1970_through_2010.assign(annual_growth = growth_rates)

While the population has grown every decade since 1970, the annual growth rate decreased dramatically from 1980 to 2020. Let's look at some other information in order to develop a possible explanation.

In [None]:
life_expectancy = pd.read_csv('life_expectancy.csv', header=0)
life_expectancy.drop(life_expectancy.loc[life_expectancy['time'] >= 2021].index, inplace=True)

child_mortality = pd.read_csv('child_mortality.csv', header=0)
child_mortality.drop(child_mortality.loc[child_mortality['time'] >= 2021].index, inplace=True)
child_mortality.rename(columns={"child_mortality_0_5_year_olds_dying_per_1000_born":"child_mortality_under_5_per_1000_born"}, inplace=True)

fertility = pd.read_csv('fertility.csv', header=0)
fertility.drop(fertility.loc[fertility['time'] >= 2021].index, inplace=True)

In [None]:
life_expectancy_haiti = life_expectancy.loc[life_expectancy['geo'] == 'hti']
life_expectancy_haiti = life_expectancy_haiti[(life_expectancy_haiti['time'] >= 1970)]
life_expectancy_haiti = life_expectancy_haiti[['time','life_expectancy_years']]

# plotting the data...
plt.figure(figsize = (5, 5))
plt.plot(life_expectancy_haiti['time'], life_expectancy_haiti['life_expectancy_years'], marker = 'o', linestyle = '-', color = 'blue')
plt.title('Change in Life Expectancy in Haiti (1970-2020)')
plt.xlabel('Year')
plt.ylabel('Life Expectancy At Birth')
plt.grid(True)
plt.show()

Assuming everything else stays the same, the trends in the life expectancy graph above does not explain why the population decreased from 1980 to 2010 as the life expectancy graph has an upward trend which should indicate that population growth rate should have increased rather than decreased. But it has decreased drastically in 2010 due to the Haiti Earthquake in 2010 which had a catastrophic impact on the Haiti population. This is the reason for the population growth rate decrease rather than the life expectancy decreasing.

In [None]:
def fertility_over_time(country, start):
    """Create a two-column dataframe that describes a country's total fertility rate each year."""
    fertility_data_frame = fertility.loc[(fertility['geo'] == country) & (fertility['time'] >= start)][['time','children_per_woman_total_fertility']]
    fertility_data_frame.rename(columns={'time':'Year','children_per_woman_total_fertility':'Children per woman'},inplace=True)
    return fertility_data_frame

In [None]:
haiti_code = 'hti'
fertility_over_time(haiti_code, 1970).plot(0, 1)

Yes, the trends in fertility in the graph above directly explain why the population growth rate decreased from 1980 to 2020. The graphs above depicts a negative trend in Children per woman strating from 1970 to 2020. As the fertility goes down, the number of children being born on average from a mother goes down, decreasing the popluation growth rate.

In [None]:
post_1969_fertility_and_child_mortality = child_mortality.join(fertility.set_index(['geo','time']), on=['geo','time'])
post_1969_fertility_and_child_mortality = post_1969_fertility_and_child_mortality[post_1969_fertility_and_child_mortality['time'] >= 1970]
post_1969_fertility_and_child_mortality.rename(columns={'time':'Year','children_per_woman_total_fertility':'Children per woman','child_mortality_under_5_per_1000_born':'Child deaths per 1000 born'},inplace=True)
post_1969_fertility_and_child_mortality.plot.scatter('Children per woman','Child deaths per 1000 born')

It shows that there is a relation between Children per woman and child deaths per 100 born. The more number of children born per woman, the higher the chances of death. This can suggest that reduced child mortality causes parents to choose to have fewer children. There are some outlier which suggest sthat sometimes even low numbers of children per woman may have a higher mortality rate but it seems to be quite rare.

In [None]:
post_1969_fertility_and_child_mortality.plot.scatter('Year', 'Child deaths per 1000 born')

This explains the outlier caused by the 2010 Haiti Earthquake. This new visualisation comparing Child deaths per 1000 born aginst the year shows that historical factors come into play when analysing Haiti's population.

In [None]:
heights = pd.read_csv('family_heights.csv')
heights

In [None]:
heights.plot.scatter('midparentHeight', 'childHeight')
plt.xlabel('Midparent Height')
plt.ylabel('Child Height')
plt.grid(True)
plt.title("Scatter Plot of Child Height vs. Midparent Height")
plt.show()

In [None]:
close_to_70_data = heights[(heights['midparentHeight'] >= 69.5) & (heights['midparentHeight'] <= 70.5)]
close_to_70_average = close_to_70_data['childHeight'].mean()
close_to_70 = round(close_to_70_average, 4)

close_to_70

In [None]:
plt.scatter('midparentHeight', 'child', s = 10)
plt.xlabel('Parent Average')
plt.ylabel('Child')

plt.plot([69.5, 69.5], [50, 85], color='red', lw=2)  # plots red vertical lines
plt.plot([70.5, 70.5], [50, 85], color='red', lw=2)
plt.scatter(70, close_to_70, color='gold', s=40);  # plots gold point

In [None]:
def predict_child(avg_height_of_parents):
    close_to_avg_height_of_parents = heights[(heights['midparentHeight'] >= (avg_height_of_parents - 0.5)) &
                       (heights['midparentHeight'] <= (avg_height_of_parents + 0.5))]
    close_to_avg_height_of_parents = round(close_to_avg_height_of_parents['childHeight'].mean(), 4)

    return close_to_avg_height_of_parents

In [None]:
child_predictions = []

for avg_parent_height in heights['midparentHeight']:
    predicted_childs_height = predict_child(avg_parent_height)
    child_predictions.append(predicted_childs_height)

# the code below makes a copy of heights and inserts the Predictions column
height_and_prediction = heights.copy()
height_and_prediction.insert(8, "Predictions", child_predictions, True)
height_and_prediction

In [None]:
plt.scatter(height_and_prediction['midparentHeight'], height_and_prediction['childHeight'], color='green', label='Actual Data')
plt.scatter(height_and_prediction['midparentHeight'], height_and_prediction['Predictions'], color='red', label='Predicted Data')
plt.xlabel('Midparent Height')
plt.ylabel('Child Height')
plt.legend()
plt.grid(True)
plt.title("Scatter Plot of Child Height vs. Midparent Height")
plt.show()

In [None]:
correlation_coefficient = np.corrcoef(height_and_prediction['midparentHeight'], height_and_prediction['Predictions'])[0, 1]
correlation_coefficient

A correlation coefficient of 0.9 ~ 1 reveals a strong positive correlation between the relationship of the average parent height and the predicted child's height. It signifies that the predicted data above and the average parent height are almost a perfect positive correlation as 0.9 ~ 1 where if the correlation coefficient is 1 it means a perfect positive correlation implying a direct relationship between the 2 columns. This is to say that when the average height of the parents increases, the predicted height of the chil also increases. The data is i.i.d (independent and identically distributed) as the heights of the children do not depend on the height of other children, even if they are fromthe same family. The results are independent of each other. It is identically distributed.

In [None]:
def converts_to_standard_units(arr):
    mean = arr.mean()
    std = arr.std()
    z = (arr - mean) / std
    return z

midparent_SU = converts_to_standard_units(heights['midparentHeight'])
child_SU = converts_to_standard_units(heights['childHeight'])

heights_SU = pd.DataFrame({'MidParent SU': midparent_SU,'Child SU': child_SU})

print(heights_SU.head())

sd_midparent = heights['midparentHeight'].std()
close = 0.5 / sd_midparent

def predict_child_SU(avg_height_of_parents):
    close_to_avg_height_of_parents = heights_SU[(heights_SU['MidParent SU'] >= (avg_height_of_parents - close)) &
                       (heights_SU['MidParent SU'] <= (avg_height_of_parents + close))]
    predicted_child_height = round(close_to_avg_height_of_parents['Child SU'].mean(), 4)

    return predicted_child_height

child_predictions_SU = []

for avg_parent_height_SU in heights_SU['MidParent SU']:
    predicted_child_SU = predict_child_SU(avg_parent_height_SU)
    child_predictions_SU.append(predicted_child_SU)

# the code below makes a copy of heights_SU and inserts the Predictions SU column
height_and_prediction_SU = heights_SU.copy()
height_and_prediction_SU.insert(2, "Predictions SU", child_predictions_SU, True)
height_and_prediction_SU

In [None]:
plt.scatter(height_and_prediction_SU['MidParent SU'], height_and_prediction_SU['Child SU'], color='green', label='Actual Data in SU')
plt.scatter(height_and_prediction_SU['MidParent SU'], height_and_prediction_SU['Predictions SU'], color='red', label='Predicted Data in SU')
plt.xlabel('Midparent Height in SU')
plt.ylabel('Child Height in SU')
plt.legend()
plt.grid(True)
plt.title("Scatter Plot of Child Height vs. Midparent Height")

plt.show()

In [None]:
def slope(df, x_label, y_label):
    r = np.corrcoef(df[x_label], df[y_label])[0, 1]
    sd_x = df[x_label].std()
    sd_y = df[y_label].std()
    slope = r * (sd_y / sd_x)
    return slope

def intercept(df, x_label, y_label):
    mean_x = df[x_label].mean()
    mean_y = df[y_label].mean()
    slope_value = slope(df, x_label, y_label)
    intercept = mean_y - (slope_value * mean_x)
    return intercept

family_slope = slope(heights, 'midparentHeight', 'childHeight')
family_intercept = intercept(heights, 'midparentHeight', 'childHeight')
family_slope, family_intercept

In [None]:
def reg_prediction(avg_parent_height):
    regression_eqn =  family_slope * avg_parent_height + family_intercept
    return regression_eqn

regression_predictions = []

for parent_height in heights['midparentHeight']:
    prediction = reg_prediction(parent_height)
    regression_predictions.append(prediction)

# the code below makes a copy of heights and inserts the Predictions column
height_and_prediction.insert(9, "Regression Predictions", regression_predictions, True)
height_and_prediction

In [None]:
# Scatter plot - original child height data
plt.scatter(height_and_prediction['midparentHeight'], height_and_prediction['childHeight'], color='blue', label='Original Child Height Data')

# Scatter plot - predicted data
plt.scatter(height_and_prediction['midparentHeight'], height_and_prediction['Predictions'], color='green', label='Predicted Data', alpha=0.5)

# Scatter plot - regression prediction data
plt.scatter(height_and_prediction['midparentHeight'], height_and_prediction['Regression Predictions'], color='red', label='Regression Prediction', alpha=0.5)

plt.xlabel('Midparent Height')
plt.ylabel('Child Height')
plt.title('Comparison of Original, Predicted, and Regression Predicted Child Height Data')
plt.legend()
plt.grid(True)
plt.show()

The regression prediction follows the trend of the predicted data, consistent with the relationship between the average height of the parents and the childs height. As the correlation coefficient is quite high, the predicted height and the regression prediction does well in approximating the child heights. The model allows us to conclude that the relationship follows a linear trend upwards, indicating taller average parent height leads to taller child height.