# An Analysis of

### Authored by: Gavin Crisologo, Josue Melendez, Caleb Solomon, & Matthew Yu



## Table of Contents
### Introduction
### [Part 1: Data Collection](#Part-1--Data-Collection)
### [Part 2: Data Cleaning](#Part-2--Data-Cleaning)
### [Part 3: Exploratory Data Analysis](#Part-3--Exploratory-Data-Analysis)
### [Part 4: Model Implementation](#Part-4--Model-Implementation)
### [Part 5: Visualizations](#Part-5--Visualizations)
### [Part 6: Conclusions](#Part-6--Conclusions)

## Introduction

Welcome! When discussing data science in the field of finance and economics, things can be confusing. Not only are there plenty of ways to use data science in economics but there are dozens of topics that can be covered, such as economic forecasting or financial consulting. To make this learning process easier for any prospective data scientist, this tutorial has been made to guide you through the process of obtaining data, cleaning it, and modeling it for any future projects. For the purposes of this tutorial, we will be looking at using data science principles in the context of modeling economic health using GDP and GDP per capita.

But what are Gross Domestic Product and Gross Domestic Product per capita?

Gross Domestic Product, otherwise known as GDP, is the measure of the total monetary value of all goods and services produced in a country throughout a period of time and is typically measured anually. It is an enormous measurement, accounting for everything from gum at a gas station to professional medical services. It represents the production power of an economy and is used in a variety of applications from economic forecasting to policy making to business and investment strategies. 

Gross Domestic Product per capita refers to an individuals average economic output. It is measured by dividing Gross Domestic Product by the population size. GDP per capita provides an indicaor for the economic prosperity and standard of living within a country. To learn more about GDP per capita, its definition and its uses please visit [here](https://www.investopedia.com/terms/p/per-capita-gdp.asp).

Throughout this tutorial we will be referring to any instance of Gross Domestic Product as GDP, and will be using a GDP per capita dataset alongside a C02 per capita and daily income per capita datasets to create future daily income predictions. We want to use GDP because it is an overall indicator of economic health, and C02 emissions are often associated with the industrial production of a country, alongside the use of things like cars. The goal of this is to see if the relationships between economic health (GDP) and industrialization (C02) has any effect on daily incomes.

But how can we use this data?

Often, when discussing how to use various metrics such as GDP, the "how" of the process is often overlooked or neglected. While it is not wrong to say GDP, C02, and daily income data can be used for all sorts of things, there are various questions as to how to clean, organize, and use the data that arise in the process. These questions are what the tutorial will answer today. 

Throughout the tutorial we will be covering the following aspects of data science in the context of GDP data:

1. Data Collection
2. Data Cleaning 
3. Exploratory Data Analysis
4. Model Implementation 
5. Visualization
6. Conclusions and Next Steps






## Part 1 - Data Collection

As mentioned in the [Introduction](#introduction), we will be using 3 data sets to create machine learning models and predict future Daily Income per capita. The 3 datasets contain data relating to:

1. Previous GDP per capita
2. CO2 Emissions per capita
3. Daily income

All three datasets used here have been gathered from Gapminder. Gapminder is an educational non-profit that aims to "fight devastating ignorance" and tackles misconceptions regarding trending topics by using reliable data to create teaching materials to rid people of their ignorance. To this end, Gapminder has allowed free access to various relevant and reliable datasets, allowing anyone to freely access their datasets and use the data to educate themselves. If you're interested in learning more about Gapminder, its mission, and its resources please go [here](https://www.gapminder.org/).

We have aditionally provided a guide on how to access every dataset used. Each dataset detailed below provides a link to Gapminders dataset directory, along with instructions on how to go about finding the specific dataset used. Finally, each dataset provides a link to learn more about the set and how it can be used.

GDP per capita dataset from: https://www.gapminder.org/data/  (gdp_pcap.csv)
1) Select an indicator
2) Economy
3) Incomes & growth
4) GDP per capita

Additional information about the dataset can be found at:  
http://gapm.io/dgdpcap_cppp

CO2 Emissions per capita dataset from: https://www.gapminder.org/data/  (co2_pcap_cons.csv)
1) Select an indicator
2) CO2 Emissions per capita

Additional information about the dataset can be found at:  
http://gapm.io/dco2_consumption_historic

Daily income dataset from: https://www.gapminder.org/data/  (mincpcap_cppp.csv)
1) Select an indicator
2) Daily income

Additional information about the dataset can be found at:  
http://gapm.io/dmincpcap_cppp

## Part 2 - Data Cleaning

Now that our datasets have been collected, it is time to clean them. Cleaning data refers to adjusting the contents of datasets to correct for errors and missing data, as well as prepare the dataset itself for visualization and further analysis. This is done to ensure the reliability and accuracy of our data and analysis, as "dirty" datasets can lead to various errors such as inaccurate predictions, biased results, etc. We first begin by importing any and all necessary libraries.

### Imports

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler

We begin by using pandas (imported above) to read in the dataset csv's, gathered from downloading the datasets from Gapminder. Pandas provides a useful function called read_csv() that will take in the string name of the datasets as in argument, and will read in the datasets to pandas DataFrames.

Pandas is a python library that is regularly used by data scientists to clea, explore, and analyze various datasets. To learn more about pandas and its capabilities please visit [this link](https://pandas.pydata.org/)

In [None]:
# Load data from CSVs to pandas DataFrames
co2_percap = pd.read_csv('co2_pcap_cons.csv')
gdp_percap = pd.read_csv('gdp_pcap.csv')
inc_day = pd.read_csv('mincpcap_cppp.csv')

We then display the raw, unclean, data to get a glimpse of the structure of the data. We see that is structured with countries as rows, columns as years and its respective subject as a field (ex: C02 emissions), allowing for easy access.

In [None]:
# Display GDP per capita dataset
print("\nGDP Per Capita Data:")
print(gdp_percap.head())

In [None]:
# Display CO2 per capita dataset
print("CO2 Per Capita Consumption Data:")
print(co2_percap.head())

In [None]:
# Display Daily income dataset
print("\nIncome Per Capita Data:")
print(inc_day.head())

From here, we begin to filter out DataFrames to ensure we only work with relevant data. We go about this by first identifying common countries within each dataset and storing it within a variable labeled 'common_countries'. We then take the dataframes made for each data and filter out any country that is not within all 3 datasets. This allows us to have complete data for every country that we modeled, avoiding potential errors that can arise when examining a country that is in one dataset but not another. (An example of this would be if 'El Salvador' was within the GDP per capita dataset, but was not within the C02 emissions per capita dataset).

In [None]:
# Identify common countries across all three datasets
common_countries = set(co2_percap['country']) & set(gdp_percap['country']) & set(inc_day['country'])

# Filter DataFrames and keep only the common countries
co2_percap = co2_percap[co2_percap['country'].isin(common_countries)]
gdp_percap = gdp_percap[gdp_percap['country'].isin(common_countries)]
inc_day = inc_day[inc_day['country'].isin(common_countries)]

''' Note: We have placed the filtered datasets back into their original Dataframes, however typically different variables are used to ensure
the original versions of the Dataframes are maintained. This is done in case the original dataframes are needed further down the line,
such that it will be possible to access different version of the dataframes and rerun an ipynb file without causing any errors. '''

The datasets we use contain predicted values from 2024 and onwards. As it is currently May, 2024 we only want data prior to 2024, as that will be actual measurements rather than predicted. We will use this to compute our predictions. This is done using python list comprehension, and allows us to gather all of the columns we're keeping from each dataset, and then filter our dataframes to only contain these columns

In [None]:
# Filter DataFrames and keep only the common countries
co2_percap = co2_percap[co2_percap['country'].isin(common_countries)]
gdp_percap = gdp_percap[gdp_percap['country'].isin(common_countries)]
inc_day = inc_day[inc_day['country'].isin(common_countries)]

In [None]:
# Drop columns with years > 2024 (to avoid predictions not our own)
columns_to_keep_co2 = ['country'] + [col for col in co2_percap.columns[1:] if col.isdigit() and int(col) <= 2024]
columns_to_keep_gdp = ['country'] + [col for col in gdp_percap.columns[1:] if col.isdigit() and int(col) <= 2024]
columns_to_keep_inc = ['country'] + [col for col in inc_day.columns[1:] if col.isdigit() and int(col) <= 2024]

# Replace data in our Dataframes with only the data we want to keep
co2_percap = co2_percap[columns_to_keep_co2]
gdp_percap = gdp_percap[columns_to_keep_gdp]
inc_day = inc_day[columns_to_keep_inc]

After ensuring the data within our Dataframes is relevant, we will now clean the data itself to ensure it can be used properly.

Before cleaning the data, it is presented as a numeric, which can be represented in forms such as "1,000" and and "−20", which while readable cannot be properly utilized within python functions (Note: − is not the typical hyphen -, and so it must be converted). We wish to convert these numers into floats, which are represented as "1000" and "-20" and can be properly utilized within python functions. We do so by defining a function 'num_to_float' that will take in a pandas a Dataframe, and will then convert the numerics to strings, replacing all appropriate characters. From there, it will call the built in pandas function to_numeric that will take in these strings and convert them to their respective float counterparts. This is then called on every dataframe we have and our data is now clean.

In [None]:
# Convert numerics to floats and coerce errors to homogenize dataset
def num_to_float(df):
    for col in df.columns[1:]:
        df[col] = pd.to_numeric(df[col].astype(str).str.replace(',', '').str.replace('−', '-'), errors='coerce')
    return df

co2_percap = num_to_float(co2_percap)
gdp_percap = num_to_float(gdp_percap)
inc_day = num_to_float(inc_day)

After cleaning our data, we then run a null check on our dataframes, and will check to see which datasets contain null values. 

Note: Null is not always a bad thing. In the case of GDP per capita data, some countries did not exist within the time ranges within the dataset. (As an example: South Sudan was established in 2011)

In [None]:
# Check for null values in each dataframe
print("\nNull values in CO2 Per Capita Data:")
print(co2_percap.isnull().sum().sum())
print("\nNull values in GDP Per Capita Data:")
print(gdp_percap.isnull().sum().sum())
print("\nNull values in Daily Income Data:")
print(inc_day.isnull().sum().sum())

Now that all datasets have been filtered and null checked, we display the head of our data to get a glimpse of what the data may look like.

In [None]:
# Display the first few rows of each filtered dataframe
print("\nFiltered CO2 Per Capita Data:")
print(co2_percap.head())
print("\nFiltered GDP Per Capita Data:")
print(gdp_percap.head())
print("\nFiltered Daily Income Data:")
print(inc_day.head())

## Part 3 - Exploratory Data Analysis

Now that we have clean, relevant data to work with, we will conduct an Exploratory Data Analysis, or EDA for short. An EDA typically consists of getting know and understand the data and using descriptive methods to note the general patterns within the data. 

It is not meant to be an in depth method of observation, so the techniques used within this EDA will aim to model the generalities within our Dataframes, to inform of us general patterns we can use to build our models, our predictions and create a more accurate, in depth analysis later on. This EDA consists of first examining each dataset in its entirity to note general trends, and then honing in on specific years and relevant criteria later on, which will allow us to further develop our hypothesis.


The first step in our EDA is to examine our datasets visually, allowing us to draw some conlcusions from visual trends. We do this utilizing matplotlib (imported in part 2), a python library that is commonly used to visualize datasets, and will be useful in visualizing our Dataframes.

In [None]:
# Plot CO2 per capita over time for each country
plt.figure(figsize=(10, 5))
# for every country in the C02 dataset, extract the current year, and the emission related 
# with that year and plot it.
for country in co2_percap['country']:
    # extracting year and emission
    years = co2_percap.columns[1:].astype(int)
    emissions = co2_percap[co2_percap['country'] == country].values[0][1:].astype(float)
    # plot the emission related the current year, with a label pertaining to the current country
    plt.plot(years, emissions, label=country)
# create the x-axis label, y-axis label, and graph title
plt.xlabel('Year')
plt.ylabel('CO2 Emissions per Capita')
plt.title('CO2 Emissions per Capita Over Time by Country')
# Tighten the legend so the plot isn't so big
plt.legend(loc='upper center', bbox_to_anchor=(0.5, -0.15), ncol=4)
plt.show()

# Plot GDP per capita over time for each country
plt.figure(figsize=(10, 5))
# for every country in the GDP per capita dataset, extract the current year, and the 
# GDP related with that year and plot it.
for country in gdp_percap['country']:
    # extracting year and GDP
    years = gdp_percap.columns[1:].astype(int)
    gdp = gdp_percap[gdp_percap['country'] == country].values[0][1:].astype(float)
    # plot the GDP related the current year, with a label pertaining to the current country
    plt.plot(years, gdp, label=country)
# create the x-axis label, y-axis label, and graph title
plt.xlabel('Year')
plt.ylabel('GDP per Capita')
plt.title('GDP per Capita Over Time by Country')
plt.legend(loc='upper center', bbox_to_anchor=(0.5, -0.15), ncol=4) 
plt.show()

# Plot Daily Income per capita over time for each country
plt.figure(figsize=(10, 5))
# for every country in the Daily Income dataset, extract the current year, and the 
# income related with that year and plot it.
for country in inc_day['country']:
    # extracting year and income
    years = inc_day.columns[1:].astype(int)
    income = inc_day[inc_day['country'] == country].values[0][1:].astype(float)
    # plot the GDP related the current year, with a label pertaining to the current country
    plt.plot(years, income, label=country)
# create the x-axis label, y-axis label, and graph title
plt.xlabel('Year')
plt.ylabel('Daily Income per Capita')
plt.title('Daily Income per Capita Over Time by Country')
plt.legend(loc='upper center', bbox_to_anchor=(0.5, -0.15), ncol=4)
plt.show()


Clearly, these graphs are massive and cluttered; while there is an apparent general upward trend in all cases (as we would expect), the GDP per capita graph especially is incredibly difficult to read, and indeed very little useful information can be garnered from it. Consequently we would like to analyze smaller samples of the data to perhaps gain a greater understanding of interesting sub-trends that a model could perhaps generalize. We take the following four potentially interesting cases based on a knowledge of history and a cursory glance at the above. For each of these, we provide a set of three graphs, one for each dataset, to visualize each of our three metrics.

1) High Emissions Countries

Such countries would be those with the broadest industrialization infrastructure that we expect to have massive carbon emissions. For these, we include the USA, China, Russia, India, and Japan.

2) Various European Countries

This allows us to get a gauge of potential trends on one subregion of the world.

3) "Developed" Countries

A set of countries considered to be "developed," first-world nations. We should expect these to have matured industrial economies.

4) "Developing" Countries

Opposite case 3, this might give us a gauge of potential trends (or potential volatility!) of nations whose industrial economies and transition we might not expect to have completely stabilized.

Now that limiting criteria have been defined, we will create a function called 'plot_data' that will iterate through the datasets and for every limiting criteria set, such as High Emission Countries, generate 3 subplots that will show C02 Emissions Per Capita, GDP per capita, and Daily Income per capita respectively.

In [None]:
limiting_criteria_sets = [
    ('High Emission Countries', ['USA', 'China', 'India', 'Russia', 'Japan']),
    ('European Countries', ['UK', 'Germany', 'France', 'Italy', 'Spain']),
    ('Developed Countries', ['USA', 'Germany', 'UK', 'France', 'Japan']),
    ('Developing Countries', ['India', 'China', 'Brazil', 'South Africa', 'Nigeria'])
]

# Function to plot CO2 Emissions per Capita, GDP per Capita, and Income per Day data for a given set of countries
def plot_data(countries, title_prefix, datasets, set_names):
    # Generate a figure
    fig = plt.figure(figsize=(12, 12))
    
    # Loop through datasets
    for i in range(len(datasets)):
        # Grab years and generate a subplot grid
        years = datasets[i].columns[1:].astype(int)
        ax = plt.subplot2grid((2, 2), (i // 2, i % 2))
        
        # Keep the max for margin cutoffs
        m = 0
        for country in countries:
            # Iterate through countries in the limiting sets, grabbing and adding their data
            data = datasets[i][datasets[i]['country'] == country].values[0][1:].astype(float)
            ax.plot(years, data, label=country)
            m = max(m, max(data))

        # Labeling and cleaning
        ax.set_ylabel(f'{set_names[i]} per Capita (metric tons)')
        ax.set_xlabel('Year')
        ax.set_xticks(years[::50]) # 50 year increments for cleanliness

        ax.set_title(f'{title_prefix} - {set_names[i]} per Capita')
        ax.legend(loc='upper left', bbox_to_anchor=(1, 1))
        ax.set_ylim(0, m * 1.1)  # Ensure y-axis includes all data

    plt.tight_layout()
    plt.show()

# Loop over each set of limiting criteria and generate CO2 Emissions per Capita, GDP per Capita, and Income per Day (per capita) plots
for title_prefix, countries in limiting_criteria_sets:
    plot_data(countries, title_prefix, [co2_percap, gdp_percap, inc_day], ['CO2 Emissions', 'GDP', 'Daily Income'])


From these plots, we can see general, upwards exponential-like trends; however, as reflected in the original large graphs and still visible from these, GDP per capita is much more volatile than the other two measures. As a potentially interesting question, we now ask if there is any relationship between carbon dioxide emissions per capita and the GDP per capita in predicting daily income; that is, do changes in what we might expect to be an indicator of a mature, industrial economy, as well as changes in the GDP, predict the direction and magnitude of the change in daily income per capita?

## Part 4 - Model Implementation

We have now conducted our exploratory data analysis, looked at general trends in data, and from this analysis have arrived at a question yet to be answered: Do changes in indicators of a mature, industrial economy, alongside changed in GDP, predict the direction and magnitude of change in daily income per capita? In answering this question we want to know if it's possible to predict future daily income per capita changes. If so, we want to use these results within this dataset to predict future daily incomes per capita, and if not, then that answer is just as satisfactory.

To begin, we will use all data prior to 2015 as training data. As we develop the model, we will be sure to take into account the time period 2020-2022 as during this time COVID-19, affected all three factors being considered (GDP, Daily Income, and C02 emissions) significantly. We will be using gradient descent to create our models, and will develop a univariate and multivariate gradient descent model.

Gradient descent is an optimization algorithm that is common in machine learning models. It is used to minimize the loss function within machine learning models. Loss functions are the metric that represent the magnitude of incorrect predictions and are used to say we were "this far off" from the correct predictions. The goal is to minimize the loss functions as a model with a 0 loss functions perfectly predicts all values.

Note: In practice, a perfect loss function (equal to 0), typically represents the overfitting of a model, and may lead to an even larger loss function when tested on real data.

Univariate gradient descent refers to gradient descent with a singular dependent and independent variable. Gradient descent will continuously adjust the magnitude by which the dependent variable affects the independent variable. In our case, we will be using C02 emissions per capita as an independent variable that will be optimized to predict future daily incomes per capita.

Multivariate gradient descent is similar to univariate gradient descent, but instead has multiple independent variables and one dependent variables. Each of these independent variables are optimized to only have a certain degree of impact upon the dependent variable, and are optimized to minimize the loss function. In our case we will be using C02 and GDP emissions per capita as independent variables and daily income per capita as a dependent variable.

Note: For the sake of this tutorial, we have neglected much of the math in gradient descent, how it works, how its calculated, etc, in favor of a general concept that will be explained throughout this tutorial. Our goal is to show how to create a gradient descent model and utilize it, rather than the math behind it.

For those interested in learning more about gradient descent and loss function please see:

- Gradient Descent: [here](https://towardsdatascience.com/gradient-descent-explained-9b953fc0d2c)
- Univariate Gradient Descent: [here](https://medium.com/swlh/the-math-of-machine-learning-i-gradient-descent-with-univariate-linear-regression-2afbfb556131)
- Multivariate Gradient Descent: [here](https://medium.com/@IwriteDSblog/gradient-descent-for-multivariable-regression-in-python-d430eb5d2cd8)
- Loss functions: [here](https://developers.google.com/machine-learning/crash-course/reducing-loss/gradient-descent)


First, we create our test sets of data:

In [None]:
# Using data prior to 2015 as the test dataset
co2_percap_train = co2_percap.drop(columns=[str(year) for year in range(2015, 2023)])
gdp_percap_train = gdp_percap.drop(columns=[str(year) for year in range(2015, 2023)])
inc_day_train = inc_day.drop(columns=[str(year) for year in range(2015, 2023)])

# Get all country averages
co2_percap_mean = co2_percap_train.set_index('country').mean(axis=1).reset_index()
gdp_percap_mean = gdp_percap_train.set_index('country').mean(axis=1).reset_index()
inc_day_mean = inc_day_train.set_index('country').mean(axis=1).reset_index()

# Merge datasets on country
merged_data = co2_percap_mean.merge(gdp_percap_mean, on='country').merge(inc_day_mean, on='country')
merged_data.columns = ['country', 'CO2_per_capita', 'GDP_per_capita', 'Income_per_capita']

Then we set up our dependent and indepedent variable for each of our models.

In [None]:
# Prepare two predictor sets: a univariate and a multivariate set
X_univariate = merged_data[['CO2_per_capita']].values
X_multivariate = merged_data[['CO2_per_capita', 'GDP_per_capita']].values
y = merged_data['Income_per_capita'].values


# Add intercept term for each
X_univariate = np.hstack((np.ones((X_univariate.shape[0], 1)), X_univariate))
X_multivariate = np.hstack((np.ones((X_multivariate.shape[0], 1)), X_multivariate))

# Normalize features based on calc mean
scaler = MinMaxScaler()
X_univariate[:, 1:] = scaler.fit_transform(X_univariate[:, 1:])
X_multivariate[:, 1:] = scaler.fit_transform(X_multivariate[:, 1:])

# Gradient descent
def grad_descent(X, y, T, alpha):
    m, n = X.shape  # m = #examples, n = #features
    theta = np.zeros(n)
    f = np.zeros(T)  # loss
    for i in range(T):
        # loss for current parameter vector theta
        f[i] = 0.5 * np.linalg.norm(X.dot(theta) - y)**2
        # compute steepest ascent at f(theta)
        g = np.transpose(X).dot(X.dot(theta) - y)
        # step down the gradient
        theta = theta - alpha * g
    return theta, f # return loss as well

Now that we've created our indepedent and dependent variable sets for our models, we can move on to visualizing our models.

## Part 5 - Visualizations

Finally, we train each of the models, and display loss over epochs for each.

In [None]:
# Training parameters
T = 1000
alpha = 0.001 # lower alpha for multivariate as well

# Train univariate model
theta_uni, loss_uni = grad_descent(X_univariate, y, T, alpha)

# Train multivariate model
theta_multi, loss_multi = grad_descent(X_multivariate, y, T, alpha)

# Plot loss over epochs
plt.figure(figsize=(14, 6))

plt.subplot(1, 2, 1)
plt.plot(range(T), loss_uni, label='Univariate Model')
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.title('Univariate Model Loss Over Epochs')
plt.legend()

plt.subplot(1, 2, 2)
plt.plot(range(T), loss_multi, label='Multivariate Model')
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.title('Multivariate Model Loss Over Epochs')
plt.legend()

plt.tight_layout()
plt.show()

# Calculate MSE for univariate and multivariate models
mse_training_univariate = np.mean((X_univariate.dot(theta_uni) - y)**2)
mse_training_multivariate = np.mean((X_multivariate.dot(theta_multi) - y)**2)

print(f'MSE for Univariate Model: {mse_training_univariate}')
print(f'MSE for Multivariate Model: {mse_training_multivariate}')

Next, we want to test our models on two different test datasets: for one dataset, we will use income from 2015-2019 (inclusive), and for the other, we will use 2020-2022 data (inclusive, taking into account that things may be different because of COVID-19, so it is feasible that the model's performance could be significantly different between both of these datasets).

In [None]:
# Creates a merged dataset given a time range (inclusive)
def merge_datasets(time_range):
    co2_percap_train = co2_percap.filter(items=['country'] + [str(year) for year in range(time_range[0], time_range[1] + 1)])
    gdp_percap_train = gdp_percap.filter(items=['country'] + [str(year) for year in range(time_range[0], time_range[1] + 1)])
    inc_day_train = inc_day.filter(items=['country'] + [str(year) for year in range(time_range[0], time_range[1] + 1)])

    co2_percap_mean = co2_percap_train.set_index('country').mean(axis=1).reset_index()
    gdp_percap_mean = gdp_percap_train.set_index('country').mean(axis=1).reset_index()
    inc_day_mean = inc_day_train.set_index('country').mean(axis=1).reset_index()

    merged_data = co2_percap_mean.merge(gdp_percap_mean, on='country').merge(inc_day_mean, on='country')
    merged_data.columns = ['country', 'CO2_per_capita', 'GDP_per_capita', 'Income_per_capita']

    return merged_data

# Obtain desired data for the two testing time periods
merged_data_2015_2019 = merge_datasets([2015, 2019]).dropna()
merged_data_2020_2022 = merge_datasets([2020, 2022]).dropna()
scaler = MinMaxScaler()

# Test univariate model on 2015-2019 data
X_univariate_test = merged_data_2015_2019[['CO2_per_capita']].values
X_univariate_test = np.hstack((np.ones((X_univariate_test.shape[0], 1)), X_univariate_test))
X_univariate_test[:, 1:] = scaler.fit_transform(X_univariate_test[:, 1:])
y_univariate_pred = X_univariate_test.dot(theta_uni)

# Test multivariate model on 2015-2019 data
X_multivariate_test = merged_data_2015_2019[['CO2_per_capita', 'GDP_per_capita']].values
X_multivariate_test = np.hstack((np.ones((X_multivariate_test.shape[0], 1)), X_multivariate_test))
X_multivariate_test[:, 1:] = scaler.fit_transform(X_multivariate_test[:, 1:])
y_multivariate_pred = X_multivariate_test.dot(theta_multi)

# Calculate MSE for both models
mse_univariate = np.mean((y_univariate_pred - merged_data_2015_2019['Income_per_capita'].values)**2)
mse_multivariate = np.mean((y_multivariate_pred - merged_data_2015_2019['Income_per_capita'].values)**2)

# Print MSE for both models
print(f'MSE for Univariate Model (2015-2019): {mse_univariate}')
print(f'MSE for Multivariate Model (2015-2019): {mse_multivariate}')

# Test univariate model on 2020-2022 data
X_univariate_test = merged_data_2020_2022[['CO2_per_capita']].values
X_univariate_test = np.hstack((np.ones((X_univariate_test.shape[0], 1)), X_univariate_test))
X_univariate_test[:, 1:] = scaler.fit_transform(X_univariate_test[:, 1:])
y_univariate_pred = X_univariate_test.dot(theta_uni)

# Test multivariate model on 2020-2022 data
X_multivariate_test = merged_data_2020_2022[['CO2_per_capita', 'GDP_per_capita']].values
X_multivariate_test = np.hstack((np.ones((X_multivariate_test.shape[0], 1)), X_multivariate_test))
X_multivariate_test[:, 1:] = scaler.fit_transform(X_multivariate_test[:, 1:])
y_multivariate_pred = X_multivariate_test.dot(theta_multi)

# Calculate MSE for both models
mse_univariate = np.mean((y_univariate_pred - merged_data_2020_2022['Income_per_capita'].values)**2)
mse_multivariate = np.mean((y_multivariate_pred - merged_data_2020_2022['Income_per_capita'].values)**2)

# Print MSE for both models
print(f'MSE for Univariate Model (2020-2022): {mse_univariate}')
print(f'MSE for Multivariate Model (2020-2022): {mse_multivariate}')

## Part 6 - Conclusions

We can see that the MSE for the univariate and multivariate models are different for both test datasets, in the same direction.  
The MSE for the univariate model pre-COVID years is less than the multivariate model for the same years.  
Likewise, the MSE for the univariate model in the post-COVID years is less than the multivariate model for the same years.  
This means that, for both instances, predicting daily income based solely off of CO2 emissions without the inclusion of GDP per capita is more accurate.  
This is interesting, because this means that CO2 emissions is a fairly good predictor of daily income, and GDP per capita is not as good.  
This could also mean that GDP per capita is only not a good predictor of daily income when CO2 emissions are also factored in.  
Some next steps from this could be to create another model that uses the single predictor variable GDP per capita to test against the univariate model we already have.  
Other next steps could also be to resegment our training vs. test datasets, and train and test our model on different sections of the data, not just train on old data and test on the most recent data.


### Congratulations!
You've fully completed this data science tutorial, and are now prepared to use the data science pipeline to analyze some of your own datasets.  
In this tutorial, you accomplished the following:  
Grabbed raw data from an online source and dataset repository  
Cleaned the data to make it ready for data processing and analysis, as well as trimmed the data down to a manageable, useful size  
Explored the data through basic graphs and grouped the data into more useful subsets  
Trained multiple machine learning models to quickly and accurately predict daily income in various countries around the world based on their CO2 emissions per capita and GDP per capita  
Compared the models to each other for validity and average error, and decided which models were better for our purposes  

We hope that you found this tutorial useful, and that you now feel ready to try out these steps on your own.  
Good luck!