# An Analysis of

### Authored by: Gavin Crisologo, Josue Melendez, Caleb Solomon, & Matthew Yu



## Table of Contents
### Introduction
### [Part 1: Data Collection](#Part-1--Data-Collection)
### [Part 2: Data Cleaning](#Part-2--Data-Cleaning)
### [Part 3- Exploratory Data Analysis](#Part-3--Exploratory-Data-Analysis)
### [Part 4- Model Implementation](#Part-4--Model-Implementation)
### [Part 5- Visualizations](#Part-5--Visualizations)
### [Part 6- Conclusions](#Part-6--Conclusions)

## Introduction

The purpose of this analysis and project is to walk a prospective data scientist through the data science pipeline via a worked example.  
For this project, we will use Gapminder's information on GDP per capita for various countries around the world, and train a model to extrapolate future GDP per capita based on the following factors:
Previous GDP per capita
CO2 Emissions per capita
Daily income


## Part 1- Data Collection

GDP per capita dataset from: https://www.gapminder.org/data/  (gdp_pcap.csv)
1) Select an indicator
2) Economy
3) Incomes & growth
4) GDP per capita

Additional information about the dataset can be found at:  
http://gapm.io/dgdpcap_cppp

CO2 Emissions per capita dataset from: https://www.gapminder.org/data/  (co2_pcap_cons.csv)
1) Select an indicator
2) CO2 Emissions per capita

Additional information about the dataset can be found at:  
http://gapm.io/dco2_consumption_historic

Daily income dataset from: https://www.gapminder.org/data/  (mincpcap_cppp.csv)
1) Select an indicator
2) Daily income

Additional information about the dataset can be found at:  
http://gapm.io/dmincpcap_cppp

## Part 2- Data Cleaning

In [4]:
# Import necessary libraries
import pandas as pd

In [5]:
# Load data from CSVs to pandas DataFrames
co2_percap = pd.read_csv('co2_pcap_cons.csv')
gdp_percap = pd.read_csv('gdp_pcap.csv')
inc_day = pd.read_csv('mincpcap_cppp.csv')

In [6]:
# Display GDP per capita dataset
print("\nGDP Per Capita Data:")
print(gdp_percap.head())


GDP Per Capita Data:
       country  1800  1801  1802  1803  1804  1805  1806  1807  1808  ...  \
0  Afghanistan   599   599   599   599   599   599   599   599   599  ...   
1       Angola   465   466   469   471   472   475   477   479   481  ...   
2      Albania   585   587   588   590   592   593   595   597   598  ...   
3      Andorra  1710  1710  1710  1720  1720  1720  1730  1730  1730  ...   
4          UAE  1420  1430  1430  1440  1450  1450  1460  1460  1470  ...   

    2091   2092   2093   2094   2095   2096   2097   2098   2099   2100  
0   4800   4910   5030   5150   5270   5390   5520   5650   5780   5920  
1  24.8k  25.3k  25.9k  26.4k  26.9k  27.4k    28k  28.5k  29.1k  29.6k  
2    54k  54.6k  55.2k  55.8k  56.4k  56.9k  57.5k  58.1k  58.7k  59.2k  
3  79.3k  79.5k  79.8k  80.1k  80.4k  80.7k    81k  81.2k  81.5k  81.8k  
4  92.5k  92.6k  92.6k  92.7k  92.8k  92.9k  92.9k    93k  93.1k  93.1k  

[5 rows x 302 columns]


In [7]:
# Display CO2 per capita dataset
print("CO2 Per Capita Consumption Data:")
print(co2_percap.head())

CO2 Per Capita Consumption Data:
       country   1800   1801   1802   1803   1804   1805   1806   1807   1808  \
0  Afghanistan  0.001  0.001  0.001  0.001  0.001  0.001  0.001  0.001  0.001   
1       Angola  0.009  0.009  0.009  0.009  0.009  0.009  0.010  0.010  0.010   
2      Albania  0.001  0.001  0.001  0.001  0.001  0.001  0.001  0.001  0.001   
3      Andorra  0.333  0.335  0.337  0.340  0.342  0.345  0.347  0.350  0.352   
4          UAE  0.063  0.063  0.064  0.064  0.064  0.064  0.065  0.065  0.065   

   ...  2013    2014    2015    2016    2017    2018    2019    2020    2021  \
0  ...  0.28   0.253   0.262   0.245   0.247   0.254   0.261   0.261   0.279   
1  ...  1.28   1.640   1.220   1.180   1.150   1.120   1.150   1.120   1.200   
2  ...  2.27   2.250   2.040   2.010   2.130   2.080   2.050   2.000   2.120   
3  ...   5.9   5.830   5.970   6.070   6.270   6.120   6.060   5.630   5.970   
4  ...    27  26.800  27.000  26.700  23.900  23.500  21.200  19.700  20.700   


In [8]:
# Display Daily income dataset
print("\nIncome Per Capita Data:")
print(inc_day.head())


Income Per Capita Data:
       country   1800   1801   1802   1803   1804   1805   1806   1807   1808  \
0  Afghanistan  1.330  1.330  1.330  1.330  1.330  1.330  1.330  1.330  1.330   
1       Angola  0.779  0.781  0.785  0.789  0.791  0.795  0.799  0.802  0.806   
2      Albania  0.919  0.921  0.924  0.927  0.929  0.932  0.935  0.937  0.940   
3      Andorra  1.880  1.880  1.880  1.890  1.890  1.890  1.900  1.900  1.900   
4          UAE  1.650  1.660  1.670  1.670  1.680  1.680  1.690  1.700  1.700   

   ...   2091   2092   2093   2094   2095   2096   2097   2098   2099   2100  
0  ...   10.7   10.9   11.2   11.4   11.7   12.0   12.3   12.6   12.8   13.2  
1  ...   19.8   20.2   20.6   21.0   21.4   21.9   22.3   22.7   23.2   23.6  
2  ...   56.7   57.4   58.0   58.6   59.2   59.8   60.5   61.1   61.7   62.3  
3  ...   87.1   87.4   87.8   88.1   88.4   88.7   89.0   89.3   89.6   89.9  
4  ...  102.0  102.0  102.0  102.0  102.0  102.0  102.0  102.0  102.0  103.0  

[5 rows x 302

In [9]:
# Identify common countries across all three datasets
common_countries = set(co2_percap['country']) & set(gdp_percap['country']) & set(inc_day['country'])

In [10]:
# Filter DataFrames and keep only the common countries
co2_percap = co2_percap[co2_percap['country'].isin(common_countries)]
gdp_percap = gdp_percap[gdp_percap['country'].isin(common_countries)]
inc_day = inc_day[inc_day['country'].isin(common_countries)]

In [11]:
# Drop columns with years > 2024 (to avoid predictions not our own)
columns_to_keep_co2 = ['country'] + [col for col in co2_percap.columns[1:] if col.isdigit() and int(col) <= 2024]
columns_to_keep_gdp = ['country'] + [col for col in gdp_percap.columns[1:] if col.isdigit() and int(col) <= 2024]
columns_to_keep_inc = ['country'] + [col for col in inc_day.columns[1:] if col.isdigit() and int(col) <= 2024]

co2_percap = co2_percap[columns_to_keep_co2]
gdp_percap = gdp_percap[columns_to_keep_gdp]
inc_day = inc_day[columns_to_keep_inc]

In [None]:
# Check for null values in each dataframe
print("\nNull values in CO2 Per Capita Data:")
print(co2_percap.isnull().sum())
print("\nNull values in GDP Per Capita Data:")
print(gdp_percap.isnull().sum())
print("\nNull values in Daily Income Data:")
print(inc_day.isnull().sum())

In [12]:
# Display the first few rows of each filtered dataframe
print("\nFiltered CO2 Per Capita Data:")
print(co2_percap.head())
print("\nFiltered GDP Per Capita Data:")
print(gdp_percap.head())
print("\nFiltered Daily Income Data:")
print(inc_day.head())


Filtered CO2 Per Capita Consumption Data:
       country   1800   1801   1802   1803   1804   1805   1806   1807   1808  \
0  Afghanistan  0.001  0.001  0.001  0.001  0.001  0.001  0.001  0.001  0.001   
1       Angola  0.009  0.009  0.009  0.009  0.009  0.009  0.010  0.010  0.010   
2      Albania  0.001  0.001  0.001  0.001  0.001  0.001  0.001  0.001  0.001   
3      Andorra  0.333  0.335  0.337  0.340  0.342  0.345  0.347  0.350  0.352   
4          UAE  0.063  0.063  0.064  0.064  0.064  0.064  0.065  0.065  0.065   

   ...  2013    2014    2015    2016    2017    2018    2019    2020    2021  \
0  ...  0.28   0.253   0.262   0.245   0.247   0.254   0.261   0.261   0.279   
1  ...  1.28   1.640   1.220   1.180   1.150   1.120   1.150   1.120   1.200   
2  ...  2.27   2.250   2.040   2.010   2.130   2.080   2.050   2.000   2.120   
3  ...   5.9   5.830   5.970   6.070   6.270   6.120   6.060   5.630   5.970   
4  ...    27  26.800  27.000  26.700  23.900  23.500  21.200  19.700  

## Part 3- Exploratory Data Analysis

## Part 4- Model Implementation

## Part 5- Visualizations

## Part 6- Conclusions