# Capstone Project Exploratory Data Analysis (EDA)

In this step of the capstone project, I explored the data to understand how it is distributed. I first started with importing the data and with that there were many difficulties and challenges. After that was resolved, I took a look at which columns were needed to explore the data and got rid of the ones that were not. Next I compared some of the oldest values recorded with the newest ones, saw how the values overall correlated, and finally used pandas profiling for report of the data. In this EDA, I use the GDP data because without it I do not believe that I would be able to accurately visualize the overall impact of the price of healthcare and how governments are choosing to allieviate this issue.

## Importing the Data

### Difficulties and Changes

When starting out with the data, I realized that the JSON format the data was originally saved in would not be easy to work with. The data itself was thuroughly nested, and once I got that worked out, it did not include the data points I needed. With that, I decided to pivot and pull the data as CSVs instead. From there, it was easier to import the data into pandas and I believe this will help further down the line with compression and when importing the data into a warehouse or lake.

### Data Importation Methods

With the files I currently have from my data source, [the World Bank](https://data.worldbank.org/), I had to import the data two different ways:

This first way, I imported pandas and pandas profiling, and put all of the cells from the CSV into a pandas data frame. The first few rows were skipped because it included information on when the file was last updated and other non-noteworthy artifacts.  
```
import pandas as pd
from pandas_profiling import ProfileReport
    
df = pd.read_csv('data.csv', skiprows = 4)
df
```
This second way, all I needed to do was import pandas and pandas profiling and read the CSV. Using the skipfooter parameter, I'm able to remove unneccessary rows from the bottom of the file.

```
import pandas as pd
from pandas_profiling import ProfileReport

df = pd.read_csv('data.csv', engine='python', skipfooter = 49)
df
```

Below, I used the second way of data importing to import GDP figures ranging back from 1960 until today.

In [1]:
import pandas as pd
from pandas_profiling import ProfileReport

df = pd.read_csv('data.csv', engine='python', skipfooter = 49)
df

Unnamed: 0,Series Name,Series Code,Country Name,Country Code,1960,1961,1962,1963,1964,1965,...,2011,2012,2013,2014,2015,2016,2017,2018,2019,2020
0,GDP (current US$),NY.GDP.MKTP.CD,Afghanistan,AFG,5.377778e+08,5.488889e+08,5.466667e+08,7.511112e+08,8.000000e+08,1.006667e+09,...,1.780511e+10,1.990732e+10,2.014640e+10,2.049713e+10,1.913421e+10,1.811656e+10,1.875347e+10,1.805323e+10,1.879945e+10,2.011614e+10
1,GDP (current US$),NY.GDP.MKTP.CD,Albania,ALB,,,,,,,...,1.289076e+10,1.231983e+10,1.277622e+10,1.322815e+10,1.138685e+10,1.186120e+10,1.301969e+10,1.515643e+10,1.540024e+10,1.488763e+10
2,GDP (current US$),NY.GDP.MKTP.CD,Algeria,DZA,2.723593e+09,2.434727e+09,2.001428e+09,2.702960e+09,2.909293e+09,3.136259e+09,...,2.000130e+11,2.090590e+11,2.097550e+11,2.138100e+11,1.659790e+11,1.600340e+11,1.700970e+11,1.749110e+11,1.717670e+11,1.450090e+11
3,GDP (current US$),NY.GDP.MKTP.CD,American Samoa,ASM,,,,,,,...,5.700000e+08,6.400000e+08,6.380000e+08,6.430000e+08,6.730000e+08,6.710000e+08,6.120000e+08,6.390000e+08,6.480000e+08,7.090000e+08
4,GDP (current US$),NY.GDP.MKTP.CD,Andorra,AND,,,,,,,...,3.629204e+09,3.188809e+09,3.193704e+09,3.271808e+09,2.789870e+09,2.896679e+09,3.000181e+09,3.218316e+09,3.155065e+09,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
212,GDP (current US$),NY.GDP.MKTP.CD,Virgin Islands (U.S.),VIR,,,,,,,...,4.223000e+09,4.089000e+09,3.738000e+09,3.565000e+09,3.663000e+09,3.798000e+09,3.794000e+09,3.900000e+09,4.068000e+09,
213,GDP (current US$),NY.GDP.MKTP.CD,West Bank and Gaza,PSE,,,,,,,...,1.118610e+10,1.220840e+10,1.351550e+10,1.398970e+10,1.397240e+10,1.540540e+10,1.612800e+10,1.627660e+10,1.713350e+10,1.556130e+10
214,GDP (current US$),NY.GDP.MKTP.CD,"Yemen, Rep.",YEM,,,,,,,...,3.272642e+10,3.540134e+10,4.041524e+10,4.322859e+10,4.244510e+10,3.131737e+10,2.684013e+10,2.160614e+10,,
215,GDP (current US$),NY.GDP.MKTP.CD,Zambia,ZMB,7.130000e+08,6.962857e+08,6.931429e+08,7.187143e+08,8.394286e+08,1.082857e+09,...,2.345952e+10,2.550306e+10,2.803724e+10,2.714102e+10,2.125122e+10,2.095841e+10,2.587360e+10,2.631159e+10,2.330867e+10,1.811063e+10


## Data Cleaning

Afterwards, I realized that there were columns that are not needed in the EDA, however they may be useful later on. I removed them from my results using `df.drop()`.

```
df = df.drop(columns=['Indicator Name', 'Indicator Code', 'Unnamed: 65'])
df
```

After that, the only other issue was the numbers showing up in scientific notation because they are between the millions and trillions of dollars range. To fix that, I changed the display options for floats with this code here: `pd.options.display.float_format = '{:.2f}'.format`

In [2]:
df = df.drop(columns=['Series Name', 'Series Code'])
pd.options.display.float_format = '{:.2f}'.format
df

Unnamed: 0,Country Name,Country Code,1960,1961,1962,1963,1964,1965,1966,1967,...,2011,2012,2013,2014,2015,2016,2017,2018,2019,2020
0,Afghanistan,AFG,537777811.10,548888895.60,546666677.80,751111191.10,800000044.40,1006666638.00,1399999967.00,1673333418.00,...,17805113119.00,19907317066.00,20146404996.00,20497126770.00,19134211764.00,18116562465.00,18753469630.00,18053228579.00,18799450743.00,20116137326.00
1,Albania,ALB,,,,,,,,,...,12890764531.00,12319830437.00,12776220507.00,13228147516.00,11386850130.00,11861199831.00,13019689337.00,15156432310.00,15400242875.00,14887629268.00
2,Algeria,DZA,2723593385.00,2434727330.00,2001428328.00,2702960118.00,2909292864.00,3136258897.00,3039834559.00,3370843066.00,...,200013000000.00,209059000000.00,209755000000.00,213810000000.00,165979000000.00,160034000000.00,170097000000.00,174911000000.00,171767000000.00,145009000000.00
3,American Samoa,ASM,,,,,,,,,...,570000000.00,640000000.00,638000000.00,643000000.00,673000000.00,671000000.00,612000000.00,639000000.00,648000000.00,709000000.00
4,Andorra,AND,,,,,,,,,...,3629203786.00,3188808943.00,3193704343.00,3271808157.00,2789870188.00,2896679212.00,3000180750.00,3218316013.00,3155065488.00,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
212,Virgin Islands (U.S.),VIR,,,,,,,,,...,4223000000.00,4089000000.00,3738000000.00,3565000000.00,3663000000.00,3798000000.00,3794000000.00,3900000000.00,4068000000.00,
213,West Bank and Gaza,PSE,,,,,,,,,...,11186100000.00,12208400000.00,13515500000.00,13989700000.00,13972400000.00,15405400000.00,16128000000.00,16276600000.00,17133500000.00,15561300000.00
214,"Yemen, Rep.",YEM,,,,,,,,,...,32726417212.00,35401341663.00,40415235702.00,43228585321.00,42445102387.00,31317365269.00,26840128755.00,21606140907.00,,
215,Zambia,ZMB,713000000.00,696285714.30,693142857.10,718714285.70,839428571.40,1082857143.00,1264285714.00,1368000000.00,...,23459515276.00,25503060420.00,28037239463.00,27141023558.00,21251216799.00,20958412538.00,25873601261.00,26311590297.00,23308667781.00,18110631358.00


## Data at a Glance

Overall, I am left with 217 rows pertaining to world countries with 63 columns pertaining to the country name, country code, and year. I then used `df.describe()` to get an overview of the data as well as seeing the diffences between the mean values from 1960 and 2020. In 60 years, the world average GDP dramatically increased. Some of the rows are missing GDP data due the GDP not being recorded for that country.

In [3]:
df1960mean = df['1960'].mean()
df2020mean = df['2020'].mean()
print('Mean GDP in 1960: {m1960}\nMean GDP in 2020: {m2020}'.format(m1960=df1960mean, m2020=df2020mean))
print('Difference in GDP between 1960 to 2020: {mdif}'.format(mdif=df2020mean-df1960mean))
df.describe()

Mean GDP in 1960: 11614905878.833878
Mean GDP in 2020: 430990113284.3397
Difference in GDP between 1960 to 2020: 419375207405.50586


Unnamed: 0,1960,1961,1962,1963,1964,1965,1966,1967,1968,1969,...,2011,2012,2013,2014,2015,2016,2017,2018,2019,2020
count,98.0,101.0,104.0,104.0,104.0,113.0,115.0,118.0,120.0,120.0,...,210.0,209.0,210.0,210.0,209.0,208.0,208.0,208.0,205.0,194.0
mean,11614905878.83,11906336553.98,12630162297.38,13615198721.58,14905901025.68,15014657828.13,16114938593.33,16758918740.8,17843252928.97,19678786073.43,...,348244521134.71,357773577302.25,366132129630.55,376109507240.34,355408590044.97,362773134768.17,385953305417.55,410166193965.98,422337620580.79,430990113284.34
std,55924191269.38,57154619021.81,60527339386.19,64025794833.88,68948817766.79,71868552907.04,78010010134.75,81574522403.85,88412522827.03,95903668520.18,...,1346255683795.77,1406897759652.32,1452330159374.29,1519571482416.66,1555346055776.19,1600167269667.4,1688091240186.3,1810180258700.04,1881976352922.97,1914465862733.67
min,12012012.01,11592011.59,9122751.45,10840095.13,12712471.4,13593932.32,14469078.18,15835177.93,14600000.0,15850000.0,...,38711810.21,37671774.69,37509075.11,37290607.54,35492074.22,36547799.58,40619251.99,42588164.97,47271463.33,48855550.2
25%,283286082.35,271066000.0,284161587.15,328767087.58,364379525.75,353251800.0,354474300.0,333335651.62,361919591.18,404472722.62,...,5172994369.25,5456009385.0,5428970192.25,5719453359.25,5799000000.0,5808218370.0,6299854430.5,6644358830.25,7220395248.0,7780402262.5
50%,1023646137.5,1058975266.0,1114083740.5,1179979554.0,1258118950.0,1006666638.0,1246908374.0,1242262431.5,1399524386.5,1559142014.0,...,23711274353.5,25503060420.0,24504743859.0,24904670209.5,22881558192.0,23221376289.0,25426395630.5,26143956879.0,26896660000.0,31192751628.5
75%,4255954032.25,4817580184.0,5114321932.75,5724960131.0,5975611956.75,5906636557.0,6500398188.0,6810960498.25,7099120710.5,8347340740.5,...,169072500000.0,182592000000.0,190880500000.0,200581500000.0,166774000000.0,161676000000.0,174377250000.0,190487750000.0,205144000000.0,203106750000.0
max,543300000000.0,563300000000.0,605100000000.0,638600000000.0,685800000000.0,743700000000.0,815000000000.0,861700000000.0,942500000000.0,1019900000000.0,...,15542600000000.0,16197000000000.0,16784800000000.0,17527200000000.0,18238300000000.0,18745100000000.0,19543000000000.0,20611900000000.0,21433200000000.0,20953000000000.0


## Data Insights

From there, I decided to ask a few questions of my data:

1. Year over year, how does the data correlate?
2. Overall, how many missing values are there?
3. What was the lowest recorded GDP from 1960?
4. What was the highest recorded GDP from 1960?
5. What were the five countries with the lowest recorded GDP from 1960?
6. What were the five countries with the highest recorded GDP from 1960?
7. What was the lowest recorded GDP from 2020?
8. What was the highest recorded GDP from 2020?
9. What were the five countries with the lowest recorded GDP from 2020?
10. What were the five countries with the highest recorded GDP from 2020?

Below, each one of the questions are answered in order.

### 1. Year over year, how does the data correlate?

The values year over year have a positive to very strong positive correlation.

In [4]:
corr = df.corr()
pd.options.display.float_format = '{:.6f}'.format
corr

Unnamed: 0,1960,1961,1962,1963,1964,1965,1966,1967,1968,1969,...,2011,2012,2013,2014,2015,2016,2017,2018,2019,2020
1960,1.000000,0.999583,0.999130,0.998771,0.998465,0.998523,0.998208,0.997148,0.995912,0.994906,...,0.906976,0.901237,0.901711,0.898131,0.899460,0.899196,0.891055,0.878608,0.881140,0.869865
1961,0.999583,1.000000,0.999872,0.999652,0.999396,0.999333,0.999050,0.998387,0.997421,0.996539,...,0.899806,0.894166,0.893280,0.889369,0.891563,0.891719,0.882716,0.870109,0.872738,0.861310
1962,0.999130,0.999872,1.000000,0.999808,0.999600,0.999472,0.999228,0.998704,0.997949,0.997144,...,0.897902,0.891695,0.890084,0.885722,0.887571,0.887906,0.878691,0.865603,0.868128,0.856396
1963,0.998771,0.999652,0.999808,1.000000,0.999813,0.999692,0.999419,0.999138,0.998436,0.997741,...,0.901664,0.895064,0.892987,0.888409,0.889548,0.890051,0.880806,0.867697,0.870135,0.858293
1964,0.998465,0.999396,0.999600,0.999813,1.000000,0.999953,0.999548,0.999308,0.998613,0.998099,...,0.906111,0.899569,0.897157,0.892475,0.893480,0.894118,0.884921,0.871939,0.874298,0.862625
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2016,0.899196,0.891719,0.887906,0.890051,0.894118,0.897755,0.897863,0.895221,0.892123,0.894375,...,0.983359,0.989298,0.994891,0.997118,0.999605,1.000000,0.999502,0.998235,0.998237,0.996413
2017,0.891055,0.882716,0.878691,0.880806,0.884921,0.888690,0.888507,0.885502,0.882149,0.884314,...,0.980370,0.987023,0.994102,0.996886,0.999419,0.999502,1.000000,0.999361,0.999290,0.997762
2018,0.878608,0.870109,0.865603,0.867697,0.871939,0.875963,0.875775,0.872471,0.868695,0.870822,...,0.974970,0.982732,0.991423,0.995135,0.998483,0.998235,0.999361,1.000000,0.999857,0.999310
2019,0.881140,0.872738,0.868128,0.870135,0.874298,0.878287,0.878024,0.874635,0.870949,0.873000,...,0.973641,0.981746,0.990461,0.994268,0.998378,0.998237,0.999290,0.999857,1.000000,0.999464


### 2. Overall, how many missing values are there?

Depending on the year, the highest missing value is 119 and the lowest is 7. In total, there are 3,114 missing GDP values.

In [5]:
df_rotate = df.isnull().sum().to_frame().T
pd.set_option('display.max_columns', df.shape[1]+1)
print(df_rotate.sum(axis=1))
df_rotate

0    3114
dtype: int64


Unnamed: 0,Country Name,Country Code,1960,1961,1962,1963,1964,1965,1966,1967,1968,1969,1970,1971,1972,1973,1974,1975,1976,1977,1978,1979,1980,1981,1982,1983,1984,1985,1986,1987,1988,1989,1990,1991,1992,1993,1994,1995,1996,1997,1998,1999,2000,2001,2002,2003,2004,2005,2006,2007,2008,2009,2010,2011,2012,2013,2014,2015,2016,2017,2018,2019,2020
0,0,0,119,116,113,113,113,104,102,99,97,97,91,89,89,89,88,86,85,82,83,82,71,68,67,66,65,63,61,57,54,54,40,45,41,38,34,25,25,25,23,22,18,17,12,12,12,12,11,11,10,10,9,7,8,7,7,8,9,9,9,12,23


### 3. What was the lowest recorded GDP from 1960?

The minimum recorded GDP came from Seychelles in 1960 with a total of $12,012,012.01.

In [6]:
pd.reset_option('^display.', silent=True)
pd.options.display.float_format = '{:.2f}'.format
df1960 = df[df['1960'] == df['1960'].min()]
df1960

Unnamed: 0,Country Name,Country Code,1960,1961,1962,1963,1964,1965,1966,1967,...,2011,2012,2013,2014,2015,2016,2017,2018,2019,2020
169,Seychelles,SYC,12012012.01,11592011.59,12642026.57,13923029.26,15393032.35,15603032.8,16443034.56,16632032.81,...,1065826670.0,1060226126.0,1328157609.0,1343007845.0,1377495054.0,1426651769.0,1528242026.0,1547690759.0,1582841059.0,1059886364.0


### 4. What was the maximum recorded GDP from 1960?

The maximum recorded GDP came from the United States in 1960 with a total of $543,300,000,000.00.

In [7]:
df1960 = df[df['1960'] == df['1960'].max()]
df1960

Unnamed: 0,Country Name,Country Code,1960,1961,1962,1963,1964,1965,1966,1967,...,2011,2012,2013,2014,2015,2016,2017,2018,2019,2020
206,United States,USA,543300000000.0,563300000000.0,605100000000.0,638600000000.0,685800000000.0,743700000000.0,815000000000.0,861700000000.0,...,15542600000000.0,16197000000000.0,16784800000000.0,17527200000000.0,18238300000000.0,18745100000000.0,19543000000000.0,20611900000000.0,21433200000000.0,20953000000000.0


### 5. What were the five countries with the lowest recorded GDP from 1960?

The five countries with the lowest recorded GDP in 1960 were Seychelles, St. Kitts and Nevis, St. Vincent and the Grenadines, Belize, and Botswana.

In [8]:
df = df.sort_values('1960', ascending=True)
df.head(5)

Unnamed: 0,Country Name,Country Code,1960,1961,1962,1963,1964,1965,1966,1967,...,2011,2012,2013,2014,2015,2016,2017,2018,2019,2020
169,Seychelles,SYC,12012012.01,11592011.59,12642026.57,13923029.26,15393032.35,15603032.8,16443034.56,16632032.81,...,1065826670.0,1060226126.0,1328157609.0,1343007845.0,1377495054.0,1426651769.0,1528242026.0,1547690759.0,1582841059.0,1059886364.0
181,St. Kitts and Nevis,KNA,12366563.61,12483229.31,12541562.15,12833226.39,13416554.86,13593932.32,14469078.18,16742338.25,...,817759259.3,800414814.8,839770370.4,916566666.7,923155555.6,1008888889.0,1060740741.0,1078518519.0,1164814815.0,980740740.7
184,St. Vincent and the Grenadines,VCT,13066557.78,13999883.33,14524878.96,13708219.1,14758210.35,15108207.43,16099865.83,15835177.93,...,676129629.6,692933333.3,721207407.4,727714814.8,755400000.0,774429629.6,792177777.8,811300000.0,825040740.7,807474074.1
19,Belize,BLZ,28071888.56,29964370.71,31856922.86,33749405.01,36193826.12,40069930.07,44405594.41,47379310.34,...,1460797903.0,1522897506.0,1579411253.0,1667335061.0,1721700991.0,1789304088.0,1858529677.0,1915899787.0,1982518541.0,1636280797.0
25,Botswana,BWA,30412308.99,32902336.64,35643207.63,38091150.57,41613969.05,45790869.75,51464435.15,58646443.51,...,15351972361.0,14380004175.0,14901750991.0,15654660710.0,13578754072.0,15082578065.0,16088437675.0,16914245098.0,16593720656.0,15061922802.0


### 6. What were the five countries with the highest recorded GDP from 1960?

The five countries with the highest recorded GDP in 1960 were the United States, the United Kingdom, France, China, and Japan.

In [9]:
df = df.sort_values('1960', ascending=False)
df.head(5)

Unnamed: 0,Country Name,Country Code,1960,1961,1962,1963,1964,1965,1966,1967,...,2011,2012,2013,2014,2015,2016,2017,2018,2019,2020
206,United States,USA,543300000000.0,563300000000.0,605100000000.0,638600000000.0,685800000000.0,743700000000.0,815000000000.0,861700000000.0,...,15542600000000.0,16197000000000.0,16784800000000.0,17527200000000.0,18238300000000.0,18745100000000.0,19543000000000.0,20611900000000.0,21433200000000.0,20953000000000.0
205,United Kingdom,GBR,73233967692.0,77741965703.0,81247564157.0,86561961812.0,94407558351.0,101825000000.0,108573000000.0,113117000000.0,...,2674890000000.0,2719160000000.0,2803290000000.0,3087170000000.0,2956570000000.0,2722850000000.0,2699020000000.0,2900790000000.0,2878670000000.0,2759800000000.0
68,France,FRA,62225478001.0,67461644222.0,75607529810.0,84759195106.0,94007851047.0,101537000000.0,110046000000.0,118973000000.0,...,2861410000000.0,2683830000000.0,2811080000000.0,2852170000000.0,2438210000000.0,2471290000000.0,2588740000000.0,2789590000000.0,2728870000000.0,2630320000000.0
41,China,CHN,59716467625.0,50056868958.0,47209359006.0,50706799903.0,59708343489.0,70436266147.0,76720285970.0,72881631327.0,...,7551500000000.0,8532230000000.0,9570410000000.0,10475700000000.0,11061600000000.0,11233300000000.0,12310400000000.0,13894800000000.0,14279900000000.0,14722700000000.0
98,Japan,JPN,44307342950.0,53508617739.0,60723018684.0,69498131797.0,81749006382.0,90950278258.0,105628000000.0,123782000000.0,...,6233150000000.0,6272360000000.0,5212330000000.0,4896990000000.0,4444930000000.0,5003680000000.0,4930840000000.0,5036890000000.0,5148780000000.0,5057760000000.0


### 7. What was the lowest recorded GDP from 2020?

The lowest recorded GDP from 2020 came from Tuvalu with a total of 48,855,550.20.

In [10]:
df2020 = df[df['2020'] == df['2020'].min()]
df2020

Unnamed: 0,Country Name,Country Code,1960,1961,1962,1963,1964,1965,1966,1967,...,2011,2012,2013,2014,2015,2016,2017,2018,2019,2020
201,Tuvalu,TUV,,,,,,,,,...,38711810.21,37671774.69,37509075.11,37290607.54,35492074.22,36547799.58,40619251.99,42588164.97,47271463.33,48855550.2


### 8. What was the highest recorded GDP from 2020?

The highest recorded GDP from 2020 came from the United States with a total of 20,953,000,000,000.00.

In [11]:
df2020 = df[df['2020'] == df['2020'].max()]
df2020

Unnamed: 0,Country Name,Country Code,1960,1961,1962,1963,1964,1965,1966,1967,...,2011,2012,2013,2014,2015,2016,2017,2018,2019,2020
206,United States,USA,543300000000.0,563300000000.0,605100000000.0,638600000000.0,685800000000.0,743700000000.0,815000000000.0,861700000000.0,...,15542600000000.0,16197000000000.0,16784800000000.0,17527200000000.0,18238300000000.0,18745100000000.0,19543000000000.0,20611900000000.0,21433200000000.0,20953000000000.0


### 9. What were the five countries with the lowest recorded GDP from 2020?

The five countries with the lowest recorded GDP in 2020 were Tuvalu, Nauru, Kiribati, The Marshall Islands, and Palau.

In [12]:
df = df.sort_values('2020', ascending=True)
df.head(5)

Unnamed: 0,Country Name,Country Code,1960,1961,1962,1963,1964,1965,1966,1967,...,2011,2012,2013,2014,2015,2016,2017,2018,2019,2020
201,Tuvalu,TUV,,,,,,,,,...,38711810.21,37671774.69,37509075.11,37290607.54,35492074.22,36547799.58,40619251.99,42588164.97,47271463.33,48855550.2
137,Nauru,NRU,,,,,,,,,...,66055407.67,96927201.48,98491843.64,104654365.2,86529661.37,99723394.96,109359680.2,124021393.7,118724073.8,114626625.6
102,Kiribati,KIR,,,,,,,,,...,181705153.6,190243432.8,185114059.6,179703165.4,171117816.7,178328984.1,187276124.8,200157020.6,188391770.6,197508774.3
124,Marshall Islands,MHL,,,,,,,,,...,172188500.0,180436300.0,184840400.0,182142800.0,183814300.0,201510900.0,213204100.0,221588900.0,239462200.0,244462400.0
150,Palau,PLW,,,,,,,,,...,196911100.0,212397800.0,221117200.0,241669800.0,280457700.0,298300000.0,285300000.0,284700000.0,274200000.0,257700000.0


### 10. What were the five countries with the highest recorded GDP from 2020?

The five countries with the highest recorded GDP were the United States, China, Japan, Germany, and the United Kingdom.

In [13]:
df = df.sort_values('2020', ascending=False)
df.head(5)

Unnamed: 0,Country Name,Country Code,1960,1961,1962,1963,1964,1965,1966,1967,...,2011,2012,2013,2014,2015,2016,2017,2018,2019,2020
206,United States,USA,543300000000.0,563300000000.0,605100000000.0,638600000000.0,685800000000.0,743700000000.0,815000000000.0,861700000000.0,...,15542600000000.0,16197000000000.0,16784800000000.0,17527200000000.0,18238300000000.0,18745100000000.0,19543000000000.0,20611900000000.0,21433200000000.0,20953000000000.0
41,China,CHN,59716467625.0,50056868958.0,47209359006.0,50706799903.0,59708343489.0,70436266147.0,76720285970.0,72881631327.0,...,7551500000000.0,8532230000000.0,9570410000000.0,10475700000000.0,11061600000000.0,11233300000000.0,12310400000000.0,13894800000000.0,14279900000000.0,14722700000000.0
98,Japan,JPN,44307342950.0,53508617739.0,60723018684.0,69498131797.0,81749006382.0,90950278258.0,105628000000.0,123782000000.0,...,6233150000000.0,6272360000000.0,5212330000000.0,4896990000000.0,4444930000000.0,5003680000000.0,4930840000000.0,5036890000000.0,5148780000000.0,5057760000000.0
73,Germany,DEU,,,,,,,,,...,3744410000000.0,3527340000000.0,3732740000000.0,3883920000000.0,3356240000000.0,3467500000000.0,3681730000000.0,3975350000000.0,3888330000000.0,3846410000000.0
205,United Kingdom,GBR,73233967692.0,77741965703.0,81247564157.0,86561961812.0,94407558351.0,101825000000.0,108573000000.0,113117000000.0,...,2674890000000.0,2719160000000.0,2803290000000.0,3087170000000.0,2956570000000.0,2722850000000.0,2699020000000.0,2900790000000.0,2878670000000.0,2759800000000.0


## Data Report

Finally, using [pandas profiling](https://github.com/ydataai/pandas-profiling), I am able to create a create a report that gives an overview of the data, including some of the insights gleaned above in a nicer format. With this amount of data, I am only able to do a minimal exploration of the data using pandas profiling, but other datasets I have for the capstone were able to be processed fully by pandas profiling.

In [14]:
df = pd.read_csv('data.csv', engine='python', skipfooter = 49)
pd.options.display.float_format = '{:.2f}'.format
df = df.drop(columns=['Series Name', 'Series Code'])
profile = ProfileReport(df, title='GDP per capita (current US$)', minimal=True, dark_mode=True)
profile

Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]

Export report to file:   0%|          | 0/1 [00:00<?, ?it/s]



## Conclusion

In conclusion, from the EDA from the data I have so far, I am able to understand better government budgets for healthcare in their countries and how that relates to a government's overall GDP, their GDP per capitia, and how much people pay out-of-pocket in each country for healthcare.