<a href="https://colab.research.google.com/github/avellozo/DIO/blob/main/Carrefour/Gapminder.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Análise do dataset Gapminder

# Imports e inicializações gerais


In [1]:
import pandas as pd
pd.options.display.float_format = '{:.2f}'.format
import plotly.graph_objects as go
import plotly.express as px

#Dataset load

In [2]:
df = pd.read_csv("/content/drive/MyDrive/Colab Notebooks/DIO/Carrefour Data Engineer/Cusro_Python_Pandas_Digital_Innovation-master/datasets/Gapminder.csv", sep=";")

#Exploração do conteúdo do dataset

In [3]:
df.sample(10)

Unnamed: 0,country,continent,year,lifeExp,pop,gdpPercap
933,Eritrea,Africa,1952,35.93,1438760,328.94
1382,Iceland,Europe,1990,77.98,254719,26372.88
3123,Ukraine,FSU,1992,69.0,51869109,6650.45
2945,Syria,Asia,1957,48.28,4149908,2117.23
3296,Zambia,Africa,1992,46.1,8381163,1210.88
2381,Poland,Europe,2001,74.33,38643641,11825.87
185,Azerbaijan,Asia,1992,65.62,7413618,3455.54
798,Czech Republic,Europe,2003,75.46,10251087,18237.59
1624,Kenya,Africa,1997,54.41,28263827,1360.49
3050,Togo,Africa,1967,46.77,1735550,1477.6


In [4]:
df.shape

(3312, 6)

Data set tem 3312 registros e 6 colunas. As colunas tem os seguintes tipos de dados:

In [5]:
df.dtypes

country       object
continent     object
year           int64
lifeExp      float64
pop            int64
gdpPercap    float64
dtype: object

#Análise preliminar

##Valores nulos e categóricos

Quais colunas tem valores nulos?

In [6]:
df.isna().any()

country      False
continent     True
year         False
lifeExp      False
pop          False
gdpPercap    False
dtype: bool

In [7]:
df.isna().sum()

country        0
continent    301
year           0
lifeExp        0
pop            0
gdpPercap      0
dtype: int64

In [8]:
df[df.continent.isna()].country.value_counts()

Canada                   57
Australia                56
Hong Kong, China         12
Reunion                  12
Sao Tome and Principe    12
Haiti                    12
Belize                   11
Papua New Guinea         10
Barbados                 10
Bahamas                  10
Georgia                   9
New Caledonia             9
French Polynesia          9
Grenada                   8
Micronesia, Fed. Sts.     8
Netherlands Antilles      8
Maldives                  8
Aruba                     8
Samoa                     7
Vanuatu                   7
Tonga                     7
Armenia                   4
Uzbekistan                4
Martinique                1
Guadeloupe                1
French Guiana             1
Name: country, dtype: int64

Quais são os possíveis valores de continente e países?

In [9]:
df.continent.nunique()

6

In [10]:
df.continent.value_counts()

Europe      1302
Africa       613
Asia         557
Americas     343
FSU          122
Oceania       74
Name: continent, dtype: int64

In [11]:
df.country.nunique()

187

In [12]:
df.country.value_counts()

Portugal         58
Finland          58
Taiwan           58
Sweden           58
Iceland          58
                 ..
Armenia           4
Azerbaijan        4
French Guiana     1
Guadeloupe        1
Martinique        1
Name: country, Length: 187, dtype: int64

##Arrumar os dados mais relevantes

In [13]:
df.loc[df.country == 'Canada','continent'] = 'Americas'
df.loc[df.country == 'Australia','continent'] = 'Oceania'
df.loc[df.country == 'Hong Kong, China','continent'] = 'Asia'
df.loc[df.country == 'Sao Tome and Principe','continent'] = 'Africa'
df.loc[df.country == 'Reunion','continent'] = 'Africa'
df.loc[df.country == 'Haiti','continent'] = 'Americas'
df.loc[df.country == 'Belize','continent'] = 'Americas'
df.loc[df.country == 'Barbados','continent'] = 'Americas'
df.loc[df.country == 'Bahamas','continent'] = 'Americas'
df.loc[df.country == 'Papua New Guinea','continent'] = 'Oceania'

# poderia acertar os outros países que são nulos tbm...

##Sumarização das colunas numéricas (describe)

In [14]:
df.describe()

Unnamed: 0,year,lifeExp,pop,gdpPercap
count,3312.0,3312.0,3312.0,3312.0
mean,1980.3,65.25,31614890.82,11317.12
std,16.93,11.77,104119342.89,11369.14
min,1950.0,23.6,59412.0,241.17
25%,1967.0,58.34,2678572.0,2514.63
50%,1982.0,69.61,7557218.5,7838.51
75%,1996.0,73.66,19585221.75,17357.88
max,2007.0,82.67,1318683096.0,113523.13


Primeiras observações:
- Média da expectativa de vida: 65,24 anos e mediana 69,61. Mostra que existem muitos valores muito baixos
- População tem desvio padrão maior que a média, ou seja, valores estão muito dispersos. 
- PIB per capta também tem grande dispersão.
- Média e mediana do PIB per capta muito diferente. Devem haver muitos dados com valores muito altos. 

Perguntas diretas que saem desta primeira análise:
- Qual registro (país e ano) tem:
    - a menor expectativa de vida?
    - o maior e menor PIB per capta
    - a menor população
- Muitos registros com alta população?
- Existe alguma correlação entre países pequenos (pouca população) e expectativa de vida?
- Qual a correlação entre PIB per capta e expectativa de vida?

#Resultados das perguntas iniciais


#Análise da expectativa de vida

## Extremos da expectativa de vida

In [15]:
df[df.lifeExp == min(df.lifeExp)]

Unnamed: 0,country,continent,year,lifeExp,pop,gdpPercap
2519,Rwanda,Africa,1992,23.6,7290203,737.07


In [16]:
df[df.lifeExp == max(df.lifeExp)]

Unnamed: 0,country,continent,year,lifeExp,pop,gdpPercap
1597,Japan,Asia,2006,82.67,127463611,31001.72


In [17]:
px.box(df, x="year", y="lifeExp", title = "Life expectation over the years")

As we can see in the box plot graph, there are many outliers in the lowest life expetation over the years.

Pergunta: será que Ruanda é o que tem as mais baixas expectativas de vida?
E as maiores expectativas de vida são do Japão?

In [18]:
df.sort_values(by='lifeExp').head(20)['country'].value_counts()

Afghanistan      3
Sierra Leone     3
Guinea-Bissau    2
Angola           2
Gambia           2
Cambodia         1
Burkina Faso     1
China            1
Mozambique       1
Somalia          1
Yemen, Rep.      1
Rwanda           1
Guinea           1
Name: country, dtype: int64

Apesar de Ruanda em 1952 ter tido a menor expectativa de vida, as 20 menores estão bem ditribuídas

In [19]:
df.sort_values(by='lifeExp').tail(20)['country'].value_counts()

Japan               8
Switzerland         4
Iceland             3
Hong Kong, China    2
Australia           2
Italy               1
Name: country, dtype: int64

Já o Japão aparece muitas vezes entre os 20 maiores

Extremo da expecitativa de vida por continente


In [20]:
df.sort_values(by='lifeExp').head(20)['continent'].value_counts()

Africa    14
Asia       6
Name: continent, dtype: int64

In [21]:
df.sort_values(by='lifeExp').tail(20)['continent'].value_counts()

Asia       10
Europe      8
Oceania     2
Name: continent, dtype: int64

Asia e Europa estão equilibradas

## Quais países tiveram decréscimos na expectativa de vida com o passar dos anos?

In [22]:
df_country_year = df.sort_values(by=['country', 'year']) #be sure the data is sorted accordingly
df_country_year['diff_lifeExp'] = df_country_year.groupby('country')['lifeExp'].diff()
df_country_year[df_country_year['diff_lifeExp']<0]['country'].value_counts()

Iceland             23
Latvia              22
Bulgaria            22
Hungary             20
Portugal            17
                    ..
Papua New Guinea     1
Togo                 1
Germany              1
Kazakhstan           1
Puerto Rico          1
Name: country, Length: 89, dtype: int64

Vários países tiveram pelo menos uma vez um decréscimo de expectativa de vida. Por exemplo Portugal e Iceland

In [23]:
df_country_year[df_country_year['country']=='Iceland']

Unnamed: 0,country,continent,year,lifeExp,pop,gdpPercap,diff_lifeExp
1342,Iceland,Europe,1950,71.01,142938,7750.29,
1343,Iceland,Europe,1951,71.05,145604,7468.81,0.04
1344,Iceland,Europe,1952,72.49,147962,7267.69,1.44
1345,Iceland,Europe,1953,72.31,151036,8211.29,-0.18
1346,Iceland,Europe,1954,73.36,154563,8800.72,1.05
1347,Iceland,Europe,1955,73.3,158044,9458.67,-0.06
1348,Iceland,Europe,1956,72.98,161358,9489.02,-0.32
1349,Iceland,Europe,1957,73.47,165110,9244.0,0.49
1350,Iceland,Europe,1958,73.43,168771,9789.87,-0.04
1351,Iceland,Europe,1959,72.66,172314,9768.93,-0.77


In [24]:
df_country_year[df_country_year['country']=='Portugal']

Unnamed: 0,country,continent,year,lifeExp,pop,gdpPercap,diff_lifeExp
2388,Portugal,Europe,1950,58.53,8442750,2961.89,
2389,Portugal,Europe,1951,58.72,8490250,3077.24,0.19
2390,Portugal,Europe,1952,59.82,8526050,3068.32,1.1
2391,Portugal,Europe,1953,61.12,8578950,3262.2,1.3
2392,Portugal,Europe,1954,62.26,8632100,3397.69,1.14
2393,Portugal,Europe,1955,61.43,8692600,3513.19,-0.83
2394,Portugal,Europe,1956,61.24,8756000,3639.99,-0.19
2395,Portugal,Europe,1957,61.51,8817650,3774.57,0.27
2396,Portugal,Europe,1958,63.81,8888550,3793.66,2.3
2397,Portugal,Europe,1959,62.99,8961550,3966.47,-0.82


##A média de expectativa de vida nos anos

In [25]:
df.groupby("year")["lifeExp"].mean().diff().sort_values()

year
1952   -16.70
1957   -16.34
1962   -14.83
1967   -14.18
1972   -12.63
1977   -11.73
1982   -10.97
1987   -10.47
2007   -10.02
2002    -9.42
1992    -9.37
1997    -9.16
1990    -0.64
1959    -0.59
2005    -0.20
1969    -0.04
1985     0.01
1980     0.05
1991     0.09
1995     0.10
1994     0.13
1999     0.13
1971     0.14
1956     0.14
1975     0.16
1989     0.16
1966     0.19
1965     0.20
1976     0.22
2001     0.23
1960     0.24
1981     0.27
1974     0.28
1979     0.30
1970     0.31
1984     0.31
2000     0.32
2004     0.33
1986     0.34
1955     0.35
1961     0.39
1964     0.47
1996     0.48
1954     0.79
2006     1.17
1951     3.90
1993     9.32
1998     9.70
2003     9.75
1988    10.78
1983    11.42
1978    12.29
1973    13.03
1968    14.43
1963    15.56
1958    17.20
1953    17.47
1950      nan
Name: lifeExp, dtype: float64

Poed-se reparar que nos anos que houveram grandes quedas na expectativa de vida, no ano seguinte teve um grande aumento na expectativa de vida. Repara-se tbm que a cada 5 anos esse fenômeno aconteceu. Talvez porque alguns países (com expectativa de vida baixa) só apresentam estes dados a cada 5 anos.

##Quais países tem registros a cada 5 anos?

In [26]:
#obtém os países que tem diferença de 5 anos nos registros
country5years = df.sort_values(by='year').country.where(df.groupby('country')['year'].diff() == 5)
country5years.value_counts()

Jordan         11
Mauritius      11
Senegal        11
Algeria        11
Nepal          11
               ..
Timor-Leste     3
Uzbekistan      3
Tajikistan      3
Poland          1
Luxembourg      1
Name: country, Length: 155, dtype: int64

Dos 187 paíse, a maior parte (155) enviam dados a cada 5 anos

Média da expectativa de vida para países que tem registros a cada 5 anos:

In [27]:
df[df['country'].isin(country5years)].lifeExp.mean()

59.25679285482999

Média da expectativa de vida para países que não tem registros a cada 5 anos:

In [28]:
df[~df['country'].isin(country5years)].lifeExp.mean()

72.85455654557919

Portanto a percepção inicial era correta e, na média, os países que enviam dados a cada 5 anos tem expectativa de vida mais baixa.

##Alguns gráficos que mostram a expectativa de vida no tempo

In [29]:
fig = px.line(df, x="year", y="lifeExp", color='country', labels="country", 
              title="Life expectancy x years. Click on the country to view it.")
fig.for_each_trace(
    lambda trace: trace.update(visible="legendonly") if trace.name not in ("country=Brazil", "country=Australia",
                                                                           "country=Portugal", "country=Ireland") 
                                                    else ()
)
fig.show()

#Population analysis

## Higher and lower values

In [30]:
print(df[df['pop'] == min(df['pop'])])
print(df[df['pop'] == max(df['pop'])])

   country continent  year  lifeExp    pop  gdpPercap
65   Aruba       NaN  1977    71.83  59412    7390.36
    country continent  year  lifeExp         pop  gdpPercap
638   China      Asia  2007    72.96  1318683096    4959.11


In [31]:
df.sort_values(by='pop').head(20)['country'].value_counts()

Aruba                    8
Sao Tome and Principe    6
Micronesia, Fed. Sts.    3
Djibouti                 2
Belize                   1
Name: country, dtype: int64

In [32]:
df.sort_values(by='pop').tail(20)['country'].value_counts()

China    16
India     4
Name: country, dtype: int64

As expected the China has the most records with the largest population

##Which countries had decreases in population over the years?

In [33]:
df_country_year = df.sort_values(by=['country', 'year'])
df_country_year['diff_pop'] = df_country_year.groupby('country')['pop'].diff()
df_country_year[df_country_year['diff_pop']<0]['country'].value_counts()

Hungary                   26
Czech Republic            19
Bulgaria                  17
Latvia                    17
Estonia                   16
Lithuania                 15
Ukraine                   14
Russia                    13
Portugal                  10
Belarus                   10
Poland                     9
Slovenia                   7
Austria                    6
Grenada                    5
Germany                    4
Denmark                    4
Trinidad and Tobago        4
Ireland                    3
Romania                    3
Armenia                    3
Switzerland                3
Georgia                    3
Guyana                     3
Micronesia, Fed. Sts.      2
Malta                      2
Belgium                    2
Suriname                   2
Finland                    2
Samoa                      2
Bosnia and Herzegovina     2
Moldova                    2
Kazakhstan                 2
Aruba                      1
Cyprus                     1
Italy         

Obs: Japan has not decreased the population. It may have been offset by higher life expectation?
Question: is there any correlation between population decreases and GDP per capta? Population x life expectation?

### How was the population data over the years for Hungary and Czech Republic?



In [34]:
df_country_year[df_country_year['country']=='Hungary']

Unnamed: 0,country,continent,year,lifeExp,pop,gdpPercap,diff_pop
1285,Hungary,Europe,1950,62.07,9338000,4726.19,
1286,Hungary,Europe,1951,62.46,9423000,5135.95,85000.0
1287,Hungary,Europe,1952,64.03,9504000,5263.67,81000.0
1288,Hungary,Europe,1953,63.87,9595000,5308.49,91000.0
1289,Hungary,Europe,1954,65.43,9706000,5431.86,111000.0
1290,Hungary,Europe,1955,66.88,9825000,5850.99,119000.0
1291,Hungary,Europe,1956,66.04,9911000,5537.72,86000.0
1292,Hungary,Europe,1957,66.41,9839000,6040.18,-72000.0
1293,Hungary,Europe,1958,67.42,9882000,6416.73,43000.0
1294,Hungary,Europe,1959,67.32,9937000,6639.87,55000.0


In [35]:
df_country_year[df_country_year['country']=='Czech Republic']

Unnamed: 0,country,continent,year,lifeExp,pop,gdpPercap,diff_pop
745,Czech Republic,Europe,1950,64.43,8925122,6690.75,
746,Czech Republic,Europe,1951,65.25,9023170,6734.89,98048.0
747,Czech Republic,Europe,1952,66.87,9125183,6876.14,102013.0
748,Czech Republic,Europe,1953,67.56,9220908,6774.04,95725.0
749,Czech Republic,Europe,1954,68.05,9290617,6979.22,69709.0
750,Czech Republic,Europe,1955,68.97,9365969,7495.88,75352.0
751,Czech Republic,Europe,1956,69.37,9442040,7855.58,76071.0
752,Czech Republic,Europe,1957,69.03,9513758,8256.34,71718.0
753,Czech Republic,Europe,1958,69.94,9574650,8811.03,60892.0
754,Czech Republic,Europe,1959,69.92,9618554,9135.7,43904.0


In [36]:
df_country_year[df_country_year['country']=='China']

Unnamed: 0,country,continent,year,lifeExp,pop,gdpPercap,diff_pop
604,China,Asia,1953,44.56,581390000,508.95,
605,China,Asia,1954,46.47,595310000,511.08,13920000.0
606,China,Asia,1955,48.02,608655000,526.74,13345000.0
607,China,Asia,1956,50.45,621465000,560.44,12810000.0
608,China,Asia,1957,50.55,637408000,575.99,15943000.0
609,China,Asia,1958,50.16,653235000,622.5,15827000.0
610,China,Asia,1959,38.4,666005000,616.19,12770000.0
611,China,Asia,1960,31.63,667070000,591.82,1065000.0
612,China,Asia,1961,34.1,660330000,492.01,-6740000.0
613,China,Asia,1962,44.5,665770000,487.67,5440000.0


## Population mean over the years
Some countries don't get available some data records

In [37]:
df.groupby("year")["pop"].mean()

year
1950   20785507.05
1951   20046182.67
1952   12945090.37
1953   44678917.21
1954   45534643.29
1955   46371144.00
1956   47183055.62
1957   18506198.26
1958   48244111.60
1959   49040590.24
1960   47471464.81
1961   47487319.85
1962   19231450.32
1963   48886529.42
1964   49771869.42
1965   48879121.85
1966   49858388.19
1967   20701011.82
1968   51753864.26
1969   52768959.93
1970   53827883.41
1971   54924942.19
1972   21388402.63
1973   54109282.06
1974   57831377.26
1975   58662985.04
1976   59409382.93
1977   23087709.05
1978   60771965.26
1979   61452908.63
1980   62122669.04
1981   62787792.48
1982   25199168.77
1983   26411949.78
1984   26583977.48
1985   26752586.67
1986   26922367.56
1987   27561857.46
1988   27248958.63
1989   27432087.63
1990   30020262.97
1991   31714771.94
1992   29579355.27
1993   32090293.52
1994   32249667.00
1995   32395185.76
1996   32527463.15
1997   31624026.49
1998   32780221.52
1999   32895068.09
2000   33002844.00
2001   33096589.76
2002   

In [38]:
df.groupby("year")["pop"].mean().diff().sort_values()

year
1982   -37588623.72
1977   -36321673.88
1972   -33536539.55
1967   -29157376.36
1957   -28676857.37
1962   -28255869.52
1952    -7101092.30
1992    -2135416.67
2006    -1932774.49
1960    -1569125.43
1997     -903436.66
1965     -892747.57
1951     -739324.38
1988     -312898.83
2005     -220903.71
2002        7305.25
1961       15855.04
2001       93745.76
2000      107775.91
1999      114846.58
1996      132277.39
1995      145518.76
1994      159373.48
1985      168609.19
1986      169780.89
1984      172027.70
2003      172488.47
1989      183129.00
1987      639489.90
1981      665123.44
1980      669760.41
1979      680943.37
1976      746397.89
1959      796478.64
1956      811911.62
1975      831607.78
1955      836500.71
1954      855726.08
1964      885340.00
1966      979266.33
2004     1009317.83
1969     1015095.67
1970     1058923.48
1971     1097058.78
1998     1156195.03
1983     1212781.01
1991     1694508.97
1993     2510938.25
1990     2588175.34
2007     368065

Accordingly above numbers, could be the countries that send data every 5 years have a lower population than others.

How is the difference between the records in 5 years window?

In [39]:
df.groupby("year")["pop"].mean().diff(5).sort_values()

year
1986   -35865424.93
1985   -35370082.37
1984   -34868931.15
1983   -34360015.48
2006     -964566.65
1961      304264.22
2003      496161.97
2001      569126.61
2000      607658.24
1963      642417.82
1999      645401.09
1972      687390.81
1998      689928.00
1962      725252.07
1964      731279.18
1996      812691.21
1988      837008.85
1989      848110.15
2005     1061953.60
1960     1100320.81
2004     1390633.22
1965     1407657.04
1967     1469561.50
2002     1479868.52
1977     1699306.42
1992     2017497.81
1997     2044671.22
1982     2111459.72
1973     2355417.80
1987     2362688.69
1966     2371068.34
1995     2374922.79
2007     2708781.73
1968     2867334.84
1969     2997090.50
1990     3267676.30
1981     3378409.56
1980     3459684.00
1959     3505946.95
1958     3565194.39
1979     3621531.37
1976     4484440.74
1991     4792404.38
1994     4817579.37
1975     4835101.63
1993     4841334.89
1970     4948761.56
1974     5062417.33
1971     5066554.00
1957     556110

The quantity of years with negative differences is lower than considerating just one year period (window). From 1983 until 1986 could miss some big countries (China or India?)

In [40]:
window = 5
before_list = []
group = df.groupby("year")["country"]
for e in group:
    if (len(before_list)>=window):
        current_year = e[0]
        old_year = before_list[0][0]
        added = set(e[1]).difference(before_list[0][1])
        removed = set(before_list[0][1]).difference(e[1])
        if len(added) ==0:
            added = 'Nothing'
        if len(removed) == 0:
            removed ='Nothing'
        print("from {} to {} countries added: {}".format(old_year, current_year, added))
        print("from {} to {} countries removed: {}".format(old_year, current_year, removed))
        before_list.pop(0)
    before_list.append(e)

from 1950 to 1955 countries added: {'China'}
from 1950 to 1955 countries removed: {'Libya', 'Mexico', 'Belize', 'Ukraine', 'Puerto Rico', 'Sri Lanka', 'Uganda', 'Costa Rica', 'Luxembourg', 'United Kingdom', 'Greece', 'Cuba', 'Thailand', 'Moldova', 'Russia', 'Germany'}
from 1951 to 1956 countries added: {'China'}
from 1951 to 1956 countries removed: {'Ireland'}
from 1952 to 1957 countries added: {'China'}
from 1952 to 1957 countries removed: Nothing
from 1953 to 1958 countries added: {'Poland'}
from 1953 to 1958 countries removed: Nothing
from 1954 to 1959 countries added: {'Poland'}
from 1954 to 1959 countries removed: Nothing
from 1955 to 1960 countries added: {'Luxembourg', 'Poland'}
from 1955 to 1960 countries removed: Nothing
from 1956 to 1961 countries added: {'Luxembourg', 'Poland'}
from 1956 to 1961 countries removed: Nothing
from 1957 to 1962 countries added: {'Barbados', 'Guyana', 'Bahamas', 'Malta', 'Belize', 'Papua New Guinea', 'Fiji'}
from 1957 to 1962 countries removed: No

As we can see above, the China data was removed from 1983 to 1986. Because this we got big differences in the population mean in that years.

## Population of all countries (sum)

In [41]:
df.groupby("year")["pop"].sum().sort_values()

year
1951     481108384
2006     578376416
1983     713122644
1984     717767392
1985     722319840
1986     726903924
1988     735721883
1989     740666366
1950     810634775
1990     960648415
2005    1021943928
1991    1046587474
1993    1058979686
1994    1064239011
1995    1069041130
1953    1072294013
1996    1073406284
1998    1081747310
1999    1085537247
2000    1089093852
2001    1092187462
1954    1092831439
2004    1097142442
2003    1098120655
1955    1112907456
1956    1132393335
1958    1206102790
1959    1226014756
1960    1234258085
1961    1234670316
1963    1271049765
1964    1294068605
1965    1319736290
1966    1346176481
1968    1397354335
1969    1424761918
1970    1453352852
1971    1482973439
1974    1561447186
1975    1583900596
1976    1604053339
1978    1640843062
1979    1659228533
1980    1677312064
1981    1695270397
1973    1731497026
1952    1851147923
1957    2664892549
1962    2903948999
1967    3229357844
1972    3593251642
1977    3947998247
1982   

##Some graphs

In [42]:
fig = px.line(df, x="year", y="pop", color='country', labels="country", 
              title="Population x years. Click on the country to view it.")
fig.for_each_trace(
    lambda trace: trace.update(visible="legendonly") if trace.name not in ("country=Australia", "country=Hungary",
                                                                           "country=Portugal", "country=Ireland") 
                                                    else ()
)
fig.add_trace(go.Scatter(mode="lines", x = df.groupby("year").mean().index, y = df.groupby("year")["pop"].mean(), name="Mean World pop"))
fig.add_trace(go.Scatter(visible="legendonly", mode="lines", x = df.groupby("year").sum().index, y = df.groupby("year")["pop"].sum(), name="Total World pop"))
fig.show()

#GDP per capita analysis

## Higher and lower values

In [43]:
print(df[df['gdpPercap'] == min(df['gdpPercap'])])
print(df[df['gdpPercap'] == max(df['gdpPercap'])])

              country continent  year  lifeExp       pop  gdpPercap
673  Congo, Dem. Rep.    Africa  2002    44.97  55379852     241.17
     country continent  year  lifeExp     pop  gdpPercap
1652  Kuwait      Asia  1957    58.03  212846  113523.13


In [44]:
df.sort_values("gdpPercap")

Unnamed: 0,country,continent,year,lifeExp,pop,gdpPercap
673,"Congo, Dem. Rep.",Africa,2002,44.97,55379852,241.17
674,"Congo, Dem. Rep.",Africa,2007,46.46,64606759,277.55
1717,Lesotho,Africa,1952,42.14,748747,298.85
1227,Guinea-Bissau,Africa,1952,32.50,580653,299.85
672,"Congo, Dem. Rep.",Africa,1997,42.59,47798986,312.19
...,...,...,...,...,...,...
2466,Qatar,Asia,2007,75.59,907229,82010.98
1653,Kuwait,Asia,1962,60.47,358266,95458.11
1651,Kuwait,Asia,1952,55.56,160000,108382.35
1655,Kuwait,Asia,1972,67.71,841934,109347.87


In [45]:
df.sort_values(by='gdpPercap').head(20)['country'].value_counts()

Myanmar              6
Congo, Dem. Rep.     3
Lesotho              2
Eritrea              2
Burundi              2
Cambodia             1
Ethiopia             1
Guinea-Bissau        1
Malawi               1
Equatorial Guinea    1
Name: country, dtype: int64

In [46]:
df.sort_values(by='gdpPercap').tail(20)['country'].value_counts()

Luxembourg    7
Kuwait        6
Qatar         5
Brunei        2
Name: country, dtype: int64

##Which countries had decreases gdp per capta over the years?

In [47]:
df_country_year = df.sort_values(by=['country', 'year'])
df_country_year['diff_gdp'] = df_country_year.groupby('country')['gdpPercap'].diff()
df_country_year[df_country_year['diff_gdp']<0]['country'].value_counts()

Iceland        17
New Zealand    15
Bulgaria       14
Switzerland    12
Hungary        11
               ..
Barbados        1
Sri Lanka       1
Croatia         1
Tunisia         1
Tajikistan      1
Name: country, Length: 165, dtype: int64

### How was the gdpPercap data over the years for Iceland and New Zealand?



In [48]:
df_country_year[df_country_year['country']=='Iceland']['gdpPercap'].diff().sort_values()

1384   -1188.23
1360    -878.46
1375    -769.71
1394    -667.34
1380    -487.98
1359    -381.38
1343    -281.49
1349    -245.02
1344    -201.12
1353    -165.58
1381    -117.76
1385    -113.53
1387    -101.05
1367     -78.14
1399     -52.48
1383     -40.26
1351     -20.94
1348      30.35
1382      55.42
1352     121.14
1374     148.20
1361     217.67
1350     545.87
1346     589.43
1398     603.27
1377     605.87
1354     625.67
1347     657.95
1357     661.01
1364     669.24
1373     686.55
1366     705.56
1376     715.94
1393     738.52
1395     773.80
1391     786.24
1371     804.10
1355     871.48
1365     871.96
1362     873.73
1368     881.81
1358     906.39
1386     910.68
1356     912.24
1345     943.60
1372     981.08
1370     994.72
1389    1033.85
1392    1049.09
1388    1186.77
1390    1195.60
1378    1286.01
1397    1350.82
1369    1475.71
1363    1595.99
1379    1815.48
1396    2342.17
1342        nan
Name: gdpPercap, dtype: float64

In [49]:
df_country_year[df_country_year['country']=='Iceland'][['year', 'gdpPercap']]

Unnamed: 0,year,gdpPercap
1342,1950,7750.29
1343,1951,7468.81
1344,1952,7267.69
1345,1953,8211.29
1346,1954,8800.72
1347,1955,9458.67
1348,1956,9489.02
1349,1957,9244.0
1350,1958,9789.87
1351,1959,9768.93


Althout many times the gdp has decreased, over the years the gdp has increased from 7750 in 1950 to 36180 in 2007.

In [50]:
df_country_year[df_country_year['country']=='New Zealand']['gdpPercap']

2105   11449.38
2106   10362.99
2107   10556.58
2108   10637.22
2109   11839.00
2110   11814.68
2111   12186.70
2112   12247.40
2113   12437.62
2114   13039.68
2115   12816.33
2116   13195.10
2117   13175.68
2118   13719.26
2119   14107.05
2120   14730.25
2121   15384.75
2122   14463.92
2123   14279.12
2124   15586.76
2125   15150.93
2126   15674.74
2127   16046.04
2128   16823.02
2129   17439.20
2130   16910.21
2131   17126.42
2132   16233.72
2133   16294.54
2134   16633.44
2135   16718.10
2136   17445.44
2137   17632.41
2138   17919.54
2139   18612.88
2140   18647.68
2141   18979.55
2142   19007.19
2143   18874.65
2144   18983.92
2145   18833.60
2146   18380.78
2147   18363.32
2148   19337.19
2149   20116.23
2150   20661.48
2151   21064.07
2152   21050.41
2153   20876.11
2154   21689.01
2155   21895.16
2156   22426.59
2157   23189.80
2158   23728.48
2159   25185.01
Name: gdpPercap, dtype: float64

## GDP per capta mean over the years
Some countries don't get available some data records

In [51]:
df.groupby("year")["gdpPercap"].mean()

year
1950    5788.11
1951    7166.68
1952    3800.77
1953    7285.91
1954    7559.36
1955    7906.26
1956    8146.75
1957    4358.65
1958    8329.08
1959    8637.00
1960    9311.45
1961    9657.45
1962    4771.03
1963   10302.72
1964   10880.00
1965   11009.44
1966   11407.94
1967    5640.61
1968   12095.22
1969   12717.19
1970   13193.38
1971   13653.77
1972    7612.26
1973   13910.29
1974   15255.04
1975   15174.66
1976   15647.20
1977    8410.91
1978   16385.25
1979   16882.20
1980   17198.89
1981   17323.69
1982    8553.18
1983   18293.20
1984   18935.84
1985   19371.59
1986   19941.49
1987    8799.16
1988   21020.91
1989   21583.94
1990   19945.55
1991   19894.17
1992    8960.83
1993   19541.87
1994   19962.68
1995   20389.87
1996   20857.62
1997    9769.40
1998   22261.29
1999   22965.89
2000   23866.07
2001   24238.00
2002   10657.93
2003   25046.67
2004   25949.49
2005   26509.63
2006   27249.62
2007   12403.13
Name: gdpPercap, dtype: float64

In [52]:
df.groupby("year")["gdpPercap"].mean().diff().sort_values()

year
2007   -14846.50
2002   -13580.07
1987   -11142.34
1997   -11088.22
1992   -10933.35
1982    -8770.51
1977    -7236.29
1972    -6041.51
1967    -5767.32
1962    -4886.41
1957    -3788.10
1952    -3365.91
1990    -1638.39
1975      -80.37
1991      -51.38
1981      124.79
1965      129.43
1956      240.49
1954      273.44
1959      307.93
1980      316.69
1961      346.00
1955      346.90
2001      371.92
1966      398.50
1994      420.81
1995      427.19
1985      435.75
1971      460.39
1996      467.75
1976      472.54
1970      476.19
1979      496.95
2005      560.14
1989      563.03
1986      569.90
1964      577.28
1969      621.98
1984      642.64
1960      674.44
1999      704.60
2006      739.99
2000      900.18
2004      902.82
1974     1344.74
1951     1378.57
1953     3485.15
1958     3970.43
1963     5531.69
1973     6298.03
1968     6454.60
1978     7974.34
1983     9740.03
1993    10581.04
1988    12221.75
1998    12491.89
2003    14388.74
1950         nan
Name: gdp

Accordingly above numbers, could be the countries that send data every 5 years have a lower GDP than others.

How is the difference between the records in 5 years window?

In [53]:
df.groupby("year")["gdpPercap"].mean().diff(5).sort_values()

year
1994   -1621.26
1993   -1479.04
1991     -47.32
1982     142.27
1992     161.67
1987     245.98
1962     412.39
1995     444.32
1957     557.88
1990     573.96
1977     798.65
1997     808.57
1967     869.58
2002     888.53
1996     963.45
1956     980.06
1958    1043.16
1959    1077.65
1960    1405.19
1961    1510.70
1979    1627.17
1981    1676.49
1965    1697.99
2007    1745.20
1966    1750.49
1968    1792.50
1973    1815.07
1969    1837.19
1983    1907.95
1972    1971.64
1963    1973.65
1975    1981.28
1976    1993.43
1980    2024.23
1984    2053.64
1955    2118.14
1985    2172.69
1970    2183.95
1964    2243.00
1971    2245.83
1978    2474.96
1974    2537.84
1986    2617.80
2005    2643.56
1989    2648.10
1998    2719.42
1988    2727.71
2003    2785.38
2004    2983.60
1999    3003.21
2006    3011.63
2001    3380.38
2000    3476.21
1950        nan
1951        nan
1952        nan
1953        nan
1954        nan
Name: gdpPercap, dtype: float64

The quantity of years with negative differences is lower than considerating just one year period (window). In 1993 and 1994 could miss some countries with high GDP or could be added countries with low GDP.

In [54]:
window = 5
before_list = []
group = df.groupby("year")["country"]
for e in group:
    if (len(before_list)>=window):
        current_year = e[0]
        old_year = before_list[0][0]
        added = set(e[1]).difference(before_list[0][1])
        removed = set(before_list[0][1]).difference(e[1])
        if len(added) ==0:
            added = 'Nothing'
        if len(removed) == 0:
            removed ='Nothing'
        print("from {} to {} countries added: {}".format(old_year, current_year, added))
        print("from {} to {} countries removed: {}".format(old_year, current_year, removed))
        before_list.pop(0)
    before_list.append(e)

from 1950 to 1955 countries added: {'China'}
from 1950 to 1955 countries removed: {'Libya', 'Mexico', 'Belize', 'Ukraine', 'Puerto Rico', 'Sri Lanka', 'Uganda', 'Costa Rica', 'Luxembourg', 'United Kingdom', 'Greece', 'Cuba', 'Thailand', 'Moldova', 'Russia', 'Germany'}
from 1951 to 1956 countries added: {'China'}
from 1951 to 1956 countries removed: {'Ireland'}
from 1952 to 1957 countries added: {'China'}
from 1952 to 1957 countries removed: Nothing
from 1953 to 1958 countries added: {'Poland'}
from 1953 to 1958 countries removed: Nothing
from 1954 to 1959 countries added: {'Poland'}
from 1954 to 1959 countries removed: Nothing
from 1955 to 1960 countries added: {'Luxembourg', 'Poland'}
from 1955 to 1960 countries removed: Nothing
from 1956 to 1961 countries added: {'Luxembourg', 'Poland'}
from 1956 to 1961 countries removed: Nothing
from 1957 to 1962 countries added: {'Barbados', 'Guyana', 'Bahamas', 'Malta', 'Belize', 'Papua New Guinea', 'Fiji'}
from 1957 to 1962 countries removed: No

As we can see above, the countries of ex-URSS were included.

##Some graphs

In [55]:
fig = px.line(df, x="year", y="gdpPercap", color='country', labels="country", 
              title="GDP per capta x years. Click on the country to view it.")
fig.for_each_trace(
    lambda trace: trace.update(visible="legendonly") if trace.name not in ("country=Australia", "country=Brazil",
                                                                           "country=Portugal", "country=Ireland") 
                                                    else ()
)
fig.add_trace(go.Scatter(mode="lines", x = df.groupby("year").mean().index, y = df.groupby("year")["gdpPercap"].mean(), name="Mean World gdpPercap"))
fig.add_trace(go.Scatter(visible="legendonly", mode="lines", x = df.groupby("year").sum().index, y = df.groupby("year")["gdpPercap"].sum(), name="Total World gdpPercap"))
fig.show()

#Correlation between population, life expectativeand and GDP per capita

In [56]:
df.corr()

Unnamed: 0,year,lifeExp,pop,gdpPercap
year,1.0,0.38,0.02,0.31
lifeExp,0.38,1.0,-0.0,0.63
pop,0.02,-0.0,1.0,-0.04
gdpPercap,0.31,0.63,-0.04,1.0


We can see the life expectation has high positive correlation with gdp per capta. 
As we can see also, life expectation and gdp have medium correlation with years, that means the lifeExp and GDPpercap grew as the years passed. 

In [57]:
df.groupby("country").corr()

Unnamed: 0_level_0,Unnamed: 1_level_0,year,lifeExp,pop,gdpPercap
country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Afghanistan,year,1.00,0.97,0.90,-0.07
Afghanistan,lifeExp,0.97,1.00,0.81,-0.05
Afghanistan,pop,0.90,0.81,1.00,0.01
Afghanistan,gdpPercap,-0.07,-0.05,0.01,1.00
Albania,year,1.00,0.95,0.99,0.82
...,...,...,...,...,...
Zambia,gdpPercap,-0.42,0.67,-0.52,1.00
Zimbabwe,year,1.00,-0.24,0.99,0.41
Zimbabwe,lifeExp,-0.24,1.00,-0.25,0.42
Zimbabwe,pop,0.99,-0.25,1.00,0.40
