# Population Example

First, load the dataset.

In [136]:
import pandas as pd
import altair as alt

df = pd.read_csv('data/population.csv')
df.head()

Unnamed: 0,Country Name,1960,1961,1962,1963,1964,1965,1966,1967,1968,...,2012,2013,2014,2015,2016,2017,2018,2019,2020,2021
0,Aruba,54608.0,55811.0,56682.0,57475.0,58178.0,58782.0,59291.0,59522.0,59471.0,...,102112.0,102880.0,103594.0,104257.0,104874.0,105439.0,105962.0,106442.0,106585.0,106537.0
1,Africa Eastern and Southern,130692579.0,134169237.0,137835590.0,141630546.0,145605995.0,149742351.0,153955516.0,158313235.0,162875171.0,...,552530654.0,567891875.0,583650827.0,600008150.0,616377331.0,632746296.0,649756874.0,667242712.0,685112705.0,702976832.0
2,Afghanistan,8622466.0,8790140.0,8969047.0,9157465.0,9355514.0,9565147.0,9783147.0,10010030.0,10247780.0,...,30466479.0,31541209.0,32716210.0,33753499.0,34636207.0,35643418.0,36686784.0,37769499.0,38972230.0,40099462.0
3,Africa Western and Central,97256290.0,99314028.0,101445032.0,103667517.0,105959979.0,108336203.0,110798486.0,113319950.0,115921723.0,...,376797999.0,387204553.0,397855507.0,408690375.0,419778384.0,431138704.0,442646825.0,454306063.0,466189102.0,478185907.0
4,Angola,5357195.0,5441333.0,5521400.0,5599827.0,5673199.0,5736582.0,5787044.0,5827503.0,5868203.0,...,25188292.0,26147002.0,27128337.0,28127721.0,29154746.0,30208628.0,31273533.0,32353588.0,33428486.0,34503774.0


Compact columns related to years into a single column

Use the `melt()` function to reshape the `df` DataFrame by unpivoting it based on the `Country Name` column. Unpivoting means converting a dataset from a wide format to a long format by rearranging the columns into rows. 

In [137]:
df = df.melt(id_vars='Country Name')
df.head()

Unnamed: 0,Country Name,variable,value
0,Aruba,1960,54608.0
1,Africa Eastern and Southern,1960,130692579.0
2,Afghanistan,1960,8622466.0
3,Africa Western and Central,1960,97256290.0
4,Angola,1960,5357195.0


Convert the variable containing years into an int.

In [138]:
df['variable'] = df['variable'].astype('int')

In [139]:
len(df)

16492

Disable the maximum row limit for data transformation. By calling this function, any limit on the number of rows that can be processed during data transformation is removed, allowing for unrestricted data processing.

In [140]:
alt.data_transformers.disable_max_rows()

DataTransformerRegistry.enable('default')

Draw the chart

In [141]:
chart = alt.Chart(df).mark_line().encode(
    x = 'variable:Q',
    y = 'value:Q',
    color = 'Country Name:N'
)
chart

The previous chart is very confused and presents the following problems:

* too many countries

* too many colors

* no focus

* axes with wrong titles

To solve the problems, group countries by continents. The dataset alrady contains values for continents. List the countries using `unique()`.

In [142]:
df['Country Name'].unique()

array(['Aruba', 'Africa Eastern and Southern', 'Afghanistan',
       'Africa Western and Central', 'Angola', 'Albania', 'Andorra',
       'Arab World', 'United Arab Emirates', 'Argentina', 'Armenia',
       'American Samoa', 'Antigua and Barbuda', 'Australia', 'Austria',
       'Azerbaijan', 'Burundi', 'Belgium', 'Benin', 'Burkina Faso',
       'Bangladesh', 'Bulgaria', 'Bahrain', 'Bahamas, The',
       'Bosnia and Herzegovina', 'Belarus', 'Belize', 'Bermuda',
       'Bolivia', 'Brazil', 'Barbados', 'Brunei Darussalam', 'Bhutan',
       'Botswana', 'Central African Republic', 'Canada',
       'Central Europe and the Baltics', 'Switzerland', 'Channel Islands',
       'Chile', 'China', "Cote d'Ivoire", 'Cameroon', 'Congo, Dem. Rep.',
       'Congo, Rep.', 'Colombia', 'Comoros', 'Cabo Verde', 'Costa Rica',
       'Caribbean small states', 'Cuba', 'Curacao', 'Cayman Islands',
       'Cyprus', 'Czechia', 'Germany', 'Djibouti', 'Dominica', 'Denmark',
       'Dominican Republic', 'Algeria',
 

Build a list of continents.

In [143]:
continents = ['Africa Eastern and Southern',
             'Africa Western and Central',
             'Middle East & North Africa',
              'Sub-Saharan Africa',
             'Europe & Central Asia',
             'Latin America & Caribbean',
             'North America',
             'Pacific island small states',
             'East Asia & Pacific']

Filter the dataset by selecting only the continents. Use `isin()` to select continents.

In [144]:
df = df[df['Country Name'].isin(continents)]

Draw the chart again.

In [145]:
chart = alt.Chart(df).mark_line().encode(
    x = 'variable:Q',
    y = 'value:Q',
    color = 'Country Name:N'
)
chart

The chart is readable. However, there are the following problems:

* too many colors

* no focus

* axes with wrong titles

In addition, it is difficult to compare the countries because they start from different values. Set the initial value of each country (1960) to zero and calculate the difference between each year the initial value.

In [146]:
baseline = df[df['variable'] == 1960]
baseline

Unnamed: 0,Country Name,variable,value
1,Africa Eastern and Southern,1960,130692600.0
3,Africa Western and Central,1960,97256290.0
63,East Asia & Pacific,1960,1043334000.0
65,Europe & Central Asia,1960,666273700.0
134,Latin America & Caribbean,1960,219142600.0
153,Middle East & North Africa,1960,104958300.0
170,North America,1960,198624800.0
197,Pacific island small states,1960,905537.0
217,Sub-Saharan Africa,1960,227948900.0


Calculate the difference between the current year and the baseline and store it into a new column called `diff`.

In [147]:
for continent in continents:
    baseline_value = baseline[baseline['Country Name'] == continent]['value'].values[0]
    m = df['Country Name'] == continent
    df.loc[m, 'diff'] = df.loc[m,'value'] - baseline_value

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self.obj[key] = empty_value
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_single_column(ilocs[0], value, pi)


Draw the new chart.

In [148]:
chart = alt.Chart(df).mark_line().encode(
    x = 'variable:Q',
    y = 'diff:Q',
    color = 'Country Name:N'
)
chart

In [149]:
colors= []
for continent in continents:
    if continent == 'North America':
        colors.append('#80C11E')
    else:
        colors.append('lightgrey')
    
colors

['lightgrey',
 'lightgrey',
 'lightgrey',
 'lightgrey',
 'lightgrey',
 'lightgrey',
 '#80C11E',
 'lightgrey',
 'lightgrey']

In [150]:
chart = alt.Chart(df).mark_line().encode(
    x = 'variable:Q',
    y = 'diff:Q',
    color = alt.Color('Country Name:N', scale=alt.Scale(range=colors),legend=None)
)
chart

In [151]:
chart = alt.Chart(df).mark_line().encode(
    x = 'variable:Q',
    y = 'diff:Q',
    color = alt.Color('Country Name:N', scale=alt.Scale(range=colors),legend=None)
).properties(
    title='Population in the North America over the last 50 years'
)
chart

In [152]:
chart = alt.Chart(df).mark_line().encode(
    x = alt.X('variable:Q',title=None, axis=alt.Axis(format='.0f',tickMinStep=10)),
    y = alt.Y('diff:Q', title='Difference of population from 1960'),
    color = alt.Color('Country Name:N', scale=alt.Scale(range=colors),legend=None)
).properties(
    title='Population in the North America over the last 50 years',
    width=400,
    height=300
)

chart

In [153]:
chart = alt.Chart(df).mark_line().encode(
    x = alt.X('variable:Q',title=None, axis=alt.Axis(format='.0f',tickMinStep=10)),
    y = alt.Y('diff:Q', title='Difference of population from 1960', axis=alt.Axis(format='.2s')),
    color = alt.Color('Country Name:N', scale=alt.Scale(range=colors),legend=None)
).properties(
    title='Population in the North America over the last 50 years',
    width=400,
    height=300
).configure_axis(
    grid=False,
    titleFontSize=14,
    labelFontSize=12
).configure_title(
    fontSize=16,
    color='#80C11E'
).configure_view(
    strokeWidth=0
)

chart

In [155]:
chart = alt.Chart(df).mark_line().encode(
    x = alt.X('variable:Q',title=None, axis=alt.Axis(format='.0f',tickMinStep=10)),
    y = alt.Y('diff:Q', title='Difference of population from 1960',axis=alt.Axis(format='.2s')),
    color = alt.Color('Country Name:N', scale=alt.Scale(range=colors),legend=None),
    opacity = alt.condition(alt.datum['Country Name'] == 'North America', alt.value(1), alt.value(0.3))
).properties(
    title='Population in the North America over the last 50 years',
    width=400,
    height=250
).configure_axis(
    grid=False,
    titleFontSize=14,
    labelFontSize=12
).configure_title(
    fontSize=16,
    color='#80C11E'
).configure_view(
    strokeWidth=0
)

chart

In [156]:
mask = df['Country Name'].isin(['North America'])
df_mean = df[~mask].groupby(by='variable').mean().reset_index()

In [157]:
df[~mask].groupby(by='variable').mean().reset_index()

Unnamed: 0,variable,value,diff
0,1960,3.113139e+08,0.000000e+00
1,1961,3.150044e+08,3.690484e+06
2,1962,3.203885e+08,9.074520e+06
3,1963,3.272891e+08,1.597513e+07
4,1964,3.342369e+08,2.292296e+07
5,1965,3.413318e+08,3.001786e+07
6,1966,3.487836e+08,3.746967e+07
7,1967,3.561910e+08,4.487711e+07
8,1968,3.638216e+08,5.250768e+07
9,1969,3.717585e+08,6.044453e+07


In [158]:
df_grouped = pd.DataFrame({ 
    'Year' : df[mask]['variable'].values,
    'North America' : df[mask]['value'].values, 
    'World': df_mean['value'].values
})

df_grouped

Unnamed: 0,Year,North America,World
0,1960,198624756.0,3.113139e+08
1,1961,202007500.0,3.150044e+08
2,1962,205198600.0,3.203885e+08
3,1963,208253700.0,3.272891e+08
4,1964,211262900.0,3.342369e+08
5,1965,214031100.0,3.413318e+08
6,1966,216659000.0,3.487836e+08
7,1967,219176000.0,3.561910e+08
8,1968,221503000.0,3.638216e+08
9,1969,223759000.0,3.717585e+08


In [159]:
df_melt = df_grouped.melt(id_vars='Year', var_name='Continent', value_name='Population')
df_melt

Unnamed: 0,Year,Continent,Population
0,1960,North America,1.986248e+08
1,1961,North America,2.020075e+08
2,1962,North America,2.051986e+08
3,1963,North America,2.082537e+08
4,1964,North America,2.112629e+08
5,1965,North America,2.140311e+08
6,1966,North America,2.166590e+08
7,1967,North America,2.191760e+08
8,1968,North America,2.215030e+08
9,1969,North America,2.237590e+08


In [161]:
colors=['#80C11E', 'grey']
chart = alt.Chart(df_melt).mark_line().encode(
    x = alt.X('Year:Q',title=None, axis=alt.Axis(format='.0f',tickMinStep=10)),
    y = alt.Y('Population:Q', title='Difference of population from 1960',axis=alt.Axis(format='.2s')),
    color = alt.Color('Continent:N', scale=alt.Scale(range=colors),legend=None),
    opacity = alt.condition(alt.datum['Continent'] == 'North America', alt.value(1), alt.value(0.3))
).properties(
    title='Population in the North America over the last 50 years',
    width=400,
    height=250
).configure_axis(
    grid=False,
    titleFontSize=14,
    labelFontSize=12
).configure_title(
    fontSize=16,
    color='#80C11E'
).configure_view(
    strokeWidth=0
)

chart

In [162]:
baseline = df_melt[df_melt['Year'] == 1960]

In [163]:
continents = ['North America', 'World']
for continent in continents:
    baseline_value = baseline[baseline['Continent'] == continent]['Population'].values[0]
    m = df_melt['Continent'] == continent
    df_melt.loc[m, 'Diff'] = df_melt.loc[m,'Population'] - baseline_value

In [164]:
df_melt

Unnamed: 0,Year,Continent,Population,Diff
0,1960,North America,1.986248e+08,0.000000e+00
1,1961,North America,2.020075e+08,3.382744e+06
2,1962,North America,2.051986e+08,6.573844e+06
3,1963,North America,2.082537e+08,9.628944e+06
4,1964,North America,2.112629e+08,1.263814e+07
5,1965,North America,2.140311e+08,1.540634e+07
6,1966,North America,2.166590e+08,1.803424e+07
7,1967,North America,2.191760e+08,2.055124e+07
8,1968,North America,2.215030e+08,2.287824e+07
9,1969,North America,2.237590e+08,2.513424e+07


In [224]:
colors=['#80C11E', 'grey']
chart = alt.Chart(df_melt).mark_line().encode(
    x = alt.X('Year:Q',title=None, axis=alt.Axis(format='.0f',tickMinStep=10)),
    y = alt.Y('Diff:Q', title='Difference from 1960',axis=alt.Axis(format='.2s')),
    color = alt.Color('Continent:N', scale=alt.Scale(range=colors),legend=None),
    opacity = alt.condition(alt.datum['Continent'] == 'North America', alt.value(1), alt.value(0.3))
).properties(
    title='Population in the North America over the last 50 years',
    width=400,
    height=250
).configure_axis(
    grid=False,
    titleFontSize=14,
    labelFontSize=12
).configure_title(
    fontSize=16,
    color='#80C11E'
).configure_view(
    strokeWidth=0
)

chart

In [166]:
mask = df_melt['Year'] == 2021
na = df_melt[mask]['Diff'].values[0] # North America
oth = df_melt[mask]['Diff'].values[1]

In [167]:
df_text = pd.DataFrame({'text' : ['World','North America'],
       'x' : [2023,2023],
       'y' : [oth,na]})

df_text

Unnamed: 0,text,x,y
0,World,2023,538693400.0
1,North America,2023,171579000.0


In [168]:
text = alt.Chart(df_text).mark_text(fontSize=14, align='left').encode(
    x = 'x',
    y = 'y',
    text = 'text',
    color = alt.condition(alt.datum.text == 'North America', alt.value('#80C11E'), alt.value('grey'))
)

text

In [169]:
chart = alt.Chart(df_melt).mark_line().encode(
    x = alt.X('Year:Q',title=None, axis=alt.Axis(format='.0f',tickMinStep=10)),
    y = alt.Y('Diff:Q', title='Difference from 1960',axis=alt.Axis(format='.2s')),
    color = alt.Color('Continent:N', scale=alt.Scale(range=colors),legend=None),
    opacity = alt.condition(alt.datum['Continent'] == 'North America', alt.value(1), alt.value(0.3))
).properties(
    title='Population in the North America over the last 50 years',
    width=400,
    height=250
)

In [170]:
total = (chart + text)

In [172]:
total.configure_axis(
    grid=False,
    titleFontSize=14,
    labelFontSize=12
).configure_title(
    fontSize=16,
    color='#80C11E'
).configure_view(
    strokeWidth=0
)

In [173]:
oth - na

367114388.25

In [174]:
offset = 10000000
line = alt.Chart(pd.DataFrame({'y' : [oth - offset,na + offset], 'x' : [2021,2021]})).mark_line(color='black').encode(
    y = 'y',
    x = 'x'
)

line

In [175]:
chart + text + line

In [176]:
diff = oth-na
df_annotation = pd.DataFrame({'text' : ['367M'],
       'x' : [2022],
       'y' : [na + (oth-na)/2]})

df_annotation

Unnamed: 0,text,x,y
0,367M,2022,355136200.0


In [177]:
ann = alt.Chart(df_annotation).mark_text(fontSize=30, align='left').encode(
    x = 'x',
    y = 'y',
    text = 'text'
)

ann

In [178]:
total = chart + text + line + ann

In [179]:
total.configure_axis(
    grid=False,
    titleFontSize=14,
    labelFontSize=12
).configure_title(
    fontSize=16,
    color='#80C11E'
).configure_view(
    strokeWidth=0
)

In [180]:
df_context = pd.DataFrame({'text' : ['Why this gap?',
                            '1. Lower Fertility Rate', 
                            '2. Lower Immigration Rate', 
                            '3. Higher Average Age'],
          'y': [0,1,2,3]})

df_context

Unnamed: 0,text,y
0,Why this gap?,0
1,1. Lower Fertility Rate,1
2,2. Lower Immigration Rate,2
3,3. Higher Average Age,3


In [181]:
context = alt.Chart(df_context).mark_text(fontSize=14, align='left', dy=50).encode(
    y = alt.Y('y:O', axis=None),
    text = 'text',
    stroke = alt.condition(alt.datum.y == 0, alt.value('#80C11E'), alt.value('black')),
    strokeWidth = alt.condition(alt.datum.y == 0, alt.value(1), alt.value(0))
)

context

In [183]:
(context | total).configure_axis(
    grid=False,
    titleFontSize=14,
    labelFontSize=12
).configure_title(
    fontSize=16,
    color='#80C11E'
).configure_view(
    strokeWidth=0
)

In [None]:
cta = alt.Chart(df_context).mark_text(fontSize=14, align='left', dy=50).encode(
    y = alt.Y('y:O', axis=None),
    text = 'text',
    stroke = alt.condition(alt.datum.y == 0, alt.value('#80C11E'), alt.value('black')),
    strokeWidth = alt.condition(alt.datum.y == 0, alt.value(1), alt.value(0))
)

context

In [222]:
df_cta = pd.DataFrame({
    'Strategy': ['Immigration Development', 'Enhance Family-Friendly Policies', 'Revitalize Rural Areas'],
    'Population Increase': [20, 30, 15]  # Sample population increase percentages
})

# Creating the stacked column chart
cta = alt.Chart(df_cta).mark_bar(color='#80C11E').encode(
    x='Population Increase:Q',
    y=alt.Y('Strategy:N', sort='-x', title=None),
    tooltip=['Strategy', 'Population Increase']
).properties(
    title='Strategies for population growth in North America',
)

cta
# Displaying the chart


In [223]:
alt.vconcat((context | total), cta,center=True).configure_axis(
    grid=False,
    titleFontSize=14,
    labelFontSize=12
).configure_title(
    fontSize=20,
    color='#80C11E',
    offset=20
).configure_view(
    strokeWidth=0
).configure_concat(
    spacing=50
)