# INFO 3402 – Week 08: Persuasion

[Brian C. Keegan, Ph.D.](http://brianckeegan.com/)  
[Assistant Professor, Department of Information Science](https://www.colorado.edu/cmci/people/information-science/brian-c-keegan)  
University of Colorado Boulder  

Copyright and distributed under an [MIT License](https://opensource.org/licenses/MIT)  

## Setup

In [15]:
import numpy as np
np.set_printoptions(suppress=True)

import pandas as pd
idx = pd.IndexSlice
pd.options.display.max_columns = 100

%matplotlib inline
import matplotlib.pyplot as plt

import seaborn as sb

## Background

We'll review concepts from previous weeks, especially making catplots with a new dataset: the U.S. Census Bureau's [International Database](https://www.census.gov/programs-surveys/international-programs/about/idb.html) of population estimates and forecasts for countries since 1960 through 2100.

## Load and explore data

### Single year estimates
([docs](https://api.census.gov/data/timeseries/idb/1year/variables.html)); SEX: 0 = Both; 1 = Male; 2 = Female

There is more detailed annual data for fewer variables. Notice that this is an excellent example of "tidy" data. Note that the data is pipe-separated "|" rather than comma-separated, so we need to change the default "sep" parameter. Also there are non-ASCII characters that aren't encoded as UTF8, it looks like "latin1" works.

In [16]:
# Load data
singleyear_df = pd.read_csv(
    'idbsingleyear.all',
    sep = '|',
    encoding = 'latin1'
)

# Report shape
print(singleyear_df.shape)

# Don't need these right now
singleyear_df.drop(columns = ['GENC','FIPS','AREA_KM2'],inplace=True)

# Replace obnoxious column name
singleyear_df.rename(columns={"#YR":"YEAR"},inplace=True)

# Replace obscure gender codes
singleyear_df.replace(
    {'SEX':
     {
        0:'Both',
        1:'Male',
        2:'Female'
     }
    },
    inplace = True
)

# Inspect
singleyear_df.head()

(7929813, 8)


Unnamed: 0,YEAR,SEX,POP,NAME,AGE
0,1990,Both,504,Andorra,0
1,1990,Both,550,Andorra,1
2,1990,Both,489,Andorra,2
3,1990,Both,515,Andorra,3
4,1990,Both,535,Andorra,4


How many unique countries represented?

In [17]:
len(singleyear_df['NAME'].unique())

227

In [19]:
sorted(singleyear_df['NAME'].unique())

['Afghanistan',
 'Albania',
 'Algeria',
 'American Samoa',
 'Andorra',
 'Angola',
 'Anguilla',
 'Antigua and Barbuda',
 'Argentina',
 'Armenia',
 'Aruba',
 'Australia',
 'Austria',
 'Azerbaijan',
 'Bahamas, The',
 'Bahrain',
 'Bangladesh',
 'Barbados',
 'Belarus',
 'Belgium',
 'Belize',
 'Benin',
 'Bermuda',
 'Bhutan',
 'Bolivia',
 'Bosnia and Herzegovina',
 'Botswana',
 'Brazil',
 'Brunei',
 'Bulgaria',
 'Burkina Faso',
 'Burma',
 'Burundi',
 'Cabo Verde',
 'Cambodia',
 'Cameroon',
 'Canada',
 'Cayman Islands',
 'Central African Republic',
 'Chad',
 'Chile',
 'China',
 'Colombia',
 'Comoros',
 'Congo (Brazzaville)',
 'Congo (Kinshasa)',
 'Cook Islands',
 'Costa Rica',
 'Croatia',
 'Cuba',
 'Curaçao',
 'Cyprus',
 'Czechia',
 "Côte d'Ivoire",
 'Denmark',
 'Djibouti',
 'Dominica',
 'Dominican Republic',
 'Ecuador',
 'Egypt',
 'El Salvador',
 'Equatorial Guinea',
 'Eritrea',
 'Estonia',
 'Eswatini',
 'Ethiopia',
 'Faroe Islands',
 'Fiji',
 'Finland',
 'France',
 'French Polynesia',
 'Gabo

What is the min and max "AGE"?

In [20]:
singleyear_df['AGE'].min(), singleyear_df['AGE'].max()

(0, 100)

### Exercise 01: United States estimates

Filter to the United States and count the number of observations per "YEAR".

### Exercise 02: Pivot and slice the data
Pivot the data so YEAR and SEX are indices, AGE are columns, and POP are values.

Get the estimates and forecasts of number of 18-year-old men and women. (Hint: Use the `idx` ([docs](https://pandas.pydata.org/docs/reference/api/pandas.IndexSlice.html)) index slicer to access MultiIndex levels)

Unstack and plot the population estimates.

### Exercise 03: Birth rate for a country in conflict

Filter the `singleyear_df` data to a single country, age, and sex.

Plot the filtered data. Add a `axvline` for a year when the conflict started.

### Exercise 04: Visualize the sex ratio in 2020

Filter the data to exclude the "Both" category, only a single country, and only a single year.

Pivot the data so the "AGE" is an index, columns are "SEX", and the values are "POP".

Divide "Female" by "Male" and preserve as "FM ratio".

Make a line plot with "FM ratio" as the y-axis and "AGE" as the x-axis. Add an `axhline` at 1.

### Exercise 05: Make a population pyramid (Intermediate)
A [population pyramid](https://en.wikipedia.org/wiki/Population_pyramid) is a common data visualization within sociology and demography to represent the distribution of ages by gender.

Cast "AGE" to a `str` type to make it categorical.

Make the values for "Male" negative so they go in the opposite direction as "Female".

Make two `barplot`s with the "AGE" on the y-axis and the "Male" and "Female" on the x-axis.

### 5-year estimates
([docs](https://api.census.gov/data/timeseries/idb/5year/variables.html))

This is an excellent example of "wide" data and is used by Wickham as an example of un-tidy data because there are multiple variables in the column names (sex, age, and variable). We could (should?) clean this up... at some point but let's keep the data wide for now.

In [11]:
fiveyear_df = pd.read_csv(
    'idb5yr.all',
    sep = '|',
    encoding = 'latin1'
)

fiveyear_df.head()

Unnamed: 0,#YR,TFR,SRB,RNI,POP95_99,POP90_94,POP85_89,POP80_84,POP75_79,POP70_74,POP65_69,POP60_64,POP5_9,POP55_59,POP50_54,POP45_49,POP40_44,POP35_39,POP30_34,POP25_29,POP20_24,POP15_19,POP10_14,POP100_,POP0_4,POP,NMR,NAME,MR1_4,MR0_4,MPOP95_99,MPOP90_94,MPOP85_89,MPOP80_84,MPOP75_79,MPOP70_74,MPOP65_69,MPOP60_64,MPOP5_9,MPOP55_59,MPOP50_54,MPOP45_49,MPOP40_44,MPOP35_39,MPOP30_34,MPOP25_29,MPOP20_24,MPOP15_19,MPOP10_14,MPOP100_,MPOP0_4,MPOP,MMR1_4,MMR0_4,IMR_M,IMR_F,IMR,GRR,GR,FPOP95_99,FPOP90_94,FPOP85_89,FPOP80_84,FPOP75_79,FPOP70_74,FPOP65_69,FPOP60_64,FPOP5_9,FPOP55_59,FPOP50_54,FPOP45_49,FPOP40_44,FPOP35_39,FPOP30_34,FPOP25_29,FPOP20_24,FPOP15_19,FPOP10_14,FPOP100_,FPOP0_4,FPOP,FMR1_4,FMR0_4,GENC,FIPS,E0_M,E0_F,E0,CDR,CBR,ASFR45_49,ASFR40_44,ASFR35_39,ASFR30_34,ASFR25_29,ASFR20_24,ASFR15_19,AREA_KM2,POP_DENS
0,1950,,,,,,,,,,,,,,,,,,,,,,,,,6176,,Andorra,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,AD,AN,,,,,,,,,,,,,468,13.2
1,1951,,,,,,,,,,,,,,,,,,,,,,,,,6310,,Andorra,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,AD,AN,,,,,,,,,,,,,468,13.5
2,1952,,,,,,,,,,,,,,,,,,,,,,,,,5866,,Andorra,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,AD,AN,,,,,,,,,,,,,468,12.5
3,1953,,,,,,,,,,,,,,,,,,,,,,,,,5591,,Andorra,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,AD,AN,,,,,,,,,,,,,468,11.9
4,1954,,,,,,,,,,,,,,,,,,,,,,,,,5503,,Andorra,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,AD,AN,,,,,,,,,,,,,468,11.8


In [41]:
fiveyear_df['NAME'].value_counts()

Andorra          151
Romania          151
Malaysia         151
Mozambique       151
Namibia          151
                ... 
Guam             151
Guinea-Bissau    151
Guyana           151
Zimbabwe         151
United States    111
Name: NAME, Length: 227, dtype: int64

## Appendix

Here's our anatomy of a matplotlib figure for reference.

![Anatomy of a matplotlib figure](https://matplotlib.org/stable/_images/sphx_glr_anatomy_001.png)

### Cleaning 5-year estimates

In [1]:
import numpy as np
import pandas as pd
idx = pd.IndexSlice

import re

In [2]:
age_bucket_order = ['0-4','5-9','10-14','15-19','20-24','25-29','30-34','35-39',
                    '40-44','45-49','50-54','55-59','60-64','65-69','70-74','75-79',
                    '80-84','85-89','90-94','95-99','100-']

In [8]:
fiveyear_df = pd.read_csv('idb5yr.all',sep='|',encoding='latin1')
fiveyear_df.head()

Unnamed: 0,#YR,TFR,SRB,RNI,POP95_99,POP90_94,POP85_89,POP80_84,POP75_79,POP70_74,...,CBR,ASFR45_49,ASFR40_44,ASFR35_39,ASFR30_34,ASFR25_29,ASFR20_24,ASFR15_19,AREA_KM2,POP_DENS
0,1950,,,,,,,,,,...,,,,,,,,,468,13.2
1,1951,,,,,,,,,,...,,,,,,,,,468,13.5
2,1952,,,,,,,,,,...,,,,,,,,,468,12.5
3,1953,,,,,,,,,,...,,,,,,,,,468,11.9
4,1954,,,,,,,,,,...,,,,,,,,,468,11.8


Let's filter to population estimates by 5-year age bucket.

In [9]:
pop_cols = [col for col in fiveyear_df.columns if 'POP' in col and "_" in col and "DENS" not in col]
pop_cols

['POP95_99',
 'POP90_94',
 'POP85_89',
 'POP80_84',
 'POP75_79',
 'POP70_74',
 'POP65_69',
 'POP60_64',
 'POP5_9',
 'POP55_59',
 'POP50_54',
 'POP45_49',
 'POP40_44',
 'POP35_39',
 'POP30_34',
 'POP25_29',
 'POP20_24',
 'POP15_19',
 'POP10_14',
 'POP100_',
 'POP0_4',
 'MPOP95_99',
 'MPOP90_94',
 'MPOP85_89',
 'MPOP80_84',
 'MPOP75_79',
 'MPOP70_74',
 'MPOP65_69',
 'MPOP60_64',
 'MPOP5_9',
 'MPOP55_59',
 'MPOP50_54',
 'MPOP45_49',
 'MPOP40_44',
 'MPOP35_39',
 'MPOP30_34',
 'MPOP25_29',
 'MPOP20_24',
 'MPOP15_19',
 'MPOP10_14',
 'MPOP100_',
 'MPOP0_4',
 'FPOP95_99',
 'FPOP90_94',
 'FPOP85_89',
 'FPOP80_84',
 'FPOP75_79',
 'FPOP70_74',
 'FPOP65_69',
 'FPOP60_64',
 'FPOP5_9',
 'FPOP55_59',
 'FPOP50_54',
 'FPOP45_49',
 'FPOP40_44',
 'FPOP35_39',
 'FPOP30_34',
 'FPOP25_29',
 'FPOP20_24',
 'FPOP15_19',
 'FPOP10_14',
 'FPOP100_',
 'FPOP0_4']

Filter to `pop_cols` and start tidying process by moving id variables to index.

In [10]:
fiveyear_pop_df = fiveyear_df.set_index(['#YR','FIPS','NAME','GENC']).loc[:,pop_cols]
fiveyear_pop_df.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,POP95_99,POP90_94,POP85_89,POP80_84,POP75_79,POP70_74,POP65_69,POP60_64,POP5_9,POP55_59,...,FPOP45_49,FPOP40_44,FPOP35_39,FPOP30_34,FPOP25_29,FPOP20_24,FPOP15_19,FPOP10_14,FPOP100_,FPOP0_4
#YR,FIPS,NAME,GENC,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1
1950,AN,Andorra,AD,,,,,,,,,,,...,,,,,,,,,,
1951,AN,Andorra,AD,,,,,,,,,,,...,,,,,,,,,,
1952,AN,Andorra,AD,,,,,,,,,,,...,,,,,,,,,,
1953,AN,Andorra,AD,,,,,,,,,,,...,,,,,,,,,,
1954,AN,Andorra,AD,,,,,,,,,,,...,,,,,,,,,,


Stack and rename columns.

In [13]:
fiveyear_pop_stack_df = fiveyear_pop_df.stack().reset_index()
fiveyear_pop_stack_df.columns = ['year','fips','name','genc','variable','population']
fiveyear_pop_stack_df.head()

Unnamed: 0,year,fips,name,genc,variable,population
0,1990,AN,Andorra,AD,POP95_99,27.0
1,1990,AN,Andorra,AD,POP90_94,67.0
2,1990,AN,Andorra,AD,POP85_89,292.0
3,1990,AN,Andorra,AD,POP80_84,573.0
4,1990,AN,Andorra,AD,POP75_79,981.0


Create new columns with empty data.

In [14]:
# Use a regular expression to extract the gender character at the start of the POP variables
fiveyear_pop_stack_df['gender'] = (fiveyear_pop_stack_df['variable']
                                   .str.findall(r'(\w)POP')
                                   .str.get(0)
                                  )

# Fill in missing values as B for Both
fiveyear_pop_stack_df['gender'].fillna('B',inplace=True)

# Replace the gender short variables with full names
fiveyear_pop_stack_df.replace({'gender':{'B':'Both','M':'Male','F':'Female'}},inplace=True)

# Use a regex to extract and clean the age ranges at the end of the POP variables
fiveyear_pop_stack_df['age'] = (fiveyear_pop_stack_df['variable']
                                .str.findall(r'POP([\d\_]+)')
                                .str.get(0)
                                .str.replace('_','-')
                               )

# Drop unintersting columns
fiveyear_pop_stack_df.drop(columns=['fips','genc','variable'],inplace=True)

# Cast the age variables to categorical
fiveyear_pop_stack_df['age'] = pd.Categorical(
    fiveyear_pop_stack_df['age'],
    categories=age_bucket_order,
    ordered=True
)

# Write to disk
fiveyear_pop_stack_df.to_csv('fiveyear_pop.csv',index=False)

# Inspect
fiveyear_pop_stack_df.head()

Unnamed: 0,year,name,population,gender,age
0,1990,Andorra,27.0,Both,95-99
1,1990,Andorra,67.0,Both,90-94
2,1990,Andorra,292.0,Both,85-89
3,1990,Andorra,573.0,Both,80-84
4,1990,Andorra,981.0,Both,75-79
