# Overview - [Preppin' Data Challenge 2023: Week 17 - Population Growth vs Country Size](https://preppindata.blogspot.com/2023/04/2023-week-17-population-growth-vs.html)

In this project we will be practicing cleaning and preparing data for analysis in Python.

**Challenge Level: Advanced**

We will be using the dataset linked in the title from the blog Preppin' Data, and will look to satisfy the following requirements per its instructions:

### Requirements
- Input the data
- Population data:
    - Use the data interpreter to ensure the headers are read as headers 
    - Remove unnecessary fields
    - Trim leading & trailing spaces from country names
    - Use the Country/Region data role to tidy up country names
    - Pivot data so we have 3 columns for year, population and country name
- Country size data: 
    - Split the Land in km2 (mi2) field to get the values for land size in km2 only 
    - Remove unnecessary fields - we only want country and land size km2
    - Clean country names and trim any trailing spaces
    - Use the Country/Region data role to tidy up country names
    - Group together Jersey and Guernsey to make the Channel Islands
- Join the two datasets so that any country names/regions that aren’t in both datasets are excluded
    - 36 countries from the country size file will not be included
- Exclude ‘World’ from the dataset
- Calculate the population density
- We want to be able to select which years to compare in order to work out population growth, i.e. what was the population growth from Year A to Year B. To do this, we need to create two parameters:
    - One for the more recent year (You don’t need to enter every year in the dataset, just give a few examples including 2021)
    - One for the less recent year (Again, don’t include every year, just a few examples including 2000)
- Set the parameters to 2000 and 2021
- Filter to just these 2 years and have the population density for each year as 2 fields in the dataset for each country (215 rows in total)
- Calculate the % change in population density between the two years
- Create a rank for % change in population density so that the country with the greatest change ranks 1
- Create another rank for total population density in 2021, so that the country with the highest density ranks 1
- Find the top 10 ranking countries for population density 2021
    - Output the data
- Find the top 10 ranking countries for change in population density 2000-2021
    - Output the data

# Project Code

## Import Necessary Packages for Project

We will be importing the following packages/modules for the following reasons
- **Pandas:** allows us to create/format/clean our dataset for easy analysis
- **Numpy:** helps us identify and clean records with null values
<!-- - ~~**OS:**~~ -->
- **re:** helps us identify/cleanse records by specifying certain regex patterns

In [1]:
import pandas as pd
import numpy as np
import re

## Load Country Size DataFrame

In [2]:
#load dataset

#specify filepath variable in case file path needs to be changed
filepath = r"C:\Work\Projects\Python Projects\Preppin Data\2023 Week 17_Population Growth vs Country Size\drive-download-20230704T072535Z-001\country size.csv".replace("\\", "/")

#specify encoding to ensure that csv is read without errors 
df = pd.read_csv(filepath, encoding = "ISO-8859-1")

#preview dataframe
df.head(3)

Unnamed: 0,Rank,Country / Dependency,Total in km2 (mi2),Land in km2 (mi2),Water in km2 (mi2),% water
0,-,World,"510,072,000 (196,940,000)","148,940,000 (57,510,000)","361,132,000 (139,434,000)",70.8
1,1,Russia,"17,098,246 (6,601,670)","16,378,410 (6,323,740)","719,836 (277,930)",4.21
2,-,Antarctica,"14,200,000 (5,500,000)","14,200,000 (5,500,000)",0 (0),0.0


## Drop Unnecessary Columns from Country Size DataFrame

In [3]:
#create list of columns to reference when specifying which columns to drop and keep  
cols = list(df.columns)
cols

['Rank',
 'Country / Dependency',
 'Total in km2 (mi2)',
 'Land in km2 (mi2)',
 'Water in km2 (mi2)',
 '% water']

In [4]:
#create list of columns to keep using prior list of columns (shorter amount of columns)
keep_cols = cols[1:2] + cols[3:4]
keep_cols

['Country / Dependency', 'Land in km2 (mi2)']

In [5]:
#grab inverse of prior list to specify columns to delete
delete_cols = [x for x in cols if x not in keep_cols]
delete_cols

['Rank', 'Total in km2 (mi2)', 'Water in km2 (mi2)', '% water']

In [6]:
#drop unnecessary columns using prior list  
df.drop(columns=delete_cols, inplace=True)
#view current remaining columns
df.head(3)

Unnamed: 0,Country / Dependency,Land in km2 (mi2)
0,World,"148,940,000 (57,510,000)"
1,Russia,"16,378,410 (6,323,740)"
2,Antarctica,"14,200,000 (5,500,000)"


## Clean and Reformat Data: Country Size DataFrame
### Look for nulls in "Land in km2 (mi2)" column, convert "not determined" values to nulls

In [7]:
#look for exceptions that will trigger errors using regex 
#try two different criteria for matching regex to pull records in dataframe: a full match and a contain criteria

#show "Land in km2 (mi2)" DataFrame records that do not completely match the following: one or more ",", ".", digit characters, a space, left parentheses, one or more ",", ".", digit characters, right parentheses 
# df.loc[~df['Land in km2 (mi2)'].str.fullmatch("[.\d,]+[ ][(][.\d,]+[)]"), :]

#show "Land in km2 (mi2)" DataFrame records that do not contain the following: one or more ",", ".", digit characters, a space, left parentheses, one or more ",", ".", digit characters, right parentheses 
df.loc[~df['Land in km2 (mi2)'].str.contains("[.\d,]+[ ][(][.\d,]+[)]"), :]

Unnamed: 0,Country / Dependency,Land in km2 (mi2)
218,Akrotiri and Dhekelia (United Kingdom),not determined
244,Saint Barthélemy (France),not determined
245,Sint Eustatius (Netherlands),not determined
247,Saba (Netherlands),not determined


In [8]:
#checking for nulls in "Land in km2 (mi2)" column
df.loc[df['Land in km2 (mi2)'].isnull(), :]

Unnamed: 0,Country / Dependency,Land in km2 (mi2)


In [9]:
#replace "not determined" records with null (np.nan) values

#na=True parameter ensures that NaN values encountered are also selected in the boolean indexing 
df.loc[~df['Land in km2 (mi2)'].str.contains("[.\d,]+[ ][(][.\d,]+[)]", na=True), 'Land in km2 (mi2)'] = np.nan
#check for new null values, can cross check with prior isnull()
df.loc[df['Land in km2 (mi2)'].isnull(), :]

Unnamed: 0,Country / Dependency,Land in km2 (mi2)
218,Akrotiri and Dhekelia (United Kingdom),
244,Saint Barthélemy (France),
245,Sint Eustatius (Netherlands),
247,Saba (Netherlands),


### Remove miles value in "Land in km2 (mi2)" column, Rename column

In [10]:
#drop miles value in "Land in km2 (mi2)" column and rename column to reflect new values 

df["Land in km2 (mi2)"] = df["Land in km2 (mi2)"].str.split(" ").str[0]
df.rename(columns={"Land in km2 (mi2)": "Land in km2"}, inplace=True)
print(df["Land in km2"].head(3))

0    148,940,000
1     16,378,410
2     14,200,000
Name: Land in km2, dtype: object


### Change "Land in km2" column data type

In [11]:
#double check data type of "Land in km2" column
df.dtypes

Country / Dependency    object
Land in km2             object
dtype: object

In [12]:
#get rid of commas in numeric column, strip whitespace, convert to numeric column data type
df["Land in km2"] = df["Land in km2"].str.replace(",", "").str.strip()
df["Land in km2"] = df["Land in km2"].astype(float)

#verify new data type of "Land in km2" column
df.dtypes

Country / Dependency     object
Land in km2             float64
dtype: object

### Clean/Format Country Column

In [13]:
#look at rows in country column to glance over values that might need cleaning

#look at all rows -- not recommended for large datasets
with pd.option_context('display.max_rows', None,):
   print(df["Country / Dependency"])
#look at some rows -- recommended for large datasets
# df["Country / Dependency"].head(30)

0                                                  World
1                                                 Russia
2                                             Antarctica
3                                         Canada[Note 1]
4                                                  China
5                                          United States
6                                                 Brazil
7                                              Australia
8                                                  India
9                                              Argentina
10                                            Kazakhstan
11                                               Algeria
12                                              DR Congo
13                       Danish Realm Kingdom of Denmark
14                                   Greenland (Denmark)
15                                          Saudi Arabia
16                                                Mexico
17                             

In [14]:
# pull country records that contain anything other than alphabetical characters and whitespace
df.loc[df["Country / Dependency"].str.contains("[^A-Za-z\s]+")]

Unnamed: 0,Country / Dependency,Land in km2
3,Canada[Note 1],9093507.0
14,Greenland (Denmark),2166086.0
127,Svalbard (Norway),62045.0
140,Guinea-Bissau,28120.0
157,New Caledonia (France),18275.0
165,Falkland Islands (United Kingdom),12173.0
171,Puerto Rico (United States),9104.0
172,French Southern Territories (France),7668.0
177,French Polynesia (France),3827.0
179,South Georgia and the South Sandwich Islands ...,3903.0


In [15]:
#strip country column of whitespace
df["Country / Dependency"] = df["Country / Dependency"].str.strip()

# [deprecated solution] replaced with latter code since edge case [246] "Cocos (Keeling) Islands (Australia)" was not satisfied
# split and save first index item in country columns that contains 1 or more chars that arent alphabetical or whitespace
# df.loc[df["Country / Dependency"].str.contains("[^A-Za-z\s]+"), "Country / Dependency"] = df.loc[df["Country / Dependency"].str.contains("[^A-Za-z\s]+"), "Country / Dependency"].str.split("(").str[0].str.strip()
# split and save first index item in country columns that contains "["
# df.loc[df["Country / Dependency"].str.contains("\["), "Country / Dependency"] = df.loc[df["Country / Dependency"].str.contains("\["), "Country / Dependency"].str.split("[").str[0].str.strip()

#remove any parentheses and data within parentheses or brackets in records
df["Country / Dependency"] = df["Country / Dependency"].str.replace("([\(\[].*?[\]\)])", "").str.strip()

# check for revised column values (should no longer have most columns except for canada, since it uses "[")
print(df.loc[df["Country / Dependency"].str.contains("[^A-Za-z\s]+"), "Country / Dependency"])
# check for any items that contain a "[" -- should be empty now
print(df.loc[df["Country / Dependency"].str.contains("\["), "Country / Dependency"])


140                                   Guinea-Bissau
188                           São Tomé and Príncipe
203                                         Curaçao
209                             U.S. Virgin Islands
212    Saint Helena, Ascension and Tristan da Cunha
244                                Saint Barthélemy
Name: Country / Dependency, dtype: object
Series([], Name: Country / Dependency, dtype: object)


In [16]:
#rename column to reflect new cleaned country column value
df.rename(columns={"Country / Dependency": "Country"}, inplace=True)
#verify new country column name
df.columns

Index(['Country', 'Land in km2'], dtype='object')

### Combine Jersey and Guernsey Territories under "Channel Islands"

In [17]:
#view Jersey and Guernsey territories that will be combined together

print(df.loc[df["Country"].str.contains("Jersey"), :])
print(df.loc[df["Country"].str.contains("Guernsey"), :])

    Country  Land in km2
229  Jersey        116.0
      Country  Land in km2
232  Guernsey         78.0


In [18]:
#combine Jersey and Guernsey records under "Channel Islands", calculate sum of land 

#save column labels so that we dont need to manually type each time
cols = list(df.columns)

#calculate the land from jersey and guernsey
jersey_land = int(df.loc[df["Country"].str.contains("Jersey"), "Land in km2"])
guernsey_land = int(df.loc[df["Country"].str.contains("Guernsey"), "Land in km2"])

new_row = {
    cols[0]: "Channel Islands", 
    cols[1]: jersey_land + guernsey_land
          }

In [19]:
# add new row to last index in df
df.loc[len(df)] = new_row 
       
# drop rows used to make new row
df.drop([229,232], axis=0, inplace=True)

# reset index
df.reset_index(drop=True, inplace=True)

In [20]:
#check last row for new "Channel Islands" record
df.iloc[len(df)-1]

Country        Channel Islands
Land in km2                194
Name: 252, dtype: object

### Review Country Size DataFrame and save

In [21]:
df

Unnamed: 0,Country,Land in km2
0,World,1.489400e+08
1,Russia,1.637841e+07
2,Antarctica,1.420000e+07
3,Canada,9.093507e+06
4,China,9.326410e+06
...,...,...
248,Clipperton Island,2.000000e+00
249,Ashmore and Cartier Islands,5.000000e+00
250,Monaco,2.020000e+00
251,Vatican City,4.900000e-01


In [22]:
cs_df = df
cs_df

Unnamed: 0,Country,Land in km2
0,World,1.489400e+08
1,Russia,1.637841e+07
2,Antarctica,1.420000e+07
3,Canada,9.093507e+06
4,China,9.326410e+06
...,...,...
248,Clipperton Island,2.000000e+00
249,Ashmore and Cartier Islands,5.000000e+00
250,Monaco,2.020000e+00
251,Vatican City,4.900000e-01


## Load Population DataFrame

In [23]:
#load dataset

#specify filepath variable in case file path needs to be changed
filepath = r"C:\Work\Projects\Python Projects\Preppin Data\2023 Week 17_Population Growth vs Country Size\drive-download-20230704T072535Z-001\Population Data.xls".replace("\\", "/")

#specify header so that csv reads headers without errors
df = pd.read_excel(filepath, header=3)

#preview dataframe
df.head(10)

Unnamed: 0,Country Name,Country Code,Indicator Name,Indicator Code,1960,1961,1962,1963,1964,1965,...,2012,2013,2014,2015,2016,2017,2018,2019,2020,2021
0,Aruba,ABW,"Population, total",SP.POP.TOTL,54608.0,55811.0,56682.0,57475.0,58178.0,58782.0,...,102112.0,102880.0,103594.0,104257.0,104874.0,105439.0,105962.0,106442.0,106585.0,106537.0
1,Africa Eastern and Southern,AFE,"Population, total",SP.POP.TOTL,130692579.0,134169237.0,137835590.0,141630546.0,145605995.0,149742351.0,...,552530654.0,567891875.0,583650827.0,600008150.0,616377331.0,632746296.0,649756874.0,667242712.0,685112705.0,702976832.0
2,Afghanistan,AFG,"Population, total",SP.POP.TOTL,8622466.0,8790140.0,8969047.0,9157465.0,9355514.0,9565147.0,...,30466479.0,31541209.0,32716210.0,33753499.0,34636207.0,35643418.0,36686784.0,37769499.0,38972230.0,40099462.0
3,Africa Western and Central,AFW,"Population, total",SP.POP.TOTL,97256290.0,99314028.0,101445032.0,103667517.0,105959979.0,108336203.0,...,376797999.0,387204553.0,397855507.0,408690375.0,419778384.0,431138704.0,442646825.0,454306063.0,466189102.0,478185907.0
4,Angola,AGO,"Population, total",SP.POP.TOTL,5357195.0,5441333.0,5521400.0,5599827.0,5673199.0,5736582.0,...,25188292.0,26147002.0,27128337.0,28127721.0,29154746.0,30208628.0,31273533.0,32353588.0,33428486.0,34503774.0
5,Albania,ALB,"Population, total",SP.POP.TOTL,1608800.0,1659800.0,1711319.0,1762621.0,1814135.0,1864791.0,...,2900401.0,2895092.0,2889104.0,2880703.0,2876101.0,2873457.0,2866376.0,2854191.0,2837849.0,2811666.0
6,Andorra,AND,"Population, total",SP.POP.TOTL,9443.0,10216.0,11014.0,11839.0,12690.0,13563.0,...,71013.0,71367.0,71621.0,71746.0,72540.0,73837.0,75013.0,76343.0,77700.0,79034.0
7,Arab World,ARB,"Population, total",SP.POP.TOTL,93359407.0,95760348.0,98268683.0,100892507.0,103618568.0,106444103.0,...,380383408.0,389131555.0,397922915.0,406501999.0,415077960.0,423664839.0,432545676.0,441467739.0,449228296.0,456520777.0
8,United Arab Emirates,ARE,"Population, total",SP.POP.TOTL,133426.0,140984.0,148877.0,157006.0,165305.0,173797.0,...,8664969.0,8751847.0,8835951.0,8916899.0,8994263.0,9068296.0,9140169.0,9211657.0,9287289.0,9365145.0
9,Argentina,ARG,"Population, total",SP.POP.TOTL,20349744.0,20680653.0,21020359.0,21364017.0,21708487.0,22053661.0,...,41733271.0,42202935.0,42669500.0,43131966.0,43590368.0,44044811.0,44494502.0,44938712.0,45376763.0,45808747.0


## Clean and Reformat Data: Population Dataset

### Delete Unnecessary Records

In [24]:
#check for unique "Indicator Name" column values
df["Indicator Name"].unique()

#would need to filter out non "Population, total" records if other metrics were present in the unique() method

array(['Population, total'], dtype=object)

### Delete Unnecessary Columns

In [25]:
#check for required columns 
df.columns

Index(['Country Name', 'Country Code', 'Indicator Name', 'Indicator Code',
       '1960', '1961', '1962', '1963', '1964', '1965', '1966', '1967', '1968',
       '1969', '1970', '1971', '1972', '1973', '1974', '1975', '1976', '1977',
       '1978', '1979', '1980', '1981', '1982', '1983', '1984', '1985', '1986',
       '1987', '1988', '1989', '1990', '1991', '1992', '1993', '1994', '1995',
       '1996', '1997', '1998', '1999', '2000', '2001', '2002', '2003', '2004',
       '2005', '2006', '2007', '2008', '2009', '2010', '2011', '2012', '2013',
       '2014', '2015', '2016', '2017', '2018', '2019', '2020', '2021'],
      dtype='object')

In [26]:
#drop unnecessary columns
df.drop(columns=['Country Code', 'Indicator Name', 'Indicator Code'], inplace=True)
df.columns

Index(['Country Name', '1960', '1961', '1962', '1963', '1964', '1965', '1966',
       '1967', '1968', '1969', '1970', '1971', '1972', '1973', '1974', '1975',
       '1976', '1977', '1978', '1979', '1980', '1981', '1982', '1983', '1984',
       '1985', '1986', '1987', '1988', '1989', '1990', '1991', '1992', '1993',
       '1994', '1995', '1996', '1997', '1998', '1999', '2000', '2001', '2002',
       '2003', '2004', '2005', '2006', '2007', '2008', '2009', '2010', '2011',
       '2012', '2013', '2014', '2015', '2016', '2017', '2018', '2019', '2020',
       '2021'],
      dtype='object')

### Check for Countries in Population Dataset that are not in Country Size Dataset, Spot Correct Individual Country Names before Inner Join
**checking for the following reasons:** 
- see if further data cleaning is needed prior to joining the datasets on country values 
- shows us which countries need to be cleaned and how

In [27]:
#check which rows of df "Country Name" are not in cc_df "Country"
#print all
not_in = df[~df["Country Name"].isin(cs_df["Country"])]
with pd.option_context('display.max_rows', None,):
   print(not_in)

                                          Country Name          1960  \
1                          Africa Eastern and Southern  1.306926e+08   
3                           Africa Western and Central  9.725629e+07   
7                                           Arab World  9.335941e+07   
36                      Central Europe and the Baltics  9.140176e+07   
49                              Caribbean small states  4.209141e+06   
51                                             Curacao  1.248260e+05   
61         East Asia & Pacific (excluding high income)  8.964823e+08   
62                          Early-demographic dividend  9.794615e+08   
63                                 East Asia & Pacific  1.043334e+09   
64       Europe & Central Asia (excluding high income)  2.557261e+08   
65                               Europe & Central Asia  6.662737e+08   
68                                           Euro area  2.652450e+08   
73                                      European Union  3.569471

In [28]:
#show tally of how many countries dont match between the two datasets
not_in["Country Name"].count()

53

we will now check the following individual countries for any mismatched values between the datasets based on the prior list
- Curacao
- Virgin Islands (U.S.)
- Kosovo

In [29]:
#use this to check for specific rows of "Country Name" values in df
#verified possible spellings of Curacao via google, will test cs_df for revised value

cs_df.loc[cs_df["Country"].str.contains("Curaçao"),:]

Unnamed: 0,Country,Land in km2
203,Curaçao,444.0


In [30]:
#there is a Curacao present with the new spelling in cs_df (Curaçao)
#replace "Curacao" in population dataset with "Curaçao" to match cs_df

df.loc[df["Country Name"].str.contains("Curacao"), "Country Name"] = "Curaçao"
df.loc[df["Country Name"] == "Curaçao", :]

Unnamed: 0,Country Name,1960,1961,1962,1963,1964,1965,1966,1967,1968,...,2012,2013,2014,2015,2016,2017,2018,2019,2020,2021
51,Curaçao,124826.0,126125.0,128414.0,130860.0,133148.0,135266.0,136682.0,138140.0,140298.0,...,152088.0,153822.0,155909.0,157980.0,159664.0,160175.0,159336.0,157441.0,154947.0,152369.0


In [31]:
#use this to check for specific rows of "Country Name" values in df
#there are no other possible spellings of Kosovo via google, will test cs_df for current value

cs_df.loc[cs_df["Country"].str.contains("Kosovo"),:]

#there is no Kosovo present in cs_df to match with, will proceed to the next case

Unnamed: 0,Country,Land in km2


In [32]:
#use this to check for specific rows of "Country Name" values in df
#will use shortened version of "Virgin Islands" to check cs_df for different formatting

cs_df.loc[cs_df["Country"].str.contains("Virgin Islands"),:]

Unnamed: 0,Country,Land in km2
209,U.S. Virgin Islands,346.0
226,British Virgin Islands,151.0


In [33]:
#there is a "U.S. Virgin Islands" present in cs_df
#replace "Virgin Islands (U.S.)" in population dataset with "U.S. Virgin Islands" to match cs_df

df.loc[df["Country Name"].str.contains("Virgin Islands \(U.S.\)"), "Country Name"] = "U.S. Virgin Islands"
df.loc[df["Country Name"].str.contains("U.S. Virgin Islands"), :]

Unnamed: 0,Country Name,1960,1961,1962,1963,1964,1965,1966,1967,1968,...,2012,2013,2014,2015,2016,2017,2018,2019,2020,2021
256,U.S. Virgin Islands,32500.0,34300.0,35000.0,39800.0,40800.0,43500.0,46200.0,49100.0,55700.0,...,108188.0,108041.0,107882.0,107712.0,107516.0,107281.0,107001.0,106669.0,106290.0,105870.0


- **individual countries/territories have been corrected above**
- **most of the remaining countries shown in the list above contain regions or groupings (as opposed to individual countries or territories)**
    - **as a result we will move on to save the population dataset and join the two cleaned datasets together**

In [34]:
#save df as p_df for clarity
p_df = df

## Merge Country Size DataFrame and Population DataFrame

### Outer Join Version (to test for missing matches)

In [35]:
#join the two datasets together via outer join to check if there are significant amount of missing matches

total_df = pd.merge(cs_df, p_df, left_on="Country", right_on="Country Name", how="outer")

In [36]:
#preview dataframe

total_df.head(3)

Unnamed: 0,Country,Land in km2,Country Name,1960,1961,1962,1963,1964,1965,1966,...,2012,2013,2014,2015,2016,2017,2018,2019,2020,2021
0,World,148940000.0,World,3031565000.0,3072511000.0,3126935000.0,3193509000.0,3260518000.0,3328285000.0,3398561000.0,...,7140896000.0,7229185000.0,7317509000.0,7404911000.0,7491934000.0,7578158000.0,7661776000.0,7742682000.0,7820982000.0,7888409000.0
1,Russia,16378410.0,Russia,119897000.0,121236000.0,122591000.0,123960000.0,125345000.0,126745000.0,127468000.0,...,143201700.0,143507000.0,143819700.0,144096900.0,144342400.0,144496700.0,144477900.0,144406300.0,144073100.0,143449300.0
2,Antarctica,14200000.0,,,,,,,,,...,,,,,,,,,,


In [37]:
#view entire joined DataFrame 
#not recommended for large datasets

with pd.option_context("display.max_rows", None,):
    print(total_df)

                                          Country   Land in km2  \
0                                           World  1.489400e+08   
1                                          Russia  1.637841e+07   
2                                      Antarctica  1.420000e+07   
3                                          Canada  9.093507e+06   
4                                           China  9.326410e+06   
5                                   United States  9.147593e+06   
6                                          Brazil  8.460415e+06   
7                                       Australia  7.633565e+06   
8                                           India  2.973190e+06   
9                                       Argentina  2.736690e+06   
10                                     Kazakhstan  2.699700e+06   
11                                        Algeria  2.381741e+06   
12                                       DR Congo  2.267048e+06   
13                Danish Realm Kingdom of Denmark  2.220072e+0

In [38]:
#reorder columns so that country columns are next to each other
old_cols = list(total_df.columns)

new_cols = old_cols[:1] + old_cols[2:3] + old_cols[1:2] + old_cols[3:]
total_df = total_df[new_cols]
total_df.columns

Index(['Country', 'Country Name', 'Land in km2', '1960', '1961', '1962',
       '1963', '1964', '1965', '1966', '1967', '1968', '1969', '1970', '1971',
       '1972', '1973', '1974', '1975', '1976', '1977', '1978', '1979', '1980',
       '1981', '1982', '1983', '1984', '1985', '1986', '1987', '1988', '1989',
       '1990', '1991', '1992', '1993', '1994', '1995', '1996', '1997', '1998',
       '1999', '2000', '2001', '2002', '2003', '2004', '2005', '2006', '2007',
       '2008', '2009', '2010', '2011', '2012', '2013', '2014', '2015', '2016',
       '2017', '2018', '2019', '2020', '2021'],
      dtype='object')

In [39]:
#print any row where either "Country" or "Country Name" is null
missing_df = total_df.loc[total_df["Country Name"].isnull() | total_df["Country"].isnull() , ["Country", "Country Name"]]
missing_df

Unnamed: 0,Country,Country Name
2,Antarctica,
13,Danish Realm Kingdom of Denmark,
80,Western Sahara,
127,Svalbard,
139,Taiwan,
...,...,...
299,,Middle East & North Africa (IDA & IBRD countries)
300,,South Asia (IDA & IBRD)
301,,Sub-Saharan Africa (IDA & IBRD countries)
302,,Upper middle income


In [40]:
#test missing_df for rows that are mismatched
#modify the test_string as needed

test_string = "Kos"
missing_df.loc[missing_df["Country"].str.contains(test_string, na=False) | missing_df["Country Name"].str.contains(test_string, na=False), :]

Unnamed: 0,Country,Country Name
303,,Kosovo


### Inner Join Version (final)

In [41]:
#join the two datasets together via inner join
total_df = pd.merge(cs_df, p_df, left_on="Country", right_on="Country Name", how="inner")

In [42]:
total_df

Unnamed: 0,Country,Land in km2,Country Name,1960,1961,1962,1963,1964,1965,1966,...,2012,2013,2014,2015,2016,2017,2018,2019,2020,2021
0,World,1.489400e+08,World,3.031565e+09,3.072511e+09,3.126935e+09,3.193509e+09,3.260518e+09,3.328285e+09,3.398561e+09,...,7.140896e+09,7.229185e+09,7.317509e+09,7.404911e+09,7.491934e+09,7.578158e+09,7.661776e+09,7.742682e+09,7.820982e+09,7.888409e+09
1,Russia,1.637841e+07,Russia,1.198970e+08,1.212360e+08,1.225910e+08,1.239600e+08,1.253450e+08,1.267450e+08,1.274680e+08,...,1.432017e+08,1.435070e+08,1.438197e+08,1.440969e+08,1.443424e+08,1.444967e+08,1.444779e+08,1.444063e+08,1.440731e+08,1.434493e+08
2,Canada,9.093507e+06,Canada,1.790936e+07,1.827100e+07,1.861400e+07,1.896400e+07,1.932500e+07,1.967800e+07,2.004800e+07,...,3.471422e+07,3.508295e+07,3.543744e+07,3.570291e+07,3.610949e+07,3.654524e+07,3.706508e+07,3.760123e+07,3.803720e+07,3.824611e+07
3,China,9.326410e+06,China,6.670700e+08,6.603300e+08,6.657700e+08,6.823350e+08,6.983550e+08,7.151850e+08,7.354000e+08,...,1.354190e+09,1.363240e+09,1.371860e+09,1.379860e+09,1.387790e+09,1.396215e+09,1.402760e+09,1.407745e+09,1.411100e+09,1.412360e+09
4,United States,9.147593e+06,United States,1.806710e+08,1.836910e+08,1.865380e+08,1.892420e+08,1.918890e+08,1.943030e+08,1.965600e+08,...,3.138777e+08,3.160599e+08,3.183863e+08,3.207390e+08,3.230718e+08,3.251221e+08,3.268382e+08,3.283300e+08,3.315011e+08,3.318937e+08
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
210,Tuvalu,2.600000e+01,Tuvalu,5.404000e+03,5.436000e+03,5.471000e+03,5.503000e+03,5.525000e+03,5.548000e+03,5.591000e+03,...,1.085400e+04,1.091800e+04,1.089900e+04,1.087700e+04,1.085200e+04,1.082800e+04,1.086500e+04,1.095600e+04,1.106900e+04,1.120400e+04
211,Nauru,2.100000e+01,Nauru,4.582000e+03,4.753000e+03,4.950000e+03,5.198000e+03,5.484000e+03,5.804000e+03,6.021000e+03,...,1.044400e+04,1.069400e+04,1.094000e+04,1.118500e+04,1.143700e+04,1.168200e+04,1.192400e+04,1.213200e+04,1.231500e+04,1.251100e+04
212,Gibraltar,6.500000e+00,Gibraltar,2.182200e+04,2.190700e+04,2.224900e+04,2.279600e+04,2.334700e+04,2.391000e+04,2.447700e+04,...,3.216000e+04,3.241100e+04,3.245200e+04,3.252000e+04,3.256500e+04,3.260200e+04,3.264800e+04,3.268500e+04,3.270900e+04,3.266900e+04
213,Monaco,2.020000e+00,Monaco,2.179700e+04,2.190700e+04,2.210600e+04,2.244200e+04,2.276600e+04,2.302200e+04,2.319800e+04,...,3.470000e+04,3.542500e+04,3.611000e+04,3.676000e+04,3.707100e+04,3.704400e+04,3.702900e+04,3.703400e+04,3.692200e+04,3.668600e+04


## Clean Joined DataFrame

### Drop Unnecessary Columns/Rows, Reset Index

In [43]:
# drop world from the dataset
total_df.drop(0, inplace=True)
#reset index
total_df.reset_index(drop=True, inplace=True)

#drop duplicate country name column
total_df.drop(columns="Country Name", inplace=True)

total_df

Unnamed: 0,Country,Land in km2,1960,1961,1962,1963,1964,1965,1966,1967,...,2012,2013,2014,2015,2016,2017,2018,2019,2020,2021
0,Russia,16378410.00,119897000.0,121236000.0,122591000.0,123960000.0,125345000.0,126745000.0,127468000.0,128196000.0,...,1.432017e+08,1.435070e+08,1.438197e+08,1.440969e+08,1.443424e+08,1.444967e+08,1.444779e+08,1.444063e+08,1.440731e+08,1.434493e+08
1,Canada,9093507.00,17909356.0,18271000.0,18614000.0,18964000.0,19325000.0,19678000.0,20048000.0,20412000.0,...,3.471422e+07,3.508295e+07,3.543744e+07,3.570291e+07,3.610949e+07,3.654524e+07,3.706508e+07,3.760123e+07,3.803720e+07,3.824611e+07
2,China,9326410.00,667070000.0,660330000.0,665770000.0,682335000.0,698355000.0,715185000.0,735400000.0,754550000.0,...,1.354190e+09,1.363240e+09,1.371860e+09,1.379860e+09,1.387790e+09,1.396215e+09,1.402760e+09,1.407745e+09,1.411100e+09,1.412360e+09
3,United States,9147593.00,180671000.0,183691000.0,186538000.0,189242000.0,191889000.0,194303000.0,196560000.0,198712000.0,...,3.138777e+08,3.160599e+08,3.183863e+08,3.207390e+08,3.230718e+08,3.251221e+08,3.268382e+08,3.283300e+08,3.315011e+08,3.318937e+08
4,Brazil,8460415.00,73092515.0,75330008.0,77599218.0,79915555.0,82262794.0,84623747.0,86979283.0,89323288.0,...,1.999777e+08,2.017218e+08,2.034596e+08,2.051882e+08,2.068596e+08,2.085050e+08,2.101666e+08,2.117829e+08,2.131963e+08,2.143262e+08
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
209,Tuvalu,26.00,5404.0,5436.0,5471.0,5503.0,5525.0,5548.0,5591.0,5657.0,...,1.085400e+04,1.091800e+04,1.089900e+04,1.087700e+04,1.085200e+04,1.082800e+04,1.086500e+04,1.095600e+04,1.106900e+04,1.120400e+04
210,Nauru,21.00,4582.0,4753.0,4950.0,5198.0,5484.0,5804.0,6021.0,6114.0,...,1.044400e+04,1.069400e+04,1.094000e+04,1.118500e+04,1.143700e+04,1.168200e+04,1.192400e+04,1.213200e+04,1.231500e+04,1.251100e+04
211,Gibraltar,6.50,21822.0,21907.0,22249.0,22796.0,23347.0,23910.0,24477.0,25047.0,...,3.216000e+04,3.241100e+04,3.245200e+04,3.252000e+04,3.256500e+04,3.260200e+04,3.264800e+04,3.268500e+04,3.270900e+04,3.266900e+04
212,Monaco,2.02,21797.0,21907.0,22106.0,22442.0,22766.0,23022.0,23198.0,23281.0,...,3.470000e+04,3.542500e+04,3.611000e+04,3.676000e+04,3.707100e+04,3.704400e+04,3.702900e+04,3.703400e+04,3.692200e+04,3.668600e+04


### Unpivot Year Data 

In [44]:
#save columns to list to make easy column referencing easy
cols = list(total_df.columns)

index = 0 

#print columns with corresponding index
for col in cols: 
    print(col, index)
    index+=1 

Country 0
Land in km2 1
1960 2
1961 3
1962 4
1963 5
1964 6
1965 7
1966 8
1967 9
1968 10
1969 11
1970 12
1971 13
1972 14
1973 15
1974 16
1975 17
1976 18
1977 19
1978 20
1979 21
1980 22
1981 23
1982 24
1983 25
1984 26
1985 27
1986 28
1987 29
1988 30
1989 31
1990 32
1991 33
1992 34
1993 35
1994 36
1995 37
1996 38
1997 39
1998 40
1999 41
2000 42
2001 43
2002 44
2003 45
2004 46
2005 47
2006 48
2007 49
2008 50
2009 51
2010 52
2011 53
2012 54
2013 55
2014 56
2015 57
2016 58
2017 59
2018 60
2019 61
2020 62
2021 63


In [45]:
#unpivot data so that year is consolidated
total_df = pd.melt(total_df, id_vars=cols[:2], value_vars=cols[2:], var_name="year", value_name="population")

In [46]:
#view year unpivot changes
total_df

Unnamed: 0,Country,Land in km2,year,population
0,Russia,16378410.00,1960,119897000.0
1,Canada,9093507.00,1960,17909356.0
2,China,9326410.00,1960,667070000.0
3,United States,9147593.00,1960,180671000.0
4,Brazil,8460415.00,1960,73092515.0
...,...,...,...,...
13263,Tuvalu,26.00,2021,11204.0
13264,Nauru,21.00,2021,12511.0
13265,Gibraltar,6.50,2021,32669.0
13266,Monaco,2.02,2021,36686.0


### Change Data Type of "Year" Column

In [47]:
total_df["year"] = total_df["year"].astype(int)

In [48]:
total_df.dtypes

Country         object
Land in km2    float64
year             int32
population     float64
dtype: object

## Obtain Final Outputs 

## (Highest Pop Density [2021], Highest Pop Density Change [2000-2021]) 

In [49]:
#filter out all years besides 2000 and 2021 
years = [2000, 2021]

total_df = total_df[total_df["year"].isin(years)]
total_df

Unnamed: 0,Country,Land in km2,year,population
8560,Russia,16378410.00,2000,1.465969e+08
8561,Canada,9093507.00,2000,3.068573e+07
8562,China,9326410.00,2000,1.262645e+09
8563,United States,9147593.00,2000,2.821624e+08
8564,Brazil,8460415.00,2000,1.758737e+08
...,...,...,...,...
13263,Tuvalu,26.00,2021,1.120400e+04
13264,Nauru,21.00,2021,1.251100e+04
13265,Gibraltar,6.50,2021,3.266900e+04
13266,Monaco,2.02,2021,3.668600e+04


In [50]:
#create list of columns for easy referencing in pivoting code
cols = list(total_df.columns)
cols

['Country', 'Land in km2', 'year', 'population']

In [51]:
#pivot data to compare years 2000 and 2021

total_df = total_df.pivot(index=cols[:2], columns=cols[2:3], values=cols[3:])
total_df

Unnamed: 0_level_0,Unnamed: 1_level_0,population,population
Unnamed: 0_level_1,year,2000,2021
Country,Land in km2,Unnamed: 2_level_2,Unnamed: 3_level_2
Afghanistan,652867.0,19542982.0,40099462.0
Albania,27398.0,3089027.0,2811666.0
Algeria,2381741.0,30774621.0,44177969.0
American Samoa,199.0,58230.0,45035.0
Andorra,468.0,66097.0,79034.0
...,...,...,...
Venezuela,882050.0,24427729.0,28199867.0
Vietnam,310070.0,79001142.0,97468029.0
Yemen,555000.0,18628700.0,32981641.0
Zambia,743398.0,9891136.0,19473125.0


### Reset Index, Delete Multi Index

In [52]:
total_df.reset_index(inplace=True)
total_df

Unnamed: 0_level_0,Country,Land in km2,population,population
year,Unnamed: 1_level_1,Unnamed: 2_level_1,2000,2021
0,Afghanistan,652867.0,19542982.0,40099462.0
1,Albania,27398.0,3089027.0,2811666.0
2,Algeria,2381741.0,30774621.0,44177969.0
3,American Samoa,199.0,58230.0,45035.0
4,Andorra,468.0,66097.0,79034.0
...,...,...,...,...
209,Venezuela,882050.0,24427729.0,28199867.0
210,Vietnam,310070.0,79001142.0,97468029.0
211,Yemen,555000.0,18628700.0,32981641.0
212,Zambia,743398.0,9891136.0,19473125.0


In [53]:
#get rid of multiindex 

cols = list(total_df.columns)
cols

[('Country', ''),
 ('Land in km2', ''),
 ('population', 2000),
 ('population', 2021)]

In [54]:
#create new index based on old index
new_cols = [cols[0][0], cols[1][0], str(cols[2][1]), str(cols[3][1])] 
new_cols

['Country', 'Land in km2', '2000', '2021']

In [55]:
# overwrite column labels 
total_df.columns = new_cols

total_df.head(10)

Unnamed: 0,Country,Land in km2,2000,2021
0,Afghanistan,652867.0,19542982.0,40099462.0
1,Albania,27398.0,3089027.0,2811666.0
2,Algeria,2381741.0,30774621.0,44177969.0
3,American Samoa,199.0,58230.0,45035.0
4,Andorra,468.0,66097.0,79034.0
5,Angola,1246700.0,16394062.0,34503774.0
6,Antigua and Barbuda,442.6,75055.0,93219.0
7,Argentina,2736690.0,37070774.0,45808747.0
8,Armenia,28342.0,3168523.0,2790974.0
9,Aruba,180.0,89101.0,106537.0


### Calculate Population Density 2000, 2021

In [56]:
#calculate population density for 2000 and 2021
total_df["Population Density 2000 (People/km2)"] = total_df["2000"]/total_df["Land in km2"]
total_df["Population Density 2021 (People/km2)"] = total_df["2021"]/total_df["Land in km2"]

In [57]:
total_df

Unnamed: 0,Country,Land in km2,2000,2021,Population Density 2000 (People/km2),Population Density 2021 (People/km2)
0,Afghanistan,652867.0,19542982.0,40099462.0,29.934094,61.420568
1,Albania,27398.0,3089027.0,2811666.0,112.746441,102.623038
2,Algeria,2381741.0,30774621.0,44177969.0,12.921061,18.548603
3,American Samoa,199.0,58230.0,45035.0,292.613065,226.306533
4,Andorra,468.0,66097.0,79034.0,141.232906,168.876068
...,...,...,...,...,...,...
209,Venezuela,882050.0,24427729.0,28199867.0,27.694268,31.970826
210,Vietnam,310070.0,79001142.0,97468029.0,254.784861,314.342016
211,Yemen,555000.0,18628700.0,32981641.0,33.565225,59.426380
212,Zambia,743398.0,9891136.0,19473125.0,13.305303,26.194750


### Calculate Population Density Change

In [58]:
#create list of columns for easy referencing in population density change calculations
cols = list(total_df.columns)
cols

['Country',
 'Land in km2',
 '2000',
 '2021',
 'Population Density 2000 (People/km2)',
 'Population Density 2021 (People/km2)']

### Calculate % Change in Population Density

In [59]:
#calculate % change in population density 
total_df["% Change in Population Density"] = (total_df[cols[-1]]-total_df[cols[-2]])/total_df[cols[-2]]

### Create Rank Columns by "% Change in Population Density" and "Population Density 2021"

In [60]:
# rank countries by population density % change 2000-2021
# use min to still have one of each number even in the event of ties
total_df["Rank % Change"] = total_df["% Change in Population Density"].rank(method="min", ascending=False)

# rank countries by population density in 2021
# use min to still have one of each number even in the event of ties
total_df["Rank Population Density 2021"] = total_df["Population Density 2021 (People/km2)"].rank(method="min", ascending=False)

### Export Output Data

In [61]:
#view column names
total_df.columns

Index(['Country', 'Land in km2', '2000', '2021',
       'Population Density 2000 (People/km2)',
       'Population Density 2021 (People/km2)',
       '% Change in Population Density', 'Rank % Change',
       'Rank Population Density 2021'],
      dtype='object')

In [62]:
#top 10 ranking countries for population density
top10popdensity2021 = total_df.loc[total_df["Rank Population Density 2021"] <= 10, ["Country", "Population Density 2021 (People/km2)", "Rank Population Density 2021"]].sort_values(by="Rank Population Density 2021")
top10popdensity2021

Unnamed: 0,Country,Population Density 2021 (People/km2),Rank Population Density 2021
115,Macao,24347.765957,1.0
128,Monaco,18161.386139,2.0
173,Singapore,7616.712291,3.0
86,Hong Kong,6702.622061,4.0
75,Gibraltar,5026.0,5.0
14,Bahrain,1861.660305,6.0
119,Maldives,1749.855705,7.0
121,Malta,1640.936709,8.0
15,Bangladesh,1261.893859,9.0
174,Sint Maarten,1260.176471,10.0


In [63]:
#save file path variables for easy editing

output_folderpath = r"C:\Work\Projects\Python Projects\Preppin Data\2023 Week 17_Population Growth vs Country Size\Outputs".replace("\\", "/")
output_file = "Top10_popdensity2021.xlsx"
output_filepath = output_folderpath + "/" + output_file

output_filepath

'C:/Work/Projects/Python Projects/Preppin Data/2023 Week 17_Population Growth vs Country Size/Outputs/Top10_popdensity2021.xlsx'

In [64]:
#output file
top10popdensity2021.to_excel(output_filepath, sheet_name="Top10_popdensity2021", header=True, index=False)

In [65]:
#top 10 ranking countries for change in population density 2000-2021
top10popdensitypercentchange20002021 = total_df.loc[total_df["Rank % Change"] <= 10, ["Country", "Population Density 2000 (People/km2)", "Population Density 2021 (People/km2)", "% Change in Population Density", "Rank % Change"]].sort_values(by="Rank % Change")
top10popdensitypercentchange20002021

Unnamed: 0,Country,Population Density 2000 (People/km2),Population Density 2021 (People/km2),% Change in Population Density,Rank % Change
158,Qatar,55.75151,232.024426,3.16176,1.0
203,United Arab Emirates,39.178624,112.023266,1.859296,2.0
198,Turks and Caicos Islands,43.590698,104.916279,1.40685,3.0
60,Equatorial Guinea,24.418987,58.267655,1.386162,4.0
100,Jordan,56.937614,125.540844,1.204884,5.0
104,Kuwait,108.592491,238.52924,1.196554,6.0
141,Niger,9.175547,19.935835,1.172714,7.0
5,Angola,13.149966,27.676084,1.104651,8.0
38,Chad,6.559035,13.643377,1.080089,9.0
14,Bahrain,905.142494,1861.660305,1.056759,10.0


In [66]:
#save file paths for easy editing

output_file = "Top10_popdensity_percentchange.xlsx"
output_filepath = output_folderpath + "/" + output_file

output_filepath

'C:/Work/Projects/Python Projects/Preppin Data/2023 Week 17_Population Growth vs Country Size/Outputs/Top10_popdensity_percentchange.xlsx'

In [67]:
#output file
top10popdensity2021.to_excel(output_filepath, sheet_name="Top10_popdensity_percentchange", header=True, index=False)