### Explore staging/countries_of_the_world.csv

Exploration summary:
- 227 records in total
- 20 columns, all read in as strings
    - Country and Region columns are truly strings
    - Population, Area (sq. mi.), GDP ($ per capita) and Climate are whole numbers
    - All other columns are decimals
- Country, the primary key, does not contain any duplicate values or missing values
- Descriptive statistics (e.g. China most populous) as expected

Cleaning steps needed:
- Clean column names, get rid of space, comma, parenthesis, etc.
- Make decimal separator a period versus comma (US format)
- Remove trailing and leading white space from Country and Region
- Change data types

In [1]:
from pyspark.sql import SparkSession
from pyspark.sql.types import IntegerType, DoubleType
import pyspark.sql.functions as F

VBox()

Starting Spark application


ID,YARN Application ID,Kind,State,Spark UI,Driver log,Current session?
1,application_1590937788150_0002,pyspark,idle,Link,Link,✔


FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

SparkSession available as 'spark'.


FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [2]:
# Get SparkSession object
spark = SparkSession.builder.getOrCreate()

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [3]:
# Filename of input file
filename = 's3://data-eng-capstone-cf/staging/countries_of_the_world.csv'

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [4]:
# Read into a spark dataframe
df = spark.read.csv(filename, header=True)

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [5]:
# Print schema
df.printSchema()

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

root
 |-- Country: string (nullable = true)
 |-- Region: string (nullable = true)
 |-- Population: string (nullable = true)
 |-- Area (sq. mi.): string (nullable = true)
 |-- Pop. Density (per sq. mi.): string (nullable = true)
 |-- Coastline (coast/area ratio): string (nullable = true)
 |-- Net migration: string (nullable = true)
 |-- Infant mortality (per 1000 births): string (nullable = true)
 |-- GDP ($ per capita): string (nullable = true)
 |-- Literacy (%): string (nullable = true)
 |-- Phones (per 1000): string (nullable = true)
 |-- Arable (%): string (nullable = true)
 |-- Crops (%): string (nullable = true)
 |-- Other (%): string (nullable = true)
 |-- Climate: string (nullable = true)
 |-- Birthrate: string (nullable = true)
 |-- Deathrate: string (nullable = true)
 |-- Agriculture: string (nullable = true)
 |-- Industry: string (nullable = true)
 |-- Service: string (nullable = true)

In [6]:
# What does the file look like?
df.take(1)

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

[Row(Country='Afghanistan ', Region='ASIA (EX. NEAR EAST)         ', Population='31056997', Area (sq. mi.)='647500', Pop. Density (per sq. mi.)='48,0', Coastline (coast/area ratio)='0,00', Net migration='23,06', Infant mortality (per 1000 births)='163,07', GDP ($ per capita)='700', Literacy (%)='36,0', Phones (per 1000)='3,2', Arable (%)='12,13', Crops (%)='0,22', Other (%)='87,65', Climate='1', Birthrate='46,6', Deathrate='20,34', Agriculture='0,38', Industry='0,24', Service='0,38')]

In [7]:
# How many records?
df.count()

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

227

In [8]:
# How many columns?
len(df.columns)

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

20

In [9]:
# How many NaN/None valued rows are there per column?
# https://stackoverflow.com/questions/44627386/how-to-find-count-of-null-and-nan-values-for-each-column-in-a-pyspark-dataframe
df.select([F.sum((F.isnan(c) | F.col(c).isNull()).cast(IntegerType())).alias(c) for c in df.columns if c == 'Country']).show()

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

+-------+
|Country|
+-------+
|      0|
+-------+

In [10]:
# Check if unqiue in i94_cit_res_id
df.select('Country').count() == df.select('Country').distinct().count()

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

True

In [11]:
# Which regions have the most countries?
df.groupBy('Region').count().orderBy('count', ascending=False).show(truncate=False)

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

+-----------------------------------+-----+
|Region                             |count|
+-----------------------------------+-----+
|SUB-SAHARAN AFRICA                 |51   |
|LATIN AMER. & CARIB                |45   |
|ASIA (EX. NEAR EAST)               |28   |
|WESTERN EUROPE                     |28   |
|OCEANIA                            |21   |
|NEAR EAST                          |16   |
|EASTERN EUROPE                     |12   |
|C.W. OF IND. STATES                |12   |
|NORTHERN AFRICA                    |6    |
|NORTHERN AMERICA                   |5    |
|BALTICS                            |3    |
+-----------------------------------+-----+

In [12]:
# Most populous countries
df = df.withColumn('Population', F.col('Population').cast(IntegerType()))
df.select('Country', 'Population').orderBy('Population', ascending=False).show()

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

+-----------------+----------+
|          Country|Population|
+-----------------+----------+
|           China |1313973713|
|           India |1095351995|
|   United States | 298444215|
|       Indonesia | 245452739|
|          Brazil | 188078227|
|        Pakistan | 165803560|
|      Bangladesh | 147365352|
|          Russia | 142893540|
|         Nigeria | 131859731|
|           Japan | 127463611|
|          Mexico | 107449525|
|     Philippines |  89468677|
|         Vietnam |  84402966|
|         Germany |  82422299|
|           Egypt |  78887007|
|        Ethiopia |  74777981|
|          Turkey |  70413958|
|            Iran |  68688433|
|        Thailand |  64631595|
|Congo, Dem. Rep. |  62660551|
+-----------------+----------+
only showing top 20 rows

In [13]:
# Countries with largest areas
df = df.withColumnRenamed('Area (sq. mi.)', 'area')
df = df.withColumn('area', F.col('area').cast(IntegerType()))
df.select('Country', 'area').orderBy('area', ascending=False).show()

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

+-----------------+--------+
|          Country|    area|
+-----------------+--------+
|          Russia |17075200|
|          Canada | 9984670|
|   United States | 9631420|
|           China | 9596960|
|          Brazil | 8511965|
|       Australia | 7686850|
|           India | 3287590|
|       Argentina | 2766890|
|      Kazakhstan | 2717300|
|           Sudan | 2505810|
|         Algeria | 2381740|
|Congo, Dem. Rep. | 2345410|
|       Greenland | 2166086|
|          Mexico | 1972550|
|    Saudi Arabia | 1960582|
|       Indonesia | 1919440|
|           Libya | 1759540|
|            Iran | 1648000|
|        Mongolia | 1564116|
|            Peru | 1285220|
+-----------------+--------+
only showing top 20 rows

In [14]:
# Richest countries
df = df.withColumnRenamed('GDP ($ per capita)', 'gdp')
df = df.withColumn('gdp', F.col('gdp').cast(IntegerType()))
df.select('Country', 'gdp').orderBy('gdp', ascending=False).show()

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

+---------------+-----+
|        Country|  gdp|
+---------------+-----+
|    Luxembourg |55100|
| United States |37800|
|        Norway |37800|
|       Bermuda |36000|
|Cayman Islands |35000|
|    San Marino |34600|
|   Switzerland |32700|
|       Denmark |31100|
|       Iceland |30900|
|       Austria |30000|
|        Canada |29800|
|       Ireland |29600|
|       Belgium |29100|
|     Australia |29000|
|     Hong Kong |28800|
|   Netherlands |28600|
|         Japan |28200|
|         Aruba |28000|
|United Kingdom |27700|
|        France |27600|
+---------------+-----+
only showing top 20 rows

In [15]:
df = df.withColumn('phones', F.regexp_replace('Phones (per 1000)',',','.').cast(DoubleType()))
df.select('Country', 'phones').orderBy('phones', ascending=False).show()

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

+--------------------+------+
|             Country|phones|
+--------------------+------+
|             Monaco |1035.6|
|      United States | 898.0|
|          Gibraltar | 877.7|
|            Bermuda | 851.4|
|           Guernsey | 842.4|
|     Cayman Islands | 836.3|
|             Jersey | 811.3|
|             Sweden | 715.0|
|         San Marino | 704.3|
|St Pierre & Mique...| 683.2|
|        Switzerland | 680.9|
|        Isle of Man | 676.0|
|            Germany | 667.9|
|     Virgin Islands | 652.8|
|            Iceland | 647.7|
|Saint Kitts & Nevis | 638.9|
|            Denmark | 614.6|
|             Taiwan | 591.0|
|             Greece | 589.7|
|             France | 586.4|
+--------------------+------+
only showing top 20 rows