Prompt Category: Dataset description or manipulation
Dialogue:
User:
I need to do an exploratory data analysis on this data set, can you start by showing me how to load the data set?

Model:
Sure, here is the code to import the file:

‘’‘python
import pandas as pd
df=pd.read_csv(‘world-population-by-country-2020.csv’)
‘’’

User:
Now show me the code and the expected outputs for cleaning the data.

Model:
The first step to cleaning your dataset is to identify the data types present:
‘’‘python
#Identify data types present
print(df.dtypes)
‘’’
This code outputs:
‘’’
no int64
Country (or dependency) object
Population 2020 object
Yearly Change object
Net Change object
Density (P/Km²) object
Land Area (Km²) object
Migrants (net) object
Fert. Rate object
Med. Age object
Urban Pop % object
World Share object
dtype: object
‘’’
These types of data can create issues when performing calculations and so the best method is to convert each of our columns to numeric values by doing the following:
‘’‘python
#Remove symbols to assist in converting data to numeric
df = df.applymap(lambda x: x.replace(’,‘, ‘’).replace(’%', ‘’) if isinstance(x, str) else x)

#Convert data to numeric
df[[‘Population (2020)’, ‘Yearly Change’, ‘Net Change’, ‘Density (P/Km²)’, ‘Land Area (Km²)’, ‘Migrants (net)’, ‘Fert. Rate’, ‘Med. Age’, ‘Urban Pop %’, ‘World Share’]] = df[[‘Population (2020)’, ‘Yearly Change’, ‘Net Change’, ‘Density (P/Km²)’, ‘Land Area (Km²)’, ‘Migrants (net)’, ‘Fert. Rate’, ‘Med. Age’, ‘Urban Pop %’, ‘World Share’]].apply(pd.to_numeric, errors=‘coerce’)

#Identify Missing data
missing_values = df.isna().sum()
print(missing_values)
‘’’
This shows us that we are missing a few values from World Share, Median Age, Net Change, and Fertility Rate:
‘’’
no 0
Country (or dependency) 0
Population 2020 0
Yearly Change 0
Net Change 0
Density (P/Km²) 0
Land Area (Km²) 0
Migrants (net) 34
Fert. Rate 34
Med. Age 34
Urban Pop % 13
World Share 0
‘’’
Our final step is to remove these NA values by doing the following:
‘’'python

Specify the columns with NA values
columns_with_na = [‘World Share’, ‘Med. Age’, ‘Net Change’, ‘Fert. Rate’]

Drop rows with NA values in specified columns
df = df.dropna(subset=columns_with_na)
‘’’

User:
Can you output the summary statistics code and results?

Input File Name: world-population-by-country-2020.csv
Golden Answer:
‘’
no Population 2020 … Urban Pop % World Share
count 201.000000 2.010000e+02 … 194.000000 201.000000
mean 101.000000 3.877661e+07 … 59.634021 0.497463
std 58.167861 1.454245e+08 … 22.226695 1.865678
min 1.000000 9.792900e+04 … 13.000000 0.000000
25% 51.000000 1.886198e+06 … 43.000000 0.020000
50% 101.000000 8.654622e+06 … 60.000000 0.110000
75% 151.000000 2.769102e+07 … 78.750000 0.360000
max 201.000000 1.439324e+09 … 100.000000 18.470000
‘’’