# Data Preprocessing

This notebook contains data preprocessing steps, including exploration, data cleaning, and feature generation.

Input: Kaggle World Cities Dataset, loaded from data/raw/worldcitiespop.csv

Output: Preprocessed Dataset, saved into data/processed/world_cities_processed.csv

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path

DATA_RAW = Path("../data/raw/worldcitiespop.csv")
DATA_PROC = Path("../data/processed")
DATA_PROC.mkdir(parents=True, exist_ok=True)

In [4]:
df = pd.read_csv(DATA_RAW)

  df = pd.read_csv(DATA_RAW)


## Exploration

In [5]:
print("Initial Data Shape:", df.shape)
df.head()

Initial Data Shape: (3173958, 7)


Unnamed: 0,Country,City,AccentCity,Region,Population,Latitude,Longitude
0,ad,aixas,Aixàs,6.0,,42.483333,1.466667
1,ad,aixirivali,Aixirivali,6.0,,42.466667,1.5
2,ad,aixirivall,Aixirivall,6.0,,42.466667,1.5
3,ad,aixirvall,Aixirvall,6.0,,42.466667,1.5
4,ad,aixovall,Aixovall,6.0,,42.466667,1.483333


In [6]:
df.describe()

Unnamed: 0,Population,Latitude,Longitude
count,47980.0,3173958.0,3173958.0
mean,47719.57,27.18817,37.08886
std,302888.7,21.95262,63.22302
min,7.0,-54.93333,-179.9833
25%,3732.0,11.63333,7.303176
50%,10779.0,32.49722,35.28
75%,27990.5,43.71667,95.70354
max,31480500.0,82.48333,180.0


In [8]:
df.describe(include='all')

Unnamed: 0,Country,City,AccentCity,Region,Population,Latitude,Longitude
count,3173958,3173952,3173954,3173950.0,47980.0,3173958.0,3173958.0
unique,234,2351891,2375759,490.0,,,
top,cn,san jose,San Antonio,7.0,,,
freq,238985,328,317,93917.0,,,
mean,,,,,47719.57,27.18817,37.08886
std,,,,,302888.7,21.95262,63.22302
min,,,,,7.0,-54.93333,-179.9833
25%,,,,,3732.0,11.63333,7.303176
50%,,,,,10779.0,32.49722,35.28
75%,,,,,27990.5,43.71667,95.70354


In [12]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3173958 entries, 0 to 3173957
Data columns (total 7 columns):
 #   Column       Dtype  
---  ------       -----  
 0   country      object 
 1   city         object 
 2   accent_city  object 
 3   region       object 
 4   population   float64
 5   lat          float64
 6   lon          float64
dtypes: float64(3), object(4)
memory usage: 169.5+ MB


In [10]:
df.isna().sum()

Country             0
City                6
AccentCity          4
Region              8
Population    3125978
Latitude            0
Longitude           0
dtype: int64

We can see that the data contains duplicates and NaN values

## Preprocessing

In [11]:
df = df.rename(columns={
    'Country': 'country',
    'City': 'city',
    'AccentCity': 'accent_city',
    'Region': 'region',
    'Latitude': 'lat',
    'Longitude': 'lon',
    'Population': 'population'
})

In [13]:
# Drop NaN values in critical columns
df = df.dropna(subset=['city', 'accent_city', 'lat', 'lon'])

# lowercase country and city names
df['country'] = df['country'].str.strip().str.lower()
df['city'] = df['city'].str.strip().str.lower()

In [14]:
# Drop Duplicates keeping the entry with the highest population
df = df.sort_values('population', ascending=False)
df = df.drop_duplicates(subset=['country', 'city'], keep='first')

In [15]:
df.shape

(2611430, 7)

In [16]:
df.head()

Unnamed: 0,country,city,accent_city,region,population,lat,lon
1544449,jp,tokyo,Tokyo,40.0,31480498.0,35.685,139.751389
570824,cn,shanghai,Shanghai,23.0,14608512.0,31.045556,121.399722
1327914,in,bombay,Bombay,16.0,12692717.0,18.975,72.825833
2200161,pk,karachi,Karachi,5.0,11627378.0,24.9056,67.0822
1331162,in,delhi,Delhi,7.0,10928270.0,28.666667,77.216667
