# Q5: Data Visualization


Link to the recorded video: [video](https://youtu.be/d9rjh6-Qhfs)

## Problems faced and how they were handled

1. Different datasets had different numbers of countries. This was solved by only considering countries present in all 3 datasets.

2. To pick the colors of the datapoints based on the continent, I had to use the gapminderData.csv dataset, since the other 3 datasets did not have any information regarding continents. gapminderData.csv only had around 140 countries in it, so I had to classify the remaining countries as their continent being 'Unavailable'. This was fine since those countries were only around 40-50 in number and most of them were insignificant. 

3. Different datasets had data for different time periods. This was solved by only considering data from 1962-2019, which existed for all 3 datasets.

4. Initially I tried to use CO2 emissions and GDP growth as the features with Total GDP determining the size of the markers. However, the plots generated did not reveal a clear relationship or pattern. I handled this by trying different features and figuring our which relationships are more clear and interpretable.

## Download Datasets

In the following code cell, we download the datasets which I have uploaded on a public GitHub repository.

Datasets used:

1. For the x-axis: Child Mortality (0-5 year-olds dying per 1000 born) from [Gapminder Website](https://www.gapminder.org/data/)

2. For the y-axis: Population Growth (Annual %) from [Gapminder Website](https://www.gapminder.org/data/)

3. For the sizes of the markers: Total Population from [Gapminder Website](https://www.gapminder.org/data/)

4. For a mapping between countries and continents: [gapminderData.csv](https://python-graph-gallery.com/wp-content/uploads/gapminderData.csv)


In [None]:
import requests 

urls = ["https://raw.githubusercontent.com/frank-chris/CS-328-Assignments/main/HW-1/child_mortality.csv",
        "https://raw.githubusercontent.com/frank-chris/CS-328-Assignments/main/HW-1/gapminderData.csv",
        "https://raw.githubusercontent.com/frank-chris/CS-328-Assignments/main/HW-1/population_growth.csv",
        "https://raw.githubusercontent.com/frank-chris/CS-328-Assignments/main/HW-1/population_total.csv"]

filenames = ["child_mortality.csv", "gapminderData.csv", "population_growth.csv", "population_total.csv"]

for url, filename in zip(urls, filenames):
  r = requests.get(url) 
  with open(filename,'wb') as f: 
      f.write(r.content) 


## Reading the csv files into DataFrames

In [None]:
import pandas as pd
import numpy as np
import plotly.express as px


population = pd.read_csv('population_total.csv')
child_mortality = pd.read_csv('child_mortality.csv')
pop_growth = pd.read_csv('population_growth.csv')
continent_mapping = pd.read_csv('gapminderData.csv')

## Processing the dataset before plotting

In the following code cell, we remove data corresponding to years before 1962 and also create a DataFrame which has a mapping between countries and continents from the gapminderData.csv dataset

In [None]:
continent_mapping = continent_mapping[['country', 'continent']].drop_duplicates()
population.drop([str(i) for i in range(1800, 1962)], axis = 1, inplace=True) 
child_mortality.drop([str(i) for i in range(1800, 1962)], axis = 1, inplace=True) 
pop_growth.drop([str(i) for i in range(1961, 1962)], axis = 1, inplace=True) 

The following code cell finds countries that are not present in all the 3 datasets to be used. We only use data of countries that exist in all three datasets.

In [None]:
a = set(child_mortality.country).symmetric_difference(set(pop_growth.country))
b = set(child_mortality.country).symmetric_difference(set(population.country))
c = (set(population.country).symmetric_difference(set(pop_growth.country))).union(a, b)

print(c)

{'Holy See'}


In the following code cell, we count the number of NaNs and the number of rows having NaNs in the datasets. We remove countries having NaNs since such countries are very few and I didn't want to impute values for them since it wouldn't really help much with the visualization. 

z is the set of the countries to be dropped. It also includes the countries that are not present in all three datasets.

In [None]:
print(population.shape[0] - population.dropna().shape[0])
print(sum(population.isnull().values.ravel()))

x = set(population[population.isna().any(axis=1)].country)

print(child_mortality.shape[0] - child_mortality.dropna().shape[0])
print(sum(child_mortality.isnull().values.ravel()))

y = set(child_mortality[child_mortality.isna().any(axis=1)].country)

print(pop_growth.shape[0] - pop_growth.dropna().shape[0])
print(sum(pop_growth.isnull().values.ravel()))

z = set(pop_growth[pop_growth.isna().any(axis=1)].country).union(x, y, c)

print(z, len(z))

0
0
0
0
5
70
{'Palestine', 'Holy See', 'Kuwait', 'Serbia', 'New Zealand', 'Eritrea'} 6


In the following code cell, we drop the 6 countries identified to be dropped, and also drop data corresponding to years after 2019, since one of the datasets has data only till 2019. We also do a sanity check on the countries present in each of the 3 datasets to make sure they are the same.

In [None]:
population = population[~population.country.isin(z)] 
child_mortality = child_mortality[~child_mortality.country.isin(z)] 
pop_growth = pop_growth[~pop_growth.country.isin(z)] 

population.drop([str(i) for i in range(2020, 2101)], axis = 1, inplace=True) 
child_mortality.drop([str(i) for i in range(2020, 2101)], axis = 1, inplace=True) 

print(set(population.country) == set(child_mortality.country))
print(set(population.country) == set(pop_growth.country))
print(set(child_mortality.country) == set(pop_growth.country))

True
True
True


## Plotting

In the following code cell, we first melt the DataFrames to create a superset(the DataFrame named plot_df) of the 3 datasets and then plot it using plotly.

In [None]:

population = population.melt(id_vars=['country'])
population = population.sort_values(by =['country', 'variable']) 

pop_growth = pop_growth.melt(id_vars=['country'])
pop_growth = pop_growth.sort_values(by =['country', 'variable']) 

child_mortality = child_mortality.melt(id_vars=['country'])
child_mortality = child_mortality.sort_values(by =['country', 'variable']) 

plot_df = pd.DataFrame()

plot_df['country'] = population['country']
plot_df['year'] = population['variable']
plot_df['population'] = population['value']
plot_df['pop_growth'] = pop_growth['value']
plot_df['child_mortality'] = child_mortality['value']

c = population.country.astype('category')
c.replace(list(continent_mapping.country), list(continent_mapping.continent), inplace=True)
c.where(c.isin(continent_mapping.continent.unique()), other='Unavailable', inplace=True)
c = c.astype('category')

plot_df['continent'] = c

fig = px.scatter(plot_df, x="child_mortality", y="pop_growth", animation_frame="year", animation_group="country",
            size="population", color="continent", hover_name="country",
            size_max=55, range_x=[-5, 400], range_y=[-0.02, 0.05], opacity=0.95,
            labels={ "child_mortality": "Child Mortality (0-5 year-olds dying per 1000 born)",
                    "pop_growth": "Population Growth (Annual %)",
                    "continent": "Continent"},
            title="Population Growth vs Child Mortality from 1962 to 2019",
            color_discrete_sequence=['#636EFA', '#FFA15A', '#00CC96', '#AB63FA', '#D62728', '#19D3F3', '#FF6692', '#B6E880', '#FF97FF', '#FECB52'])
fig.show()