In [51]:
import pandas as pd

### Dataset Overview:
**Title:** *Country_population_2023.csv*

**Source:** *World Bank Group*  
**Reference:** [World Bank - Population Data](https://data.worldbank.org/indicator/SP.POP.TOTL)

**Description:**  
This dataset presents the population figures for various countries around the globe for the year 2023. It is structured as a two-dimensional dataset, providing key demographic data on a country-by-country basis.

**Columns:**
- **Country:** The name of each country included in the dataset.
- **Population:** The total population of the respective country as recorded in 2023.

**Key Details:**
- **Coverage:** The dataset encompasses population data for all recognized countries.
- **Time Frame:** The data is specifically for the year 2023.
- **Unit of Measure:** Population figures are presented as the total number of people.


In [52]:
df = pd.read_csv("Country_population_2023.csv")
df

Unnamed: 0,Country,Population
0,Afghanistan,42239854
1,Albania,2745972
2,Algeria,45606480
3,American Samoa,43914
4,Andorra,80088
...,...,...
212,Virgin Islands (U.S.),104917
213,West Bank and Gaza,5165775
214,"Yemen, Rep.",34449825
215,Zambia,20569737


## Listing Column Names:

The **`df.columns`** attribute is used to retrieve and display the names of all columns in a DataFrame.

### Purpose:
* **Verify Structure:** This allows you to confirm the structure of the DataFrame, ensuring that the column names match your expectations.
* **Column Identification:** Understanding the exact column names is crucial for operations such as renaming, selecting, or manipulating specific columns within the DataFrame.


In [53]:
df.columns

Index(['Country', 'Population'], dtype='object')

## Going from 2D to ND

#### Adding `Mean Deviation`
Calculate the population deviation for each country by determining the absolute difference between each country's population and the mean population of all countries.

### Purpose:

1. **Quantifying Population Dispersion:**  
   The 'Mean Deviation' column will quantify how much the population of each country deviates from the global average. This helps in understanding the distribution of population across different countries and identifying outliers.

2. **Comparative Insights:**  
   With the 'Mean Deviation' column, you can easily compare each country's population to the global average, enabling a better understanding of how each country stands relative to others.

3. **Supporting Data Visualization:**  
   The mean deviation values can be used to create visualizations that highlight population differences, making it easier to communicate these disparities to stakeholders or in reports.


In [54]:
mean_value = df['Population'].mean()
df["Mean Deviation"] = df['Population'].apply(lambda x: (abs(x - mean_value))).astype(int)
df

Unnamed: 0,Country,Population,Mean Deviation
0,Afghanistan,42239854,5366011
1,Albania,2745972,34127870
2,Algeria,45606480,8732637
3,American Samoa,43914,36829928
4,Andorra,80088,36793754
...,...,...,...
212,Virgin Islands (U.S.),104917,36768925
213,West Bank and Gaza,5165775,31708067
214,"Yemen, Rep.",34449825,2424017
215,Zambia,20569737,16304105


### Adding `Rank`

Assign a rank to each country based on its population size, with the highest population receiving the top rank. This involves ordering the countries from most to least populous.

**Reasons for Adding a 'Rank' Column:**

1. **Hierarchical Ordering:**  
   The 'Rank' column allows you to order countries based on population size, making it easier to identify the most and least populous countries at a glance.

2. **Comparative Analysis:**  
   Adding a rank helps in comparing countries relative to each other. It provides a quick reference for understanding a country's position in the global population hierarchy.

3. **Simplifying Data Interpretation:**  
   Ranking simplifies data interpretation by converting raw population figures into an ordinal format. This can make it easier to present and discuss population data in reports or visualizations.

4. **Supporting Performance Metrics:**  
   The 'Rank' column can be used to create performance metrics or benchmarks, such as assessing a country's growth or changes in rank over time.

5. **Enhanced Data Visualization:**  
   Including a rank facilitates the creation of charts and graphs that highlight differences in population more clearly, enhancing the visual impact of the data.


In [55]:
df["Rank"] = df["Population"].rank(ascending=False).astype(int)
df.sort_values("Rank")

Unnamed: 0,Country,Population,Mean Deviation,Rank
89,India,1428627663,1391753820,1
41,China,1410710000,1373836157,2
206,United States,334914895,298041052,3
90,Indonesia,277534122,240660279,4
149,Pakistan,240485658,203611815,5
...,...,...,...,...
183,St. Martin (French part),32077,36841765,213
27,British Virgin Islands,31538,36842304,214
150,Palau,18058,36855784,215
137,Nauru,12780,36861062,216


#### Adding `Normalized`

Normalize the population values for each country to ensure they are on a common scale. This involves transforming the raw population figures so they fall within a specified range, typically between 0 and 1.

**Reasons for Adding a 'Normalized' Column:**

1. **Standardization Across Scales:**  
   Normalizing the population values ensures that all data points are on a common scale, making it easier to compare countries regardless of their actual population sizes.

2. **Enhanced Data Analysis:**  
   Normalized data allows for more meaningful statistical analysis, such as detecting patterns or trends that may not be apparent with raw population figures.

3. **Improved Model Performance:**  
   When used in machine learning models, normalized data can improve performance by ensuring that features are scaled similarly, which helps algorithms converge more effectively.

4. **Facilitates Visualization:**  
   Normalized values are particularly useful for creating visualizations that require data on a uniform scale, such as heatmaps or radar charts, which can make trends and comparisons more apparent.

5. **Comparative Insights:**  
   The normalized column provides a relative measure of population size, allowing for easier comparison between countries with vastly different population figures.

6. **Consistency in Reporting:**  
   Normalized values ensure consistency in reporting and analysis, especially when combining datasets or integrating with other metrics that are also normalized.


In [56]:
pd.options.display.float_format = '{:.8f}'.format
df["Normalized"] = (df["Population"] - df["Population"].min())/(df["Population"].max() - df["Population"].min())
df.sort_values("Rank")

Unnamed: 0,Country,Population,Mean Deviation,Rank,Normalized
89,India,1428627663,1391753820,1,1.00000000
41,China,1410710000,1373836157,2,0.98745803
206,United States,334914895,298041052,3,0.23442509
90,Indonesia,277534122,240660279,4,0.19425981
149,Pakistan,240485658,203611815,5,0.16832670
...,...,...,...,...,...
183,St. Martin (French part),32077,36841765,213,0.00001448
27,British Virgin Islands,31538,36842304,214,0.00001410
150,Palau,18058,36855784,215,0.00000466
137,Nauru,12780,36861062,216,0.00000097
