<a href="https://colab.research.google.com/github/dzervenes/Machine-Learning-module/blob/main/Unit_4_Seminar_Preparation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [7]:
!ls

 sample_data  'Unit04 Global_GDP.csv'  'Unit04 Global_Population.csv'


**Loading the csv files**

In [8]:
import pandas as pd
df1 = pd.read_csv('Unit04 Global_GDP.csv')
df2 = pd.read_csv('Unit04 Global_Population.csv')

**Inspecting the data**

In [9]:
print(df1.head())
print(df2.head())

                  Country Name Country Code     Indicator Name  \
0                        Aruba          ABW  GDP (current US$)   
1  Africa Eastern and Southern          AFE  GDP (current US$)   
2                  Afghanistan          AFG  GDP (current US$)   
3   Africa Western and Central          AFW  GDP (current US$)   
4                       Angola          AGO  GDP (current US$)   

   Indicator Code          1960          1961          1962          1963  \
0  NY.GDP.MKTP.CD           NaN           NaN           NaN           NaN   
1  NY.GDP.MKTP.CD  1.929944e+10  1.970954e+10  2.147872e+10  2.571501e+10   
2  NY.GDP.MKTP.CD  5.377778e+08  5.488889e+08  5.466667e+08  7.511112e+08   
3  NY.GDP.MKTP.CD  1.040428e+10  1.112805e+10  1.194335e+10  1.267652e+10   
4  NY.GDP.MKTP.CD           NaN           NaN           NaN           NaN   

           1964          1965  ...          2011          2012          2013  \
0           NaN           NaN  ...  2.549721e+09  2.534637e+

**Data overview**

These datasets encompass GDP and population data for a wide range of countries and regions from 1960 to 2021. GDP values, measured in current US dollars, reflect annual economic output, while population figures provide a record of total inhabitants over the same timeframe. Although population data is largely complete, there are some gaps, such as missing entries for Aruba in 2020 and 2021. GDP data, on the other hand, exhibits more substantial missing values, especially for smaller economies and earlier years, necessitating careful preprocessing to ensure robust analysis.

**Identifying missing values**

In [10]:
print("Missing values in GDP data:")
print(df1.isnull().sum())

print("\nMissing values in Population data:")
print(df2.isnull().sum())

Missing values in GDP data:
Country Name        0
Country Code        0
Indicator Name      0
Indicator Code      0
1960              138
                 ... 
2016               10
2017               10
2018               10
2019               13
2020               25
Length: 65, dtype: int64

Missing values in Population data:
Country Name    3
Country Code    5
Series Name     5
Series Code     5
1960            5
               ..
2017            5
2018            5
2019            5
2020            6
2021            6
Length: 66, dtype: int64


**Filling missing values**


This code handles missing values in GDP and population data. It identifies columns representing years, fills missing GDP values with the column mean, and converts population data to numeric, replacing non-numeric entries with NaN. Missing population values are then filled with the column mean. Finally, it checks and prints the remaining missing values to ensure the data is clean and consistent for analysis.

In [20]:
# extracting the GDP year columns
year_columns = df1.columns[df1.columns.str.isnumeric()]

df1[year_columns] = df1[year_columns].fillna(df1[year_columns].mean())

print("Missing values in GDP data after filling:")
print(df1.isnull().sum())

# extracting the population year columns
year_columns = df2.columns[df2.columns.str.isnumeric()]

df2[year_columns] = df2[year_columns].apply(pd.to_numeric, errors='coerce')
df2[year_columns] = df2[year_columns].fillna(df2[year_columns].mean())
print(df2.isnull().sum())

Missing values in GDP data after filling:
Country Name      0
Country Code      0
Indicator Name    0
Indicator Code    0
1960              0
                 ..
2016              0
2017              0
2018              0
2019              0
2020              0
Length: 65, dtype: int64
Country Name    3
Country Code    5
Series Name     5
Series Code     5
1960            0
               ..
2017            0
2018            0
2019            0
2020            0
2021            0
Length: 66, dtype: int64


**Handling non-numeric missing values**

After handling numeric missing values in the datasets, I noticed there were some non-numeric missing values in the population data. To address this, I replaced the non-numeric missing values with the placeholder "Unknown". This ensures that the dataset remains complete without losing rows, making it ready for further analysis.

In [22]:
df2 = df2.fillna("Unknown")

print("Non-numeric missing values handled in Population data:")
print(df2.isnull().sum())

Non-numeric missing values handled in Population data:
Country Name    0
Country Code    0
Series Name     0
Series Code     0
1960            0
               ..
2017            0
2018            0
2019            0
2020            0
2021            0
Length: 66, dtype: int64
