# Data Analysis Team Project

<b>Team 8:</b> Abdalla Mohamed Abdalla Al-Zaro，Frida Gallardo Islas, Souleymane Abdoul Aziz Meite, Yue Xu 


Eastern Europe is a loosely defined term that refers to the eastern part of Europe bordered to the east by the Baltic Sea, the Black Sea, and the Ural Mountains.

More than 99% of the population in Eastern Europe is part of the Caucasian race, and most of the ethnic groups belong to the Slavic group. The principal religion in this region is Orthodox Christianity, while the second religion is Islam, where most of them are located in countries such as Albania. 
The landform of Eastern Europe is relatively simple, dominated by the Eastern European plains. The climate is mainly temperate. The land is rich in natural resources, mostly distributed on the plains of Eastern Europe.
Most of the Eastern European countries have a temperate climate. Winter is severely cold, summer is warm, spring and autumn are shorter, and there is an annual temperature difference. Summer is rainy, moisture is mainly brought in from the North Atlantic by westerly winds, and the rainfall is about 500 mm.

Historically, Eastern European countries were influenced by communist and socialist movements post-WWII, which resulted in a shared history and similar problems such as poverty, corruption, and pollution. After the drastic changes in Europe and the disintegration of the Soviet Union, all Eastern European countries adopted the capitalist system, and some Eastern European countries joined the European Union and NATO. Despite these changes, Eastern European countries still lag far behind the rest of Europe on most economical indicators, which results in a lower standard of living in Eastern Europe compared to their Western European counterparts.

One of the long-lasting effects of soviet-era influence on eastern Europe is the high level of contamination and pollution resulting from the lack of industrial control. In the 1990’s Eastern Europe was considered to be the world’s most polluted region. Vegetables were considered a treat because of the high population levels that affected the soil preventing the countries from using its forests and excellent agricultural resources. Today the region still has higher than average pollution rates despite the more sustainable practices implemented by Eastern European countries.

The fall of the Soviet Union resulted in a period of instability for Eastern Europe during the late ’80s and ’90s. A combination of revolutionary efforts calling for democratization along with ethnic and religious wars resulted in the splitting of Eastern Europe into smaller countries with a relatively small population. An example of this splitting process is the former country of Yugoslavia, which disintegrated into five independent states (Serbia, Croatia, Slovenia, Macedonia, and Bosnia & Herzegovina) with less than 10 million in population.

The method that we approach to complete was first clean the data by filtering the data we are the most interested in Eastern Europe, after that and based on the research we made, we were able to settle that the most critical features of our region were:

1-	Total population 
2-	CO2 emissions (metric tons per capita)
3-	GDP per person employed (constant 2011 PPP $)
4-	Income share held by lowest 20%
5-	Employment to population ratio, 15+, total (%) (modeled ILO estimate)

Therefore, we continue by doing another filter only for those specific attributes. Once we had the table with the features needed, we analyze our Datasets by using the .describe() and .info(), then we localize the number of missing values that we have per column.

After inspecting our dataset, we were able to drop five countries from our data that do not match the geographical definitions of an Eastern European country, which are Sweden, Andorra, Norway, San Marino, Greece, and Kosovo, which is not recognized as an independent country. 

After dropping those countries, we count again the missing values reducing to 5 for the column 'Income share' so, we had the choice to replace them with the median or the mean, which will replace all the missing values with the data previously calculated. In our case, we decided to choose to substitute the missing values with the median since the mean is going to be less affected by inserting new values. To prove that we made the correct substitutions, we were able to run again those missing values, and we found out that there was 0 for the column, so we made sure that we do not have any more missing values.

From our datasets, we were able to make this analysis:

CO2 emissions Distribution has a mean of 6.94 and a median of 6.34
We were able to identify that the country with the highest CO2 emissions is Estonia over 13 metric tons per capita and the one with less CO2 emissions is Albania with around 1.50 and a range of 11.30.
Comparing the average of European region against the world, we notice that CO2 emission per capita in eastern Europe is higher than the world average (green line), the data shows that our region is experiencing pollution.

From the GDP per person employed distribution standpoint, we found a median of 52,586 and a mean of 51,102. The country with the highest GDP per person employed is Finland with 88,767 and the one with the lowest GDP per person employed is Albania with 2,475.
Even though the mean and median GDP per capita of Eastern Europe is higher than the world average, by comparing with the Western European region we see that the average is almost double that of Eastern Europe, which is an indicator of the current economic situation of the region

Total population distribution has a median of 3,480,915 and a mean of 6,222,782. The country with the highest total population is Poland with 38,125,759 and the lowest total population is Malta with 409,379.
Our data is skewed to the right with Poland being an outlier with more than 30 million people compared to the regional average of around 6 million, the relatively small population of the region is a result of the independence movements that happened over the past decades.

Our fourth point of the analysis is employment with a median of 49.72% and a mean of 49.53%. From our analysis, we found that the country with the highest employment-population is Estonia with 58% and the lowest employment-population in Bosnia and Herzegovina with 37%. 
Our region has a lower employment ratio average compared to the rest of the world including western Europe due to labor weakness meaning that people are not enough to qualify for the jobs, this high ratio contributes to the increasing number of immigration to western Europe, being a reflection of the economic situation and the standard of living in eastern Europe.

Finally, our last point is the income share with a median of 88% and a mean of 8.5%. For the income statement, the country with the highest income share is Slovenia with 10.2% and the lowest income share is Latvia with 6.1%.
By overlaying our graphs before and after imputing the missing values for the income share using the median, we notice a change in the shape of the distribution which is due to the higher mode. 
We can’t rely too much on this and use it as the main factor in the income share data since have a lot of data imputed.

By analyzing the interquartile range and the box plot for each of the features, we found out that Croatia and Poland are the countries that are the best representation of our region on average.
From these results, we decided to choose Croatia as our country that most resembles an average Eastern European country. We decided not to choose Poland because of the high population of the country which is uncharacteristic for an Eastern European nation. We can obviously notice from the boxplot and the histograms that Poland is an outlier when it comes to population.
Finally, It should also be noted that the data for the income share of the lowest 20% for both Croatia and Poland were imputed using the median of the region therefore using these values for comparison would not give us any useful insights.


In [None]:
#Import packages 
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

#Open Dataset
file = "final_Project_Dataset.xlsx"

#Read the file using pd
project = pd.read_excel(io = file,
                       sheet_name = 'Data',
                       header = 0)

project

In [None]:
#Filter for Eastern Europe

region= project.loc[ : , : ][project.loc[ : ,'Hult Region'] == "Eastern Europe"]

region

In [None]:
#Look names of columns
region.columns

In [None]:
#Filter by columns of interest
region_columns = region[['Country Name','Hult Region','CO2 emissions (metric tons per capita)','Income share held by lowest 20%','GDP per person employed (constant 2011 PPP $)','Employment to population ratio, 15+, total (%) (modeled ILO estimate)','Population, total']]

#Print result Eastern Europe with filtered columns 
region_columns 

In [None]:
#Information of our data
region_columns.describe()

In [None]:
#Data type 
region_columns.info()

In [None]:
#Counting missing values
region_columns.isnull().sum()

In [None]:
#drop regions that we are not going to consider: Sweden, Andorra, Greece, Norway, San Marino, Kosovo
region_drop=region_columns.drop(index=[75,181,144,4,173,212])

#Print result
region_drop


In [None]:
#New information after drop countries

region_drop.describe()

In [None]:
#Number of missing values per column
region_drop.isnull().sum()

In [None]:
#Histogram for our variable before fill missing values

# histogram for Income share
sns.distplot(a     = region_drop['Income share held by lowest 20%'],
             bins  = 'fd', 
             hist  = True,
             kde   = False,
             rug   = False,
             color = 'blue')

#Add mean and median vertical lines
plt.axvline(x = region_drop['Income share held by lowest 20%'].mean(),
            color = 'red')
plt.axvline(x = region_drop['Income share held by lowest 20%'].median(),
            color = 'blue')

# Add title and name axis
plt.title(label = "Income share Distribution")
plt.xlabel(xlabel = 'Income share')
plt.ylabel(ylabel = 'Frequency')

#lables 
plt.legend(labels =  ['mean', 'median'])

# Compile and display the plot 
plt.tight_layout()
plt.show()




In [None]:
#Flagging missing values for income share as mv_income and create a copy of 
#data set before filling missing values
region_drop['mv_income']= region_drop['Income share held by lowest 20%'].isnull().astype(int)
region_drop_before = region_drop.copy()

#Print result
region_drop_before

In [None]:
#Fill missing values with the median 
region_drop=region_drop.fillna(region_drop.iloc[:,3].median())

#Check if we fill all the missing values
region_drop.isnull().sum()

In [None]:
#check mv_income column that flag mv
region_drop

In [None]:
#Pull out information about the world to compare
world= project.loc[ : ,  ['Country Name','Hult Region','CO2 emissions (metric tons per capita)','Income share held by lowest 20%','GDP per person employed (constant 2011 PPP $)','Employment to population ratio, 15+, total (%) (modeled ILO estimate)','Population, total']] [project.loc[ : ,'Hult Region'] == "World"]

worldlist=world.iloc[0].to_list()

type(worldlist[2])



In [None]:
#Information about Western Europe
region_1= project.loc[ : , : ][project.loc[ : ,'Hult Region'] == "Western Europe"]


In [None]:
#Histogram for our variables

# Histogram for CO2
sns.distplot(a     = region_drop['CO2 emissions (metric tons per capita)'],
             bins  = 'fd', 
             hist  = True,
             kde   = False,
             rug   = False,
             color = 'blue')

#Add mean and median vertical lines
plt.axvline(x = region_drop['CO2 emissions (metric tons per capita)'].mean(),
            color = 'red')
plt.axvline(x = region_drop['CO2 emissions (metric tons per capita)'].median(),
            color = 'blue')

plt.axvline(x = worldlist[2], color= 'green')

# Add title and name axis
plt.title(label = "CO2 emissions Distribution")
plt.xlabel(xlabel = 'CO2 emissions')
plt.ylabel(ylabel = 'Frequency')

#legends 
plt.legend(labels =  ['mean', 'median','world'])

# Compile and display the plot 
plt.tight_layout()
plt.show()



# histogram for GDP person employed
sns.distplot(a     = region_drop['GDP per person employed (constant 2011 PPP $)'],
             bins  = 'fd', 
             hist  = True,
             kde   = False,
             rug   = False,
             color = 'blue')

#Add mean and median vertical lines
plt.axvline(x = region_drop['GDP per person employed (constant 2011 PPP $)'].mean(),
            color = 'red')
plt.axvline(x = region_drop['GDP per person employed (constant 2011 PPP $)'].median(),
            color = 'blue')
plt.axvline(x = worldlist[4], color= 'green')
plt.axvline(x = region_1['GDP per person employed (constant 2011 PPP $)'].mean(),
            color = 'purple')

# Add title and name axis
plt.title(label = "GDP per person employed Distribution")
plt.xlabel(xlabel = 'GDP per person employed')
plt.ylabel(ylabel = 'Frequency')

#legends
plt.legend(labels =  ['mean', 'median','world','Western Europe'])

# Compile and display the plot 
plt.tight_layout()
plt.show()


# Histogram for population employment
sns.distplot(a     = region_drop['Employment to population ratio, 15+, total (%) (modeled ILO estimate)'] ,
             bins  = 'fd', 
             hist  = True,
             kde   = False,
             rug   = False,
             color = 'blue')

#Add mean and median vertical lines
plt.axvline(x = region_drop['Employment to population ratio, 15+, total (%) (modeled ILO estimate)'].mean(),
            color = 'red')
plt.axvline(x = region_drop['Employment to population ratio, 15+, total (%) (modeled ILO estimate)'].median(),
            color = 'blue')
plt.axvline(x = worldlist[5], color= 'green')
plt.axvline(x = region_1['Employment to population ratio, 15+, total (%) (modeled ILO estimate)'].mean(),
            color = 'purple')

# Add title and name axis
plt.title(label = "Distribution of employment population")
plt.xlabel(xlabel = 'Employment population')
plt.ylabel(ylabel = 'Frequency')

#legends 
plt.legend(labels =  ['mean', 'median','world','Western Europe'])

# Compile and display the plot 
plt.tight_layout()
plt.show()


# Histogram for total population
sns.distplot(a     = region_drop[ 'Population, total'],
             bins  = 'fd', 
             hist  = True,
             kde   = False,
             rug   = False,
             color = 'blue')

#Add mean and median vertical lines
plt.axvline(x =  region_drop[ 'Population, total'].mean(),
            color = 'red')
plt.axvline(x = region_drop[ 'Population, total'].median(),
            color = 'blue')

# Add title and name axis
plt.title(label = "Total population Distribution")
plt.xlabel(xlabel = 'Total population')
plt.ylabel(ylabel = 'Frequency')

#legends
plt.legend(labels =  ['mean', 'median'])

# Compile and display the plot 
plt.tight_layout()
plt.show()


# histogram for Income share
sns.distplot(a     = region_drop['Income share held by lowest 20%'],
             bins  = 'fd', 
             hist  = True,
             kde   = False,
             rug   = False,
             color = 'blue')

#Add mean and median vertical lines
plt.axvline(x = region_drop['Income share held by lowest 20%'].mean(),
            color = 'red')
plt.axvline(x = region_drop['Income share held by lowest 20%'].median(),
            color = 'blue')
plt.axvline(x = worldlist[3], color= 'green')

# Add title and name axis
plt.title(label = "Income share Distribution")
plt.xlabel(xlabel = 'Income share')
plt.ylabel(ylabel = 'Frequency')

#legends
plt.legend(labels =  ['mean', 'median'])

# Compile and display the plot 
plt.tight_layout()
plt.show()


#Comparison before and after filling mv

# overlay the original and imputed distributions for clarity
fig,ax = plt.subplots(figsize = [8,5],
                     sharex = True,
                     sharey = True)

#histogram for Income share
sns.distplot(a = region_drop_before['Income share held by lowest 20%'],
            bins = 'fd',
            hist = True,
            kde = True,
            rug = False,
            color = 'black')

sns.distplot(a = region_drop['Income share held by lowest 20%'],
            bins = 'fd',
            hist = True,
            kde = True,
            rug = False,
            color = 'gray')

#add titles and labels
plt.title(label = "Income Share Distribution")
plt.xlabel(xlabel = 'Income Share')
plt.ylabel(ylabel = 'Frequency')

#add legend
plt.legend(labels = ['original distribution',
                    'imputed distribution'])

#display plot
plt.tight_layout()
plt.show()




In [None]:
#Setting figure size
for i in region_drop.columns[2:-1]:
    fig, ax = plt.subplots(figsize=(10, 5))

    lower = region_drop[i].quantile(0.25)
    upper = region_drop[i].quantile(0.75)

# developing a boxplot for region_drop
    near_mean = region_drop[(region_drop[i] >= lower) & (region_drop[i] <= upper)]
    not_near_mean = region_drop[(region_drop[i] <= lower) | (region_drop[i] >= upper)]

    ax = sns.boxplot(x=i,
                     y=None,
                     data=region_drop,
                     color='lightgray')

    ax = sns.stripplot(x=i,
                       y='Hult Region',
                       data=near_mean,
                       size=8,
                       hue='Country Name')

    ax = sns.stripplot(x=i,
                       y=None,
                       data=not_near_mean,
                       size=4,
                       color= 'black')

    ax.legend(loc='upper right')

    # formatting and displaying the plot
    plt.title(label=i)
    plt.xlabel(xlabel=i)
    plt.show()

In [None]:
# function returns 25th pctl of given column

#method 1 to reach two countries that are closes to mean and standard deviation
def get_lower(col_name):
    lower = region_drop[col_name].quantile(0.25)
    return lower

# function returns 75th pctl of given column
def get_upper(col_name):
        upper = region_drop[col_name].quantile(0.75)
        return upper

# subsetting region data for cols that fit all criteria – within pctls for each col    
final_countries = region_drop[(region_drop['CO2 emissions (metric tons per capita)'] >= get_lower('CO2 emissions (metric tons per capita)')) &
                  (region_drop['CO2 emissions (metric tons per capita)'] <= get_upper('CO2 emissions (metric tons per capita)')) &
                  (region_drop['Income share held by lowest 20%'] >= get_lower('Income share held by lowest 20%')) &
                  (region_drop['Income share held by lowest 20%'] <= get_upper('Income share held by lowest 20%')) &
                  (region_drop['GDP per person employed (constant 2011 PPP $)'] >= get_lower('GDP per person employed (constant 2011 PPP $)')) &
                  (region_drop['GDP per person employed (constant 2011 PPP $)'] <= get_upper('GDP per person employed (constant 2011 PPP $)')) &
                  (region_drop['Employment to population ratio, 15+, total (%) (modeled ILO estimate)'] >= get_lower('Employment to population ratio, 15+, total (%) (modeled ILO estimate)')) &
                  (region_drop['Employment to population ratio, 15+, total (%) (modeled ILO estimate)'] <= get_upper('Employment to population ratio, 15+, total (%) (modeled ILO estimate)')) &
                  (region_drop['Population, total'] >= get_lower('Population, total'))]

final_countries

In [None]:
#filling nulls with median of data

#method 2 to reach two countries that are closes to mean and standard deviation
region_drop=region_drop.fillna(region_drop.iloc[:,3].median())

#Set list
list_up=[]
list_down=[]

#Iterate over the selected country name and add it to the List
#Through the name of the Region drop Columns
for i in region_drop.columns[2:]:
    #Set STDUP as the upper limit calculated for each column in each cycle, 
    #and use the sample mean plus sample standard deviation
    std_up =(region_drop[i].std())/4+region_drop[i].mean()
    #Set STDDOWN as the lower limit calculated for each column in each cycle, 
    #and subtract the sample standard deviation from the sample mean
    std_down =region_drop[i].mean()- (region_drop[i].std())/4
    #Add the upper and lower limits to the list previously set
    list_up.append(std_up)
    list_down.append(std_down)
print(list_down,list_up)

#Set list
list =[]
#Through the name of the Region drop Columns
for country in region_drop.iloc[:, 0]:
    list.append(country)

for i in list:
    # Select all of the data for each country, which is set as x
    x = region_drop.iloc[ : , 2:][region_drop['Country Name']==i]
    # Change X to vector form Ax, change ListUp to arup, and change ListDown to ardown
    ax=np.array(x)
    arup=np.array(list_up)
    ardown= np.array(list_down)
    #Edit Y and Z to compare the selected row to the upper and lower limits, respectively
    y = arup-ax
    z= ax - ardown
    #Conditional judgment, when the selected data is less than the upper limit
    #and greater than the lower limit, the output is Country_lst
    Country_lst= np.logical_and(y > 0, z > 0)
    #Print Country_lst to find the country with the most True, and it is the 
    #country closest to the average data in the normal distribution
    print(Country_lst)


References:

https://en.wikipedia.org/wiki/Eastern_Europe 

https://www.newworldencyclopedia.org/entry/Eastern_Europe

https://saylordotorg.github.io/text_world-regional-geography-people-places-and-globalization/s05-05-eastern-europe.html
