In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [None]:
df = pd.read_csv("billionaires.csv")
display(df)


In [None]:
df.info()

As we can see, the dataset consists of 30 columns and 2591 rows that do not contain any NULL values. And the corresponding types are also indicated

In [None]:
df.describe()

In this cell we can see the summarized information for all numeric columns including their maximum, minimum, mean, median merits and more.

Because of the user filteration system(which will be unvailed during the presentation🤫), new columns: "1st_initial" & "last_initial" were added, to make the filteration more accuarate and versitile.
   
   🎯 "1st_initial" - The very first letter in the "full_name" column
   
   🎯 "last_initial" - The "full_name" column was split into parts and the initial of the last word(surname) 
                       was taken 
                       
Later the unique values for the new columns were printed to check if all first initials are alphabetical with new leadeing/trailing signs.

In [None]:
df["1st_initial"] = df["full_name"].str[0].str.title()
print(df["1st_initial"].unique())
df["last_initial"] = df["full_name"].str.split(" ").str[-1].str[0].str.title()
print(df["last_initial"].unique())

display(df)

## Data cleaning

### Now 🥁*drum roll*🥁 let's consider the relevance of our columns and see which ones would yield insightful analyses. The process will involve the removal of duplicate rows, the processing of missing values and the dropping of some variable columns.

To start the data cleaning process, let us understand the shape of our data frame. As data show, we have 2591 observations and 32 features which describe each billionaire's specific information.

In [None]:
df.shape

In [None]:
df["position"].nunique()

By analyzing the dataset we concluded that the "position" column represents billionaire rankings by wealth. Hence, there are just 219 different values for wealth, which means that if we subtracted this value from the number of all our billionaires(number of rows) we would get the number of people who share the same wealth as at least one other billionaire:
 => 2591 - 219 = 2372

In [None]:
df["g_primary_ed_enroll"].describe()

As we can see the "g_primary_ed_enroll" column was expected to have percentage values, however misleading and out of range values were found. Additionally, primary education is the base for future eduacation, hence insightful information cannot be concluded from this column. 

In [None]:
duplicate_rows = df.duplicated().sum() 
missing_values == df.isnull().sum().sum()
print(f"there are {duplicate_rows} duplicate rows and {missing_values} missing values")

As can be seen, there are no duplicate rows and no missing values. We came to this conclusion using the .duplicated() function, which returns a series with boolean values in its' rows based on whether any rows are repeated. Then we calculated the sum of the returned series, which turned out to be 0. Virtually the same process was done to find the missing values.

### Due to the analysis above the following columns are dropped:
    🎯 'position' - no insightful information
    🎯 'g_primary_ed_enroll' - no insightful information
    🎯 'cpi_change_country' - time period of the cpi change is not indicated, therefore redundunt
    🎯 'residence_region' - all the data are related to the US only
    🎯 'residence_state' - all the data are related to the US only
    


In [None]:
df.drop(["g_primary_ed_enroll", "cpi_change_country", "residence_state", "residence_region", "position"], axis = 1, inplace = True)
df

Since the values of the "wealth" column are expressed in thousands(քառանիշ) we decided to divide its values by 1000 for convenience, to get precise values expressed in billions.

In [None]:
df["wealth"] = df["wealth"]/1000
df