# MATH 189 Final Project


## Statement of Project

Analyzing the wealth distribution among billionaires is vital for understanding broader economic inequalities that plague our current economy. This analysis extends insights to lower income groups, helping to evaluate the effectiveness of economic systems and policies in fostering equitable growth. By studying wealth accumulation dynamics, we hope to uncover patterns and barriers to wealth equality, aiding the understanding of our current economic climate and the creation of future inclusive economic strategies.

https://www.kaggle.com/datasets/nelgiriyewithana/billionaires-statistics-dataset

## Imports

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns

import statsmodels
import statsmodels.api as sm
import statsmodels.formula.api as smf
import sys

sys.path.append('D:/Documents/School/WI24/MATH189/MATH189Final/modules')
from data_cleaning import *


## Importing Dataset

In [2]:
### DEFINING parent directory, different for each person
data_dir = 'D:/Documents/School/WI24/MATH189/FinalData'

### Billionaires Data
billionaires = pd.read_csv(data_dir + '/billionaires.csv')

### EdStats Data
ed_stats_country = pd.read_csv(data_dir+'/EdStatsCountry.csv')

### Merging data

In [3]:
df = merge_billionaires_ed_stats(billionaires, ed_stats_country)
df.head(1)

Unnamed: 0,rank,finalWorth,category,personName,age,country,city,source,industries,countryOfCitizenship,...,IMF data dissemination standard,Latest population census,Latest household survey,Source of most recent Income and expenditure data,Vital registration complete,Latest agricultural census,Latest industrial data,Latest trade data,Latest water withdrawal data,Unnamed: 31
0,1,211000,Fashion & Retail,Bernard Arnault & family,74.0,France,Paris,LVMH,Fashion & Retail,France,...,Special Data Dissemination Standard (SDDS),2006. Rolling census based on continuous sampl...,,"Expenditure survey/budget survey (ES/BS), 1994/95",Yes,2010,2009.0,2012.0,2007,


## EDA

In [5]:
# drop columns with too many nulls (2000+) or redundant
df = df.drop(columns=['organization', 'title', 'Other groups', 'Vital registration complete', 'Alternative conversion factor', 'Unnamed: 31'])

#convert datetime columns
placeholder_date = pd.Timestamp('2024-03-01')
df['birthDate'] = pd.to_datetime(df['birthDate']).fillna(placeholder_date)
df['date'] = pd.to_datetime(df['date'])
df['birthYear'] = df['birthYear'].fillna(-1).astype(int)
df['birthMonth'] = df['birthMonth'].fillna(-1).astype(int)
df['birthDay'] = df['birthDay'].fillna(-1).astype(int)
df['National accounts reference year'] = df['National accounts reference year'].fillna(-1).astype(int)
uncleaned_dates = ['Latest population census', 'Latest agricultural census', 'Latest industrial data', 'Latest trade data',
                  'Latest water withdrawal data']
for col in uncleaned_dates:
    df[col] = df[col].astype(str).str.extract('(\d{4})')
    df[col] = pd.to_datetime(df[col], format='%Y', errors='coerce').fillna(placeholder_date)

#convert numerical columns
numerical_columns = ['age', 'cpi_country', 'cpi_change_country', 'gdp_country', 'gross_tertiary_education_enrollment', 
                     'gross_primary_education_enrollment_country', 'life_expectancy_country', 'tax_revenue_country_country',
                     'total_tax_rate_country', 'population_country', 'latitude_country', 'longitude_country', 'Currency Unit']
for col in numerical_columns:
    df[col] = df[col].fillna(df['age'].mean())

#categorical columns
categorical_columns = ['category', 'city', 'country', 'state', 'gender', 'status', 'residenceStateRegion', 'source', 'industries', 
                       'countryOfCitizenship', 'Country Code', 'Short Name', 'Table Name', 'Long Name', '2-alpha code', 'Region', 
                      'Income Group', 'WB-2 code', 'National accounts base year', 'SNA price valuation', 'Lending category',
                      'System of National Accounts', 'PPP survey year', 'External debt Reporting status', 'System of trade', 
                      'Government Accounting concept', 'IMF data dissemination standard', 'Latest household survey', 
                      'Source of most recent Income and expenditure data', 'Balance of Payments Manual in use']
for col in categorical_columns:
    df[col] = df[col].astype('category').cat.add_categories(['Unknown'])
    df[col] = df[col].fillna('Unknown')

#fill nan for remaining columns
df['firstName'] = df['firstName'].fillna('Unknown')
df['Special Notes'] = df['Special Notes'].fillna('None')


In [6]:
missing_values = df.isnull().sum()
columns_with_missing_values = missing_values[missing_values > 0]
columns_with_missing_values

Series([], dtype: int64)

## Analysis