<a href="https://colab.research.google.com/github/asthanas/DataScienceProjects/blob/master/Worlds_Wealthiest.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **World's Wealthiest - Descriptive Statistics**

The World's Billionaires is an annual ranking by documented net worth of the world's wealthiest billionaires compiled and published in March annually by the American business magazine Forbes. The list was first published in March 1987. The total net worth of each individual on the list is estimated and is cited in United States dollars, based on their documented assets and accounting for debt. Royalty and dictators whose wealth comes from their positions are excluded from these lists. This ranking is an index of the wealthiest documented individuals, excluding and ranking against those with wealth that is not able to be completely ascertained. (wikipedia)

The dataset has following features:
 - Year
 - Rank
 - Name
 - Net_Worth
 - Age
 - Nationality
 - Source_wealth

**Objective**: Perform descriptive analytics to understand what data tells us about the world's wealthiest

In [0]:
import numpy as np
import pandas as pd
import scipy
from scipy import stats
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')
pd.set_option('display.max_columns',25)

  import pandas.util.testing as tm


Read input file billionaires.csv

In [0]:
data=pd.read_csv("billionaires.csv")

**Read top 10 values from the csv file**

In [0]:
data.head(10)

Unnamed: 0,year,rank,name,net_worth,age,natinality,source_wealth
0,2019,1,Jeff Bezos,131.0,55,United States,Amazon
1,2019,2,Bill Gates,96.5,63,United States,Microsoft
2,2019,3,Warren Buffett,82.5,88,United States,Berkshire Hathaway
3,2019,4,Bernard Arnault,76.0,70,France,LVMH
4,2019,5,Carlos Slim,64.0,79,Mexico,"América Móvil, Grupo Carso"
5,2019,6,Amancio Ortega,62.7,82,Spain,"Inditex, Zara"
6,2019,7,Larry Ellison,62.5,74,United States,Oracle Corporation
7,2019,8,Mark Zuckerberg,62.3,34,United States,Facebook
8,2019,9,Michael Bloomberg,55.5,77,United States,Bloomberg L.P.
9,2019,10,Larry Page,50.8,45,United States,Alphabet Inc.


Correct the column name from "natinality" to "nationality"

In [0]:
data.rename(columns = {'natinality':'nationality'}, inplace = True)

In [0]:
data.head(20)

Unnamed: 0,year,rank,name,net_worth,age,nationality,source_wealth
0,2019,1,Jeff Bezos,131.0,55,United States,Amazon
1,2019,2,Bill Gates,96.5,63,United States,Microsoft
2,2019,3,Warren Buffett,82.5,88,United States,Berkshire Hathaway
3,2019,4,Bernard Arnault,76.0,70,France,LVMH
4,2019,5,Carlos Slim,64.0,79,Mexico,"América Móvil, Grupo Carso"
5,2019,6,Amancio Ortega,62.7,82,Spain,"Inditex, Zara"
6,2019,7,Larry Ellison,62.5,74,United States,Oracle Corporation
7,2019,8,Mark Zuckerberg,62.3,34,United States,Facebook
8,2019,9,Michael Bloomberg,55.5,77,United States,Bloomberg L.P.
9,2019,10,Larry Page,50.8,45,United States,Alphabet Inc.


In [0]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 180 entries, 0 to 179
Data columns (total 7 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   year           180 non-null    int64 
 1   rank           180 non-null    int64 
 2   name           180 non-null    object
 3   net_worth      180 non-null    object
 4   age            180 non-null    int64 
 5   nationality    180 non-null    object
 6   source_wealth  180 non-null    object
dtypes: int64(3), object(4)
memory usage: 10.0+ KB


In [0]:
data.describe()

Unnamed: 0,year,rank,age
count,180.0,180.0,180.0
mean,2010.5,5.394444,66.666667
std,5.202599,2.793642,13.689698
min,2002.0,1.0,31.0
25%,2006.0,3.0,55.0
50%,2010.5,5.5,69.0
75%,2015.0,8.0,78.0
max,2019.0,10.0,92.0


In [0]:
data.describe(include='object')

Unnamed: 0,name,net_worth,nationality,source_wealth
count,180,180.0,180,180
unique,45,122.0,14,43
top,Bill Gates,20.0,United States,Microsoft
freq,18,9.0,99,23


In [0]:
data['net_worth']=data['net_worth'].astype('float')

In [0]:
data['year']=data['year'].astype('object')

In [0]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 180 entries, 0 to 179
Data columns (total 7 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   year           180 non-null    object 
 1   rank           180 non-null    int64  
 2   name           180 non-null    object 
 3   net_worth      180 non-null    float64
 4   age            180 non-null    int64  
 5   nationality    180 non-null    object 
 6   source_wealth  180 non-null    object 
dtypes: float64(1), int64(2), object(4)
memory usage: 10.0+ KB


In [0]:
data['source_wealth'].value_counts()

Microsoft                                  23
Berkshire Hathaway                         18
Oracle Corporation                         14
Koch Industries                            12
Wal-Mart                                   11
Aldi Süd                                    7
Walmart                                     6
Telmex, América Móvil, Grupo Carso          6
Inditex Group                               6
Arcelor Mittal                              5
IKEA                                        5
Reliance Industries                         4
Amazon                                      4
Kingdom Holding Company                     4
LVMH Moët Hennessy • Louis Vuitton          4
Facebook                                    4
LVMH                                        4
América Móvil, Grupo Carso                  4
Inditex, Zara                               3
EBX Group                                   3
Inditex                                     3
Bloomberg L.P.                    

Based on the above count we can say tha worlds wealthiest people belong to microsoft

In [0]:
data_by_year=data.groupby('year')
data_by_year.describe()

Unnamed: 0_level_0,rank,rank,rank,rank,rank,rank,rank,rank,net_worth,net_worth,net_worth,net_worth,net_worth,net_worth,net_worth,net_worth,age,age,age,age,age,age,age,age
Unnamed: 0_level_1,count,mean,std,min,25%,50%,75%,max,count,mean,std,min,25%,50%,75%,max,count,mean,std,min,25%,50%,75%,max
year,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2,Unnamed: 17_level_2,Unnamed: 18_level_2,Unnamed: 19_level_2,Unnamed: 20_level_2,Unnamed: 21_level_2,Unnamed: 22_level_2,Unnamed: 23_level_2,Unnamed: 24_level_2
2002,10.0,5.5,3.02765,1.0,3.25,5.5,7.75,10.0,10.0,27.52,14.434665,16.1,19.55,20.0,27.4,60.0,10.0,55.0,15.570628,35.0,43.25,51.0,66.0,80.0
2003,10.0,5.2,2.616189,1.0,3.25,5.5,7.75,8.0,10.0,26.63,10.266889,20.5,20.55,22.15,26.4,52.8,10.0,60.8,12.968338,46.0,53.25,56.5,67.75,82.0
2004,10.0,4.5,1.900292,1.0,3.25,5.5,6.0,6.0,10.0,25.5,10.229152,20.0,20.0,20.5,22.625,46.6,10.0,61.6,13.882043,47.0,52.0,57.0,69.75,84.0
2005,10.0,5.5,3.02765,1.0,3.25,5.5,7.75,10.0,10.0,26.22,10.339761,18.3,19.125,23.35,24.7,46.5,10.0,62.8,12.787146,49.0,52.5,60.5,71.75,85.0
2006,10.0,5.5,3.02765,1.0,3.25,5.5,7.75,10.0,10.0,27.74,11.020304,18.8,20.375,22.75,29.5,52.0,10.0,64.3,12.970479,49.0,53.5,61.5,76.5,82.0
2007,10.0,5.5,3.02765,1.0,3.25,5.5,7.75,10.0,10.0,34.35,13.00438,22.0,24.5,29.25,45.0,56.0,10.0,65.9,11.512795,49.0,56.5,69.0,75.25,80.0
2008,10.0,5.5,3.02765,1.0,3.25,5.5,7.75,10.0,10.0,42.6,13.631662,27.0,30.25,42.5,54.75,62.0,10.0,63.8,16.287691,40.0,51.25,62.5,76.75,88.0
2009,10.0,5.5,3.02765,1.0,3.25,5.5,7.75,10.0,10.0,25.39,8.439648,18.3,19.35,21.75,31.875,40.0,10.0,70.6,13.672357,52.0,59.5,71.0,81.75,89.0
2010,10.0,5.5,3.02765,1.0,3.25,5.5,7.75,10.0,10.0,34.22,11.932011,23.5,27.125,28.35,42.5,53.5,10.0,66.0,12.328828,53.0,55.5,63.5,73.0,90.0
2011,10.0,5.5,3.02765,1.0,3.25,5.5,7.75,10.0,10.0,40.61,15.351399,26.5,30.25,35.3,47.75,74.0,10.0,63.7,9.031427,53.0,56.25,62.0,69.75,80.0


Describe funtion in python is a powerful when trying to get statistical details of entire data, as above we can see mean, median and mode details of each column.
For 2002 mean age of billionair is 55 and in 2019 mean age is 66.7, 
For 2002 mean networth of billionair is 27.53 and in 2019 mean networth is 74.38

**Descriptive statistics** is the summary given data set. Descriptive statistics are broken down into 1. Measures of Central Tendency and 2. Measures of Spread

# **Measures of Central Tendency**

The obvious question when looking at a salary dataset is "How much do people make?". And when asking that nobody is interested to get 100s of rows of data. They want just a single number which can represent the entire dataset. And that's exactly what Central Tendency seeks to do. There are three measures of central tendency viz. **Mean, Median, Mode**



**MEAN**: The mean is the average value.

In [0]:
data_by_year[['net_worth','age']].mean()

Unnamed: 0_level_0,net_worth,age
year,Unnamed: 1_level_1,Unnamed: 2_level_1
2002,27.52,55.0
2003,26.63,60.8
2004,25.5,61.6
2005,26.22,62.8
2006,27.74,64.3
2007,34.35,65.9
2008,42.6,63.8
2009,25.39,70.6
2010,34.22,66.0
2011,40.61,63.7


**Observation** : Mean net_worth and age for each year