### Purpose:

Using statistics, apply different calculations to measure insights such as:

+ Countries that average the highest/lowest spending
+ Countries above and below average
+ Grouping coutries by percentiles

What concepts will be practiced?

+ Estimates of Location
+ Estimates of Variability
+ Exploring Data Distribution
+ Binary and Categorical Data

In [1]:
import pandas as pd
import numpy as np
import scipy
from scipy import stats

df = pd.read_csv('data/military_expenditure.csv')
df.head(5)

Unnamed: 0,Name,Population,Code,Type,Indicator Name,2016,2017,2018
0,Afghanistan,38928346,AFG,Country,Military expenditure (current USD),185878310,191407100.0,198086300.0
1,Angola,32866272,AGO,Country,Military expenditure (current USD),2764054937,3062873000.0,1983614000.0
2,Albania,2877797,ALB,Country,Military expenditure (current USD),130853163,144382700.0,180488700.0
3,Argentina,45195774,ARG,Country,Military expenditure (current USD),4509647660,5459644000.0,4144992000.0
4,Armenia,2963243,ARM,Country,Military expenditure (current USD),431396219,443610400.0,608854600.0


In [2]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 153 entries, 0 to 152
Data columns (total 8 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Name            153 non-null    object 
 1   Population      153 non-null    int64  
 2   Code            153 non-null    object 
 3   Type            153 non-null    object 
 4   Indicator Name  153 non-null    object 
 5   2016            153 non-null    int64  
 6   2017            152 non-null    float64
 7   2018            150 non-null    float64
dtypes: float64(2), int64(2), object(4)
memory usage: 9.7+ KB


In [3]:
round(df.describe().T,2)

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Population,153.0,49633380.0,165268400.0,98347.0,4649658.0,11402528.0,37742150.0,1439324000.0
2016,153.0,10597500000.0,52360880000.0,0.0,137649100.0,552381264.0,4357991000.0,600000000000.0
2017,152.0,11089940000.0,53407600000.0,0.0,159574200.0,592642871.0,4408412000.0,606000000000.0
2018,150.0,11812200000.0,57575010000.0,0.0,200507900.0,697514650.0,4714833000.0,649000000000.0


## Calculating Different Estimates of Location

### For the first exercise, will use the year 2016

#### Mean 
+ Sum of all values divided by the number of values

#### Median
+ (50th Percentile) The value where such value of the data is above, and the other half below

### Trimmed Mean
+ The average of all values after dropping a fixed set of extreme values

### Weighted Mean
+ The sum of all values times a weight divided by the sum of the weights

In [4]:
year_2016 = df[['Name', 'Population','Code', '2016']]
year_2016['2016'] = year_2016['2016'].fillna(0)
year_2016.head(5)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  year_2016['2016'] = year_2016['2016'].fillna(0)


Unnamed: 0,Name,Population,Code,2016
0,Afghanistan,38928346,AFG,185878310
1,Angola,32866272,AGO,2764054937
2,Albania,2877797,ALB,130853163
3,Argentina,45195774,ARG,4509647660
4,Armenia,2963243,ARM,431396219


In [5]:
mean_spend = round(year_2016['2016'].mean(), 2)
mean_spend

10597496483.98

In [6]:
median_spend = round(year_2016['2016'].median(), 2)
median_spend

552381264.0

In [7]:
trim_mean_spend = round(stats.trim_mean(year_2016['2016'], 0.1),2)
trim_mean_spend

2226828354.13

In [8]:
weighted_mean = np.average(year_2016['2016'], weights = year_2016['Population'])
weighted_mean

84432394984.36955

## Estimates of Variability 

### Deviations
+ Measure that is used to find the difference betwen the observed value and the expected value of a variable. In other words, it is the distance from the centre point
+ Value - mean of the data

### Variance 
+ The sum of squared deviatons from the mean divided by n - 1 where n is the number of data values
+ Sum of deviations/(count of data values - 1)

### Standard Deviation
+ The Square Root of the Variance

### Mean Absolute Deviation
+ The mean of the absolute values of the deviations from the mean

### Percentile
+ The value such that P percent of the values take on this value or less (100-P) percent to take on this value or more

### Interquartile Range (IQR)
+ The difference between the 75th percentile and the 25th percentile

In [9]:
# calculating the standard deviation of the population

std_deviation = round(df['Population'].std(),2)
std_deviation

165268351.51

In [10]:
iqr = round(df['Population'].quantile(0.75) - df['Population'].quantile(0.25), 2)
iqr

33092496.0

In [31]:
# using numpy to find mean absolute deviation

data = df['Population']
mean = np.mean(data)
abs_dev = np.absolute(data - mean)
mad = round(np.mean(abs_dev), 2)
mad

59524147.87

In [33]:
# now let's use pandas .mad() to see if we get the same answer

p_mad = round(df['Population'].mad(), 2)
p_mad

59524147.87