# Finding the Happiness Factor
For our CMSC320 class our team did exploratory analysis on the World Happiness Report for 2021 and the World Development Indicators to determine what development factors best correspond with increased happiness across the globe
- Project by Victor Novichkov, Jacob Livchitz

### Downloading the datasets
- Need to have Python installed
- Recommended to install the kaggle pip library
    - ```pip install kaggle```
- Copy the commands listed under the datasets

### Datasets Used
[World Development Indicators - World Bank](https://datacatalog.worldbank.org/dataset/world-development-indicators) (~200MB at time of writing)
- How to download
    - Go to the **Data & Resources** tab
    - Download the **CSV**

[World Happiness Report 2021 - Ajaypal Singh](https://www.kaggle.com/ajaypalsinghlo/world-happiness-report-2021) (155KB at time of writing)
- How to download
    - ```kaggle datasets download -d ajaypalsinghlo/world-happiness-report-2021```


## Looking at the Happiest Countries

In [34]:
import pandas as pd
import numpy as np

whr_path = './world-happiness-index/whr.csv'
wdi_path = './world-development-index/WDIData_trimmed_n.csv'

whr_data = pd.read_csv(whr_path)
simplified_whr_data = whr_data[['Country Name', 'Life Ladder', 'Year']] # get just the name / score / and year

countries = pd.unique(simplified_whr_data['Country Name'])
print("Total Countries with Indices: {}".format(len(countries)))

wdi_data = pd.read_csv(wdi_path, index_col=0)

Total Countries with Indices: 166


In [36]:
print(simplified_whr_data[simplified_whr_data['Country Name'] == 'Rwanda']['Year'])

1455    2006
1456    2008
1457    2009
1458    2011
1459    2012
1460    2013
1461    2014
1462    2015
1463    2016
1464    2017
1465    2018
1466    2019
Name: Year, dtype: int64


## Dealing with Too Much Data
Although the World Bank dataset is amazing, its size is a bit too much for Github to even accept. In order to be able to collaborate we had to cut how much data we were working with. Luckily, there were a couple methods we could use to cut our dataset in half without drastically influencing our analysis
- Get rid of countries in the World Development Index data that don't have corresponding happiness indices
- Get rid of majority NaN/Null datapoints (we won't be looking at them in our analysis)
    - Note: The World Happiness Index started in 2012, which means most of the data in the World Development Index going back to 1960 isn't entirely relevant. As a result we can look for rows that have at least 20 non null entries (1990-2020).

```python
# Creating the truncated dataset
wdi_data = pd.read_csv(wdi_path)

# First by removing countries that don't have indices
boolean_mask = wdi_data['Country Name'].isin(countries)
wdi_cleaned = wdi_data[boolean_mask]
print("% of dataset trimmed: {0:.4f}".format((1 - (len(wdi_cleaned)/len(wdi_data)))*100))

# Remove all data from 1960 - 1999 (Most of the data from before the the Happiness Index even existed)
wdi_pre_drop = len(wdi_cleaned.columns)
wdi_cleaned.drop(axis=1, labels=wdi_cleaned.columns[4:44], inplace=True)
print("% of dataset trimmed: {0:.4f}".format((1 - (wdi_pre_drop/len(wdi_cleaned.columns)))*100))

# Get rid of all rows that have less than 8 datapoints over the time period from (2000-2020)
pre_dropna = len(wdi_cleaned)
wdi_cleaned.dropna(axis=0, thresh=8, inplace=True)
print("% of dataset trimmed: {0:.4f}".format((1 - (len(wdi_cleaned)/pre_dropna)) * 100))

# wdi_cleaned.to_csv('./world-development-index/WDIData_trimmed_n.csv')
```

Since the original dataset could not be uploaded. We can't show the comparison live

But we were able to trim 44.7% of the data just by removing countries without Happiness Indices

And trim a further 45.45% of the data by removing the datapoints from (1960 - 1989), leaving only 1990-2020

And trim an additional 15.99% of the data, dropping our final file size to ~ 27 MB

In [56]:
sample_code = "EG.CFT.ACCS.ZS"

# Gets Country Name, Country Code, Year, Indicator Value, Score
def getAllForIndicator(df, hi, indicator_code):
    
    countries = df[df['Indicator Code'] == indicator_code].copy()
    countries.drop(axis=1, labels=['Indicator Name', 'Indicator Code'], inplace=True)
    countries = pd.melt(countries, id_vars=['Country Name', 'Country Code'], var_name='Year', value_name='indicator_value')
    countries = countries.astype({'Year': np.int64})

    return countries.merge(hi, how='inner', on=['Country Name', 'Year'])

getAllForIndicator(wdi_data, simplified_whr_data, sample_code)


Unnamed: 0,Country Name,Country Code,Year,indicator_value,Life Ladder
0,Australia,AUS,2005,100.00,7.341
1,Belgium,BEL,2005,100.00,7.262
2,Brazil,BRA,2005,91.04,6.637
3,Canada,CAN,2005,100.00,7.418
4,Czech Republic,CZE,2005,95.65,6.439
...,...,...,...,...,...
1685,United Kingdom,GBR,2020,,6.798
1686,United States,USA,2020,,7.028
1687,Uruguay,URY,2020,,6.310
1688,Zambia,ZMB,2020,,4.838


In [None]:
#When one thinks of happiness they think of money. Lets first graph the WHR dataset's graph of national income per capita vs happiness
incomePerCap=pd.DataFrame(columns=['Country','Year','IPC'])
yearsValid=wdi_data.columns
print(yearsValid)
yearsValid=yearsValid[4:len(yearsValid)-1]
#print(incomePerCap)
for x, rowX in wdi_data.iterrows():
    if rowX['Indicator Name']=='Adjusted net national income per capita (current US$)':
        for year in yearsValid:
            name=str(rowX['Country Name'])
            yearFinal=int(year)
            if wdi_data.loc[x,year] == np.NaN:
                ipcFinal=np.NaN
            else:
                ipcFinal=float(wdi_data.loc[x,year])
            incomePerCap=incomePerCap.append({'Country':name,'year':yearFinal,'IPC':ipcFinal},ignore_index=True)
#print(simplified_whr_data)
incomePerCap=incomePerCap.dropna()
print((incomePerCap.columns))
print((simplified_whr_data.columns))

#Whats bellow WORKS. its just that the condition year==year will never be matched because the data does not overlap at all!

whrIPC=pd.DataFrame(columns=['Country','Year','Score','IPC'])
print(len(whrIPC))

for x, rowX in incomePerCap.iterrows():
    for y,rowY in simplified_whr_data.iterrows():
        if rowX['Country']==rowY['Country Name'] and rowX['year']==rowY['year']:
            name=str(rowX['Country'])
            yearFinal=int(rowX['year'])
            ipcFinal=float(rowX['IPC'])
            score=float(rowY['Life Ladder'])
            whrIPC = whrIPC.append({'Country':name,'year':yearFinal,'Score':score,'IPC':ipcFinal},ignore_index=True)
print(whrIPC.head(100))

ax=incomePerCap.plot.scatter(x='year',xlabel='year',y='IPC',ylabel='income per capita',rot=90,color='b',title='year vs income per cap.')
simplified_whr_data.plot.scatter(x='year',xlabel='year',y='Life Ladder',ylabel='Happy Score',rot=90,color='r',title='happiness vs time',ax=ax)
