# The World Happiness Report Data Analysis

## Table of Contents


[A. Importing, cleaning and numerical summaries](#Import) <br>
[B. Indexing and grouping](#Indexing) <br>
[C. Bar plot of the Happiness Score](#plot) <br>
[D. Histogram of Job Satisfaction](#hist) <br>
[E. Pairwise Scatter plots](#scat) <br>
[F. Correlation](#corr) <br>
[G. Probabilities](#prob) <br>
[H. Matrices](#mat)



***

## A. 导包、数据清洗
<a id="Import" > 

1.下载数据集数据。从“资源”选项卡中选择csv。
2.将数据作为数据帧导入。
3.检查观察次数。
4.获取列标题。
5.检查每列的数据类型。
6.检查是否有任何缺失值。
7.如有必要，删除任何观察值，以确保没有缺失值，并且每列中的值都是相同的数据类型。
8.获取包含数值数据的每列的平均值、最小值和最大值。
9.列出十个最快乐的国家。
10.列出十个最不快乐的国家。

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

KeyboardInterrupt: 

In [None]:
WHR = pd.read_csv("World Happiness Report.csv")

In [None]:
WHR.head(10)

In [None]:
WHR.shape

In [None]:
print("There are {:,} rows ".format(WHR.shape[0]) + "and {} columns in our data".format(WHR.shape[1]))

In [None]:
WHR.set_index('Country', inplace=True)

In [None]:
WHR.info()

In [None]:
WHR.isnull().sum()

In [None]:
NULLS = WHR[WHR.isnull().any(axis=1)]

In [None]:
NULLS.head()

In [None]:
WHR.dropna(inplace=True)

In [None]:
WHR.isnull().sum()

In [None]:
WHR.duplicated().sum()

In [None]:
WHR.describe()

In [None]:
WHR.sort_values(by="Happiness Rank", ascending=True).head(10)

In [None]:
WHR.sort_values(by="Happiness Rank", ascending=False).head(10)

***

## B. 索引、分组
<a id="Indexing" > 

1.使用“区域”列创建一个单独的数据框，其中包含来自六个区域中每个区域的数据点：北美、拉丁美洲、西欧、东欧、亚太和非洲。
2.计算每个区域的平均幸福度得分，并将每个区域从最幸福到最不幸福进行排序。
3.计算每个地区幸福指数高于6.0的国家数量。
4.计算每个地区的最大幸福指数和最小幸福指数之间的差异。哪个地区的幸福指数范围最大？

In [None]:
WHR_Region = WHR.groupby('Region')

In [None]:
WHR_Region['Happiness Score'].describe().sort_values(by="mean",ascending=True).head(10)

In [None]:
WHR[WHR["Region"]=="Europe"].head()

In [None]:
WHR = WHR.replace('Europe', 'Eastern Europe')

In [None]:
WHR_Region['Happiness Score'].describe().sort_values(by="mean",ascending=False).head(10)

In [None]:
WHR_A = WHR[WHR['Region'] == 'Africa']
WHR_WE = WHR[WHR['Region'] == 'Western Europe']
WHR_EE = WHR[WHR['Region'] == 'Eastern Europe']
WHR_LA = WHR[WHR['Region'] == 'Latin America']
WHR_AP = WHR[WHR['Region'] == 'Asia-Pacific']
WHR_NA = WHR[WHR['Region'] == 'North America']

In [None]:
len(WHR_A[WHR_A['Happiness Score'] > 6])

In [None]:
print("There are {} countries in Africa that have a happiness score above 6.0 ".format(len(WHR_A[WHR_A['Happiness Score'] > 6])))


In [None]:
len(WHR_WE[WHR_WE['Happiness Score'] > 6])

In [None]:
print("There are {} countries in Western Europe that have a happiness score above 6.0 ".format(len(WHR_WE[WHR_WE['Happiness Score'] > 6])))


In [None]:
len(WHR_EE[WHR_EE['Happiness Score'] > 6])

In [None]:
print("There is {} country in Eastern Europe that has a happiness score above 6.0 ".format(len(WHR_EE[WHR_EE['Happiness Score'] > 6])))


In [None]:
len(WHR_AP[WHR_AP['Happiness Score'] > 6])

In [None]:
print("There are {} countries in the Asia Pacific that have a happiness score above 6.0 ".format(len(WHR_AP[WHR_AP['Happiness Score'] > 6])))


In [None]:
len(WHR_LA[WHR_LA['Happiness Score'] > 6])

In [None]:
print("There are {} countries in the Latin America that have a happiness score above 6.0 ".format(len(WHR_LA[WHR_LA['Happiness Score'] > 6])))


In [None]:
len(WHR_NA[WHR_NA['Happiness Score'] > 6])

In [None]:
print("There are {} countries in the North America that have a happiness score above 6.0 ".format(len(WHR_NA[WHR_NA['Happiness Score'] > 6])))


In [None]:
Delta_NA = WHR_NA.max(axis=0)['Happiness Score'] - WHR_NA.min(axis=0)['Happiness Score']
print(Delta_NA)

In [None]:
Delta_EE = WHR_EE.max(axis=0)['Happiness Score'] - WHR_EE.min(axis=0)['Happiness Score']
print(Delta_EE)

In [None]:
Delta_WE = WHR_WE.max(axis=0)['Happiness Score'] - WHR_WE.min(axis=0)['Happiness Score']
print(Delta_WE)

In [None]:
Delta_A = WHR_A.max(axis=0)['Happiness Score'] - WHR_A.min(axis=0)['Happiness Score']
print(Delta_A)

In [None]:
Delta_LA = WHR_LA.max(axis=0)['Happiness Score'] - WHR_LA.min(axis=0)['Happiness Score']
print(Delta_LA)

In [None]:
Delta_AP = WHR_AP.max(axis=0)['Happiness Score'] - WHR_AP.min(axis=0)['Happiness Score']
print(Delta_AP)

In [None]:
Deltas = {}

In [None]:
Deltas["North America"] = Delta_NA
Deltas["Eastern Europe"] = Delta_EE
Deltas["Western Europe"] = Delta_WE
Deltas["Africa"] = Delta_A
Deltas["Latin America"] = Delta_LA
Deltas["Asia Pacific"] = Delta_AP

In [None]:
print("The {} region seems to have the largest range of happiness scores".format(max(Deltas, key=Deltas.get)))

***

## C. 作柱状图
<a id="plot" > 

1.获得前10个国家幸福指数的水平条形图。你的柱状图上应该有国家的名称，这些国家的名称沿着柱状图垂直排列
y轴和x轴上的每个数字都应有0到8的标签。确保图表有适当的标题和标签。
2.现在，您将修改在步骤1中获得的条形图，将其转换为堆叠条形图，其中总体幸福感得分被划分为七个部分，分别对应于以下列：
*节约
*家庭
*健康
*自由
*慷慨
*腐败
*反乌托邦
为每个类别选择不同的颜色，并在图表中添加适当的图例。
3.获得与第2步相同的堆叠水平条形图，但这次不是前10个国家，而是考虑非洲地区的所有国家。

In [None]:
WHR['Happiness Score'].head(10).plot(xticks=np.arange(9), kind='barh', figsize= (10, 10))
plt.xlabel("Happiness Score")
plt.title('Happiness Score of the top 10 Countries')

In [None]:
WHR[['Economy', 'Family','Health', 'Freedom', 'Generosity', 'Corruption', 'Dystopia']].head(10).plot(kind='barh',
                                                                xticks=np.arange(9), stacked=True, figsize= (10, 10))

plt.xlabel("Happiness Score")
plt.title('Happiness Score of the top 10 Countries')
plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)

In [None]:
WHR_A[['Economy', 'Family','Health', 'Freedom', 'Generosity', 'Corruption', 'Dystopia']].head(10).plot(kind='barh',
                                                                xticks=np.arange(9), stacked=True, figsize= (10, 10))

plt.xlabel("Happiness Score")
plt.title('Happiness Score of the top 10 Countries')
plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)

***

## D. 工作满意度直方图
<a id="hist" > 

Obtain a histogram of the Job Satisfaction using the following categories:


In [None]:
WHR['Job Satisfaction'].plot(kind='hist', bins=[ 40, 50, 60, 70, 80, 90, 100], figsize=(10,10))

plt.xlabel("Job Satisfaction in Percent")
plt.title("Distribution of Job Satisfaction")

***

## E. 散点图
<a id="scat" > 

Obtain scatter plots of the Happiness Score versus each of the other variables. Your plots should be displayed as multiple plots table and obtained with one command as supposed to separate commands for each plot.

In [None]:
sns.pairplot(data=WHR, kind='reg', size = 5,
                  x_vars=['Happiness Score'],
                  y_vars=['Economy', 'Family','Health', 'Freedom', 'Generosity', 'Corruption', 'Dystopia'])

In [None]:
sns.pairplot(data=WHR, size = 5, hue='Region',
                  x_vars=['Happiness Score'],
                  y_vars=['Economy', 'Family','Health', 'Freedom', 'Generosity', 'Corruption', 'Dystopia'])

***

## F. 相关性
<a id="corr" > 

Obtain the correlation between the Happiness Score and each of the other variables. Which variable has the highest correlation with the Happiness Score?

In [None]:
WHR.corr(method="pearson", min_periods=20)["Happiness Score"].sort_values(ascending=False)

In [None]:
WHR.corr(method="pearson", min_periods=20)["Happiness Score"].abs().sort_values(ascending=False)

If we ignore the Happiness Rank, Job Satisfaction seems to have the highest correlation with the Happiness Score.

In [None]:
WHR.corr(method="pearson", min_periods=20)

In [None]:
corr = WHR.corr(method = "pearson")

f, ax = plt.subplots(figsize=(10, 10))

sns.heatmap(corr, mask=np.zeros_like(corr, dtype=np.bool), 
            cmap=sns.diverging_palette(220, 10, as_cmap=True), square=True, ax=ax)

***

## G. 概率
<a id="prob" > 

Compute the probability that randomly selected country with Happiness score over 6.0 is from Western Europe. You will have to use pandas to count the appropriate quantities.

In [None]:
WHR[WHR['Happiness Score'] > 6].shape[0]

In [None]:
WHR[(WHR['Happiness Score'] > 6) & (WHR['Region'] == 'Western Europe')].shape[0]

In [None]:
float(len(WHR[(WHR['Happiness Score'] > 6) & (WHR['Region'] == 'Western Europe')]))/float(len(WHR[WHR['Happiness Score'] > 6]))

In [None]:
print("The probability that a randomly selected country with happiness score over 6.0 is form Western Europe is {0:.0%}".format(float(WHR[(WHR['Happiness Score'] > 6) & (WHR['Region'] == 'Western Europe')].shape[0]

)/float(WHR[WHR['Happiness Score'] > 6].shape[0])))

***

## H. 矩阵
<a id="mat" > 

Define a matrix whose rows correspond to countries and the columns to the regions. Fill in the matrix with 0/1 values where entry (i,j) is a 1 if the country in row i is in the region in column j and a 0 otherwise.

In [None]:
WHR.shape

In [None]:
Western_Europe = []
Eastern_Europe = []
North_America = []
Latin_America = []
Asia_Pacific = []
Africa = []

In [None]:
for x in WHR['Region']:
    if x == 'Western Europe':
         Western_Europe.append(1)
    else: Western_Europe.append(0)

In [None]:
for x in WHR['Region']:
    if x == 'Eastern Europe':
         Eastern_Europe.append(1)
    else: Eastern_Europe.append(0)

In [None]:
for x in WHR['Region']:
    if x == 'North America':
         North_America.append(1)
    else: North_America.append(0)

In [None]:
for x in WHR['Region']:
    if x == 'Latin America':
         Latin_America.append(1)
    else: Latin_America.append(0)

In [None]:
for x in WHR['Region']:
    if x == 'Asia-Pacific':
         Asia_Pacific.append(1)
    else: Asia_Pacific.append(0)

In [None]:
for x in WHR['Region']:
    if x == 'Africa':
         Africa.append(1)
    else: Africa.append(0)

In [None]:
Matrix = pd.DataFrame(index=WHR.index)

In [None]:
Matrix['Western Europe'] = Western_Europe
Matrix['Eastern Europe'] = Eastern_Europe
Matrix['North America'] = North_America
Matrix['Latin America'] = Latin_America
Matrix['Asia Pacific'] = Asia_Pacific
Matrix['Africa'] = Africa

In [None]:
Matrix.head(20)

In [None]:
array_Matrix = Matrix.iloc[:,:].values


In [None]:
array_Matrix