## Exploratory Data Analysis (EDA)

### Describe your dataset

Consider the following questions to guide you in your exploration:

- Who: Which company/agency/organization provided this data?

This data if from http://serebii.net/ (was scraped)

- What: What is in your data?

1. **Content of the data:**
1. **country:** The English name of the Country 
1. **period:** The Year the Data was collected for
1. **h_index:** The entry number the Happiness Index
1. **alco:** Alcohol total per capita (15+) consumption
1. **homi:** Estimated Rate of homicide per 100 000 population
1. **road:** Estimated number of road traffic deaths
1. **suic:** Crude suicide rates (per 100 000 population)


- When: When was your data collected (for example, for which years)?

This data was last updated to include 2019 (from 2015-2019)

- Why: What is the purpose of your dataset? Is it for transparency/accountability, public interest, fun, learning, etc...

The Happiness index is used as a factor to determine how the 'happiness' of a population in a country. It has been used to determine the effectiveness of policy making and is useful for guaging the population's reaction to different events and circumstances.

The indicator data is from the WHO and is useful for knowing the overall health of populations across the world. It is useful for determining what areas the world should be focusing on to improve overall health.

- How: How was your data collected? Was it a human collecting the data? Historical records digitized? Server logs?

The Gallup World Poll interviews people directly and works with partnering organizations to collect data. The WHO has a Global Health Observatory data repository that mainly derives data from population-based sources (household surveys, civil registration systems of vital events) and institution-based sources (administrative and operational activities of institutions).

In [1]:
from scripts import project_functions as pf # This is called a relative import
from scripts import vari

df=pf.pd.read_csv('../../data/processed/h_ind_merged_noalco.csv')
df.head()
df.describe( include=[pf.np.number]).T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
period,5562.0,2016.997843,1.415483,2015.0,2016.0,2017.0,2018.0,2019.0
h_index,5562.0,5.448548,1.129708,2.839,4.518,5.3875,6.321,7.769
homi,5562.0,7.44253,13.655071,0.0,1.18,2.93,7.82,124.5
road,5562.0,8308.38254,29579.482764,6.95,535.0,2033.5,5686.0,258175.0
suic,5562.0,9.044962,7.634582,0.35,3.65,6.86,11.7725,61.71


In [2]:
df_al=pf.pd.read_csv('../../data/processed/h_alco.csv')
df_al.head()
df_al.describe( include=[pf.np.number]).T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
period,719.0,2016.997218,2.00139,2015.0,2015.0,2015.0,2019.0,2019.0
h_index,719.0,5.485014,1.125695,2.839,4.5985,5.399,6.2775,7.769
alco,719.0,6.240334,5.424805,0.0,1.8,4.8,9.7,24.7


In [3]:
h_rank=df.groupby('country').mean()[['h_index']]
h_rank=h_rank.sort_values("h_index", ascending=False)
h_rank=h_rank.reset_index()
h_rank

Unnamed: 0,country,h_index
0,Denmark,7.5460
1,Norway,7.5410
2,Finland,7.5378
3,Switzerland,7.5114
4,Iceland,7.5110
...,...,...
119,Yemen,3.6258
120,Togo,3.5442
121,Afghanistan,3.5128
122,Rwanda,3.4386


In [5]:
alco_rank=df_al.groupby('country').mean()[['alco']]
alco_rank=alco_rank.sort_values("alco", ascending=False)
alco_rank=alco_rank.reset_index()
alco_rank

Unnamed: 0,country,alco
0,Lithuania,14.583333
1,Germany,13.000000
2,Estonia,12.983333
3,Uganda,12.950000
4,Ireland,12.733333
...,...,...
119,Mauritania,0.000000
120,Bangladesh,0.000000
121,Kuwait,0.000000
122,Saudi Arabia,0.000000


In [6]:
homi_rank=df.groupby('country').mean()[['homi']]
homi_rank=homi_rank.sort_values("homi", ascending=False)
homi_rank=homi_rank.reset_index()
homi_rank

Unnamed: 0,country,homi
0,El Salvador,85.768000
1,Honduras,64.532667
2,Jamaica,51.252667
3,Colombia,41.845333
4,South Africa,37.853333
...,...,...
119,Switzerland,0.548000
120,Qatar,0.485333
121,Bahrain,0.346000
122,Singapore,0.326000


In [7]:
road_rank=df.groupby('country').mean()[['road']]
road_rank=road_rank.sort_values("road", ascending=False)
road_rank=road_rank.reset_index()
road_rank

Unnamed: 0,country,road
0,China,254190.800000
1,India,206626.200000
2,Nigeria,40315.400000
3,Brazil,38416.800000
4,Indonesia,31302.200000
...,...,...
119,Cyprus,68.666000
120,Montenegro,56.400000
121,Luxembourg,33.296364
122,Malta,19.582000


In [8]:
suic_rank=df.groupby('country').mean()[['suic']]
suic_rank=suic_rank.sort_values("suic", ascending=False)
suic_rank=suic_rank.reset_index()
suic_rank

Unnamed: 0,country,suic
0,Lithuania,31.256667
1,South Africa,24.502000
2,Belarus,23.903333
3,Ukraine,22.222000
4,Latvia,21.287333
...,...,...
119,Indonesia,2.391333
120,Turkey,2.316000
121,Jamaica,2.274667
122,Philippines,2.204667


# Data Analysis and Visualizations


## Initial Comparision of Happiness with Four Health Indicators 

![Alcohol vs H index](Hindex_alco.png)

1. The relationship between happiness and alcohol consuption does not appear to be linear or strongly corrolated. It appears that countries with the highest amounts of alchol consumption have above average happiness. The most concentrated cluster appears to be around medium happiness where these countries have very low levels of alcohol consumption. It appears that on average, the countries with higher happiness indexes also have higher alcohol consumption.

![Road Accidents vs H index](Hindex_road.png)

2. This distribution is relatively flat and concentrated around low traffic accident rates and middle happiness. The correlation between the variables is again non-linear. As shown, the countries with the highest rates of road accident deaths are actually middle happiness (compared to overall mean of 5.4). However, the countries with the highest happiness have very low rates of traffic accidents.

![Homicide Rates vs H index](Hindex_homi.png)

3. This distribution is also relatively flat but contains the highest variation of homicide rates for the top 75 percentile happiness countries. The happinest countries appear to have low rates of homicide but a similar observation can be made about the least happy countries. On average, the countries with average happiness have low rates of homicide. 

![Suicide Rates vs H index](Hindex_suic.png)

4. This distribution is has a bit of a fan effect. Although the distribution is not clearly linear or any pattern, it is interesting to note that happier countries have a pretty wide range of suicide rates. For those countries with a happiness score above the mean (5.4), the achieved the highest rates of suicide overall. Furthermore, the range of rates per country is higher for the top 90 percentile than the bottom 10 percentile.

In [None]:
from pandas_profiling import ProfileReport
# Your solution for `pandas_profiling`

prof = ProfileReport(df)
prof

# Archived

## function testing used for test 2

In [43]:
def over_budget(budget, food_bill, electricity_bill, internet_bill, rent):

    # Your solution here
    sum_cost = float(food_bill)+float(electricity_bill)+float(internet_bill)+float(rent)
    if float(budget) >= sum_cost:
        return True
    else:
        print('You have gone over budget!')
        return False

In [44]:
over_budget(200, 90, 20, 100, 10)

You have gone over budget!


False

In [12]:
try:
    print("hi!")
    num = 'three' 
    if num%3 != 0 :
        print("nope")
    else:
        print("yep")
except:
    print("huh?")
finally:
    print("ok")

hi!
huh?
ok


In [13]:
try:
    num1 = 8
    num2 = 0 
    print(num1*num2) 
    print(num1/num2)
    print("a")
except:
    print("b")
else:
    print("c")
finally:
    print("d")

0
b
d


In [66]:
def sum_values(my_dictionary):
    current_sum=0
    for key in my_dictionary:
        current_sum=current_sum+float(my_dictionary[key])
    return current_sum

In [68]:
sum_values({'five': 6, 'seven': 30, 'two': 100, 'three': 30})

166.0

In [19]:
my_dict={'five': 6, 'seven': 3}
for key in my_dict:
    print(my_dict[key])

6
3


Write a function named max_key that takes a dictionary named my_dictionary as a parameter. The function should return the key associated with the largest value in the dictionary. Hint: Begin by creating two variables named largest_key and largest_value. Initialize largest_value to be the smallest number possible (you can use float("-inf"). Initialize largest_key to be an empty string. Loop through all keys/value pair in the dictionary. Any time you find a value larger than what is currently stored in largest_value, replace largest_value with that new value. Similarly, replace largest_key with the key associated with the new largest value.

In [14]:
#return the key associated with the largest value in the dictionary
def max_key(my_dictionary): 
    largest_key=''
    largest_value=float("-inf")
    for key in my_dictionary:
        if float(my_dictionary[key])>largest_value:
            largest_key=key
            largest_value=float(my_dictionary[key])
    return largest_key

In [15]:
my_dict={'min': -100, 'max': 30, 'mid':0, 'not this':9, 'also not this':10}

max_key(my_dict)

'max'

Create a function named remove_middle which has three parameters named lst, start, and end. The function should return a list where all elements in lst with an index between start and end (inclusive) have been removed. For example, the following code should return [4, 23, 42] because elements at indices 1, 2, and 3 have been removed:

In [61]:
#function should return a list where all elements in lst with an index between start and end (inclusive) have been removed

def remove_middle(lst, start, end):
    middle_idx=range(start,end+1)
    final_lst=[]
    for item in lst:
        if lst.index(item) not in middle_idx:
            final_lst.append(item)
    return final_lst
        

In [62]:
remove_middle([4, 8 , 15, 16, 23, 42,30,40,90], 0, 2)

[16, 23, 42, 30, 40, 90]

In [42]:
temp = {'Fruits': ['apple','banana','grapefruit'], 'Mass (g)': [50,30,45]}
pf.pd.DataFrame.from_dict(temp)

Unnamed: 0,Fruits,Mass (g)
0,apple,50
1,banana,30
2,grapefruit,45


In [69]:
mygenerator = (x*x for x in range(3))
for i in mygenerator:
    print(i)

0
1
4


In [70]:
for i in mygenerator:
    print(i)
    

In [3]:
counter = 0

def update(): 
    new_counter = counter + 1 
    return new_counter

counter

0

In [23]:
def always_false(num):

    # Your solution here
    if float(num) <= float('inf') or float(num) >= -float('inf'):
        return False
    
always_false(float('inf'))

False

In [10]:
try:
    print("hi!")
    num = 'three' 
    if num%3 != 0 :
        print("nope")
    else:
        print("yep")
except:
    print("huh?")
finally:
    print("ok")

hi!
huh?
ok


In [11]:
try:
    num1 = 8
    num2 = 0 
    print(num1*num2) 
    print(num1/num2)
    print("a")
except:
    print("b")
else:
    print("c")
finally:
    print("d")

0
b
d


In [24]:
#Write a function named sum_values that takes a dictionary named my_dictionary as a parameter. The function should return the sum of the values of the dictionary

def sum_values(my_dictionary):
    curr_sum=0
    for key in my_dictionary:
        curr_sum=float(my_dictionary[key])+curr_sum
        
    return curr_sum
  
    
sum_values({'one':1, 'two':2,'three':3, 'four':4})

10.0

Write a function named max_key that takes a dictionary named my_dictionary as a parameter. The function should return the key associated with the largest value in the dictionary. Hint: Begin by creating two variables named largest_key and largest_value. Initialize largest_value to be the smallest number possible (you can use float("-inf"). Initialize largest_key to be an empty string. Loop through all keys/value pair in the dictionary. Any time you find a value larger than what is currently stored in largest_value, replace largest_value with that new value. Similarly, replace largest_key with the key associated with the new largest value.

Create a function named more_frequent_item that has three parameters named lst, item1, and item2. Return either item1 or item2 depending on which item appears more often in lst. If the two items appear the same number of times, return item1.

In [21]:
def more_frequent_item(lst,item1,item2):
    count_1=lst.count(item1)
    count_2=lst.count(item2)
    if count_1 >= count_2:
        return item1
    else:
        return item2
    
    
more_frequent_item([2,2,1,],1,2)

2