# Project: Finding Correlations among Unrelated Variables

## Table of Contents
<ul>
<li><a href="#intro">Introduction</a></li>
<li><a href="#wrangling">Data Wrangling</a></li>
<li><a href="#eda">Exploring the Data</a></li>
<li><a href="#conclusions">Conclusions</a></li>
</ul>

<a id='intro'></a>
## I) Introduction

**Broad question:** How do total forest area and frequency of natural disasters shape a country's obesity rates and murder rates?

> I picked these factors which seem to be unrelated--geography & frequency of natural disasters, to rates of obesity murder, to formulate new interesting questions and uncover unexpected patterns. I also wanted to approach this project through an experimental and free-for-all lens, just to see if I can make any fun or comical conclusions from the giving unrelated datasets. To narrow down my focus, I picked subcategory of [TK] for geography, [TK] for education and [TK] to encompass murder rates. I received all of my data through GapMinder.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

<a id='wrangling'></a>
# II) Data Wrangling

## A) Gathering Data

#### Natural Disasters
> I took [TK list dataset names here], each depicting the number of deaths of their respective natural disaster. To combine those datasets, I generated a new CSV file named ‘natural_disaster_deaths.csv’, which depicts the sum of deaths by natural disaster per year for each country.

In [116]:
#Load all DataFrames for natural disasters
filepath_disasters = './data/natural_disasters/'

#assign country names as primary indexes
df_drought = pd.read_csv(filepath_disasters + 'indicator_drought_killed.csv', index_col = 'Drought killed')
df_earthquake = pd.read_csv(filepath_disasters + 'indicator_earthquake_killed.csv', index_col = 'Earthquake killed')
df_epidemic = pd.read_csv(filepath_disasters + 'indicator_epidemic_killed.csv', index_col = 'Epidemic killed')
df_flood = pd.read_csv(filepath_disasters + 'indicator_flood_killed.csv', index_col = 'Flood killed')
df_storm = pd.read_csv(filepath_disasters + 'indicator_storm_killed.csv', index_col = 'Storm killed')
df_tsunami = pd.read_csv(filepath_disasters + 'indicator_tsunami_killed.csv', index_col = 'Tsunami killed')

In [117]:
#put all DataFrames in a NumPy array for more efficient handling later
dfs_nd = np.array([df_drought, df_earthquake, df_epidemic, df_flood, df_storm, df_tsunami])

#### Total Forest Area
> I am focusing on total natural forest land per country, and so I will exclude factors that indicate if said forest land is reserved for agricultural production.

> From GapMinder, forest area is described as ‘land under natural or planted stands of trees of at least 5 meters in situ, whether productive or not, and excludes tree stands in agricultural production systems (for example, in fruit plantations and agroforestry systems) and trees in urban parks and gardens.' The dataset ‘forest_area_sq_km.csv’ keeps track of the total forest area from 1990 to 2015.

In [118]:
df_forest = pd.read_csv('./data/forest_area_sq_km.csv', index_col='country')

#### Obesity Rates
> GapMinder provided the age standardized mean for BMI, dividing it into BMI values for men and women. 



In [119]:
df_bmi_male = pd.read_csv('./data/bmi_rates/bmi_male.csv', index_col='Country')
df_bmi_female = pd.read_csv('./data/bmi_rates/bmi_female.csv', index_col='Country')

#### Murder Rates
> Encompasses number of murders per 100,000 people, accounting for all ages.

In [120]:
df_murder = pd.read_csv('./data/homicide_rates.csv', index_col='Murder per 100,000, age adjusted')

## B) Data Cleaning

### Natural Disasters
> To account for all natural disasters that occurred in each country, my goal is to generate a new CSV file where each cell contains the sum of all the natural disaster DataFrames.

Let's examine the columns of each natural disaster DataFrame. There are 195 total countries in the world, so I am expecting there to be at most 195 rows.

In [121]:
for df in np.ndenumerate(dfs_nd):
    #display (row, column) per DataFrame
    print(df[1].index.name , df[1].shape)

Drought killed (128, 39)
Earthquake killed (97, 39)
Epidemic killed (143, 38)
Flood killed (182, 39)
Storm killed (181, 39)
Tsunami killed (18, 15)


> It can be seen that the DataFrame for tsunamis is the least reported out of all the other natural disasters (accounting for only 18 countries), while the most documented natural disaster is floods, at 181 countries. This adds more ambiguity as to how we should generate our final CSV file accounting for all natural disasters in all countries.

> Now there are two possibilities for approaching this:

Approach 1) The missing countries means that no natural disasters occurred in them, and so it was not necessary to include them in their respective DataFrames.

> Therefore **it is safe to set the values of the missing countries to 0 and generate the final CSV as a sum of all the DataFrames.**

Approach 2) We do not know if any natural disasters occurred in the missing countries.

> Therefore **we should focus only the countries in common who have no missing values.**

I ended up picking **Approach 2** because as can be seen in the above df_drought DataFrame, there are values for 0 in countries and years where no earthquakes happened. The best conclusion I can come for the missing indexed countries is that there has been no data recorded for them, and therefore I cannot assume whether or not earthquakes ever occurred in the missing countries. This logic extends to the rest of the DataFrames for the natural disasters.

**But before we go limiting countries, let's rename the indexed countries as 'country' to allow for more efficient indexing.**

In [122]:
#rename indexed country header as 'country'
for df in np.ndenumerate(dfs_nd):
    df[1].index.name = 'country'
    
    #check that the name change did occur
    print(df[1].head())
    


                     1970  1971  1972  1973  1974  1975  1976  1977  1978  \
country                                                                     
Afghanistan             0     0     0     0     0     0     0     0     0   
Albania                 0     0     0     0     0     0     0     0     0   
Algeria                 0     0     0     0     0     0     0     0     0   
Angola                  0     0     0     0     0     0     0     0     0   
Antigua and Barbuda     0     0     0     0     0     0     0     0     0   

                     1979  ...  1999  2000  2001  2002  2003  2004  2005  \
country                    ...                                             
Afghanistan             0  ...     0    37     0     0     0     0     0   
Albania                 0  ...     0     0     0     0     0     0     0   
Algeria                 0  ...     0     0     0     0     0     0    12   
Angola                  0  ...     0     0    58     0     0     0     0   
Anti

> The above results show that only the last DataFrame for tsunamis contains null values. 

#### *DROP NULLS*

In [123]:
#Display number of NaN values in each dataset
for df in np.ndenumerate(dfs_nd):
    print(df[1].isna().sum().sum())

0
0
0
0
0
248


In [124]:
#Drop nulls from tsunami dataset
df_tsunami.dropna(inplace=True)

In [125]:
#Confirm changes
df_tsunami.isna().any().any()

False

**But hold up!**
> Before we move on, let's check the first few rows of the tsunami data set:

In [126]:
#Check how many values exist in the tsunami data set
print(df_tsunami.count().sum())

df_tsunami

0


Unnamed: 0_level_0,1979,1980,1995,1996,1997,1998,1999,2000,2001,2002,2003,2004,2005,2006,2007
country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1


> The tsunami dataset is completely empty, which means we'll have to discard it. This makes me come to the conclusion that ***every single row in the data set contained a null value, which means that after dropping all the null values, the dataset became completely empty.***

> I know it's painful to have to discard an entire DataFrame, but at least that is better than incorporating largely unreliable data into our final analysis.

Before moving on, let's create a new NumPy array of the DataFrames without the tsunami dataset.



In [154]:
#Drop tsunami dataset
dfs_nd_dropped = np.delete(dfs_nd, -1)

#Check changes
len(dfs_nd_dropped)

5

There are only 5 DataFrames in the NumPy array instead of 6, so we can move on.

### *Dedupe Data*

In [156]:
#Check for number of duplicate data per row
for df in np.ndenumerate(dfs_nd_dropped):
    print(df[1].duplicated().sum())

101
21
21
30
29


> But wait, we don't want to drop duplicate data just yet. As we know, many of the DataFrames are populated with 0's. Before we drop duplicate rows, let's make sure that they are all actually **zeroes** and not just rows populated with the same repeating values.

In [158]:
#Display all duplicated rows in each DataFrame
for df in np.ndenumerate(dfs_nd_dropped):
    #return True if the dataset contains a value that's not 0
    print(df[1][df[1].duplicated(keep=False)].any().any())
    
    
# print(df[1][df[1].duplicated(keep=False)])

False
True
True
False
True


> Some of the DataFrames have values that are greater than 0. But I've come to realize I don't necessarily care about the duplicate values themselves. The real question is: **are they repeating countries?** 


In [159]:
#Check if any of the DataFrames have duplicate index 'country'
for df in np.ndenumerate(dfs_nd_dropped):
    #return True if the dataset contains a repeating country
    print(df[1].index.duplicated().sum().any())

False
False
False
False
False


> None of the DataFrames have repeating countries for indexes. This is good news! Our data has turned out to be much more reliable than expected.

Now we can decide to drop the duplicated rows or not, but I have **ultimately decided not to.** This is an exceptional case. Dropping the rows could harm our data reliability in the end, because dropping duplicates means we would be getting rid of natural disaster scores for entire countries in the long run.

#### *FIX DATA TYPES*

Counting the number of natural disaster occurrences means handling **discrete variables**, so it would make the most sense to convert all of the values in every DataFrame to integers.

In [162]:
#Display data types of each DataFrame
for df in np.ndenumerate(dfs_nd_dropped):
    print(df[1].info())

<class 'pandas.core.frame.DataFrame'>
Index: 128 entries, Afghanistan to Zimbabwe
Data columns (total 39 columns):
1970    128 non-null int64
1971    128 non-null int64
1972    128 non-null int64
1973    128 non-null int64
1974    128 non-null int64
1975    128 non-null int64
1976    128 non-null int64
1977    128 non-null int64
1978    128 non-null int64
1979    128 non-null int64
1980    128 non-null int64
1981    128 non-null int64
1982    128 non-null int64
1983    128 non-null int64
1984    128 non-null int64
1985    128 non-null int64
1986    128 non-null int64
1987    128 non-null int64
1988    128 non-null int64
1989    128 non-null int64
1990    128 non-null int64
1991    128 non-null int64
1992    128 non-null int64
1993    128 non-null int64
1994    128 non-null int64
1995    128 non-null int64
1996    128 non-null int64
1997    128 non-null int64
1998    128 non-null int64
1999    128 non-null int64
2000    128 non-null int64
2001    128 non-null int64
2002    128 non-null 

All of the columns contain only integer values, so we do not have to perform any data type conversion or extraction

Before generating a final DataFrame containing the sum of each natural disaster occurrence per country and year, **let's first check to see that their year ranges are consistent.**

In [216]:
#Extract year columns from last DataFrame, use as point of comparison
column_names = dfs_nd_dropped[-1].columns


for i in range(len(dfs_nd_dropped)):
    print(dfs_nd_dropped[i].columns)

Index(['1970', '1971', '1972', '1973', '1974', '1975', '1976', '1977', '1978',
       '1979', '1980', '1981', '1982', '1983', '1984', '1985', '1986', '1987',
       '1988', '1989', '1990', '1991', '1992', '1993', '1994', '1995', '1996',
       '1997', '1998', '1999', '2000', '2001', '2002', '2003', '2004', '2005',
       '2006', '2007', '2008'],
      dtype='object')
Index(['1970', '1971', '1972', '1973', '1974', '1975', '1976', '1977', '1978',
       '1979', '1980', '1981', '1982', '1983', '1984', '1985', '1986', '1987',
       '1988', '1989', '1990', '1991', '1992', '1993', '1994', '1995', '1996',
       '1997', '1998', '1999', '2000', '2001', '2002', '2003', '2004', '2005',
       '2006', '2007', '2008'],
      dtype='object')
Index(['1970', '1971', '1972', '1974', '1975', '1976', '1977', '1978', '1979',
       '1980', '1981', '1982', '1983', '1984', '1985', '1986', '1987', '1988',
       '1989', '1990', '1991', '1992', '1993', '1994', '1995', '1996', '1997',
       '1998', '1999', 

The 3rd DataFrame does not have the year '1973' but besides that, 



#### *OUTER MERGE DATASETS*
> **Main idea:** We are using an ***outer merge*** because our data implies that each country in each DataFrame satisfies the following conditions:

1) contains reliable non-null data

2) represents the true number of its respective natural disaster per year

> Consider that we are merging df1 and df2.

If df1 contains countries that df2 does not AND df2 contains countries that df1 does not, we want all of those countries to show up in the final result.

If df2 contains years that df2 does not AND df2 contains years that df1 does not, we want all of those years to show up in the final result anyway.

Index(['1970', '1971', '1972', '1973', '1974', '1975', '1976', '1977', '1978',
       '1979', '1980', '1981', '1982', '1983', '1984', '1985', '1986', '1987',
       '1988', '1989', '1990', '1991', '1992', '1993', '1994', '1995', '1996',
       '1997', '1998', '1999', '2000', '2001', '2002', '2003', '2004', '2005',
       '2006', '2007', '2008'],
      dtype='object')
Index(['1970', '1971', '1972', '1973', '1974', '1975', '1976', '1977', '1978',
       '1979', '1980', '1981', '1982', '1983', '1984', '1985', '1986', '1987',
       '1988', '1989', '1990', '1991', '1992', '1993', '1994', '1995', '1996',
       '1997', '1998', '1999', '2000', '2001', '2002', '2003', '2004', '2005',
       '2006', '2007', '2008'],
      dtype='object')
Index(['1970', '1971', '1972', '1974', '1975', '1976', '1977', '1978', '1979',
       '1980', '1981', '1982', '1983', '1984', '1985', '1986', '1987', '1988',
       '1989', '1990', '1991', '1992', '1993', '1994', '1995', '1996', '1997',
       '1998', '1999', 

#### *CLASSIFY FREQUENCY AS 'LOW', 'MEDIUM', OR 'HIGH'*

### Obesity Rates
> Though generally it may seem that a higher BMI indicates a 'healthier' weight for male, [this article](https://signup.weightwatchers.co.uk/util/art/index_art.aspx?art_id=31901&tabnum=1&sc=803&subnav=Science+Library%3A+Health+and+Weight) clarifies that BMI rates above 25 would still indicate ill health: an equal expectation for both men and women. Based on this fast, I decided it was safe to generate a new CSV, 'bmi_indicator,' for both men and women, by filling each slot with the mean BMI value for men and women.

In [38]:
#Creating 'natural_disasters_killed.csv'

<a id='eda'></a>
# III) Exploring the Data

> **Tip**: Now that you've trimmed and cleaned your data, you're ready to move on to exploration. Compute statistics and create visualizations with the goal of addressing the research questions that you posed in the Introduction section. It is recommended that you be systematic with your approach. Look at one variable at a time, and then follow it up by looking at relationships between variables.

### How much have global obesity rates changed in countries with high forest area vs. countries with low forest area?

In [None]:
# Use this, and more code cells, to explore your data. Don't forget to add
#   Markdown cells to document your observations and findings.


### Does a country’s likelihood of experiencing a natural disaster affect the homicidal tendencies of its citizens?

In [None]:
# Continue to explore the data to address your additional research
#   questions. Add more headers as needed if you have more questions to
#   investigate.


<a id='conclusions'></a>
# IV) Conclusions

> **Tip**: Finally, summarize your findings and the results that have been performed. Make sure that you are clear with regards to the limitations of your exploration. If you haven't done any statistical tests, do not imply any statistical conclusions. And make sure you avoid implying causation from correlation!

> **Tip**: Once you are satisfied with your work here, check over your report to make sure that it is satisfies all the areas of the rubric (found on the project submission page at the end of the lesson). You should also probably remove all of the "Tips" like this one so that the presentation is as polished as possible.

## Submitting your Project 

> Before you submit your project, you need to create a .html or .pdf version of this notebook in the workspace here. To do that, run the code cell below. If it worked correctly, you should get a return code of 0, and you should see the generated .html file in the workspace directory (click on the orange Jupyter icon in the upper left).

> Alternatively, you can download this report as .html via the **File** > **Download as** submenu, and then manually upload it into the workspace directory by clicking on the orange Jupyter icon in the upper left, then using the Upload button.

> Once you've done this, you can submit your project by clicking on the "Submit Project" button in the lower right here. This will create and submit a zip file with this .ipynb doc and the .html or .pdf version you created. Congratulations!

In [None]:
from subprocess import call
call(['python', '-m', 'nbconvert', 'Investigate_a_Dataset.ipynb'])