# Title

<a id='intro'></a>
## Introduction

> **Research has shown that the increase of food consumption can be due to coping mechanisms for stress or other mental health issues and has risen with the increase of technology in daily life.** 

The purpose of this study is 

### Contents:
<ul>
    <li><a href="#intro">Introduction</a></li>
    <li><a href="#source">Data Source</a></li>
    <li><a href="#wrangle">Data Wrangling & Cleaning</a></li>
    <li><a href="#explore">Exploratory Data Analysis</a></li>
        <ul>
        <li><a href="#Q1">Q1. How has the consumption of food per person (per day) changed over time throughout the world?</a></li>
        <li><a href="#Q2">Q2. Is consumption of food per person related by region of the world?</a></li>
        <li><a href="#Q3">Q3. Is there a correlation between the food consumption per person in a country with that country's rate of suicide?</a></li>
        <li><a href="#Q4">Q4. Is there a correlation between the food consumption per person in a country with the number of internet users in that country?</a></li>
        </ul>
    <li><a href="#summary">Summary & Conclusions</a></li>
</ul>

<a id='source'></a>
## Data Sources

Access the README.md documentation regarding the specific source of the data used in this analysis. 

All the data was retrived through [Gapminder](http://www.gapminder.org). Gapminder is an independent Swedish foundation with no political, religious or economic affiliations.


<a id='wrangle'></a>
## Data Wrangling & Cleaning

Before starting to analyze the data, the quality of the datasets downloaded from the Gapminder database is evaluated and then cleaned appropriately. First, various libaries are imported that will be used at different steps of this process and later with visualizations.   

This is link to country_converter: https://github.com/konstantinstadler/country_converter

In [25]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns 

import pycountry # connects country names with country codes
import country_converter as coco # connects country code with its continent 

%matplotlib inline

#### Wrangling *population* and *foodsupply* raw datasets
As hinted by the inital questions stated above, knowing the population numbers will be helpful to normalize to compare different datasets approprately. 

Since the purpose of this quanantative correlation study wants to explore relationships (if any) between the different data sets, it is important that there are consistent year ranges. First, the world population dataset is explored to understand its general properties.

In [42]:
# read population csv
df_population = pd.read_csv('population_total.csv')

# view structure of dataset
df_population.head(3)

Unnamed: 0,country,1800,1801,1802,1803,1804,1805,1806,1807,1808,...,2091,2092,2093,2094,2095,2096,2097,2098,2099,2100
0,Afghanistan,3280000,3280000,3280000,3280000,3280000,3280000,3280000,3280000,3280000,...,76600000,76400000,76300000,76100000,76000000,75800000,75600000,75400000,75200000,74900000
1,Albania,400000,402000,404000,405000,407000,409000,411000,413000,414000,...,1330000,1300000,1270000,1250000,1220000,1190000,1170000,1140000,1110000,1090000
2,Algeria,2500000,2510000,2520000,2530000,2540000,2550000,2560000,2560000,2570000,...,70400000,70500000,70500000,70600000,70700000,70700000,70700000,70700000,70700000,70700000


In [39]:
# understand the shape of population dataset
df_population.shape

(195, 302)

The dataset on the consumption of food is primary to answering the questions drafted for this project. This is the next dataset explored after global population.

In [22]:
# read food consumption csv
df_foodsupply = pd.read_csv('food_supply_kilocalories_per_person_and_day.csv')

# view structure of dataset
df_foodsupply.head(3)

Unnamed: 0,country,1961,1962,1963,1964,1965,1966,1967,1968,1969,...,2004,2005,2006,2007,2008,2009,2010,2011,2012,2013
0,Afghanistan,3000.0,2920.0,2700.0,2950.0,2960.0,2740.0,2970.0,2920.0,2940.0,...,1970.0,1950.0,1970,2050,2040,2080,2100,2110,2100,2090
1,Albania,2220.0,2240.0,2160.0,2270.0,2250.0,2250.0,2260.0,2340.0,2400.0,...,2790.0,2870.0,2860,2860,2950,2990,3080,3130,3180,3190
2,Algeria,1620.0,1570.0,1530.0,1540.0,1590.0,1570.0,1650.0,1710.0,1710.0,...,2990.0,2960.0,3050,3040,3050,3110,3140,3220,3270,3300


In [23]:
# understand the shape of the foodsupply dataset
df_foodsupply.shape

(168, 54)

The population dataset year range is much broader than the foodsupply. (Population has 301 years of data while Food Supply has 53.) It is important that both datasets have the same year ranges since comparisons will be made. Therefore the years out of the range of the foodsupply dataset will be removed from the population dataset.

When exploring the dataset it was noted that there is data on future years in the population dataset. This raises a flag on the integrity of the data. Since the project is limited to Gapminder data - these columns will be deleted so the last column is 2013, similar to the foodsupply dataset.

In [43]:
# remove years in population 
df_population.drop(df_population.loc[:,'1800':'1960'].columns, axis=1, inplace=True)
df_population.drop(df_population.loc[:,'2014':'2100'].columns, axis=1, inplace=True)

# confirm columns in both datasets match
(df_population.columns == df_foodsupply.columns).all()

True

#### Wrangling *suicide* and *internet* raw datasets
The factor of number of suicide in each country as a measure of mental health and the number of internet users per country as a measure of the rise of technology in daily life, will be datasets used for analysis later on in conjunction with the population and food supply datasets. 

The previous actions done to population and foodsupply will be done with the suicide and internet data. 

In [172]:
# read suicide csv
df_suicide = pd.read_csv('suicide_total_deaths.csv')

# view structure of dataset
df_suicide.head(3)

Unnamed: 0,country,1990,1991,1992,1993,1994,1995,1996,1997,1998,...,2007,2008,2009,2010,2011,2012,2013,2014,2015,2016
0,Afghanistan,703.0,754.0,820.0,894.0,977.0,1050.0,1100.0,1130.0,1170.0,...,1680.0,1710.0,1750.0,1760.0,1810.0,1870.0,1990.0,2080.0,2170.0,2250.0
1,Albania,127.0,130.0,131.0,135.0,136.0,142.0,150.0,162.0,170.0,...,204.0,205.0,201.0,195.0,191.0,188.0,186.0,184.0,183.0,181.0
2,Algeria,806.0,822.0,843.0,866.0,888.0,912.0,941.0,983.0,1020.0,...,1240.0,1250.0,1270.0,1290.0,1310.0,1340.0,1370.0,1410.0,1420.0,1440.0


The year ranges for the population and foodsupply were 1961 - 2013.  Ideally, these other two measures will need comparable year ranges. Since foodsupply will be used compare to each of these, columns will be dropped to match to ensure foodsupply, internet users, and suicide numbers will show only the years of 1990-2013. 

Since the <a href="#Q1">first question</a> of the analysis only looks at population and food supply a seperate merged table will be created that gives an expanded set of years.

In [171]:
# read internet csv
df_internet = pd.read_csv('it_net_user_zs.csv')

# view structure of dataset
df_internet.head(3)

Unnamed: 0,country,1960,1961,1962,1963,1964,1965,1966,1967,1968,...,2010,2011,2012,2013,2014,2015,2016,2017,2018,2019
0,Afghanistan,,,,,,,,,,...,4.0,5.0,5.45,5.9,7.0,8.26,11.2,13.5,13.5,13.5
1,Albania,,,,,,,,,,...,45.0,49.0,54.7,57.2,60.1,63.3,66.4,71.8,71.8,69.6
2,Algeria,,,,,,,,,,...,12.5,14.9,18.2,22.5,29.5,38.2,42.9,47.7,59.6,59.6


At first glance above there might be several null values with the early years of Internet user table. The null values within the chart will be inspected more closely.    

In [168]:
# exploring null values for years in internet users
df_internet.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 194 entries, 0 to 193
Data columns (total 61 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   country  194 non-null    object 
 1   1960     7 non-null      float64
 2   1961     0 non-null      float64
 3   1962     0 non-null      float64
 4   1963     0 non-null      float64
 5   1964     0 non-null      float64
 6   1965     7 non-null      float64
 7   1966     0 non-null      float64
 8   1967     0 non-null      float64
 9   1968     0 non-null      float64
 10  1969     0 non-null      float64
 11  1970     7 non-null      float64
 12  1971     0 non-null      float64
 13  1972     0 non-null      float64
 14  1973     0 non-null      float64
 15  1974     0 non-null      float64
 16  1975     7 non-null      float64
 17  1976     7 non-null      float64
 18  1977     7 non-null      float64
 19  1978     7 non-null      float64
 20  1979     7 non-null      float64
 21  1980     7 non-n

There are several null values 1960-1989. Keeping the range of 1990 - 2013 which matches with the numbers of suicides will work well for analysis. Both datasets will have years dropped outside of the range of 1990-2013. 

In [173]:
# remove unneeded year columns in internet dataset
df_internet.drop(df_internet.loc[:,'1960':'1989'].columns, axis=1, inplace=True)
df_internet.drop(df_internet.loc[:,'2014':'2019'].columns, axis=1, inplace=True)

# remove unneeded year columns in suicide dataset
df_suicide.drop(df_suicide.loc[:,'2014':'2016'].columns, axis=1, inplace=True)

# confirm columns in both datasets match
(df_internet.columns == df_suicide.columns).all()

True

In [None]:
#### Dropping Unique Countries
All 4 datasets will have all the same countries. 

#### Cleaning datasets

In [None]:
def missing_values(df, column):
    """
    Identifies what values are not contained in individual rows (missing data per country)
    
    Args:
        (dataframe) df: 
        (str) column: 
    Return:
        (list) missing_countries: list of missing countries
    """
    list1 = list(df.country.unique())
    list2 = list(df.dropna)
    list3 = [x for x in list11 if x not in list2]
    list3.sort()
    
    return list3

In [None]:
.info() 
make sure everything is in the correct type

In [None]:
def data_details
    """
    Provides some summary data for each column in df to help with cleaning
    
    ARGS:
    RETURN:
    """
    print("="*6,"  DUPLICATES  ","="*6)
    print("Number of duplicated rows:", sum(df.duplicated()),"\n")
    
    print("="*6,"  MAX & MIN VALUES  ","="*6)

    print("="*6,"  DUPLICATES  ","="*6)

    print("="*6,"  DUPLICATES  ","="*6)


Although the columns of years allow good opportunity for comparison, creating a 'year' column with multiple listings for each country will lend better to visualizations and aggregations later in the project. Therefore the [melt panda](https://pandas.pydata.org/docs/reference/api/pandas.melt.html) function will be used so that each year will be an individual row and sorted by country and year. 

A function is written below that will be able to be applied to other dataframes in the future. The function will take two or more dataframes, individually melt it and then merge. After merging, it sorts by year and country. This will be used for faciliate the process for future dataframes as well since all data is gathered from Gapminder share similar formats.

In [None]:
def df_melt(dfs, col, file):
    """
    Uses panda melt arrange Gapminder data in better format; sorts by year and country. 
    
    Args:
        (dataframe) df: list of dataframes
        (str) col: value columns names will use
    
    Returns:
        (dataframe) df_melt : dataframe containing the merged df formatted for analysis
    """
    
    df_melt_{}.f{file} 
    
        

In [None]:
# confirm successful completion of 


In [None]:
df_suicide = 

In [None]:
dfs = [df_suicide, internet_use]
col = ['suicide', 'internet_use']

df_merge = 

In [None]:
# confirm changes and view df
df_population_melt.head(2)

In [None]:
# looking for any missing values in df
df_population_melt.info()

The same procedures outlined above is repeated for the datasets on numbers of suicide per country each year and the number of number of internet users 

<a id='explore'></a>
## Exploratory Data Analysis

<a id='Q1'></a>
### Q1. How has the consumption of food per person (per day) changed over time throughout the world?

<a id='Q2'></a>
### Q2. How has the world's population changed over the same time period above?

<a id='Q3'></a>
### Q3

<a id='Q4'></a>
### Q4

<a id='summary'></a>
## Summary & Conclusions