# GPHY 491-591: Lab 2 
Written by Cascade Tuholske, Jan. 2024  

### <u> Goals <u>
The goal of this lab is to pratice writing code with numpy, pandas, and matplotlib using a real-world dataset. The dataset was developed to understand the relationship between child undernutrition, measured by [USAID Demographic and Health Surveys](https://www.usaid.gov/global-health/demographic-and-health-surveys-program), and national-level indicators of economic development. Child undernutrition is often broken into two categories: low weight for age (stunting - reflecting chronic food insecurity) and low weight for height (wasting - reflecting acute food insecurity). <br>
    
Orthodox theories of economic development posit that as a country urbanizes, its economy grows and food security improves. Thus, more urbanized countries should be more food secure compared to more rural countries. Further, within a country, urban areas should be more food secure compared to rural areas.<br>
    
We will explore this dataset to understand if economic development and food security are well-correlated, and see what proxy indecators of economic development may also correlate with child food insecurity. <br>
    
### <u> Instruction <u>
1. Please rename your notebook as: `Last_First_Lab1` **NOTE:** Naming convetions matter for files, including notebooks and data. Always be consistant. 
2. Complete lab by writing code or answering questions in the cells as in structed by the comments.
3. Copy your notebook to `/home/YOURNETID/gphy591/submissions/`



# Let's import our packages
![import](assets/import.png)

In [None]:
# Packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import os # this package allows us to interface with out opperating system 

## Let's load the .csv file with pandas and explore it.

In [None]:
# we write file paths and names as their own variables for easy editing
fn = os.path.join('./data/DhsPrevalenceWCovar.csv')
type(fn) # see that our fn is a string, but the os.path.join also helps Python know it's a file path

In [None]:
# open the data frame and check it
df = pd.read_csv(fn)
df.head(4) # show first four rows

In [None]:
# write a for loop that prints all the column names sorted by first letter
for column in sorted(df.columns):
    print(column)

In [None]:
# What is the shape of the data frame?

In [None]:
# Print the data type for each column.

In [None]:
# Which rows are missing data?

Our dataset is organized in survey-year-country rows. This means that each year is a single survey for a single country. All the metrics are national-level, meaning that stunting or wasting are rates (e.g. prevelance). 

In [None]:
# Which country-year had the highest stunting prevalence for all households?
df.loc[df['stunt_all'].idxmax()]

In [None]:
# Which country-year had the lowest stunting prevalence for all households?

In [None]:
# How many surveys were conducted in Senegal?
geog = 'Senegal'
df_ = df[df['country'] == geog]
len(df_)

In [None]:
# What years were surveys conducted in Senegal?

In [None]:
# Using for loops, print the year of the year and prevalence ranked highest to lowest for urban wasting for senegal.

In [None]:
# Another way to do this, is using the built-in methods to a pandas data frame 'sort_values'
df_.sort_values('waste_urban', ascending = True)

In [None]:
# You can look at all the available functions using the key word 'dir'
dir(df_)

In [None]:
# You can use 'help' to understand objects, methods, functions, etc.
# Parameters (also called 'arguments') are the objects you pass to a function before you 'call it'
help(df_.sort_values)

In [None]:
# You can also summerize each column easily
df_.describe()

## New Columns
You can easily make new columns in a Pandas DateFrame using simple math.

In [None]:
# Make a new column that is the ratio of agLabor to agLand
df_['agLabor-to-agLand'] = df_['agLabor'] / df_['agLand']

The warning is important. **Always** read about warnings before deciding if you are okay with moving forward. In this case, we are okay to continue.

In [None]:
# what kind of data is df_['agLabor-to-agLand']?

In [None]:
# Let's round our data two 2 decimals
df_['agLabor-to-agLand'] = df_['agLabor-to-agLand'].round(2)

In [None]:
# what kind of data is df_['agLabor-to-agLand'] now?
df_['agLabor-to-agLand']

In [None]:
# Let's change it to save memory
df_['agLabor-to-agLand'] = df_['agLabor-to-agLand'].astype('float32')

### Let's do the same, but for agLabor to foodExport

In [None]:
# Make a new column that is the ratio of agLabor to agLand
df_c = df_.copy()
df_c.loc[:, 'agLabor-to-foodExport'] = df_c['agLabor'] / df_c['foodExport']

Notice `.copy()` removes the warning. You are now doing proper indexing under the latest version of Pandas by creating a new copy of the Senegal data.

## Now explore a bit yourself

Which country and what year did the highest [gdp ppp](https://www.cia.gov/the-world-factbook/field/real-gdp-purchasing-power-parity/country-comparison#:~:text=A%20nation%27s%20GDP%20at%20purchasing,prevailing%20in%20the%20United%20States.) happen? 

In [None]:
# Code here. (note: you should us the df object, not df_)

Across all years, what is the correlation between gdp ppp and stunting? <br> Hint - you can try `np.corrcoef(x, y)`

In [None]:
# Code here.

Across all years, what is the correlation between gdp ppp and stunting for surveys conducted after 2000? <br> Hint - you can subset your DataFrame by year with  `df[df['year'] > 2000]`

In [None]:
# Code here.

# Let's do some plots

In [None]:
# Make a scatter plot of the relationship between GPD PPP and All Stunting

In [None]:
# Make a scatter plot of the relationship between GPD PPP and Urban Stunting

In [None]:
# Make a scatter plot of the relationship between GPD PPP and Rural Stunting

In [None]:
# Make a scatter plot of the relationship between GPD PPP with both Urban and Rural Stunting plotted 
# on the same plot

FYI: UrbanPop is the percentage of the total population classified as 'urban'. 

In [None]:
# Make a three panel figure with panel (A) the relationship between UrbanPop and All wasting 
# panel (B) the relationship between UrbanPop and rural wasting, and panel (C) the relationship 
# between UrbanPop and urban wasting.

### Take some time to read about the package [Seaborn](https://seaborn.pydata.org)

In [None]:
import seaborn as sns

In [None]:
# Try making a histogram of GPD PPP using Seaborn

In [None]:
# Try making a scatter plot of the relationship between GPD PPP and All Stunting 
# with the regression line using seaborn using Seaborn

In [None]:
# Try making a pairplot of GPD PPP, UrbanPop, AgLabor, All Stunting, Rural Stunting, and Urban Stunting

# Questions

### Question 1: Which variables are most highly correlated: GPD PPP, UrbanPop, AgLabor, All Stunting, Rural Stunting, and Urban Stunting? Provide evidence to support your conclusions.

Answer 1: <br>
xyz ...