##### Lecture 10 
ENVR 890-010: Python for Environmental Research, Fall 2021

November 5, 2021

By Rosa Cuppari. Some material adapted from Andrew Hamilton, Greg Characklis, David Gorelick and H.B. Zeff.

## Summary
In this lecture we will be transitioning from Jupyter Notebooks to Spyder, an environment for coding in Python. Spyder should have been downloaded to your computers along with Anaconda. We will also be reviewing and reinforcing concepts from the rest of the class and introducing GitHub. 

### A brief statistics interlude
During our regression lecture, we mentioned many important tests to run on data to evaluate normality as well as indicator values for understanding whether our regression is statistically significant. To review what the important values mean, here is a list of key terms:

1. **p-value:** the "significance" level. This is simply the probability of seeing the result you are getting - particularly the t-value - in a random dataset (e.g., if the coefficients were all zero). If it is very low (e.g., <0.05), the chances of  your analysis being statistically significant are high. 
1. **t-test:** based on the Student's t-statistic and t-distribution. It calculates a p-value to accept/reject the null hypothesis. It essentially compares the mean with the mean that a sample of your size is expected to have. 
1. **f-test:** based on a different distribution that the t-test, it is used to compare variances, instead of means, comparing your regression coefficients to an equation with coefficients set to zero. Note: unlike the p-value that looks at a single variable at a time, the f-statistic is evaluating the *overall* significance of your results (i.e. all the variables together). 
1. **r2 versus adjusted r2:** r2 is the correlation coefficient squared. The adjusted r2 takes into account how many variables you are inputting to your regression equation to account for potential overfitting. 
1. **Log-likelihood:** the higher the better. You can use this to compare two relatively simple models, as it is a way to describe the joint probability of your data and the coefficients in your model.  
1. **Aikake Information Criterion (AIC):** a way to measure how good your model compared to others, describing "information loss." The lower the better! It includes a "penalty" for more parameters.
1. **Bayesian Information Criterion (BIC):** similar to AIC, it describes the quality of the model, also penalizing the model for more parameters (a bit more than AIC). The lower the better! 
1. **Skew:** how asymmetrical your data is. Between -3 and 3 is ok, otherwise your data may not be normally distributed.
1. **Kurtosis:** how "tall"/"peaked" your distribution is. Between -3 and 3 is ok, otherwise your data may not be normally distributed.
1. **Durbin-Watson:** a test for autocorrelation in model residuals at a lag of 1. The values range from 0 to 4, with <2 indicating a positive autocorrelation, >2 indicating a negative autocorrelation, and 2 indicating no autocorrelation.
1. **Jarque-Bera:** a test for normal skewness and kurtosis. The further from zero it is, the more non-normal. 
1. **Omnibus:** another test for normality! A value close to zero is good for the Omnibus test and close to one is good for the Prob (Omnibus).

Some of the infinitely many resources available: 
1. Brief Princeton U Library Guides: [Regression Intro](https://dss.princeton.edu/online_help/analysis/regression_intro.htm), [Interpreting Regressions](https://dss.princeton.edu/online_help/analysis/interpreting_regression.htm) (*they need a website update*)
1. [A very long linear regression 101 guide](https://dss.princeton.edu/training/Regression101.pdf) (*sadly based in STATA*)
1. [Statistics How To](https://www.statisticshowto.com/probability-and-statistics/f-statistic-value-test/) (*as an intro, easy read*)


## And now, Spyder! 
Spyder is useful because: 
1. It's easier to keep track of the variables you have created and you can easily see how they are categorized
1. I find it's a little bit faster 
1. You can look into your folders more easily
1. You don't use cells (I find they make it easier to lose track of things) 

**Important note: Spyder will automatically set your directory as the folder the file is saved in!**

## Going backwards: applications in Spyder 
We have learned a ton of stuff this semester, but we have focused on Jupyter Notebooks. Implementation in Spyder, as you have seen, is nearly identical, so we are going to take advantage of these similarities to review some practical skills in Spyder. 

### In-class exercise: reading in multiple files in Spyder
First, let's read in our data. I have collected data from the [World Bank Open Data Portal](https://data.worldbank.org/). The data is really nice because it's global, it is downloaded with consistent columns (i.e. 1960 onwards), and you can find some long timeseries. 

In [5]:
## Let's crowdsource what we should import 
## hint: there are two/three you should essentially always import
import pandas as pd

In [6]:
## Create a list with the names of each data file in the folder 
names = ['AgLand','CO2','Electricity','Freshwater','GDP','Population','Poverty','Renewables',
        'Traffic_Mortality','UrbanPop','WaterSan','MaternalMortality','Literacy']

In [7]:
## create an empty dataframe to hold all of our information 
df = pd.DataFrame()

In [8]:
## Create a for loop which takes the names of the indicators and reads in the csv files for each one, 
## concatenating them into one large dataframe

## Set up what the for loop is going to iterate over 
for n in names: 
    ## Within the for loop, write a line to read in the file where the file name is changing according to 
    ## the indicator. Skip 3 lines.
    ## hint: you will need to combine a non-string with the indicator name with a string ('.csv')
    new = pd.read_csv(##INSERT NAME!)
    ## Within the for loop, concatenate your files based on rows. 
    ## hint: pd.concat([df1,df2]) will concatenate based on rows 
    ## the paramater axis=0 is implied, changing this to  axis=1 will concatenate based on columns.
    df = 

df = df.iloc[:,:65] ##need to drop last column to clean data 

SyntaxError: invalid syntax (<ipython-input-8-fea5a6396c46>, line 15)

In [9]:
## print data head and data tail once you are done, checking to see that the columns have consistent values 

**For those that want to practice: do this without writing the names explicitly into a list. Hint: you can iterate over all of the files in your directory.** 

### In-class exercise: subsetting and plotting data 
Now we are going to subset our data for Afghanistan, Brazil, Niger, and Bangladesh, and then plot them. We will be using a function that takes the names of countries as a list and the variable of interest, and then uses a for loop to create a line plot with each of the countries displayed. This, of course, needs a legend so that we can interpret it. 

In [20]:
## define your function with three inputs: the dataset to use, countries of interest, and indicator of interest
## hint: make sure you know what the indicators are actually called in your dataset. 
## df.column.unique() could be helpful
countries = ['Afghanistan','Lebanon','Colombia','Brazil','Rwanda','Niger','Nepal','Bangladesh']

def country_plots(df, countries, variable): 
    ## Start by isolate our indicator variable 

    
    ## write a line to subset the data based on whether the country column contains any of the 
    ## names within our list. There are a few ways to do this, but in order for the function to be useful, 
    ## you will probably want to find the number of values in the countries list and iterate over them
    
    df_subset = pd.DataFrame()
    for n in countries: 
        df_new = 
        
        ## For our purposes later, you will want to TRANSPOSE the data, that is flip it so that the 
        ## years are all in the first column. I have added the line below, meant to happen AFTER you subset the df
        ## I have also named your column 
        df_transposed = df_new.T
        df_transposed.columns = [str(n)] 
        ## you will also need to drop the first four rows (in your spare time, check out what happens if you do not)
        df_transposed = df_tranposed.iloc[5:,0]
        
        df_subset = ## this will be your combine dataset with year as index and columns as country 
        ## hint: concatenate the old with the new 

    ## create a line plot with different lines for each country over time
    ## hint: you can do this a few different ways, but a for loop is a safe bet (feel free to play around with this)
    
    ## also create a boxplot where the x axis is the year and the y axis is the spread for all of the countries combined 
    
    return df_subset

['Afghanistan', 'Brazil']


**A thought: what would you do if you wanted to create a new variable for each country?** 

**Another bonus exercise: can you create a nested dictionary where the key is the country, then you sort by indicator, and then the final layer is the value? How would you plot this using a for loop?** 

**A final bonus! How would you make this function spit out plots for many variables? (or even all the indicators in our folder?)**

### In-class exercise: regression analysis 
What if we want to read in all of the indicators for the same country and test the relationships between them? Let's give it a try: reading in all of the files and selecting for a single country. 

In [10]:
## start by reading in all of the data files using a similar for loop to the first exercise 
## (i.e. iterating over the indicators)


## subset by COUNTRY this time instead of by indicator 

## manipulate the data so that the years are the index/rows, and the columns are the indicator names 



In [11]:
## let us regress! Import the necessary packages 

## pick 3-5 variables and conduct a regression analysis to pick one (there is a variety of data here) 
## make sure to consider whether time (years) are the predominant influence on outcome (and why!)

#### Bonus
Can you create a for loop that will automatically create a regression for each variable? What could you do if you wanted to run many regressions and record the results without printing them. 


## GitHub Intro
Great! Now we have some cool code we want to share with the world because other people might want to replicate our analysis (yay for FAIR data and transparent science!). You might wonder whether you can publish Jupyter Notebooks somewhere, or Spyder analyses. The answer is yes: on GitHub. In fact, you can find the lectures we use in [Andrew's online repository](https://github.com/ahamilton144/Python-For-Environmental-Research), though changes from last year to this year are still not publically available. In fact, we will be using another one of Andrew's repositories - [his Git Tutorial](https://github.com/ahamilton144/GitTutorial) - to guide us for the second half of this lecture and the first half of next week's lecture.   

GitHub is also really useful as an online database for your own work. You can update code as you work on it so that if your computer crashes or if your external hard drive breaks, you have a back up option. You can keep repositories private, so that others cannot see your work in progress. 