# Welcome to the tutorial!

## This is Part 1.

### Let's start by exploring some data

Some links to useful help pages:

* https://docs.python.org/3/tutorial/
* https://www.learnpython.org/ (nice interactive tutorials)
* http://datacamp-community-prod.s3.amazonaws.com/dbed353d-2757-4617-8206-8767ab379ab3 (pandas cheet sheet)
* https://github.com/pandas-dev/pandas/blob/master/doc/cheatsheet/Pandas_Cheat_Sheet.pdf (another pandas cheet sheet)

This tutorial was made using Python 3. 

### Short introduction to the environment

In [None]:
# This is comment in Python. Anything written after a '#' will not be executed
# Follow the script, read the comments and play around

# To run the code in a cell: 
# when you are inside the cell you want to run, either click the run button at the top or press Ctrl+Enter
# this cell will not do anything when run as there are only comments inside. It is not running any code

In [None]:
# Check what the current working directory is
# "import" loads python modules (a lot of python code) that contains useful functions we need
import os
os.getcwd()

In [None]:
# Change the working directory (into the data directory)
os.chdir('data/')
os.getcwd()

In [None]:
# But we do actually want to be where we started, so let's go back (../ means go back one directory)
os.chdir('../')
os.getcwd()

In [None]:
# We are assigning the variable a the value that is equal to 2 + 2. Then we print the answer
a = 2 + 2
a

In [None]:
# If you want to print multiple outputs from a single cell in jupyter notebook you must use the print() function
# Try it without using print()
a = 2 + 2
print(a)

b = a + 4
print(b)

### Reading in the data

In [None]:
# pandas is a very useful python data analysis library
# The 'as' allows us to refer to it by a shorter name elsewhere in the code

import pandas as pd

In [None]:
# Now we read in the data file
# Notice we have used the name 'pd'
un_data = pd.read_csv('data/UN.csv')

### Exploring the structure of the dataframe

In [None]:
# What does our data look like?
# The head function returns the first 5 rows of the dataframe
# Parentheses are used to call a function. Here we are using the head() function
un_data.head()

# type "un_data." and without runnning the code just press <tab>. What do you find?

In [None]:
# We can specify the number of rows to display
un_data.head(10)

In [None]:
# Can you tell what this function does?
un_data.describe()

In [None]:
# What do you think this is telling you?
# Note there are no parentheses after shape. This is because shape is not a function
# We are just asking for a property of the data frame, rather than applying some function to the dataframe
un_data.shape

In [None]:
# info() function gives information about the dataframe
# Is this consistent with the results from head() and shape
un_data.info()

### Subsetting the dataframe

In [None]:
# You can select a specific column by name
pctUrban = un_data.pctUrban

# Use the head() function to view the object
# Compare the output of gdp.head() with un_data.head() - you can either scroll up or run the command again
# Can you see what we have done here?
pctUrban.head()

In [None]:
# What size do you expect this object to be? Check if you are right (Hint: use pctUrban.size)


In [None]:
# We can check that the result of un_data.describe() from earlier is correct
# (Hint: use pctUrban.mean and pctUrban.std())


In [None]:
# Another way to select a column
gdp = un_data['ppgdp']

# Use the head() function to check we have done the same thing. And again compare with un_data.head()
gdp.head()

In [None]:
# In a similar way, we can select multiple columns at once
reduced = un_data[['country', 'ppgdp', 'fertility']]
reduced.head()

In [None]:
# Another way to select a column is by index
# When you select by index we use the function iloc
fertility = un_data.iloc[:, 3]  # This gets the 4th column because Python indexing starts at 0

In [None]:
# Again we can check what we have done
fertility.head()

In [None]:
# Selecting rows: we just the row index to select a specific row
print("The second row:")
print(un_data.iloc[1])

In [None]:
# We can select a specific element in the dataframe by specifying row and column number
print("\nThe 3rd row and 4th column:")
print(un_data.iloc[2, 3])

In [None]:
# Here we are selecting a specific row according to a criteria
# What does un_data.country == 'Spain' do? (Hint: try running it by itself)
un_data[un_data.country == 'Spain']

In [None]:
# We can also select by other criteria to get multiple rows
un_data[un_data.ppgdp > 50000]

### Exploring the data further

In [None]:
# How many NA values in a specific column
# Remember un_data.ppgdp selects the column by name
# If you are not sure what si happening here try just running un_data.ppgdp.isna()
un_data.ppgdp.isna().sum()

In [None]:
# We take a column and find all the unique values
unique_regions = un_data.region.unique()
print(unique_regions)

# It seems that there is an nan here. Where does that come from?
# Continue to explore

In [None]:
# Here we select all the rows that have a value of NA for the region
un_data[un_data.region.isna()]

In [None]:
# Based on these results we probably want to remove all these rows from our table
# dropna() function drops all row with any column that has NA data
print(un_data.shape)
un_data_nona = un_data.dropna()
print(un_data_nona.shape)

In [None]:
# This is a loop in Python. Note the structure. Run the code. Can you see what is going on here?
# Ask you are not sure
for region in unique_regions:
    print(region)

In [None]:
# Is this different to the lines above? Can you see from the error why this is?
for region in unique_regions:
print(region)