# CORE Skills Prerequisite - Intro to Python

This lesson is adapted from the [Data Carpentry Ecology lesson](http://www.datacarpentry.org/python-ecology-lesson/)


## How to use a Jupyter Notebook

https://jupyter-notebook-beginner-guide.readthedocs.io/en/latest/index.html

https://jupyterlab.readthedocs.io/en/stable/user/notebook.html

- The file autosaves
- You run a cell with **shift + enter** or using the run button in the tool bar
- If you run a cell with **option + enter** it will also create a new cell below
- In the classic Jupyter Notebook see *Help > Keyboard Shortcuts* or the *Cheatsheet* for more info
- In Jupyter Lab use the Search from the left pane to search for shortcuts


- The notebook has different type of cells: Code and Markdown are most commonly used
- **Code** cells expect code for the Kernel you have chosen, syntax highlighting is available, comments in the code are specified with # -> code after this will not be executed
- **Markdown** cells allow you to right report style text, using markdown for formatting the style (e.g. Headers, bold face etc)

## Introduction to Python and data analysis using pandas

Python is a high-level, interpreted programming language. This means the code is easy to read for humans and there is no need for us to compile it and in many cases we do not have to think too much about the underlying system e.g. memory usage.

As a consequence, we can use it in two ways:
- Using the interpreter as an "advanced calculator" in interactive mode:

In [None]:
# Calculations


In [None]:
# Printing text to screen


- Executing programs/scripts saved as a text file, usually with *.py extension:

In [None]:
# running scripts (using jupyter notebook magics)
%run my_script.py

# Types of Data

How information is stored in a DataFrame or a python object affects what we can do with it and the outputs of calculations as well. There are two main types of data that we're explore in this lesson: numeric and character types.


## Numeric Data Types

Numeric data types include integers and floats. A **floating point** (known as a
float) number has decimal points even if that decimal point value is 0. For
example: 1.13, 2.0 1234.345. If we have a column that contains both integers and
floating point numbers, Pandas will assign the entire column to the float data
type so the decimal points are not lost. In a vector or data frame (we learn about these different types later) the entire object or an entire column will be of the same type.

An **integer** will never have a decimal point. Thus 1.13 would be stored as 1.
1234.345 is stored as 1234. You will often see the data type `Int64` in python
which stands for 64 bit integer. The 64 simply refers to the memory allocated to
store data in each cell which effectively relates to how many digits it can
store in each "cell". Allocating space ahead of time allows computers to
optimize storage and processing efficiency.



## Character Data Types

Strings are values that contain numbers and / or characters. 
For example, a string might be a word, a sentence, or several sentences. 
A string can also contain or consist of numbers. For instance, '1234' could be stored as a
string. As could '10.23'. However **strings that contain numbers can not be used
for mathematical operations**!





In [None]:
# #Examples of numeric and text data
text = "Data Science"
number = 42
pi_value = 3.1415

Here we've assigned data to variables, namely `text`, `number` and `pi_value`,
using the assignment operator `=`. The variable called `text` is a string which
means it can contain letters and numbers. We could reassign the variable `text`
to an integer too - but be careful reassigning variables as this can get 
confusing.

To print out the value stored in a variable we can simply type the name of the
variable into the interpreter:

A cell, by default, will print to screen the last thing it evaluates (unless this is explicitly written to a variable).

Thus, in scripts and for evaluating things anywhere else within a cell, we must use the `print` function:

In [None]:
# Next line will print out text
print(text)

In [None]:
# We also need the print statement if we want to see more than one variable
text
number

### Mathematical Operators

We can perform mathematical calculations in Python using the basic operators
 `+, -, /, *, %`:

In [None]:
6*7

In [None]:
2**16

In [None]:
13%5

**In python 2 if we divide one integer by another, we get an integer!**
The result in python 3 is different where we get a float.

If you use Python 2 (not recommended as it **is** deprecated) remember to convert your integers to floats when you want floating point precision for divisions!

In [None]:
# testing integer division


In [None]:
# convert to integer
a = 6.6
int(a)

In [None]:
# convert to float


### Logical Operators
We can also use comparison and logic operators:
`<, >, ==, !=, <=, >=` and statements of identity such as
`and, or, not`. The data type returned by this is 
called a _boolean_.

## Sequential types: Lists and Tuples

### Lists

**Lists** are a common data structure to hold an ordered sequence of
elements. Each element can be accessed by an index.  Note that Python
indexes start with 0 instead of 1:

In [None]:
numbers = [1,2,3]
numbers[0]

To add a single elements to the end of a list, we can use the `append` method:

In [None]:
numbers.append(5)
print(numbers)

To add multiple elements to the end of a list, we can use the `extend` method:

**Methods** are a way to interact with an object (a list, for example). We can invoke 
a method using the dot `.` followed by the method name and a list of arguments in parentheses. 
To find out what methods are available for an object, we can use the built-in `help` command:

In [None]:
help(numbers)

In [None]:
# try some methods


We can also access a list of methods using `dir`. Some methods names are
surrounded by double underscores. Those methods are called "special", and
usually we access them in a different way. For example `__add__` method is
responsible for the `+` operator.

In [None]:
dir(numbers)

### Tuples

A tuple is similar to a list in that it's an ordered sequence of elements. However,
tuples can not be changed once created (they are "immutable"). Tuples are
created by placing comma-separated values inside parentheses `()`.

In [None]:
a_tuple = (1,2,3)
another_tuple = ('blue','green','red')
a_list = [1,2,3]

### Challenge
1. What happens when you type `a_tuple[2]=5` vs `a_list[1]=5` ?
2. Type `type(a_tuple)` into python - what is the object type?


# Working With Pandas DataFrames in Python

## Starting in the same spot

To help the lesson run smoothly, let's ensure everyone is in the same directory.
This should help us avoid path and file name issues. At this time please
navigate to the workshop directory. If you working in IPython Notebook be sure
that you start your notebook in the workshop directory.

A quick aside that there are Python libraries like [OS
Library](https://docs.python.org/3/library/os.html) that can work with our
directory structure, however, that is not our focus today.

If you need to change your directory ```import os``` and use ```os.chdir```

## Our Data 

For this lesson, we will be using the Portal Teaching data, a subset of the data
from Ernst et al
[Long-term monitoring and experimental manipulation of a Chihuahuan Desert ecosystem near Portal, Arizona, USA](http://www.esapubs.org/archive/ecol/E090/118/default.htm)

We will be using files from the [Portal Project Teaching Database](https://figshare.com/articles/Portal_Project_Teaching_Database/1314459).
This section will use the `surveys.csv` file which can be found in /data/python/python_data

We are studying the species and weight of animals caught in plots in our study
area. The dataset is stored as a `.csv` file: each row holds information for a
single animal, and the columns represent:

| Column           | Description                        |
|------------------|------------------------------------|
| record_id        | Unique id for the observation      |
| month            | month of observation               |
| day              | day of observation                 |
| year             | year of observation                |
| plot             | ID of a particular plot            |
| species          | 2-letter code                      |
| sex              | sex of animal ("M", "F")           |
| wgt              | weight of the animal in grams      |


The first few rows of our first file look like this:

```
record_id,month,day,year,plot,species,sex,wgt
1,7,16,1977,2,NA,M,
2,7,16,1977,3,NA,M,
3,7,16,1977,2,DM,F,
```

## About Libraries

A library in Python contains a set of tools (called functions) that perform
tasks on our data. Importing a library is like getting a piece of lab equipment
out of a storage locker and setting it up on the bench for use in a project.
Once a library is set up, it can be used or called to perform many tasks.

Python doesn't load all of the libraries available to it by default. We have to
add an `import` statement to our code in order to use library functions. To import
a library, we use the syntax `import libraryName`. If we want to give the
library a nickname to shorten the command, we can add `as nickNameHere`.  An
example of importing the pandas library using the common nickname `pd` is below.

You only need to load a library once during your session. You can load the library when needed
or you can load all necessary libraries at the beginning of your script. 
This is good practice, especially for the readability of your code

## Pandas in Python

One of the best options for working with tabular data in Python is to use the
[Python Data Analysis Library](http://pandas.pydata.org/) (a.k.a. Pandas). The
Pandas library provides data structures, produces high quality plots with
[matplotlib](http://matplotlib.org/) and integrates nicely with other libraries
that use [NumPy](http://www.numpy.org/) (which is another Python library) arrays.

A handy **Pandas cheathsheet** can be found [here](http://pandas.pydata.org/Pandas_Cheat_Sheet.pdf).

Each time we call a function that's in a library, we use the syntax
`LibraryName.FunctionName`. Adding the library name with a `.` before the
function name tells Python where to find the function. In the example above, we
have imported Pandas as `pd`. This means we don't have to type out `pandas` each
time we call a Pandas function.

In [None]:
# check if you need to change your directory
import os
os.getcwd()  

In [None]:
os.listdir("../")

In [None]:
os.chdir("../data/")

In [None]:
os.getcwd()  

In [None]:
import pandas as pd
#check your version, we need v0.19 or higher
pd.__version__

# Reading CSV Data Using Pandas

We will begin by locating and reading our survey data which are in CSV format.
We can use Pandas' `read_csv` function to pull the file directly into a
[DataFrame](http://pandas.pydata.org/pandas-docs/stable/dsintro.html#dataframe).

## So What's a DataFrame?

A DataFrame is a 2-dimensional data structure that can store data of different
types (including characters, integers, floating point values, factors and more)
in columns. It is similar to a spreadsheet or an SQL table or the `data.frame` in
R. A DataFrame always has an index (0-based). An index refers to the position of 
an element in the data structure.


In [None]:
# note that pd.read_csv is used because we imported pandas as pd
pd.read_csv("surveys.csv")

We can see that there were 35,549 rows parsed. Each row has 9
columns. The first column is the index of the DataFrame. The index is used to
identify the position of the data, but it is not an actual column of the DataFrame. 
It looks like  the `read_csv` function in Pandas  read our file properly. However, 
we haven't saved any data to memory so we can work with it.We need to assign the 
DataFrame to a variable. Remember that a variable is a name for a value, such as `x`, 
or  `data`. We can create a new  object with a variable name by assigning a value to it using `=`.

Let's call the imported survey data `surveys_df`:



In [None]:
surveys_df = pd.read_csv("surveys.csv")

Notice when you assign the imported DataFrame to a variable, Python does not
produce any output on the screen. We can print the value of the `surveys_df`
object by typing its name into the Python command prompt.
