# Jupyter Notebook Example

Credit: This is slightly modified from examples used in the FSCI 2017 Computational Reproducibility Day (https://osf.io/sbnz7/), which was created by Courtney Soderberg and Jennifer Smith from the Center for Open Science. 


# Setting up the notebook

## Lets get started

The notebook is built up from separate editable areas, or cells.

A new notebook contains a single *code* cell.

Add a line of code and execute it by:
* *clicking the run button*, or
* click in the cell, and press shift-return

In [None]:
print('hello world')

## Navigating and Selecting Cells

To select a cell, click on it. The selected cell will be surrounded by a box with the left hand side highlighted.

Move the selection focus to the cell above/below using the keyboard up/down arrow keys.

Additionally select adjacent cells using SHIFT-UP ARROW or SHIFT-DOWN ARROW.

## Managing Cells - Add, Delete, Reorder

Add a new cell to the notebook by:
* click the + button on the toolbar
* Insert -> Insert Cell Above or ESC-A
* Insert -> Insert Cell Below or ESC-B

Delete a cell by selecting it and:
* click the scissors button on the toolbar
* Edit -> Delete cells or ESC-X

Undelete the last deleted cell:
* Edit -> Undo Delete cells or ESC-Z

Each cell has a cell history associated with it. Use CMD-Z to step back through previous cell contents.

Reorder cells by:
* moving them up and down the notebook using the up and down arrows on the toolbar
* Edit -> Move Cell Up or Edit -> Move Cell Down
* cutting and pasting them:
    * Edit - >Cut or Edit->Paste Cells Above or Edit->Paste Cells Below on the toolbar

You can also copy selected cells from the toolbar, Edit -> Copy Cells or ESC-C.
## More resources

[More stuff you can do in notebooks](https://towardsdatascience.com/how-to-effortlessly-optimize-jupyter-notebooks-e864162a06ee)

## About Libraries in Python

Lets use our first code cell to import a library. A library in Python contains a set of tools (called functions) that perform tasks on our data. Importing a library is like getting a piece of lab equipment out of a storage locker and setting it up on the bench for use in a project. Once a library is imported, it can be used or called to perform many tasks.

Python doesn’t load all of the libraries available to it by default. We have to add an import statement to our code in order to use library functions. To import a library, we use the syntax `import libraryName`. If we want to give the library a nickname to shorten the command, we can add `as nickNameHere`. An example of importing the Pandas library using the common nickname `pd` is below.

**`import`** `pandas` **`as`** `pd`

## The Pandas Library

One of the best options for working with tabular data in Python is the Python Data Analysis Library (a.k.a. Pandas). The Pandas library is built on top of the NumPy package (another Python library). Pandas provides data structures, produces high quality plots with matplotlib, and integrates nicely with other libraries that use NumPy arrays. Those familiar with spreadsheets should become comfortable with Pandas data structures.
  

In [None]:
import pandas as pd
import numpy as np

Each time we call a function that’s in a library, we use the syntax `LibraryName.FunctionName`. Adding the library name with a `.` before the function name tells Python where to find the function. In the example above, we have imported Pandas as `pd`. This means we don’t have to type out `pandas` each time we call a Pandas function.

See this free [Pandas cheat sheet](https://www.datacamp.com/community/blog/python-pandas-cheat-sheet) from DataCamp for the most common Pandas commands. 

## Markdown

We're seen how we can have coding cells and show their output below them, but what about that plain language I mentioned? We can added another type of cell, a Markdown cell that contains narrative text. Markdown is a popular markup language that is a superset of HTML. To learn more, see [Jupyter's Markdown guide](http://jupyter-notebook.readthedocs.io/en/latest/examples/Notebook/Working%20With%20Markdown%20Cells.html) or revisit the [Reproducible Research lesson on Markdown](https://github.com/Reproducible-Science-Curriculum/introduction-RR-Jupyter/blob/master/notebooks/Navigating%20the%20notebook%20-%20instructor%20script.ipynb).

Lets add a markdown cell above our library imports. Do to this:

* Change the cell type using the drop down list in the toolbar, or by using the ESC-M keyboard shortcut.
* To "open" or select a markdown cell for editing, double click the cell.
* View the rendered markdown by running the cell:

Markdown cells can contain:

* headings
    Prefix a line of text in a markdown cell by one or more # signs, followed by a space, to specify the level of the heading required.
# Heading 1
## Heading 2
...
###### Heading 6

# Getting data into the notebook

We will begin by locating and reading our data which are in a table format as a tab-delimited file. We will use Pandas’ `read_csv` function to pull the file directly into a `DataFrame`.

## What’s a `DataFrame`?
A `DataFrame` is a 2-dimensional data structure that can store in columns data of different types (including characters, integers, floating point values, factors and more). It is similar to a spreadsheet or a SQL table or data.frame in R. A `DataFrame` always has an index (0-based). An index refers to the position of an element in the data structure.

Note that we use `pd.read_csv`, not just `read_csv` or `pandas.read_csv`, because we imported Pandas as `pd`.

In our original file, the columns in the data set are separated by a TAB. We need to tell the `read_csv` function in Pandas that that is the delimiter with `sep = ‘\t’`.



In [None]:
url = "https://osf.io/z274d/download"
#You can also read your table in from a file directory
gapminder = pd.read_csv(url, sep = "\t")

The first thing to do when loading data into the notebook is to actually "look" at it.  How many rows and columns are there?  What types of variables are in it and what values can they take?

There are usually too many rows to print to the screen.  By default, when you type the name of the `DataFrame` and run a cell, Pandas knows to not print the whole thing.  Instead, you will see the first and last few rows with dots in between.  A neater way to see a preview of the dataset is the `head()` method.  Calling `dataset.head()` will display the first 5 rows of the data.  You can specify how many rows you want to see as an argument, like `dataset.head(10)`.  The `tail()` method does the same with the last rows of the `DataFrame`.

In [None]:
gapminder.head()

Sometimes the table has too many columns to print on screen. Calling `df.columns.values` will print all the column names in an array.

In [None]:
gapminder.columns.values

# Assess the structure and cleanliness


## How many rows and columns are in the data?
We often want to know how many rows and columns are in the data -- what is the "shape" of the `DataFrame`. Shape is an attribute of the `DataFrame`. Pandas has a convenient way for getting that information by using `DataFrame.shape`  (using `DataFrame` here as a generic name for your `DataFrame`). This returns a tuple (immutable values separated by commas) representing the dimensions of the `DataFrame` (rows, columns).<p>
To get the shape of the gapminder `DataFrame`:

In [None]:
gapminder.shape

We can learn even more about our `DataFrame`. The `info()` method gives a few useful pieces of information, including the shape of the `DataFrame`, the variable type of each column, and the amount of memory stored.

The output from `info()` displayed below shows that the fields ‘year’ and ‘pop’ (population) are represented as ‘float’ (that is: numbers with a decimal point). This is not appropriate: year and population should be integers or whole numbers. We can change the data-type with the function `astype()`. The code for `astype()` is shown below; however, we will change the data types later in this lesson.

In [None]:
gapminder.info()

The `describe()` method will take the numeric columns and provide a summary of their values. This is useful for getting a sense of the ranges of values and seeing if there are any unusual or suspicious numbers.


In [None]:
gapminder.describe()

The `DataFrame` function `describe()` just blindly looks at all numeric variables. We wouldn't actually want to take the mean year. Additionally, we obtain ‘NaN’ values for our quartiles. This suggests we might have missing data which we can (and will) deal with shortly when we begin to clean our data.

For now, let's pull out only the columns that are truly continuous numbers (i.e. ignore the description for ‘year’). This is a preview of selecting columns from the data; we'll talk more about how to do it later in the lesson.

In [None]:
gapminder[['pop', 'lifeexp', 'gdppercap']].describe()

# Data cleaning

## Referencing objects vs copying objects
Before we get started with cleaning our data, let's practice good data hygiene by first creating a copy of our original data set. Often, you want to leave the original data untouched.  To protect your original, you can make a copy of your data (and save it to a new `DataFrame` variable) before operating on the data or a subset of the data.  This will ensure that a new version of the original data is created and your original is preserved.

###### Why this is important
Suppose you take a subset of your `DataFrame` and store it in a new variable, like `gapminder_early = gapminder[gapminder['year'] < 1970]`.  Doing this does not actually create a new object. Instead, you have just given a name to that subset of the original data: `gapminder_early`. This subset still points to the original rows of `gapminder`.  Any changes you make to the new `DataFrame` `gapminder_early` will appear in the corresponding rows of your original `gapminder` `DataFrame` too.  


In [None]:
gapminder = pd.read_csv(url, sep = "\t")
gapminder_copy = gapminder.copy()
gapminder_copy.head()

## Handling Missing Data

Missing data (often denoted as 'NaN'- not a number- in Pandas, or as 'null') is an important issue to handle because Pandas cannot compute on rows or columns with missing data. 'NaN' or 'null' does not mean the value at that position is zero, it means that there is no information at that position. Ignoring missing data doesn't make it go away. There are different ways of dealing with it which include:

* analyzing only the available data (i.e. ignore the missing data)
* input the missing data with replacement values and treating these as though they were observed
* input the missing data and account for the fact that these were inputed with uncertainty (ex: create a new boolean variable so you know that these values were not actually observed)
* use statistical models to allow for missing data--make assumptions about their relationships with the available data as necessary

For our purposes with the dirty gapminder data set, we know our missing data is excess (and unnecessary) and we are going to choose to analyze only the available data. To do this, we will simply remove rows with missing values.

This is incredibly easy to do because Pandas allows you to either remove all instances with null data or replace them with a particular value.

`df = df.dropna()` drops rows with any column having NA/null data.  `df = df.fillna(value)` replaces all NA/null data with the argument `value`.

For more fine-grained control of which rows (or columns) to drop, you can use `how` or `thresh`. These are more advanced topics and are not covered in this lesson; you are encouraged to explore them on your own.

In [None]:
gapminder_copy = gapminder_copy.dropna()
gapminder_copy.head()

## Changing Data Types
We can change the data-type with the function `astype()`. The code for `astype()` is shown below.

## Subsetting

We can subset (or slice) by giving the numbers of the rows you want to see between square brackets.

*REMINDER:* Python uses 0-based indexing. This means that the first element in an object is located at position 0. this is different from other tools like R and Matlab that index elements within objects starting at 1.

In [None]:
gapminder_copy[0:15]

In [None]:
#Select the first 15 rows
gapminder_copy[:15]

In [None]:
#Select the last 10 rows
gapminder_copy[-10:]

Subsetting can also be done by selecting for a particular column or for a particular value in a column; for instance select the rows that have ‘africa’ in the column ‘continent. Note the double equal sign: single equal signs are used in Python to assign something to a variable. The double equal sign is a comparison: the variable to the left has to be exactly equal to the string to the right.

In [None]:
#Select for a particular column
gapminder_copy['year']

#this syntax, calling the column as an attribute, gives you the same output
gapminder_copy.year

## Summarize and plot

Summaries (but can’t *say* statistics…)
* Sort data
* Can make note about using numpy functions, dif between `DataFrame` and `array`
Good Plots for the data/variable type



Plots 
* of subsets, 
* single variables
* pairs of variables
* Matplotlib syntax (w/ Seaborn for defaults (prettier, package also good for more analysis later...))

Exploring is often iterative - summarize, plot, summarize, plot, etc. - sometimes it branches…


# Summarizing data

Remember that the `info()` method gives a few useful pieces of information, including the shape of the `DataFrame`, the variable type of each column, and the amount of memory stored. We can see many of our changes (continent and country columns instead of region, higher number of rows, etc.) reflected in the output of the `info()` method.

In [None]:
gapminder_copy.info()

We also saw above that the `describe()` method will take the numeric columns and give a summary of their values. We have to remember that we changed the column names and this time it shouldn't have NaNs.

In [None]:
gapminder_copy[['pop', 'lifeexp', 'gdppercap']].describe()

### More summaries

What if we just want a single value, like the mean of the population? We can call mean on a single column this way:


In [None]:
gapminder_copy['pop'].mean()

What if we want to know the mean population by _continent_? Then we need to use the Pandas `groupby()` method and tell it which column we want to group by.


In [None]:
gapminder_copy[['continent', 'pop']].groupby(by='continent').mean()

What if we wanted a new `DataFrame` that just contained these summaries? This could be a table in a report, for example.

In [None]:
continent_mean_pop = gapminder_copy[['continent', 'pop']].groupby(by='continent').mean()
continent_mean_pop = continent_mean_pop.rename(columns = {'pop':'meanpop'})
continent_row_ct = gapminder_copy[['continent', 'country']].groupby(by='continent').count()
continent_row_ct = continent_row_ct.rename(columns = {'country':'nrows'})
continent_median_pop = gapminder_copy[['continent', 'pop']].groupby(by='continent').median()
continent_median_pop = continent_median_pop.rename(columns = {'pop':'medianpop'})
gapminder_summs = pd.concat([continent_row_ct,continent_mean_pop,continent_median_pop], axis=1)
gapminder_summs = gapminder_summs.rename(columns = {'y':'year'})
gapminder_summs

## Visualization with `matplotlib`

[matplotlib](http://matplotlib.org) is Python's main visualization 
library. It provides a range of tools for constructing plots and numerous 
high-level plotting libraries (e.g., [Seaborn](http://seaborn.pydata.org)) are 
built with matplotlib in mind. When we were in the early stages of setting up 
our analysis, we loaded these libraries like so:

In [None]:
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
sns.set()

*Consider the above three commands to be essential practice for plotting (as
essential as **`import`** `pandas` **`as`** `pd` is for data munging).*

Now, let's turn to data visualization. In order to get a feel for the properties
of the data set we are working with, data visualization is key. While, we will
focus only on the essentials of how to properly construct plots in univariate
and bivariate settings here, it's worth noting that both matplotlib and Seaborn
support a diversity of plots: [matplotlib 
gallery](http://matplotlib.org/gallery.html), [Seaborn
gallery](http://seaborn.pydata.org/examples/). 


---

### Single variables

* __Histograms__ - provide a quick way of visualizing the distribution of numerical
  data, or the frequencies of observations for categorical variables.

In [None]:
#import numpy as npa
plt.hist(gapminder_copy['lifeexp'])
plt.xlabel('lifeexp')
plt.ylabel('count')

Hmmm, something does look right.  Let's check the data.

In [None]:
gapminder_copy['lifeexp']

In [None]:
gapminder_copy['lifeexp'].describe()

Ok, it seems that there's a 999999 value that was used to indicate **something**, and it's throwing off the histogram. So let's remove rows that don't have a life expentency between 0 and 100, and then try again

In [None]:
gapminder_clean = gapminder_copy[gapminder_copy['lifeexp'].between(0, 100)]
gapminder_clean['lifeexp'].describe()

In [None]:
plt.hist(gapminder_clean['lifeexp'])
plt.xlabel('lifeexp')
plt.ylabel('count')

* __Boxplots__ - provide a way of comparing the summary measures (e.g., max, min,
  quartiles) across variables in a data set. Boxplots can be particularly useful with larger data sets.

---

In [None]:
sns.boxplot(x='year', y='lifeexp', data = gapminder_clean)
plt.xlabel('year')
plt.ylabel('lifeexp')

### Pairs of variables

* __Scatterplots__ - visualization of relationships across two variables...

In [None]:
# example plot goes here

plt.scatter(gapminder_clean['gdppercap'], gapminder_clean['lifeexp'])
plt.xlabel('gdppercap')
plt.ylabel('lifeexp')

In [None]:
plt.scatter(gapminder_clean['gdppercap'], gapminder_clean['lifeexp'])
plt.xscale('log')
plt.xlabel('gdppercap')
plt.ylabel('lifeexp')

### Saving your plots as image files  
If you'd like to save your plots as an image file, you can run `fig.savefig('my_figure.png')` where `"my_figure"` is the file name.    

# Putting it all together

On your own or with a partner, using the techniques you've learned in this lesson, try to create a plot of life expectancy in Canada during the 1950s and 1960s. We've provided headers to guide you through the process.

#### Import your data

#### Describe your data set here using *Markdown*
What is the general shape of your `DataFrame`? What are the datatypes? Are there missing values? What questions do you have about your data set and how will you answer those questions?

Answers:

#### Create a subset of the data for just Canada and for years between 1950 and 1969

#### Create a plot of life expectancy by year. Is there anything wrong?

#### Fix the 99999 coding error, rerun the plot and save it. *Hint* try using the df.set_values() command to change the value