# DATA 202 @ Calvin U
## Lab 1: First Steps with Jupyter Notebooks and Pandas
Adapted with permission from materials from Harvard CS109A, Berkeley DS100, and from "Introduction to Data Analysis with Python" by Ofra Amir at the Technion

## Objectives

In this lab, we will:

* Get familiar with Jupyter Notebooks (like this one!)
* Review a bit of Python
* Practice with Pandas

### Collaboration Policy

Data science is a collaborative activity. While you may talk with others about the labs, we ask that you **write your solutions individually**. If you do discuss the assignments with others please **include their names** below. (That's a good way to learn your classmates' names.)

**Collaborators**: *list collaborators here*

## Documentation

- **General**: Look under the Help menu for quick links to reference quides
- **Jupyter Notebooks**: skim [this tutorial](http://nbviewer.jupyter.org/github/jupyter/notebook/blob/master/docs/source/examples/Notebook/Notebook%20Basics.ipynb)
- **Python** [documentation](https://docs.python.org/3/)
- **Pandas** Cheatsheets: [1](https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf) and [2](https://www.datacamp.com/community/blog/python-pandas-cheat-sheet)

# Using Jupyter

### Running Cells and Displaying Output

Run the following cell.

In [2]:
print("Hello World!")

In Jupyter notebooks, all print statements are displayed below the cell. Furthermore, the output of the last line is displayed following the cell upon execution.

In [3]:
"Will this line be displayed?"

print("Hello" + ",", "world!")

5 + 3

### Viewing Documentation
There are a few ways you can get to documentation:

* Type a `?` after a function or method (e.g., `min?`) and run the cell
* Use the `help` function (e.g., `help(min)`)
* Press Shift-TAB several times when your cursor is over the function name. Each press (in quick sequence) will open a successively larger help window.
* Open the documentation from the Help menu

The object you're trying to get help about must already be defined in the kernel for this to work.

**Try two of these ways now.**

### Importing Libraries

In DATA 202, we will be using common Python libraries to help us process data. By convention, we import all libraries at the very top of the notebook. There are also a set of standard aliases that are used to shorten the library names. Below are some of the libraries that you may encounter throughout the course, along with their respective aliases. Run this cell now.

In [4]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

* Pandas is a "Python Data Analysis Library" (hence the name). We'll be using it extensively.
* Numpy is the standard Python numerical computing library. It provides the underlying math operations for Pandas, and we'll very occasionally use it directly.
* Matplotlib is the standard Python plotting library. We'll sometimes use it directly too.
* Seaborn is a layer on top of matplotlib that makes it easy to produce nice-looking plots. It can't do everything, though.

`%matplotlib inline` is a Jupyter "magic command" that configures the notebook so that Matplotlib displays any plots that you draw directly in the notebook (rather than to a separate window or a file), allowing you to view the plots upon executing your code. You don't need to know anything about how it works, but if your plots don't show up, make sure you didn't forget this line :)

### Keyboard Shortcuts

Even if you are familiar with Jupyter, we strongly encourage you to become proficient with keyboard shortcuts (this will save you time in the future). To learn about keyboard shortcuts, go to **Help --> Keyboard Shortcuts** in the menu above. 

Here are a few that we like:
1. `Ctrl` + `Return` : *Evaluate the current cell*
1. `Shift` + `Return`: *Evaluate the current cell and move to the next*
1. `ESC` : *command mode* (may need to press before using any of the commands below)
1. `a` : *create a cell above*
1. `b` : *create a cell below*
1. `dd` : *delete a cell*
1. `z` : *undo the last cell operation*
1. `m` : *convert a cell to markdown*
1. `y` : *convert a cell to code*

### Python

Python is the main programming language we'll use in the course. We expect that you've taken CS 104/106/108, so we will not be covering general Python syntax. If any of the below exercises are challenging (or if you would like to refresh your Python knowledge), please review one or more of the following materials.

- **[Python Tutorial](https://docs.python.org/3.7/tutorial/)**: Introduction to Python from the creators of Python.
- **[Composing Programs Chapter 1](http://composingprograms.com/pages/11-getting-started.html)**: This is more of a introduction to programming with Python.
- **[Advanced Crash Course](http://cs231n.github.io/python-numpy-tutorial/)**: A fast crash course which assumes some programming background.
- **[Practice Exercises](http://www.practicepython.org/)**
- **[Python Data Science Handbook](https://jakevdp.github.io/PythonDataScienceHandbook/)**
- Some people have found [this](https://chrisalbon.com/) helpful.

# Loading and Summarizing Data

**1**: Use your web browser to download `temperature-anomaly.csv` by clicking on the Data tab on [this "Our World in Data" chart](https://ourworldindata.org/co2-and-other-greenhouse-gas-emissions#introduction).

**2**: Load it into a variable called `anomaly_df` using the `pd.read_csv` function.

In [5]:
# Your code here


**3**: Use the `head` method to see the first few rows of `anomaly_df`.

In [6]:
anomaly_df.head()

**4**: Use the `info` method to see the structure of `anomaly_df`.

In [7]:
# Your code here


**5**: Use the `describe` method to get a summary of the numeric columns.

In [8]:
# your code here


**6**: Use the outputs above to answer the following questions:

* How many total rows are in the dataset?
* What is the range of years covered? If there was one row per year, how many rows would you expect?
* What is the mean of the "Median (℃)" column?

You can access a single column using square-bracket notation. For example, here is how to get the "Median (℃)" column. (Yes, that "℃" is a special character. You can either write "Median (\u2103)" or just copy and paste the name from one of the outputs above.)

In [10]:
anomaly_df["Median (℃)"]

**7**: Call the `mean`, `min`, and `max` methods on the "Median (℃)" column and verify that the results are the same as what was shown in the `describe` output above.

In [12]:
# Your code here


**8**: The `Entity` column is not numeric. Use the `value_counts` method to see what values the `Entity` column takes on. Do the values make sense in light of your answer to the question above about the range of years?

In [14]:
# Your code here


**9**: Now let's plot. Run the two plotting commands below, which use the `seaborn` library to quickly make reasonable-looking plots.

In [65]:
sns.lineplot(x="Year", y="Median (℃)", hue="Entity", data=anomaly_df);

In [66]:
sns.lmplot(x="Year", y="Median (℃)", col="Entity", data=anomaly_df, fit_reg=False);

**10**: Compare and contrast the two plots. Are there trends or relationships that are easier to see in the first plot than in the second? Vice versa?

**your answer here**

**11**: Try tweaking the `lmplot` command in the following ways. Look in the documentation if necessary (don't Google it at this point).

**11a**: Each column is a region in the plot above; try making each row a region.

In [68]:
# Your code here


**11b**: Add a line of best fit. (Hint: you won't need to add any code.)

In [69]:
# Your code here.


**11c**: Try using a "lowess" curve instead of a line. What's the same, what's different?

In [70]:
# Your code here
