## Session 1: Introduction to Jupyter Notebooks and Python

### 1. An Introduction to the Environment

This first session will cover the basics of Python, and introduce elements that will help you get familiar with Python as an interactive computational environment for exploring data. The material is presented in an interactive environment that runs within your web browser, called a Jupyter Notebook. It allows presentation of text and graphics to be combined with Python code that can be run interactively, with the results appearing inline. We are looking at a Jupyter notebook now. Note that Jupyter is a relatively recent name for this so sometimes you may still see it referred to as an IPython noteboook. Jupyter is just the new version of IPython notebooks, but now also supports a variety of other languages and tools.

Let's start by getting familiar with the Jupyter Notebook and how it works.

### 1.1 Launching a Jupyter Notebook on DataHub

In this course we will be using hosted computing facilities to make it easy to start learning to code in Python and to leverage university computing infrastructure. In class we will use datahub, and you can use a Calnet login to connect it at http://datahub.berkeley.edu.

### 1.2 Using Jupyter Notebooks

Once you launch the Jupyter Notebook, and the Notebook opens up in your browser, you will be looking at a directory of the folder you were in when you launched the notebook. you can either load an existing notebook if you see one in the directory, or create a new one. 

The Jupyter Notebook is made up of cells.  Some cells are known as ***markdown cells***, which contain html text. (Like this cell!) 

You can edit the contents of a cell by double-clicking on it. Try it on this cell. When you are ready to save a markdown cell, just use Shift-Enter.

You can use Markdown syntax to format your text. Documentation on the markdown syntax is available here: https://www.markdownguide.org/basic-syntax/

### 1.3 Code Cells

In addition to markdown cells, Jupyter has ***code cells***, which contain programming code.  It's in these cells that the magic happens!  Try it - put your curser in the cell below, and hit "Ctrl-Enter" to run the code cell.


In [2]:
import math
math.sqrt(52)

7.211102550927978

That was cool!!!  Try changing the 144 to a different number, and run the cell again.

### 1.4 Learning the Python Language

Python is a language, and like natural human languages, it takes time to learn.  There is vocabulary, or "syntax," as well as rules for how that syntax is presented.  Both of these just take practice, practice, practice.  We will teach you the rules, but you need to practice on your own. 

However, programming languages differ from natural language in one important way:

> The rules are rigid. If you're proficient in a natural language, you can understand a non-proficient speaker, glossing over small mistakes. A computer running Python code is not smart enough to do that.

Whenever you write code, you'll make mistakes. Errors are okay; even experienced programmers make many errors. When you make an error, you just have to find the source of the problem, fix it, and move on. Fixing it can take a long time.  A long long time.

### 2. Moving from Excel to Python

We're going to jump right into the Python language by replicating what we did with our ACS data in a programming language, rather than in Excel.  We're starting with a very simple datasheet that includes data on census tracts in Oakland for housing units, including the number of housing units that are rented, and the number that are owned.

#### 2.1 Introduction to Tables

A table is a fundamental object type for representing data sets. Each column of a table data structure must contain values of the same kind. For example, a table can have a column of integers or a column of strings. We can create a table from scratch or from a CSV file. Let's try reading our ACS .csv file into a table.

One of the challenging things about using open source software is that it is rarely presented as a "complete" software package.  If you're working in Excel, you don't need to go outside of the program to insert a "square root of the sum of squares" equation, for example.  It's a function in Excel. With open source software, different functions were created by different programmers, and we often have to "call" in that external program to do what we want. These are often called libraries.

Two important libraries in Python are:

>**numpy**: a general-purpose array-processing package. What does this mean?  A package that let's us work with tables (also known as arrays).

>**pandas**: a software library for data manipulation and analysis.

We install these two libraries with the following commands. The abbreviation will be what we use to "call" functions that belong to that library.

In [3]:
import numpy as np
import pandas as pd

In [4]:
pd.read_csv('Lab5_Tenure_Alameda.csv', delimiter = ',')

Unnamed: 0,Census Tract,County,Total Population,Occupied housing units,Occupied housing units: Owner Occupied,Occupied housing units: Renter occupied
0,"Census Tract 9900, Alameda County, California",Alameda,0,0,0,0
1,"Census Tract 4432, Alameda County, California",Alameda,3704,1171,1131,40
2,"Census Tract 4511.02, Alameda County, California",Alameda,4011,1322,1268,54
3,"Census Tract 4420, Alameda County, California",Alameda,3212,997,926,71
4,"Census Tract 4261, Alameda County, California",Alameda,5922,2116,1958,158
5,"Census Tract 4351.03, Alameda County, California",Alameda,6403,2045,1885,160
6,"Census Tract 4403.32, Alameda County, California",Alameda,3310,866,792,74
7,"Census Tract 4304, Alameda County, California",Alameda,2046,747,682,65
8,"Census Tract 4046, Alameda County, California",Alameda,4353,1800,1642,158
9,"Census Tract 4431.03, Alameda County, California",Alameda,4029,1166,1062,104


So cool!!!!  But, it might be helpful to give the dataset a name, and then just look at the first few lines, rather than seeing the full dataset.

In [5]:
tenure_2017 = pd.read_csv('Lab5_Tenure_Alameda.csv', delimiter = ',')

In [6]:
tenure_2017.head(5)

Unnamed: 0,Census Tract,County,Total Population,Occupied housing units,Occupied housing units: Owner Occupied,Occupied housing units: Renter occupied
0,"Census Tract 9900, Alameda County, California",Alameda,0,0,0,0
1,"Census Tract 4432, Alameda County, California",Alameda,3704,1171,1131,40
2,"Census Tract 4511.02, Alameda County, California",Alameda,4011,1322,1268,54
3,"Census Tract 4420, Alameda County, California",Alameda,3212,997,926,71
4,"Census Tract 4261, Alameda County, California",Alameda,5922,2116,1958,158


What if I just want to look at a couple of columns?

In [7]:
tenure_2017[['Occupied housing units', 'Occupied housing units: Renter occupied']].head(5)

Unnamed: 0,Occupied housing units,Occupied housing units: Renter occupied
0,0,0
1,1171,40
2,1322,54
3,997,71
4,2116,158


Long column names are a pain, because we have to type a lot and it's easy to make a typo. One approach is to clean up your datasheet in Excel before reading it into Python.  But let's practice renaming columns.

In [12]:
tenure_2017.rename(columns={'Total Population': 'Total_Pop', 'Occupied housing units':'HU', 'Occupied housing units: Owner Occupied': 'Owner_HU', "Occupied housing units:Renter occupied": "Renter_HU"}, inplace=True)

In [13]:
tenure_2017.head(5)

Unnamed: 0,Census Tract,County,Total_Pop,HU,Owner_HU,Occupied housing units: Renter occupied
0,"Census Tract 9900, Alameda County, California",Alameda,0,0,0,0
1,"Census Tract 4432, Alameda County, California",Alameda,3704,1171,1131,40
2,"Census Tract 4511.02, Alameda County, California",Alameda,4011,1322,1268,54
3,"Census Tract 4420, Alameda County, California",Alameda,3212,997,926,71
4,"Census Tract 4261, Alameda County, California",Alameda,5922,2116,1958,158
