## start data 100 FA2020 lab01 content

[link here](https://github.com/DS-100/fa20/blob/master/lab/lab01/lab01.ipynb)

## Part 1: Jupyter Tips


### Viewing Documentation

To output the documentation for a function, use the `help` function.

In [None]:
help(print)

You can also use Jupyter to view function documentation inside your notebook. The function must already be defined in the kernel for this to work.

Below, click your mouse anywhere on `print()` and use `Shift` + `Tab` to view the function's documentation. 

In [None]:
print('Welcome to [class name].')

### Importing Libraries and Magic Commands

In [class name], we will be using common Python libraries to help us process data. By convention, we import all libraries at the very top of the notebook. There are also a set of standard aliases that are used to shorten the library names. Below are some of the libraries that you may encounter throughout the course, along with their respective aliases.

In [None]:
import pandas as pd
import numpy as np
import re
import matplotlib.pyplot as plt

%matplotlib inline

`%matplotlib inline` is a [Jupyter magic command](https://ipython.readthedocs.io/en/stable/interactive/magics.html) that configures the notebook so that Matplotlib displays any plots that you draw directly in the notebook rather than to a file, allowing you to view the plots upon executing your code. (Note: In practice, this is no longer necessary, but we're showing it to you now anyway)

Another useful magic command is `%%time`, which times the execution of that cell. You can use this by writing it as the first line of a cell. (Note that `%%` is used for *cell magic commands* that apply to the entire cell, whereas `%` is used for *line magic commands* that only apply to a single line.)

In [None]:
%%time

lst = []
for i in range(100):
    lst.append(i)

### Keyboard Shortcuts

Even if you are familiar with Jupyter, we strongly encourage you to become proficient with keyboard shortcuts (this will save you time in the future). To learn about keyboard shortcuts, go to **Help --> Keyboard Shortcuts** in the menu above. 

Here are a few that we like:
1. `Ctrl` + `Return` : *Evaluate the current cell*
1. `Shift` + `Return`: *Evaluate the current cell and move to the next*
1. `ESC` : *command mode* (may need to press before using any of the commands below)
1. `a` : *create a cell above*
1. `b` : *create a cell below*
1. `dd` : *delete a cell*
1. `z` : *undo the last cell operation*
1. `m` : *convert a cell to markdown*
1. `y` : *convert a cell to code*


### Python

Python is the main programming language we'll use in the course. We expect that you've taken CS 61A, Data 8, or an equivalent class, so we will not be covering general Python syntax. If any of the following exercises are challenging (or if you would like to refresh your Python knowledge), please review one or more of the following materials.

- **[Python Tutorial](https://docs.python.org/3.5/tutorial/)**: Introduction to Python from the creators of Python.
- **[Composing Programs Chapter 1](http://composingprograms.com/pages/11-getting-started.html)**: This is more of a introduction to programming with Python.
- **[Advanced Crash Course](http://cs231n.github.io/python-numpy-tutorial/)**: A fast crash course which assumes some programming background.

### NumPy

NumPy is the numerical computing module introduced in Data 8, which is a prerequisite for this course. Here's a quick recap of NumPy. For more review, read the following materials.

- **[NumPy Quick Start Tutorial](https://numpy.org/doc/stable/user/quickstart.html)**
- **[DS100 NumPy Review](http://ds100.org/fa17/assets/notebooks/numpy/Numpy_Review.html)**
- **[Stanford CS231n NumPy Tutorial](http://cs231n.github.io/python-numpy-tutorial/#numpy)**
- **[The Data 8 Textbook Chapter on NumPy](https://www.inferentialthinking.com/chapters/05/1/Arrays)**

## end Data 100 FA2020 lab01 content

## Part 2: Working with Census data

In [None]:
can_census_data_ct = pd.read_csv('~/git/urban-data-science-notebooks/labs/lab01/lab01_data.csv')
can_census_data_ct

In [None]:
# view columns
can_census_data_ct.columns.values

In [None]:
can_census_data_ct.columns

In [None]:
# fix missingness and data types
can_census_data_ct = can_census_data_ct.fillna(0)
can_census_data_ct = can_census_data_ct.replace({'NA': 0})
can_census_data_ct = can_census_data_ct.replace({'': 0})
can_census_data_ct.iloc[:,4:] = can_census_data_ct.iloc[:,4:].apply(pd.to_numeric)
can_census_data_ct["GeoUID"] = can_census_data_ct["GeoUID"].astype(str)

In [None]:
# convert whitespace, parens, commas to underscore in column names
can_census_data_ct.columns = can_census_data_ct.columns.str.replace(" |\\(|\\)|,", "_")

In [None]:
# handle accidental division by zero
def div_0(n,d):
    try:
        return n/d
    except:
        return 0

## Exercise 1

Using `sort_values(...)`, identify which census tract in Toronto has the largest population.

In [None]:
# solution: it's tract 5350012.01
can_census_data_ct.sort_values(by = "Population", ascending = False)

## Exercise 2

Calculate total population, total number of dwellings, and the total number of renter and owner private households in Toronto. 

In [None]:
# total pop
can_census_data_ct["Population"].sum()

In [None]:
# total dwellings
can_census_data_ct["Dwellings"].sum()

In [None]:
# total renter private households
can_census_data_ct["v_CA21_4239:_Renter"].sum()

In [None]:
# total owner private households
can_census_data_ct["v_CA21_4238:_Owner"].sum()

In [None]:
# can also calculate them all at once
can_census_data_ct[["Population", "Dwellings", "v_CA21_4239:_Renter", "v_CA21_4238:_Owner"]].sum()

In [None]:
# compute the proportion of private households that are renters
can_census_data_ct["v_CA21_4239:_Renter"].sum() / can_census_data_ct["v_CA21_4237:_Total_-_Private_households_by_tenure"].sum()

## Exercise 3

### Common functions and calculations

In [None]:
# read in the lemonade data
lemonade = pd.read_csv("lemonade_sales.csv")
lemonade

In [None]:
# data type conversion - looks good
# ok to drop Unnamed: 0 col
# partition the data into two weeks
lemonade["day_datetime"] = pd.to_datetime(lemonade["Day"])
lemonade.drop(columns = "Unnamed: 0", inplace = True)
lemonade['week_num'] = (lemonade['day_datetime'] >= "2019-01-14") + 1
lemonade

In [None]:
# compute: totals (AM & PM)
lemonade.iloc[:,2:].sum()

In [None]:
# compute: totals (AM + PM) of sales
am_pm_sales = lemonade[lemonade.columns.values[lemonade.columns.str.contains("Sale Count")]].sum()
am_pm_sales.sum()

In [None]:
# compute: totals (AM + PM) of profits
am_pm_profit = lemonade[lemonade.columns.values[lemonade.columns.str.contains("Profit")]].sum()
am_pm_profit.sum()

In [None]:
# pct of total sales made in AM, PM
am_pm_sales / am_pm_sales.sum()

In [None]:
# pct of total profit made in AM, PM
am_pm_profit / am_pm_profit.sum()

In [None]:
# the averages of week 1, week 2, am, pm for sales, profits
lemonade.groupby('week_num').mean()

In [None]:
# compute daily totals for sums and profits
lemonade['total_sales'] = lemonade[lemonade.columns.values[lemonade.columns.str.contains("Sale Count")]].sum(axis = 1)
lemonade['total_profit'] = lemonade[lemonade.columns.values[lemonade.columns.str.contains("Profit")]].sum(axis = 1)
lemonade

In [None]:
# the daily average sales
lemonade['total_sales'].mean()

In [None]:
# the daily average profit
lemonade['total_profit'].mean()

In [None]:
# the maximum sales in a day (AM + PM)
lemonade['total_sales'].max()

In [None]:
# the maximum profit in a day (AM + PM)
lemonade['total_profit'].max()

In [None]:
# the minimum sales in a day (AM + PM)
lemonade['total_sales'].min()

In [None]:
# the minimum profit in a day (AM + PM)
lemonade['total_profit'].min()

In [None]:
# command to compute these summary statistics and more all at once
lemonade[['total_sales', 'total_profit']].describe()