# Getting Starting with Python and JupyterLab
### Jonathan Kropko

Welcome to DS 6001: Pactice and Application of Data Science. We are excited about this course and we hope you are too. Python is the most important tool for modern data science, and it includes incredible tools for running models for [machine learning and artificial intelligence](https://www.tensorflow.org/install/), processing [image](https://scikit-image.org/) and [video](https://github.com/scikit-video/scikit-video) files, generating [static](https://seaborn.pydata.org/) and [interactive](https://plotly.com/python/) visualizations, making [frontend dashboards](https://plotly.com/dash/) to show off results and publish them on the web, and many more tasks that involve skills that are in high demand now and will continue to be valued in the future. 

We understand the eagerness to jump right into advanced modeling, predictions, image recognition, and other cutting-edge work. But all of these skills involve the analysis of data, and are not possible unless you know exactly how to get data, load it into Python, and manipulate it so that it is in the form that packages for modeling and visualization expect. 

This course is all about the things you will need to do to get data, clean data, and explore data in Python. We will work on skills for data analysis and data science that are foundational and crucial for any career that involves work with data. We aren't going to spend a lot of time on the most flashy techniques, but we will emphasize the skills you need to make the flashy techniques possible. As an analogy, if we were practicing basketball, this course would be about dribbling, passing, and playing defense, not 360 degree slam dunks. If we were learning salsa dancing, we are practicing the basic steps, keeping the rhythm, and maintaining the connection with a dance partner, and not the moves that these [salsa masters](https://www.youtube.com/watch?v=Skl9QIkYzuU) can show off. 

This document is called a Jupyter notebook. It combines text with hypertext features, code, and the results of the code in a single document. We will be working with Jupyter notebooks a lot in this course. What follows is an **orientation guide**. Please read this document carefully, in its entirety, prior to the start of the class. There are also **exercises** included below. These exercises will not be graded, but they are designed to guide you through the steps to get all the required free software installed on your own computer and to understand how to get started writing your own Python code.

If you have any questions, don't hesitate to contact the instructor Jon Kropko (jkropko@virginia.edu) over email or over a direct message on Slack.

**Table of Contents:**

* [Introduction: What is Python and Why is it So Popular?](#intro)
  * [The History of Python](#history)
  * [Open-Source Software and Data](#opensource)
  * [Python vs. R vs. Julia](#versus)
* [Downloading and Installing All the Software You Will Need](#download)
  * [Downloading Python 3](#python)
  * [Anaconda Navigator](#anaconda)
  * [Installing and Importing Packages](#packages)
* [Using JupyterLab](#jupyter)
  * [Jupyter Notebooks](#notebooks)
  * [The Markdown Language and Text Cells](#markdown)
* [Getting Started with Python Code](#basics)



## <a name="intro"></a> Introduction: What is Python and Why is it So Popular?
### <a name="history"></a> The History of Python

Python is an open-source computing environment that can be used for many tasks, including data science. Python was [written between 1989 and 1991 by Guido van Rossum](https://en.wikipedia.org/wiki/History_of_Python) and was named in honor of [Monty Python's Flying Circus](https://www.youtube.com/watch?v=eCLp7zodUiI). The most up-to-date version of Python is version 3.8.3. Code that is written in Python version 2 often cannot be run by compilers that read Python version 3, so be careful when working with someone else's Python code and pay attention to the version number. 

Python was based on ABC, an earlier all-purpose programming language developed at the Centrum voor Wiskunde en Informatica in the Netherlands. Guido van Rossum aimed to replicate the best features of ABC while eliminating many of its flaws. In [van Rossum's words](https://www.python-course.eu/python3_history_and_philosophy.php):

> I remembered all my experience and some of my frustration with ABC. I decided to try to design a simple scripting language that possessed some of ABC's better properties, but without its problems. So I started typing. I created a simple virtual machine, a simple parser, and a simple runtime. I made my own version of the various ABC parts that I liked. I created a basic syntax, used indentation for statement grouping instead of curly braces or begin-end blocks, and developed a small number of powerful data types: a hash table (or dictionary, as we call it), a list, strings, and numbers.

The motivation behind Python was to create a programming language that is intuitive, understandable, and readable. Guido van Rossum and his collaborators expressed the principles behind the design on Python with a series of maxims, which include:

* Beautiful is better than ugly.
* Explicit is better than implicit.
* Simple is better than complex.
* Complex is better than complicated.
* Flat is better than nested.
* Sparse is better than dense.
* Readability counts.
* Special cases aren't special enough to break the rules.
* Although practicality beats purity.
* Errors should never pass silently.
* Unless explicitly silenced.
* In the face of ambiguity, refuse the temptation to guess.
* There should be one -- and preferably only one -- obvious way to do it.

Although Python is frequently criticized for being [slow to accomplish some universal tasks](https://jakevdp.github.io/blog/2014/05/09/why-python-is-slow/), Python is incredibly popular and is the langauge in which most of the most widely-used data science software is written. [Python is so popular](https://www.kdnuggets.com/2017/07/6-reasons-python-suddenly-super-popular.html) because
the language is (comparatively) easier to read and understand than other programming languages, because of the proliferation of excellent third-party extensions and the large and active developer community that supports these extensions, because of its functionality to work with big data, and because of its overall versatility. 

---
**Exercise 1**: The [PYPL (PopularitY of Programming Language) Index](http://pypl.github.io/PYPL.html) ranks all programming languages by popularity, as measured by analyzing how often language tutorials are searched on Google. Where does Python rank among programming languages as of June 2020? Where do the other widely used data science programming languages, R and Julia, rank?

---

### <a name="opensource"></a> Open Source Software and Data
[Open source](https://en.wikipedia.org/wiki/Open_source) refers to software, code, data, and other products for which the internal design is accessable to anyone that wants to see it. Open source products are generally free to use and to distribute. For operating systems, Microsoft Windows and Mac OS are closed and proprietary, but Linux is free and open source. Open source work has a huge influence on the tech industry and on other industries. Many proprietary products such as smart phone operating systems are based on Linux, and likewise, a great deal of profitable analysis and proprietary software is created using Python, which is open source.

Open source projects use [licenses](https://en.wikipedia.org/wiki/Open-source_license) that describe the rights that users have with regard to applying, adapting, and distributing open source products. Python, for example, has a [license](https://docs.python.org/3/license.html) through the Python Software Foundation (PSF) that gives users the following rights:

> PSF hereby grants Licensee a nonexclusive, royalty-free, world-wide license to reproduce, analyze, test, perform and/or display publicly, prepare derivative works, distribute, and otherwise use Python 3.8.3 alone or in any derivative version

In addition, most extensions (called packages or libraries) to Python are also open source, and include licenses that grant similar rights to users.

The three most commonly used platforms for data science are all open source: Python, [R](https://cran.r-project.org/), and [Julia](https://julialang.org/). Other platforms for data analysis include proprietary competitors such as Microsoft Excel, Stata, SAS, and SPSS. While the proprietary options can be used to manage data and run statistical models, the open source options are considered superior for developing new approaches to working with data and for using cutting edge methods. The development of these open source environments is directly connected to the development of data science as a field of study. New methods can only be used by the community of scholars and practioners if there is software that implements the method. And while proprietary software like Stata and SAS do release updates with new functionality, they do so at a much less frequent rate than open source alternatives, and they may charge for these extensions.

---
**Exercise 2**: Visit the Python package repository at https://pypi.org/. Click on browse projects, and Python 3 compatible projects. Then order the listings by date last updated. How many pages do you have to click through before you have seen all of the packages that were released or updated just today?

---

### <a name="versus"></a> Python vs. R vs. Julia
While Python is the most popular programming language of the three options overall, R and Julia are also growing and are used by many people in the field of data science. The choice between [Python vs. R](https://www.datacamp.com/community/tutorials/r-or-python-for-data-analysis) evokes a great deal of debate on the internet. Both are excellent and highly marketable tools. R tends to be used more by people who work more closely with statistics, and Python tends to be used more by people whose work intersects with software development. That said, in recent years both R and Python have greatly expanded their functionality so that both can do statistics, software development, and many other tasks. Individual employers might have an in-house preference for R or Python, so the choice of which one you use might be dictated by the people you work with. Some people claim that one of these languages is "easier" than the other, but I think they are roughly equivalent in terms of their ease of use. Both are **object oriented** programming languages, and although there are differences in the way both languages approach dealing with programming objects, they can both handle the same kinds of operations.

Julia is a relatively newer option and is gaining popularity because of the evidence that is accomplishes certain tasks [more quickly than Python and R](https://qz.com/1360318/is-julia-a-good-alternative-to-r-and-python-for-programmers/). The disadvantage of Julia, at the moment, is that its community of developers is quite a bit smaller than those involved with Python and R, and because there are fewer developers there are fewer packages, so Julia lags a bit in functionality. That may change in the future if the Julia community grows.

Incidentally, the Jupyter project, which creates the Jupyter notebook interface that we will use to write and share Python code, [named their project to be a play on the words Julia, Python, and R](https://blog.jupyter.org/i-python-you-r-we-julia-baf064ca1fb6).

## <a name="download"></a> Downloading and Installing All the Software You Will Need
In this course we will only be using open source software. You are welcome to add additional software to your personal toolkit, but given the proliferation of excellent software that is free of charge, it is strongly recommended that you do not pay for software. Many businesses that employ data analysts and scientists use open source software exclusively, and the skills to effectively use open source software environments are generally more marketable than proprietary software skills.

There are two major pieces of software that we will download and install. The first is the latest version of Python, which enables your computer to understand and execute Python code. The second is Anaconda Navigator, which includes several **integrated development environments (IDEs)**. An IDE provides a user interface that we can use to write and run Python code. In addition we will frequently be downloading and importing packages that extend the functionality of Python.

### <a name="python"></a> Downloading Python 3
To download Python 3, visit https://www.python.org/, click on "Downloads", and click on the grey button that reads `Python 3.8.3`. The website will then initiate a download of the latest version of Python for the operating system you are using.

<img src="https://github.com/jkropko/DS-6001/raw/master/localimages/python.png" width="600">

Click through the installation wizard that pops up. Agree to the open source license and install Python in the default location.

---
**Exercise 3**: Install Python 3.9.5 on your computer.

---

### <a name="anaconda"></a> Anaconda Navigator
Anaconda is a company that developes open source software. You would be correct to wonder: how does a company that writes free software make money? Anaconda has three [pricing levels](https://www.anaconda.com/pricing). The "Individual Edition" is free and includes all of the open source software we will be using, bundled together in a well-organized interface. The "Team Edition" adds cloud computing, technical support, and some additional package management software for $10,000 a year. The "Enterprise Edition" adds on top of that some resources to assist in model deployment for the cost of "I'm afraid to ask."

We will be using Anaconda Navigator. To download Navigator, visit https://www.anaconda.com/products/individual, scroll to the bottom of the page to the section labeled "Anaconda Installers", and click on the link for the **64 bit graphical installer** for Python 3.7 and your operating system.

<img src="https://github.com/jkropko/DS-6001/raw/master/localimages/anacondadownload.png" width="600">

After installing Anaconda Navigator, you will be able to open the program. On Windows computers you will see a link to Anaconda Navigator in the start menu. On Mac computers, you will see an an icon for Anaconda Navigator in Launchpad and in your Applications folder. See [Using JupyterLab](#jupyter) below for instructions on how we will be using Anaconda Navigator in this course.

---
**Exercise 4**: Download Anaconda Navigator on your computer and open it. Make sure that you can see JupyterLab included in the list of icons that appears on the screen when Navigator opens.

---

### <a name="packages"></a> Installing and Importing Packages
One of the best things about Python is the existance of a massive repository of free and open source software that extends the functionality of Python. There are Python packages for pretty much [anything you can imagine doing](https://xkcd.com/353/) in Python:

<img src="https://imgs.xkcd.com/comics/python.png" width="500">

There are a few key pieces of terminology to remember when working with extensions to Python's base code. A **package** consists of a collection of scripts that implement various new functions in Python. Each script contains one function, or just a few related functions, and is called a **module**. In other words, a package consists of many modules. In Python, these extensions need to be downloaded and **installed** just once (and once again anytime you want to update the extensions to their latest versions), but **imported** into every script or notebook you write in which you want to use the extension. It is possible to import an entire package or just a specific module from a package. Python packages are also sometimes called Python **libraries**, and sometimes people use the words module and package interchangably.

To install a package, you need three things:

1. Knowledge of the exact name of the package you need

2. A place to type in and execute one line of Python code

3. The command to install a package

The largest repository for Python packages is https://pypi.org/ which allows us to [browse the packages](https://pypi.org/search/) by various characteristics, including by topic. There are other places where we can get Python packages, most commonly on individual developers' [GitHub](https://github.com/) pages, which should list installation instructions on the front page of the project's repository.

There are a couple ways to enter in a single line of Python code. One option is to follow the instructions described below for [opening a console window in JupyterLab](#notebooks). The console is designed to read and run single lines of Python code that you enter into the text bar at the bottom of the screen, and is a good place to go to enter the command to install new packages. A second option is to use the **command line interface** (CLI) on your computer. To use the CLI, within JupyterLab click on "file", then "new", then "terminal". Alternatively, on a Mac, you can click the magnifying glass in the upper-right corner of your screen, type "Terminal", and press enter. On Windows there are [many different ways](https://www.howtogeek.com/235101/10-ways-to-open-the-command-prompt-in-windows-10/) to get access to the CLI, so choose whichever method works well for you. Once you have access to a window that lets you use the CLI, you can enter in single lines of code, and the CLI understands Python code if Python has been installed. 

There are two Python commands that download and install a package. One command is
```
pip install packagename
```
and the other is 
```
conda install packagename
```
where `packagename` is the exact, case-sensitive name of the package we want to install. For most intents and purposes, these two commands are exactly the same. In general, there is a reason to prefer `conda` to `pip` because `conda` has the ability to [download packages written both in Python and other programming languages](https://stackoverflow.com/questions/20994716/what-is-the-difference-between-pip-and-conda#20994790), in case a Python package has a dependency that is written in Javascript, C, or a similar language, while `pip` can only install Python packages. You **should not** download packages twice: choose one version of this command and execute it once.

To update a package that is already installed, use either
```
pip install --upgrade packagename
```
or
```
conda update packagename
```
to obtain the latest version of the package.

---
**Exercise 5**: Use either `pip` or `conda` (but not both) to download and install the following packages using the command line interface:
* `numpy`
* `bson`
* `bs4`
* `requests`
* `pandas`
* `matplotlib`
* `seaborn`
* `plotly`

---

After downloading and installing a package, the package must be imported into your notebook if you want to use it in your code. To import a package, type
```
import packagename
```
near the top of your notebook. Alternatively a package can be imported with an alias, so that we can use the alias instead of the full package name for all the subsequent code that uses that package:
```
import packagename as alias
```
For example, `numpy` is an important package that gives Python the ability to perform operations on vectors and arrays, and `pandas` is the best package in Python for manipulating and cleaning dataframes. The traditional aliases for these two packages are `np` and `pd` respectively. I can import `numpy` and `pandas` with the following code:


In [1]:
import numpy as np
import pandas as pd

To use a function that comes from an external package, after installing and importing the package, we need to type the name of the package, then a period, then the name of the function. If we've imported a package with an alias, we can use the alias here instead of the full package name. For example, the `numpy` package includes a function `log()` that takes the natural logarithm of a number. I can use this function to take the natural log of 5 as follows:






In [2]:
np.log(5)

1.6094379124341003

If I hadn't supplied an alias for the `numpy` package, I would have had to type `numpy.log(5)` instead.

The one way around the requirement of typing the package name in front of functions from that package is to import the function name directly into Python by typing
```
from packagename import function1, function2
```
where `function1` and `function2` are the functions we want to use (if there's just one function we want to import, the code is `from packagename import function`), and `packagename` is the exact name of the package that includes the function. If we take this step, then we no longer have to specify the package name. For example, I can import the `log()` function directly from `numpy`:

In [None]:
from numpy import log
log(5)

1.6094379124341003

Importing functions directly into Python can cause problems, however, if there are conflicts with functions in base Python or in another package that share the same name. It is recommended to only import functions directly if these functions are especially important and if you will use those functions frequently. There will be an exercise below that guides you through the process of importing packages and functions in a Jupyter notebook.

## <a name="jupyter"></a> Using JupyterLab
Now that you have downloaded and installed Anaconda Navigator, open this program. You will see a dashboard with buttons to launch several programs:

<img src="https://github.com/jkropko/DS-6001/raw/master/localimages/anaconda.png" width="600">

In this course, we will be using JupyterLab. Click launch, and it will open a window in whatever web-browser is the default browser on your computer. If you have previously used JupyterLab and left any files open, the files will open automatically the next time you run JupyterLab. If this is the first time you are using JupyterLab, or if you closed all your files previously, you will see the Launcher window:

<img src="https://github.com/jkropko/DS-6001/raw/master/localimages/launcher.png" width="400">

The two options here that are most important are **notebook** and **console**, both of which should have a Python 3 button. Python 3 is the underlying code base that the notebook or console will use to read and evaluate the code you write.

---
**Exercise 6**: Open a new Jupyter notebook by clicking on the Python 3 button.

---

After opening a new notebook, you will see a screen that looks like this:

<img src="https://github.com/jkropko/DS-6001/raw/master/localimages/newnotebook.png" width="800">

This is a blank notebook file ready to be filled up with code, results, text, and images.

The first thing to do is to save this notebook file on your computer. Saving a file for the first time is the single most annoying thing about JupyterLab. To save a file, first, outside of JupyterLab, identify or create a folder on your computer where you want to save this file. Then copy the file address for this folder. Then, inside JupyterLab, click "File" and "Save Notebook As". Most modern pieces of software will give you a window that allows you to click your way to the folder you need, but not JupyterLab. Instead you will see a box that says "Save File As.." and provides a textbox. Paste the address to the folder where you want to save the file into the textbox. Then type a forward slash /, then type the name you want to give your notebook. Make sure the file name ends ".ipynb", which is the file extension for a Jupyter notebook.

---
**Exercise 7**: Save your new notebook as "mynotebook.ipynb" in a folder on your computer.

---

The good news is that you only have to go through this annoying process once. Now, whenever you want to save a notebook file that you are working on, you only have to press the save button (with the standard disk picture) in the upper-left corner, just underneath the tab for the notebook.

### <a name="notebooks"></a>Jupyter Notebooks
A notebook is composed of a series of **cells**. A cell can be of three types: code, markdown, or raw:

* **Code** cells contain Python code. When you execute the code in the cell (more on that below), the results of the code (if there are any to show) will be displayed immediately underneath this cell.

* **Markdown** cells contain text, with optional formatting to make text bold or italicized, to include section titles, or to display images. To include these elements of stylization, we will use a very lightweight programming language called [markdown](https://www.markdownguide.org/).

* We won't be using **raw** cells. Raw is for displaying code but not evaluating the code. There are better ways, however, to show code that we don't want to run.

---
**Exercise 7**: Click inside the first cell and find the button at the top that reads "Code" and has a downward pointing arrow. Click on this box and change the type for this cell from Code to Markdown. Then change it from Markdown to Raw. Then change it back to Code.

---
You won't notice any changes other than the disappearence and reappearence of the bracket symbols next to the cell. These brackets only appear for code cells.

The purpose of having cells for code and different cells for text is to make our work more transparent and easier to check for errors. Traditionally, if researchers would share their code at all, it was in a script that contained nothing but code. Raw code is hard for humans to read, even for people who are very good at coding. Instead of a huge dump of raw code, a Jupyter notebook does what is called **weaving**: combining code, results, and explanatory text together all in one readable document. The code is broken into small chunks of just a few lines, and the results of the code are displayed immediately following the code cell. It is best practice not to write more than a few lines of code in any one code cell. Seeing the results immediately after the code helps us understand what the code is doing. Jupyter notebooks are part of a scientific movement towards Reproducible Research -- the idea is that notebooks make scientific work much more transparent. It is a great idea to build a habit right now of writing text as you go along in a notebook so that you can produce work that fits with this movement. More detailed guidelines for how to [write Jupyter notebooks for scientific research](https://www.researchgate.net/publication/328380478_Ten_Simple_Rules_for_Reproducible_Research_in_Jupyter_Notebooks) are laid out by researchers at UC San Diego.

Text cells allow us to explain what we are trying to do and why. The document you are reading is a Jupyter notebook, and this cell is a markdown cell. I can write as much as I like here to present the material to you. For example, I can tell you that I am about to write code that tells Python to evaluate one plus one. Here is the code:



In [3]:
1 + 1

2

Notice that the result of the code, 2, appears immediately following the cell. The result **will not appear automatically, however, unless the code is executed**. To execute the code inside a code cell, click inside the cell, then press SHIFT + ENTER (or RETURN on a Mac). This is a very simple example, but this combination of text, code, and results becomes more and more useful as the complexity of our code-based work grows. Text cells must be executed too: executing a text cell displays the text as neatly-formatted HTML text.

To create a new cell, push the button with the + sign underneath the notebook's tab, between the disk and the scissors.

---
**Exercise 8:** Set the first cell to be a markdown cell and type "I will now take the square root of 9". Then create a code cell underneath it and type
```
import math
math.sqrt(9)
```
Execute both cells, and make sure the result of the code cell displays.

---

The order of the cells in your document matters because the cells will be run in order. For example, it will be important to write a cell that loads data prior to writing a cell that runs a model on that data. To manage the order of the cells there are a few tools. Cells can be copied or cut and pasted to other parts of the document. To delete a cell entirely, right click inside the cell and click "Delete Cells".

---
**Exercise 9**: Right click inside the first cell in the document that currently reads "I will now take the square root of 9" and select "Copy Cells". Then right-click inside the code cell that appears second, and select "Paste Cells Below", and change the text of the new cell to "I just took the square root of 9 and it is 3." Then delete this last cell.

---

To split a cell into two cells, find the place where you would like to split the cell and push enter/return twice so that there is a blank line between the two parts of the cell. Then place the cursor on this blank line. Then click "Edit" and "Split Cell". Then two cells will exist where one previously did, one containing the code/text above this blank line, and one containing the code/text below the line.

The **kernel** refers to the programming language that the notebook evaluates when we execute a code cell, as well as all of the items that exist in the background memory after having run code inside the notebook. When we start a new notebook, we select a programming language to include in the kernel. In this course, we will always choose Python 3, but it is also possible to include a different programming language like Python 2, R, or Julia. Then as we run code cells, we create objects that will persist in the kernel. For example, the following code cell



I will now take the square root of 9

In [4]:
import math

In [5]:
math.sqrt(9)

3.0

I just took the square root of 9.

In [6]:
speedoflight = 299792458

records the speed of light (299,792,458 meters per second) into the kernel. Now the variable `speedoflight` can be used in other cells (we discuss Python variables in more detail below). For example, if I want to convert the speed to miles per hour, I can use the following conversion


In [7]:
speedoflight * (60) * (60) * (24) * (1 / 1609.344)

16094799105.225481

We can run the code cells in any order we like, but if we execute this second cell before the first we will get an error because Python does not yet have an variable in its memory called `speedoflight`. That's why it is important to write the cells in the order we want the cells to be run. To reset all of the cells and to run them all again in sequential order, click on "Kernel" and "Restart Kernel and Run All Cells". The brackets next to each code cell will display numbers that tell us the order in which the cells were executed.





---
**Exercise 10**: Create a code cell that creates a Python variable named `virginia_gdp` and set it equal to the GDP of the state of Virginia in 2019: $508,662,000,000. Then create a second code cell that reports the per capita GDP of Virginia by dividing the `virginia_gdp` variable by the population of Virginia: 8,001,024. Finally, restart the kernel and run both cells.

---

JupyterLab can have many notebooks open at once. If we open a second notebook, it will appear as a tab next to the notebook we already have open. We can click to switch between the tabs and we can drag the tabs to any order we would like for them to appear. Another very useful feature of JupyterLab is the ability to display two notebooks side by side, or one on top of the other. To display two notebooks side by side, click a tab other than the one that is currently displayed, and drag it to the side of the screen. When you see a blue rectangle appear, release the mouse, and the second notebook will appear to the side of the one we had open. To display a notebook on top of another notebook, drag the tab for the second notebook to the bottom of the screen. 

Notebooks are just one kind of tab that JupyterLab works with. The other important tab is the **console**. A console allows us to type Python code one line at a time into the Python kernel. In general, we won't be doing our main work in the console, but the console is useful for displaying help documentation and playing around with code to better understand it. Two create a console window, we can click "File", "New", and "Console", which opens up a console with an empty kernel. Or better yet, we can right click anywhere on an open notebook and select "New Console for Notebook" which loads a console with the same kernel as the notebook. To issue code to the console, type in the text box at the bottom of the console window, and type SHIFT + ENTER/RETURN to execute that line of code. The console is another good option, in addition to the command line interface, for issuing `pip install` or `conda install` commands to download and install new packages.


In [1]:
virginia_gdp = 508662e6

In [2]:
virginia_pop = 8001024
virginia_gdp / virginia_pop

63574.61244960645



---
**Exercise 11**: Your notebook has a cell that reads 
```
import math
math.sqrt(9)
```
Make sure that this cell has been executed (you can tell that a code cell has been run if there is a number inside the square brackets next to the cell). Then right click and call up a console window that shares the same kernel as the notebook. Then move that console window to appear side-by-side with the notebook. Finally, in the console window type
```
?math.sqrt
```
to call up the help documentation for the `math.sqrt()` function in Python.

---

In [Installing and Importing Packages](#packages) above, we discussed the process of importing packages and functions into Python, and you downloaded and installed eight common and important packages. Refer back to that section to complete the following exercise:

---
**Exercise 12**: Write a code cell in which you import `numpy` with the alias `np`, `pandas` with the alias `pd`, `matplotlib` with the alias `plt`, and `seaborn` with the alias `sns`. Also import the `requests` and `plotly` packages without aliases. Do not import the `bson` and `bs4` packages in their entirety, but import the functions `dumps` and `loads` from `bson` and the `BeautifulSoup` function from `bs4`.

---

In [4]:
import math
math.sqrt(9)

3.0

In [5]:
?math.sqrt

[1;31mSignature:[0m [0mmath[0m[1;33m.[0m[0msqrt[0m[1;33m([0m[0mx[0m[1;33m,[0m [1;33m/[0m[1;33m)[0m[1;33m[0m[1;33m[0m[0m
[1;31mDocstring:[0m Return the square root of x.
[1;31mType:[0m      builtin_function_or_method


### <a name="markdown"></a>The Markdown Language and Text Cells
Markdown is a version of the Hypertext Markup Language (HTML) in which stylistic tags are stripped down to make the language easier to use. To insert text into a text cell, just type the text you want to appear into the text cell and press SHIFT + ENTER (or RETURN) to execute the code. To 
italicize text, place ONE star * before and after the text. For bold text, place TWO stars ** before and after the text.
For struck through text, place TWO tildes ∼∼ before and after the text. For example the following markdown code
```   
*this will be italicized*
**this will be bold**
~~this will be struck out~~
```
yields the following results:

*this will be italicized*

**this will be bold**

~~this will be struck out~~





To start a new paragraph, push ENTER (or RETURN) **twice**, so that there is a blank line separating the paragraphs.

For a hyperlink, either type the address itself (it will automatically become a link), or use syntax like this to place the link on top of other text,
```
The movie Space Jam from 1996 still has its [original website online](https://www.spacejam.com/1996/)
 ```
 which yields: The movie Space Jam from 1996 still has its [original website online](https://www.spacejam.com/1996/)




For block quotes, push ENTER (or RETURN) at least once, then start the quote with > and a space. The quote will continue until you start a new paragraph. For example, the markdown code
```
Here’s a profound quote:
> I’d rather have this bottle in front of me than a frontal lobotomy
```
evaluates to:

Here’s a profound quote:
> I’d rather have this bottle in front of me than a frontal lobotomy





One of the most important ways to make a document readable is to use sectioning to organize the document. Section titles are denoted with hashtags (pound signs). The more hashtags, the smaller the text in the section header.

* One hashtag # followed by some text denotes a document title.
* Two hashtags ## followed by some text denote a section title.
* Three hashtags ### followed by some text denotes a subsection title. 
* Four hashtags #### followed by some text denotes a sub-subsection,
and so on.

There are two ways to include an image (that is not generated by Python code) in the notebook. With markdown code you can write
```
![](image.jpg)
```
where `image.jpg` is the name of the image file. If the image exists in the same folder as your notebook, you can refer to the image with its file name and extension in the code below. If the image exists on the internet, you can refer to the image with the full URL. The second way is more complicated, as it uses HTML code, but gives you more control over the size of the image:
```
<img src="image.jpg" width="600">
```
Again, change `image.jpg` to the name of the local image file or to the URL of an image on the internet, and change width to whatever number gives you the size you want.





---
**Exercise 13**: Create a new markdown cell, and write markdown code that exactly replicates the following output (the image is available [here](https://news.virginia.edu/sites/default/files/article_image/Rotunda_Copper_Dome_UTDA[1]_0.jpg)):

# Title: The University of Virginia
## Section: Introduction
The **University of Virginia** (U.Va. or UVA) is a public research university in [Charlottesville, Virginia](https://en.wikipedia.org/wiki/Charlottesville,_Virginia). It was founded in 1819 by United States *Declaration of Independence* author Thomas Jefferson. 

![](https://news.virginia.edu/sites/default/files/article_image/Rotunda_Copper_Dome_UTDA[1]_0.jpg)

It is the flagship university of Virginia and home to Jefferson's Academical Village, a UNESCO World Heritage Site. UVA is known for its historic foundations, student-run honor code and ~~secret societies~~ championship basketball.

### Subsection: Founders
The original governing Board of Visitors included Jefferson, James Madison, and James Monroe who once said
> The best form of government is that which is most likely to prevent the greatest sum of evil.

Monroe was the sitting President of the United States at the time of its foundation and earlier Presidents Jefferson and Madison were UVA's first two rectors. 

### Subsection: History
Jefferson conceived and designed the original courses of study and original architecture. UVA was the first university of the American South elected to the research-driven Association of American Universities in 1904. More than a century later, the journal Science credited UVA faculty with two of the top ten global scientific breakthroughs of 2015. The University of Virginia, along with the University at Buffalo, are the only two colleges founded by United States Presidents. 





---

To include a bulleted list (without numbering) in the notebook, use stars * to denote items in the list, with each item on a new line, and denote subitems by pressing tab once for each level the item is nested. For example, the code 
```
* item 1
* item 2
  * item 2a
  * item 2b
    * item 2b, part 1
    * item 2b, part 2
* item 3
```
yields:
* item 1
* item 2
  * item 2a
  * item 2b
    * item 2b, part 1
    * item 2b, part 2
* item 3

Numbered lists are also possible, but they have some [quirks and restrictions](https://riptutorial.com/markdown/example/1805/numbered-lists).

To refer to code in a text block in a way that references code without running the code, use a single forward quote (a "backtic", that shares the same key as ~ in the upper left corner of the keyboard) to begin and end the text that is code. This technique changes the font of the text and changes the background to grey to make it clear the text is code. For example, the markdown syntax
```
Next I will discuss the `pd.read_csv()` function
```
yields: Next I will discuss the `pd.read_csv()` function.

To show a block of code for illustration purposes (not to be run), write three backtics on a line, press enter, type all the code you want, and then write three more backtics on a new line to end the code block.

To include mathematical equations, begin the mathematical expression with a dollar sign to begin and end the expression in line with the text, or use two dollar signs to place the math on its own line. In between the dollar signs, use latex code to express the mathematical symbols. If you aren't familiar with latex, [here](https://en.wikibooks.org/wiki/LaTeX/Mathematics) is a list of latex code for various mathematical symbols. For example, to include the quadratic formula in my notebook, I can type
```
$$ x=\frac{-b \pm \sqrt{b^2 - 4ac}}{2a}$$
```
which yields:
$$ x=\frac{-b \pm \sqrt{b^2 - 4ac}}{2a}$$






Finally, markdown includes code to include a table. I generally find the code too cumbersome to remember, but there are some excellent websites that take a table that we enter in manually and output markdown code. Then we can copy this markdown code and paste it directly into a markdown cell. My favorite website for this is https://www.tablesgenerator.com/markdown_tables.

For example, if I type
```
| Day       | High Temp | Low Temp |
|-----------|-----------|----------|
| Monday    | 93        | 79       |
| Tuesday   | 88        | 73       |
| Wednesday | 76        | 67       |
| Thursday  | 89        | 84       |
| Friday    | 101       | 82       |
| Saturday  | 96        | 80       |
| Sunday    | 85        | 72       |
```
the output is

| Day       | High Temp | Low Temp |
|-----------|-----------|----------|
| Monday    | 93        | 79       |
| Tuesday   | 88        | 73       |
| Wednesday | 76        | 67       |
| Thursday  | 89        | 84       |
| Friday    | 101       | 82       |
| Saturday  | 96        | 80       |
| Sunday    | 85        | 72       |





---
**Exercise 14**: Create a new markdown cell, and write markdown code that exactly replicates the following output:

## My favorite math functions in Python

* The natural logarithm $\ln(x)$ is `math.log(x)`
  * The log base 10, $\log_{10}(x)$, is `math.log10(x)`
  * The log base $a$, $\log_{a}(x)$, is `math.log(x, a)`

* Trigonometric functions:

| Function | Mathematical notation | Python code|
|----------|-----------------------|------------|
|Sine      | $sin(\theta)$         |`math.sin(theta)`|
|Cosine      | $cos(\theta)$         |`math.cos(theta)`|
|Tangent      | $tan(\theta)$         |`math.tan(theta)`|

* $\pi$ rounded to ten decimal places:
```
round(math.pi, 10)
```

---

## <a name="basics"></a> Getting Started with Python Code
### <a name="object"></a> Object Oriented Programming and Python Variables
Python uses [object oriented programming](https://en.wikipedia.org/wiki/Object-oriented_programming), a particular approach to computer programming that allows users to define "objects" that exist in the background memory of the computing environment (the kernel) and can be used in subsequent code. When I think about object-oriented programming, it helps me to think about a fishtank:

<img src="https://i.ytimg.com/vi/--ztGaF4m2U/maxresdefault.jpg" width="500">

In this analogy, the tank is the background memory on my computer that Python allocates to the things I create with code. The fish are the objects, things that exist in the background memory. There are many kinds of objects, just like there can be many kinds of fish: big objects and small ones, objects with different shapes and behaviors. 

Objects in Python are called **variables**. That will be confusing to any of you that have learned about computing from a statistical point of view (like with R). In statistics, variables are columns in a data table. In machine learning, and generally among Python users, columns in a data table are called "features", and objects in the Python kernel are variables. Please keep this distinction in mind. To create a Python variable, we write the name we want to give the variable, an equal sign, and the value(s) we want to assign to the variable. Variable names can include underscores, but not dashes or spaces.

There are many different types Python variables. The first distinction is whether a variable is atomic (a single datapoint) or non-atomic (consisting of many datapoints). An atomic variable can be an integer, a float (a numeric type with decimals), or a string (words and characters instead of numbers). The values of string variables need to be enclosed in either single or double quotes. Here I define a variable of each type:




In [6]:
my_integer = 5
my_float = 5.0
my_string = 'five'

To see the type of a variable, use the `type()` function:


In [7]:
type(my_integer)

int

In [8]:
type(my_float)

float

In [9]:
type(my_string)

str

Once a variable has been defined, we can use it in other commands, for example, to calculate $5^2$:

In [10]:
my_integer ** 2

25

 ---
 **Exercise 15**: Create a variable that contains the string: "We hold these truths to be self-evident, that all men are created equal, that they are endowed by their Creator with certain unalienable Rights, that among these are Life, Liberty and the pursuit of Happiness." Then use the `len()` function to display the number of characters in this string.

 ---

Non-atomic variables contain many individual values. The simplest non-atomic variable is a **list**, which is denoted with square brackets and separates different values with commas. The individual values can be integers, floats, or strings:

In [11]:
my_list = [5, 5.0, "five"]
type(my_list)

list

One property of lists is that they can be **indexed**. That is, we can call individual elements of the list by listing the element number inside square brackets. One important thing to remember is that **Python elements always count from 0** not 1! That's different from R and other programming environments and causes a lot of errors among new Python users. It helps me to remember this joke:
> What did the Python programmer sing during the 7th Inning Stretch at the baseball game? For it's 0, 1, 2 strikes you're out!

So the first, second, and third elements of the list are

In [12]:
my_list[0]

5

In [13]:
my_list[1]

5.0

In [14]:
my_list[2]

'five'

To call more than one element of the list, write `[a:b]` after the list name, where `a` is the first element to be called and `b` is the number one higher than the last element to be called -- in other words, this syntax captures the `a` element and stops just short of the `b` element. To capture the first two elements of the list, but not the third, we type:

In [15]:
my_list[0:2]

[5, 5.0]

A **tuple** is similar to a list, but the elements are contained within parentheses instead of brackets:

In [None]:
my_tuple = (5, 5.0, "five")
type(my_tuple)

tuple

Lists and tuples share much of the same functionality. Unlike a list, however, a tuple is designed to provide specific data in specific slots: we can provide the exact time of an event in the form of a tuple in which the first element is the hour, the second element is the minute, and the third element is the second. Tuples are called **immutable** which means that their length is fixed and no elements can be added to or removed from the tuple. The reason for the fixed structure is that functions that use tuples expect particular information to exist in specific slots, and adding or removing elements shifts elements in a way that can break these functions.

A **dictionary** places the elements into curly braces and optionally names the elements:

In [None]:
my_dictionary = {'university': 'University of Virginia',
                 'enrollment': 24639,
                 'location': 'Charlottesville, Virginia'}
type(my_dictionary)

dict

Unlike a list and tuple, elements of a dictionary are not ordered. But they can be extracted by referencing the name of the element:

In [None]:
my_dictionary['location']

'Charlottesville, Virginia'

Dictionaries are very similar to [JSON](https://en.wikipedia.org/wiki/JSON) formatted data, which is one of the most important coding standards for transferring data via the internet. We will discuss JSON data at length in module 3.

Finally, a dataframe is a table that contains data. A `DataFrame` variable in Python comes from the `pandas` package, and this kind of variable is the primary way data must be stored in Python in order to run statistical analyses and other kinds of analysis. Most of this course is devoted to finding ways to take data from all kinds of places and many different formats and manipulating the data into a `DataFrame`:

In [None]:
import pandas as pd
my_dict = {'day': ['Sunday', 'Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday'],
         'high_temp': [87, 80, 91, 102, 92, 86, 78],
         'low_temp': [74, 68, 78, 85, 81, 72, 61]}
my_dataframe = pd.DataFrame(my_dict)
my_dataframe

Unnamed: 0,day,high_temp,low_temp
0,Sunday,87,74
1,Monday,80,68
2,Tuesday,91,78
3,Wednesday,102,85
4,Thursday,92,81
5,Friday,86,72
6,Saturday,78,61


---
**Exercise 16**: The Star Wars movies in the order of release are

* Episode IV – A New Hope, 1977 	
* Episode V – The Empire Strikes Back, 1980 	
* Episode VI – Return of the Jedi, 1983
* Episode I – The Phantom Menace, 1999 
* Episode II – Attack of the Clones, 2002 
* Episode III – Revenge of the Sith, 2005
* Episode VII – The Force Awakens, 2015
* Rogue One: A Star Wars Story, 2016
* Episode VIII – The Last Jedi, 2017
* Solo: A Star Wars Story, 2018
* Episode IX – The Rise of Skywalker, 2019

Create a list of the episode numbers (use S1 and S2 for Rogue One and Solo), then a list in which each element is a tuple containing the episode number and title for each of the movies, then a dataframe that contains the episode number, title, and release year.

---

In [42]:
import pandas as pd

sw_dict = {'episodes': ['E4','E5','E6','E1','E2','E3','E7','S1','E8','S2','E9'],
              'titles': ['A New Hope','The Empire Strikes Back','Return of the Jedi','The Phantom Menace','Attack of the Clones','Revenge of the Sith','The Force Awakens','Rogue One: A Star Wars Story','The Last Jedi','Solo: A Star Wars Story','The Rise of Skywalker'], 
                'release_year': [1977,1980,1983,1999,2002,2005,2015,2016,2017,2018,2019]}

df_starwars = pd.DataFrame(sw_dict)


df_starwars

Unnamed: 0,episodes,titles,release_year
0,E4,A New Hope,1977
1,E5,The Empire Strikes Back,1980
2,E6,Return of the Jedi,1983
3,E1,The Phantom Menace,1999
4,E2,Attack of the Clones,2002
5,E3,Revenge of the Sith,2005
6,E7,The Force Awakens,2015
7,S1,Rogue One: A Star Wars Story,2016
8,E8,The Last Jedi,2017
9,S2,Solo: A Star Wars Story,2018


A Python **function** takes an input and supplies an output. Functions have names, and include parentheses in which the input should be typed. For example, the function `math.exp()` take a numeric input and outputs the value of $e = 2.71 ...$ raised to the power of the input:

In [35]:
import math
math.exp(2)

7.38905609893065

Some Python variables have specific methods and attributes associated with them. We will discuss how to find out what the methods and attributes are for a specific variable later in this course. A **method** is a function that is attached to an existing Python variable, and operates on the data that exists within that variable. To call a method, type the name of the variable we want to apply the method to, then a period, then the name of the method with parentheses. If there are additional inputs, they should be typed inside the parentheses. For example, to extract summary statistics from the dataframe I created above, I can use the `.describe()` method:

In [37]:
df_starwars.describe(include='all')

Unnamed: 0,episodes,titles,release_year
count,11,11,11.0
unique,11,11,
top,E4,A New Hope,
freq,1,1,
mean,,,2002.818182
std,,,16.172929
min,,,1977.0
25%,,,1991.0
50%,,,2005.0
75%,,,2016.5


An **attribute** is a variable that can be extracted from another variable. To extract the attribute, type the name of the variable, a period, and the name of the attribute yoiu want. For example, we can extract a table that shows us the data types of each column in the dataframe with the `.dtypes` attribute:

In [38]:
df_starwars.dtypes

episodes        object
titles          object
release_year     int64
dtype: object

---
**Exercise 17**: Use the `.head()` to look at only the first three rows of the Star Wars dataframe you created in the last exercise. Then use the `.shape` attribute to report the number of rows and columns in this dataframe.

---

In [40]:
df_starwars.head(3)

Unnamed: 0,episodes,titles,release_year
0,E4,A New Hope,1977
1,E5,The Empire Strikes Back,1980
2,E6,Return of the Jedi,1983


In [41]:
df_starwars.shape

(11, 3)