# [PUBPOL/ETHSTD C164A] 1. Introduction to Python and Jupyter 

**Estimated time:** 60 minutes

**Notebook developed by:** <br>
Team Lead: Skye Pickett  <br>
Fall 2022 Developers: Leah Hong, Emily Guo, Reynolds Zhang <br>
Summer 2022 Developers: Vaidehi Bulusu, Leah Hong, Drishti Gupta, Hans Ocampo <br>


### Learning Outcomes

In this notebook, you will learn about:
- Navigating a Jupyter Notebook
- Basics of Coding in Python
- How to Understand Tables
- How to Reorganize and Manipulate Tables
- How to Find Statistical Analyses from Tables

## Table of Contents
1. [Jupyter Basics](#1.-Jupyter-Basics)
1. [Python Basics](#2.-Python-Basics)
1. [Introduction to Tables](#3.-Introduction-to-Tables)
1. [Introducting the Dataset](#4.-Introducing-the-Dataset)
1. [Statistical Analysis](#5.-Statistical-Analysis)
1. [Conclusion](#6.-Conclusion)
1. [Submitting Your Work](#7.-Submitting-Your-Work)
1. [Explore Data Science Opportunities](#8.-Explore-Data-Science-Opportunities)
1. [Feedback Form](#9.-Feedback-Form)
***


### Helpful Data Science Resources
Here are some resources you can check out while doing this notebook!

- [DATA 8 Textbook](https://inferentialthinking.com/chapters/03/1/Expressions.html) (also the reference source of this notebook)
- [Reference Sheet for the datascience Module](http://data8.org/sp22/python-reference.html)（This is extremely helpful whenever you need a cheatsheet!)
- [Documentation for the datascience Module](http://data8.org/datascience/index.html)


### Peer Consulting

If you find yourself having trouble with any content in this notebook, Data Peer Consultants are an excellent resource! Click [here](https://dlab.berkeley.edu/training/frontdesk-info) to locate live help.

Peer Consultants are there to answer all data-related questions, whether it be about the content of this notebook, applications of data science in the world, or other data science courses offered at Berkeley.

---

# 1. Jupyter Basics

### 1.1 Cells

#### What is a cell?
A cell is a block in the Notebook. This whole document is a Jupyter Notebook. Each cell will be highlighted when clicked on.

#### Types of Cells

**Markdown Cell**

A markdown cell is where you write text. This cell a markdown cell! 

To edit a markdown cell, you must double click the cell's text. Try double clicking this block of text!

**Code Cell**

In [None]:
# This is a code cell 
# A code cell is where you will write code
# Errors can arise when you run code cells if there are any errors

#### Adding a cell

To add a cell, find the plus `+` sign on the tab bar above.

When you add a cell, it is automatically a code cell.

To switch from a code cell to a markdown cell, click the *Cell* button in the menu bar then click *Cell Type* then click *Markdown*.

**Running a Cell**

To run a code cell, click *shift+return* on your keyboard or the `Run` button in the tab bar.

In [None]:
# try to run this cell 

1+12

The above cell should return 13 below the cell.

### 1.2 Navigating the Notebook

**Editing, saving, and submitting work**

To delete a cell, you can click the *Scissors* icon in the tab bar or click the *Edit* button in the tab bar.  

To save your notebook and your progress, click the *File* button in the tab bar.  
**Tip:** It is important to ***Save and Checkpoint*** your progress in order to be able to revert back to your data if you need to restart your kernel!

To submit your completed notebook, click the *File* button then the *Download as* button. Finally, download the notebook as  *Notebook (.ipynb)* or as *PDF*.

**How to get more Jupyter help**

["Jupyter Notebook Tips, Tricks, and Shortcuts"](https://dataquest.io/blog/jupyter-notebook-tips-tricks-shortcuts/): Here you can find some high level tools to get more function and variable information  as well as some tips on how to use Jupyter in other languages. (This is a way to give a better understanding of Jupyter and may appeal to students who have other programming language foundations.)

*** 
# 2. Python Basics

**Python** is a popular programming language, for both data science and general software development. It gives us a way for us to communicate with the computer and give it instructions, which is why mastering the fundamentals is critical. 

Just like any language, Python has a set vocabulary made up of words it can understand, and a syntax which provides the rules for how to structure our commands and give instructions. 

### 2.1 Errors
Errors in programming are common and totally okay! Don't be afraid when you see an error because more likely than not the solution lies in the error code itself! Let's see what an error looks like. <font color = #d14d0f>**Run the cell below to see the output.**</font>

In [None]:
print('This line is missing something.'

The last line of the error message in the output attempts to tell you what went wrong. You should see a message saying "SyntaxError: unexpected EOF while parsing." This just means it expected a closing to your code in this instance. <font color = #d14d0f>**Try adding a parentheses to end the statement and watch the error message disappear!**</font>

### 2.2 Expressions

Programs are made up of expressions, which describe to the computer how to combine pieces of data. For example, a multiplication expression consists of a * symbol between two numerical expressions. 

Expressions, such as 5 * 3, are evaluated by the computer. The value (the result of evaluation) of the last expression in each cell, 15 in this case, is displayed below the cell. <font color = #d14d0f>**Try running the cell below to see the value of the expression!**</font>

In [None]:
5 * 3

The grammar rules of a programming language are strict. In Python, the * symbol cannot appear twice in a row. Instead of computing the cell, it will show a SyntaxError error. The Syntax of a language is its set of grammar rules, and a SyntaxError indicates that an expression structure doesn’t match any of the rules of the language. <font color = #d14d0f>**Run the cell below to see the SyntaxError!**</font>

In [None]:
5 * * 3

Small changes to an expression can change its meaning entirely. Below, the space between the *’s has been removed. Because ** appears between two numerical expressions, the expression is an exponentiation expression (the first number raised to the power of the second number, so 5 to the power of 3 or 5 times 5 times 5). The symbols * and ** are called operators, and the values they combine are called operands.

In [None]:
5 ** 3

#### Common Operators

In Python, the following operators are essential:

| Expression Type | Operator | Example | Value |
| --- | --- | --- | --- |
| Addition | + | 2 + 3 | 5 |
| Subtraction | - | 2 - 3 | -1 |
| Multiplication | * | 2 * 3 | 6 |
| Division | / | 6 / 3 | 2 |
| Remainder | % | 7 % 3 | 1 |
| Exponentiation | ** | 2 ** 3 | 8 |

Python expressions obey the same familiar rules of PEMDAS as in algebra: multiplication and division occur before addition and subtraction. Parentheses can be used to group together smaller expressions within a larger expression. <font color = #d14d0f>**Try running the cell below to see the difference parentheses can make!**</font> 

In [None]:
1 + 2 * 3 * 4 * 5 / 6 ** 3 + 7 + 8 - 9 + 10

In [None]:
1 + 2 * (3 * 4 * 5 / 6) ** 3 + 7 + 8 - 9 + 10

#### Assignment Statements

Names are given to values in Python using an **assignment statement**. An assignment statement consists of the name (on the left), followed by =, which is followed by any expression. 

The value of the expression to the right of the =, is **assigned** to the name (on the right). Once you've assigned an expression to a name, you can access that expression through the name in future instances. <font color = #d14d0f>**Try running the cell below to see how names are stored!**</font> 

In [None]:
a = 5 
b = 7 
a + b

Sometimes, instead of trying to work with raw information all the time in a long calculation like 4 - 2 * (1 + 6 / 3) you will want to store it as an assignment statement for easy access in future calculations. <font color = #d14d0f>**Check out how we can use assignment statements to our advantage below!**</font>

In [None]:
# Instead of performing this calculation over and over again ...
4 - 2 * (1 + 6 / 3)

In [None]:
# Try assigning it to a name for future use!
y = 4 - 2 * (1 + 6 / 3)
# This cell doesn't return anything since it only assigns and doesn't say to return y

An assignment statement, such as `y = 4 - 2 * (1 + 6 / 3)` has three parts:
>On the left is the variable name (y) <br>
On the right is the variable's value (4 - 2 * (1 + 6 / 3)) <br> 
The equals sign in the middle tells the computer to assign the value to the name.

You might have noticed that running that second cell did not output anything, however, we can access that value again and again in the future.



In [None]:
# We can print the value as follows
y

In [None]:
# We can also use it in other calculations now!
y * 2

Names must start with a letter, but can contain both letters and numbers. A name cannot contain a space; instead, it is common to use an underscore character _ to replace each space.

Names are only as useful as you make them; it’s up to the programmer to choose names that are easy to interpret. Typically, more meaningful names can be invented than a and b. For example, to describe the different types of data from a labor report, the following names clarify the meaning of the various quantities involved.

In [None]:
year = 2022
unemployment_rate_females = 0.04
unemployment_rate_males = 0.037
total_unemployment_rate = unemployment_rate_females + unemployment_rate_males
total_unemployment_rate

## 2.3 Data Types

Every value has a type, and the built-in `type` function returns the type of the result of any expression.

### Numbers

One type we have encountered already is Numbers. 

Python distinguishes between *two different types of numbers*:

Integers are called int values in the Python language. They can only represent whole numbers (negative, zero, or positive) that don’t have a fractional component.

Recall earlier we ran a cell that stated: `a = 5`

In [None]:
#the "a" we just defined is an int type number
type(a)

Real numbers are called float values (or floating point values) in the Python language. They can represent whole or fractional numbers but have some limitations due to approximation.

Recall earlier we ran a cell that stated:
`y = 4 - 2 * (1 + 6 / 3)`

In [None]:
#the "y" we just defined is a float type number
type(y)

### Strings

Much of the world’s data is text, and a piece of text represented in a computer is called a string. A string can represent a word, a sentence, or even the contents of every book in a library. 

The meaning of an expression depends both upon its structure and the types of values that are being combined. 

So, for instance, adding two strings together produces another string. This expression is still an addition expression, but it is combining a different type of value.

In [None]:
"public" + " " + "policy"

We use the `split()` method to split a string into a list. You can specify the separator (the default separator is any whitespace). 

In [None]:
txt = "welcome to PP190"

x = txt.split() 
# using the default separator 

print(x)

In [None]:
txt = "apple#banana#orange"

x = txt.split("#") 
# specifying the separator

print(x)

The string `join()` method returns a string by joining all the elements of an array, separated by the given separator.

In [None]:
text = ['PP190', 'is', 'a', 'fun', 'class.']

# join elements of text with space
print(' '.join(text))

The string `replace()` method replaces a specified phrase with another specified phrase.

In [None]:
txt = "I like listening to pop music."

x = txt.replace("pop", "country")

print(x)

*** 
# 3. Introduction to Tables

### What is a Table?

A table is an object in Python that allows you to store data. It is a collection of rows and columns. Each row corresponds to one entry in the table and each column corresponds to a particular aspect you have data about. For example, say you have a table with information about 10 college students. It would have one row for each student and one column for each aspect of the students (e.g. name, major, year in college, etc.).

### How to Use Tables

You can create and edit tables using functions. One type of functions we can use on tables are called table methods. We use table methods in a specific format: table_name.method_name(any arguments). We'll look at plenty of examples of table methods below and in future notebooks!

#### How to Create Tables

Let's look at how to create tables. First, we have to import the `datascience` module using the following import statement:

In [None]:
# Run this cell (otherwise the next code cells will error)
from datascience import *
import otter
grader = otter.Notebook()
print("All necessary packages have been imported!")

A module is a collection of functions. We have to import modules using an import statement to use the functions they contain. We want to use the table functions in the `datascience` module, so we're importing it. The above statement is basically saying: from the `datascience` module, import all functions (which is what the * means).

Next, we can create an empty table using the `Table()` function:

In [None]:
new_table = Table()

Here, we created a new variable called `new_table`, which is assigned to an empty table. Because it's an empty table, nothing shows up when we display it:

In [None]:
new_table

Now, let's add some data to this table. We can do this by creating the columns we want the table to have, then adding those columns to the table.

Each column is a collection of values of the same type. So, we can use the `make_array` function to create columns.

In [None]:
cafe_names = make_array("Peet's", "Romeo's", "Milano", "Strada")
cafe_prices = make_array(4, 5, 6.5, 3)

# Like before, this cell won't return anything since it's only assignment statements

An **array** is essentially a list. As seen above, it can be made up of any items (like strings and integers and floats).

Now we've created two columns for our table: one for the names of different cafes in Berkeley and one for the prices of their coffee *(note: prices are made-up)*. 

Now, let's add these columns to the table using the `.with_columns` method where our first argument is the name of the column and the second argument is the column values:



> **`Table().with_columns(...)`**: create a new table by adding columns
     

In [None]:
new_table = new_table.with_columns("Cafe Names", cafe_names,
                                  "Cafe Prices", cafe_prices)

The first argument in the `.with_columns` method is the name of your column – make sure to enclose this name within single or double quotes. The second argument is the array with the values for this column.

If we look at `new_table` now, we'll see that two new columns are added to it:

In [None]:
new_table

This is a good resource for learning about tables: https://inferentialthinking.com/chapters/06/Tables.html

Feel free to also look through [documentation](http://data8.org/datascience/tables.html) on Tables, which lists the different methods of the `Table` object. We'll learn more about tables in future notebooks!

***
# 4. Introducing the Dataset

In this notebook, you will use data from the [USBLS (U.S. Bureau of Labor Statistics)](https://www.bls.gov/cps/demographics.htm#families).

The USBLS reports data analyzing different factors that contribute to the varying levels and causes of unemployment and labor force participation in the United States. 

In this dataset, you will see marital and family labor force statistics from the Current Population Survey about unemployment rates of individuals with children of different ages over the years. This will provide a framework for understanding the different factors that can go into unemployment rates, and effective policies to reduce it.


## 4.1 Context of the Data

Below you will find a data dictionary for future reference. This data dictionary goes over what the column names mean in the data we are about to load.

| Column Name | Definition |
| :- | :- |
| year | The year the data in the row is for. |
| overall_6_to_17 | The overall unemployment rate (%) for individuals with children strictly between the ages of 6 and 17. |
| men_6_to_17 | The unemployment rate (%) for men with children strictly between the ages of 6 and 17. |
| women_6_to_17 | The unemployment rate (%) for women with children strictly between the ages of 6 and 17. |
| overall_under_6 | The overall unemployment rate (%) for individuals with children strictly between the age of 6. |
| men_under_6 | The unemployment rate (%) for men with children strictly between the age of 6. |
| women_under_6 | The unemployment rate (%) for women with children strictly between the age of 6. |

Now, we're going to load the data we're going to be working with! Run the cell below to see the data.

In [None]:
# Below we see an assignment statement.
# We are telling the computer to create a Table and read in some data.
unemployment = Table().read_table('Data/marital.csv')

# This next command will display the top 5 entries. You can change the number to view a different amount of entries at time.
unemployment.show(5)

## 4.2 Exploring the Data

Now that we have the table loaded and saved with an assignment statement, we can start to use some of the table documentation that we introduced above to reorganize the data as we like!

### Sorting


Let's say that we want to specifcally see which years the unemployment rate was the highest for women with children under the age of 6.

To do this, we would use the `tbl.sort` method, which creates a copy of a table sorted by the values in a column. The table generated will default to ascending order unless `descending = True` is included.

Run the cell below to generate the sorted table!

In [None]:
sorted_table = unemployment.sort("women_under_6", descending = True) 
# We included descending = True because we want to see the years with the highest unemployment rate!

sorted_table

Notice that the table produced earlier and the table presented now are the same data, but the rows are in a different order.

### Selecting Column(s)

So, now we have a table with the sorted data, but we can still see the other columns! How would we generate a table with the sorted data, but just displaying the column we want to see?

To accomplish this, we would use the `tbl.select` method, which creates a copy of the table displaying only the column(s) specified.

Run the cell below to see the table with just the sorted data!

In [None]:
filtered_table = sorted_table.select("year", "women_under_6") # Place the column name in quotes
filtered_table

### Grouping Columns

Data scientists often need to classify individuals into groups according to shared features, and then identify some characteristics of the groups.


[This page on the `group` function](https://inferentialthinking.com/chapters/08/3/Cross-Classifying_by_More_than_One_Variable.html?highlight=group) from the Data8 Textbook is an excellent resource with great examples!

Let's try using the group function. 
> **`tablename.group(column_name(s), func)`**: Group rows by unique values or combinations of values in a column(s). Multiple columns must be entered in array or list form. Other values aggregated by count (default) or optional argument func.

Below, the `group` method lists every distinct year in the unemployment table. Since there is no function provided as seen in the above documentation, it counts the number of rows that have each year in the dataset.

In [None]:
grouped_years = unemployment.group('year')
grouped_years

The result of the previous code cell shows that all the years appear just once in this table.

*** 
# 5. Statistical Analysis

Now that we have some of the basic table manipulation concepts down, let's attempt to play around with the data a little! Let's try to find the difference in the average unemployment rates of men and women with children under 6.

First, we'll need to find the average unemployment rates of both columns. To store all the values of a column as an array, use the `.column` function, and type in the name of the column.

In [None]:
# This will store the values in the 'men_under_6' column as an array!
men_column = unemployment.column('men_under_6')


<!-- BEGIN QUESTION -->
<font color = #d14d0f>

#### **Question 1:**
Now you try! In the cell below, replace the ... with the column name.

In [None]:
women_column = unemployment.column(...)

<!-- END QUESTION -->
<!-- BEGIN QUESTION -->
<font color = #d14d0f>
    
#### Question 2:
Next, we want to find the average unemployment rates of the respective columns. For this, we need to use the `.mean` function.

In [None]:
men_averages = np.mean(men_column)

# Now you try!
women_averages = ...

<!-- END QUESTION -->
Now, to find the difference between the average unemployment rates, you would subtract one from the other!

In [None]:
difference = women_averages - men_averages
difference

<!-- BEGIN QUESTION -->
<font color = #d14d0f>
    
#### Question 3:
What about the median unemployment rates of the respective columns? For this, we need to use the `np.median` function.

In [None]:
men_median = np.median(men_column)

# Now you try!
women_median = ...

<!-- END QUESTION -->
<!-- BEGIN QUESTION -->
<font color = #d14d0f>

#### Question 4:

What about the standard deviation for unemployment rates of the respective columns? For this, we need to use the `np.std` function.

In [None]:
men_sd = np.std(men_column)

# Now you try!
women_sd = ...

<!-- END QUESTION -->
<!-- BEGIN QUESTION -->
<font color = #d14d0f>

#### Question 5:    

What are the ranges for unemployment rates of the respective columns? Remember that range is the difference between the largest and smallest numbers. For this, we need to use the `max` and `min` functions.

In [None]:
men_range = max(men_column) - min(men_column)

# Now you try!
women_range = ...

<!-- END QUESTION -->
***
# 6. Conclusion

Over the course of this notebook, you were introduced to the basic types of objects in Python and how to understand and manipulate tables. In the next notebook, you will build further on the coding and quantitative skills you learned today. Stay tuned!

### Congratulations! You have finished the notebook! ##

***
# 7. Submitting Your Work

**Make sure that you've answered all the questions.**

Follow these steps: 
1. Go to `File` in the menu bar, then select `Save and Checkpoint` (or click CTRL+S).
2. Go to `Cell` in the menu bar, then select `Run All`.
3. Click the link produced by the code cell below.
4. Submit the downloaded PDF on bCourses according to your professor's instructions.

**Note:** If clicking the link below doesn't work for you, don't worry! Simply click `File` in the menu, find `Download As`, and choose `PDF via LaTeX (.pdf)` to save a copy of your pdf onto your computer.

**Check the PDF before submitting and make sure all of your answers and any changes are shown.**

In [None]:
# This may take a few extra seconds.
from otter.export import export_notebook
from IPython.display import display, HTML
export_notebook("1. Introduction to Python and Jupyter.ipynb", filtering=True, pagebreaks=False)
display(HTML("<p style='font-size:20px'> <br>Save this notebook, then click <a href='1. Introduction to Python and Jupyter.pdf' download>here</a> to open the pdf.<br></p>"))

***
# 8. Explore Data Science Opportunities

Interested in learning more about how to get involved in data science or learn about data science applications in your field of study? The following resources might help support your learning:

- Data Science Modules: http://data.berkeley.edu/education/modules
- Data Science Offerings at Berkeley: https://data.berkeley.edu/academics/undergraduate-programs/data-science-offerings
- Data 8 Course Information: http://data8.org/
- Data 100 Course Information: https://ds100.org/


***
# 9. Feedback Form 

<div class="alert alert-info">
<b> We encourage students to fill out the following feedback form to share your experience with this Module notebook. This feedback form will take no longer than 5 minutes. At UC Berkeley Data Science Undergraduate Studies Modules, we appreciate all feedback to improve the learning of students and experience utilizing Jupyter Notebooks for Data Science Education: </b> 
</div>

# [UC Berkeley Data Science Feedback Form](https://forms.gle/hipxf2uFw5Ud4Hyn8)
***