# [PP190] Introduction to Python and Jupyter Notebook
### Professor Anibel Ferus-Comelo

***
## Table of Contents
* [Jupyter Basics](#Jupyter-Basics)
* [Python Basics](#Python-Basics)
* [Introduction to Tables](#Introduction-to-Tables)
* [Introducting the Dataset](#Introducing-the-Dataset)
* [Exploring the Data](#Exploring-the_Data)
* [Conclusion](#Conclusion)

In [1]:
import numpy as np

## Jupyter Basics

#### What is a cell?
A cell is a block in the Notebook framework, and each cell will be highlighted when clicked on.

### Types of Cells

**Markdown Cell**

A markdown cell is where you write text. This is a markdown cell! 

To edit a markdown cell, you must double click the cell's text.

**Code Cell**

In [2]:
# This is a code cell 
# A code cell is where you will write code
# Errors can arise in code cells if you write bad code

#### Adding a cell

To add a cell, find the plus + sign on the tab bar above.

When you add a cell, it is automatically a code cell.

To switch from a code cell to a markdown cell, click the *Cell* button in the tab bar then click *Cell Type*.

**Running a Cell**

To run a code cell, click *shift+run* or the *Run* button in the tab bar.

In [3]:
# try to run this cell 

1+12

13

**Editing, saving, and submitting work**

To delete a cell, you can click the *Scissors* icon in the tab bar or click the *Edit* button in the tab bar.  

To add a cell, you can click the *+* icon in the tab bar or click the *Insert* button in the tab bar.  

To save your notebook and your progress, click the *File* button in the tab bar.  
Tip: It is important to *Save and Checkpoint* your progress in order to be able to revert back to your data if you need to restart your kernel!  

To submit your completed notebook, click the *File* button then the *Download as* button. Finally, download the notebook as  *Notebook (.ipynb)*.

**How to get more Jupyter help**

Here are some resources to familiarize yourself even more with Jupyter! 

Here is the link to the "Jupyter Project Documentation". Here you can find everything about Jupyter, from how to install it on your computer (we're using the web version right now) to Release Notes.

https://docs.jupyter.org/en/latest/

Here is the link to "Jupyter Notebook Tips, Tricks, and Shortcuts". Here you can find some high level tools to get more function and variable information , as well as some tips on how to use Jupyter in other languages. (Obviously, this is a way to give a better understanding of Jupyter for students who have other programming language foundations)

https://dataquest.io/blog/jupyter-notebook-tips-tricks-shortcuts/

Here is a link to "The Jupyter Notebook". Here you can find a feature introduction to Jupyter Notebook for everyone who can help you learn how to create a Jupyter Notebook yourself. There are also other functions introduced in the list on the left.

https://jupyter-notebook.readthedocs.io/en/stable/notebook.html

*** 
## Python Basics

**Python** is a popular programming language, for both data science and general software development. It gives us a way for us to communicate with the computer and give it instructions, which is why mastering the fundamentals is critical. 

Just like any language, Python has a set vocabulary made up of words it can understand, and a syntax which provides the rules for how to structure our commands and give instructions. 

#### Errors
Errors in programming are common and totally okay! Don't be afraid when you see an error because more likely than not the solution lies in the error code itself! Let's see what an error looks like. <font color = #d14d0f>**Run the cell below to see the output.**</font>

In [4]:
print('This line is missing something.'

SyntaxError: unexpected EOF while parsing (2588290324.py, line 1)

The last line of the error message in the output attempts to tell you what went wrong. You should see a message saying "SyntaxError: unexpected EOF while parsing." This just means it expected a closing to your code in this instance. <font color = #d14d0f>**Try adding a parentheses to end the statement and watch the error message disappear!**</font>

### Expressions

Programs are made up of expressions, which describe to the computer how to combine pieces of data. For example, a multiplication expression consists of a * symbol between two numerical expressions. 

Expressions, such as 5 * 3, are evaluated by the computer. The value (the result of evaluation) of the last expression in each cell, 15 in this case, is displayed below the cell. <font color = #d14d0f>**Try running the cell below to see the value of the expression!**</font>

In [5]:
5 * * 3

SyntaxError: invalid syntax (2421471793.py, line 1)

The grammar rules of a programming language are strict. In Python, the * symbol cannot appear twice in a row. Instead of computing the cell, it will show a SyntaxError error. The Syntax of a language is its set of grammar rules, and a SyntaxError indicates that an expression structure doesn’t match any of the rules of the language. <font color = #d14d0f>**Run the cell below to see the SyntaxError!**</font>

In [6]:
5 * 3

15

Small changes to an expression can change its meaning entirely. Below, the space between the *’s has been removed. Because ** appears between two numerical expressions, the expression is an exponentiation expression (the first number raised to the power of the second: 5 times 5 times 5). The symbols * and ** are called operators, and the values they combine are called operands.

In [7]:
5 ** 3

125

### Common Operators

In Python, the following operators are essential

| Expression Type | Operator | Example | Value |
| --- | --- | --- | --- |
| Addition | + | 2 + 3 | 5 |
| Subtraction | - | 2 - 3 | -1 |
| Multiplication | * | 2 * 3 | 6 |
| Division | / | 6 / 3 | 2 |
| Remainder | % | 7 % 3 | 1 |
| Exponentiation | ** | 2 ** 3 | 8 |

Python expressions obey the same familiar rules of PEMDAS as in algebra: multiplication and division occur before addition and subtraction. Parentheses can be used to group together smaller expressions within a larger expression. <font color = #d14d0f>**Try running the cell below to see the difference parentheses can make!**</font> 

In [8]:
1 + 2 * 3 * 4 * 5 / 6 ** 3 + 7 + 8 - 9 + 10

17.555555555555557

In [9]:
1 + 2 * (3 * 4 * 5 / 6) ** 3 + 7 + 8 - 9 + 10

2017.0

#### Assignment Statements

Names are given to values in Python using an **assignment statement**. An assignment statement consists of the name (on the left), followed by =, which is followed by any expression. 

The value of the expression to the right of the =, is **assigned** to the name (on the right). Once you've assigned an expression to a name, you can access that expression through the name in future instances. <font color = #d14d0f>**Try running the cell below to see how names are stored!**</font> 

In [10]:
a = 5 
b = 7 
a + b

12

Sometimes, instead of trying to work with raw information all the time in a long calculation like 4 - 2 * (1 + 6 / 3) you will want to store it as an assignment statement for easy access in future calculations. <font color = #d14d0f>**Check out how we can use assignment statements to our advantage below!**</font>

In [11]:
# Instead of performing this calculation over and over again ...
4 - 2 * (1 + 6 / 3)

-2.0

In [12]:
# Try assigning it to a name for future use!
y = 4 - 2 * (1 + 6 / 3)

An assignment statement, such as y = 4 - 2 * (1 + 6 / 3) has three parts: on the left is the variable name (y), on the right is the variable's value (4 - 2 * (1 + 6 / 3)), and the equals sign in the middle tells the computer to assign the value to the name.

You might have noticed that running that second cell did not output anything, however, we can access that value again and again in the future.



In [13]:
# We can print the value as follows
y

-2.0

In [14]:
# We can also use it in other calculations now!
y * 2

-4.0

Names must start with a letter, but can contain both letters and numbers. A name cannot contain a space; instead, it is common to use an underscore character _ to replace each space.

Names are only as useful as you make them; it’s up to the programmer to choose names that are easy to interpret. Typically, more meaningful names can be invented than a and b. For example, to describe the different types of data from a labor report, the following names clarify the meaning of the various quantities involved.

In [15]:
year = 2022
unemployment_rate_females = 0.04
unemployment_rate_males = 0.037
total_unemployment_rate = unemployment_rate_females + unemployment_rate_males
total_unemployment_rate

0.077

## Data type

Every value has a type, and the built-in type function returns the type of the result of any expression.



### Numbers

One type we have encountered already is Numbers. 

Python distinguishes between two different types of numbers:

Integers are called int values in the Python language. They can only represent whole numbers (negative, zero, or positive) that don’t have a fractional component.

In [16]:
#the "a" we just defined is an int type number
type(a)

int

Real numbers are called float values (or floating point values) in the Python language. They can represent whole or fractional numbers but have some limitations.

In [17]:
#the "y" we just defined is a float type number
type(y)

float

### Strings

Much of the world’s data is text, and a piece of text represented in a computer is called a string. A string can represent a word, a sentence, or even the contents of every book in a library. 

The meaning of an expression depends both upon its structure and the types of values that are being combined. 

So, for instance, adding two strings together produces another string. This expression is still an addition expression, but it is combining a different type of value.

In [18]:
"public"+" "+"policy"

'public policy'

We use the `split()` method to split a string into a list. You can specify the separator, default separator is any whitespace.

In [19]:
txt = "welcome to PP190"

x = txt.split()

print(x)

['welcome', 'to', 'PP190']


In [20]:
txt = "apple#banana#orange"

x = txt.split("#")

print(x)

['apple', 'banana', 'orange']


The string `join()` method returns a string by joining all the elements of an array, separated by the given separator.

In [21]:
text = ['PP190', 'is', 'a', 'fun', 'class.']

# join elements of text with space
print(' '.join(text))

PP190 is a fun class.


The string `replace()` method replaces a specified phrase with another specified phrase.

In [22]:
txt = "I like listening to pop music."

x = txt.replace("pop", "country")

print(x)

I like listening to country music.


*** 
## Introduction to Tables

### What is a Table?

A table is an object in Python that allows you to store data. It is a collection of rows and columns. Each row corresponds to one entry in the table and each column corresponds to a particular aspect you have data about. For example, say you have a table with information about 10 college students. It would have one row for each student and one column for each aspect of the students (e.g. name, major, year in college, etc.).

### Table Functions

You can create and edit tables using functions. One type of functions we can use on tables are called table methods. We use table methods in a specific format: table_name.method_name(any arguments). We'll look at plenty of examples of table methods below!

#### How to Create Tables

Let's look at how to create tables. First, we have to import the `datascience` module using the following import statement:

In [23]:
from datascience import *

A module is a collection of functions. We have to import modules using an import statement to use the functions they contain. We want to use the table functions in the `datascience` module, so we're importing it. The above statement is basically saying: from the `datascience` module, import all functions (which is what the * means).

Next, we can create an empty table using the `Table()` function:

In [24]:
new_table = Table()

Here, we created a new variable called `new_table`, which is assigned to an empty table. Because it's an empty table, nothing shows up when we display it:

In [25]:
new_table

Now, let's add some data to this table. We can do this by creating the columns we want the table to have, then adding those columns to the table.

Each column is a collection of values of the same type. So, we can use the `make_array` function to create columns.

In [26]:
cafe_names = make_array("Peet's", "Romeo's", "Milano", "Strada")
cafe_prices = make_array(4, 5, 6.5, 3)

Above, we created two columns: one for the names of different cafes in Berkeley and one for the prices of their coffee (note that the prices are made-up). Now, let's add these columns to the table using the `.with_columns` method:

In [27]:
new_table = new_table.with_columns("Cafe Names", cafe_names,
                                  "Cafe Prices", cafe_prices)

The first argument in the `.with_columns` method is the name of your column – make sure to enclose this name within single or double quotes. The second argument is the array with the values for this column.

If we look at `new_table` now, we'll see that two new columns are added to it:

In [28]:
new_table

Cafe Names,Cafe Prices
Peet's,4.0
Romeo's,5.0
Milano,6.5
Strada,3.0


This is a good resource for learning about tables: https://inferentialthinking.com/chapters/06/Tables.html

Feel free to also look through [documentation](http://data8.org/datascience/tables.html) on Tables, which lists the different methods of the `Table` object. We'll learn more about tables within the coming week!

## Introducing the Dataset

In this notebook, you will use data from the USBLS (U.S. Bureau of Labor Statistics).

The USBLS reports data analyzing different factors that contribute to the varying levels and causes of unemployment and labor force participation in the United States. 

In this dataset, you will information from the marital and family labor force statistics from the Current Population Survey about unemployment rates over the years of individuals with children of different ages. This will provide a framework for understanding the different factors that can go into unemployment rates, and effective policies to reduce it.

If you are interested in learning more, please visit:

[USBLS data and information](https://www.bls.gov/cps/demographics.htm#families)

### The Data

Below you will find a data dictionary for future reference. This data dictionary goes over what the column names mean in the data we are about to load.

| Column Name | Definition |
| :- | :- |
| year | The year the data in the row is for. |
| overall_6_to_17 | The overall unemployment rate (%) for individuals with children strictly between the ages of 6 and 17. |
| men_6_to_17 | The unemployment rate (%) for men with children strictly between the ages of 6 and 17. |
| women_6_to_17 | The unemployment rate (%) for women with children strictly between the ages of 6 and 17. |
| overall_under_6 | The overall unemployment rate (%) for individuals with children strictly between the age of 6. |
| men_under_6 | The unemployment rate (%) for men with children strictly between the age of 6. |
| women_under_6 | The unemployment rate (%) for women with children strictly between the age of 6. |

Now, we're going to load the data we're going to be working with! Run the cell below to see the data.

In [29]:
# Below we see an assignment statement.
# We are telling the computer to create a Table and read in some data.
import numpy as np
unemployment = Table().read_table('marital.csv')

# This next command will display the top 5 entries. You can change the number to view a different amount of entries at time.
unemployment.show(5)

FileNotFoundError: [Errno 2] No such file or directory: 'marital.csv'

## Exploring the Data

Now that we have the table loaded and saved with an assignment statement, we can start to use some of the table documentation that we learned above to reorganize the data as we like!

### Sorting


Let's say that we want to specifcally see which years the unemployment rate was the highest for women with children under the age of 6.

To do this, we would use the `tbl.sort` method, which create a copy of a table sorted by the values in a column. The table generated will default to ascending order unless `descending = True` is included.

Run the cell below to generate the sorted table!

In [None]:
sorted_table = unemployment.sort("women_under_6", descending = True) 
# We included descending = True because we want to see the years with the highest unemployment rate!

sorted_table

### Filtering

So, now we have a table with the sorted data, but we can still see the other columns! How would we generate a table with the sorted data, but just displaying the column we want to see?

To accomplish this, we would use the `tbl.where` method, which creates a copy of the table displaying only the column(s) specified.

Run the cell below to see the table with just the sorted data!

In [None]:
filtered_table = sorted_table.where("women_under_6") # Place the column name in quotes
filtered_table

### Grouping Columns

Data scientists often need to classify individuals into groups according to shared features, and then identify some characteristics of the groups. The `group` method, with a single argument, counts the number of rows for each category in a column. The result contains one row per unique value in the grouped column.

Below, the `group` method lists every distinct year in the unemployment table, and provides the count for the number of times that year appears in the dataset (all the years appear just once).

In [None]:
grouped_years = unemployment.group('year')
grouped_years

The call to `group` creates a table of counts in each category. The column is called `count` by default, and contains the number of rows in each category.

### Statistical Analysis

Now that we have some of the basic table manipulation concepts down, let's attempt to play around with the data a little! Let's try to find the difference in the average unemployment rates of men and women with children under 6.

First, we'll need to find the average unemployment rates of both columns. To store all the values of a column as an array, use the `.column` function, and type in the name of the column.

In [30]:
# This will store the values in the 'men_under_6' column as an array!
men_column = unemployment.column('men_under_6')

# Now you try!
women_column = 

SyntaxError: invalid syntax (3592633639.py, line 5)

Next, we want to find the average unemployment rates of the respective columns. For this, we need to use the `.mean` function.

In [31]:
men_averages = np.mean(men_column)

# Now you try!
women_averages = 

SyntaxError: invalid syntax (934767716.py, line 4)

Now, to find the difference between the average unemployment rates, you would subtract one from the other!

In [32]:
difference = women_averages - men_averages
difference

NameError: name 'women_averages' is not defined

What about the median unemployment rates of the respective columns? For this, we need to use the `np.median` function.

In [33]:
men_median = np.median(men_column)

# Now you try!
women_median = ...

NameError: name 'men_column' is not defined

What about the standard deviation for unemployment rates of the respective columns? For this, we need to use the `np.std` function.

In [34]:
men_sd = np.std(men_column)

# Now you try!
women_sd = ...

NameError: name 'men_column' is not defined

What are the ranges for unemployment rates of the respective columns? Remember that range is the difference between the largest and smallest numbers. For this, we need to use the `max` and `min` functions.

In [35]:
men_range = max(men_column) - min(men_column)

# Now you try!
women_range = ...

NameError: name 'men_column' is not defined

## Conclusion

In [36]:
import plotly.express as px
import plotly.graph_objects as go
import plotly.io as pio

men_under_6 = sum(unemployment.column('men_under_6'))
women_under_6 = sum(unemployment.column('women_under_6'))
men_6_to_17 = sum(unemployment.column('men_6_to_17'))
women_6_to_17 = sum(unemployment.column('women_6_to_17'))
fig2 = px.pie(values=[men_under_6, women_under_6, men_6_to_17, women_6_to_17], names= ['Men U6', 'Women U6', 'Men 6-17', 'Women 6-17'], 
                title='Percentage of Individuals with Children Unemployed from 2009-2021')

fig2.show()

NameError: name 'unemployment' is not defined

Great! Over the course of this notebook, you were introduced to the basic types of objects in Python and how to understand and manipulate tables. In the next notebook, you will build further on the coding and quantitative skills you learned today. Stay tuned!

#  Data Science Resources 

## Peer Consulting

If you had trouble with any content in this notebook, Data Peer Consultants are here to help! You can view their locations and availabilites at this link: https://data.berkeley.edu/education/data-peer-consulting.
Peer Consultants are there to answer all data-related questions, whether it be about the content of this notebook, applications of data science in the world or other data science courses offered at Berkeley -- make sure to take advantage of this wonderful resource!

## Helper resourse

Here are some resources you can check out to explore further!

- [DATA 8 Textbook](https://inferentialthinking.com/chapters/07/Visualization.html)(also the reference source of this notebook)
- [Reference Sheet for the datascience Module](http://data8.org/sp22/python-reference.html)（This is extremely helpful whenever you need a cheatsheet!)
- [Documentation for the datascience Modules](http://data8.org/datascience/index.html)


This brings us to the end of the notebook. Now you can try to use basic python table skills for any table you found!

---