# [ARE 212] Discussion Section - Python 01



## Introduction to `Python` Discussion Sections

### Learning goals for this series of discussion sections

1.  Build off of skills we developed with `R` to provide a basic familiarity of coding in `Python` (and deepen your overall understanding of coding for econometrics in the process)
2.  From no `Python` experience to ability to follow [lecture applications](https://datahub.berkeley.edu/user/benjaminkrause/tree/ARE212_Materials)
 and engage in [bcourses discussions](https://bcourses.berkeley.edu/courses/1487913/discussion_topics) via `Python`
 
(Note: I'm starting the numbering over so to keep things distinct from our `R` notes) 

### Source Material

As before, these notes are the fruits of arduous labor of others.  My contributions are minimual -- and know that this will be much rougher than our journey thus far.  So please expect to find some code (and even explanation) errors (and assume them all to be mine). If you find any mistakes or you are knocking your head against a screen drentched in the red letters of <font color='red'>Error</font>/desperation/existential dread/rage, please [let me know](mailto:benjaminkrause@berkeley.edu) sooner rather than later.
    
The primary sources of these notes (and honestly the contributors of the real work here) are:
- Ethan and in particular his EEP 153 Notes
- [Computational and Inferential Thinking: The Foundations of Data Science](https://www.inferentialthinking.com/chapters/intro.html) which is the textbook for UC Berkeley's [Data 8: The Foundations of Data Science](http://data8.org/) course.  All of the notes, readings, labs, and assignments are fully available online as well.  For instance, here is [Spring 2020](http://data8.org/sp20/).
- [Python Data Science Handbook](https://jakevdp.github.io/PythonDataScienceHandbook/)

### [Wherefore](https://www.merriam-webster.com/words-at-play/wherefore-meaning-shakespeare) `R`'t thou?

I'm going to attempt to both get us up to speed on `Python` (and I do very much include myself in "us") while also ensuring that you're coming out of this class with a solid coding foundation in at least one language.  

For this reason, I'll also circulate an updated version of each of the remaining discussion sections over the next month so to ensure that you have a full set of notes for coding econometrics in `R` from this course.  

I do not expect that we'll be devoting anymore time this semester to `R` in discussion section itself, but please do know that I'll be happy to field any questions via email/zoom/passenger pigeon/smoke signals throughout the sememster as well as to [infinity and beyond](https://www.youtube.com/watch?v=2VSYmGSJtCA).  So please do not feel shy about reaching out for anything `R`, `Python`, Econometrics, ARE, or life related (I'm not always the fastest in responses . . . and frankly I know relatively little about any of these topics, but I'm always happy to muddle through together).  

In addition, if you haven't already discovered these, my very much more illustrious predecessors have their own sets of `R` notes for this class freely available on their respective websites, so know that you can always find more from [Ed Rubin](http://edrub.in/ARE212/Spring2017/notes.html) now at University of Oregon Econ and [Fiona Burlig](https://www.fionaburlig.com/teaching/are212) now at University of Chicago Harris School of Public Policy.

Finally, as we're parting with `R` (at least during discussion section), to lessen your sweet sorrow, let me again recommend [R for Data Science](https://r4ds.had.co.nz/) as it is genuinely my favorite place to practice `R`.   

### Class expectations for the rest of the semester 

- [Syllabus](https://github.com/ligonteaching/ARE212_Materials/blob/master/syllabus.pdf)
- Keep up with [Ethan's materials](https://github.com/ligonteaching/ARE212_Materials) hosted on [github.com](https://github.com/)
- [Discussion instead of Problem Sets](https://bcourses.berkeley.edu/courses/1487913/discussion_topics/5731445) (20% of your overall grade)
    - Interactions take place in the bcourses [Discussions](https://bcourses.berkeley.edu/courses/1487913/discussion_topics) tab
    - You are responsible to:
        - Respond to two or more of my discussion prompts.  Posting a response will "unlock" the discussion for that item for you, so you'll be able to see the rest of the discussion thread.   Please post these responses by the end of Wednesday.
        - Comment on at least three of the posts responding to the discussion prompt (i.e., comment on others comments).    A livelier discussion will result if you don't wait until the last minute (end of Sunday) to post these comments, so that there's some opportunity for dialog.
    - ". . . we just want to see that you're contributing constructively to the conversation."
- Final (30% of your overall grade)

## Introduction to `Python`

### Learning Goals Today

1.  Open a `jupyter` notebook on `datahub.berkeley.edu`.
2.  Understand simple `python` expressions.
3.  Work with lists & dictionaries.
4.  Work with `pandas.DataFrames`.
5.  Submit indication of completion.

### 1. Open a `jupyter` notebook on `datahub.berkeley.edu`

- 

#### Simple Expressions



Python is a general purpose language, extended via *modules*.
One important aspect of python are *expressions*.  Here are some examples:



In [14]:
# Arithmetic 
1 + 1

# String (Delineated using single- or double-quotes)
"Hello world!"

# To see output, using a "print" statement:
print(1+1)
print("Hello" + "world!")

2
Helloworld!


The above provides examples of:

-   **Comments:** Text that begins with a &ldquo;#&rdquo; character.
-   **Function calls:** Something that takes arguments (in parentheses)
    and returns some output.  Here &ldquo;print&rdquo; is a
    function.
-   **Objects:** 1 and &ldquo;Hello&rdquo; are examples of objects.
-   **Operators:** &ldquo;+&rdquo; is an operator.  Notice that it functions
    differently depending on what it&rsquo;s operating on;
    that&rsquo;s because its operation depends on the `type`
    of the objects (*operands*) it&rsquo;s operating on.

Predict the output of the following two lines:



In [2]:
print(type(1+1))
print(type("Hello"))

<class 'int'>
<class 'str'>


#### Lists



Strings and integers are simple examples of different data types (or
objects).  A very important type that is more complicated are
`lists`.  Here are some examples.  Examine them, and predict the output:



In [3]:
a = [1,2,3]
b = ["Hello","world"]

c = a + b
print(c)
print(len(c))  # Here len returns the "length" of the list.

[1, 2, 3, 'Hello', 'world']
5


Extra optional practice to refresh yourself with list manipulation and slicing.
Examine these first, then predict the output.



In [4]:
print(c[2]) # Remember how Python counts arrays.
print(c[1:2]) # Why does Python only return one item instead of two?
print(b[0]*4)
print(c[0::2])
print(list(map(lambda x: x*2,a)))

3
[2]
HelloHelloHelloHello
[1, 3, 'world']
[2, 4, 6]


#### Dictionaries



Another very basic kind of compound object are `dicts` (dictionaries;
also called associative arrays or hashes in other languages).  Predict
the output:



In [5]:
d = {'name': "Barney", 'species': "Dinosaur", 'age': 27, 'color': "Purple"}

print("{name} the {species} is {color}.".format(**d))

Barney the Dinosaur is Purple.


#### DataFrames



A much more complicated data structure is provided by a module called
`pandas`; you can find a quick tutorial at
[http://pandas.pydata.org/pandas-docs/version/0.23/10min.html](http://pandas.pydata.org/pandas-docs/version/0.23/10min.html).  The
DataFrame object will be very important for us.

The `pandas` module provides a data structure called a `DataFrame`.
These are basically rectangular arrays of data, with names for rows
and columns, rather like a spreadsheet.  In fact, one important thing
one can do with DataFrames is to import data *from* spreadsheets.



In [6]:
import pandas as pd

# Try looking at https://docs.google.com/spreadsheets/d/1ObK5N_5aVXzVHE7ZXWBg0kQvPS3k1enRwsUjhytwh5A in your browser.
SHEET = "https://docs.google.com/spreadsheets/d/1q1ikP1CXCcLf_Tq6VbhoskOYRvn_nDU5MHStIoVzMgA"

# The following line goes on-line and turns the spreadsheet into a pandas DataFrame:
df = pd.read_csv(SHEET + "/export?format=csv")

# This line will show us only the first five rows of data. Try removing .head() to see the full list of items.
# Guess what happens if you replace .head() with .tail(). Try it out!
df.head()

Unnamed: 0,Food,Quantity,Units,Price,Date,Location,NDB
0,"Milk, 2% fat",1.0,gallon,4.99,[2019-09-14 Sat],"Monterey Market, Berkeley",45226447
1,"Eggs, extra large",12.0,xl_egg,3.59,[2019-09-14 Sat],"Monterey Market, Berkeley",45208918
2,Crumpets,6.0,crumpet,3.19,[2019-09-14 Sat],"Monterey Market, Berkeley",45324369
3,Bananas,1.0,pound,3.15,[2019-09-14 Sat],"Monterey Market, Berkeley",9040
4,"Carrots, Organic",2.0,pound,2.29,[2019-09-14 Sat],"Monterey Market, Berkeley",11124


If this worked (!) you should be able to see some data from a recent
shopping trip of mine.  What are the different variables available in
the DataFrame `df`?  They correspond to the *columns* of the spreadsheet.



In [7]:
df.columns

Index(['Food', 'Quantity', 'Units', 'Price', 'Date', 'Location', 'NDB'], dtype='object')

Now, what else can we do?  Let&rsquo;s figure out how much my total grocery
bill was:



In [8]:
df['Price'].sum()

72.22

Let&rsquo;s say I&rsquo;m on a budget. Naturally, we&rsquo;d want to identify the item(s)
I&rsquo;m spending the most on. We can sort the values to investigate further.



In [9]:
# Note that we're indicating we want to sort by the 'Price' column and specify that it should be in descending order.
df.sort_values(by='Price', ascending=False).head()

Unnamed: 0,Food,Quantity,Units,Price,Date,Location,NDB
10,"Mushrooms, King Oyster",1.0,pound,12.0,[2019-09-14 Sat],"Monterey Market, Berkeley",45218868
12,Orange juice,0.5,gallon,8.98,[2019-09-14 Sat],"Monterey Market, Berkeley",45213207
6,"Endive, Red",1.26,pound,6.27,[2019-09-14 Sat],"Monterey Market, Berkeley",11213
9,"Lettuce, Little Gem",1.0,pound,5.98,[2019-09-14 Sat],"Monterey Market, Berkeley",45276886
0,"Milk, 2% fat",1.0,gallon,4.99,[2019-09-14 Sat],"Monterey Market, Berkeley",45226447


Everything here looks straightforward, but let&rsquo;s take a closer look at
Red Endive and calculate the price per pound to make comparison easier.



In [10]:
# This line selects the 7th item in the dataframe (note the index number is 6 because we start counting at 0 when we use Python)
# and selects the 'Price' value for this particular item. It divides it by 'Quantity' to get the price per pound.
df.iloc[6]['Price']/df.iloc[6]['Quantity']

4.976190476190475

You&rsquo;ll find throughout the semester that unit price is a pretty useful statistic
to calculate. Let&rsquo;s do it for all the items on this grocery list. Thankfully we
don&rsquo;t have to do this one by one.



In [11]:
# This line creates a new column in our dataframe named 'Unit Price' and populates each row with the respective price value 
# divided by the quantity value.
df['Unit Price'] = df['Price']/df['Quantity']
df.head()

Unnamed: 0,Food,Quantity,Units,Price,Date,Location,NDB,Unit Price
0,"Milk, 2% fat",1.0,gallon,4.99,[2019-09-14 Sat],"Monterey Market, Berkeley",45226447,4.99
1,"Eggs, extra large",12.0,xl_egg,3.59,[2019-09-14 Sat],"Monterey Market, Berkeley",45208918,0.299167
2,Crumpets,6.0,crumpet,3.19,[2019-09-14 Sat],"Monterey Market, Berkeley",45324369,0.531667
3,Bananas,1.0,pound,3.15,[2019-09-14 Sat],"Monterey Market, Berkeley",9040,3.15
4,"Carrots, Organic",2.0,pound,2.29,[2019-09-14 Sat],"Monterey Market, Berkeley",11124,1.145


Almost there! Let&rsquo;s pare down our dataframe to look more friendly to the eye. We
don&rsquo;t want to see the following columns: Date, Location, NDB. Also, we only want to see
the first five items of the dataframe. 

In the previous blocks, we used .iloc which stands for index (or integer) location. We used integers to specify which
columns we wanted. In this section, we&rsquo;ll use .loc which allows us to use column labels. For extra practice, try to
achieve the same result but by using .iloc instead.



In [12]:
# Note that in both the .iloc and .loc syntax, the first set of parameters refer to rows and the second set refer to columns.
df.loc[0:5, ['Food', 'Quantity', 'Units', 'Price', 'Unit Price']]

Unnamed: 0,Food,Quantity,Units,Price,Unit Price
0,"Milk, 2% fat",1.0,gallon,4.99,4.99
1,"Eggs, extra large",12.0,xl_egg,3.59,0.299167
2,Crumpets,6.0,crumpet,3.19,0.531667
3,Bananas,1.0,pound,3.15,3.15
4,"Carrots, Organic",2.0,pound,2.29,1.145
5,Cauliflower,2.51,pound,4.24,1.689243


Here&rsquo;s one last exercise that might be useful. Often times you will only want to view data
that fits a certain criterion. In this case, let&rsquo;s only look at items where the unit price
is less than 1.



In [13]:
# This line will return all rows in the dataframe where the Unit Price is < 1. Using what we've covered prior,
# modify the view of this dataframe to only include Food and Unit Price.
df[df['Unit Price'] < 1]

Unnamed: 0,Food,Quantity,Units,Price,Date,Location,NDB,Unit Price
1,"Eggs, extra large",12.0,xl_egg,3.59,[2019-09-14 Sat],"Monterey Market, Berkeley",45208918,0.299167
2,Crumpets,6.0,crumpet,3.19,[2019-09-14 Sat],"Monterey Market, Berkeley",45324369,0.531667
11,"Onion, yellow",1.0,pound,0.39,[2019-09-14 Sat],"Monterey Market, Berkeley",45339306,0.39
16,"Potato, russet",10.0,pound,2.98,[2019-09-14 Sat],"Monterey Market, Berkeley",45364251,0.298


Extra things to refresh that may be helpful for Project 1: basic visualizations, datatypes, index, joins.



#### Final words



Throughout this class, you will be exposed to a variety of Python modules and tools and the data that you work with
may or may not be cleaned. In any case, learning how to find and use online documentation/resources is a
valuable skill that will benefit you greatly in this course and beyond. Be sure to utilize our course `piazza`
for any questions you might have - there&rsquo;s a good chance a peer may have a similar question or have the answer.
As the semester goes on, course staff will update the &ldquo;Useful Links/Resources&rdquo; post with any outside Python resources 
that may be helpful for the whole class.

