# A Reintroduction to python

If you've been off for the summer (lucky you, a month in Bali sounds lovely) or you haven't worked in python before or in a while, this is a chance to refamiliarize youself with the language. 

There are a limitless set of tutorials onnline. My goals isn't to replace or even do as well as those; rather the goal is to have a limited set of work that is right in line with today's lecture. 

## Observations and Vectors 

When conceptualizing and writing out research, it is important to keep in mind what or whom your fundimental unit of observation. Suppose that you are doing research on climate change: 

- Is the primal unit of observation the daily temperature measurement at a research station? 
- Is it the average across a set of research stations? 
- Is it the montly average at a single research station? 

Any of these, depending on the question you seek to answer, might be the the most appropriate unit of observation. 

At the time that you're analyzing your data you will rely on the fact that you've structured your units of observations in some meaningful way; you've stacked the measurments into a *variable*, perhaps called `daily_temp`. This variable is stored as a vector, refered to as a **list**. For the purposes of analysis, thinking at the *vector* level is probably the most useful mode of thought. 

In python, lists are created within square brackets `[]`. Lists are ordered, collection of values. 

In [None]:
[1,2,3]

Because data flying past you is useless, you can store these lists in objects, known as variables. Assigment into a variable is undertaken using an equals, `=` assignment. Convention places the named variable on the left and the values that are to be assigned to that variable on the right. 

In [None]:
x = [1, 2, 3, 4, 6]

In [None]:
x

In [None]:
type(x)

If a vector is one-dimensional, then we can either:

- Reference a location in that vector:
  - `x[2]` Will print the value in the third position
  - `x[0]` Will print the value in the first position
  - `x[:2]`  Will print the values in the first, second, and third positions
  - `x[-1]`  Will print the first value, starting from the end of the list. Note that this isn't zero indexed. 

For this class, we will spend most of our time working with dataframes, collections of different types of lists in a single object. [pandas](https://pandas.pydata.org) is the defacto choice for manipulating dataframes, though I'll point out the very good work that is covered in [dask](https://dask.org) and developing work with [python datatable](https://datatable.readthedocs.io/en/latest/index.html) (an analogous task to R's data.table). 

# Imports 

In [None]:
import pandas as pd 
import numpy as np

If vectors are single dimensional objects that can be indexed, then a `data.frame` is just a collection of vectors that can also be indexed. Importantly, once we have more than one vector collected, we've got more than a single dimension that we have to index. This is *ok*, because we can just index in both of these directions. 

Note as well that one of the features that we appreciate about these data.frames is that they can collect vectors that are of different types in a way that a matrix cannot. 

In [None]:
d = pd.DataFrame({
      'value': (np.arange(0,20))**2, 
      'group': ['a', 'b', 'c', 'd', 'e'] * 4},
    index = np.arange(1,21)
)

In [None]:
d.head()

We can reference any of the vectors that are stored in the data frame by `indexing` into that object. Column indexing can be accomplished either using dot-notation, or by naming in the *column* space. 

In [None]:
d.group.head()

In [None]:
d['group'].head()

Note that by calling the stored data.frame object we're scoping from the global namespace into the namespace for that object, and then calling for the named list (or Series). 

Rather than just showing the `head` of the result of this query, we *could* also pull this based on location. 

In [None]:
d[0:5].value

In [None]:
d.loc[0:5, 'value']

Though this brings on considerable risk: If the data structure should change in the future, you'll pull from a position that might have different information than you think it does. 

(Note that we've used the `.loc` location based indexer.) 

Here's an example of this risk: Suppoese you want to pull the values that are less than 20 from the DataFrame `d`. Presently, the DataFrame is sorted so that a `.loc` indexer that pulls the first five value will catch the result you seek. But, if the frame had new data flow in (or flow out) it might not meet your goal. 

For example, suppose the second observation was remeasured and came out to be 21. 

In [None]:
d.loc[2, 'value'] = 21

Now, your `.loc` based index isn't going to accomplish what you'd like it to. 

In [None]:
d.loc[0:5, 'value']

Instead, safer is to pass the test you'd like to accomplish, and let that test index the rows for you. 

In [None]:
d.loc[d.value <= 20, 'value']

# Some light practice

With the following data: 

In [None]:
d = pd.DataFrame({
    'name': ['alex', 'becky', 'carl', 'doris', 'eve', 'frank', 'gertrude', 
             'harry', 'ingrid', 'jerry', 'kerri', 'lou', 'mary', 'neal',
             'olivia', 'patrick', 'qing', 'roy', 'samantha', 'teddy', 
             'usha', 'victor', 'wendy', 'xander', 'yvette', 'zeke'], 
    'age': np.random.choice([23,24,25,26,27,29,29,30], 26), 
    'focus': np.random.choice(['data science', 'product management', 'phd'], 26)
})

Do the following: 

- pull the first ten rows, and all the columns. 

- pull the last ten rows, and only the name and focus columns. 

- pull all the columns where people are an odd age. 

- pull all the columns where people are an odd age, and whose names start with a vowel

- pull all the columns where people are an odd age, and whose name's second letter is a vowel.

# Here come the answers

But, don't look at them right away. 
.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.



Do the following: 

- pull the first ten rows, and all the columns. 

In [None]:
d.loc[0:9,:]

- pull the last ten rows, and only the name and focus columns. 

In [None]:
d.loc[16:, ['name', 'focus']]

- pull all the columns where people are an odd age. 

In [None]:
d.loc[d.age % 2 == 1]

- pull all the columns where people are an odd age, and whose names start with a vowel

In [None]:
d.loc[(d.age % 2 == 1) & (d.name.isin(['alex', 'eve', 'ingrid', 'olivia', 'usha']))]

- pull all the columns where people are an odd age, and whose name's second letter is a vowel.

In [None]:
# oh snap, you can't just hard code this. 
# as you can see, even the last solution is getting pretty complex inside that indexing call 
# for this answer, i'll split out a mask, and then bring that in. 

age_mask = d.age % 2 == 1
vowel_mask = d.name.str.get(1).isin(['a', 'e', 'i', 'o', 'u'])

d.loc[age_mask & vowel_mask]