# Data in Python

## Stats 141B

## Lecture 6

## Jupyter notebook

- ``jupyter notebook`` from terminal
- runs ipython in background
- notebook consists of
 - markdown cells: 
 - code cells
- code cells act like ipython prompt: tab completion, magic commands, etc.
- markdown is a markup language that makes formatted text
 - e.g. `#` makes header text,  `$ \alpha  $`  makes latex equations as in $\alpha$
 - see [markdown cheatsheet](https://github.com/adam-p/markdown-here/wiki/Markdown-Cheatsheet)

## What is Data?

- Computation takes Data as input
- Statistics are computed from Data

## Data Types in Python

- Different data types have different common operations
- Python sequence data types (lists, tuples, etc) are ready for "for" loops
- Python is good for string manipulation
- Choose the data type (e.g. list vs. dict) based on what you plan to do

## Some Python rules 

- indentation is required!
- code block initiated by ``if header:``
- method for file object: ``file.readline()`` 
- ``+`` is defined for strings (operator overloading): ``loc = 'CollegeScorecard/' + filename``
- for loops act on iterables like lists: ``for filename in cs_files:``
- print to screen with ``print('header differs for ' + filename)`` - different from Python 2.7

In [1]:
import this

The Zen of Python, by Tim Peters

Beautiful is better than ugly.
Explicit is better than implicit.
Simple is better than complex.
Complex is better than complicated.
Flat is better than nested.
Sparse is better than dense.
Readability counts.
Special cases aren't special enough to break the rules.
Although practicality beats purity.
Errors should never pass silently.
Unless explicitly silenced.
In the face of ambiguity, refuse the temptation to guess.
There should be one-- and preferably only one --obvious way to do it.
Although that way may not be obvious at first unless you're Dutch.
Now is better than never.
Although never is often better than *right* now.
If the implementation is hard to explain, it's a bad idea.
If the implementation is easy to explain, it may be a good idea.
Namespaces are one honking great idea -- let's do more of those!


In [None]:
%run some_script.py

In [None]:
%save temp.py 1-2
# its magic!

## IPython - text editor workflow

- run simple commands in ipython
- ``%save`` to temp file, move to python file and clean up
- run longer version (on full data) or save for reproducibility
- run in ipython with ``%run``
- run in terminal with ``python scriptname.py``

In [4]:
! head data/state_of_union.csv

date,name,words
"January 8, 1790",George Washington,1089
"December 8, 1790",George Washington,1401
"October 25, 1791",George Washington,2302
"November 6, 1792",George Washington,2101
"December 3, 1793",George Washington,1968
"November 19, 1794",George Washington,2918
"December 8, 1795",George Washington,1989
"December 7, 1796",George Washington,2871
"November 22, 1797",John Adams,2063


In [5]:
datafile = "data/state_of_union.csv"

with open(datafile,'r') as sou:
    header = sou.readline()
    sou_line = sou.readline()

In [6]:
print(header) # print header string
print(sou_line)

date,name,words

"January 8, 1790",George Washington,1089



In [7]:
type(header) # type checks the type of object

str

In [8]:
header = header.strip() # remove whitespace in front/back
header

'date,name,words'

In [9]:
header = header.split(',') # split the string by ,
header

['date', 'name', 'words']

In [10]:
header = tuple(header) # convert list to tuple
header

('date', 'name', 'words')

### Built in data structures:

**Sequences:**
- list
- tuple
- string

**Numeric:**
- boolean
- integer
- float

**Mappings: dictionaries**

**Exceptions, Classes**

In [11]:
print('"January 8, 1790",George Washington,1089') # first line

def proc_sou(line):
    """state of union line processing - in the future use Pandas"""
    line_date, line_rest = line.strip().split('",') # strip and split btw date/name
    line_date = line_date.strip('"') # remove "
    name, words = line_rest.split(',') # split name/words
    return(line_date,name,int(words)) # return - convert int

"January 8, 1790",George Washington,1089


In [12]:
with open(datafile,'r') as sou:
    head_temp = sou.readline() # skip the first line
    sou_data = [proc_sou(line) for line in sou] # list comp - return to this

In [14]:
sou_data

[('January 8, 1790', 'George Washington', 1089),
 ('December 8, 1790', 'George Washington', 1401),
 ('October 25, 1791', 'George Washington', 2302),
 ('November 6, 1792', 'George Washington', 2101),
 ('December 3, 1793', 'George Washington', 1968),
 ('November 19, 1794', 'George Washington', 2918),
 ('December 8, 1795', 'George Washington', 1989),
 ('December 7, 1796', 'George Washington', 2871),
 ('November 22, 1797', 'John Adams', 2063),
 ('December 8, 1798', 'John Adams', 2218),
 ('December 3, 1799', 'John Adams', 1505),
 ('November 22, 1800', 'John Adams', 1372),
 ('December 8, 1801', 'Thomas Jefferson', 3224),
 ('December 15, 1802', 'Thomas Jefferson', 2197),
 ('October 17, 1803', 'Thomas Jefferson', 2263)]

In [15]:
head_temp

'date,name,words\n'

In [13]:
dates, names, words = zip(*sou_data) # zip and tuple unpacking

In [16]:
words

(1089,
 1401,
 2302,
 2101,
 1968,
 2918,
 1989,
 2871,
 2063,
 2218,
 1505,
 1372,
 3224,
 2197,
 2263)

### Python [built in functions](https://docs.python.org/3/library/functions.html)

- zip: combines multiple sequences into sequence of tuples
- range(i,j): iterable from i to j-1
- format: insert variables into string
- print
- iter, next: iterable tools
- all, any, max, min, len

### Tuple

- sequence of mixed type
- tuple unpacking is a neat trick
- immutable
- think of as a single record of multiple variables

In [12]:
header

('date', 'name', 'words')

In [13]:
x, y = 1, 2 #assign values
y, x = x, y #swap values
r, (x, y) = (x**2 + y**2)**0.5, (y, x) #nested value assignment and swapping
print(x, y, r)

1 2 2.23606797749979


In [14]:
header

('date', 'name', 'words')

In [15]:
header[1] = 'pres' # cannot modify tuple

TypeError: 'tuple' object does not support item assignment

In [16]:
header = (header[0], 'pres', header[2]) # reassign tuple
header

('date', 'pres', 'words')

- zip: combines multiple sequences into sequence of tuples
- unpack tuples in arguments with *
- `zip(*arg)` unzips

In [17]:
dates, names, words = zip(*sou_data) 
words = list(words)

In [18]:
for n,w in zip(names, words): # zip combines lists/iterables
    print("{} spoke {} words".format(n,w)) # format string method

George Washington spoke 1089 words
George Washington spoke 1401 words
George Washington spoke 2302 words
George Washington spoke 2101 words
George Washington spoke 1968 words
George Washington spoke 2918 words
George Washington spoke 1989 words
George Washington spoke 2871 words
John Adams spoke 2063 words
John Adams spoke 2218 words
John Adams spoke 1505 words
John Adams spoke 1372 words
Thomas Jefferson spoke 3224 words
Thomas Jefferson spoke 2197 words
Thomas Jefferson spoke 2263 words


### Lists

- sequences
- appendable, mutable
- slice-able
- think of as many rows of dataset

In [19]:
print(words)

[1089, 1401, 2302, 2101, 1968, 2918, 1989, 2871, 2063, 2218, 1505, 1372, 3224, 2197, 2263]


In [20]:
words[0] # select single element

1089

In [21]:
words[0:8] # select first 8

[1089, 1401, 2302, 2101, 1968, 2918, 1989, 2871]

In [22]:
words[-1] # select last

2263

In [23]:
words[-3:] # last 3

[3224, 2197, 2263]

In [24]:
words[::4] # every 4

[1089, 1968, 2063, 3224]