# Getting to know Python through examples
#### A whirlwind tour of Python

## Nonclinical Biostatistics Conference 2019

### Daniel Chen

### Virginia Tech

#### 2019-06-19

- PhD Student at Virginia Tech
    - Genetics, Bioinformatics, and Computational Biology
- Intern at RStudio
    - Garrett Grolemund 
    - `learnr` + `grader` (`gradethis`)

## New heading

## New heading

## New heading

## New heading

# This "talk"

- How to get python
- Introduction to python via pandas
- Tidy data
- Concatenation
- Functions
- Models

# Python, how to get it?

## Anaconda

https://www.anaconda.com/distribution

- Comes with Python + scientific computing stack (numpy, scikit, pandas)
    - Don't need to setup your compliers
    - Installs as local user (good for admin blocks)
    - `conda` package manager
        - You could use it to install R as well
            - Just have to use conda to install your packages
            - not `install.packages`
- Spyder IDE (similar to RStudio)
- Jupyter (notebooks!)

# Introduction to Python (pandas)

How to compress a 4-hour workshop into 40 minutes

- Loading data
- Slicing data
- Groupby statments

In [114]:
# import pandas to give us the dataframe object
import pandas

In [115]:
# pandas.__version__ gives us the current version of pandas
print('pandas version: ' + pandas.__version__)

# don't use python 2 (probably use 3.6+)
import sys
print('python version: ' + sys.version)

pandas version: 0.24.2
python version: 3.7.3 (default, Mar 27 2019, 22:11:17) 
[GCC 7.3.0]


## There is a literal countdown for when Python 2.7 will no longer be supported

https://pythonclock.org/

In [116]:
# load dataset using pandas
pandas.read_csv('../data/gapminder.tsv', sep='\t')

Unnamed: 0,country,continent,year,lifeExp,pop,gdpPercap
0,Afghanistan,Asia,1952,28.801,8425333,779.445314
1,Afghanistan,Asia,1957,30.332,9240934,820.853030
2,Afghanistan,Asia,1962,31.997,10267083,853.100710
3,Afghanistan,Asia,1967,34.020,11537966,836.197138
4,Afghanistan,Asia,1972,36.088,13079460,739.981106
5,Afghanistan,Asia,1977,38.438,14880372,786.113360
6,Afghanistan,Asia,1982,39.854,12881816,978.011439
7,Afghanistan,Asia,1987,40.822,13867957,852.395945
8,Afghanistan,Asia,1992,41.674,16317921,649.341395
9,Afghanistan,Asia,1997,41.763,22227415,635.341351


In [117]:
# save the dataset into a variable, df
df = pandas.read_csv('../data/gapminder.tsv',
                     sep='\t')
df

Unnamed: 0,country,continent,year,lifeExp,pop,gdpPercap
0,Afghanistan,Asia,1952,28.801,8425333,779.445314
1,Afghanistan,Asia,1957,30.332,9240934,820.853030
2,Afghanistan,Asia,1962,31.997,10267083,853.100710
3,Afghanistan,Asia,1967,34.020,11537966,836.197138
4,Afghanistan,Asia,1972,36.088,13079460,739.981106
5,Afghanistan,Asia,1977,38.438,14880372,786.113360
6,Afghanistan,Asia,1982,39.854,12881816,978.011439
7,Afghanistan,Asia,1987,40.822,13867957,852.395945
8,Afghanistan,Asia,1992,41.674,16317921,649.341395
9,Afghanistan,Asia,1997,41.763,22227415,635.341351


In [118]:
# import pandas using an alias
import pandas as pd

# never import things using * notation
# from pandas import *
# from numpy import *

In [119]:
# before
df = pandas.read_csv('../data/gapminder.tsv', sep='\t')

# now
df = pd.read_csv('../data/gapminder.tsv', sep='\t')

In [120]:
# type is a built-in python function (like `class`)
type(df)

pandas.core.frame.DataFrame

In [121]:
# shape is an "attribute" of a dataframe
# like `dim`
df.shape

(1704, 6)

In [122]:
# info as a "method" of a dataframe
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1704 entries, 0 to 1703
Data columns (total 6 columns):
country      1704 non-null object
continent    1704 non-null object
year         1704 non-null int64
lifeExp      1704 non-null float64
pop          1704 non-null int64
gdpPercap    1704 non-null float64
dtypes: float64(2), int64(2), object(2)
memory usage: 80.0+ KB


In [123]:
# you'll learn what is an attribute vs method
df.shape()

TypeError: 'tuple' object is not callable

In [124]:
# head/tail are methods
df.head()
df.tail()

Unnamed: 0,country,continent,year,lifeExp,pop,gdpPercap
1699,Zimbabwe,Africa,1987,62.351,9216418,706.157306
1700,Zimbabwe,Africa,1992,60.377,10704340,693.420786
1701,Zimbabwe,Africa,1997,46.809,11404948,792.44996
1702,Zimbabwe,Africa,2002,39.989,11926563,672.038623
1703,Zimbabwe,Africa,2007,43.487,12311143,469.709298


There are 3 parts to a dataframe. They are all attributes.

- `columns` (like `names`)
- `index` (like `rownames`)
- `values` (the actual body of the dataframe)

In [125]:
print(df.head())
df.columns

       country continent  year  lifeExp       pop   gdpPercap
0  Afghanistan      Asia  1952   28.801   8425333  779.445314
1  Afghanistan      Asia  1957   30.332   9240934  820.853030
2  Afghanistan      Asia  1962   31.997  10267083  853.100710
3  Afghanistan      Asia  1967   34.020  11537966  836.197138
4  Afghanistan      Asia  1972   36.088  13079460  739.981106


Index(['country', 'continent', 'year', 'lifeExp', 'pop', 'gdpPercap'], dtype='object')

In [126]:
print(df.head())
df.index

       country continent  year  lifeExp       pop   gdpPercap
0  Afghanistan      Asia  1952   28.801   8425333  779.445314
1  Afghanistan      Asia  1957   30.332   9240934  820.853030
2  Afghanistan      Asia  1962   31.997  10267083  853.100710
3  Afghanistan      Asia  1967   34.020  11537966  836.197138
4  Afghanistan      Asia  1972   36.088  13079460  739.981106


RangeIndex(start=0, stop=1704, step=1)

In [None]:
print(df.head())
df.values

In [127]:
# just look at the type of our columns
# similar to: lapply(class, df)
df.dtypes

country       object
continent     object
year           int64
lifeExp      float64
pop            int64
gdpPercap    float64
dtype: object

## Subsetting columns

In [128]:
country = df['country'] # df[, 'country', drop=FALSE]
type(country)

pandas.core.series.Series

In [129]:
# square brackets are uesd for subsetting
# as well as creating a list
# lists in python are similar to c() + list()

country = df[['country']] # df[, 'country', drop=TRUE]
type(country)

pandas.core.frame.DataFrame

In [130]:
# dropping column(s)
df = pd.read_csv('../data/gapminder.tsv', sep='\t')
df = df.drop(['continent', 'country'], axis='columns')

## Subsetting rows by "rowname"

In [131]:
# just to reset the data back
df = pd.read_csv('../data/gapminder.tsv', sep='\t')

df.loc[0]

country      Afghanistan
continent           Asia
year                1952
lifeExp           28.801
pop              8425333
gdpPercap        779.445
Name: 0, dtype: object

In [None]:
df.loc[-1]

## Python starts counting from 0

## Negative index values count from the end!

## Subsetting rows by "row index" (position)

In [132]:
# get index (position) 0 aka the first row
df.iloc[0]

country      Afghanistan
continent           Asia
year                1952
lifeExp           28.801
pop              8425333
gdpPercap        779.445
Name: 0, dtype: object

In [133]:
# get the last row
df.iloc[-1]

country      Zimbabwe
continent      Africa
year             2007
lifeExp        43.487
pop          12311143
gdpPercap     469.709
Name: 1703, dtype: object

## Subsetting rows AND columns

In [134]:
# use slicing notation
subset = df.loc[:, ['year', 'pop']]
subset.head()

Unnamed: 0,year,pop
0,1952,8425333
1,1957,9240934
2,1962,10267083
3,1967,11537966
4,1972,13079460


Left inclusive, right exclusive slicing


```

list:  | 0 | 1 | 2 | 3 |
       |   |   |   |   |
index: 0   1   2   3   4

```

[0:3] = [0, 1, 2]

[ :3] = [0, 1, 2]

In [135]:
# list:  | 0 | 1 | 2 | 3 |
#        |   |   |   |   |
# index: 0   1   2   3   4

l = [0, 1, 2, 3]

print(l[:3])

print(l[1:3])

[0, 1, 2]
[1, 2]


In [136]:
subset = df.iloc[:, [2, 4]]
subset.head()

Unnamed: 0,year,pop
0,1952,8425333
1,1957,9240934
2,1962,10267083
3,1967,11537966
4,1972,13079460


## Boolean subsetting

In [137]:
df.loc[df['country'] == 'United States']

Unnamed: 0,country,continent,year,lifeExp,pop,gdpPercap
1608,United States,Americas,1952,68.44,157553000,13990.48208
1609,United States,Americas,1957,69.49,171984000,14847.12712
1610,United States,Americas,1962,70.21,186538000,16173.14586
1611,United States,Americas,1967,70.76,198712000,19530.36557
1612,United States,Americas,1972,71.34,209896000,21806.03594
1613,United States,Americas,1977,73.38,220239000,24072.63213
1614,United States,Americas,1982,74.65,232187835,25009.55914
1615,United States,Americas,1987,75.02,242803533,29884.35041
1616,United States,Americas,1992,76.09,256894189,32003.93224
1617,United States,Americas,1997,76.81,272911760,35767.43303


In [138]:
# bitwise comparisons use & | instead of and/or
df.loc[(df['country'] == 'United States') & (df['year'] == 1982)]

Unnamed: 0,country,continent,year,lifeExp,pop,gdpPercap
1614,United States,Americas,1982,74.65,232187835,25009.55914


## Group by statements

In [139]:
# for each year, get the lifeExp column, and calculate the mean
df.groupby('year')['lifeExp'].mean()

year
1952    49.057620
1957    51.507401
1962    53.609249
1967    55.678290
1972    57.647386
1977    59.570157
1982    61.533197
1987    63.212613
1992    64.160338
1997    65.014676
2002    65.694923
2007    67.007423
Name: lifeExp, dtype: float64

In [None]:
# can use any function in the groupby statement with `agg`

import numpy as np
df.groupby('year')['lifeExp'].agg(np.mean)

In [140]:
# these are pipes!

# note the round brackets beginning and ending the statment
(df\
 .groupby(['year', 'continent'])
 [['lifeExp', 'gdpPercap']]
 .agg(np.mean)
 .reset_index() # flatten the dataset to remove hierarchical index
 .sample(10)
)

Unnamed: 0,year,continent,lifeExp,gdpPercap
38,1987,Europe,73.642167,17214.310727
11,1962,Americas,58.39876,4901.54187
1,1952,Americas,53.27984,4079.062552
12,1962,Asia,51.563223,5729.369625
54,2002,Oceania,79.74,26938.77804
3,1952,Europe,64.4085,5661.057435
25,1977,Africa,49.580423,2585.938508
43,1992,Europe,74.4401,17061.568084
56,2007,Americas,73.60812,11003.031625
57,2007,Asia,70.728485,12473.02687


# Tidy data

"Tidy Data" - Hadley Wickham (https://vita.had.co.nz/papers/tidy-data.pdf)

1. Each variable forms a column.
2. Each observation forms a row.
3. Each type of observational unit forms a table.


## Targets for cleaning your data

Tidy:

- Column headers are values, not variable names.
- Multiple variables are stored in one column.
- Variables are stored in both rows and columns.

Normalize:
- Multiple types of observational units are stored in the same table.
- A single observational unit is stored in multiple tables

In [141]:
pew = pd.read_csv('../data/pew.csv')
pew.head()

Unnamed: 0,religion,<$10k,$10-20k,$20-30k,$30-40k,$40-50k,$50-75k,$75-100k,$100-150k,>150k,Don't know/refused
0,Agnostic,27,34,60,81,76,137,122,109,84,96
1,Atheist,12,27,37,52,35,70,73,59,74,76
2,Buddhist,27,21,30,34,33,58,62,39,53,54
3,Catholic,418,617,732,670,638,1116,949,792,633,1489
4,Don’t know/refused,15,14,15,11,10,35,21,17,18,116


In [142]:
pew_tidy = pew.melt(id_vars='religion',
                    var_name='income',
                    value_name='count')
pew_tidy.sample(10)

Unnamed: 0,religion,income,count
87,Other Faiths,$40-50k,49
39,Catholic,$20-30k,732
63,Jewish,$30-40k,25
89,Unaffiliated,$40-50k,341
112,Don’t know/refused,$75-100k,21
62,Jehovah's Witness,$30-40k,24
71,Unaffiliated,$30-40k,365
90,Agnostic,$50-75k,137
88,Other World Religions,$40-50k,2
105,Other Faiths,$50-75k,63


In [143]:
billboard = pd.read_csv('../data/billboard.csv')
billboard.head()

Unnamed: 0,year,artist,track,time,date.entered,wk1,wk2,wk3,wk4,wk5,...,wk67,wk68,wk69,wk70,wk71,wk72,wk73,wk74,wk75,wk76
0,2000,2 Pac,Baby Don't Cry (Keep...,4:22,2000-02-26,87,82.0,72.0,77.0,87.0,...,,,,,,,,,,
1,2000,2Ge+her,The Hardest Part Of ...,3:15,2000-09-02,91,87.0,92.0,,,...,,,,,,,,,,
2,2000,3 Doors Down,Kryptonite,3:53,2000-04-08,81,70.0,68.0,67.0,66.0,...,,,,,,,,,,
3,2000,3 Doors Down,Loser,4:24,2000-10-21,76,76.0,72.0,69.0,67.0,...,,,,,,,,,,
4,2000,504 Boyz,Wobble Wobble,3:35,2000-04-15,57,34.0,25.0,17.0,17.0,...,,,,,,,,,,


In [144]:
billboard_tidy = billboard.melt(
    id_vars=['year', 'artist', 'track', 'time', 'date.entered'],
    value_name='rank',
    var_name='week')
billboard_tidy.sample(10)

Unnamed: 0,year,artist,track,time,date.entered,week,rank
1743,2000,Kelis,Caught Out There,4:09,1999-12-04,wk6,54.0
4581,2000,Jay-Z,Anything,3:41,2000-02-26,wk15,
14190,2000,Rascal Flatts,Prayin' For Daylight,3:36,2000-05-06,wk45,
7272,2000,"Vassar, Phil",Carlene,4:07,2000-03-04,wk23,
18527,2000,Jagged Edge,Let's Get Married,4:23,2000-05-06,wk59,
964,2000,"Aguilera, Christina",What A Girl Wants,3:18,1999-11-27,wk4,18.0
11502,2000,Eiffel 65,Blue,3:29,1999-12-11,wk37,
21106,2000,Lox,"Ryde or Die, Chick",3:56,2000-03-18,wk67,
1636,2000,"Carey, Mariah",Crybaby,5:19,2000-06-24,wk6,90.0
11823,2000,Eminem,The Way I Am,4:40,2000-08-26,wk38,


In [145]:
# pipe syntax
billboard_tidy = (billboard
    .melt(id_vars=['year', 'artist', 'track', 'time', 'date.entered'],
              value_name='rank',
              var_name='week')
    .groupby('artist')['rank']
    .mean()
)

In [None]:
ebola = pd.read_csv('../data/country_timeseries.csv')
ebola.tail()

In [146]:
ebola_long = ebola.melt(id_vars=['Date', 'Day'],
                        var_name='cd_country', value_name='count')
ebola_long.head()

Unnamed: 0,Date,Day,cd_country,count
0,1/5/2015,289,Cases_Guinea,2776.0
1,1/4/2015,288,Cases_Guinea,2775.0
2,1/3/2015,287,Cases_Guinea,2769.0
3,1/2/2015,286,Cases_Guinea,
4,12/31/2014,284,Cases_Guinea,2730.0


In [147]:
var_split_df = ebola_long['cd_country'].str.split('_', expand=True)
var_split_df.head()

Unnamed: 0,0,1
0,Cases,Guinea
1,Cases,Guinea
2,Cases,Guinea
3,Cases,Guinea
4,Cases,Guinea


In [148]:
ebola_long[['case2', 'country2']] = var_split_df
ebola_long.head()

Unnamed: 0,Date,Day,cd_country,count,case2,country2
0,1/5/2015,289,Cases_Guinea,2776.0,Cases,Guinea
1,1/4/2015,288,Cases_Guinea,2775.0,Cases,Guinea
2,1/3/2015,287,Cases_Guinea,2769.0,Cases,Guinea
3,1/2/2015,286,Cases_Guinea,,Cases,Guinea
4,12/31/2014,284,Cases_Guinea,2730.0,Cases,Guinea


In [149]:
weather = pd.read_csv('../data/weather.csv')
weather.head()

Unnamed: 0,id,year,month,element,d1,d2,d3,d4,d5,d6,...,d22,d23,d24,d25,d26,d27,d28,d29,d30,d31
0,MX17004,2010,1,tmax,,,,,,,...,,,,,,,,,27.8,
1,MX17004,2010,1,tmin,,,,,,,...,,,,,,,,,14.5,
2,MX17004,2010,2,tmax,,27.3,24.1,,,,...,,29.9,,,,,,,,
3,MX17004,2010,2,tmin,,14.4,14.4,,,,...,,10.7,,,,,,,,
4,MX17004,2010,3,tmax,,,,,32.1,,...,,,,,,,,,,


In [150]:
weather_long = weather.melt(
    id_vars=['id', 'year', 'month', 'element'],
    var_name='day',
    value_name='temp')

weather_long.head()

Unnamed: 0,id,year,month,element,day,temp
0,MX17004,2010,1,tmax,d1,
1,MX17004,2010,1,tmin,d1,
2,MX17004,2010,2,tmax,d1,
3,MX17004,2010,2,tmin,d1,
4,MX17004,2010,3,tmax,d1,


In [151]:
weather_tidy = (weather_long
                    .pivot_table(index=['id', 'year', 'month', 'day'],
                                 columns='element',
                                 values='temp')
                .reset_index()
)

weather_tidy.sample(10)

element,id,year,month,day,tmax,tmin
3,MX17004,2010,2,d23,29.9,10.7
22,MX17004,2010,10,d14,29.5,13.0
32,MX17004,2010,12,d6,27.8,10.5
25,MX17004,2010,10,d7,28.1,12.9
17,MX17004,2010,8,d13,29.8,16.5
1,MX17004,2010,2,d11,29.7,13.4
18,MX17004,2010,8,d25,29.7,15.6
27,MX17004,2010,11,d5,26.3,7.9
9,MX17004,2010,5,d27,33.2,18.2
0,MX17004,2010,1,d30,27.8,14.5


# Concatenation

Indicies are automatically aligned (similar to how `data.table` works)

In [152]:
df1 = pd.read_csv('../data/concat_1.csv')
df2 = pd.read_csv('../data/concat_2.csv')
df3 = pd.read_csv('../data/concat_3.csv')

In [153]:
print(df1)
print(df2)
print(df3)

    A   B   C   D
0  a0  b0  c0  d0
1  a1  b1  c1  d1
2  a2  b2  c2  d2
3  a3  b3  c3  d3
    A   B   C   D
0  a4  b4  c4  d4
1  a5  b5  c5  d5
2  a6  b6  c6  d6
3  a7  b7  c7  d7
     A    B    C    D
0   a8   b8   c8   d8
1   a9   b9   c9   d9
2  a10  b10  c10  d10
3  a11  b11  c11  d11


In [154]:
pd.concat([df1, df2, df3]) # look repeted rownames!

Unnamed: 0,A,B,C,D
0,a0,b0,c0,d0
1,a1,b1,c1,d1
2,a2,b2,c2,d2
3,a3,b3,c3,d3
0,a4,b4,c4,d4
1,a5,b5,c5,d5
2,a6,b6,c6,d6
3,a7,b7,c7,d7
0,a8,b8,c8,d8
1,a9,b9,c9,d9


In [155]:
df2.columns = ['E', 'F', 'G', 'H']
df3.columns = ['H', 'F', 'A', 'C']

print(df1)
print(df2)
print(df3)

    A   B   C   D
0  a0  b0  c0  d0
1  a1  b1  c1  d1
2  a2  b2  c2  d2
3  a3  b3  c3  d3
    E   F   G   H
0  a4  b4  c4  d4
1  a5  b5  c5  d5
2  a6  b6  c6  d6
3  a7  b7  c7  d7
     H    F    A    C
0   a8   b8   c8   d8
1   a9   b9   c9   d9
2  a10  b10  c10  d10
3  a11  b11  c11  d11


In [157]:
pd.concat([df1, df2, df3]) # future warning is about sorting the columns

# pd.concat([df1, df3, df2], sort=True) # note swapping df2 and df3
# pd.concat([df1, df3, df2], sort=False)

of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.


  """Entry point for launching an IPython kernel.


Unnamed: 0,A,B,C,D,E,F,G,H
0,a0,b0,c0,d0,,,,
1,a1,b1,c1,d1,,,,
2,a2,b2,c2,d2,,,,
3,a3,b3,c3,d3,,,,
0,,,,,a4,b4,c4,d4
1,,,,,a5,b5,c5,d5
2,,,,,a6,b6,c6,d6
3,,,,,a7,b7,c7,d7
0,c8,,d8,,,b8,,a8
1,c9,,d9,,,b9,,a9


# Functions

- Making a function
- Apply a functionn
- Vectorize functions

Functions can be used in the "tidy data" process

In [158]:
def my_sq(x):
    return x ** 2

assert my_sq(4) == 16

In [159]:
df = pd.DataFrame({ # curly brackets denote a dictionary
    'a': [10, 20, 30],
    'b': [20, 30, 40]
})

df

Unnamed: 0,a,b
0,10,20
1,20,30
2,30,40


In [160]:
df['a'].apply(my_sq)

0    100
1    400
2    900
Name: a, dtype: int64

In [161]:
def my_exp(x, e):
    return x ** e

assert my_exp(4, 2) == 16

In [162]:
df['a'].apply(my_exp, e=4)

0     10000
1    160000
2    810000
Name: a, dtype: int64

In [None]:
import numpy as np

# function to vectorize
def avg_2_mod(x, y):
    """return the average between x and y,
    unless x is 20, then return missing
    """
    if (x == 20):
        return np.NaN # missing values are NaN, NAN, or nan
    else:
        return (x + y) / 2

print(avg_2_mod(10, 20))
print(avg_2_mod(20, 30))

In [163]:
avg_2_mod(df['a'], df['b'])

ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

In [164]:
# vectorize the function with np.vectorize
avg_2_mod_vec = np.vectorize(avg_2_mod)
avg_2_mod_vec(df['a'], df['b'])

array([15., nan, 35.])

In [165]:
# a function that takes a function as its input
# and returns a modified version of the input function
@np.vectorize
def v_avg_2_mod(x, y):
    if (x == 20):
        return np.NaN
    else:
        return (x + y) / 2

v_avg_2_mod(df['a'], df['b'])

array([15., nan, 35.])

In [166]:
import numba
@numba.vectorize
def v_avg_2_mod_numba(x, y):
    if (x == 20):
        return np.NaN
    else:
        return (x + y) / 2

# you have to use `.values` here to get the array representation
v_avg_2_mod_numba(df['a'].values, df['b'].values)

array([15., nan, 35.])

In [167]:
def avg_2(x, y):
    return (x + y) / 2

In [168]:
%%timeit
avg_2(df['a'], df['b'])

466 µs ± 27.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)


In [None]:
%%timeit
v_avg_2_mod(df['a'], df['b'])

In [None]:
%%timeit
v_avg_2_mod_numba(df['a'].values, df['b'].values)

# Models

- statsmodels
- scikit-learn

## statsmodels

In [170]:
import pandas as pd
import seaborn as sns
import statsmodels.api as sm
import statsmodels.formula.api as smf

tips = sns.load_dataset('tips')
tips.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,16.99,1.01,Female,No,Sun,Dinner,2
1,10.34,1.66,Male,No,Sun,Dinner,3
2,21.01,3.5,Male,No,Sun,Dinner,3
3,23.68,3.31,Male,No,Sun,Dinner,2
4,24.59,3.61,Female,No,Sun,Dinner,4


In [171]:
# multiple variable regression
model = sm.OLS(endog=tips['tip'], exog=tips[['total_bill', 'size']])
results = model.fit()
results.summary()

0,1,2,3
Dep. Variable:,tip,R-squared:,0.902
Model:,OLS,Adj. R-squared:,0.901
Method:,Least Squares,F-statistic:,1117.0
Date:,"Mon, 17 Jun 2019",Prob (F-statistic):,6.16e-123
Time:,17:20:15,Log-Likelihood:,-353.88
No. Observations:,244,AIC:,711.8
Df Residuals:,242,BIC:,718.8
Df Model:,2,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
total_bill,0.1007,0.009,11.174,0.000,0.083,0.118
size,0.3621,0.071,5.074,0.000,0.222,0.503

0,1,2,3
Omnibus:,12.83,Durbin-Watson:,2.059
Prob(Omnibus):,0.002,Jarque-Bera (JB):,27.284
Skew:,0.179,Prob(JB):,1.19e-06
Kurtosis:,4.599,Cond. No.,23.7


In [172]:
model = smf.ols(formula='tip ~ total_bill + sex + smoker + size',
                data=tips)
results = model.fit()
results.summary()

0,1,2,3
Dep. Variable:,tip,R-squared:,0.469
Model:,OLS,Adj. R-squared:,0.46
Method:,Least Squares,F-statistic:,52.72
Date:,"Mon, 17 Jun 2019",Prob (F-statistic):,8.470000000000001e-32
Time:,17:20:15,Log-Likelihood:,-347.78
No. Observations:,244,AIC:,705.6
Df Residuals:,239,BIC:,723.0
Df Model:,4,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,0.6115,0.219,2.793,0.006,0.180,1.043
sex[T.Female],0.0273,0.137,0.198,0.843,-0.243,0.298
smoker[T.No],0.0837,0.138,0.605,0.546,-0.189,0.356
total_bill,0.0941,0.009,9.996,0.000,0.076,0.113
size,0.1803,0.088,2.049,0.042,0.007,0.354

0,1,2,3
Omnibus:,26.891,Durbin-Watson:,2.099
Prob(Omnibus):,0.0,Jarque-Bera (JB):,50.438
Skew:,0.589,Prob(JB):,1.12e-11
Kurtosis:,4.891,Cond. No.,78.5


## scikit-learn

In [174]:
import pandas as pd
import seaborn as sns
from sklearn import linear_model

tips = sns.load_dataset('tips')
tips.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,16.99,1.01,Female,No,Sun,Dinner,2
1,10.34,1.66,Male,No,Sun,Dinner,3
2,21.01,3.5,Male,No,Sun,Dinner,3
3,23.68,3.31,Male,No,Sun,Dinner,2
4,24.59,3.61,Female,No,Sun,Dinner,4


In [175]:
tips['sex_dummy'] = pd.get_dummies(tips['sex'],
                                   drop_first=True)
tips.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size,sex_dummy
0,16.99,1.01,Female,No,Sun,Dinner,2,1
1,10.34,1.66,Male,No,Sun,Dinner,3,0
2,21.01,3.5,Male,No,Sun,Dinner,3,0
3,23.68,3.31,Male,No,Sun,Dinner,2,0
4,24.59,3.61,Female,No,Sun,Dinner,4,1


In [182]:
lr = linear_model.LinearRegression()
X = tips[['total_bill', 'sex_dummy', 'size']]
y = tips['tip']

In [183]:
predicted = lr.fit(X, y)

In [184]:
predicted.intercept_

0.6554552419246682

In [185]:
predicted.coef_

array([0.09292034, 0.02641868, 0.19258767])

The formula API from statsmodels uses a library called `patsy`

https://patsy.readthedocs.io/en/latest/

You could also directly use `patsy` for your sklearn models, but millage will vary.
All the online tutorials and training materials use the slicing notation to pass columns into sklearn

# Main take aways

1. Python starts counting from 0
2. Negative index numbers count backwards
2. Slicing is left inclusive right exclusive
    - think about fence posts!
3. Everything is a class
    - Functions
    - Methods
    - Attributes
4. `list`: `[ ]`
5. `dict`: `{ }`
6. `tuple`/`set`: `( )`
7. `import` things with namespaces
8. `True` and `False`
9. Only have `=`, no `<-`

# Cool extras

In [None]:
# the underscore can be used as a separator in numbers

print(1000000)

print(1_000_000)

In [None]:
# trailing commas in a list are ignored

l1 = [1, 2, 3]
l2 = [1, 2, 3, ]

print(l1)
print(l2)

params = [
    'param1',
    'param2',
    #'param3',
]
print(params)

- `pandas_flavor` library
    - `register_dataframe_method`

- `pyjanitor` library (inspired by the R `janitor` package)

```python
import pandas_flavor as pf

@pf.register_dataframe_method
def my_data_cleaning_function(df, arg1, arg2, ...):
    # Put data processing function here.
    return df
```

use with

```python
df.my_data_cleaning_function()
```

# Thanks!

Twitter: @chendaniely

Slides: https://github.com/chendaniely/ncb-2019-python

# Extra

# Pipelines

breast cancer wisconsin dataset (classification)

Features are computed from a digitized image of a fine needle aspirate (FNA) of a breast mass. They describe characteristics of the cell nuclei present in the image.

https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+(Diagnostic)

```
Attribute Information:

1) ID number 
2) Diagnosis (M = malignant, B = benign) 
3-32) 

Ten real-valued features are computed for each cell nucleus: 

a) radius (mean of distances from center to points on the perimeter) 
b) texture (standard deviation of gray-scale values) 
c) perimeter 
d) area 
e) smoothness (local variation in radius lengths) 
f) compactness (perimeter^2 / area - 1.0) 
g) concavity (severity of concave portions of the contour) 
h) concave points (number of concave portions of the contour) 
i) symmetry 
j) fractal dimension ("coastline approximation" - 1)


```

```
radius (mean of distances from center to points on the perimeter)
texture (standard deviation of gray-scale values)
perimeter
area
smoothness (local variation in radius lengths)
compactness (perimeter^2 / area - 1.0)
concavity (severity of concave portions of the contour)
concave points (number of concave portions of the contour)
symmetry
fractal dimension (“coastline approximation” - 1)
The mean, standard error, and “worst” or largest (mean of the three largest values) of these features were computed for each image, resulting in 30 features. For instance, field 3 is Mean Radius, field 13 is Radius SE, field 23 is Worst Radius.

class:
WDBC-Malignant
WDBC-Benign
```

In [29]:
import pandas as pd

In [None]:
df = pd.read_csv('../data/breast-cancer-wisconsin-data.zip')

In [30]:
df.head()

Unnamed: 0,country,continent,year,lifeExp,pop,gdpPercap
0,Afghanistan,Asia,1952,28.801,8425333,779.445314
1,Afghanistan,Asia,1957,30.332,9240934,820.85303
2,Afghanistan,Asia,1962,31.997,10267083,853.10071
3,Afghanistan,Asia,1967,34.02,11537966,836.197138
4,Afghanistan,Asia,1972,36.088,13079460,739.981106


In [31]:
df.shape

(1704, 6)

In [32]:
cancer = df.drop(labels=['id', 'Unnamed: 32'], axis='columns')

KeyError: "['id' 'Unnamed: 32'] not found in axis"

In [None]:
cancer.info()

In [33]:
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler

In [None]:
X_train, X_test, y_train, y_test = train_test_split(
    cancer.iloc[:, 1:], cancer['diagnosis'] random_state=42)