Python Workshop 1: basic coding and exploratory data analysis

https://colab.research.google.com/ (import Jupyter Notebook from GitHub, have data in Google Drive, put link to Google Drive folder in this document and link to a short tutorial about how to run the code)

In [7]:
# With Python, you can do prety much everything you can do in R -- and more! 
# All in relatively few lines of code
# Today we will cover the Python code to achieve results similar to what previous R workshops have focused on
# and also explore some Python idiosyncracies and useful functions, tips, and tricks

# In the notebook environment, output is displayed directly below the code block it is in 
print('Hello world!')

Hello world!


In [8]:
# We actually don't even need to use print() to see our output when using a notebook
'Hello world!'

'Hello world!'

In [10]:
# Just like R uses packages to extend its base functionality, Python has libraries 
# Let's import the libraries we will use today -- aliases make calling functions from the library simpler

# Importing libraries
# import [library] as [alias]
import numpy as np
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [19]:
# Let's save some simple variables 
# In a notebook, any variable defined like this is accessible in the global environment 
# (i.e., we can use the value of this variable in any other code block in our notebook)
x = 'this is a string'
y = 7
z = 3.14

In [21]:
# The type() function tells us the type (i.e., class) of object it considers 
type(x)
type(y)
type(z)

float

In [45]:
# If we want to have multiple lines of output, we need to print each line 
# The notebook will otherwise just give us the output of our final line of code 
# We can also change the datatype of the two numbers to a string so we can print full sentences describing our variables
# \ is the escape character in Python -- use it before quotations to include quotes in your printed string

print('\"' + x + '\" is a ' + str(type(x)) + '.')
print('\"' + str(y) + '\" is a ' + str(type(y)) + '.')
print('\"' + str(z) + '\" is a ' + str(type(z)) + '.')

"this is a string" is a <class 'str'>.
"7" is a <class 'int'>.
"3.14" is a <class 'float'>.


In [46]:
# Notice that we copied and pasted several times above
# Whenever we are copying and pasting, we are defeating one of the main points of coding: efficiency! 

# Let's instead make a simple loop to achieve the same output as above in fewer lines 
# First, we'll put our variables in a list (notice how multiple data types can be in a list)
var_list = [x,y,z]
type(var_list)

list

In [47]:
# Now we'll loop through our list to print the senteces equivalent to what we have above
for variable in var_list: 
    print('\"' + str(variable) + '\" is a ' + str(type(variable)) + '.')

"this is a string" is a <class 'str'>.
"7" is a <class 'int'>.
"3.14" is a <class 'float'>.


In [48]:
# We might want to use this pattern again, so let's make it a function 
def var_type_phrase(var): 
    return '\"' + str(var) + '\" is a ' + str(type(var)) + '.'

In [49]:
for variable in var_list:
    print(var_type_phrase(variable))

"this is a string" is a <class 'str'>.
"7" is a <class 'int'>.
"3.14" is a <class 'float'>.


In [71]:
# Let's now take a step back and go over some key Python terms
# When we call the type() function, we are seeing what the <class> of an individual <object> is
# Individual objects can have <attributes> specific to the object's class 
# Individual objects can also have <methods> which are defined by the object's class 
# (think of methods like class-specific functions)

# Let's see an example of this
# First, make a list of numbers
num_list = [1,2,5,8]
print(num_list)

# Our num_list object is indeed of the list class
print(type(num_list))

# A max() function exists to find the maximum value of our list 
# This is NOT a method of the list class -- just a regular function 
print(max(num_list))
print('\n')

# We can change the object type by creating a numpy array out of our list 
num_array = np.array(num_list)
print(num_array)

# Let's check the type 
print(type(num_array))

# We can also use the max() function 
print(max(num_array))

# But a max() method also exists for objects of the numpy array class that gives the same result!
print(num_array.max())

# attributes are more like properties of the object
# shape is an attribute of numpy arrays 
print(num_array.shape)
print('\n')

# With methods, we can change attributes 
num_array_reshaped = num_array.reshape(2,2)
print(num_array_reshaped)
print(num_array_reshaped.shape)

[1, 2, 5, 8]
<class 'list'>
8


[1 2 5 8]
<class 'numpy.ndarray'>
8
8
(4,)


[[1 2]
 [5 8]]
(2, 2)


In [77]:
# One more more useful Python programming thing before we dive into some data!
# Suppose that you have a list of objects and you want to figure out how many even integers there are in that list
# The modulo % returns the remainder of y divided by x if given y % x, so if y % 2 == 0, y is even  
4 % 2 

0

In [78]:
5 % 2

1

In [83]:
potential_evens = [2,4,5,1,8,9,48]
true_evens = []
odds = []

for num in potential_evens:
    if num % 2 == 0:
        true_evens.append(num)
    else:
        odds.append(num)

print(true_evens)
print(len(true_evens))
print(odds)

[2, 4, 8, 48]
4
[5, 1, 9]


In [84]:
# But in Python, there is an even better, shorter, faster, more "pythonic" way to do this!
# It's called a list comprehension
true_evens = [num for num in potential_evens if num % 2 == 0]
true_evens

[2, 4, 8, 48]

In [85]:
# What if, however, our list of potential_evens contains objects that are not even numbers?
potential_evens = [2,4,5,1,8,9,48,'bus']
true_evens = []
not_evens = []

for obj in potential_evens:
    if obj % 2 == 0:
        true_evens.append(obj)
    else:
        not_evens.append(obj)

TypeError: not all arguments converted during string formatting

In [86]:
# We can use a try-except statement to deal with an error if it comes up
# This is not really possible to do well with a list comprehension 
potential_evens = [2,4,5,1,8,9,48,'bus']
true_evens = []
not_evens = []

for obj in potential_evens:
    try:
        if obj % 2 == 0:
            true_evens.append(obj)
        else:
            not_evens.append(obj)
    except:
        not_evens.append(obj)
        
print(true_evens)
print(not_evens)

[2, 4, 8, 48]
[5, 1, 9, 'bus']


In [12]:
# We will be looking at some Census data on healthcare trends 
# We can turn this into a nice panel dataset -- there are cross-sectional observations 
# at regularly spaced (yearly) intervals

# Importing data from a csv using the read_csv function from pandas (recall that pd is the pandas alias)
df = pd.read_csv('../All_years.csv')

# Look at the first five rows with the head() method 
df.head()

Unnamed: 0,year,cpi99,statefip,perwt,sex,age,marst,race,raced,hcovany,hinscaid,hinscare,educ,educd,inctot
0,2008,0.774,6,103,2,32,6,2,200,2,1,1,6,64,25500
1,2008,0.774,48,76,1,37,2,1,100,1,1,1,10,101,200000
2,2008,0.774,6,77,2,44,1,1,100,2,1,1,10,101,65000
3,2008,0.774,24,125,2,72,5,1,100,2,2,2,4,40,7600
4,2008,0.774,26,32,2,77,1,1,100,2,1,2,6,63,5800


In [13]:
# We can also look at the first 10 rows if we want 
df.head(10)

Unnamed: 0,year,cpi99,statefip,perwt,sex,age,marst,race,raced,hcovany,hinscaid,hinscare,educ,educd,inctot
0,2008,0.774,6,103,2,32,6,2,200,2,1,1,6,64,25500
1,2008,0.774,48,76,1,37,2,1,100,1,1,1,10,101,200000
2,2008,0.774,6,77,2,44,1,1,100,2,1,1,10,101,65000
3,2008,0.774,24,125,2,72,5,1,100,2,2,2,4,40,7600
4,2008,0.774,26,32,2,77,1,1,100,2,1,2,6,63,5800
5,2008,0.774,25,83,1,17,6,1,100,2,1,1,5,50,0
6,2008,0.774,45,80,2,44,1,1,100,2,1,1,11,114,60000
7,2008,0.774,53,97,2,19,6,1,100,2,1,1,6,63,0
8,2008,0.774,4,56,2,73,5,1,100,2,1,2,5,50,10800
9,2008,0.774,37,71,2,63,1,1,100,2,1,2,7,71,5900


In [14]:
# Or the last ten 
df.tail(10)

Unnamed: 0,year,cpi99,statefip,perwt,sex,age,marst,race,raced,hcovany,hinscaid,hinscare,educ,educd,inctot
3429157,2018,0.663,12,86,2,33,1,1,100,2,1,1,11,115,36000
3429158,2018,0.663,36,78,2,44,1,1,100,2,1,1,6,65,65000
3429159,2018,0.663,6,61,1,27,2,7,700,2,2,1,3,30,700
3429160,2018,0.663,36,56,2,90,4,1,100,2,1,2,4,40,39300
3429161,2018,0.663,36,68,1,49,6,1,100,2,1,1,6,64,52000


In [15]:
# The info() method lists the variables and datatype of each variable
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3429162 entries, 0 to 3429161
Data columns (total 15 columns):
 #   Column    Dtype  
---  ------    -----  
 0   year      int64  
 1   cpi99     float64
 2   statefip  int64  
 3   perwt     int64  
 4   sex       int64  
 5   age       int64  
 6   marst     int64  
 7   race      int64  
 8   raced     int64  
 9   hcovany   int64  
 10  hinscaid  int64  
 11  hinscare  int64  
 12  educ      int64  
 13  educd     int64  
 14  inctot    int64  
dtypes: float64(1), int64(14)
memory usage: 392.4 MB


In [24]:
# The describe() method gives summary statistics for all relevant variables 
df.describe()

Unnamed: 0,year,cpi99,statefip,perwt,sex,age,marst,race,raced,hcovany,hinscaid,hinscare,educ,educd,inctot
count,3429162.0,3429162.0,3429162.0,3429162.0,3429162.0,3429162.0,3429162.0,3429162.0,3429162.0,3429162.0,3429162.0,3429162.0,3429162.0,3429162.0,3429162.0
mean,2013.061,0.7211128,27.65057,101.4881,1.512115,40.54053,3.601857,1.766201,178.4144,1.891136,1.17613,1.196557,5.973344,62.16537,1786708.0
std,3.158196,0.03650449,16.03609,80.32649,0.4998533,23.51276,2.309683,1.822552,186.4583,0.3114688,0.380931,0.3973948,3.235699,32.16469,3791467.0
min,2008.0,0.663,1.0,1.0,1.0,0.0,1.0,1.0,100.0,1.0,1.0,1.0,0.0,1.0,-19998.0
25%,2010.0,0.694,12.0,55.0,1.0,20.0,1.0,1.0,100.0,2.0,1.0,1.0,4.0,40.0,9900.0
50%,2013.0,0.715,27.0,80.0,2.0,41.0,4.0,1.0,100.0,2.0,1.0,1.0,6.0,63.0,30000.0
75%,2016.0,0.764,42.0,121.0,2.0,59.0,6.0,1.0,100.0,2.0,1.0,1.0,8.0,81.0,86000.0
max,2018.0,0.777,56.0,2415.0,2.0,97.0,6.0,9.0,990.0,2.0,2.0,2.0,11.0,116.0,9999999.0


In [28]:
# The columns attribute lists the columns of our dataframe 
# list() will turn the "list" that df.columns gives into an actual list object
list(df.columns)

['year',
 'cpi99',
 'statefip',
 'perwt',
 'sex',
 'age',
 'marst',
 'race',
 'raced',
 'hcovany',
 'hinscaid',
 'hinscare',
 'educ',
 'educd',
 'inctot']

In [4]:
# Show how to do some of the important things that we've learned to do in R with Python
# https://github.com/UChicagoOeconomica/WorkshopMaterials2020-2021
# Creating plots from data, fitting regression line, and getting regression table
# Interactive plots
# Exporting/saving plots