# CP201A - Python Bootcamp for Module 3
### November 5, 2025

## 0. Load data and packages/libraries

In [None]:
# !pip install census
# !pip install pandas
# !pip install numpy

from census import Census
import pandas as pd
import numpy as np

api_key = 'ab895c1f94c45324d4cfa4f724f1aec7f1a274a4'
c = Census(key=api_key)

In [None]:
# We are going to use the age of structure table and pull this data for every "place" (aka city) in Alameda County

variables_of_interest = {
    'NAME': 'NAME',
    'GEO_ID': 'GEO_ID',
    'B25034_001E': 'total',
    'B25034_001M': 'total_moe',
    'B25034_002E': '2020_later',
    'B25034_002M': '2020_later_moe',
    'B25034_003E': '2010_2019',
    'B25034_003M': '2010_2019_moe',
    'B25034_004E': '2000_2009',
    'B25034_004M': '2000_2009_moe',
    'B25034_005E': '1990_1999',
    'B25034_005M': '1990_1999_moe',
    'B25034_006E': '1980_1989',
    'B25034_006M': '1980_1989_moe',
    'B25034_007E': '1970_1979',
    'B25034_007M': '1970_1979_moe',
    'B25034_008E': '1960_1969',
    'B25034_008M': '1960_1969_moe',
    'B25034_009E': '1950_1959',
    'B25034_009M': '1950_1959_moe',
    'B25034_010E': '1940_1949',
    'B25034_010M': '1940_1949_moe',
    'B25034_011E': '1939_earlier',
    'B25034_011M': '1939_earlier_moe',
}

df = pd.DataFrame(
    c.acs5.get(
        list(variables_of_interest.keys()),
        {'for': 'county:*', 'in': 'state:06'},
        year=2023
    )
).rename(columns=variables_of_interest)


In [None]:
df_county = df.copy() # It's best practice to make a copy of your raw dataframe at the start, so that you can reset the dataframe when necessary
df_county.head()

## 1. Variables and data types

In [None]:
# variable_value is a variable I've created - it is not part of a dataframe
variable_value = 0.4 # I am "assigning" 0.4 to the variable called variable_value
variable_value

I can reassign a new value to my variable – Python will only store the last value you assigned to a given variable

In [None]:
variable_value = "Hello there!"
variable_value

Our variables can store all different kinds of data

In [None]:
#This is a new variable that contains all the information from the '2020_later' column in the county dataframe - this makes it a series.
built_2020_later = df_county['2020_later'] 
built_2020_later

When you're playing with data in Python, it's important to keep track of your variables' data types (string, integer, series, dataframe) 

We use `type()` to check that:

In [None]:
type(built_2020_later) 
#built_2020_later is a pandas series - it's an array, like one column of a dataframe, that pandas knows how to interpret

Remember, there are several central types of data types: 

1. **Integers (int)**: Whole numbers without a decimal point.
2. **Floating-Point Numbers (float)**: Numbers that have a decimal point.
3. **Strings (str)**: Text inside quotation marks (single or double).
4. **Booleans (bool)**: A value that is either True or False.
5. **Lists (list)**: An ordered collection of items (like a shopping list).
6. **Dictionaries (dict)**: A way to store data as key–value pairs (like a mini phone book).

When you are working in pandas, for numeric data, you’ll see things like *float64* or *int64*. For text or mixed types, you’ll see *object*.

In [None]:
# This is how we check what datatypes are present in each column in the dataframe
df_county.dtypes

## 2. Conditional logic

There are tons of situations in which you want to edit, transform, add, or remove data based on certain conditions. Maybe you want to calculate root sum of squares of MOEs, exclude columns where the MOE is too large, or filter your data to, say, counties with over 50% renter households. 

You want to apply *conditional logic* to Python's tasks. In other words, Python will only execute a given task if a specific *condition* is met.

First, you have to know how to "test" a condition. Usually, we use a Boolean expression to evaluate a condition as either True or False.

In [None]:
# A double equal sign checks that two things are equal (equivalent to, rather than assign to), it does NOT work for variable assignment
pct_2020 = 0.4
pct_before_1939 = 0.3

# This line of code asks Python if the two variables are equal to each other 
pct_2020 == pct_before_1939 

### if statements
```if``` statements use conditional logic. The idea here is to run some code only when a certain condition is true.

**The basic format is IF THIS, THEN THAT**

**Here is a description of how ```if``` logic works in Python**


|Here is the logic for a standard```if``` statement:| Here is the logic for an ```if else``` statement:| 
|:-:|:-:| 
|<img src="if_logic.png" width="350"> | <img src="if_else_logic.png" width="350">|

[Image source](https://www.geeksforgeeks.org/python/conditional-statements-in-python/)

In [None]:
# We could conduct the same check that we did above using ==, but use an if statement to give us a more specific output

if pct_2020 == pct_before_1939:
    print("Percentage in 2020 is the same as the percentage in 1939.")
else: 
    print("Percentage in 2020 is NOT the same as the percentage in 1939.")

In [None]:
# A more practical example is that, we can use conditional logic to select specific data
geoid_ct4001 = '0500000US06001'

# This if/else structure is confirming whether a given GEOID is Alameda County
if geoid_ct4001.endswith('001'): 
    print("GEOID is Alameda County, CA")
else:
    print("GEOID is NOT Alameda County, CA")

## 3. For loops

The `for` loop is a magical thing. The reason coding is so handy is because it lets you systematize and streamline your data cleaning and analysis process. `for` loops are what let you turn repetitive tasks into a few lines of Python code. 

`for` loops are generally structured in a similar way to `if` statements. Your code is given parameters for its task and completes it accordingly. In this case, `for` loops give Python a **range** for which to do a certain thing.   

`for` loops work with a "loop variable" or ticker, that goes through (ticks) through a sequence a set amount of times. We can either provide: 

1. a numeric value of how many iterations to loop through, like this:  
`for i in range (5):`\
*the "i" is something you will see often - it means you are creating an (*i*maginary) variable called 'i' and setting it equal to the first value in the second part of the line called range(). Every time the loop loops, i becomes the next value.* \
*or* 
2. a finite list or sequence, like this:  
`transportation=["walk","bike","carpool","bus","train"]`  
`for mode in transportation:`  


Part of the art of using `for` loops is learning to recognize when one might be useful. Whenever you are making Python do the same task several times, consider what changes each time: this is what your code can "loop" through to make this iterative process much faster. 

Here is how Python handles a ```for``` loop:

<img src="for_loop_logic.png" width="400">

[Image source](https://www.geeksforgeeks.org/python/loops-in-python/)

In [None]:
# This is creating an index (kind of like a list) of all the columns in the dataframe
columns = df_county.columns 

# Read this as "for each of the items in my index (which we called columns), do this:"
for c in columns:
    print(c) 

In [None]:
# What happens if we swap c out for item?
for item in columns: 
    print(item)

Let's try this with the if/else statement we used earlier. What if I wanted to check if each row in the dataframe was a part of the Bay Area? Instead of writing that if/else statement over and over, I'll make Python loop through all of my geoids...

In [None]:
geoids = df_county['GEO_ID']
bay_area_fips = ['013', '001', '085', '081', '075', '041', '097', '055', '095'] # These are the FIPS codes for the 9 counties in the Bay Area

for g in geoids: #For each of the geoids in my series of geoids
    if g[-3:] in bay_area_fips: # This checks if the last three characters of the full GEOID code match any of the items in the FIPS list
        print("This county is in the Bay Area!!!")
    else:
        print("This county is NOT in the Bay Area :(")

Let's try a more practical example. Let's say I want to add columns to my dataframe for *percent* of units in each 'age' bracket. Instead of individually creating each column, let's use a `for` loop.\
\
**Step 1**: Identify **what you are looping through** (i.e., what is changing each time you repeat the task).\
  &nbsp;&nbsp;&nbsp;&nbsp; *In this case, my year interval is changing.*\
  \
**Step 2**: Write a **list** or other sequential variable that contains each of the elements you want to loop through.\
     &nbsp;&nbsp;&nbsp;&nbsp; *Note: Make sure your syntax and spelling are exact, here, otherwise your `for` loop won't work*\
     \
**Step 3**: Build your `for` loop, asking Python to go through each of the elements in your list and do a specific thing.\
      &nbsp;&nbsp;&nbsp;&nbsp;Here, I want a new column to be created, with the same name as my original column + "_pct" and I want it to contain the share of the total for that year interval\
      \
**Step 4**: Run it, make sure it works by using .head() and checking one of your calculations, then be proud of yourself for making a cool `for` loop!

In [None]:
#Practice: 

# 1. Identify the columns you want to loop through

# 2. Write a list of all those elements
columns_to_use=[]

# 3. Build your 'for' loop
for _____ in ______:
    df_county[]=

# 4. Display the results
df_county.head()

### 4. Functions

Functions help us expedite data processing even more. They take a given input and convert it to a specified output, which allows us to repeat an operation again and again, but in a more specific way than a for loop. They are often used in conjunction with for loops and if statements.

Let's try with a simple example first.

In [None]:
# Let's turn the for loop that we just wrote into a function.

def get_pct_cols(df):
    
    '''
    This function converts the estimates for columns in the ACS table B25034 into percentages of the total. 
    '''
   
    columns_to_use=['2020_later', '2010_2019', '2000_2009', '1990_1999', '1980_1989', 
                    '1970_1979', '1960_1969', '1950_1959', '1940_1949', '1939_earlier']

    #Build the 'for' loop
    for col in columns_to_use:
        df[col+"_pct"] = df[col] / df['total']

    return df

Now let's try running this function.

In [None]:
get_pct_cols(df_county)

Let's try a more complicated example. Let's write a function that combines rows together to create an overall estimate for the Bay Area.

Now obviously all know well that ABAG uses the 9 county definition of the Bay Area NOT the 14 county combined statistical area (CSA) definition that the Census Bureau uses. So let's write a function to subset only those 9 rows in the dataframe. 

In [None]:
def get_bay_counties(df):
    bay_area_geoids = ['0500000US06013', '0500000US06001', '0500000US06085', '0500000US06081', 
                       '0500000US06075', '0500000US06041', '0500000US06097', '0500000US06055', '0500000US06095'] 
    new_df = df[df['GEO_ID'].isin(bay_area_geoids)]
    return new_df
    

In [None]:
bay_counties = get_bay_counties(df_county)
bay_counties

Now let's create a function to add the columns together to create a single estimate for the Bay Area. But first, let's explore each of the steps of the function. 

In [None]:
# First, we create a list of the estimate columns we want to add up
estimates_to_sum = ['total', '2020_later', '2010_2019', '2000_2009', '1990_1999',
       '1980_1989', '1970_1979', '1960_1969', '1950_1959', '1940_1949', '1939_earlier']
    
# Now we sum all of the estimate columns. 
# Axis = 0 indicates that we want to sum every row within each column, rather than across columns.
estimates_agg=df_county[estimates_to_sum].sum(axis=0)

# Now let's see what that did
print(type(estimates_agg))
print(estimates_agg)

In [None]:
# Now we do the same thing with the MOEs, but we calculate the square root sum of squares instead of a basic sum.
moes_to_sum = ['total_moe', '2020_later_moe', '2010_2019_moe', '2000_2009_moe', '1990_1999_moe',
'1980_1989_moe', '1970_1979_moe', '1960_1969_moe', '1950_1959_moe', '1940_1949_moe', '1939_earlier_moe']
    
moes_agg=(df_county[moes_to_sum]**2).sum(axis=0)**0.5

# Now let's see what that did
print(type(moes_agg))
print(moes_agg)

In [None]:
# Next, let's combine the two series
bayarea = pd.concat([estimates_agg,moes_agg])

# And let's see what that did
print(type(bayarea))
print(bayarea)

In [None]:
# We want our data in dataframe format. In order to do that, we need to transpose the data and use pandas to convert into a dataframe. 
bayarea_transpose = pd.DataFrame(bayarea).transpose()

# Now let's look at the dataframe
bayarea_transpose

In [None]:
# The final step: let's add some descriptive columns to the dataframe
bayarea_transpose['NAME'] = 'Bay Area'
bayarea_transpose['GEOID'] = np.nan #Adding a null value for our GEOID column

# You'll see the new columns on the far right of the dataframe
bayarea_transpose.head()

Now let's bring together all of the above steps into a function!

In [None]:
# Here's a break down of exactly what this function does:

# First, we define the function (combine_counties), and specify that we will give a single dataframe as the input (df)
def combine_counties(df): 
    
    '''
    This function takes a dataframe for the ACS table B25034 and combines all the rows into a single geography.
    It returns a new dataframe that contains the combined data. 
    '''
    
    # Next we give the function all the estimate columns we'll want added up
    estimates_to_sum = ['total', '2020_later', '2010_2019', '2000_2009', '1990_1999',
       '1980_1989', '1970_1979', '1960_1969', '1950_1959', '1940_1949', '1939_earlier']
    
    # Now we tell the function to sum all of the estimate columns. 
    # Axis = 0 indicates that we want to sum every row within each column, rather than across columns.
    # Note that we are referring to the dataframe as "df", because this is the input we specified earlier
    estimates_agg=df[estimates_to_sum].sum(axis=0)

   # Now we do the same thing with the MOEs, but we calculate the square root sum of squares instead of a basic sum.
    moes_to_sum = ['total_moe', '2020_later_moe', '2010_2019_moe', '2000_2009_moe', '1990_1999_moe',
       '1980_1989_moe', '1970_1979_moe', '1960_1969_moe', '1950_1959_moe', '1940_1949_moe', '1939_earlier_moe']
    
    moes_agg=(df[moes_to_sum]**2).sum(axis=0)**0.5

   # Finally we combine the two sets of aggregated data
    bayarea = pd.concat([estimates_agg,moes_agg])

   # And transpose it to the correct format
    bayarea_transpose = pd.DataFrame(bayarea).transpose()

   # Then we add the county name, and 
    bayarea_transpose['NAME'] = 'Bay Area'
    bayarea_transpose['GEOID'] = np.nan #Adding a null value for our GEOID column

   # We ask the function to "return" the new dataframe. 
   # In other words, we specify the "nbhd_transpose" dataframe as our output.
    return bayarea_transpose

In [None]:
# Now we can "call" (aka use) the function

# We assign the output of the function (the "return" dataframe) as a new variable called df_county
df_ba = combine_counties(bay_counties)
df_ba

In [None]:
# If we don't assign the output of the function to a new variable the function will still work
# But we haven't stored the output in the computer's memory
combine_counties(bay_counties)

# Assigning the output to a new variable makes it easier to reference and use in the future

#### Bonus exercise

The function above will let us easily combine counties (or any geography type for this specific ACS table). That's great!
We could expand it even more... Maybe we want to use this function for a different table, with different variable names.\
**How could we change our function to adapt it to other dataframes with different ACS data?**

What about combine_counties is specific to Table B25034?
<small>
```python
def combine_counties(df): 
    '''
    This function takes a dataframe for the ACS table B25034 and combines all the rows into a single geography.
    It returns a new dataframe that contains the combined data. 
    '''

    estimates_to_sum = ['total', '2020_later', '2010_2019', '2000_2009', '1990_1999',
       '1980_1989', '1970_1979', '1960_1969', '1950_1959', '1940_1949', '1939_earlier']
    
    estimates_agg=df[estimates_to_sum].sum(axis=0)

    moes_to_sum = ['total_moe', '2020_later_moe', '2010_2019_moe', '2000_2009_moe', '1990_1999_moe',
       '1980_1989_moe', '1970_1979_moe', '1960_1969_moe', '1950_1959_moe', '1940_1949_moe', '1939_earlier_moe']
    
    moes_agg=(df[moes_to_sum]**2).sum(axis=0)**0.5

    bayarea = pd.concat([estimates_agg,moes_agg])

    bayarea_transpose = pd.DataFrame(bayarea).transpose()

    bayarea_transpose['NAME'] = 'Bay Area'
    bayarea_transpose['GEOID'] = np.nan #Adding a null value for our GEOID column

    return bayarea_transpose
```

Let's try to write a new function that takes several inputs (i.e. the things that will change from one ACS table to the next) so that we could use this on several different tables.

In [None]:
# Practice:
# Try writing a new function that has 2 inputs: the dataframe and the list of estimates to sum

def combine_counties_updated():
    # Write your function here

### 5. Additional resources

#### Want to review the Python basics we just covered?

Here are some resources to help you practice and clarify confusing concepts: 
* Practice Platforms
    * [HackerRank](https://www.hackerrank.com/) (Beginner to Intermediate)
    * [Codewars](https://www.codewars.com/) (Beginner to Advanced)
* Online Courses
    * [Programming for Everbody (UMich)](https://www.coursera.org/learn/python) on Coursera (Beginner)
    * [CS50's Introduction to Programming with Python (Harvard)](https://cs50.harvard.edu/python/) (Beginner)
    * [Data8: Foundations of Data Science (UCB)](https://www.data8.org/sp25/) (Beginner)
    * [How to Think Like a Computer Scientist](https://runestone.academy/ns/books/published/thinkcspy/index.html) (Beginner)
* Videos
    * [Tech with Tim](https://www.youtube.com/@TechWithTim) (Beginner to Imtermediate)
 

Finally, please remember to make use of GSI office hours! You can drop-in to review Python concepts and skills regardless of whether your questions are related to a class assignment or not. 

#### Too easy? Here are some more advanced Python resources you can review. 

Here are some resources to help you advance your skills:
* Practice Platforms
    * [LeetCode](https://leetcode.com/) (Intermediate to Advanced)
    * [Kaggle](https://www.kaggle.com/) (Intermediate)
* Python Documentation
    * Dig around on the [official Python documentation](https://docs.python.org/3/)!
* Videos
    * [Corey Schafer's Python Tutorials](https://www.youtube.com/user/schafer5) (Intermediate)
    * [ArjanCode](https://www.youtube.com/@ArjanCodes) (Advanced)
  


Source: https://ucb-urban-informatics.github.io/cp255_web/docs/tutorials/python_reference.html