Automate Data Tasks With Loops in Python

Loops are very useful for removing repetition in your code.

Automate Calculations on Values in Lists

Create list of values for loop

In [1]:
# Create list of average monthly precip (inches) in Boulder, CO
avg_monthly_precip_in = [0.70, 0.75, 1.85, 2.93, 3.05, 2.02,
                        1.93, 1.62, 1.84, 1.31, 1.39, 0.84]

Write loop

You want to convert each item in a list from inches to mm (recall that 1 inch = 25.4mm). So you have a fixed list of values upon which you want to iterate a calculation.

In [2]:
# Convert each item in list from in to mm
for month in avg_monthly_precip_in:
    month *= 25.4
    print(month)

17.779999999999998
19.049999999999997
46.99
74.422
77.46999999999998
51.308
49.022
41.148
46.736
33.274
35.306
21.336


Expand loop to add results to new list

In the loop above, each month's value is converted from inches to mm and the value is printed; however, the new value is not captured

You can expand the loop with more code, so that each converted value is actually added to a new list

You can add do this with yyour loop with only 2 new lines of code:
1. 1st, you create an empty list that will receive new values using listname = [].
2. Then, you can add a new line of code to append each value after it is calculated using listname += [value].

In [3]:
# Create new empty to receive values
avg_monthly_precip_mm = []

# Convert each item from in to mm and add to new list
for month in avg_monthly_precip_in:
    month *= 25.4
    avg_monthly_precip_mm += [month]

In [4]:
# Print original list in inches
print(avg_monthly_precip_in)

# Print new list after loop is complete
print(avg_monthly_precip_mm)

[0.7, 0.75, 1.85, 2.93, 3.05, 2.02, 1.93, 1.62, 1.84, 1.31, 1.39, 0.84]
[17.779999999999998, 19.049999999999997, 46.99, 74.422, 77.46999999999998, 51.308, 49.022, 41.148, 46.736, 33.274, 35.306, 21.336]


The list variable avg_monthly_precip_mm was explicitly created; in this case, you manually created the variable avg_monthly_precip_mm as an empty list.

The variable month is the placeholder variable, meaning that it was not explicitly created by you.

Rather, it is created as part of the loop and serves as a placeholder to represent each item from the original list (avg_monthly_precip_in), as the loop iterates.

At the end of the loop, the placeholder variable is equal to the last value that it was assigned (e.g. month is equal to 21.336 when the loop ends).

In [5]:
# Final value oif month
month

21.336

Automate Summary Stats on Multiple Numpy Arrays

You can build a loop that will calculate summary statistics (such as sum, median, values) of multiple data structures, such as numpy arrays.

Recall that you can use the function np.sum() and np.median() to calculate sum and median values of a numpy array.

In [6]:
# Import necessary packages
import numpy as np

# Array of average monthly precip (inches) for 2002 in Boulder, CO
precip_2002_arr = np.array([1.07, 0.44, 1.50, 0.20, 3.20, 1.18,
                           0.09, 1.44, 1.52, 2.44, 0.78, 0.02])

# Array of average monthly precip (inches) for 2013 in Boulder, CO
precip_2013_arr = np.array([0.27, 1.33, 1.72, 4.14, 2.66, 0.61,
                           1.03, 1.40, 18.16, 2.24, 0.29, 0.50])

Create list of Numpy arrays for loop

Just like in the previous example, begin by creating the list upon which your loop will iterate

As you want to iterate on multiple numpy arrays, you can create a list that contains the object names for all of the numpy arrays that you want to work with in the loop.

In [7]:
# Create list of numpy arrays
arr_list = [precip_2002_arr, precip_2013_arr]

In [8]:
for arr in arr_list:
    arr_sum = np.sum(arr)
    print("sum:", arr_sum)
    
    arr_median = np.median(arr)
    print("median:", arr_median)

sum: 13.879999999999999
median: 1.125
sum: 34.35
median: 1.365


Again you can capture these values in a new, separate lists by defining empty lists and using the assignment operator (listname += [value]) to add the results to each list.

In [9]:
# Create new empty lists to receive values
monthly_precip_sum = []
monthly_precip_median = []

# Calculate sum and median for each numpy array and add to new lists
for arr in arr_list:
    arr_sum = np.sum(arr)
    monthly_precip_sum += [arr_sum]
    
    arr_median = np.median(arr)
    monthly_precip_median += [arr_median]

Review the list being iterated upon and the placeholder in loop

In [10]:
# Lists contain the calculated values
print(monthly_precip_sum)
print(monthly_precip_median)

[13.879999999999999, 34.35]
[1.125, 1.365]


The variable arr is the placeholder variable that is created as part of the loop and serves as a placeholder to represent each item form the original list(arr_list), as the loop iterates.

At the end of the loop, arr is equal to the last value that it was assigned (e.g. precip_2013_arr, the last array in the list).

Similarly, at the end of the loop, arr_sum and arr_median are also equal to the last value that was calculated for each (e.g. the sum and median values for precip_2013_arr).

In [11]:
# Final value of arr
print(arr)

# Final value of arr_sum
print(arr_sum)

# Final value of arr_median
print(arr_median)

[ 0.27  1.33  1.72  4.14  2.66  0.61  1.03  1.4  18.16  2.24  0.29  0.5 ]
34.35
1.365


Automate Calculation on Multiple Columns in Pandas Dataframe

In addition to running a loop on multiple data structures (e.g. multiple numpy arrays like in the previous example), you can also run loops on multiple columns of a pandas dataframe.

For example, you may need to convert the measurement units of multiple columns, such as converting the precipitation values from inches to mm (1 inch = 25.4 mm).

In [12]:
# Import necessary packages
import pandas as pd

# Average monthly precip (inches) in 2002 and 2013 for Boulder, CO
precip_2002_2013_df = pd.DataFrame(columns=["month", "precip_2002", "precip_2013"],
                                  data=[
                                      ["Jan", 1.07, 0.27], ["Feb", 0.44, 1.13],
                                      ["Mar", 1.50, 1.72], ["Apr", 0.20, 4.14],
                                      ["May", 3.20, 2.66], ["June", 1.18, 0.61],
                                      ["July", 0.09, 1.03], ["Aug", 1.44, 1.40],
                                      ["Sept", 1.52, 18.16], ["Oct", 2.44, 2.24],
                                      ["Nov", 0.78, 0.29], ["Dec", 0.02, 0.50]
                                  ])

precip_2002_2013_df

Unnamed: 0,month,precip_2002,precip_2013
0,Jan,1.07,0.27
1,Feb,0.44,1.13
2,Mar,1.5,1.72
3,Apr,0.2,4.14
4,May,3.2,2.66
5,June,1.18,0.61
6,July,0.09,1.03
7,Aug,1.44,1.4
8,Sept,1.52,18.16
9,Oct,2.44,2.24


Create List of Column Names

As you want to iterate on multiple columns in a pandas dataframe, you can create a list that contains the column names that you want to work with in the loop.

In [13]:
# Create a list of column names
cols = ["precip_2002", "precip_2013"]

In [14]:
# Convert values for each column in cols list
for column in cols:
    precip_2002_2013_df[column] *= 25.4
    
# Print new values
precip_2002_2013_df

Unnamed: 0,month,precip_2002,precip_2013
0,Jan,27.178,6.858
1,Feb,11.176,28.702
2,Mar,38.1,43.688
3,Apr,5.08,105.156
4,May,81.28,67.564
5,June,29.972,15.494
6,July,2.286,26.162
7,Aug,36.576,35.56
8,Sept,38.608,461.264
9,Oct,61.976,56.896


In the 1st iteration, column would contain the values in the precip_2002 column, while in the last iteration, column would contain the values in the precip_2013 column.

You know you are using an implicit variable because the column name will change with each iteration.

Also, notice the placement of code precip_2002_2013 to display the dataframe after the loop is completed.

This code is not contained with the loop, so you do not see the dataframe each time that the loop iterates. You only see the dataframe when the loop is completed.

Automate Data Downloads Using EarthPy

In [15]:
import os
import earthpy as et

Create a list of URLs For Loop

As you want to iterate on multiple URLs, you can create a list that contains the URLs for all of the files that you want to download.

It is useful to create variables for the individual URLs first, so that you can easily manage them as well as make the code more readable.

In [22]:
# URL for avg monthly precip (inches) for Boulder, CO
avg_month_precip_url = 'https://ndownloader.figshare.com/files/12565616'

# URL for precip data for 2002 and 2013 (inches) for array
precip_2002_2013_url = 'https://ndownloader.figshare.com/files/12707792'

# Create list of URLs
urls = [avg_month_precip_url, precip_2002_2013_url]

In [23]:
# Download each url in list
for file_url in urls:
    et.data.get_data(url=file_url)

You can use os.listdir()  to list the contents

Earthpy downloads files to a subdirectory calleed earthpy-downloads under the data directory in the earth-analytics directory (e.g. earth-analytics/data/earthpy-downloads/).

In [24]:
# Create path for data directory
data_dir = os.path.join(et.io.HOME, "earth-analytics", "data", "earthpy-downloads")

os.listdir(data_dir)

['avg-monthly-precip.txt', 'monthly-precip-2002-2013.csv']