Since the directions for this lab applied to a webpage version of IMDB that was out of date, I was directed by my instructor to conduct some webscraping on whatever piqued my interest, so long as I could learn or glean some helpful information from my work. I decided, therefore, to apply webscraping to see if a personal project of mine could be made easier and more effective. 

I've been trying to figure out a way to divide up the study of the different books of the Standard Works of the Church of Jesus Christ of Latter-day Saints into chunks of daily reading that are both manageable and consistent. I have tried doing such breakdowns by verse, but have found that there is quite a wide variation in verse lengths, even in the same chapters or on the same pages. That inconsistency can lead to major differences in the amount of time a reader will spend on one chunk of daily reading. I often will read 17 verses one day and find it only spans about a page, and then read 17 verses the next day and find that the same number of verses spans two pages because the verses are longer on the second day than on the first. 

So, I wanted to use webscraping to see if I could more effectively breakup those daily chunks by the number of lines to read. This level of nuance will provide for a higher level of consistency from day to day as I seek to come closer to Christ by studying in His word, while also helping me not be so stressed out when I find that one day's reading is twice as long as the day before's despite being the same number of verses.

In [1]:
import pandas as pd
import requests
from bs4 import BeautifulSoup
import os
import time
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.options import Options
import re

I started off by setting my working directory and navigating to the Doctrine and Covenants webpage found on the main website of the Church, and set up a response request to start looking at the html code of the page. 

In [3]:
os.chdir("D:\Faith and Religion Stuff\Come, Follow Me Breakdowns")

In [4]:
# This url may need to be updated from time to time. 
url = 'https://www.churchofjesuschrist.org/study/scriptures/dc-testament/dc/1?lang=eng'


In [5]:
response = requests.get(url)
response

<Response [200]>

In [6]:
response.content



In [7]:
response.headers

{'Content-Security-Policy': "frame-src 'self' https://*.churchofjesuschrist.org/ https://*.trustarc.com/;style-src 'self' 'unsafe-inline' *", 'Content-Type': 'text/html', 'ETag': 'W/"12702-MlBla2xpbBujMZkOLujd4LHlmJo"', 'X-Response-Time': '192.404ms', 'X-Vcap-Request-Id': 'cfed052f-a84f-4862-7098-83b7cb030a0f', 'X-Akamai-Transformed': '9 75522 0 pmb=mRUM,1', 'Content-Encoding': 'gzip', 'Expires': 'Mon, 05 Aug 2024 22:08:31 GMT', 'Cache-Control': 'max-age=0, no-cache, no-store', 'Pragma': 'no-cache', 'Date': 'Mon, 05 Aug 2024 22:08:31 GMT', 'Transfer-Encoding': 'chunked', 'Connection': 'keep-alive, Transfer-Encoding', 'Vary': 'Accept-Encoding', 'Server-Timing': 'edge; dur=42, origin; dur=241, cdn-cache; desc=MISS, ak_p; desc="1722895711259_399222596_540988346_28179_6148_220_31_-";dur=1'}

In [8]:
print(response.headers['Content-Type'])

text/html


In [9]:
soup = BeautifulSoup(response.content, "html.parser")

In [10]:
type(soup)

bs4.BeautifulSoup

In [11]:
content = soup.prettify()  # This formats the HTML in a readable way
print(content)

<!DOCTYPE html>
<html lang="en">
 <head>
  <meta charset="utf-8"/>
  <title data-react-helmet="true">
   Doctrine and Covenants 1
  </title>
  <meta content="4.16.1" name="version"/>
  <meta content="width=device-width, initial-scale=1, minimum-scale=1" name="viewport"/>
  <meta content="128361512" data-react-helmet="true" name="Search.doc-aid">
   <meta content="Doctrine and Covenants 1" data-react-helmet="true" name="title">
    <meta content="Revelation given through Joseph Smith the Prophet, on November 1, 1831, during a special conference of elders of the Church, held at Hiram, Ohio. Many revelations had been received from the Lord prior to this time, and the compilation of these for publication in book form was one of the principal subjects passed upon at the conference. This section constitutes the Lord’s preface to the doctrines, covenants, and commandments given in this dispensation." data-react-helmet="true" name="description">
     <meta content="Doctrine and Covenants 1" da

Using beautiful soup allowed me to identify a few important pieces of information. First off, when looking at any given chapter of scripture, I found verse tags that gave me information about each verse and how it is displayed. Then, when looking at the contents page of the Doctrine and Covenants, I found that there were href links to each section, and in each of the < a > lines there was also a text title associated with the tag < a class="sc-omeqik-0 ewktus">. Having that information, I knew that I could get the information I needed. 

I got a lot of help from ChatGPT on this, but I was able to work with it download and access a google driver and use Selenium to do a lot of the same webscraping things that BeautifulSoup can do. 

After installing the driver and Selenium, I had to establish the path for accessing the driver, test to see if it was working, set options for Chrome, and create the service function I'd be using. 

In [2]:
# Define the path to the chromedriver executable
chrome_driver_dir = r'D:\\Faith and Religion Stuff\\Come, Follow Me Breakdowns\\chromedriver-win64'
chrome_driver_path = os.path.join(chrome_driver_dir, 'chromedriver.exe')

# Verify that the path is correct and the file exists
print(f"Checking if the chromedriver exists at: {chrome_driver_path}")
if not os.path.isfile(chrome_driver_path):
    raise FileNotFoundError(f"The chromedriver executable was not found at the specified path: {chrome_driver_path}")
else:
    print("Chromedriver found!")

# Set up the headless browser options
chrome_options = Options()
chrome_options.add_argument("--headless")
chrome_options.add_argument("--disable-gpu")
chrome_options.add_argument("--window-size=1920x1080")

# Set up the Chrome service
service = Service(chrome_driver_path)

Checking if the chromedriver exists at: D:\\Faith and Religion Stuff\\Come, Follow Me Breakdowns\\chromedriver-win64\chromedriver.exe
Chromedriver found!


From there, it was a lot of poking around and tweaking and adjusting. My first order of business was to see if I could even access the text of a verse. After that, I relied on ChatGPT's guidance to get the font-size and line heights to estimate the number of lines. 

In [13]:
# Initialize the Chrome WebDriver
driver = webdriver.Chrome(service=service, options=chrome_options)

# Use driver to get url
driver.get(url)

# Allow some time for rendering
time.sleep(2)

# Find the element containing the text
container = driver.find_element(By.CSS_SELECTOR, '.verse')

# Get the text of the element
text = container.text

# Get the bounding rectangle of the element
rect = container.rect

# Print the information
print(f"Container dimensions: width={rect['width']} height={rect['height']}")
print(f"Text content: {text}")

# Check if the content height suggests line breaks
verse_element = driver.find_element(By.CSS_SELECTOR, '.verse')
font_size = driver.execute_script("return window.getComputedStyle(arguments[0]).getPropertyValue('font-size');", verse_element)

# Calculate line height
line_height_str = driver.execute_script("return window.getComputedStyle(arguments[0]).getPropertyValue('line-height');", verse_element)
line_height_numeric = int(re.search(r'\d+', line_height_str).group())  # Extract numeric value from string
num_lines = rect['height'] // line_height_numeric

print(f"Estimated number of lines: {num_lines}")

# Close the browser
driver.quit()

Container dimensions: width=512 height=127
Text content: 1 Hearken, O ye people of my church, saith the voice of him who dwells on high, and whose eyes are upon all men; yea, verily I say: Hearken ye people from afar; and ye that are upon the islands of the sea, listen together.
Estimated number of lines: 4


Once I established that I could do that with one verse, and that the estimate was accurate, I could do it for a whole chapter. 

In [14]:
# Initialize the Chrome WebDriver
driver = webdriver.Chrome(service=service, options=chrome_options)

# Run the driver
driver.get(url)

# Allow some time for rendering
time.sleep(2)

# Find all elements containing the text
verses = driver.find_elements(By.CSS_SELECTOR, '.verse')

# Iterate over each verse element
for verse in verses:
    # Get the text of the element
    text = verse.text

    # Get the bounding rectangle of the element
    rect = verse.rect

    # Print the information
    print(f"Container dimensions: width={rect['width']} height={rect['height']}")
    print(f"Text content: {text}")

    # Calculate font size
    font_size = driver.execute_script("return window.getComputedStyle(arguments[0]).getPropertyValue('font-size');", verse)

    # Calculate line height
    line_height_str = driver.execute_script("return window.getComputedStyle(arguments[0]).getPropertyValue('line-height');", verse)
    line_height_numeric = int(re.search(r'\d+', line_height_str).group())  # Extract numeric value from string

    # Calculate number of lines
    num_lines = rect['height'] // line_height_numeric

    print(f"Estimated number of lines: {num_lines}")
    print("-" * 50)  # Separator for clarity

# Close the browser
driver.quit()

Container dimensions: width=512 height=127
Text content: 1 Hearken, O ye people of my church, saith the voice of him who dwells on high, and whose eyes are upon all men; yea, verily I say: Hearken ye people from afar; and ye that are upon the islands of the sea, listen together.
Estimated number of lines: 4
--------------------------------------------------
Container dimensions: width=512 height=127
Text content: 2 For verily the voice of the Lord is unto all men, and there is none to escape; and there is no eye that shall not see, neither ear that shall not hear, neither heart that shall not be penetrated.
Estimated number of lines: 4
--------------------------------------------------
Container dimensions: width=512 height=98
Text content: 3 And the rebellious shall be pierced with much sorrow; for their iniquities shall be spoken upon the housetops, and their secret acts shall be revealed.
Estimated number of lines: 3
--------------------------------------------------
Container dimen

After getting that to work, I knew I could move on to just saving the information collected to a dataframe. 

In [15]:
# Initialize the Chrome WebDriver
driver = webdriver.Chrome(service=service, options=chrome_options)

# Run the driver
driver.get(url)

# Find all elements containing the text
verses = driver.find_elements(By.CSS_SELECTOR, '.verse')

# Initialize a list to store data dictionaries
data_list = []

# Iterate over each verse element
for verse in verses:
    # Get the text of the element
    text = verse.text

    # Extract verse number (assuming it's in the format "1 ", "2 ", etc.)
    verse_number = text.split(' ')[0]  # Assuming verse number is at the start of text
    
    # Get the bounding rectangle of the element
    rect = verse.rect

    # Calculate line height
    line_height_str = driver.execute_script("return window.getComputedStyle(arguments[0]).getPropertyValue('line-height');", verse)
    line_height_numeric = int(re.search(r'\d+', line_height_str).group())  # Extract numeric value from string

    # Calculate number of lines
    num_lines = rect['height'] // line_height_numeric

    # Append data dictionary to list
    data_list.append({
        'Verse Number': verse_number,
        'Num Lines': num_lines
    })

# Convert list of dictionaries to DataFrame
df = pd.DataFrame(data_list)

# Print the DataFrame (optional)
print(df)

# Close the browser
driver.quit()

   Verse Number  Num Lines
0             1          4
1             2          4
2             3          3
3             4          3
4             5          2
5             6          4
6             7          2
7             8          4
8             9          2
9            10          4
10           11          2
11           12          2
12           13          3
13           14          5
14           15          2
15           16          6
16           17          4
17           18          3
18           19          3
19           20          2
20           21          1
21           22          1
22           23          3
23           24          4
24           25          1
25           26          2
26           27          2
27           28          3
28           29          4
29           30          6
30           31          2
31           32          2
32           33          3
33           34          3
34           35          4
35           36          3
3

And once that was working, I knew I could just write a function that would take the input of a url to a chapter of scripture and then use that url to get the line lengths of each verse. 

In [3]:
def get_verse_lines(url):
    """
    This function takes a URL as input and returns the number of lines in the HTML content of all the verses in a given chapter of Holy Scripture and stores the verses and line counts in a pandas dataframe.
    """
    # Import needed packages
    import pandas as pd
    import requests
    import os
    import time
    from selenium import webdriver
    from selenium.webdriver.chrome.service import Service
    from selenium.webdriver.common.by import By
    from selenium.webdriver.chrome.options import Options
    import re

    # Define the path to the chromedriver executable
    chrome_driver_dir = r'D:\\Faith and Religion Stuff\\Come, Follow Me Breakdowns\\chromedriver-win64'
    chrome_driver_path = os.path.join(chrome_driver_dir, 'chromedriver.exe')

    # Set up the headless browser options
    chrome_options = Options()
    chrome_options.add_argument("--headless")
    chrome_options.add_argument("--disable-gpu")
    chrome_options.add_argument("--window-size=1920x1080")

    # Set up the Chrome service
    service = Service(chrome_driver_path)    
    
    # Initialize the Chrome WebDriver
    driver = webdriver.Chrome(service=service, options=chrome_options)

    # Run the driver
    driver.get(url)

    # Find all elements containing the text
    verses = driver.find_elements(By.CSS_SELECTOR, '.verse')

    # Initialize a list to store data dictionaries
    data_list = []

    # Iterate over each verse element
    for verse in verses:
        # Get the text of the element
        text = verse.text

        # Extract verse number (assuming it's in the format "1 ", "2 ", etc.)
        verse_number = text.split(' ')[0]  # Assuming verse number is at the start of text
    
        # Get the bounding rectangle of the element
        rect = verse.rect

        # Calculate line height
        line_height_str = driver.execute_script("return window.getComputedStyle(arguments[0]).getPropertyValue('line-height');", verse)
        line_height_numeric = int(re.search(r'\d+', line_height_str).group())  # Extract numeric value from string

        # Calculate number of lines
        num_lines = rect['height'] // line_height_numeric

        # Append data dictionary to list
        data_list.append({
            'verse_number': verse_number,
            'num_lines': num_lines
        })

    # Convert list of dictionaries to DataFrame
    df = pd.DataFrame(data_list)

    # Close the browser
    driver.quit()

    return df

Having successfully done that, I could have called it a day, but instead I decided that if I could get that information, surely I could also use webscraping to get a dataframe of links that I could feed into that function using an iteration loop. 

Again, I relied on ChatGPT to walk me through the process and help me gather the right information. 

I started with just getting the links. 

In [4]:
driver = webdriver.Chrome(service=service, options=chrome_options)

try:
    # Navigate to the page with your elements
    driver.get('https://www.churchofjesuschrist.org/study/scriptures/dc-testament?lang=eng')

    # Find all elements with the specified class name
    elements = driver.find_elements(By.CLASS_NAME, 'sc-omeqik-0')

    # Initialize a list to store href values
    href_list = []
    title_list = []

    # Iterate over each element and extract the href attribute
    for element in elements:
        href = element.get_attribute('href')
        href_list.append(href)

    # Create a DataFrame to store the href values
    urls_df = pd.DataFrame({'Href': href_list})

    # Print the DataFrame (optional)
    print(urls_df)

except Exception as e:
    print(f"An error occurred: {e}")

finally:
    # Close the browser
    driver.quit()

                                                  Href
0    https://www.churchofjesuschrist.org/study/scri...
1    https://www.churchofjesuschrist.org/study/scri...
2    https://www.churchofjesuschrist.org/study/scri...
3    https://www.churchofjesuschrist.org/study/scri...
4    https://www.churchofjesuschrist.org/study/scri...
..                                                 ...
140  https://www.churchofjesuschrist.org/study/scri...
141  https://www.churchofjesuschrist.org/study/scri...
142  https://www.churchofjesuschrist.org/study/scri...
143  https://www.churchofjesuschrist.org/study/scri...
144  https://www.churchofjesuschrist.org/study/scri...

[145 rows x 1 columns]


But the links would do no good at all if I didn't have the section titles, because I realized I'd also want to save each section or chapter's verse lengths as separate dataframes that I could export as csv files. So, I spent a long time trying to get that worked out. Well, actually, I spent a long time trying to work out nonexistent issues because I didn't realize the output of my attempts to get titles and links was a scrollable cell, so I thought I kept just getting links. 

Some additional things added to this cell's code are things that I had to remove - there were more titles than href links, so I had to go in and remove the titles that didn't have links. Additionally, Official Declaration 1 and 2 have different links, but the same title, so I had to duplicate that title. This process was just a whole lot of trial and error. 

In [5]:
driver = webdriver.Chrome(service=service, options=chrome_options)

try:
    # Navigate to the page with your elements
    driver.get('https://www.churchofjesuschrist.org/study/scriptures/dc-testament?lang=eng')

    # Find all elements with the specified classes
    href_elements = driver.find_elements(By.CSS_SELECTOR, 'a.sc-omeqik-0.ewktus')
    title_elements = driver.find_elements(By.CSS_SELECTOR, 'p.title')

    # Debugging: Print lengths of elements found
    print(f"Number of href elements: {len(href_elements)}")
    print(f"Number of title elements: {len(title_elements)}")

    # Skip the very first title element
    if title_elements:
        title_elements = title_elements[1:145]

    # Delete the 4th title element (index 3)
    if len(title_elements) > 3:
        del title_elements[3]

    # Duplicate the last title element
    if title_elements:
        title_elements.append(title_elements[-1])

    # Delete the 4th title element (index 3)
    if len(href_elements) > 142:
        del href_elements[142]

    # Initialize lists to store href and title values
    href_list = [element.get_attribute('href') for element in href_elements]
    title_list = [element.text for element in title_elements]

    # Create a list of dictionaries to store matched data
    matched_data = []
    min_length = min(len(href_list), len(title_list))

    # Match hrefs and titles based on the minimum length
    for i in range(min_length):
        matched_data.append({'Title': title_list[i], 'Href': href_list[i]})

    # Create a DataFrame from matched data
    dc_links_df = pd.DataFrame(matched_data)

    # Set Pandas display options to show full content of 'Href' column
    pd.set_option('display.max_colwidth', None)

    # Print the DataFrame to verify the 'Href' column contents
    print(dc_links_df)

except Exception as e:
    print(f"An error occurred: {e}")

finally:
    # Close the browser
    driver.quit()

Number of href elements: 145
Number of title elements: 148
                               Title  \
0                         Title Page   
1                       Introduction   
2    Chronological Order of Contents   
3                           Contents   
4           Doctrine and Covenants 1   
..                               ...   
139       Doctrine and Covenants 136   
140       Doctrine and Covenants 137   
141       Doctrine and Covenants 138   
142            Official Declarations   
143            Official Declarations   

                                                                                        Href  
0      https://www.churchofjesuschrist.org/study/scriptures/dc-testament/title-page?lang=eng  
1    https://www.churchofjesuschrist.org/study/scriptures/dc-testament/introduction?lang=eng  
2     https://www.churchofjesuschrist.org/study/scriptures/dc-testament/chron-order?lang=eng  
3    https://www.churchofjesuschrist.org/study/scriptures/dc-testament/dc/_conte

Once it was working, though, I could subset the dataframe into the introductory stuff, the actual sections of the Doctrine and Covenants, and the Official Declarations. I had to do this because, while the sections of the Doctrine and Covenants are "versified," the introductory materials and Official Declarations are not, so I'll have to get those line lengths later. 

In [6]:
dc_sections = dc_links_df.iloc[4:142]
dc_intro = dc_links_df.iloc[:3]
dc_ods = dc_links_df[142:]

In [16]:
dc_extras = pd.concat([dc_intro,dc_ods])
dc_extras

Unnamed: 0,Title,Href
0,Title Page,https://www.churchofjesuschrist.org/study/scriptures/dc-testament/title-page?lang=eng
1,Introduction,https://www.churchofjesuschrist.org/study/scriptures/dc-testament/introduction?lang=eng
2,Chronological Order of Contents,https://www.churchofjesuschrist.org/study/scriptures/dc-testament/chron-order?lang=eng
142,Official Declarations,https://www.churchofjesuschrist.org/study/scriptures/dc-testament/od/1?lang=eng
143,Official Declarations,https://www.churchofjesuschrist.org/study/scriptures/dc-testament/od/2?lang=eng


After subsetting the data all I had to do was establish my iteration loop, and the result was a nice set of individual csv files that I'll be able to use SQL queries to join and access so that I can make my own daily breakdowns later. 

**NOTE:** The cell below has been commented out, as later on in my process I ammended what I wanted to do as part of my iteration loop that creates the csv files, and so I didn't want to accidentally run this cell again and create all the csvs multiple times in the future. 

In [56]:
# # Define the directory path where you want to save the CSV files
# dir_path = r'D:\\Faith and Religion Stuff\\Come, Follow Me Breakdowns\\D&C Sections Verse Lines'

# # Ensure the directory exists, if not create it
# os.makedirs(dir_path, exist_ok=True)

# for index, row in dc_sections.iterrows():
#     title = row['Title']
#     link = row['Href']
    
#     verse_lines_df = get_verse_lines(link)

#     if verse_lines_df is not None and not verse_lines_df.empty:
#         csv_filename = f'{title.replace(" ","_").lower()}_verse_lines.csv'
#         full_path = os.path.join(dir_path, csv_filename)
        
#         # Debugging: Print full path to ensure it's correct
#         print(f'Saving to: {full_path}')

#         verse_lines_df.to_csv(full_path, index=False)

#         print(f'CSV file for "{title}" saved successfully as {csv_filename}.')
#     else:
#         print(f'No data for "{title}", skipping CSV creation.')

Saving to: D:\\Faith and Religion Stuff\\Come, Follow Me Breakdowns\\D&C Sections Verse Lines\doctrine_and_covenants_1_verse_lines.csv
CSV file for "Doctrine and Covenants 1" saved successfully as doctrine_and_covenants_1_verse_lines.csv.
Saving to: D:\\Faith and Religion Stuff\\Come, Follow Me Breakdowns\\D&C Sections Verse Lines\doctrine_and_covenants_2_verse_lines.csv
CSV file for "Doctrine and Covenants 2" saved successfully as doctrine_and_covenants_2_verse_lines.csv.
Saving to: D:\\Faith and Religion Stuff\\Come, Follow Me Breakdowns\\D&C Sections Verse Lines\doctrine_and_covenants_3_verse_lines.csv
CSV file for "Doctrine and Covenants 3" saved successfully as doctrine_and_covenants_3_verse_lines.csv.
No data for "Doctrine and Covenants 4", skipping CSV creation.
Saving to: D:\\Faith and Religion Stuff\\Come, Follow Me Breakdowns\\D&C Sections Verse Lines\doctrine_and_covenants_5_verse_lines.csv
CSV file for "Doctrine and Covenants 5" saved successfully as doctrine_and_covenants_

Having downloaded all of that information, I could then move on to editing each csv file to be more friendly for accessing in SQL. Among things to be edited are:

1) Creating a sheet with individual csv sheet with all the primary keys that will be used to identify each section, and eventually each chapter of each book of scripture. Included in this table will be the primary key (for example "DC_1", because special characters like the ampersand are not SQL friendly) and information about that chapter such as number of verses and number of lines total in that chapter. At some point I'll likely figure out how to pull and add the length of the audio-recording of that chapter. This information will all be useful for establishing how long entire chapters are in comparison to each other. 
2) Adding the correct Foreign Key to each corresponding chapter and section of Scripture (for example, making sure every row in D&C 1 has the Primary Key "DC_1" listed as a Foreign Key, which will facilitate accessing information). 
3) Establishing a Primary Key for each verse that will facilitate accessing information about each of those verses specifically. 

The first task will be fairly simple, given that I've already created a pandas dataframe with the section Titles and links. 

In [20]:
dc_sections

Unnamed: 0,Title,Href
4,Doctrine and Covenants 1,https://www.churchofjesuschrist.org/study/scriptures/dc-testament/dc/1?lang=eng
5,Doctrine and Covenants 2,https://www.churchofjesuschrist.org/study/scriptures/dc-testament/dc/2?lang=eng
6,Doctrine and Covenants 3,https://www.churchofjesuschrist.org/study/scriptures/dc-testament/dc/3?lang=eng
7,Doctrine and Covenants 4,https://www.churchofjesuschrist.org/study/scriptures/dc-testament/dc/4?lang=eng
8,Doctrine and Covenants 5,https://www.churchofjesuschrist.org/study/scriptures/dc-testament/dc/5?lang=eng
...,...,...
137,Doctrine and Covenants 134,https://www.churchofjesuschrist.org/study/scriptures/dc-testament/dc/134?lang=eng
138,Doctrine and Covenants 135,https://www.churchofjesuschrist.org/study/scriptures/dc-testament/dc/135?lang=eng
139,Doctrine and Covenants 136,https://www.churchofjesuschrist.org/study/scriptures/dc-testament/dc/136?lang=eng
140,Doctrine and Covenants 137,https://www.churchofjesuschrist.org/study/scriptures/dc-testament/dc/137?lang=eng


In [21]:
def get_pk(title):
    if len(title) == 24:
        pk_num = title[-1]
    elif len(title) == 25:
        pk_num = title[-2:]
    elif len(title) == 26:
        pk_num = title[-3:]
    
    return pk_num

dc_sections['PK'] = "DC_" + dc_sections['Title'].apply(get_pk)

columns = ['PK','Title','Href']

dc_sections = dc_sections[columns]

dc_sections

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  dc_sections['PK'] = "DC_" + dc_sections['Title'].apply(get_pk)


Unnamed: 0,PK,Title,Href
4,DC_1,Doctrine and Covenants 1,https://www.churchofjesuschrist.org/study/scriptures/dc-testament/dc/1?lang=eng
5,DC_2,Doctrine and Covenants 2,https://www.churchofjesuschrist.org/study/scriptures/dc-testament/dc/2?lang=eng
6,DC_3,Doctrine and Covenants 3,https://www.churchofjesuschrist.org/study/scriptures/dc-testament/dc/3?lang=eng
7,DC_4,Doctrine and Covenants 4,https://www.churchofjesuschrist.org/study/scriptures/dc-testament/dc/4?lang=eng
8,DC_5,Doctrine and Covenants 5,https://www.churchofjesuschrist.org/study/scriptures/dc-testament/dc/5?lang=eng
...,...,...,...
137,DC_134,Doctrine and Covenants 134,https://www.churchofjesuschrist.org/study/scriptures/dc-testament/dc/134?lang=eng
138,DC_135,Doctrine and Covenants 135,https://www.churchofjesuschrist.org/study/scriptures/dc-testament/dc/135?lang=eng
139,DC_136,Doctrine and Covenants 136,https://www.churchofjesuschrist.org/study/scriptures/dc-testament/dc/136?lang=eng
140,DC_137,Doctrine and Covenants 137,https://www.churchofjesuschrist.org/study/scriptures/dc-testament/dc/137?lang=eng


Having gotten that set up, now I can adjust the code I ran above to include a row with the appropriate Primary Key. 

In [28]:
# Define the directory path where you want to save the CSV files
dir_path = r'D:\\Faith and Religion Stuff\\Come, Follow Me Breakdowns\\D&C Sections Verse Lines'

# Ensure the directory exists, if not create it
os.makedirs(dir_path, exist_ok=True)

for index, row in dc_sections.iterrows():
    title = row['Title']
    link = row['Href']
    foreign_key = row['PK']
    
    verse_lines_df = get_verse_lines(link)

    if verse_lines_df is not None and not verse_lines_df.empty:
        # Add the PK from dc_sections dataframe as a foreign key in the created dataframe
        verse_lines_df['foreign_key'] = foreign_key

        # Create the new pk column
        # Uses a lambda function to concatenate the foreign_key (the original PK) and the verse number. 
        verse_lines_df['pk'] = verse_lines_df.apply(lambda x: f"{foreign_key}_{x['verse_number']}", axis=1)

        # re-order columns to be most useable and understandable
        verse_lines_df = verse_lines_df[['foreign_key','pk', 'verse_number', 'num_lines']]

        csv_filename = f'{title.replace(" ","_").lower()}_verse_lines.csv'
        full_path = os.path.join(dir_path, csv_filename)
        
        # Debugging: Print full path to ensure it's correct
        print(f'Saving to: {full_path}')

        verse_lines_df.to_csv(full_path, index=False)

        print(f'CSV file for "{title}" saved successfully as {csv_filename}.')
    else:
        print(f'No data for "{title}", skipping CSV creation.')

Saving to: D:\\Faith and Religion Stuff\\Come, Follow Me Breakdowns\\D&C Sections Verse Lines\doctrine_and_covenants_1_verse_lines.csv
CSV file for "Doctrine and Covenants 1" saved successfully as doctrine_and_covenants_1_verse_lines.csv.
Saving to: D:\\Faith and Religion Stuff\\Come, Follow Me Breakdowns\\D&C Sections Verse Lines\doctrine_and_covenants_2_verse_lines.csv
CSV file for "Doctrine and Covenants 2" saved successfully as doctrine_and_covenants_2_verse_lines.csv.
Saving to: D:\\Faith and Religion Stuff\\Come, Follow Me Breakdowns\\D&C Sections Verse Lines\doctrine_and_covenants_3_verse_lines.csv
CSV file for "Doctrine and Covenants 3" saved successfully as doctrine_and_covenants_3_verse_lines.csv.
Saving to: D:\\Faith and Religion Stuff\\Come, Follow Me Breakdowns\\D&C Sections Verse Lines\doctrine_and_covenants_4_verse_lines.csv
CSV file for "Doctrine and Covenants 4" saved successfully as doctrine_and_covenants_4_verse_lines.csv.
Saving to: D:\\Faith and Religion Stuff\\Com

Something that tripped me up when executing the cell above was naming conventions - I was trying to reorder the columns, but using the wrong name for the column num_lines, calling it 'verse_lines. That's something I need to watch out for and be careful of. 

In [32]:
dc_sections

Unnamed: 0,PK,Title,Href
4,DC_1,Doctrine and Covenants 1,https://www.churchofjesuschrist.org/study/scriptures/dc-testament/dc/1?lang=eng
5,DC_2,Doctrine and Covenants 2,https://www.churchofjesuschrist.org/study/scriptures/dc-testament/dc/2?lang=eng
6,DC_3,Doctrine and Covenants 3,https://www.churchofjesuschrist.org/study/scriptures/dc-testament/dc/3?lang=eng
7,DC_4,Doctrine and Covenants 4,https://www.churchofjesuschrist.org/study/scriptures/dc-testament/dc/4?lang=eng
8,DC_5,Doctrine and Covenants 5,https://www.churchofjesuschrist.org/study/scriptures/dc-testament/dc/5?lang=eng
...,...,...,...
137,DC_134,Doctrine and Covenants 134,https://www.churchofjesuschrist.org/study/scriptures/dc-testament/dc/134?lang=eng
138,DC_135,Doctrine and Covenants 135,https://www.churchofjesuschrist.org/study/scriptures/dc-testament/dc/135?lang=eng
139,DC_136,Doctrine and Covenants 136,https://www.churchofjesuschrist.org/study/scriptures/dc-testament/dc/136?lang=eng
140,DC_137,Doctrine and Covenants 137,https://www.churchofjesuschrist.org/study/scriptures/dc-testament/dc/137?lang=eng


In [38]:
dc_sections.to_csv('dc_pk_and_links.csv',index=False)