### 1 Jan 2025

The purpose of this notebook is to begin making improvements on my General Conference Breakdown program. 

Things I need to fix include the following: 

Functionality issues:
* the calculator is printing out days after the readthrough is supposed to be finished. Currently, on these days, it prints out the start point as the paragraph after the final paragraph of the talk, and the end point as the final paragraph of the talk. 
* when exporting to a csv, special characters (like ones with accent marks) seem to be saving in something akin to unicode.
* consider also saving the total number of paragraphs in the final breakdown, and maybe use short role instead of role.
* make sure code can handle handle different and incorrect inputs (such as different date formats or just putting the wrong thing in as input)
* ~~descriptions of photographs are being counted as paragraphs, throwing off the count.~~ Corrected **24 Feb 2024**
* ~~some talks ended up having the wrong number of counted paragraphs - this could be related to the photograph descriptions, or to something else in the code.~~ Corrected **24 Feb 2024**

Efficiency issues:
* adjust function to save conference talks as their own information and draw information about them from it's memory, not export then re-import them. 
* adjust function to only drive to the web one time, rather than driving to the web to get information about the conference as a whole, then to navigate to it a second time to get information about each of the talks (so include the work done in the `get_talks` functions in the initial driver navigation)
* remove unnecessary debugging print lines, implement more informative and useful ones until they are no longer needed. 

Later features:
* enable the function to create a breakdown regardless of any information being missing, like start or end date, number of readthroughs, etc. 
* enable the function to be able to account for "break days"

As I work through different issues, I will use the ~~strikethrough~~ markdown to indicate that an issue in the list above has been resolved. I will also include documentation in the markdowns below explaining how I approached fixing the issues. 


Before starting with anything else, though, I needed to first import the necessary packages and then divide the calculator function into separate parts or stages. This will help me handle things at different steps of the process without having to go through the whole breakdown process every time, and enable me to identify easily when and where my code is getting choked up. 

In [1]:
# import necessary packages
import pandas as pd
import requests
from bs4 import BeautifulSoup
import os
import time
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.options import Options
import glob
from selenium.webdriver.chrome.service import Service as ChromeService
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from webdriver_manager.chrome import ChromeDriverManager
import re
from datetime import datetime, timedelta

### 2 Jan 2025

In the cell below, I tested the old calculator to ensure it was working properly. 

In [None]:
def breakdown_gc():
    """
    This function takes all of the webscraping, cleaning, and manipulating I have done in other notebooks, and combines the entire process into one huge function that asks for significantly more user input. 
    """


    ##### Setup #####
    


    # import necessary packages
    import pandas as pd
    import requests
    from bs4 import BeautifulSoup
    import os
    import time
    from selenium import webdriver
    from selenium.webdriver.chrome.service import Service
    from selenium.webdriver.common.by import By
    from selenium.webdriver.chrome.options import Options
    import glob
    from selenium.webdriver.chrome.service import Service as ChromeService
    from selenium.webdriver.support.ui import WebDriverWait
    from selenium.webdriver.support import expected_conditions as EC
    from webdriver_manager.chrome import ChromeDriverManager
    import re
    from datetime import datetime, timedelta

    # set working directory for accessing and saving files
    os.chdir('d:\\Faith and Religion Stuff\\Come, Follow Me\\come-follow-me-breakdown-builder')

    # ask for user input for link
    conf_link = input('Please paste the link to the landing page of the conference you would like to breakdown.')

    # ask for user input for year and month of conference
    month = input('Please enter whether the conference was held in April or October: ').lower().replace('il','',1).replace('ober','',1)
    year = input('Please enter the year of the conference: ').strip().replace('20','',1)
    month_year = month + year

    # ask for user input for the start date and end dates
    ini_start = str(input("What day would you like to start reading?"))
    ini_end = str(input("What day would you like to stop reading?"))

    # Convert user input into datetime objects
    ini_start_date = pd.to_datetime(ini_start,format='%m/%d/%Y')
    ini_end_date = pd.to_datetime(ini_end,format='%m/%d/%Y')
    
    # Get the total number of days for the breakdown plan
    ini_total_days = ini_end_date - (ini_start_date - timedelta(days=1))
    
    
    
    ##### Getting information such as author, title, and description of each link on the conference landing page. #####



    # get response
    response = requests.get(conf_link)

    # Define the path to the chromedriver executable
    chrome_driver_dir = r'D:\\Faith and Religion Stuff\\Come, Follow Me\\chromedriver-win64'
    chrome_driver_path = os.path.join(chrome_driver_dir, 'chromedriver.exe')

    # Set up the headless browser options
    chrome_options = Options()
    chrome_options.add_argument("--headless")
    chrome_options.add_argument("--disable-gpu")
    chrome_options.add_argument("--window-size=1920x1080")

    # Set up the Chrome service
    service = Service(chrome_driver_path)    
        
    # Initialize the Chrome WebDriver
    driver = webdriver.Chrome(service=service, options=chrome_options)

    # Establish a try loop that tries to navigate to the provided link and find and store bits of info that we need 
    try:
        # Navigate to the page with your elements - in this case the April 2024 General Conference
        driver.get(conf_link)

        # Find all elements with the specified class name
        # Gotta use dots, not spaces, here because CSS considers each of those spaces to be defining a dif class object
        elements = driver.find_elements(By.CSS_SELECTOR, 'a.sc-omeqik-0.ewktus.list-tile.listTile-WHLxI')  

        # Initialize a list to store authors, titles, descriptions, and links
        primary_meta_list = []
        title_list = []
        description_list = []
        href_list = []

        # Iterate over each element
        # This for loop will, for all the videos/links to talks on the 2024 General Conference page, run through each of the following operations before moving onto the next
        for element in elements:
            # Try to get the author
            try:
                # Finds and stores the primary meta element (which is the author of the talk or report)
                primary_meta_element = element.find_element(By.CSS_SELECTOR,'p.primaryMeta')
                # Saves the stored author information as text
                primary_meta = primary_meta_element.text
            # If there is no author, save the author as None or Null
            except:
                primary_meta = None
            # Adds the author (or the None) to the list of authors in the appropriate row
            primary_meta_list.append(primary_meta)

            # Try to get the title - every link/video should have a title
            try:
                # Finds and stores the title element (the title of the video, talk, or report)
                title_element = element.find_element(By.CSS_SELECTOR,'p.title')
                # Saves the stored title as text
                title = title_element.text
            # If there is no title, save the title as None - THIS SHOULD NEVER BE THE CASE
            except:
                title = None
            # add the title (or the None) to the list of titles in the appropriate row
            title_list.append(title)

            # Try to get the description - the summary blurb about the video, talk, or report
            try:
                # Finds and stores the description element (the title of the video, talk, or report)
                description_element = element.find_element(By.CSS_SELECTOR,'p.description')
                # Saves the stored description as text
                description = description_element.text
            # If there is no description, save it as None - THIS SHOULD NEVER BE THE CASE
            except:
                description = None
            # add the title (or the None) to the list of descriptions in the appropriate row
            description_list.append(description)

            # Finds and stores the link (or href) to the video, talk, or report
            # This is ultimately going to be the information we use later to get the lengths (in paragraphs) of the talks and the lengths (in lines) of each of those paragraphs
            href = element.get_attribute('href')
            # Adds the stored href to the list of hrefs in the appropriate row
            href_list.append(href)

        # Creates a dataframe to store all the found and stored lists together
        ini_conf_df = pd.DataFrame({
            'Author': primary_meta_list,
            'Title': title_list,
            'Description': description_list,
            'Link': href_list
        })

    # If anything doesn't work for some reason, tell why
    except Exception as e:
        print(f"An error occurred: {e}")

    # After running everything, close the driver we opened to collect the data
    finally:
        # Close the browser
        driver.quit()



    ##### Removing things we don't need. #####
    


    # initialize empty list of rows that need to be dropped
    rows_to_drop = []

    # Adds the indexes (or row numbers) of rows to the list of rows to be dropped if there is either no Author or Description
        ## This exclusionary list is easy to edit
    for index,row in ini_conf_df.iterrows():
        if row['Author'] == None:
            rows_to_drop.append(index)
        elif row['Description'] == None:
            rows_to_drop.append(index)
        elif 'Sustaining' in row['Title']:
            rows_to_drop.append(index)
        elif 'Audit' in row['Title']:
            rows_to_drop.append(index)
    
    # Drops the rows in the list of rows to drop from the dataframe and resets the index
        ## This eliminates from the dataframe the session videos and the sustaining of the officers of the Church
    conf_df_1 = ini_conf_df.drop(rows_to_drop).reset_index(drop=True)
    
    
    
    ##### Getting additional information about each talk. ##### 



    # Define function for getting the total number of lines all talks 
        ## This function uses the urls stored in the dataframe
    def get_total_lines(url):
        """
        This function was designed specifically to run using an already active webdriver to gather paragraph and line length information about a general conference talk.
        First, it uses a webdriver to navigate to a url and then finds the paragraphs within a body block, and gets the size of the rectangles within which each of those paragraphs are assigned to appear. 
        Then, calculates the height of each line, saves that number as an integer, and calculates how many of those lines would fit into the assigned rectangle. 
        Then, it adds the paragraph number and the number of lines in that paragraph to the previously created dataframe.
        Finally, it calculates and returns the total number of lines in the talk by getting the sum of all paragraph lengths in lines. 
        """
        # initialize empty dataframe "data_list", with columns "paragraph" and "lines" being initially populated with NA values
        data_list = pd.DataFrame()
        data_list['paragraph'] = pd.NA
        data_list['lines'] = pd.NA

        # Find all elements containing the text
        paragraphs = driver.find_elements(By.CSS_SELECTOR, '.body-block p')

        # Iterate over each paragraph element
        for index, paragraph in enumerate(paragraphs, start=1):
            # Log paragraph number, since the paragraphs are not numbered. 
            paragraph_number = index

            # Get the bounding rectangle of the element
            rect = paragraph.rect

            # Calculate line height
            line_height_str = driver.execute_script("return window.getComputedStyle(arguments[0]).getPropertyValue('line-height');", paragraph)
            line_height_numeric = int(re.search(r'\d+', line_height_str).group())  # Extract numeric value from string

            # Calculate number of lines
            num_lines = rect['height'] // line_height_numeric

            # Append data dictionary to list
            data_list.at[index,'paragraph'] = paragraph_number
            data_list.at[index, 'lines'] = num_lines

            total_lines = sum(data_list['lines'])
        
        return total_lines
    
    # Initialize the Chrome WebDriver
    driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()))

    # Copy conf_df_1 to create a reversion point if necessary (it won't be)
    conf_df_2 = conf_df_1.copy()
    
    # Initialize columns in the dataframe with NA values to later be filled
    conf_df_2['time'] = pd.NA
    conf_df_2['paragraphs'] = pd.NA
    conf_df_2['lines'] = pd.NA
    conf_df_2['role'] = pd.NA

    # iterate the following over each row in the apr_2024_df dataframe
    for index, row in conf_df_2.iterrows():
        # for each row, when the function calls for title, url, and author it is looking for the Title, Link, and Author columns in that row, respectively
        title = row['Title']
        url = row['Link']
        author = row['Author']

        # run the driver, navigating to the linked page in the row currently being worked on
        driver.get(url)

        # Wait for the page to load completely
        driver.implicitly_wait(10)

        # Simulate clicking the play button using the class attribute
        try:
            play_button = driver.find_element(By.CSS_SELECTOR, "button.sc-1g7hsbc-0.lcWZjw.sc-bvqtyr-4.eYHLNi")
            play_button.click()
            print(f"Clicked the play button for {title} to start the media.")
        except:
            # ### Added - attempt to find alternately labeled play button
            print("Trying alternative play button selector.")
            try:
                play_button = driver.find_element(By.CSS_SELECTOR, "button.sc-1g7hsbc-0.bCKkuP.sc-bvqtyr-3.eKGiZd")
                play_button.click()
                print(f"Clicked the play button for {title} to start the media.")
            except:
                print(f"Play button for {title} not found.")

        # Wait for the video element to be present in the DOM
        try:
            video_element = WebDriverWait(driver, 20).until(
                EC.presence_of_element_located((By.TAG_NAME, 'video'))
            )
            print("Video element found in the DOM.")
        except:
            print("No video element found.")

        # Wait for a short period to allow the video to start loading
        time.sleep(2)

        # Attempt to retrieve the video duration using JavaScript
        try:
            video_duration = driver.execute_script("""
                let video = document.querySelector('video');
                if (video) {
                    console.log('Video element is present, checking duration...');
                    return video.duration;
                } else {
                    let audio = document.querySelector('audio');
                    if (audio) {
                        console.log('Audio element is present, checking duration...');
                        return audio.duration;
                    }
                }
                return null;  // No media element found
            """)
            
            # if video_duration exists
            if video_duration:
                # print a message saying how long the talk is in seconds
                print(f"{title} duration: {video_duration:.2f} seconds")
                # save the duration into the dataframe in the same row
                conf_df_2.at[index, 'time'] = video_duration
            # otherwise, print a message saying no video or audio element was found for the talk
            else:
                print(f"No video or audio element found for {title}.")
        
        # If there is an error, say there was an error and what it was, and try to get the next piece of information       
        except Exception as e:
            print(f"Error retrieving video duration for {title}: {e}")

        # attempt to find the body block
        try:
            paragraphs = driver.find_elements(By.CSS_SELECTOR, '.body-block p')
            # if body block is found, find the number of paragraphs, and save that number to the dataframe in the same row
            conf_df_2.at[index, 'paragraphs'] = len(paragraphs)
            # print a message giving the length of the talk in paragraphs
            print(f"Paragraph length of {title}: {len(paragraphs)} paragraphs.")

            # Use the get_total_lines function to get the total number of lines in the talk
            num_lines = get_total_lines(url)
            
            # save the number of lines to the dataframe in the same row
            conf_df_2.at[index, 'lines'] = num_lines
            
            # print a message telling the number of lines in the talk
            print(f"Line length of {title}: {num_lines} lines.")

        # if there is an error or a problem, print a message saying what the problem was, and try to get the next piece of information 
        except Exception as e:
            print(f"Error calculating lines and paragraphs for {title}: {e}")
        
        # try to find the author role
        try:
            role = driver.find_element(By.CLASS_NAME, 'author-role')
            # if there is one, save it to the dataframe in the same row
            conf_df_2.at[index, 'role'] = role.text
            # print a message displaying the role of the author
            print(f"Role of {author}: {role.text}")
        # if there is an error or a problem, print a message saying what the problem was and then move onto the next row
        except Exception as e:
            print(f"Error retrieving role for {author}: {e}")

    # Close the browser after all rows have been iterated through
    driver.quit()


    # converts all numeric columns to integers for easier use later
    conf_df_2['time'] = conf_df_2['time'].astype(int).round(0)
    conf_df_2['paragraphs'] = conf_df_2['paragraphs'].astype(int)
    conf_df_2['lines'] = conf_df_2['lines'].astype(int)



    ##### Establishing a primary key column and getting read and day weights
    


    # copy conf_df_2 to establish a reversion point
    conf_df_3 = conf_df_2.copy()

    # copies the role column onto a newly created short_role column
    conf_df_3['short_role'] = conf_df_3['role']

    # initializes a replacement dictionary to shorten information in newly created 'short_role' column
    rep_dict = {}

    # Adds specific shortenings of each role to the replacement dictionary
        ## this list is also easily editable if any other office becomes prominently represented in future conferences
        ## this list also puts members of the Presidency of the Seventy and of any other member of any other Quorum of the Seventy on equal ground       
    for index, row in conf_df_3.iterrows():
        if 'President of The Church'in row['role']:
            rep_dict[row['short_role']] = 'President of the Church'
        elif 'First Presidency' in row['role']:
            rep_dict[row['short_role']] = 'First Presidency'
        elif 'Quorum of the Twelve' in row['role']:
            rep_dict[row['short_role']] = 'Quorum of the Twelve'
        elif 'the Seventy' in row['role']:
            rep_dict[row['short_role']] = 'Seventy'
        elif 'Relief Society' in row['role']:
            rep_dict[row['short_role']] = 'Relief Society Presidency'
        elif 'Presiding' in row['role']:
            rep_dict[row['short_role']] = 'Presiding Bishopric'
        elif 'Sunday School' in row['role']:
            rep_dict[row['short_role']] = 'Sunday School Presidency'
        elif 'Young Men' in row['role']:
            rep_dict[row['short_role']] = 'Young Men Presidency'
        elif 'Young Women' in row['role']:
            rep_dict[row['short_role']] = 'Young Women Presidency'
        elif 'Primary' in row['role']:
            rep_dict[row['short_role']] = 'Primary Presidency'
        else:
            rep_dict[row['short_role']] = 'other speakers'
    
    # uses replacement dictionary to replace (shorten) all the entries in the short_role column
    for words, replacement in rep_dict.items():
        conf_df_3['short_role'] = conf_df_3['short_role'].replace(words, replacement).str.strip()
    
    # defines function to get the initials of the speaker for use in creation of primary key column
    def get_initials(full_name):
        parts = full_name.split()
        initials = [part[0].lower() for part in parts]
        return ''.join(initials)

    # Create a new column with initials
    conf_df_3['initials'] = conf_df_3['Author'].apply(get_initials)

    # create a primary key column that combines the initials of the speaker and the month and year of the conference
    conf_df_3['pk'] = (conf_df_3['initials'] + "_" + month_year)

    # initialize an empty list of read weights
    read_weights = []

    # for every unique role code in the role_code column of the apr_2024_info dataframe...
    for short_role in conf_df_3.short_role.unique():
        # ... ask the user what the read weight should be and...
        read_weight = int(input(f"How many times would you like to read talks given by the {short_role}?"))
        # ... save both the role code and the read weight to the read_weights list
        read_weights.append({'short_role':short_role, 'read_weight':read_weight})

    # convert the read_weights list to a dataframe, save with the same name to replace the old item
    read_weights = pd.DataFrame(read_weights)

    # left-merge the read_weights dataframe to the apr_2024_info dataframe useing the role_code columns as a guide for merging
    # left-merge keeps everything in the dataframe being merged to, and only merges data from the second dataframe that has a corresponding value in the original dataframe
    conf_df_3 = conf_df_3.merge(read_weights, on='short_role',how='left')

    # initialize an empty list of day weights
    day_weights = []

    # for every unique role code in the role_code column of the apr_2024_info dataframe...
    for short_role in conf_df_3.short_role.unique():
        # ... ask the user what the day weight should be and...
        day_weight = int(input(f"How many more or fewer days would you like to spend on talks given by the {short_role}?\n"
                               f"\nIf you want to spend more days reading talks from the {short_role}, enter a number above 0.\n"
                               f"\nOr if you want to spend fewer days reading talks from the {short_role}, enter a number below 0 by using a minus sign or dash.\n"
                               f"\nIf you would rather spend a relatively the same amount of time on each talk from this organization as others, enter 0."))
        # ... save both the role code and the day weight to the read_weights list
        day_weights.append({'short_role':short_role, 'day_weight':day_weight})

    # convert the read_weights list to a dataframe, save with the same name to replace the old item
    day_weights = pd.DataFrame(day_weights)

    # left-merge the read_weights dataframe to the apr_2024_info dataframe useing the role_code columns as a guide for merging
    # left-merge keeps everything in the dataframe being merged to, and only merges data from the second dataframe that has a corresponding value in the original dataframe
    conf_df_3 = conf_df_3.merge(day_weights, on='short_role',how='left')
    


    ##### Getting information about each talk



    # save a copy of conf_df_3 as a reversion point
    conf_df_4 = conf_df_3.copy()

    # define a function that takes a link and gets the text and counts the lines of text of each talk given in the linked conference
    def get_talks(talk_link):
        """
        This function finds the description and text of a talk found at the talk link, numbers each paragraph of that talk and gets the text and length in lines of each paragraph. It returns as a dataframe all this information about each talk. 
        """

        # Define the path to the chromedriver executable
        chrome_driver_dir = r'D:\\Faith and Religion Stuff\\Come, Follow Me\\chromedriver-win64'
        chrome_driver_path = os.path.join(chrome_driver_dir, 'chromedriver.exe')

        # Set up the headless browser options
        chrome_options = Options()
        chrome_options.add_argument("--headless")
        chrome_options.add_argument("--disable-gpu")
        chrome_options.add_argument("--window-size=1920x1080")

        # Set up the Chrome service
        service = Service(chrome_driver_path)    
        
        # Initialize the Chrome WebDriver
        driver = webdriver.Chrome(service=service, options=chrome_options)

        # Run the driver
        driver.get(talk_link)

        # Initialize a list to store data dictionaries
        data_list = []

        # Try to get the description - the summary blurb about the video, talk, or report
        try:
            # Finds and stores the description (kicker) element (the title of the video, talk, or report)
            description_element = driver.find_element(By.CSS_SELECTOR,'p.kicker')
            # Saves the stored description as text
            description = description_element.text

            # set paragraph number for description as 0
            paragraph_number = 0
            
            # Get the bounding rectangle of the element
            rect = description_element.rect

            # Calculate line height
            line_height_str = driver.execute_script("return window.getComputedStyle(arguments[0]).getPropertyValue('line-height');", description_element)
            line_height_numeric = int(re.search(r'\d+', line_height_str).group())  # Extract numeric value from string

            # Calculate number of lines
            num_lines = rect['height'] // line_height_numeric

            # add text of and information about description to the data_list dictionary
            data_list.append({
                'paragraph_number': paragraph_number,
                'num_lines': num_lines,
                'text': description
            })

        # If there is no description, save it as None - THIS SHOULD NEVER BE THE CASE
        except:
            description = None

        # Find all elements containing the text
        paragraphs = driver.find_elements(By.CSS_SELECTOR, '.body-block p')

        # Iterate over each paragraph element
        for index, paragraph in enumerate(paragraphs, start=1):
            # Get the text of the element
            text = paragraph.text

            # Log paragraph number, since the paragraphs are not numbered. 
            paragraph_number = index

            # Get the bounding rectangle of the element
            rect = paragraph.rect

            # Calculate line height
            line_height_str = driver.execute_script("return window.getComputedStyle(arguments[0]).getPropertyValue('line-height');", paragraph)
            line_height_numeric = int(re.search(r'\d+', line_height_str).group())  # Extract numeric value from string

            # Calculate number of lines
            num_lines = rect['height'] // line_height_numeric

            # Append data dictionary to list
            data_list.append({
                'paragraph_number': paragraph_number,
                'num_lines': num_lines,
                'text': text
            })

        # Convert list of dictionaries to DataFrame
        df = pd.DataFrame(data_list)

        # Close the browser
        driver.quit()
        return df
    
    # Define directory path for saving CSV files
    dir_path = input(f'Please paste here the location of the folder in which you would like to store information from this general conference.\n'
                     f'\nExample:    D:\\Faith and Religion Stuff\\Come, Follow Me Breakdowns\\April 2024 GC Talks\n'
                     f'\nThis will require you to have already created a folder in which you want the information for the talks to be saved.')

    # establish a loop that iterates through every row of the conference dataframe
    for index, row in conf_df_4.iterrows():
        # save information from the 'Title' row as title
        title = row['Title']
        # save information from the 'Link' row as link
        link = row['Link']
        # save information from the 'pk' Primary Key row as foreign_key
        foreign_key = row['pk']

        # using the saved link, create a dataframe that contains the paragraph line counts information for the talk in the current row 
        talk_lines_df = get_talks(link)

        # if the created dataframe is not populated with None data and is not empty:
        if talk_lines_df is not None and not talk_lines_df.empty:
            # creates a new column in the dataframe that uses the primary key of the talk as the foreign key
            talk_lines_df['foreign_key'] = foreign_key
            # Uses a lambda function to create a primary key for each paragraph consisting of the foreign key + the paragraph number
            talk_lines_df['pk'] = talk_lines_df.apply(lambda x:f"{foreign_key}_{x.get('paragraph_number')}", axis=1)
        
            # very rudimentarily define which columns to keep, and add column names to that list in the desired order
            columns_to_keep = ['foreign_key','pk']
            columns_to_keep.append('paragraph_number')
            columns_to_keep.append('text')
            columns_to_keep.append('num_lines')

            # save dataframe with columns in the order specified in the columns_to_keep list
            talk_lines_df = talk_lines_df[columns_to_keep]
            
            # save file name
            csv_filename = f'{foreign_key}_lines.csv'
            # combine the file name and the user inputted folder location to create a complete save path
            full_path = os.path.join(dir_path,csv_filename)

            # Debugging line to state where files can be found
            print(f'Saving to: {full_path}')

            # export dataframe as a csv file to the location specified
            talk_lines_df.to_csv(full_path, index = False)
            # print message saying that csv file has been created
            print(f'CSV file for "{title}" saved successfully as {csv_filename}.')
        
        # otherwise, if the dataframe is filled with None values or is empty
        else:
            # print a message saying no data was found for the talk
            print(f'No data found for "{title}", skipping CSV creation.')
    
    ### stupidly import data I just exported because I don't have the bandwidth to come up with another solution and want to go to bed ###

    # define the beginning of the file location
        ## this is done by accessing the dir_path given by the user, and adding \\ to the end of it.
    path_start = f'{dir_path}\\'

    # initialize filenames list
    csv_files = []

    # get the names of all the csv files in the directory
    for file in os.listdir(path_start):
        if file.endswith(".csv"):
            csv_files.append(file)

    # initialize a dictionary to store the dataframes
    all_talks_dict = {}

    # import the csv files into pandas dataframes, store each dataframe in the dictionary
    for file in csv_files:
        talk = file[:-4]
        all_talks_dict[talk] = pd.read_csv(os.path.join(path_start, file))
        print(f'file string: {file}\n'
            f'talk string: {talk}')

    # iterate over every dataframe stored in the all_talks_df dictionary
    for talk, df in all_talks_dict.items():
        # create a new column in each dataframe that is the cumulative sum of the number of lines
        df['running_lines'] = df['num_lines'].cumsum()


    
    ##### Assigning a number of days for each readthrough of each talk #####


    
    # Since conf_df_4 was not altered in anyway during the previous major step, we don't need to save a copy

    # Get some information about the conference as a whole
    total_time = sum(conf_df_4['time'])
    total_lines = sum(conf_df_4['lines'])
    
    # create a new conference consumption column that gives a proportional weight to each talk based on it's length and user input
    conf_df_4['conf_cons'] = (
        ((1/32) +                                                                  # Each talk is 1 of 32 given, this treats each equally
        conf_df_4['time']/total_time +                                             # time weight - longer "heavier"
        conf_df_4['lines']/total_lines +                                           # lines weight - longer "heavier"
        ((conf_df_4['day_weight'] + 1)/(conf_df_4['day_weight'] + 1).sum()))       # preference weight - user input factors in here
        / 4                                                                        # Adding each of those and then dividing by 4 gets the average
    )
    
    # create a column with the total number of days to be spent on each talk
        ## multiply the number of days specified in the plan by the conference consumption ratio
    conf_df_4['tot_num_days'] = ini_total_days.days * conf_df_4['conf_cons']
    conf_df_4['tot_num_days'] = conf_df_4['tot_num_days'].round()

    # convert the newly created column into integers rather than floats
    conf_df_4['tot_num_days'] = conf_df_4['tot_num_days'].astype(int)

    # find and save the highest number in the read_weights column
    max_reads = conf_df_4.read_weight.max()

    # start a loop that, for every number between 1 and whatever the max_reads number is, inclusive...
    for i in range(1,max_reads+1):
        # create a new column of NA values titled "Readthrough # _(whatever number the loop is on)_"
        conf_df_4[f"Readthrough #{i}"] = pd.NA
        
    # convert all NA values to "0"
    conf_df_4.fillna(0, inplace=True)

    def distribute_days(conf_df):
        """
        This function takes a dataframe like the one I have crafted above and distributes the total number of days into the "Readthrough #_" columns.
        """
        # establish that the function needs to repeat for every row of the dataframe
        for index, row in conf_df.iterrows():
            # get total number of days for that talk
            total_days = row['tot_num_days']
            # initialize number of distributed days as 0
            dist_days = 0
            # establish that the function needs to proceed with the following operation until dist_days and total_days are equal
            while dist_days < total_days:
                # for every whole number between 1 and whatever the read_weight (or number of readthroughs) is...
                for i in range(1,row['read_weight']+1):
                    # if dist_days is still less than total_days...
                    if dist_days < total_days:
                        # add 1 to whatever value is in the "Readthrough #(number between 1 and number of readthroughs)" column and...
                        conf_df.at[index, f'Readthrough #{i}'] += 1
                        # add 1 to dist_days
                        dist_days += 1
                        # go back to add 1 to the next column until dist_days is no longer less than total_days
                    # if/when dist_days is equal to total_days
                    else:
                        # break the process of adding one to each column, and move on to the next row to start the process over
                        break
        # when everything is done, the output of this function is the same dataframe with all of the updated columns
        return conf_df

    # run the function on my dataframe
    conf_df_5 = distribute_days(conf_df_4)



    ##### Distributing lines from each talk across each day of each readthrough #####

    # Use a series of loops to create a line start, number of lines, line end, paragraph start, and paragraph end column for every day of every readthrough of every talk
    for i in range(1, conf_df_5['read_weight'].max()+1):
        for x in range(1, conf_df_5[f'Readthrough #{i}'].max()+1):
            for index, row in conf_df_5.iterrows():
                if row[f'Readthrough #{i}'] != 0:
                    conf_df_5.at[index, f'r{i}d{x}_l_start'] = int(0)
                    conf_df_5.at[index, f'r{i}d{x}_lines'] = int(0)
                    conf_df_5.at[index, f'r{i}d{x}_l_end'] = int(0)
                    conf_df_5.at[index, f'r{i}d{x}_p_start'] = int(0)
                    conf_df_5.at[index, f'r{i}d{x}_p_end'] = int(0)
                else:
                    conf_df_5.at[index, f'r{i}d{x}_l_start'] = pd.NA
                    conf_df_5.at[index, f'r{i}d{x}_lines'] = pd.NA
                    conf_df_5.at[index, f'r{i}d{x}_l_end'] = pd.NA
                    conf_df_5.at[index, f'r{i}d{x}_p_start'] = pd.NA
                    conf_df_5.at[index, f'r{i}d{x}_p_end'] = pd.NA
    
    # establishes a regular expression pattern
    pattern = r'\D\d\D\d+'
    # identifies columns that are floats and not integers
    float_cols = conf_df_5.select_dtypes(include=['float'])
    # targets specific float columns using regular expression pattern
    change_cols = [col for col in float_cols if re.search(pattern, col)]
    # converts targeted columns to integers
    conf_df_5[change_cols] = conf_df_5[change_cols].astype(int)

    # I struggled for about 5 hours because I forgot this step. Including it was as simple as a copy paste.
    # This also came in answer to my prayer for help. I prayed for God to help me know where to look, and this was the next thing I looked at. 
    # God is good. 
    
    def distribute_lines(conf_df):
        """
        This function takes a dataframe like the one I have crafted above and distributes the total number of days into the "Readthrough #_" columns.
        """
        # initialize readthrough count as zero
        readthrough = 0
        # establish maximum number of readthroughs so the loop I create below knows when to end or stop
        max_readthroughs = conf_df['read_weight'].max()
        # start a loop of action that will continue until readthroughs is bigger than max_readthroughs, at which point it will stop
        while readthrough <= max_readthroughs:
            # A - first action: add 1 to readthroughs, establishing which readthrough we are working with
            readthrough += 1
            # B - Check if the current readthrough exists in the DataFrame
            if f'Readthrough #{readthrough}' not in conf_df.columns:
                break  # If the column doesn't exist, exit the loop

            # C - second action: for every row in the dataframe do the following:
            for index, row in conf_df.iterrows():
                # D - second action, cont'd: save the info in the row 'lines' as the total number of lines to be distributed
                total_lines = row['lines']
                # E - second action, cont'd: initialize count of distributed lines as zero
                dist_lines = 0
                # debugging print line - remove later
                print(f"Row {index} - Readthrough {readthrough} has {row[f'Readthrough #{readthrough}']} days.")
                # F - check whether the number of days assigned to a talk in a particular readthrough is 0
                if row[f'Readthrough #{readthrough}'] == 0:
                    # debugging print line - remove later
                    print(f"Skipping row {index} - Readthrough {readthrough} because it has 0 days.")
                    # if that talk has zero days alloted for that readthrough, skip to the next talk
                    continue
                
                # G - second action, cont'd: open another while loop that will continue until the number of distributed lines is equal to the number of total lines
                while dist_lines < total_lines:
                    # H - first action of second loop: open another loop that operates i number of times, where i is the number of days in the readthrough
                    for i in range(1, row[f'Readthrough #{readthrough}']+1):
                        # debugging print line - remove later
                        print(f"Processing row {index}, readthrough {readthrough}, day {i}.")
                        # I - first action of second loop, cont'd: add 1 to the corresponding i day of the current readthrough
                        conf_df.at[index, f'r{readthrough}d{i}_lines'] += 1
                        # J - first action of second loop, cont'd: add 1 to the number of distributed lines
                        dist_lines += 1
                        # debugging print line - remove later
                        print(f"Distributed lines: {dist_lines}/{total_lines}")
                        # K - check if dist_lines is less than total_lines: 
                        if dist_lines >= total_lines:
                            break
                            # Unstated action: If it is less, return to point H and repeat this loop.
                            # If not, end *this* loop and return to point D for the next row.
                    
            # L - check if readthroughs is less than or equal to max_readthroughs
            if readthrough > max_readthroughs:
                # If it is, repeat this loop, starting with point A
                break
                # Unstated action: If not, end this loop and go to point M.
        
        # M - return the newly modified dataframe
        return conf_df
    
    conf_df_5 = distribute_lines(conf_df_5)
    
    ##### Assigning lines and paragraphs for each day of each readthrough. Final Major Step. #####


    # save a copy of conf_df_5 as a reversion point
    conf_df_6 = conf_df_5.copy()
    

    def get_paragraphs(conf_df,talks_dictionary):
        """
        This function assigns starting and ending lines and paragraphs for each day of each readthrough of each talk contained in the conference dataframe.
        """    
        # initialize readthrough count as 0
        rt = 0
        # establish maximum number of readthroughs so the loop I create below knows when to end or stop
        max_rts = conf_df['read_weight'].max()
        # create a list of the column names in the given dataframe
        cols = list(conf_df.columns)
        # start a loop of action that will continue until readthroughs is bigger than max_readthroughs, at which point it will stop
        while rt <= max_rts:
            # A - first action: add 1 to readthroughs, establishing which readthrough we are working with
            rt += 1
            # debugging line - remove later
            print(f'Starting readthrough {rt}.')
            # B - Check if the current readthrough exists in the DataFrame
            if f'Readthrough #{rt}' not in conf_df.columns:
                # debugging line - remove later
                print(f'Readthrough {rt} not found. Function complete.')
                # if a column for the current readthrough number doesn't exist, exit the loop
                break

            # C-1 - Second action: start another loop that does the follow for the highest number of days in whatever readthrough number the function is on
            for day in range(1, conf_df[f'Readthrough #{rt}'].max()+1):
                
                # debugging line - remove later
                print(f'Working through readthrough {rt} day {day}.')

                # C-2 - set patterns for pulling out info for the start, lines, end, and start of next day
                rt_start_l_pattern = fr'r{rt}d{day}_l_start'
                rt_lines_pattern = fr'r{rt}d{day}_lines'
                rt_end_l_pattern = fr'r{rt}d{day}_l_end'
                rt_next_start_l_pattern = fr'r{rt}d{day+1}_l_start'
                rt_start_p_pattern = fr'r{rt}d{day}_p_start'
                rt_end_p_pattern = fr'r{rt}d{day}_p_end'
                rt_next_start_p_pattern = fr'r{rt}d{day+1}_p_start'

                # debugging line - remove later
                print(f'Readthrough {rt} day {day} patterns saved.')

                # C-3 - initialize variables as empty lists
                start_l_col = []
                lines_col = []
                end_l_col = []
                next_start_l_col = []
                start_p_col = []
                end_p_col = []
                next_start_p_col = []

                # debugging line - remove later
                print(f'Variables initialized.')

                # C-4 - search for and save columns defined in patterns
                for col in cols:
                    if re.search(rt_start_l_pattern, col):
                        start_l_col.append(col)
                    elif re.search(rt_lines_pattern, col):
                        lines_col.append(col)
                    elif re.search(rt_end_l_pattern, col):
                        end_l_col.append(col)
                    elif re.search(rt_next_start_l_pattern, col):
                        next_start_l_col.append(col)
                    elif re.search(rt_start_p_pattern, col):
                        start_p_col.append(col)
                    elif re.search(rt_end_p_pattern, col):
                        end_p_col.append(col)
                    elif re.search(rt_next_start_p_pattern, col):
                        next_start_p_col.append(col)

                # debugging line - remove later
                print(f'Columns found and saved.'
                    f'Starting column name: {start_l_col}'
                    f'Number of lines column: {lines_col}'
                    f'End column name: {end_l_col}'
                    f'Next start column name: {next_start_l_col}'
                    f'Starting column name: {start_p_col}'
                    f'Number of lines column: {lines_col}'
                    f'End column name: {end_p_col}'
                    f'Next start column name: {next_start_p_col}')
                
                # C-5-a -Start another loop
                for index, row in conf_df.iterrows():
                    # C-5-b - check if the number of days assigned for the current readthrough of the current talk is 0
                    if row[f'Readthrough #{rt}'] == 0:
                        # debugging print line - remove later
                        print(f"Skipping row {index} - Readthrough {rt} because it has 0 days.")
                        # if that talk has zero days alloted for that readthrough, skip to the next talk
                        continue
                    
                    # unstated action - if the number of days assigned for the current readthrough is greater than zero, proceed to C-5-c

                    # C-5-c - establish the connection between conf_df and talks_dict using the primary key column of the conf_df
                    talk = talks_dictionary[f"{conf_df.loc[index,'pk']}_lines"]

                    # debugging line - remove later
                    print(f'Connecting conf_df to talks_dict using primary key {conf_df.loc[index,"pk"]}.')

                    # C-5-d - check if the day number is 1
                    if day == 1:
                        # C-5-d-1 - if so, initialize the start column as 1 for every row
                        conf_df[start_l_col] = 1
                        conf_df[start_p_col] = 1
                        # debugging line - remove later
                        print(f'Readthrough {rt} day {day} set at 1.')
                        # C-5-d-1 - then save the end point as the start point (1) plus the number of lines to be read 
                        # the minus 1 at the end ensures that we end at the assigned reading line, not the line after
                        for start, read, end in zip(start_l_col, lines_col, end_l_col):
                            conf_df.loc[index, end] = conf_df.loc[index, start] + conf_df.loc[index, read] - 1
                        # C-5-d-2 - then save today's ending point as the starting point for the next day, to be accessed later
                        for end, next_start in zip(end_l_col, next_start_l_col):
                            conf_df.loc[index, next_start] = conf_df.loc[index, end]
                        # C-5-d-3 - save the end line just calculated as object 'end_line'
                        end_line = conf_df.loc[index, f'r{rt}d{day}_l_end']

                        # debugging line - remove later
                        print(f'Readthrough {rt} day {day} end line set. End line: {end_line}')

                        # C-5-d-4 - find in the talk the paragraph with a running total that is greater than or equal to the end line 
                        end_paragraph_1 = talk.loc[talk['running_lines'] >= end_line].index[0]
                        # C-5-d-5 - find in the talk the paragraph before the one above
                        end_paragraph_2 = talk.loc[talk['running_lines'] >= end_line].index[-1]
                        # C-5-d-6 - determine which paragraph would yield a number of lines being read closer to the target end line and save as end_paragraph
                        if abs(talk.loc[end_paragraph_1,'running_lines'] - end_line) < abs(talk.loc[end_paragraph_2,'running_lines']- end_line):
                            end_paragraph = end_paragraph_1
                        else:
                            end_paragraph = end_paragraph_2
                        # C-5-d-7 - save the selected end_paragraph as the paragraph ending point of the current day of the current readthrough
                        for end_p in end_p_col:
                            conf_df.loc[index, end_p] = end_paragraph
                        
                        for end_p, next_start_p in zip(end_p_col, next_start_p_col):
                            conf_df.loc[index, next_start_p] = conf_df.loc[index, end_p] + 1

                        # debugging line - remove later
                        print(f'Readthrough {rt} day {day} end paragraph set: End paragraph: {end_paragraph}')
                        

                        
                    
                    # C-5-e - if the day number IS NOT 1...
                    else:
                        # C-5-e-1 - access whatever the start column has already been saved as, add the number of lines, and save that as the end point
                        for start, read, end in zip(start_l_col, lines_col, end_l_col):
                            conf_df.loc[index, end] = conf_df.loc[index, start] + conf_df.loc[index, read]
                        # C-5-e-2 - then save today's ending point as the starting point for the next day, to be accessed later
                        for end, next_start in zip(end_l_col, next_start_l_col):
                            conf_df.loc[index, next_start] = conf_df.loc[index, end]
                        # C-5-e-3 - save the end line just calculated as object 'end_line'
                        end_line = conf_df.loc[index, f'r{rt}d{day}_l_end']

                        # debugging line - remove later
                        print(f'Readthrough {rt} day {day} end line set. End line: {end_line}')

                        # C-5-e-4 - find in the talk the paragraph with a running total that is greater than or equal to the end line 
                        end_paragraph_1 = talk.loc[talk['running_lines'] >= end_line].index[0]
                        # C-5-e-5 - find in the talk the paragraph before the one above
                        end_paragraph_2 = talk.loc[talk['running_lines'] >= end_line].index[-1]
                        # C-5-e-6 - determine which paragraph would yield a number of lines being read closer to the target end line and save as end_paragraph
                        if abs(talk.loc[end_paragraph_1,'running_lines'] - end_line) < abs(talk.loc[end_paragraph_2,'running_lines']- end_line):
                            end_paragraph = end_paragraph_1
                        else:
                            end_paragraph = end_paragraph_2
                        # C-5-e-7 - save the selected end_paragraph as the paragraph ending point of the current day of the current readthrough
                        for end_p in end_p_col:
                            conf_df.loc[index, end_p] = end_paragraph

                        for end_p, next_start_p in zip(end_p_col, next_start_p_col):
                            conf_df.loc[index, next_start_p] = conf_df.loc[index, end_p] + 1

                        # debugging line - remove later
                        print(f'Readthrough {rt} day {day} end paragraph set: End paragraph: {end_paragraph}')

                # debugging print line - remove later
                print(f'Readthrough {rt} columns day {day} saved. Looping back.')
            
            # debugging print line - remove later
            print(f'Readthrough {rt} loop finished.')
        
        return conf_df
    
    # runs the get paragraphs function, saves as conf_df_7
    conf_df_7 = get_paragraphs(conf_df_6, all_talks_dict)

    
    ##### Setting up and exporting final breakdown #####
    
    # establishes patterns for getting a list of start and end columns
    start_cols_pat = r'r\d+d\d+_p_start'
    end_cols_pat = r'r\d+d\d+_p_end'

    # create a list of columns names in conf_df_7 
    cols = list(conf_df_7.columns)

    # initialize empty lists for start and end columns
    final_start_cols = []
    final_end_cols = []

    # look at all of the column names in the column names list
    for col in cols:
        # if the column name matches the pattern for start columns, add it to the list of start columns
        if re.search(start_cols_pat, col):
            final_start_cols.append(col)
        # if the column name matches the pattern for end columns, add it to the list of end columns
        elif re.search(end_cols_pat, col):
            final_end_cols.append(col)

    # initialize an empty list for zipping the other two together
        ## zipping two lists together basically entails combining them in the order of list 1 item 1, list 2 item 1, list 1 item 2, list 2 item 2, list 1 item 3, list 2 item 3, etc. 
    zipped_cols = []

    # create a loop that 'zips' the start columns list and end columns list together
    for start_col, end_col in zip(final_start_cols, final_end_cols):
        zipped_cols.append(start_col)
        zipped_cols.append(end_col)
    
    # create a list of necessary information columns
    info_cols = ['Author', 'role', 'Title']

    # combine the info_cols list and the zipped_cols list
    final_cols = info_cols + zipped_cols

    # keep only the columns in the final_cols list, save as final_breakdown
    final_breakdown = conf_df_7[final_cols]

    # Get user input about saving breakdown to computer
    response_1 = input('Do you want to save the breakdown to your computer? ')
    
    # if the user wants to export the breakdown
    if response_1.lower() == 'yes':
        # ask the user where they want to store it
        input_path = input(f'Please paste the location of the folder you would like to save the breakdown in: \n'
                           f'\nFor example: D:\Faith and Religion Stuff\Come, Follow Me')
        # add \\ to that path to make it compatible
        path_start = f'{input_path}\\'
        # ask the user if they want to use a custom name
        response_2 = input('Would you like to save the file with a custom name?')
        # if they do, ask the user for the custom name
        if response_2.lower() == 'yes':
            custom_name = input('Please enter the name you would like to save the file as: ')
            name = f'{custom_name}.csv'
        # otherwise generate a generic file name
        else:
            name = f'{month_year}_breakdown.csv'
        
        # combine the destination folder with the name of the file
        final_path = os.path.join(path_start, name)
        
        # export final_breakdown as a csv to the destination folder
        final_breakdown.to_csv(final_path, index=False)
        
    # Display the final breakdown for viewing in this notebook
    return final_breakdown


In [None]:
breakdown_gc()

Trying alternative play button selector.
Clicked the play button for The Triumph of Hope to start the media.
Video element found in the DOM.
No video or audio element found for The Triumph of Hope.
Paragraph length of The Triumph of Hope: 44 paragraphs.
Line length of The Triumph of Hope: 177 lines.
Role of Neil L. Andersen: Of the Quorum of the Twelve Apostles
Trying alternative play button selector.
Clicked the play button for Live Up to Your Privileges to start the media.
Video element found in the DOM.
No video or audio element found for Live Up to Your Privileges.
Paragraph length of Live Up to Your Privileges: 34 paragraphs.
Line length of Live Up to Your Privileges: 170 lines.
Role of Emily Belle Freeman: Young Women General President
Trying alternative play button selector.
Clicked the play button for God’s Favourite to start the media.
Video element found in the DOM.
No video or audio element found for God’s Favourite.
Paragraph length of God’s Favourite: 20 paragraphs.
Line l

### 27 Jan 2025

Upon trying to run the calculator above, I found that for some reason, the video/audio element was not being found, which was resulting in the function no longer being able to calculate the length of videos of these talks. I could spend a few days trying to get that up and working, but ultimately, I think that what will be easiest and simplest would be to forego any operations that require 'clicking play' and looking for the length of the video. I'll spend the rest of my time today trying to figure it out, and if I can't, I'll probably just move on.

The main reason I think I'll just move on is that this calculator was working a few months ago when I calculated out the October 2024 General Conference, but now trying to work with the same conference, it's not working, suggesting that some back-end identifiers got updated or something. Even when I tried to run the function in my `Conference Calculator 1.0` notebook, I find that it is no longer able to find the play button for the videos. This suggests that just in the last few months the html code has been updated, and if I'm going to have to rebuild this every six months to accomodate that, it's probably not worth it, especially considering the marginal level of additional accuracy using the video lengths of each talk gives. 

So, that's a big thing to undo, I think: setting up this calculator to no longer use the video. That will speed it up and make it simpler. And, it seems that the number of lines and paragraphs are still calculating and counting, suggesting that although the video and play button tags have been updated, the tags for the text and paragraphs have not. 

I think that with the last few minutes I have today, I will breakdown the first aspect of the calculator, and see what I can do about making sure it is working. As I did before, I'm going to have to modularize this, and then put it all together. 

In [9]:
def system_setup_for_breakdown():
    """The purpose of this function is to import the packages I'll need for the breakdown to work properly and to set the working directory."""

    # import packages
    import pandas as pd
    print('Pandas imported as pd.')
    import requests
    print('Requests imported.')
    from bs4 import BeautifulSoup
    print('BeautifulSoup imported.')
    import os
    print('OS imported.')
    import time
    print('Time imported.')
    from selenium import webdriver
    from selenium.webdriver.chrome.service import Service
    from selenium.webdriver.common.by import By
    from selenium.webdriver.chrome.options import Options
    print('Webdriver, Service, By, and Options imported from Selenium.')
    from selenium.webdriver.chrome.service import Service as ChromeService
    from selenium.webdriver.support.ui import WebDriverWait
    from selenium.webdriver.support import expected_conditions as EC
    print('ChromeService, WebDriverWait, and ExpectedConditions imported from Selenium.')
    import glob
    print('Glob imported.')
    from webdriver_manager.chrome import ChromeDriverManager
    print('ChromeDriverManager imported.')
    import re
    print('Regular expression imported.')
    from datetime import datetime, timedelta
    print('Date and time imported.')

    # Ask for input to establish the working directory
    working_directory = input("Please enter the path to the working directory you'd like to use for this calculator.")
    
    # set working directory for accessing and saving files
    os.chdir(working_directory)
    print(f'Working directory set as [{os.getcwd()}].')
    
    #### save packages imported into list 'packages'
    packages = [pd, requests, BeautifulSoup, os, time, 
                webdriver, Service, By, Options, ChromeService, 
                WebDriverWait, EC, glob, ChromeDriverManager, re, 
                datetime, timedelta]
    
    # return from running this function 'packages'
    return packages


#### Call the function and unpack the imports
# pd, requests, BeautifulSoup, os, time, webdriver, Service, By, Options, ChromeService, WebDriverWait, EC, glob, ChromeDriverManager, re, datetime, timedelta = system_setup_for_breakdown()


Great, we'll call that good for the day. The first modular aspect of my breakdown is working well. 

### 4 Feb 2025

Okay, now I need to work on the next step of the breakdown. 

In [2]:
def set_ini_inputs():
    # ask for user input for link
    conf_link = input('Please paste the link to the landing page of the conference you would like to breakdown.')
    # print conference link
    print(f'Conference Link: {conf_link}')

    # Define directory path for saving CSV files
    save_path = input(f'Please paste here the location of the folder in which you would like to store information from this general conference.\n'
                     f'\nExample:    D:\\Faith and Religion Stuff\\Come, Follow Me Breakdowns\\April 2024 GC Talks\n'
                     f'\nThis will require you to have already created a folder in which you want the information for the talks to be saved.')

    # ask for user input for year and month of conference
    month = input('Please enter whether the conference was held in April or October: ').lower().replace('il','',1).replace('ober','',1)
    year = input('Please enter the year of the conference: ').strip().replace('20','',1)
    month_year = month + year
    # print month_year
    print(f'Conference Month-year: {month_year}')

    # ask for user input for the start date and end dates
    ini_start = str(input("What day would you like to start reading?"))
    ini_end = str(input("What day would you like to stop reading?"))
    # print ini_start and ini_end
    print(f'ini_start: {ini_start}')
    print(f'ini_end: {ini_end}')

    # Convert user input into datetime objects
    ini_start_date = pd.to_datetime(ini_start,format='%m/%d/%Y')
    ini_end_date = pd.to_datetime(ini_end,format='%m/%d/%Y')
    # print ini_start_date and ini_end_date
    print(f'ini_start as datetime: {ini_start_date}')
    print(f'ini_end as datetime: {ini_end_date}')
    
    # Get the total number of days for the breakdown plan
    ini_total_days = ini_end_date - (ini_start_date - timedelta(days=1))
    # print total number of days to be calculated over
    print(f'Total number of days: {ini_total_days.days}')

    #### Get user input on the minimum number of days they would like to spend on each talk on each readthrough
    min_per_rt = int(input('What is the minimum number of days you want to spend reading each talk on each readthrough?'))
    #### print min_per_rt
    print(f'Minimum days per readthrough: {min_per_rt}')

    ini_inputs = [conf_link,save_path, month_year, ini_start_date, ini_end_date, ini_total_days, min_per_rt]

    return ini_inputs

In [21]:
#### Call the function and unpack the imports
# pd, requests, BeautifulSoup, os, time, webdriver, Service, By, Options, ChromeService, WebDriverWait, EC, glob, ChromeDriverManager, re, datetime, timedelta = system_setup_for_breakdown()

# conf_link, month_year, ini_start_date, ini_end_date, ini_total_days = set_ini_inputs()

I immediately ran into the problem of the packages only being imported within the `system_setup_for_breakdown` function. So, I modified it above to see if that will work. I modified the code above (defining the `system_setup_for_breakdown` function) by adding a line that saves the packages in a list, and changing how I use the function. Rather than just running it, I run it as saving the packages as different items that the function returns. All this simply overcomes that fact that packages imported inside a function are only imported for that function as stop working as soon as that function ends. 

And it worked, as shown by the output above. 

Now for the third chunk of the overall function - getting the initial information about each talk.

In [3]:
def get_landing_page_info():
    '''
    This function uses requests and ChromeDriver to navigate to the conf_link landing page and get information such as author, title, and description of each talk on the landing page of the conference.
    '''
    # get response
    response = requests.get(conf_link)

    # Define the path to the chromedriver executable
    chrome_driver_dir = r'D:\\Faith and Religion Stuff\\Come, Follow Me\\chromedriver-win64'
    chrome_driver_path = os.path.join(chrome_driver_dir, 'chromedriver.exe')

    # Set up the headless browser options
    chrome_options = Options()
    chrome_options.add_argument("--headless")
    chrome_options.add_argument("--disable-gpu")
    chrome_options.add_argument("--window-size=1920x1080")

    # Set up the Chrome service
    service = Service(chrome_driver_path)    
        
    # Initialize the Chrome WebDriver
    driver = webdriver.Chrome(service=service, options=chrome_options)

    # Establish a try loop that tries to navigate to the provided link and find and store bits of info that we need 
    try:
        # Navigate to the page with your elements - in this case the April 2024 General Conference
        driver.get(conf_link)

        # Find all elements with the specified class name
        # Gotta use dots, not spaces, here because CSS considers each of those spaces to be defining a dif class object
        elements = driver.find_elements(By.CSS_SELECTOR, 'a.sc-omeqik-0.ewktus.list-tile.listTile-WHLxI')  

        # Initialize a list to store authors, titles, descriptions, and links
        primary_meta_list = []
        title_list = []
        description_list = []
        href_list = []

        # Iterate over each element
        # This for loop will, for all the videos/links to talks on the 2024 General Conference page, run through each of the following operations before moving onto the next
        for element in elements:
            # Try to get the author
            try:
                # Finds and stores the primary meta element (which is the author of the talk or report)
                primary_meta_element = element.find_element(By.CSS_SELECTOR,'p.primaryMeta')
                # Saves the stored author information as text
                primary_meta = primary_meta_element.text
            # If there is no author, save the author as None or Null
            except:
                primary_meta = None
            # Adds the author (or the None) to the list of authors in the appropriate row
            primary_meta_list.append(primary_meta)

            # Try to get the title - every link/video should have a title
            try:
                # Finds and stores the title element (the title of the video, talk, or report)
                title_element = element.find_element(By.CSS_SELECTOR,'p.title')
                # Saves the stored title as text
                title = title_element.text
            # If there is no title, save the title as None - THIS SHOULD NEVER BE THE CASE
            except:
                title = None
            # add the title (or the None) to the list of titles in the appropriate row
            title_list.append(title)

            # Try to get the description - the summary blurb about the video, talk, or report
            try:
                # Finds and stores the description element (the title of the video, talk, or report)
                description_element = element.find_element(By.CSS_SELECTOR,'p.description')
                # Saves the stored description as text
                description = description_element.text
            # If there is no description, save it as None - THIS SHOULD NEVER BE THE CASE
            except:
                description = None
            # add the title (or the None) to the list of descriptions in the appropriate row
            description_list.append(description)

            # Finds and stores the link (or href) to the video, talk, or report
            # This is ultimately going to be the information we use later to get the lengths (in paragraphs) of the talks and the lengths (in lines) of each of those paragraphs
            href = element.get_attribute('href')
            # Adds the stored href to the list of hrefs in the appropriate row
            href_list.append(href)

        # Creates a dataframe to store all the found and stored lists together
        ini_conf_df = pd.DataFrame({
            'Author': primary_meta_list,
            'Title': title_list,
            'Description': description_list,
            'Link': href_list
        })

    # If anything doesn't work for some reason, tell why
    except Exception as e:
        print(f"An error occurred: {e}")

    # After running everything, close the driver we opened to collect the data
    finally:
        # Close the browser
        driver.quit()

    return ini_conf_df

In [23]:
# #### Call the function and unpack the imports
# pd, requests, BeautifulSoup, os, time, webdriver, Service, By, Options, ChromeService, WebDriverWait, EC, glob, ChromeDriverManager, re, datetime, timedelta = system_setup_for_breakdown()

# conf_link, month_year, ini_start_date, ini_end_date, ini_total_days = set_ini_inputs()

# ini_conf_df = get_landing_page_info()

After defining the `get_landing_page_info` function, I tried to run it with the others and found that I needed to go back and set up the `set_ini_inputs` function to return all of those initial variables. Running the cell below, which displays the initially collected conference dataframe, shows that the three modular pieces I have created are all functioning properly together. 

In [24]:
# ini_conf_df

Now I can move onto the next phase of the modularized function: cleaning up that initial dataframe to only contain the things I want, another easy bit. 

Function below was updated on **25 Feb 2025** to include the section about adding a primary key. 

In [4]:
def clean_ini_df(df,monthyear):
    """This function removes rows from a dataframe based on the contents of their respective columns, and requires the following inputs:
    - df: a pandas dataframe that contain columns entitled Author, Description, and Title
    - monthyear: some saved object or variable that contains information about the month and year of a given conference
    """

    # initialize empty list of rows that need to be dropped
    rows_to_drop = []

    # Adds the indexes (or row numbers) of rows to the list of rows to be dropped if there is either no Author or Description
        ## This exclusionary list is easy to edit
    for index,row in df.iterrows():
        if row['Author'] == None:
            rows_to_drop.append(index)
        elif row['Description'] == None:
            rows_to_drop.append(index)
        elif 'Sustaining' in row['Title']:
            rows_to_drop.append(index)
        elif 'Audit' in row['Title']:
            rows_to_drop.append(index)
    
    # Drops the rows in the list of rows to drop from the dataframe and resets the index
        ## This eliminates from the dataframe the session videos and the sustaining of the officers of the Church
    df_1 = df.drop(rows_to_drop).reset_index(drop=True)

    #### Added 25 Feb 2025 - establish primary key (pk) for all talks
    # defines function to get the initials of the speaker for use in creation of primary key column
    def get_initials(full_name):
        parts = full_name.split()
        initials = [part[0].lower() for part in parts]
        return ''.join(initials)

    # Create a new column with initials
    df_1['initials'] = df_1['Author'].apply(get_initials)

    # create a primary key column that combines the initials of the speaker and the month and year of the conference
    df_1['pk'] = (df_1['initials'] + "_" + monthyear)

    return df_1


In [26]:
# Commented out 25 Feb 2025 due to updates in the functions called

# # Call the function and unpack the imports
# pd, requests, BeautifulSoup, os, time, webdriver, Service, By, Options, ChromeService, WebDriverWait, EC, glob, ChromeDriverManager, re, datetime, timedelta = system_setup_for_breakdown()

# # call set_ini_inputs and save outputs
# conf_link, month_year, ini_start_date, ini_end_date, ini_total_days = set_ini_inputs()

# # call get_landing_page_info and save as ini_conf_df
# ini_conf_df = get_landing_page_info()

# # run clean_ini_df on ini_conf_df and save as conf_df_1
# conf_df_1 = clean_ini_df(ini_conf_df)

# conf_df_1

Alright! That's good for the day. When running the cell above, it shows that all my modular pieces up to this point are functioning properly and working together the way they should!

### 10 Feb 2025

After re-running my previous code to make sure it's running properly, I can now move on to getting the next bit of my code modularized and working with the rest. 

In my original code, I have a function that collects the lines, paragraphs, and video length all at once. I previously decided that I would not try to collect information about the video length, but I'm a wishy-washy person, so I think I'll try to modularize the video thing on it's own, so that if I can get it up and running, I can use it, but if I can't, I don't need to worry about it. That will just require creating a conditional operation later that only accounts for the video length if it is present. I'll start with the easy stuff, though, which is getting the lines and paragraphs. 

In [27]:
#### commented out 12 Feb 2025

# def get_line_count(url):
#         """
#         This function was designed specifically to run using an already active webdriver to gather paragraph and line length information about a general conference talk.
#         First, it uses a webdriver to navigate to a url and then finds the paragraphs within a body block, and gets the size of the rectangles within which each of those paragraphs are assigned to appear. 
#         Then, calculates the height of each line, saves that number as an integer, and calculates how many of those lines would fit into the assigned rectangle. 
#         Then, it adds the paragraph number and the number of lines in that paragraph to the previously created dataframe.
#         Finally, it calculates and returns the total number of lines in the talk by getting the sum of all paragraph lengths in lines. 
#         """
#         # initialize empty dataframe "data_list", with columns "paragraph" and "lines" being initially populated with NA values
#         data_list = pd.DataFrame()
#         data_list['paragraph'] = pd.NA
#         data_list['lines'] = pd.NA

#         # Find all elements containing the text
#         paragraphs = driver.find_elements(By.CSS_SELECTOR, '.body-block p')

#         # Iterate over each paragraph element
#         for index, paragraph in enumerate(paragraphs, start=1):
#             # Log paragraph number, since the paragraphs are not numbered. 
#             paragraph_number = index

#             # Get the bounding rectangle of the element
#             rect = paragraph.rect

#             # Calculate line height
#             line_height_str = driver.execute_script("return window.getComputedStyle(arguments[0]).getPropertyValue('line-height');", paragraph)
#             line_height_numeric = int(re.search(r'\d+', line_height_str).group())  # Extract numeric value from string

#             # Calculate number of lines
#             num_lines = rect['height'] // line_height_numeric

#             # Append data dictionary to list
#             data_list.at[index,'paragraph'] = paragraph_number
#             data_list.at[index, 'lines'] = num_lines

#             total_lines = sum(data_list['lines'])
        
#         return total_lines

In [28]:
### commented out 24 Feb 2025

# def get_lpr(conf_df):
#     """This function is designed to work with a dataframe that has the title, author, and url for each talk listed on a page that lists conference talks. It applies the get_line_count to each row, and also tries to get the role of each speaker. It then stores collected info in appropriate columns."""
#     # Define the path to the chromedriver executable
#     chrome_driver_dir = r'D:\\Faith and Religion Stuff\\Come, Follow Me\\chromedriver-win64'
#     chrome_driver_path = os.path.join(chrome_driver_dir, 'chromedriver.exe')

#     # Set up the headless browser options
#     chrome_options = Options()
#     chrome_options.add_argument("--headless")
#     chrome_options.add_argument("--disable-gpu")
#     chrome_options.add_argument("--window-size=1920x1080")

#     # Set up the Chrome service
#     service = Service(chrome_driver_path)    
        
#     # Initialize the Chrome WebDriver
#     driver = webdriver.Chrome(service=service, options=chrome_options)

#     # Initialize columns in the dataframe with NA values to later be filled
#     conf_df['paragraphs'] = pd.NA
#     conf_df['lines'] = pd.NA
#     conf_df['role'] = pd.NA

#      # iterate the following over each row in the apr_2024_df dataframe
#     for index, row in conf_df.iterrows():
#         # for each row, when the function calls for title, url, and author it is looking for the Title, Link, and Author columns in that row, respectively
#         title = row['Title']
#         url = row['Link']
#         author = row['Author']

#         # run the driver, navigating to the linked page in the row currently being worked on
#         driver.get(url)

#         # attempt to find the body block
#         try:
#             paragraphs = driver.find_elements(By.CSS_SELECTOR, '.body-block p')
#             # if body block is found, find the number of paragraphs, and save that number to the dataframe in the same row
#             conf_df.at[index, 'paragraphs'] = len(paragraphs)
#             # print a message giving the length of the talk in paragraphs
#             print(f"Paragraph length of {title}: {len(paragraphs)} paragraphs.")

#             # Use the get_line_count function to get the total number of lines in the talk
#             num_lines = get_line_count(url)
            
#             # save the number of lines to the dataframe in the same row
#             conf_df.at[index, 'lines'] = num_lines
            
#             # print a message telling the number of lines in the talk
#             print(f"Line length of {title}: {num_lines} lines.")

#         # if there is an error or a problem, print a message saying what the problem was, and try to get the next piece of information 
#         except Exception as e:
#             print(f"Error calculating lines and paragraphs for {title}: {e}")
        
#         # try to find the author role
#         try:
#             role = driver.find_element(By.CLASS_NAME, 'author-role')
#             # if there is one, save it to the dataframe in the same row
#             conf_df.at[index, 'role'] = role.text
#             # print a message displaying the role of the author
#             print(f"Role of {author}: {role.text}")
#         # if there is an error or a problem, print a message saying what the problem was and then move onto the next row
#         except Exception as e:
#             print(f"Error retrieving role for {author}: {e}")

#     # Close the browser after all rows have been iterated through
#     driver.quit()

#     # converts all numeric columns to integers for easier use later
#     conf_df['paragraphs'] = conf_df['paragraphs'].astype(int)
#     conf_df['lines'] = conf_df['lines'].astype(int)
            
#     return conf_df


In [29]:
# Commented out 24 Feb 2024

# # Call the function and unpack the imports
# pd, requests, BeautifulSoup, os, time, webdriver, Service, By, Options, ChromeService, WebDriverWait, EC, glob, ChromeDriverManager, re, datetime, timedelta = system_setup_for_breakdown()

# # call set_ini_inputs and save outputs
# conf_link, month_year, ini_start_date, ini_end_date, ini_total_days = set_ini_inputs()

# # call get_landing_page_info and save as ini_conf_df
# ini_conf_df = get_landing_page_info()

# # run clean_ini_df on ini_conf_df and save as conf_df_1
# conf_df_1 = clean_ini_df(ini_conf_df)

# # run get_lpr on conf_df_1 and save as conf_df_2
# conf_df_2 = get_lpr(conf_df_1)

Issues I have run into:
* driver not defined - This seems to be an issue with the `get_line_counts` function. I addressed it by just redefining the driver within that code. 
* cannot access local variable 'total_lines' where it is not associated with a value - This is also an issue with the `get_line_counts` function. 

So, before I can insert the `get_line_counts` function into `get_lpr` and into my overall calculator, I need to get it up and running by applying it to individual links and trying to figure out what is wrong. 

I also need to remember to adapt the code that gets the paragraph counts to exclude 'paragraphs' that are actually just subtext for images included with the talks. 

### 12 Feb 2025

I'll start today off by testing the `get_line_count` function on one of the urls from the conference. I've commented out the code above, and I've just copied and pasted the `get_total_lines` function from the original breakdown function to test it's utility. 

Ultimately, I could not get the `get_line_count` function to work using the link to Neil L. Andersen's talk. The issue I was running into is that for some reason the function was not picking up the paragraphs like they should. I inspected my code and the html structure of the webpage, and confirmed that I am calling the paragraphs correctly. I relied mostly on the assistance of BlackboxAI today, and didn't really make any meaningful progress, so I'll try again to figure out what's going on tomorrow using ChatGPT or some other GPT model.

### 24 Feb 2025

Good news and bad news. Good new is that I finally have a job! Bad news is that means that I need to expedite this process, and the fastest way to do that, I think, is to get the old breakdown function working so that if I can't get done what I want to, I still have something to use and fall back on. 

The old function is getting caught or stuck on not being able to find the play buttons for the talks, so I am gonna spend some more time trying to get that up and running, because if I can, the rest of the function **should** still work. 

In digging into the html code of Elder Andersen's talk, and working with ChatGPT, I've found that this snippet of html contains some valuable information.

`<div id="bmpui-id-150" class="bmpui-ui-seekbar" role="slider" aria-label="Video timeline" tabindex="0" aria-valuemin="0" aria-valuemax="811.777"`

The code above defines a slider seekbar used for video content, and the `aria-valuemax` is the length of the video in seconds. My code before relied on a bit of Javascript provide by ChatGPT that I don't really understand. This bit of info could be useful if I decide in the future that I don't want to use the Java, and instead want to rely on what has been programmed into the html code. 

I need to figure out why the original calculator isn't finding the video element. 

Okay, I figured it out. The problem was that I was looking for the play button to be found using the selector `button.sc-1g7hsbc-0.bCKkuP.sc-bvqtyr-3.eKGiZd`, but it is actually found using this one: `button.sc-1g7hsbc-0.lcWZjw.sc-bvqtyr-4.eYHLNi`. In the code above I had tried to add in a secondary selector that would make it so that my calculator could be used on older conferences (which use the first selector), and then I got myself confused and turned around as to which selector is the right one to use. Fixing that enabled the function in Conference Calculator 1.0 to work as it is supposed to. 

Getting that squared away is a big deal for a few reasons. First, it means that I have a proper place to start with when trying to get my code to find the proper play button. Second, it means that if I can't get all or even any of the changes I want to working before next conference (in a few weeks), I still have my original calculator to fall back on, which *does work*, just not as well as I want it to quite yet. 

Having that squared away, I can get back to trying to modularize the Conference Calculator 1.0 into pieces that I can mess around with individually. 

In the cell below, I define a new function `get_vplr()` which stands for get video length, paragraph length, line length, and role. I copied relevant sections from the Conference Calculator 1.0 above more slowly and thoughtfully than I usually do. 

In [30]:
### Commented out 25 Feb 2025 - updated below on same date

# def get_vplr(conf_df):
#     """
#     This function does not get the Variance of the Predicted Log-Ratios from a dataframe containing information about a given general conference. Instead it gets the video length, paragraph length, and line length of each talk in a given general conference or set of talks and gets the role of each speaker. The function first copies the provided conference dataframe and saves it as a new dataframe that will be added to. It then initializes new rows for information to be populated into, and starts up a webdriver, which navigates to each url provided in the original dataframe. At each url attempts to remotely 'click' the play button, and when it successfully does so, it records the length of the video of the talk. It then finds all the paragraph elements on the page, removes captions to pictures, and then stores the number of paragraph elements as the paragraph length of the talk. Then it calculates the total number of lines in the talk and stores that as the number of lines. It then finds the role of the speaker, saves it in the 'role' column, and returns the updated dataframe. 
#     """
#     # copy the conf_df being passed through the function so it doesn't change it
#     conf_df_2 = conf_df.copy()

#     # Initialize columns in the dataframe with NA values to later be filled
#     conf_df_2['time'] = pd.NA
#     conf_df_2['paragraphs'] = pd.NA
#     conf_df_2['lines'] = pd.NA
#     conf_df_2['role'] = pd.NA

#     # Initialize the Chrome WebDriver
#     driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()))

#     # iterate the following over each row in the apr_2024_df dataframe
#     for index, row in conf_df_2.iterrows():
#         # for each row, when the function calls for title, url, and author it is looking for the Title, Link, and Author columns in that row, respectively
#         title = row['Title']
#         url = row['Link']
#         author = row['Author']

#         # run the driver, navigating to the linked page in the row currently being worked on
#         driver.get(url)

#         # Wait for the page to load completely
#         #### Updated - reduced time for webdriver to wait to 2.5 seconds
#         driver.implicitly_wait(2.5)

#         # Simulate clicking the play button using the class attribute
#         try:
#             play_button = driver.find_element(By.CSS_SELECTOR, "button.sc-1g7hsbc-0.lcWZjw.sc-bvqtyr-4.eYHLNi")
#             play_button.click()
#             print(f"Clicked the play button for {title} to start the media.")
#         except:
#             # ### Added - attempt to find alternately labeled play button
#             print("Trying alternative play button selector.")
#             try:
#                 play_button = driver.find_element(By.CSS_SELECTOR, "button.sc-1g7hsbc-0.bCKkuP.sc-bvqtyr-3.eKGiZd")
#                 play_button.click()
#                 print(f"Clicked the play button for {title} to start the media.")
#             except:
#                 print(f"Play button for {title} not found.")

#         # Wait for the video element to be present in the DOM
#         #### Updated - reduced time for webdriver to wait to 2.5 seconds
#         try:
#             video_element = WebDriverWait(driver, 2.5).until(
#                 EC.presence_of_element_located((By.TAG_NAME, 'video'))
#             )
#             print("Video element found in the DOM.")
#         except:
#             print("No video element found.")

#         # Wait for a short period to allow the video to start loading
#         #### Updated - reduced time for webdriver to wait to 1.5 seconds instead of 2
#         time.sleep(1.5)

#         # Attempt to retrieve the video duration using JavaScript
#         try:
#             video_duration = driver.execute_script("""
#                 let video = document.querySelector('video');
#                 if (video) {
#                     console.log('Video element is present, checking duration...');
#                     return video.duration;
#                 } else {
#                     let audio = document.querySelector('audio');
#                     if (audio) {
#                         console.log('Audio element is present, checking duration...');
#                         return audio.duration;
#                     }
#                 }
#                 return null;  // No media element found
#             """)
            
#             # if video_duration exists
#             if video_duration:
#                 # print a message saying how long the talk is in seconds
#                 print(f"{title} duration: {video_duration:.2f} seconds")
#                 # save the duration into the dataframe in the same row
#                 conf_df_2.at[index, 'time'] = video_duration
#             # otherwise, print a message saying no video or audio element was found for the talk
#             else:
#                 print(f"No video or audio element found for {title}.")
        
#         # If there is an error, say there was an error and what it was, and try to get the next piece of information       
#         except Exception as e:
#             print(f"Error retrieving video duration for {title}: {e}")

#         # attempt to find the body block
#         try:
#             #### save all paragraph elements in list 'all_paragraphs'
#             all_paragraphs = driver.find_elements(By.CSS_SELECTOR, '.body-block p')
#             #### only keep those paragraphs whose element height is greater than or equal to 28 pixels
#             #### this will effectively remove from the paragraph list the subtext to pictures
#             paragraphs = [p for p in all_paragraphs if p.rect.get('height', 0) >= 28]
#             # if body block is found, find the number of paragraphs, and save that number to the dataframe in the same row
#             conf_df_2.at[index, 'paragraphs'] = len(paragraphs)
#             # print a message giving the length of the talk in paragraphs
#             print(f"Paragraph length of {title}: {len(paragraphs)} paragraphs.")

#             # initialize empty dataframe "data_list", with columns "paragraph" and "lines" being initially populated with NA values
#             data_list = pd.DataFrame()
#             data_list['paragraph'] = pd.NA
#             data_list['lines'] = pd.NA

#             # Iterate over each paragraph element
#             for idx, paragraph in enumerate(paragraphs, start=1):
#                 # Log paragraph number, since the paragraphs are not numbered. 
#                 paragraph_number = idx

#                 # Get the bounding rectangle of the element
#                 rect = paragraph.rect

#                 # Calculate line height
#                 line_height_str = driver.execute_script("return window.getComputedStyle(arguments[0]).getPropertyValue('line-height');", paragraph)
#                 line_height_numeric = int(re.search(r'\d+', line_height_str).group())  # Extract numeric value from string

#                 # Calculate number of lines
#                 num_lines = rect['height'] // line_height_numeric

#                 # Append data dictionary to list
#                 data_list.at[idx,'paragraph'] = paragraph_number
#                 data_list.at[idx, 'lines'] = num_lines

#                 total_lines = sum(data_list['lines'])

#             # Use the get_total_lines function to get the total number of lines in the talk
#             num_lines = total_lines
            
#             # save the number of lines to the dataframe in the same row
#             conf_df_2.at[index, 'lines'] = num_lines
            
#             # print a message telling the number of lines in the talk
#             print(f"Line length of {title}: {num_lines} lines.")

#         # if there is an error or a problem, print a message saying what the problem was, and try to get the next piece of information 
#         except Exception as e:
#             print(f"Error calculating lines and paragraphs for {title}: {e}")
        
#         # try to find the author role
#         try:
#             role = driver.find_element(By.CLASS_NAME, 'author-role')
#             # if there is one, save it to the dataframe in the same row
#             conf_df_2.at[index, 'role'] = role.text
#             # print a message displaying the role of the author
#             print(f"Role of {author}: {role.text}")
#         # if there is an error or a problem, print a message saying what the problem was and then move onto the next row
#         except Exception as e:
#             print(f"Error retrieving role for {author}: {e}")

#     # Close the browser after all rows have been iterated through
#     driver.quit()

#     return conf_df_2

In [31]:
# commented out 26 Feb 2025

# # Call the function and unpack the imports
# pd, requests, BeautifulSoup, os, time, webdriver, Service, By, Options, ChromeService, WebDriverWait, EC, glob, ChromeDriverManager, re, datetime, timedelta = system_setup_for_breakdown()

# # call set_ini_inputs and save outputs
# conf_link, month_year, ini_start_date, ini_end_date, ini_total_days = set_ini_inputs()

In [32]:
# commented out 26 Feb 2025

# # call get_landing_page_info and save as ini_conf_df
# ini_conf_df = get_landing_page_info()

# # run clean_ini_df on ini_conf_df and save as conf_df_1
# conf_df_1 = clean_ini_df(ini_conf_df)

In [33]:
# commented out 26 Feb 2025 - get_vplr no longer exists

# # run get_lpr on conf_df_1 and save as conf_df_2
# conf_df_2 = get_vplr(conf_df_1)

That's solid work for today. I got the section that gathers further information about the talk modularized, and I corrected one of the main issues I wanted to see corrected while working through this breakdown. Great progress today. 

### 25 Feb 2025

Okay, today I can start working on the next bit of modularized code. As always, I'll start the day's work off by making sure that everything from yesterday is still working. 

In [34]:
# commented out 26 Feb 2025

# # Call the function and unpack the imports
# pd, requests, BeautifulSoup, os, time, webdriver, Service, By, Options, ChromeService, WebDriverWait, EC, glob, ChromeDriverManager, re, datetime, timedelta = system_setup_for_breakdown()

# # call set_ini_inputs and save outputs
# conf_link, save_path, month_year, ini_start_date, ini_end_date, ini_total_days = set_ini_inputs()

In [35]:
# Commented out 25 Feb 2025 due to changes in functions
 
# # call get_landing_page_info and save as ini_conf_df
# ini_conf_df = get_landing_page_info()

# # run clean_ini_df on ini_conf_df and save as conf_df_1
# conf_df_1 = clean_ini_df(ini_conf_df)

In [36]:
# Commented out 25 Feb 2025 because get_lpr and get_vplr no longer exist

# # run get_lpr on conf_df_1 and save as conf_df_2
# conf_df_2 = get_vplr(conf_df_1)

One thing that happens with my current calculator is that it exports the talks to my device. However, it currently, stupidly, does that after collecting all the information in the talks, meaning that it has to navigate to each one again. It also, stupidly, exports the talks to my device without saving them in python's memory, and then imports them from my device to save them in python's memory. This is inefficient and time consuming. I can adapt my `get_vplr()` function to grab all the information I'm currently after **and** save the text of each talk locally **and** export the text to my device. 

Before proceeding with that, though, I think I want to move getting the role of each speaker to almost the initial stage of the calculator. My reasoning for this is that it will remove the `r` (role) aspect from the `get_vplr()` function, making it a little simpler before making the change above. It will also help me to get the entire calculator set up in such a way that most, if not all, of the requests for user input are concentrated at the beginning of the calculator program, rather than scattered throughout the process. 

The function below needs to be run after `clean_ini_df()`. It takes a list of talks and links, navigates to each, gets and stores the role of the author, and asks the user how many times they would like to read talks given by people with certain roles and if they would like to spend any extra days on talks given by speakers with certain roles. 

In [5]:
def get_role_inputs(df):
    """
    This function takes a dataframe that has a list of conference talks and their links, navigates to each of the links, and collects and stores the role of the author. After collecting the role of the author, it shortens the role of that author and stores the new short role in a new column. The function then creates and populates new columns using user input to establish a read_weight and day_weight for each of the unique roles represented in the Conference. Read_weight is the number of times the user wants to read talks given by each unique role, and day_weight is any number of extra or fewer days the user wants to spend reading talks given by any of the unique roles. 
    """
    
    # copy the dataframe being passed through the function so it isn't changed
    df = df.copy()
    
    # initialize author role column
    df['role'] = pd.NA

    # Initialize the Chrome WebDriver
    driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()))

    # iterate the following over each row in the apr_2024_df dataframe
    for index, row in df.iterrows():
        # for each row, when the function calls for title, url, and author it is looking for the Title, Link, and Author columns in that row, respectively
        title = row['Title']
        url = row['Link']
        author = row['Author']

        # run the driver, navigating to the linked page in the row currently being worked on
        driver.get(url)

        # Wait for the page to load completely
        #### Updated - reduced time for webdriver to wait to 0.25 seconds
        driver.implicitly_wait(0.25)

        # try to find the author role
        try:
            role = driver.find_element(By.CLASS_NAME, 'author-role')
            # if there is one, save it to the dataframe in the same row
            df.at[index, 'role'] = role.text
            # print a message displaying the role of the author
            print(f"Role of {author}: {role.text}")
        # if there is an error or a problem, print a message saying what the problem was and then move onto the next row
        except Exception as e:
            print(f"Error retrieving role for {author}: {e}")
    
    # Close the browser after all rows have been iterated through
    driver.quit()

    # initialize short_role column as a copy of the role column
    df['short_role'] = df['role']

    # initializes a replacement dictionary to shorten information in newly created 'short_role' column
    rep_dict = {}

    # Adds specific shortenings of each role to the replacement dictionary
        ## this list is also easily editable if any other office becomes prominently represented in future conferences
        ## this list also puts members of the Presidency of the Seventy and of any other member of any other Quorum of the Seventy on equal ground       
    for index, row in df.iterrows():
        if 'President of The Church'in row['role']:
            rep_dict[row['short_role']] = 'President of the Church'
        elif 'First Presidency' in row['role']:
            rep_dict[row['short_role']] = 'First Presidency'
        elif 'Quorum of the Twelve' in row['role']:
            rep_dict[row['short_role']] = 'Quorum of the Twelve'
        elif 'the Seventy' in row['role']:
            rep_dict[row['short_role']] = 'Seventy'
        elif 'Relief Society' in row['role']:
            rep_dict[row['short_role']] = 'Relief Society Presidency'
        elif 'Presiding' in row['role']:
            rep_dict[row['short_role']] = 'Presiding Bishopric'
        elif 'Sunday School' in row['role']:
            rep_dict[row['short_role']] = 'Sunday School Presidency'
        elif 'Young Men' in row['role']:
            rep_dict[row['short_role']] = 'Young Men Presidency'
        elif 'Young Women' in row['role']:
            rep_dict[row['short_role']] = 'Young Women Presidency'
        elif 'Primary' in row['role']:
            rep_dict[row['short_role']] = 'Primary Presidency'
        else:
            rep_dict[row['short_role']] = 'other speakers'

    # uses replacement dictionary to replace (shorten) all the entries in the short_role column
    for words, replacement in rep_dict.items():
        df['short_role'] = df['short_role'].replace(words, replacement).str.strip()

    # initialize an empty list of read weights
    read_weights = []

    # for every unique role code in the role_code column of the apr_2024_info dataframe...
    for short_role in df.short_role.unique():
        # ... ask the user what the read weight should be and...
        read_weight = int(input(f"How many times would you like to read talks given by the {short_role}?"))
        # ... save both the role code and the read weight to the read_weights list
        read_weights.append({'short_role':short_role, 'read_weight':read_weight})

    # convert the read_weights list to a dataframe, save with the same name to replace the old item
    read_weights = pd.DataFrame(read_weights)

    # left-merge the read_weights dataframe to the apr_2024_info dataframe useing the role_code columns as a guide for merging
    # left-merge keeps everything in the dataframe being merged to, and only merges data from the second dataframe that has a corresponding value in the original dataframe
    df = df.merge(read_weights, on='short_role',how='left')

    # initialize an empty list of day weights
    day_weights = []

    # for every unique role code in the role_code column of the apr_2024_info dataframe...
    for short_role in df.short_role.unique():
        # ... ask the user what the day weight should be and...
        day_weight = int(input(f"How many more or fewer days would you like to spend on talks given by the {short_role}?\n"
                               f"\nIf you want to spend more days reading talks from the {short_role}, enter a number above 0.\n"
                               f"\nOr if you want to spend fewer days reading talks from the {short_role}, enter a number below 0 by using a minus sign or dash.\n"
                               f"\nIf you would rather spend a relatively the same amount of time on each talk from this organization as others, enter 0."))
        # ... save both the role code and the day weight to the read_weights list
        day_weights.append({'short_role':short_role, 'day_weight':day_weight})

    # convert the read_weights list to a dataframe, save with the same name to replace the old item
    day_weights = pd.DataFrame(day_weights)

    # left-merge the read_weights dataframe to the apr_2024_info dataframe useing the role_code columns as a guide for merging
    # left-merge keeps everything in the dataframe being merged to, and only merges data from the second dataframe that has a corresponding value in the original dataframe
    df = df.merge(day_weights, on='short_role',how='left')

    return df

In the cell below, I updated the `get_vplr()` function. I first removed it's collection of the speaker's role, as that aspect of the function has been incorporated earlier in the process. 

In [51]:
def get_vplt(conf_df):
    """
    This function does not get the Variance of the Predicted Log-Ratios from a dataframe containing information about a given general conference. Instead it gets the video length, paragraph length, and line length of each talk in a given general conference or set of talks. The function first copies the provided conference dataframe and saves it as a new dataframe that will be added to. It then initializes new rows for information to be populated into, and starts up a webdriver, which navigates to each url provided in the original dataframe. At each url attempts to remotely 'click' the play button, and when it successfully does so, it records the length of the video of the talk. It then finds all the paragraph elements on the page, removes captions to pictures, and then stores the number of paragraph elements as the paragraph length of the talk. Then it calculates the total number of lines in the talk and stores that as the number of lines. 
    
    The function also gathers all the text of each talk, and saves to a dictionary nuanced information about each talk, including the text and length of each paragraph. The dataframes saved to this dictionary are made SQL friendly with the incorporation of primary and foreign keys. Each dataframe is also exported as a csv to the location specified in earlier stages of the calculator. 
    
    It then returns the updated dataframe and the dictionary populated with dataframes for each of the talks. 
    """
    # copy the conf_df being passed through the function so it doesn't change it
    conf_df_2 = conf_df.copy()

    # Initialize columns in the dataframe with NA values to later be filled
    conf_df_2['time'] = pd.NA
    conf_df_2['paragraphs'] = pd.NA
    conf_df_2['lines'] = pd.NA

    # initialize a dictionary to store individual talk dataframes in
    all_talks_dict = {}

    # Initialize the Chrome WebDriver
    driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()))

    # iterate the following over each row in the apr_2024_df dataframe
    for index, row in conf_df_2.iterrows():
        # for each row, when the function calls for title, url, and author it is looking for the Title, Link, and Author columns in that row, respectively
        title = row['Title']
        url = row['Link']
        author = row['Author']
        pk = row['pk']

        # run the driver, navigating to the linked page in the row currently being worked on
        driver.get(url)

        # Wait for the page to load completely
        #### Updated - reduced time for webdriver to wait to 2.5 seconds
        driver.implicitly_wait(2.5)

        # Simulate clicking the play button using the class attribute
        try:
            play_button = driver.find_element(By.CSS_SELECTOR, "button.sc-1g7hsbc-0.lcWZjw.sc-bvqtyr-4.eYHLNi")
            play_button.click()
            print(f"Clicked the play button for {title} to start the media.")
        except:
            # ### Added - attempt to find alternately labeled play button
            print("Trying alternative play button selector.")
            try:
                play_button = driver.find_element(By.CSS_SELECTOR, "button.sc-1g7hsbc-0.bCKkuP.sc-bvqtyr-3.eKGiZd")
                play_button.click()
                print(f"Clicked the play button for {title} to start the media.")
            except:
                print(f"Play button for {title} not found.")

        # Wait for the video element to be present in the DOM
        #### Updated - reduced time for webdriver to wait to 2.5 seconds
        try:
            video_element = WebDriverWait(driver, 2.5).until(
                EC.presence_of_element_located((By.TAG_NAME, 'video'))
            )
            print("Video element found in the DOM.")
        except:
            print("No video element found.")

        # Wait for a short period to allow the video to start loading
        #### Updated - reduced time for webdriver to wait to 1.5 seconds instead of 2
        time.sleep(1.5)

        # Attempt to retrieve the video duration using JavaScript
        try:
            video_duration = driver.execute_script("""
                let video = document.querySelector('video');
                if (video) {
                    console.log('Video element is present, checking duration...');
                    return video.duration;
                } else {
                    let audio = document.querySelector('audio');
                    if (audio) {
                        console.log('Audio element is present, checking duration...');
                        return audio.duration;
                    }
                }
                return null;  // No media element found
            """)
            
            # if video_duration exists
            if video_duration:
                # print a message saying how long the talk is in seconds
                print(f"{title} duration: {video_duration:.2f} seconds")
                # save the duration into the dataframe in the same row
                conf_df_2.at[index, 'time'] = video_duration
            # otherwise, print a message saying no video or audio element was found for the talk
            else:
                print(f"No video or audio element found for {title}.")
        
        # If there is an error, say there was an error and what it was, and try to get the next piece of information       
        except Exception as e:
            print(f"Error retrieving video duration for {title}: {e}")

        # attempt to find the body block
        try:
            #### save all paragraph elements in list 'all_paragraphs'
            all_paragraphs = driver.find_elements(By.CSS_SELECTOR, '.body-block p')
            #### only keep those paragraphs whose element height is greater than or equal to 28 pixels
            #### this will effectively remove from the paragraph list the subtext to pictures
            paragraphs = [p for p in all_paragraphs if p.rect.get('height', 0) >= 28]
            # if body block is found, find the number of paragraphs, and save that number to the dataframe in the same row
            conf_df_2.at[index, 'paragraphs'] = len(paragraphs)
            # print a message giving the length of the talk in paragraphs
            print(f"Paragraph length of {title}: {len(paragraphs)} paragraphs.")

            # initialize empty dataframe "data_list", with columns "paragraph" and "lines" being initially populated with NA values
            data_list = pd.DataFrame()
            data_list['paragraph'] = pd.NA
            data_list['lines'] = pd.NA
            data_list['text'] = pd.NA

            # Iterate over each paragraph element
            for idx, paragraph in enumerate(paragraphs, start=1):
                # get the text of the element
                text = paragraph.text
                
                # Log paragraph number, since the paragraphs are not numbered. 
                paragraph_number = idx

                # Get the bounding rectangle of the element
                rect = paragraph.rect

                # Calculate line height
                line_height_str = driver.execute_script("return window.getComputedStyle(arguments[0]).getPropertyValue('line-height');", paragraph)
                line_height_numeric = int(re.search(r'\d+', line_height_str).group())  # Extract numeric value from string

                # Calculate number of lines
                num_lines = rect['height'] // line_height_numeric

                # Append data dictionary to list
                data_list.at[idx,'paragraph'] = paragraph_number
                data_list.at[idx, 'lines'] = num_lines
                data_list.at[idx, 'text'] = text

            total_lines = sum(data_list['lines'])

            # Use the get_total_lines function to get the total number of lines in the talk
            num_lines = total_lines
            
            # save the number of lines to the dataframe in the same row
            conf_df_2.at[index, 'lines'] = num_lines
            
            # print a message telling the number of lines in the talk
            print(f"Line length of {title}: {num_lines} lines.")

        # if there is an error or a problem, print a message saying what the problem was, and try to get the next piece of information 
        except Exception as e:
            print(f"Error calculating lines and paragraphs for {title}: {e}")

        # if the data_list dataframe is not populated with None data and is not empty:
        if data_list is not None and not data_list.empty:
            # create a new column that uses the primary key of the talk as the foreign key
            data_list['foreign_key'] = pk
            # use a lambda function to create a primary key for each paragraph consisting of the foreign key + the paragraph number
            data_list['pk'] = data_list.apply(lambda x:f"{pk}_{x.get('paragraph')}", axis=1)

            # define new column order
            new_column_order = ['foreign_key','pk','paragraph','text','lines']

            # apply new order to data_list
            data_list = data_list[new_column_order]

            # save file name
            csv_filename = f'{pk}_lines.csv'
            # combine the file name and the user inputted folder location to create a complete save path
            full_path = os.path.join(save_path,csv_filename)

            # Debugging line to state where files can be found
            print(f'Saving to: {full_path}')

            # export dataframe as a csv file to the location specified
            data_list.to_csv(full_path, index = False)
            # print message saying that csv file has been created
            print(f'CSV file for "{title}" saved successfully as {csv_filename}.')

            # save the text and other information about the talk to the all_talks_dict
            all_talks_dict[pk] = data_list
        
        # otherwise, if the dataframe is filled with None values or is empty
        else:
            # print a message saying no data was found for the talk
            print(f'No data found for "{title}", skipping CSV creation.')
        
    # Close the browser after all rows have been iterated through
    driver.quit()

    #### re-added 30 Mar 2025
    # converts all numeric columns to integers for easier use later
    conf_df_2['time'] = conf_df_2['time'].astype(int).round(0)
    conf_df_2['paragraphs'] = conf_df_2['paragraphs'].astype(int)
    conf_df_2['lines'] = conf_df_2['lines'].astype(int)


    #### re-added 30 Mar 2025
    # iterate over every dataframe stored in the all_talks_df dictionary
    for talk, df in all_talks_dict.items():
        # create a new column in each dataframe that is the cumulative sum of the number of lines
        df['running_lines'] = df['lines'].cumsum()


    return conf_df_2, all_talks_dict

In [39]:
# Call the function and unpack the imports
pd, requests, BeautifulSoup, os, time, webdriver, Service, By, Options, ChromeService, WebDriverWait, EC, glob, ChromeDriverManager, re, datetime, timedelta = system_setup_for_breakdown()

# call set_ini_inputs and save outputs
conf_link, save_path, month_year, ini_start_date, ini_end_date, ini_total_days = set_ini_inputs()

Pandas imported as pd.
Requests imported.
BeautifulSoup imported.
OS imported.
Time imported.
Webdriver, Service, By, and Options imported from Selenium.
ChromeService, WebDriverWait, and ExpectedConditions imported from Selenium.
Glob imported.
ChromeDriverManager imported.
Regular expression imported.
Date and time imported.
Working directory set as [D:\Faith and Religion Stuff\Come, Follow Me\come-follow-me-breakdown-builder].
Conference Link: https://www.churchofjesuschrist.org/study/general-conference/2024/10?lang=eng
Conference Month-year: oct24
ini_start: 10/10/2024
ini_end: 4/4/2025
ini_start as datetime: 2024-10-10 00:00:00
ini_end as datetime: 2025-04-04 00:00:00
Total number of days: 177


In [40]:
# call get_landing_page_info and save as ini_conf_df
ini_conf_df = get_landing_page_info()

# run clean_ini_df on ini_conf_df and save as conf_df_1
conf_df_1 = clean_ini_df(ini_conf_df,month_year)

In [41]:
conf_df_2 = get_role_inputs(conf_df_1)

conf_df_2

Role of Neil L. Andersen: Of the Quorum of the Twelve Apostles
Role of Emily Belle Freeman: Young Women General President
Role of Karl D. Hirst: Of the Seventy
Role of Dale G. Renlund: Of the Quorum of the Twelve Apostles
Role of David P. Homer: Of the Seventy
Role of Gregorio E. Casillas: Of the Seventy
Role of Dallin H. Oaks: First Counselor in the First Presidency
Role of D. Todd Christofferson: Of the Quorum of the Twelve Apostles
Role of José A. Teixeira: Of the Presidency of the Seventy
Role of Juan Pablo Villar: Of the Seventy
Role of Patrick Kearon: Of the Quorum of the Twelve Apostles
Role of David L. Buckner: Of the Seventy
Role of D. Martin Goury: Of the Seventy
Role of Aroldo B. Cavalcante: Of the Seventy
Role of Ulisses Soares: Of the Quorum of the Twelve Apostles
Role of Gerrit W. Gong: Of the Quorum of the Twelve Apostles
Role of Kristin M. Yee: Second Counselor in the Relief Society General Presidency
Role of Kyle S. McKay: Of the Seventy
Role of Jorge M. Alvarado: Of t

Unnamed: 0,Author,Title,Description,Link,initials,pk,role,short_role,read_weight,day_weight
0,Neil L. Andersen,The Triumph of Hope,Elder Andersen teaches that when we have hope ...,https://www.churchofjesuschrist.org/study/gene...,nla,nla_oct24,Of the Quorum of the Twelve Apostles,Quorum of the Twelve,2,2
1,Emily Belle Freeman,Live Up to Your Privileges,President Freeman encourages women and young w...,https://www.churchofjesuschrist.org/study/gene...,ebf,ebf_oct24,Young Women General President,Young Women Presidency,2,0
2,Karl D. Hirst,God’s Favourite,Elder Hirst teaches us how to feel the divine ...,https://www.churchofjesuschrist.org/study/gene...,kdh,kdh_oct24,Of the Seventy,Seventy,2,0
3,Dale G. Renlund,“This Is My Gospel”—“This Is My Church”,Elder Renlund teaches that the combination of ...,https://www.churchofjesuschrist.org/study/gene...,dgr,dgr_oct24,Of the Quorum of the Twelve Apostles,Quorum of the Twelve,2,2
4,David P. Homer,Trusting Our Father,Elder Homer teaches that we receive blessings ...,https://www.churchofjesuschrist.org/study/gene...,dph,dph_oct24,Of the Seventy,Seventy,2,0
5,Gregorio E. Casillas,God Loves All His Children,Elder Casillas teaches that we can bless the l...,https://www.churchofjesuschrist.org/study/gene...,gec,gec_oct24,Of the Seventy,Seventy,2,0
6,Dallin H. Oaks,Following Christ,President Oaks teaches the importance of follo...,https://www.churchofjesuschrist.org/study/gene...,dho,dho_oct24,First Counselor in the First Presidency,First Presidency,2,2
7,D. Todd Christofferson,Burying Our Weapons of Rebellion,Elder Christofferson encourages us to bury any...,https://www.churchofjesuschrist.org/study/gene...,dtc,dtc_oct24,Of the Quorum of the Twelve Apostles,Quorum of the Twelve,2,2
8,José A. Teixeira,Bonded to Jesus Christ: Becoming the Salt of t...,Elder Teixeira teaches four simple but profoun...,https://www.churchofjesuschrist.org/study/gene...,jat,jat_oct24,Of the Presidency of the Seventy,Seventy,2,0
9,Juan Pablo Villar,His Hand Ready to Help Us,Elder Villar teaches that the Savior is always...,https://www.churchofjesuschrist.org/study/gene...,jpv,jpv_oct24,Of the Seventy,Seventy,2,0


In [42]:
conf_df_3, all_talks_dict = get_vplt(conf_df_2)

Clicked the play button for The Triumph of Hope to start the media.
Video element found in the DOM.
The Triumph of Hope duration: 811.78 seconds
Paragraph length of The Triumph of Hope: 44 paragraphs.
Line length of The Triumph of Hope: 177 lines.
Saving to: D:\Faith and Religion Stuff\Come, Follow Me\Test Oct 2024 GC\nla_oct24_lines.csv
CSV file for "The Triumph of Hope" saved successfully as nla_oct24_lines.csv.
Clicked the play button for Live Up to Your Privileges to start the media.
Video element found in the DOM.
Live Up to Your Privileges duration: 705.54 seconds
Paragraph length of Live Up to Your Privileges: 34 paragraphs.
Line length of Live Up to Your Privileges: 170 lines.
Saving to: D:\Faith and Religion Stuff\Come, Follow Me\Test Oct 2024 GC\ebf_oct24_lines.csv
CSV file for "Live Up to Your Privileges" saved successfully as ebf_oct24_lines.csv.
Clicked the play button for God’s Favourite to start the media.
Video element found in the DOM.
God’s Favourite duration: 631.40 

In [43]:
conf_df_3

Unnamed: 0,Author,Title,Description,Link,initials,pk,role,short_role,read_weight,day_weight,time,paragraphs,lines
0,Neil L. Andersen,The Triumph of Hope,Elder Andersen teaches that when we have hope ...,https://www.churchofjesuschrist.org/study/gene...,nla,nla_oct24,Of the Quorum of the Twelve Apostles,Quorum of the Twelve,2,2,811.777,44,177
1,Emily Belle Freeman,Live Up to Your Privileges,President Freeman encourages women and young w...,https://www.churchofjesuschrist.org/study/gene...,ebf,ebf_oct24,Young Women General President,Young Women Presidency,2,0,705.538,34,170
2,Karl D. Hirst,God’s Favourite,Elder Hirst teaches us how to feel the divine ...,https://www.churchofjesuschrist.org/study/gene...,kdh,kdh_oct24,Of the Seventy,Seventy,2,0,631.397,20,157
3,Dale G. Renlund,“This Is My Gospel”—“This Is My Church”,Elder Renlund teaches that the combination of ...,https://www.churchofjesuschrist.org/study/gene...,dgr,dgr_oct24,Of the Quorum of the Twelve Apostles,Quorum of the Twelve,2,2,828.994,22,181
4,David P. Homer,Trusting Our Father,Elder Homer teaches that we receive blessings ...,https://www.churchofjesuschrist.org/study/gene...,dph,dph_oct24,Of the Seventy,Seventy,2,0,619.986,36,150
5,Gregorio E. Casillas,God Loves All His Children,Elder Casillas teaches that we can bless the l...,https://www.churchofjesuschrist.org/study/gene...,gec,gec_oct24,Of the Seventy,Seventy,2,0,576.509,15,122
6,Dallin H. Oaks,Following Christ,President Oaks teaches the importance of follo...,https://www.churchofjesuschrist.org/study/gene...,dho,dho_oct24,First Counselor in the First Presidency,First Presidency,2,2,810.976,33,161
7,D. Todd Christofferson,Burying Our Weapons of Rebellion,Elder Christofferson encourages us to bury any...,https://www.churchofjesuschrist.org/study/gene...,dtc,dtc_oct24,Of the Quorum of the Twelve Apostles,Quorum of the Twelve,2,2,791.557,30,188
8,José A. Teixeira,Bonded to Jesus Christ: Becoming the Salt of t...,Elder Teixeira teaches four simple but profoun...,https://www.churchofjesuschrist.org/study/gene...,jat,jat_oct24,Of the Presidency of the Seventy,Seventy,2,0,649.448,21,147
9,Juan Pablo Villar,His Hand Ready to Help Us,Elder Villar teaches that the Savior is always...,https://www.churchofjesuschrist.org/study/gene...,jpv,jpv_oct24,Of the Seventy,Seventy,2,0,613.412,16,114


In [44]:
all_talks_dict

{'nla_oct24':    foreign_key            pk paragraph  \
 1    nla_oct24   nla_oct24_1         1   
 2    nla_oct24   nla_oct24_2         2   
 3    nla_oct24   nla_oct24_3         3   
 4    nla_oct24   nla_oct24_4         4   
 5    nla_oct24   nla_oct24_5         5   
 6    nla_oct24   nla_oct24_6         6   
 7    nla_oct24   nla_oct24_7         7   
 8    nla_oct24   nla_oct24_8         8   
 9    nla_oct24   nla_oct24_9         9   
 10   nla_oct24  nla_oct24_10        10   
 11   nla_oct24  nla_oct24_11        11   
 12   nla_oct24  nla_oct24_12        12   
 13   nla_oct24  nla_oct24_13        13   
 14   nla_oct24  nla_oct24_14        14   
 15   nla_oct24  nla_oct24_15        15   
 16   nla_oct24  nla_oct24_16        16   
 17   nla_oct24  nla_oct24_17        17   
 18   nla_oct24  nla_oct24_18        18   
 19   nla_oct24  nla_oct24_19        19   
 20   nla_oct24  nla_oct24_20        20   
 21   nla_oct24  nla_oct24_21        21   
 22   nla_oct24  nla_oct24_22        22  

The cells above that display `conf_df_3` and `all_talks_dict` show that my functions are all working the way I hoped and expected them to, so that's a great stopping point for the day. 

### 26 Feb 2025

As usual, I will start the day off by running all the code parts to ensure that they are functioning properly. I ran them all in the cells above, and it went great! Total run-time is about 6 minutes so far, which is about 33% less than what it was to get through all the webscraping in the original calculator, so that's big progress!

Unfortunately that's also all I have time for today. 

### 28 Feb 2025

Okay, I might not have a ton of time today, but I'm gonna see about getting the next step of the calculator modularized. 

In [7]:
# Call the function and unpack the imports
pd, requests, BeautifulSoup, os, time, webdriver, Service, By, Options, ChromeService, WebDriverWait, EC, glob, ChromeDriverManager, re, datetime, timedelta = system_setup_for_breakdown()

# call set_ini_inputs and save outputs
conf_link, save_path, month_year, ini_start_date, ini_end_date, ini_total_days, min_per_rt = set_ini_inputs()

# call get_landing_page_info and save as ini_conf_df
ini_conf_df = get_landing_page_info()

# run clean_ini_df on ini_conf_df and save as conf_df_1
conf_df_1 = clean_ini_df(ini_conf_df,month_year)

# run get_role_inputs
conf_df_2 = get_role_inputs(conf_df_1)

# run get vplt to get conf_df_3 and all_talks_dict
conf_df_3, all_talks_dict = get_vplt(conf_df_2)

Pandas imported as pd.
Requests imported.
BeautifulSoup imported.
OS imported.
Time imported.
Webdriver, Service, By, and Options imported from Selenium.
ChromeService, WebDriverWait, and ExpectedConditions imported from Selenium.
Glob imported.
ChromeDriverManager imported.
Regular expression imported.
Date and time imported.
Working directory set as [D:\Faith and Religion Stuff\Come, Follow Me\come-follow-me-breakdown-builder].
Conference Link: https://www.churchofjesuschrist.org/study/general-conference/2024/10?lang=eng
Conference Month-year: oct24
ini_start: 10/10/2024
ini_end: 4/4/2025
ini_start as datetime: 2024-10-10 00:00:00
ini_end as datetime: 2025-04-04 00:00:00
Total number of days: 177
Minimum days per readthrough: 2
Role of Neil L. Andersen: Of the Quorum of the Twelve Apostles
Role of Emily Belle Freeman: Young Women General President
Role of Karl D. Hirst: Of the Seventy
Role of Dale G. Renlund: Of the Quorum of the Twelve Apostles
Role of David P. Homer: Of the Seventy

The next step in the process is establishing the number of days that will be spent reading each talk. In order to do that, I'll have to go back to my inputs stage and ask for input on a minimum number of days, and then incorporate that into my function below, `distribute_days()`. Okay, I've gone back and added that bit. That's probably all I'll have time for today.

In [9]:
conf_df_3

Unnamed: 0,Author,Title,Description,Link,initials,pk,role,short_role,read_weight,day_weight,time,paragraphs,lines
0,Neil L. Andersen,The Triumph of Hope,Elder Andersen teaches that when we have hope ...,https://www.churchofjesuschrist.org/study/gene...,nla,nla_oct24,Of the Quorum of the Twelve Apostles,Quorum of the Twelve,2,2,811.777,44,177
1,Emily Belle Freeman,Live Up to Your Privileges,President Freeman encourages women and young w...,https://www.churchofjesuschrist.org/study/gene...,ebf,ebf_oct24,Young Women General President,Young Women Presidency,2,0,705.538,34,170
2,Karl D. Hirst,God’s Favourite,Elder Hirst teaches us how to feel the divine ...,https://www.churchofjesuschrist.org/study/gene...,kdh,kdh_oct24,Of the Seventy,Seventy,2,0,631.397,20,157
3,Dale G. Renlund,“This Is My Gospel”—“This Is My Church”,Elder Renlund teaches that the combination of ...,https://www.churchofjesuschrist.org/study/gene...,dgr,dgr_oct24,Of the Quorum of the Twelve Apostles,Quorum of the Twelve,2,2,828.994,22,181
4,David P. Homer,Trusting Our Father,Elder Homer teaches that we receive blessings ...,https://www.churchofjesuschrist.org/study/gene...,dph,dph_oct24,Of the Seventy,Seventy,2,0,619.986,36,150
5,Gregorio E. Casillas,God Loves All His Children,Elder Casillas teaches that we can bless the l...,https://www.churchofjesuschrist.org/study/gene...,gec,gec_oct24,Of the Seventy,Seventy,2,0,576.509,15,122
6,Dallin H. Oaks,Following Christ,President Oaks teaches the importance of follo...,https://www.churchofjesuschrist.org/study/gene...,dho,dho_oct24,First Counselor in the First Presidency,First Presidency,2,2,810.976,33,161
7,D. Todd Christofferson,Burying Our Weapons of Rebellion,Elder Christofferson encourages us to bury any...,https://www.churchofjesuschrist.org/study/gene...,dtc,dtc_oct24,Of the Quorum of the Twelve Apostles,Quorum of the Twelve,2,2,791.557,30,188
8,José A. Teixeira,Bonded to Jesus Christ: Becoming the Salt of t...,Elder Teixeira teaches four simple but profoun...,https://www.churchofjesuschrist.org/study/gene...,jat,jat_oct24,Of the Presidency of the Seventy,Seventy,2,0,649.448,21,147
9,Juan Pablo Villar,His Hand Ready to Help Us,Elder Villar teaches that the Savior is always...,https://www.churchofjesuschrist.org/study/gene...,jpv,jpv_oct24,Of the Seventy,Seventy,2,0,613.412,16,114


### 1 Mar 2025

Today I'll get started on my `distribute_days()` function. 

In [None]:
#### commented out 4 Mar 2025 because it doesn't work
# 
# 
# def distribute_days(conference_dataframe, initial_total_days, min_days_per_readthrough):
#     """
    
#     """
#     # copy the df used so that it is not affected or changed by being passed through the program
#     df = conference_dataframe.copy()
#     # rename min_days_per_readthrough so it's not so cumbersome to use in the function
#     min_days = min_days_per_readthrough

#     #### subtract the total number of days needed to spend the minimum number of days on each talk from the total days
#     days_wo_min = initial_total_days.days - (min_days * len(df))

#     # Get some information about the conference as a whole
#     total_time = sum(df['time'])
#     total_lines = sum(df['lines'])
    
#     # create a new conference consumption column that gives a proportional weight to each talk based on it's length and user input
#     df['conf_cons'] = (
#         ((1/len(df)) +                                                      # Each talk is 1 of all the talks given, this treats each equally
#         df['time']/total_time +                                             # time weight - longer "heavier"
#         df['lines']/total_lines +                                           # lines weight - longer "heavier"
#         ((df['day_weight'] + 1)/(df['day_weight'] + 1).sum()))              # preference weight - user input factors in here
#         / 4                                                                 # Adding each of those and then dividing by 4 gets the average
#     )
    
#     # create a column with the total number of days to be spent on each talk
#         ## multiply the number of days specified in the plan by the conference consumption ratio
#     df['tot_num_days'] = (days_wo_min * df['conf_cons'])

#     # convert the newly created column into floats
#     df['tot_num_days'] = df['tot_num_days'].astype(float)

#     # round the floats to their nearest whole numbers
#     df['tot_num_days'] = df['tot_num_days'].round()

#     # convert those rounded floats to integers
#     df['tot_num_days'] = df['tot_num_days'].astype(int)

#     return df

In [None]:
# conf_df_4 = distribute_days(conf_df_3, ini_total_days, min_per_rt)

# conf_df_4

Unnamed: 0,Author,Title,Description,Link,initials,pk,role,short_role,read_weight,day_weight,time,paragraphs,lines,conf_cons,tot_num_days
0,Neil L. Andersen,The Triumph of Hope,Elder Andersen teaches that when we have hope ...,https://www.churchofjesuschrist.org/study/gene...,nla,nla_oct24,Of the Quorum of the Twelve Apostles,Quorum of the Twelve,2,2,811.777,44,177,0.034622,4
1,Emily Belle Freeman,Live Up to Your Privileges,President Freeman encourages women and young w...,https://www.churchofjesuschrist.org/study/gene...,ebf,ebf_oct24,Young Women General President,Young Women Presidency,2,0,705.538,34,170,0.025687,3
2,Karl D. Hirst,God’s Favourite,Elder Hirst teaches us how to feel the divine ...,https://www.churchofjesuschrist.org/study/gene...,kdh,kdh_oct24,Of the Seventy,Seventy,2,0,631.397,20,157,0.024376,3
3,Dale G. Renlund,“This Is My Gospel”—“This Is My Church”,Elder Renlund teaches that the combination of ...,https://www.churchofjesuschrist.org/study/gene...,dgr,dgr_oct24,Of the Quorum of the Twelve Apostles,Quorum of the Twelve,2,2,828.994,22,181,0.03497,4
4,David P. Homer,Trusting Our Father,Elder Homer teaches that we receive blessings ...,https://www.churchofjesuschrist.org/study/gene...,dph,dph_oct24,Of the Seventy,Seventy,2,0,619.986,36,150,0.02395,3
5,Gregorio E. Casillas,God Loves All His Children,Elder Casillas teaches that we can bless the l...,https://www.churchofjesuschrist.org/study/gene...,gec,gec_oct24,Of the Seventy,Seventy,2,0,576.509,15,122,0.022271,2
6,Dallin H. Oaks,Following Christ,President Oaks teaches the importance of follo...,https://www.churchofjesuschrist.org/study/gene...,dho,dho_oct24,First Counselor in the First Presidency,First Presidency,2,2,810.976,33,161,0.033899,4
7,D. Todd Christofferson,Burying Our Weapons of Rebellion,Elder Christofferson encourages us to bury any...,https://www.churchofjesuschrist.org/study/gene...,dtc,dtc_oct24,Of the Quorum of the Twelve Apostles,Quorum of the Twelve,2,2,791.557,30,188,0.034914,4
8,José A. Teixeira,Bonded to Jesus Christ: Becoming the Salt of t...,Elder Teixeira teaches four simple but profoun...,https://www.churchofjesuschrist.org/study/gene...,jat,jat_oct24,Of the Presidency of the Seventy,Seventy,2,0,649.448,21,147,0.024107,3
9,Juan Pablo Villar,His Hand Ready to Help Us,Elder Villar teaches that the Savior is always...,https://www.churchofjesuschrist.org/study/gene...,jpv,jpv_oct24,Of the Seventy,Seventy,2,0,613.412,16,114,0.022277,2


In [35]:
conf_df_4.dtypes

Author          object
Title           object
Description     object
Link            object
initials        object
pk              object
role            object
short_role      object
read_weight      int64
day_weight       int64
time            object
paragraphs      object
lines           object
conf_cons       object
tot_num_days    object
dtype: object

In [43]:
ini_total_days.days - sum(conf_df_4['tot_num_days'])

62

In [45]:
sum(conf_df_4['tot_num_days'])

115

Hm. Well I'm out of time for today. I am running into a problem where my days are not distributing across the talks quite properly. More on that on Monday or something. 

### 4 March 2025

Picking up where I left off, I'm trying to figure out why the days are not distributing properly.

In [8]:
# Call the function and unpack the imports
pd, requests, BeautifulSoup, os, time, webdriver, Service, By, Options, ChromeService, WebDriverWait, EC, glob, ChromeDriverManager, re, datetime, timedelta = system_setup_for_breakdown()

# call set_ini_inputs and save outputs
conf_link, save_path, month_year, ini_start_date, ini_end_date, ini_total_days, min_per_rt = set_ini_inputs()

# call get_landing_page_info and save as ini_conf_df
ini_conf_df = get_landing_page_info()

# run clean_ini_df on ini_conf_df and save as conf_df_1
conf_df_1 = clean_ini_df(ini_conf_df,month_year)

# run get_role_inputs
conf_df_2 = get_role_inputs(conf_df_1)

# run get vplt to get conf_df_3 and all_talks_dict
conf_df_3, all_talks_dict = get_vplt(conf_df_2)

Pandas imported as pd.
Requests imported.
BeautifulSoup imported.
OS imported.
Time imported.
Webdriver, Service, By, and Options imported from Selenium.
ChromeService, WebDriverWait, and ExpectedConditions imported from Selenium.
Glob imported.
ChromeDriverManager imported.
Regular expression imported.
Date and time imported.
Working directory set as [D:\Faith and Religion Stuff\Come, Follow Me\come-follow-me-breakdown-builder].
Conference Link: https://www.churchofjesuschrist.org/study/general-conference/2024/10?lang=eng
Conference Month-year: oct24
ini_start: 10/10/2024
ini_end: 4/4/2025
ini_start as datetime: 2024-10-10 00:00:00
ini_end as datetime: 2025-04-04 00:00:00
Total number of days: 177
Minimum days per readthrough: 2
Role of Neil L. Andersen: Of the Quorum of the Twelve Apostles
Role of Emily Belle Freeman: Young Women General President
Role of Karl D. Hirst: Of the Seventy
Role of Dale G. Renlund: Of the Quorum of the Twelve Apostles
Role of David P. Homer: Of the Seventy

In the cell below, I define the function `distribute_days()` which requires a few inputs, including the current iteration of the conference dataframe, the initial total number of days as calculated above based on the start and end dates given, and the minimum number of days per talk per readthrough, as input by the user above. It then copies the given dataframe (so it remains unaffected) and the min_days variable. There's a lot that goes into it, and it needs a lot of improvement, but it is working the way I want it to for now. 

In [7]:
def distribute_days(conference_dataframe, initial_total_days, min_days_per_readthrough):
    """
    This function calculates how many total days should be spent reading each talk, and then distributes those days across the appropriate number of readthroughs. 
    """
    # copy the df used so that it is not affected or changed by being passed through the program
    df = conference_dataframe.copy()
    # rename min_days_per_readthrough so it's not so cumbersome to use in the function
    min_days = min_days_per_readthrough

    #### subtract the total number of days needed to spend the minimum number of days on each talk from the total days
    days_wo_min = initial_total_days.days - (min_days * len(df))

    # Get some information about the conference as a whole
    total_time = sum(df['time'])
    total_lines = sum(df['lines'])
    
    # create a new conference consumption column that gives a proportional weight to each talk based on it's length and user input
    df['conf_cons'] = (
        ((1/len(df)) +                                                      # Each talk is 1 of all the talks given, this treats each equally
        df['time']/total_time +                                             # time weight - longer "heavier"
        df['lines']/total_lines +                                           # lines weight - longer "heavier"
        ((df['day_weight'] + 1)/(df['day_weight']+1).sum()))              # preference weight - user input factors in here
        / 4                                                                 # Adding each of those and then dividing by 4 gets the average
    )
    
    # create a column with the total number of days to be spent on each talk
        ## multiply the number of days specified in the plan by the conference consumption ratio
    df['tot_num_days'] = ((days_wo_min * df['conf_cons']) + min_days)

    # convert the newly created column into floats
    df['tot_num_days'] = df['tot_num_days'].astype(float)

    # round the floats to their nearest whole numbers
    df['tot_num_days'] = df['tot_num_days'].round()

    # convert those rounded floats to integers
    df['tot_num_days'] = df['tot_num_days'].astype(int)


    #### added to handle over/undershooting on number of days
    
    #### calculate the difference between the sum of the total number of days and the ini_total_days
    diff = sum(df['tot_num_days']) - initial_total_days.days

    #### if diff > 0 (meaning too many days have been distributed)
    if diff > 0:
        # subset the data so that it only contains those rows where tot_num_days is greater than min_days * read_weight
        subset_df = df[df.tot_num_days > (df.read_weight * min_days)]
        # then subset the data again so that only those rows with the lowest tot_num_days are included
        subset_df = subset_df[subset_df['tot_num_days'] == subset_df['tot_num_days'].min()]
        # then recalculate 'tot_num_days' so that it will still be a float
        subset_df['tot_num_days'] = ((days_wo_min * subset_df['conf_cons']) + min_days)
        subset_df['tot_num_days'] = subset_df['tot_num_days'].astype(float)
        # then sort the subset by tot_num_days in ascending order and only keep a number of rows equal to diff
        subset_df = subset_df.sort_values(by='tot_num_days', ascending=True).head(diff)
        # then subtract 1 from each of the tot_num_days values in the subset
        subset_df['tot_num_days'] = subset_df['tot_num_days'] - 1
        # round the floats to their nearest whole numbers
        subset_df['tot_num_days'] = subset_df['tot_num_days'].round()
        # convert those rounded floats to integers
        subset_df['tot_num_days'] = subset_df['tot_num_days'].astype(int)
        # then replace the values in the original df with the values in the subset
        for index, row in subset_df.iterrows():
            df.loc[index, 'tot_num_days'] = subset_df.loc[index, 'tot_num_days']

    #### if diff < 0 (meaning too few days have been distributed)
    elif diff < 0:
        # initialize a subset of the data by copying the given df
        subset_df = df.copy()
        # then recalculate 'tot_num_days' so that it will still be a float
        subset_df['tot_num_days'] = ((days_wo_min * subset_df['conf_cons']) + min_days)
        subset_df['tot_num_days'] = subset_df['tot_num_days'].astype(float)
        # import math for the next function
        import math
        # then initialize a new column in the subset for rounded up values
        subset_df['tot_num_days_rounded_up'] = pd.NA
        # populate that new column with the rounded up values
        for index, row in subset_df.iterrows():
            value = subset_df.loc[index, 'tot_num_days']
            subset_df.loc[index, 'tot_num_days_rounded_up'] = math.ceil(value)
        # then create a new column that shows the difference between the rounded values and the original values
        subset_df['rounded_diff'] = subset_df['tot_num_days_rounded_up'] - subset_df['tot_num_days']
        # then sort the subset by the rounded_diff ascending, and only keep the number of rows equal to diff
        subset_df = subset_df.sort_values(by='rounded_diff', ascending=True).head(abs(diff))
        # then add 1 to each of the tot_num_days values in the subset
        subset_df['tot_num_days'] = subset_df['tot_num_days'] + 1
        # round the floats to their nearest whole numbers
        subset_df['tot_num_days'] = subset_df['tot_num_days'].round()
        # convert those rounded floats to integers
        subset_df['tot_num_days'] = subset_df['tot_num_days'].astype(int)
        # then replace the values in the original df with the values in the subset
        for index, row in subset_df.iterrows():
            df.loc[index, 'tot_num_days'] = subset_df.loc[index, 'tot_num_days']

    # find and save the highest number in the read_weights column
    max_reads = df.read_weight.max()

    # start a loop that, for every number between 1 and whatever the max_reads number is, inclusive...
    for i in range(1,max_reads+1):
        # create a new column of NA values titled "Readthrough # _(whatever number the loop is on)_"
        df[f"Readthrough #{i}"] = pd.NA
        
    # convert all NA values to "0"
    df.fillna(0, inplace=True)

    def distribute_rt_days(conf_df):
        """
        This function takes a dataframe like the one I have crafted above and distributes the total number of days into the "Readthrough #_" columns.
        """
        # establish that the function needs to repeat for every row of the dataframe
        for index, row in conf_df.iterrows():
            # get total number of days for that talk
            total_days = row['tot_num_days']
            # initialize number of distributed days as 0
            dist_days = 0
            # establish that the function needs to proceed with the following operation until dist_days and total_days are equal
            while dist_days < total_days:
                # for every whole number between 1 and whatever the read_weight (or number of readthroughs) is...
                for i in range(1,row['read_weight']+1):
                    # if dist_days is still less than total_days...
                    if dist_days < total_days:
                        # add 1 to whatever value is in the "Readthrough #(number between 1 and number of readthroughs)" column and...
                        conf_df.at[index, f'Readthrough #{i}'] += 1
                        # add 1 to dist_days
                        dist_days += 1
                        # go back to add 1 to the next column until dist_days is no longer less than total_days
                    # if/when dist_days is equal to total_days
                    else:
                        # break the process of adding one to each column, and move on to the next row to start the process over
                        break
        # when everything is done, the output of this function is the same dataframe with all of the updated columns
        return conf_df

    # run the function on my dataframe
    new_df = distribute_rt_days(df)


    return new_df

In [11]:
conf_df_4 = distribute_days(conf_df_3,ini_total_days,min_per_rt)

In [149]:
conf_df_4.tot_num_days.sum()

177

In [12]:
conf_df_4

Unnamed: 0,Author,Title,Description,Link,initials,pk,role,short_role,read_weight,day_weight,time,paragraphs,lines,conf_cons,tot_num_days,Readthrough #1,Readthrough #2,Readthrough #3
0,Neil L. Andersen,The Triumph of Hope,Elder Andersen teaches that when we have hope ...,https://www.churchofjesuschrist.org/study/gene...,nla,nla_oct24,Of the Quorum of the Twelve Apostles,Quorum of the Twelve,2,2,811.777,44,177,0.034622,6,3,3,0
1,Emily Belle Freeman,Live Up to Your Privileges,President Freeman encourages women and young w...,https://www.churchofjesuschrist.org/study/gene...,ebf,ebf_oct24,Young Women General President,Young Women Presidency,2,0,705.538,34,170,0.025687,5,3,2,0
2,Karl D. Hirst,God’s Favourite,Elder Hirst teaches us how to feel the divine ...,https://www.churchofjesuschrist.org/study/gene...,kdh,kdh_oct24,Of the Seventy,Seventy,2,0,631.397,20,157,0.024376,5,3,2,0
3,Dale G. Renlund,“This Is My Gospel”—“This Is My Church”,Elder Renlund teaches that the combination of ...,https://www.churchofjesuschrist.org/study/gene...,dgr,dgr_oct24,Of the Quorum of the Twelve Apostles,Quorum of the Twelve,2,2,828.994,22,181,0.03497,6,3,3,0
4,David P. Homer,Trusting Our Father,Elder Homer teaches that we receive blessings ...,https://www.churchofjesuschrist.org/study/gene...,dph,dph_oct24,Of the Seventy,Seventy,2,0,619.986,36,150,0.02395,4,2,2,0
5,Gregorio E. Casillas,God Loves All His Children,Elder Casillas teaches that we can bless the l...,https://www.churchofjesuschrist.org/study/gene...,gec,gec_oct24,Of the Seventy,Seventy,2,0,576.509,15,122,0.022271,4,2,2,0
6,Dallin H. Oaks,Following Christ,President Oaks teaches the importance of follo...,https://www.churchofjesuschrist.org/study/gene...,dho,dho_oct24,First Counselor in the First Presidency,First Presidency,2,2,810.976,33,161,0.033899,6,3,3,0
7,D. Todd Christofferson,Burying Our Weapons of Rebellion,Elder Christofferson encourages us to bury any...,https://www.churchofjesuschrist.org/study/gene...,dtc,dtc_oct24,Of the Quorum of the Twelve Apostles,Quorum of the Twelve,2,2,791.557,30,188,0.034914,6,3,3,0
8,José A. Teixeira,Bonded to Jesus Christ: Becoming the Salt of t...,Elder Teixeira teaches four simple but profoun...,https://www.churchofjesuschrist.org/study/gene...,jat,jat_oct24,Of the Presidency of the Seventy,Seventy,2,0,649.448,21,147,0.024107,5,3,2,0
9,Juan Pablo Villar,His Hand Ready to Help Us,Elder Villar teaches that the Savior is always...,https://www.churchofjesuschrist.org/study/gene...,jpv,jpv_oct24,Of the Seventy,Seventy,2,0,613.412,16,114,0.022277,4,2,2,0


In [101]:
subset_df = conf_df_4

# then recalculate 'tot_num_days' so that it will still be a float
subset_df['tot_num_days'] = (((ini_total_days.days - (min_per_rt * len(subset_df))) * subset_df['conf_cons']) + min_per_rt)
subset_df['tot_num_days'] = subset_df['tot_num_days'].astype(float)
# import math for the next function
import math
# then initialize a new column in the subset for rounded up values
subset_df['tot_num_days_rounded_up'] = pd.NA
# populate that new column with the rounded up values
for index, row in subset_df.iterrows():
    value = subset_df.loc[index, 'tot_num_days']
    subset_df.loc[index, 'tot_num_days_rounded_up'] = math.ceil(value)
# then create a new column that shows the difference between the rounded values and the original values
subset_df['rounded_diff'] = subset_df['tot_num_days_rounded_up'] - subset_df['tot_num_days']
# then sort the subset by the rounded_diff ascending, and only keep the number of rows equal to diff
subset_df = subset_df.sort_values(by='rounded_diff', ascending=True).head(8)
# then add 1 to each of the tot_num_days values in the subset
subset_df['tot_num_days'] = subset_df['tot_num_days'] + 1
# round the floats to their nearest whole numbers
subset_df['tot_num_days'] = subset_df['tot_num_days'].round()
# convert those rounded floats to integers
subset_df['tot_num_days'] = subset_df['tot_num_days'].astype(int)
# then replace the values in the original df with the values in the subset
for index, row in subset_df.iterrows():
    conf_df_4.loc[index, 'tot_num_days'] = subset_df.loc[index, 'tot_num_days']

subset_df

Unnamed: 0,Author,Title,Description,Link,initials,pk,role,short_role,read_weight,day_weight,time,paragraphs,lines,conf_cons,tot_num_days,tot_num_days_rounded_up,rounded_diff
23,L. Todd Budge,Seek Him with All Your Heart,Bishop Budge speaks about the importance of qu...,https://www.churchofjesuschrist.org/study/gene...,ltb,ltb_oct24,Second Counselor in the Presiding Bishopric,Presiding Bishopric,2,0,723.556,31,177,0.027389,6,5,0.014565
19,David A. Bednar,In the Space of Not Many Years,Elder Bednar uses the examples of the Nephites...,https://www.churchofjesuschrist.org/study/gene...,dab,dab_oct24,Of the Quorum of the Twelve Apostles,Quorum of the Twelve,2,2,976.375,43,207,0.03622,7,6,0.05205
1,Emily Belle Freeman,Live Up to Your Privileges,President Freeman encourages women and young w...,https://www.churchofjesuschrist.org/study/gene...,ebf,ebf_oct24,Young Women General President,Young Women Presidency,2,0,705.538,34,170,0.026899,6,5,0.068009
18,Jorge M. Alvarado,Embrace the Lord’s Gift of Repentance,Elder Alvarado teaches about repentance and te...,https://www.churchofjesuschrist.org/study/gene...,jma,jma_oct24,Of the Seventy,Seventy,2,0,694.593,39,170,0.026791,6,5,0.079756
22,Brook P. Hales,Mortality Works!,Elder Hales teaches that because of the gospel...,https://www.churchofjesuschrist.org/study/gene...,bph,bph_oct24,Of the Seventy,Seventy,2,0,659.2909,32,177,0.026757,6,5,0.083541
25,Bradley R. Wilcox,O Youth of the Noble Birthright,"Speaking to youth, Brother Wilcox addresses th...",https://www.churchofjesuschrist.org/study/gene...,brw,brw_oct24,First Counselor in the Young Men General Presi...,Young Men Presidency,2,0,752.885,23,148,0.026382,6,5,0.124379
24,Gary E. Stevenson,Days Never to Be Forgotten,Elder Stevenson looks ahead to the next 10 yea...,https://www.churchofjesuschrist.org/study/gene...,ges,ges_oct24,Of the Quorum of the Twelve Apostles,Quorum of the Twelve,2,2,859.024,33,205,0.034975,7,6,0.187747
2,Karl D. Hirst,God’s Favourite,Elder Hirst teaches us how to feel the divine ...,https://www.churchofjesuschrist.org/study/gene...,kdh,kdh_oct24,Of the Seventy,Seventy,2,0,631.397,20,157,0.025588,6,5,0.210922


In [102]:
conf_df_4

Unnamed: 0,Author,Title,Description,Link,initials,pk,role,short_role,read_weight,day_weight,time,paragraphs,lines,conf_cons,tot_num_days,tot_num_days_rounded_up,rounded_diff
0,Neil L. Andersen,The Triumph of Hope,Elder Andersen teaches that when we have hope ...,https://www.churchofjesuschrist.org/study/gene...,nla,nla_oct24,Of the Quorum of the Twelve Apostles,Quorum of the Twelve,2,2,811.777,44,177,0.033258,5.625123,6,0.374877
1,Emily Belle Freeman,Live Up to Your Privileges,President Freeman encourages women and young w...,https://www.churchofjesuschrist.org/study/gene...,ebf,ebf_oct24,Young Women General President,Young Women Presidency,2,0,705.538,34,170,0.026899,6.0,5,0.068009
2,Karl D. Hirst,God’s Favourite,Elder Hirst teaches us how to feel the divine ...,https://www.churchofjesuschrist.org/study/gene...,kdh,kdh_oct24,Of the Seventy,Seventy,2,0,631.397,20,157,0.025588,6.0,5,0.210922
3,Dale G. Renlund,“This Is My Gospel”—“This Is My Church”,Elder Renlund teaches that the combination of ...,https://www.churchofjesuschrist.org/study/gene...,dgr,dgr_oct24,Of the Quorum of the Twelve Apostles,Quorum of the Twelve,2,2,828.994,22,181,0.033606,5.66309,6,0.33691
4,David P. Homer,Trusting Our Father,Elder Homer teaches that we receive blessings ...,https://www.churchofjesuschrist.org/study/gene...,dph,dph_oct24,Of the Seventy,Seventy,2,0,619.986,36,150,0.025163,4.742725,5,0.257275
5,Gregorio E. Casillas,God Loves All His Children,Elder Casillas teaches that we can bless the l...,https://www.churchofjesuschrist.org/study/gene...,gec,gec_oct24,Of the Seventy,Seventy,2,0,576.509,15,122,0.023483,4.559641,5,0.440359
6,Dallin H. Oaks,Following Christ,President Oaks teaches the importance of follo...,https://www.churchofjesuschrist.org/study/gene...,dho,dho_oct24,First Counselor in the First Presidency,First Presidency,2,2,810.976,33,161,0.032535,5.546308,6,0.453692
7,D. Todd Christofferson,Burying Our Weapons of Rebellion,Elder Christofferson encourages us to bury any...,https://www.churchofjesuschrist.org/study/gene...,dtc,dtc_oct24,Of the Quorum of the Twelve Apostles,Quorum of the Twelve,2,2,791.557,30,188,0.033551,5.657014,6,0.342986
8,José A. Teixeira,Bonded to Jesus Christ: Becoming the Salt of t...,Elder Teixeira teaches four simple but profoun...,https://www.churchofjesuschrist.org/study/gene...,jat,jat_oct24,Of the Presidency of the Seventy,Seventy,2,0,649.448,21,147,0.025319,4.75973,5,0.24027
9,Juan Pablo Villar,His Hand Ready to Help Us,Elder Villar teaches that the Savior is always...,https://www.churchofjesuschrist.org/study/gene...,jpv,jpv_oct24,Of the Seventy,Seventy,2,0,613.412,16,114,0.023489,4.560271,5,0.439729


In [85]:
conf_df_4.tot_num_days.sum()

177

### 30 Mar 2024 

I haven't had much time to work on this since starting my new job. It's been great, but dang, my interest in this project has gone way down since. 

But, I **do** want to try to make *some* progress. 

I left off distributing days to each talk.

In [10]:
# Call the function and unpack the imports
pd, requests, BeautifulSoup, os, time, webdriver, Service, By, Options, ChromeService, WebDriverWait, EC, glob, ChromeDriverManager, re, datetime, timedelta = system_setup_for_breakdown()

# call set_ini_inputs and save outputs
conf_link, save_path, month_year, ini_start_date, ini_end_date, ini_total_days, min_per_rt = set_ini_inputs()

# call get_landing_page_info and save as ini_conf_df
ini_conf_df = get_landing_page_info()

# run clean_ini_df on ini_conf_df and save as conf_df_1
conf_df_1 = clean_ini_df(ini_conf_df,month_year)

# run get_role_inputs
conf_df_2 = get_role_inputs(conf_df_1)

# run get vplt to get conf_df_3 and all_talks_dict
conf_df_3, all_talks_dict = get_vplt(conf_df_2)

conf_df_4 = distribute_days(conf_df_3,ini_total_days,min_per_rt)

Pandas imported as pd.
Requests imported.
BeautifulSoup imported.
OS imported.
Time imported.
Webdriver, Service, By, and Options imported from Selenium.
ChromeService, WebDriverWait, and ExpectedConditions imported from Selenium.
Glob imported.
ChromeDriverManager imported.
Regular expression imported.
Date and time imported.
Working directory set as [D:\Faith and Religion Stuff\Come, Follow Me\come-follow-me-breakdown-builder].
Conference Link: https://www.churchofjesuschrist.org/study/general-conference/2024/10?lang=eng
Conference Month-year: oct24
ini_start: 10/10/2024
ini_end: 4/4/2025
ini_start as datetime: 2024-10-10 00:00:00
ini_end as datetime: 2025-04-04 00:00:00
Total number of days: 177
Minimum days per readthrough: 2
Role of Neil L. Andersen: Of the Quorum of the Twelve Apostles
Role of Emily Belle Freeman: Young Women General President
Role of Karl D. Hirst: Of the Seventy
Role of Dale G. Renlund: Of the Quorum of the Twelve Apostles
Role of David P. Homer: Of the Seventy

In [13]:
conf_df_4

Unnamed: 0,Author,Title,Description,Link,initials,pk,role,short_role,read_weight,day_weight,time,paragraphs,lines,conf_cons,tot_num_days,Readthrough #1,Readthrough #2,Readthrough #3
0,Neil L. Andersen,The Triumph of Hope,Elder Andersen teaches that when we have hope ...,https://www.churchofjesuschrist.org/study/gene...,nla,nla_oct24,Of the Quorum of the Twelve Apostles,Quorum of the Twelve,2,2,811.777,44,177,0.034622,6,3,3,0
1,Emily Belle Freeman,Live Up to Your Privileges,President Freeman encourages women and young w...,https://www.churchofjesuschrist.org/study/gene...,ebf,ebf_oct24,Young Women General President,Young Women Presidency,2,0,705.538,34,170,0.025687,5,3,2,0
2,Karl D. Hirst,God’s Favourite,Elder Hirst teaches us how to feel the divine ...,https://www.churchofjesuschrist.org/study/gene...,kdh,kdh_oct24,Of the Seventy,Seventy,2,0,631.397,20,157,0.024376,5,3,2,0
3,Dale G. Renlund,“This Is My Gospel”—“This Is My Church”,Elder Renlund teaches that the combination of ...,https://www.churchofjesuschrist.org/study/gene...,dgr,dgr_oct24,Of the Quorum of the Twelve Apostles,Quorum of the Twelve,2,2,828.994,22,181,0.03497,6,3,3,0
4,David P. Homer,Trusting Our Father,Elder Homer teaches that we receive blessings ...,https://www.churchofjesuschrist.org/study/gene...,dph,dph_oct24,Of the Seventy,Seventy,2,0,619.986,36,150,0.02395,4,2,2,0
5,Gregorio E. Casillas,God Loves All His Children,Elder Casillas teaches that we can bless the l...,https://www.churchofjesuschrist.org/study/gene...,gec,gec_oct24,Of the Seventy,Seventy,2,0,576.509,15,122,0.022271,4,2,2,0
6,Dallin H. Oaks,Following Christ,President Oaks teaches the importance of follo...,https://www.churchofjesuschrist.org/study/gene...,dho,dho_oct24,First Counselor in the First Presidency,First Presidency,2,2,810.976,33,161,0.033899,6,3,3,0
7,D. Todd Christofferson,Burying Our Weapons of Rebellion,Elder Christofferson encourages us to bury any...,https://www.churchofjesuschrist.org/study/gene...,dtc,dtc_oct24,Of the Quorum of the Twelve Apostles,Quorum of the Twelve,2,2,791.557,30,188,0.034914,6,3,3,0
8,José A. Teixeira,Bonded to Jesus Christ: Becoming the Salt of t...,Elder Teixeira teaches four simple but profoun...,https://www.churchofjesuschrist.org/study/gene...,jat,jat_oct24,Of the Presidency of the Seventy,Seventy,2,0,649.448,21,147,0.024107,5,3,2,0
9,Juan Pablo Villar,His Hand Ready to Help Us,Elder Villar teaches that the Savior is always...,https://www.churchofjesuschrist.org/study/gene...,jpv,jpv_oct24,Of the Seventy,Seventy,2,0,613.412,16,114,0.022277,4,2,2,0


After checking that everything else is working properly, I can now move on to modularizing the 'distribute lines' aspect of the code. Before doing that, though, just because I don't want to try and combine two things into one, I'll first make a `prep_for_line_dist()` function, which will create a series of columns that will be needed when I create the `distribute_lines()` function. 

In [14]:
def prep_for_line_dist(df):
    """
    This function creates a series of new columns based on the number of times a conference talk is supposed to be read and the number of days that are supposed to be spent on each readthrough of that talk. These empty columns will eventually be populated with information about what line to start and stop reading at for each day of each readthrough.
    """

    #### Copy the dataframe and use the copy
    df = df.copy()

    # Use a series of loops to create a line start, number of lines, line end, paragraph start, and paragraph end column for every day of every readthrough of every talk
    for i in range(1, df['read_weight'].max()+1):
        for x in range(1, df[f'Readthrough #{i}'].max()+1):
            for index, row in df.iterrows():
                if row[f'Readthrough #{i}'] != 0:
                    df.at[index, f'r{i}d{x}_l_start'] = int(0)
                    df.at[index, f'r{i}d{x}_lines'] = int(0)
                    df.at[index, f'r{i}d{x}_l_end'] = int(0)
                    df.at[index, f'r{i}d{x}_p_start'] = int(0)
                    df.at[index, f'r{i}d{x}_p_end'] = int(0)
                else:
                    df.at[index, f'r{i}d{x}_l_start'] = pd.NA
                    df.at[index, f'r{i}d{x}_lines'] = pd.NA
                    df.at[index, f'r{i}d{x}_l_end'] = pd.NA
                    df.at[index, f'r{i}d{x}_p_start'] = pd.NA
                    df.at[index, f'r{i}d{x}_p_end'] = pd.NA
    
    # establishes a regular expression pattern
    pattern = r'\D\d\D\d+'
    # identifies columns that are floats and not integers
    float_cols = df.select_dtypes(include=['float'])
    # targets specific float columns using regular expression pattern
    change_cols = [col for col in float_cols if re.search(pattern, col)]
    # converts targeted columns to integers
    df[change_cols] = df[change_cols].astype(int)

    return df

In [15]:
conf_df_5 = prep_for_line_dist(conf_df_4)

In [18]:
conf_df_5.columns

Index(['Author', 'Title', 'Description', 'Link', 'initials', 'pk', 'role',
       'short_role', 'read_weight', 'day_weight', 'time', 'paragraphs',
       'lines', 'conf_cons', 'tot_num_days', 'Readthrough #1',
       'Readthrough #2', 'Readthrough #3', 'r1d1_l_start', 'r1d1_lines',
       'r1d1_l_end', 'r1d1_p_start', 'r1d1_p_end', 'r1d2_l_start',
       'r1d2_lines', 'r1d2_l_end', 'r1d2_p_start', 'r1d2_p_end',
       'r1d3_l_start', 'r1d3_lines', 'r1d3_l_end', 'r1d3_p_start',
       'r1d3_p_end', 'r2d1_l_start', 'r2d1_lines', 'r2d1_l_end',
       'r2d1_p_start', 'r2d1_p_end', 'r2d2_l_start', 'r2d2_lines',
       'r2d2_l_end', 'r2d2_p_start', 'r2d2_p_end', 'r2d3_l_start',
       'r2d3_lines', 'r2d3_l_end', 'r2d3_p_start', 'r2d3_p_end',
       'r3d1_l_start', 'r3d1_lines', 'r3d1_l_end', 'r3d1_p_start',
       'r3d1_p_end', 'r3d2_l_start', 'r3d2_lines', 'r3d2_l_end',
       'r3d2_p_start', 'r3d2_p_end'],
      dtype='object')

Whew. That worked. I just want to include my comments regarding this particular aspect of my code, so I'll always remember, and hopefully so anyone else who reads this will, too. 

"# I struggled for about 5 hours because I forgot this step. Including it was as simple as a copy paste."

"# This also came in answer to my prayer for help. I prayed for God to help me know where to look, and this was the next thing I looked at."

"# God is good."

Alright, now for the `distribute_lines()` function. 

In [19]:
def distribute_lines(df):
    """
    This function takes a dataframe like the one I have crafted above and distributes the total number of days into the "Readthrough #_" columns.
    """
    #### copy the given dataframe and work from the copy
    df = df.copy()
    
    # initialize readthrough count as zero
    readthrough = 0
    # establish maximum number of readthroughs so the loop I create below knows when to end or stop
    max_readthroughs = df['read_weight'].max()
    # start a loop of action that will continue until readthroughs is bigger than max_readthroughs, at which point it will stop
    while readthrough <= max_readthroughs:
        # A - first action: add 1 to readthroughs, establishing which readthrough we are working with
        readthrough += 1
        # B - Check if the current readthrough exists in the DataFrame
        if f'Readthrough #{readthrough}' not in df.columns:
            break  # If the column doesn't exist, exit the loop

        # C - second action: for every row in the dataframe do the following:
        for index, row in df.iterrows():
            # D - second action, cont'd: save the info in the row 'lines' as the total number of lines to be distributed
            total_lines = row['lines']
            # E - second action, cont'd: initialize count of distributed lines as zero
            dist_lines = 0
            # debugging print line - remove later
            print(f"Row {index} - Readthrough {readthrough} has {row[f'Readthrough #{readthrough}']} days.")
            # F - check whether the number of days assigned to a talk in a particular readthrough is 0
            if row[f'Readthrough #{readthrough}'] == 0:
                # debugging print line - remove later
                print(f"Skipping row {index} - Readthrough {readthrough} because it has 0 days.")
                # if that talk has zero days alloted for that readthrough, skip to the next talk
                continue
            
            # G - second action, cont'd: open another while loop that will continue until the number of distributed lines is equal to the number of total lines
            while dist_lines < total_lines:
                # H - first action of second loop: open another loop that operates i number of times, where i is the number of days in the readthrough
                for i in range(1, row[f'Readthrough #{readthrough}']+1):
                    # debugging print line - remove later
                    print(f"Processing row {index}, readthrough {readthrough}, day {i}.")
                    # I - first action of second loop, cont'd: add 1 to the corresponding i day of the current readthrough
                    df.at[index, f'r{readthrough}d{i}_lines'] += 1
                    # J - first action of second loop, cont'd: add 1 to the number of distributed lines
                    dist_lines += 1
                    # debugging print line - remove later
                    print(f"Distributed lines: {dist_lines}/{total_lines}")
                    # K - check if dist_lines is less than total_lines: 
                    if dist_lines >= total_lines:
                        break
                        # Unstated action: If it is less, return to point H and repeat this loop.
                        # If not, end *this* loop and return to point D for the next row.
                
        # L - check if readthroughs is less than or equal to max_readthroughs
        if readthrough > max_readthroughs:
            # If it is, repeat this loop, starting with point A
            break
            # Unstated action: If not, end this loop and go to point M.
    
    # M - return the newly modified dataframe
    return df

I copy/pasted the code above directly from the original function above, and at first it didn't work. On inspection, I found that this *might* be related to the fact that the **paragraphs** and **lines** columns of all the dataframes from conf_df_3 onward are objects, not integers. I need to go back to one of the functions to fix this, probably to `get_vplt()` since that's where the lines and paragraphs are initially collected. 

It should be noted that because of the way the `distribute_days()` function is designed, the only columns that populate are the **r{x}d{y}** columns. 

In [33]:
conf_df_5.lines = conf_df_5.lines.astype(int)
conf_df_5.paragraphs = conf_df_5.paragraphs.astype(int)

conf_df_5.dtypes

Author            object
Title             object
Description       object
Link              object
initials          object
pk                object
role              object
short_role        object
read_weight        int64
day_weight         int64
time              object
paragraphs         int32
lines              int32
conf_cons         object
tot_num_days       int32
Readthrough #1     int64
Readthrough #2     int64
Readthrough #3     int64
r1d1_l_start       int32
r1d1_lines         int32
r1d1_l_end         int32
r1d1_p_start       int32
r1d1_p_end         int32
r1d2_l_start       int32
r1d2_lines         int32
r1d2_l_end         int32
r1d2_p_start       int32
r1d2_p_end         int32
r1d3_l_start       int32
r1d3_lines         int32
r1d3_l_end         int32
r1d3_p_start       int32
r1d3_p_end         int32
r2d1_l_start       int32
r2d1_lines         int32
r2d1_l_end         int32
r2d1_p_start       int32
r2d1_p_end         int32
r2d2_l_start       int32
r2d2_lines         int32


In [35]:
conf_df_6 = distribute_lines(conf_df_5)

pd.set_option('display.max_columns', None)

conf_df_6

Row 0 - Readthrough 1 has 3 days.
Processing row 0, readthrough 1, day 1.
Distributed lines: 1/177
Processing row 0, readthrough 1, day 2.
Distributed lines: 2/177
Processing row 0, readthrough 1, day 3.
Distributed lines: 3/177
Processing row 0, readthrough 1, day 1.
Distributed lines: 4/177
Processing row 0, readthrough 1, day 2.
Distributed lines: 5/177
Processing row 0, readthrough 1, day 3.
Distributed lines: 6/177
Processing row 0, readthrough 1, day 1.
Distributed lines: 7/177
Processing row 0, readthrough 1, day 2.
Distributed lines: 8/177
Processing row 0, readthrough 1, day 3.
Distributed lines: 9/177
Processing row 0, readthrough 1, day 1.
Distributed lines: 10/177
Processing row 0, readthrough 1, day 2.
Distributed lines: 11/177
Processing row 0, readthrough 1, day 3.
Distributed lines: 12/177
Processing row 0, readthrough 1, day 1.
Distributed lines: 13/177
Processing row 0, readthrough 1, day 2.
Distributed lines: 14/177
Processing row 0, readthrough 1, day 3.
Distributed

Unnamed: 0,Author,Title,Description,Link,initials,pk,role,short_role,read_weight,day_weight,time,paragraphs,lines,conf_cons,tot_num_days,Readthrough #1,Readthrough #2,Readthrough #3,r1d1_l_start,r1d1_lines,r1d1_l_end,r1d1_p_start,r1d1_p_end,r1d2_l_start,r1d2_lines,r1d2_l_end,r1d2_p_start,r1d2_p_end,r1d3_l_start,r1d3_lines,r1d3_l_end,r1d3_p_start,r1d3_p_end,r2d1_l_start,r2d1_lines,r2d1_l_end,r2d1_p_start,r2d1_p_end,r2d2_l_start,r2d2_lines,r2d2_l_end,r2d2_p_start,r2d2_p_end,r2d3_l_start,r2d3_lines,r2d3_l_end,r2d3_p_start,r2d3_p_end,r3d1_l_start,r3d1_lines,r3d1_l_end,r3d1_p_start,r3d1_p_end,r3d2_l_start,r3d2_lines,r3d2_l_end,r3d2_p_start,r3d2_p_end
0,Neil L. Andersen,The Triumph of Hope,Elder Andersen teaches that when we have hope ...,https://www.churchofjesuschrist.org/study/gene...,nla,nla_oct24,Of the Quorum of the Twelve Apostles,Quorum of the Twelve,2,2,811.777,44,177,0.034622,6,3,3,0,0,59,0,0,0,0,59,0,0,0,0,59,0,0,0,0,59,0,0,0,0,59,0,0,0,0,59,0,0,0,,,,,,,,,,
1,Emily Belle Freeman,Live Up to Your Privileges,President Freeman encourages women and young w...,https://www.churchofjesuschrist.org/study/gene...,ebf,ebf_oct24,Young Women General President,Young Women Presidency,2,0,705.538,34,170,0.025687,5,3,2,0,0,57,0,0,0,0,57,0,0,0,0,56,0,0,0,0,85,0,0,0,0,85,0,0,0,0,0,0,0,0,,,,,,,,,,
2,Karl D. Hirst,God’s Favourite,Elder Hirst teaches us how to feel the divine ...,https://www.churchofjesuschrist.org/study/gene...,kdh,kdh_oct24,Of the Seventy,Seventy,2,0,631.397,20,157,0.024376,5,3,2,0,0,53,0,0,0,0,52,0,0,0,0,52,0,0,0,0,79,0,0,0,0,78,0,0,0,0,0,0,0,0,,,,,,,,,,
3,Dale G. Renlund,“This Is My Gospel”—“This Is My Church”,Elder Renlund teaches that the combination of ...,https://www.churchofjesuschrist.org/study/gene...,dgr,dgr_oct24,Of the Quorum of the Twelve Apostles,Quorum of the Twelve,2,2,828.994,22,181,0.03497,6,3,3,0,0,61,0,0,0,0,60,0,0,0,0,60,0,0,0,0,61,0,0,0,0,60,0,0,0,0,60,0,0,0,,,,,,,,,,
4,David P. Homer,Trusting Our Father,Elder Homer teaches that we receive blessings ...,https://www.churchofjesuschrist.org/study/gene...,dph,dph_oct24,Of the Seventy,Seventy,2,0,619.986,36,150,0.02395,4,2,2,0,0,75,0,0,0,0,75,0,0,0,0,0,0,0,0,0,75,0,0,0,0,75,0,0,0,0,0,0,0,0,,,,,,,,,,
5,Gregorio E. Casillas,God Loves All His Children,Elder Casillas teaches that we can bless the l...,https://www.churchofjesuschrist.org/study/gene...,gec,gec_oct24,Of the Seventy,Seventy,2,0,576.509,15,122,0.022271,4,2,2,0,0,61,0,0,0,0,61,0,0,0,0,0,0,0,0,0,61,0,0,0,0,61,0,0,0,0,0,0,0,0,,,,,,,,,,
6,Dallin H. Oaks,Following Christ,President Oaks teaches the importance of follo...,https://www.churchofjesuschrist.org/study/gene...,dho,dho_oct24,First Counselor in the First Presidency,First Presidency,2,2,810.976,33,161,0.033899,6,3,3,0,0,54,0,0,0,0,54,0,0,0,0,53,0,0,0,0,54,0,0,0,0,54,0,0,0,0,53,0,0,0,,,,,,,,,,
7,D. Todd Christofferson,Burying Our Weapons of Rebellion,Elder Christofferson encourages us to bury any...,https://www.churchofjesuschrist.org/study/gene...,dtc,dtc_oct24,Of the Quorum of the Twelve Apostles,Quorum of the Twelve,2,2,791.557,30,188,0.034914,6,3,3,0,0,63,0,0,0,0,63,0,0,0,0,62,0,0,0,0,63,0,0,0,0,63,0,0,0,0,62,0,0,0,,,,,,,,,,
8,José A. Teixeira,Bonded to Jesus Christ: Becoming the Salt of t...,Elder Teixeira teaches four simple but profoun...,https://www.churchofjesuschrist.org/study/gene...,jat,jat_oct24,Of the Presidency of the Seventy,Seventy,2,0,649.448,21,147,0.024107,5,3,2,0,0,49,0,0,0,0,49,0,0,0,0,49,0,0,0,0,74,0,0,0,0,73,0,0,0,0,0,0,0,0,,,,,,,,,,
9,Juan Pablo Villar,His Hand Ready to Help Us,Elder Villar teaches that the Savior is always...,https://www.churchofjesuschrist.org/study/gene...,jpv,jpv_oct24,Of the Seventy,Seventy,2,0,613.412,16,114,0.022277,4,2,2,0,0,57,0,0,0,0,57,0,0,0,0,0,0,0,0,0,57,0,0,0,0,57,0,0,0,0,0,0,0,0,,,,,,,,,,


Moving on to the last major step, I know define `distribute_paragraphs()` to set the starting and ending lines and paragraphs for each day of each readthrough of each talk. 

In [44]:
def distribute_paragraphs(df,dict):
    """
    This function assigns starting and ending lines and paragraphs for each day of each readthrough of each talk contained in the conference dataframe.
    """    
    #### copy the dataframe and work on that so as to preserve the dataframe being passed through this function
    df = df.copy()

    # initialize readthrough count as 0
    rt = 0
    # establish maximum number of readthroughs so the loop I create below knows when to end or stop
    max_rts = df['read_weight'].max()
    # create a list of the column names in the given dataframe
    cols = list(df.columns)
    # start a loop of action that will continue until readthroughs is bigger than max_readthroughs, at which point it will stop
    while rt <= max_rts:
        # A - first action: add 1 to readthroughs, establishing which readthrough we are working with
        rt += 1
        # debugging line - remove later
        print(f'Starting readthrough {rt}.')
        # B - Check if the current readthrough exists in the DataFrame
        if f'Readthrough #{rt}' not in df.columns:
            # debugging line - remove later
            print(f'Readthrough {rt} not found. Function complete.')
            # if a column for the current readthrough number doesn't exist, exit the loop
            break

        # C-1 - Second action: start another loop that does the follow for the highest number of days in whatever readthrough number the function is on
        for day in range(1, df[f'Readthrough #{rt}'].max()+1):
            
            # debugging line - remove later
            print(f'Working through readthrough {rt} day {day}.')

            # C-2 - set patterns for pulling out info for the start, lines, end, and start of next day
            rt_start_l_pattern = fr'r{rt}d{day}_l_start'
            rt_lines_pattern = fr'r{rt}d{day}_lines'
            rt_end_l_pattern = fr'r{rt}d{day}_l_end'
            rt_next_start_l_pattern = fr'r{rt}d{day+1}_l_start'
            rt_start_p_pattern = fr'r{rt}d{day}_p_start'
            rt_end_p_pattern = fr'r{rt}d{day}_p_end'
            rt_next_start_p_pattern = fr'r{rt}d{day+1}_p_start'

            # debugging line - remove later
            print(f'Readthrough {rt} day {day} patterns saved.')

            # C-3 - initialize variables as empty lists
            start_l_col = []
            lines_col = []
            end_l_col = []
            next_start_l_col = []
            start_p_col = []
            end_p_col = []
            next_start_p_col = []

            # debugging line - remove later
            print(f'Variables initialized.')

            # C-4 - search for and save columns defined in patterns
            for col in cols:
                if re.search(rt_start_l_pattern, col):
                    start_l_col.append(col)
                elif re.search(rt_lines_pattern, col):
                    lines_col.append(col)
                elif re.search(rt_end_l_pattern, col):
                    end_l_col.append(col)
                elif re.search(rt_next_start_l_pattern, col):
                    next_start_l_col.append(col)
                elif re.search(rt_start_p_pattern, col):
                    start_p_col.append(col)
                elif re.search(rt_end_p_pattern, col):
                    end_p_col.append(col)
                elif re.search(rt_next_start_p_pattern, col):
                    next_start_p_col.append(col)

            # debugging line - remove later
            print(f'Columns found and saved.'
                f'Starting column name: {start_l_col}'
                f'Number of lines column: {lines_col}'
                f'End column name: {end_l_col}'
                f'Next start column name: {next_start_l_col}'
                f'Starting column name: {start_p_col}'
                f'Number of lines column: {lines_col}'
                f'End column name: {end_p_col}'
                f'Next start column name: {next_start_p_col}')
            
            # C-5-a -Start another loop
            for index, row in df.iterrows():
                # C-5-b - check if the number of days assigned for the current readthrough of the current talk is 0
                if row[f'Readthrough #{rt}'] == 0:
                    # debugging print line - remove later
                    print(f"Skipping row {index} - Readthrough {rt} because it has 0 days.")
                    # if that talk has zero days alloted for that readthrough, skip to the next talk
                    continue
                
                # unstated action - if the number of days assigned for the current readthrough is greater than zero, proceed to C-5-c

                # C-5-c - establish the connection between conf_df and talks_dict using the primary key column of the conf_df
                talk = dict[f"{df.loc[index,'pk']}"]

                # debugging line - remove later
                print(f'Connecting conf_df to talks_dict using primary key {df.loc[index,"pk"]}.')

                # C-5-d - check if the day number is 1
                if day == 1:
                    # C-5-d-1 - if so, initialize the start column as 1 for every row
                    df[start_l_col] = 1
                    df[start_p_col] = 1
                    # debugging line - remove later
                    print(f'Readthrough {rt} day {day} set at 1.')
                    # C-5-d-1 - then save the end point as the start point (1) plus the number of lines to be read 
                    # the minus 1 at the end ensures that we end at the assigned reading line, not the line after
                    for start, read, end in zip(start_l_col, lines_col, end_l_col):
                        df.loc[index, end] = df.loc[index, start] + df.loc[index, read] - 1
                    # C-5-d-2 - then save today's ending point as the starting point for the next day, to be accessed later
                    for end, next_start in zip(end_l_col, next_start_l_col):
                        df.loc[index, next_start] = df.loc[index, end]
                    # C-5-d-3 - save the end line just calculated as object 'end_line'
                    end_line = df.loc[index, f'r{rt}d{day}_l_end']

                    # debugging line - remove later
                    print(f'Readthrough {rt} day {day} end line set. End line: {end_line}')

                    # C-5-d-4 - find in the talk the paragraph with a running total that is greater than or equal to the end line 
                    end_paragraph_1 = talk.loc[talk['running_lines'] >= end_line].index[0]
                    # C-5-d-5 - find in the talk the paragraph before the one above
                    end_paragraph_2 = talk.loc[talk['running_lines'] >= end_line].index[-1]
                    # C-5-d-6 - determine which paragraph would yield a number of lines being read closer to the target end line and save as end_paragraph
                    if abs(talk.loc[end_paragraph_1,'running_lines'] - end_line) < abs(talk.loc[end_paragraph_2,'running_lines']- end_line):
                        end_paragraph = end_paragraph_1
                    else:
                        end_paragraph = end_paragraph_2
                    # C-5-d-7 - save the selected end_paragraph as the paragraph ending point of the current day of the current readthrough
                    for end_p in end_p_col:
                        df.loc[index, end_p] = end_paragraph
                    
                    for end_p, next_start_p in zip(end_p_col, next_start_p_col):
                        df.loc[index, next_start_p] = df.loc[index, end_p] + 1

                    # debugging line - remove later
                    print(f'Readthrough {rt} day {day} end paragraph set: End paragraph: {end_paragraph}')
                    

                    
                
                # C-5-e - if the day number IS NOT 1...
                else:
                    # C-5-e-1 - access whatever the start column has already been saved as, add the number of lines, and save that as the end point
                    for start, read, end in zip(start_l_col, lines_col, end_l_col):
                        df.loc[index, end] = df.loc[index, start] + df.loc[index, read]
                    # C-5-e-2 - then save today's ending point as the starting point for the next day, to be accessed later
                    for end, next_start in zip(end_l_col, next_start_l_col):
                        df.loc[index, next_start] = df.loc[index, end]
                    # C-5-e-3 - save the end line just calculated as object 'end_line'
                    end_line = df.loc[index, f'r{rt}d{day}_l_end']

                    # debugging line - remove later
                    print(f'Readthrough {rt} day {day} end line set. End line: {end_line}')

                    # C-5-e-4 - find in the talk the paragraph with a running total that is greater than or equal to the end line 
                    end_paragraph_1 = talk.loc[talk['running_lines'] >= end_line].index[0]
                    # C-5-e-5 - find in the talk the paragraph before the one above
                    end_paragraph_2 = talk.loc[talk['running_lines'] >= end_line].index[-1]
                    # C-5-e-6 - determine which paragraph would yield a number of lines being read closer to the target end line and save as end_paragraph
                    if abs(talk.loc[end_paragraph_1,'running_lines'] - end_line) < abs(talk.loc[end_paragraph_2,'running_lines']- end_line):
                        end_paragraph = end_paragraph_1
                    else:
                        end_paragraph = end_paragraph_2
                    # C-5-e-7 - save the selected end_paragraph as the paragraph ending point of the current day of the current readthrough
                    for end_p in end_p_col:
                        df.loc[index, end_p] = end_paragraph

                    for end_p, next_start_p in zip(end_p_col, next_start_p_col):
                        df.loc[index, next_start_p] = df.loc[index, end_p] + 1

                    # debugging line - remove later
                    print(f'Readthrough {rt} day {day} end paragraph set: End paragraph: {end_paragraph}')

            # debugging print line - remove later
            print(f'Readthrough {rt} columns day {day} saved. Looping back.')
        
        # debugging print line - remove later
        print(f'Readthrough {rt} loop finished.')
    
    return df

While running `distribute_paragraphs()`, I found a few errors, most of which seem pretty small.

First of these was an issue just with naming conventions - rather than saving items in `all_talks_dict` as, say, 'nla_oct24_lines' as I had originally, I saved that talk as 'nla_oct24'. That was easy enough to address and fix in this iteration. 

Next, in that same dictionary, when I originally created it, I created a column that had a running number of lines for each paragraph, meaning the total number of lines in each talk up to the end of that paragraph. Apparently, when modularizing that aspect of this iteration of the calculator, I *did not* create in the dictionary that running tally. So, I had to go back and fix that as well. 

In [41]:
conf_df_7 = distribute_paragraphs(conf_df_6,all_talks_dict)

conf_df_7

Starting readthrough 1.
Working through readthrough 1 day 1.
Readthrough 1 day 1 patterns saved.
Variables initialized.
Columns found and saved.Starting column name: ['r1d1_l_start']Number of lines column: ['r1d1_lines']End column name: ['r1d1_l_end']Next start column name: ['r1d2_l_start']Starting column name: ['r1d1_p_start']Number of lines column: ['r1d1_lines']End column name: ['r1d1_p_end']Next start column name: ['r1d2_p_start']
Connecting conf_df to talks_dict using primary key nla_oct24.
Readthrough 1 day 1 set at 1.
Readthrough 1 day 1 end line set. End line: 59


KeyError: 'running_lines'

In [39]:
all_talks_dict

{'nla_oct24':    foreign_key            pk paragraph  \
 1    nla_oct24   nla_oct24_1         1   
 2    nla_oct24   nla_oct24_2         2   
 3    nla_oct24   nla_oct24_3         3   
 4    nla_oct24   nla_oct24_4         4   
 5    nla_oct24   nla_oct24_5         5   
 6    nla_oct24   nla_oct24_6         6   
 7    nla_oct24   nla_oct24_7         7   
 8    nla_oct24   nla_oct24_8         8   
 9    nla_oct24   nla_oct24_9         9   
 10   nla_oct24  nla_oct24_10        10   
 11   nla_oct24  nla_oct24_11        11   
 12   nla_oct24  nla_oct24_12        12   
 13   nla_oct24  nla_oct24_13        13   
 14   nla_oct24  nla_oct24_14        14   
 15   nla_oct24  nla_oct24_15        15   
 16   nla_oct24  nla_oct24_16        16   
 17   nla_oct24  nla_oct24_17        17   
 18   nla_oct24  nla_oct24_18        18   
 19   nla_oct24  nla_oct24_19        19   
 20   nla_oct24  nla_oct24_20        20   
 21   nla_oct24  nla_oct24_21        21   
 22   nla_oct24  nla_oct24_22        22  

In [52]:
# Call the function and unpack the imports
pd, requests, BeautifulSoup, os, time, webdriver, Service, By, Options, ChromeService, WebDriverWait, EC, glob, ChromeDriverManager, re, datetime, timedelta = system_setup_for_breakdown()

# call set_ini_inputs and save outputs
conf_link, save_path, month_year, ini_start_date, ini_end_date, ini_total_days, min_per_rt = set_ini_inputs()

# call get_landing_page_info and save as ini_conf_df
ini_conf_df = get_landing_page_info()

# run clean_ini_df on ini_conf_df and save as conf_df_1
conf_df_1 = clean_ini_df(ini_conf_df,month_year)

# run get_role_inputs
conf_df_2 = get_role_inputs(conf_df_1)

# run get_vplt to get conf_df_3 and all_talks_dict
conf_df_3, all_talks_dict = get_vplt(conf_df_2)

# run distribute_days
conf_df_4 = distribute_days(conf_df_3,ini_total_days,min_per_rt)

# prep the df for line distribution
conf_df_5 = prep_for_line_dist(conf_df_4)

# distribute lines
conf_df_6 = distribute_lines(conf_df_5)

# get the final reading schedule
conf_df_7 = distribute_paragraphs(conf_df_6,all_talks_dict)

Pandas imported as pd.
Requests imported.
BeautifulSoup imported.
OS imported.
Time imported.
Webdriver, Service, By, and Options imported from Selenium.
ChromeService, WebDriverWait, and ExpectedConditions imported from Selenium.
Glob imported.
ChromeDriverManager imported.
Regular expression imported.
Date and time imported.
Working directory set as [D:\Faith and Religion Stuff\Come, Follow Me\come-follow-me-breakdown-builder].
Conference Link: https://www.churchofjesuschrist.org/study/general-conference/2024/10?lang=eng
Conference Month-year: oct24
ini_start: 10/10/2024
ini_end: 4/4/2025
ini_start as datetime: 2024-10-10 00:00:00
ini_end as datetime: 2025-04-04 00:00:00
Total number of days: 177
Minimum days per readthrough: 2
Role of Neil L. Andersen: Of the Quorum of the Twelve Apostles
Role of Emily Belle Freeman: Young Women General President
Role of Karl D. Hirst: Of the Seventy
Role of Dale G. Renlund: Of the Quorum of the Twelve Apostles
Role of David P. Homer: Of the Seventy

Great! Running the cell above and below shows that I have something awesome to be proud of. I got this all ready up to the point where, if I get nothing else done in regards to patching this calculator up, it still works, despite having modularized it. In addition to that, though, I also can be proud because I cut the run time for the whole thing basically in half, down from almost 10 minutes to just over 5. That in and of itself is a huge accomplishment. 

In [54]:
conf_df_7

Unnamed: 0,Author,Title,Description,Link,initials,pk,role,short_role,read_weight,day_weight,time,paragraphs,lines,conf_cons,tot_num_days,Readthrough #1,Readthrough #2,Readthrough #3,r1d1_l_start,r1d1_lines,r1d1_l_end,r1d1_p_start,r1d1_p_end,r1d2_l_start,r1d2_lines,r1d2_l_end,r1d2_p_start,r1d2_p_end,r1d3_l_start,r1d3_lines,r1d3_l_end,r1d3_p_start,r1d3_p_end,r2d1_l_start,r2d1_lines,r2d1_l_end,r2d1_p_start,r2d1_p_end,r2d2_l_start,r2d2_lines,r2d2_l_end,r2d2_p_start,r2d2_p_end,r2d3_l_start,r2d3_lines,r2d3_l_end,r2d3_p_start,r2d3_p_end,r3d1_l_start,r3d1_lines,r3d1_l_end,r3d1_p_start,r3d1_p_end,r3d2_l_start,r3d2_lines,r3d2_l_end,r3d2_p_start,r3d2_p_end
0,Neil L. Andersen,The Triumph of Hope,Elder Andersen teaches that when we have hope ...,https://www.churchofjesuschrist.org/study/gene...,nla,nla_oct24,Of the Quorum of the Twelve Apostles,Quorum of the Twelve,2,2,811,44,177,0.034619,6,3,3,0,1,59,59,1,13,59,59,118,14,33,118,59,177,34,44,1,59,59,1,13,59,59,118,14,33,118,59,177,34,44,1,,,1,,,,,,
1,Emily Belle Freeman,Live Up to Your Privileges,President Freeman encourages women and young w...,https://www.churchofjesuschrist.org/study/gene...,ebf,ebf_oct24,Young Women General President,Young Women Presidency,2,0,705,34,170,0.025686,5,3,2,0,1,57,57,1,8,57,57,114,9,17,114,56,170,18,34,1,85,85,1,14,85,85,170,15,34,170,0,170,35,34,1,,,1,,,,,,
2,Karl D. Hirst,God’s Favourite,Elder Hirst teaches us how to feel the divine ...,https://www.churchofjesuschrist.org/study/gene...,kdh,kdh_oct24,Of the Seventy,Seventy,2,0,631,20,157,0.024376,5,3,2,0,1,53,53,1,9,53,52,105,10,14,105,52,157,15,20,1,79,79,1,11,79,78,157,12,20,157,0,157,21,20,1,,,1,,,,,,
3,Dale G. Renlund,“This Is My Gospel”—“This Is My Church”,Elder Renlund teaches that the combination of ...,https://www.churchofjesuschrist.org/study/gene...,dgr,dgr_oct24,Of the Quorum of the Twelve Apostles,Quorum of the Twelve,2,2,828,22,181,0.034966,6,3,3,0,1,61,61,1,7,61,60,121,8,14,121,60,181,15,22,1,61,61,1,7,61,60,121,8,14,121,60,181,15,22,1,,,1,,,,,,
4,David P. Homer,Trusting Our Father,Elder Homer teaches that we receive blessings ...,https://www.churchofjesuschrist.org/study/gene...,dph,dph_oct24,Of the Seventy,Seventy,2,0,619,36,150,0.023945,4,2,2,0,1,75,75,1,16,75,75,150,17,36,150,0,150,37,36,1,75,75,1,16,75,75,150,17,36,150,0,150,37,36,1,,,1,,,,,,
5,Gregorio E. Casillas,God Loves All His Children,Elder Casillas teaches that we can bless the l...,https://www.churchofjesuschrist.org/study/gene...,gec,gec_oct24,Of the Seventy,Seventy,2,0,576,15,122,0.02227,4,2,2,0,1,61,61,1,7,61,61,122,8,15,122,0,122,16,15,1,61,61,1,7,61,61,122,8,15,122,0,122,16,15,1,,,1,,,,,,
6,Dallin H. Oaks,Following Christ,President Oaks teaches the importance of follo...,https://www.churchofjesuschrist.org/study/gene...,dho,dho_oct24,First Counselor in the First Presidency,First Presidency,2,2,810,33,161,0.033894,6,3,3,0,1,54,54,1,11,54,54,108,12,20,108,53,161,21,33,1,54,54,1,11,54,54,108,12,20,108,53,161,21,33,1,,,1,,,,,,
7,D. Todd Christofferson,Burying Our Weapons of Rebellion,Elder Christofferson encourages us to bury any...,https://www.churchofjesuschrist.org/study/gene...,dtc,dtc_oct24,Of the Quorum of the Twelve Apostles,Quorum of the Twelve,2,2,791,30,188,0.034914,6,3,3,0,1,63,63,1,10,63,63,126,11,24,126,62,188,25,30,1,63,63,1,10,63,63,126,11,24,126,62,188,25,30,1,,,1,,,,,,
8,José A. Teixeira,Bonded to Jesus Christ: Becoming the Salt of t...,Elder Teixeira teaches four simple but profoun...,https://www.churchofjesuschrist.org/study/gene...,jat,jat_oct24,Of the Presidency of the Seventy,Seventy,2,0,649,21,147,0.024106,5,3,2,0,1,49,49,1,7,49,49,98,8,14,98,49,147,15,21,1,74,74,1,10,74,73,147,11,21,147,0,147,22,21,1,,,1,,,,,,
9,Juan Pablo Villar,His Hand Ready to Help Us,Elder Villar teaches that the Savior is always...,https://www.churchofjesuschrist.org/study/gene...,jpv,jpv_oct24,Of the Seventy,Seventy,2,0,613,16,114,0.022277,4,2,2,0,1,57,57,1,9,57,57,114,10,16,114,0,114,17,16,1,57,57,1,9,57,57,114,10,16,114,0,114,17,16,1,,,1,,,,,,


In the cell below, I define the `export_df()` function, which, you guessed it, prepares the created reading plan for export and then exports it.

In [56]:
def export_df(df):
    """
    
    """

    # copy the dataframe and work with the copy
    df = df.copy()

    # establishes patterns for getting a list of start and end columns
    start_cols_pat = r'r\d+d\d+_p_start'
    end_cols_pat = r'r\d+d\d+_p_end'

    # create a list of columns names in conf_df_7 
    cols = list(df.columns)

    # initialize empty lists for start and end columns
    final_start_cols = []
    final_end_cols = []

    # look at all of the column names in the column names list
    for col in cols:
        # if the column name matches the pattern for start columns, add it to the list of start columns
        if re.search(start_cols_pat, col):
            final_start_cols.append(col)
        # if the column name matches the pattern for end columns, add it to the list of end columns
        elif re.search(end_cols_pat, col):
            final_end_cols.append(col)

    # initialize an empty list for zipping the other two together
        ## zipping two lists together basically entails combining them in the order of list 1 item 1, list 2 item 1, list 1 item 2, list 2 item 2, list 1 item 3, list 2 item 3, etc. 
    zipped_cols = []

    # create a loop that 'zips' the start columns list and end columns list together
    for start_col, end_col in zip(final_start_cols, final_end_cols):
        zipped_cols.append(start_col)
        zipped_cols.append(end_col)
    
    # create a list of necessary information columns
    info_cols = ['Author', 'role', 'Title']

    # combine the info_cols list and the zipped_cols list
    final_cols = info_cols + zipped_cols

    # keep only the columns in the final_cols list, save as final_breakdown
    final_breakdown = df[final_cols]

    # Get user input about saving breakdown to computer
    response_1 = input('Do you want to save the breakdown to your computer? ')
    
    # if the user wants to export the breakdown
    if response_1.lower() == 'yes':
        # ask the user where they want to store it
        input_path = input(f'Please paste the location of the folder you would like to save the breakdown in: \n'
                           f'\nFor example: D:\Faith and Religion Stuff\Come, Follow Me')
        # add \\ to that path to make it compatible
        path_start = f'{input_path}\\'
        # ask the user if they want to use a custom name
        response_2 = input('Would you like to save the file with a custom name?')
        # if they do, ask the user for the custom name
        if response_2.lower() == 'yes':
            custom_name = input('Please enter the name you would like to save the file as: ')
            name = f'{custom_name}.csv'
        # otherwise generate a generic file name
        else:
            name = f'{month_year}_breakdown.csv'
        
        # combine the destination folder with the name of the file
        final_path = os.path.join(path_start, name)
        
        # export final_breakdown as a csv to the destination folder
        final_breakdown.to_csv(final_path, index=False)


In [57]:
export_df(conf_df_7)

Running the cell above and checking my targeted folder, I see that the breakdown did in fact show up, meaning that I really do have everything about it back up and running. Awesome. 