# Part 1: Web Scraping with Selenium and Beautiful Soup

In [1]:
pip install selenium

Note: you may need to restart the kernel to use updated packages.


Initializing Selenium WebDriver for Chrome

This code first imports the webdriver module from the selenium package. Then, it creates an instance of the Chrome WebDriver. The webdriver.Chrome() command launches a new Chrome browser session controlled by Selenium. The driver object created here will serve as our primary interface to interact with the web browser. With driver, we can navigate to URLs, interact with web elements, and extract data as needed.

In [2]:
from selenium import webdriver

driver = webdriver.Chrome()


The following Python code demonstrates how to use Selenium WebDriver for Chrome to navigate to the CFA Institute's website and extract URLs of refresher readings. Selinum is sed to extract the links in each page and the links are added all_links array. Selinum is used to extract the links because the webpage is dynamic and article links are store in javascript. 

The code performs the following tasks:

1) Initialization and Page Navigation: 
  * We start by creating a Chrome WebDriver instance (driver) and navigating to the CFA Institute's refresher readings page.

2) Handling Privacy Banner: 
  * A function close_privacy_banner() is defined and used to close any potential privacy consent banners that may appear on the site.

3) Extracting URLs in a Loop: 
  * We then enter a while loop to traverse through the web pages. In each iteration, the script:
  * Waits for the page content to load.
  * Extracts the URLs of the refresher readings using JavaScript and stores them in the all_links list.
  * Checks and clicks the 'next page' button if available, or exits the loop if there are no more pages.
  * Error Handling:The script includes error handling for timeouts and cases where elements (the next page button) is not found.

4) Closing the Browser: 
  * Finally, after extracting all URLs or encountering an error, the script closes the browser session.

In [3]:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import TimeoutException, NoSuchElementException, StaleElementReferenceException

driver = webdriver.Chrome()
driver.get("https://www.cfainstitute.org/en/membership/professional-development/refresher-readings#sort=%40refreadingcurriculumyear%20descending")
wait = WebDriverWait(driver, 10)

def close_privacy_banner():
    try:
        driver.execute_script("document.getElementById('privacy-banner').style.display='none';")
    except Exception as e:
        print("Privacy banner not found or could not be closed:", e)
all_links = []  # Array to store all the links

current_page = 1
while True:
    close_privacy_banner()

    try:
        # Wait for the new content to load
        wait.until(EC.presence_of_all_elements_located((By.CLASS_NAME, "CoveoResultLink")))

        # Extract URLs using JavaScript
        links = driver.execute_script(
            "return Array.from(document.querySelectorAll('.CoveoResultLink')).map(link => link.getAttribute('href'));"
        )
        for link in links:
            print(f"Page {current_page}: {link}")
            all_links.append(link)
        # Check if the next page exists
        try:
            next_page = wait.until(EC.presence_of_element_located((By.XPATH, f"//a[contains(@class, 'coveo-pager-anchor') and text()='{current_page + 1}']")))
            if next_page:
                next_page.click()
                current_page += 1
                wait.until(EC.staleness_of(next_page))  # Wait for the old next page to go stale
        except NoSuchElementException:
            print("No more pages to navigate.")
            break
      

    except TimeoutException as e:
        print(f"Error loading page content: {e}")
        break

driver.quit()


Page 1: https://www.cfainstitute.org/membership/professional-development/refresher-readings/time-series-analysis
Page 1: https://www.cfainstitute.org/membership/professional-development/refresher-readings/credit-analysis-models
Page 1: https://www.cfainstitute.org/membership/professional-development/refresher-readings/introduction-alternative-investments
Page 1: https://www.cfainstitute.org/membership/professional-development/refresher-readings/credit-default-swaps
Page 1: https://www.cfainstitute.org/membership/professional-development/refresher-readings/valuation-contingent-claims
Page 1: https://www.cfainstitute.org/membership/professional-development/refresher-readings/introduction-commodities-commodity-derivatives
Page 1: https://www.cfainstitute.org/membership/professional-development/refresher-readings/understanding-income-statements
Page 1: https://www.cfainstitute.org/membership/professional-development/refresher-readings/pricing-and-valuation-of-forward-commitments
Page 1: ht

Page 8: https://www.cfainstitute.org/membership/professional-development/refresher-readings/ethics-and-trust-investment-profession
Page 8: https://www.cfainstitute.org/membership/professional-development/refresher-readings/ethics-application
Page 8: https://www.cfainstitute.org/membership/professional-development/refresher-readings/guidance-standards-i-vii-l3
Page 8: https://www.cfainstitute.org/membership/professional-development/refresher-readings/introduction-gips
Page 8: https://www.cfainstitute.org/membership/professional-development/refresher-readings/trade-strategy-execution
Page 8: https://www.cfainstitute.org/membership/professional-development/refresher-readings/portfolio-performance-evaluation
Page 8: https://www.cfainstitute.org/membership/professional-development/refresher-readings/exchange-traded-funds-mechanics-applications
Page 8: https://www.cfainstitute.org/membership/professional-development/refresher-readings/fixed-income-active-management-credit-strategies
Page 8: 

Page 15: https://www.cfainstitute.org/membership/professional-development/refresher-readings/2020/ICE-RSS-FEED-active-equity-investing-strategies
Page 15: https://www.cfainstitute.org/membership/professional-development/refresher-readings/2020/Refresher-Reading
Page 15: https://www.cfainstitute.org/membership/professional-development/refresher-readings/2020/Refresher-Reading
Page 15: https://www.cfainstitute.org/membership/professional-development/refresher-readings/2020/Copy-of-Refresher-Reading-Test
Page 15: https://www.cfainstitute.org/membership/professional-development/refresher-readings/2020/QA-Test-Refresher-Reading-1
Page 15: https://www.cfainstitute.org/membership/professional-development/refresher-readings/2020/Test-3-big-data-projects
Page 15: https://www.cfainstitute.org/membership/professional-development/refresher-readings/2020/Test-1-big-data-projects
Page 15: https://www.cfainstitute.org/membership/professional-development/refresher-readings/2020/Test20-cost-capital
Pag

Page 22: https://www.cfainstitute.org/membership/professional-development/refresher-readings/Industry-and-Competitive-Analysis
Page 22: https://www.cfainstitute.org/membership/professional-development/refresher-readings/Company-Analysis-Forecasting
Page 22: https://www.cfainstitute.org/membership/professional-development/refresher-readings/investments-real-estate-pubicly-traded-securities
Page 22: https://www.cfainstitute.org/membership/professional-development/refresher-readings/cost-capital-advanced-topics
Page 22: https://www.cfainstitute.org/membership/professional-development/refresher-readings/arbitrage-replication-cost-carry-pricing-derivatives
Page 22: https://www.cfainstitute.org/membership/professional-development/refresher-readings/Company-Analysis-Past-and-Present
Page 22: https://www.cfainstitute.org/membership/professional-development/refresher-readings/Business-Models
Page 22: https://www.cfainstitute.org/membership/professional-development/refresher-readings/Fixed-Incom

In [4]:
all_links[0]


'https://www.cfainstitute.org/membership/professional-development/refresher-readings/time-series-analysis'

In [5]:
pip install requests beautifulsoup4 pandas


Note: you may need to restart the kernel to use updated packages.


This Python script is designed to scrape information from the URLs of CFA Institute refresher readings (previously collected) and then process and save this information into a CSV file. The script performs several key operations:

Text Cleaning Function:

1) clean_text: 
  * A function defined to clean and normalize text. 
  * It converts the text to ASCII, handles special characters like dashes and quotation marks, and performs other general cleaning tasks.

2) Data Scraping and Processing:
  * The script iterates through each URL stored in the all_links list.
  * For each URL, it sends an HTTP GET request, retrieves the HTML content, and parses it using BeautifulSoup.
  * It then extracts various pieces of information, such as the title (topic name), year, level, learning outcomes, summary, and download link. If certain data is not found, it is marked as 'N/A', expcet for year it is 0.
  * The extracted text is cleaned using the clean_text function to ensure readability and uniformity.

3) Data Organization:
  * Extracted data is organized into a dictionary (new_row) for each URL.
  * These dictionaries are collected in the all_rows list.

4) DataFrame Creation and CSV Export:

  * A DataFrame is created from the all_rows list using pandas, with predefined columns.
  * This DataFrame is then exported to a CSV file named 'Assignment.csv'.
  * This script effectively automates the process of extracting and organizing data from web pages into a structured format, which can be easily analyzed or shared.

In [6]:
import requests
from bs4 import BeautifulSoup
import pandas as pd
import re
import unicodedata
def clean_text(text):
    # Normalize the text to NFKD form which separates characters and their diacritics
    # which should turn characters like "â€" into their separate components "a" and "€"
    text = unicodedata.normalize('NFKD', text)
    
    # Encode to ASCII bytes, then decode back to a string ignoring non-ASCII characters
    text = text.encode('ascii', 'ignore').decode('ascii')
    
    # Replace common encoding issues
    replacements = {
        '\u2013': '-',  # En-dash
        '\u2014': '--', # Em-dash
        '\u2018': "'",  # Left single quotation mark
        '\u2019': "'",  # Right single quotation mark
        '\u201c': '"',  # Left double quotation mark
        '\u201d': '"',  # Right double quotation mark
        '\u2026': '...',# Ellipsis
        '\u00a0': ' ',  # Non-breaking space
    }
    
    # Apply replacements
    for src, dest in replacements.items():
        text = text.replace(src, dest)
    
    # Additional cleaning can be done here if needed
    text = re.sub(r'\n\s*', ' ', text)
    text = text.replace('\r', ' ')

    return text


def extract_year(text):
    # Match only the year number
    year_match = re.search(r'\b\d{4}\b', text)
    return year_match.group() if year_match else '0'

# Set the path for the CSV file
csv_path = 'Assignment.csv'
# Define the columns of the DataFrame
columns = ["Name of the Title", "Topic", "Level", "Year", "Introduction", "Learning Outcomes", "Summary", "Link to the Summary Page", "Download Link"]
all_rows = []


# Loop through all the URLs in the `all_links` list
for url in all_links:
    response = requests.get(url)
    response.encoding = 'utf-8'
    html_content = response.text
    soup = BeautifulSoup(html_content, 'html.parser')
    new_row = {}

    # Fetching the title
    title = soup.find('h1', class_='article-title')
    new_row['Name of the Title'] = clean_text(title.get_text()) if title else 'Data Not Found.'
    
    topic_container = soup.find('span', class_='content-utility-topics')

    # Initialize the topic text as 'N/A' by default
    topic_text = 'Data Not Found.'

    if topic_container:
        # Find the nested span within the found container
        topic = topic_container.find('span', class_='content-utility-topic')
        if topic:
            topic_text = topic.get_text(strip=True)

# Assuming 'new_row' is a dictionary where you want to store the extracted topic
    new_row['Topic'] = topic_text
    
    
    
    # Fetching the Level
    level = soup.find('span', class_='content-utility-topic')
    new_row['Level'] = clean_text(level.get_text()) if level else 'Data Not Found.'

    # Learning Outcomes
    outcomes_header = soup.find('h2', class_='article-section', text='Learning Outcomes')
    outcomes_text = outcomes_header.find_next_sibling('section').get_text() if outcomes_header else 'Data Not Found.'
    new_row['Learning Outcomes'] = clean_text(outcomes_text)

    # Summary
    summary_header = soup.find('h2', class_='article-section', text='Summary')
    summary_text = summary_header.find_next_sibling('div').get_text() if summary_header else 'Data Not Found.'
    new_row['Summary'] = clean_text(summary_text)

    # Link to the Summary Page
    new_row['Link to the Summary Page'] = url
    
    # Download Link
    locked_content_links = soup.find_all('a', class_='locked-content')
    download_links = [link for link in locked_content_links if 'underlined-anchor' not in link.get('class', [])]
    new_row['Download Link'] = download_links[0]['href'] if download_links else 'Data Not Found.'


    # Year
    year_span = soup.find('span', class_='content-utility-curriculum')
    new_row['Year'] = extract_year(year_span.get_text()) if year_span else '0'


    # Introduction
    intro_header = soup.find('h2', class_='article-section', text='Introduction')
    if intro_header:
        # Find the parent section of the introduction header
        intro_section = intro_header.find_parent('section')
        # Find all p tags following the introduction header within the section
        intro_paragraphs = intro_section.find_all('p', recursive=False) if intro_section else []
        # Join the text from all p tags
        intro_text = ' '.join(p.get_text(strip=True) for p in intro_paragraphs)
        new_row['Introduction'] = clean_text(intro_text)
    else:
        new_row['Introduction'] = 'Data Not Found.'



    # Collect the new row in our list
    all_rows.append(new_row)

# Create a DataFrame from our list of rows
df = pd.DataFrame(all_rows, columns=columns)

# Save the DataFrame to a CSV file
df.to_csv(csv_path, index=False)


Download link is missing the https://www.cfainstitute.org

![](IMG20240301131354.jpg)

# Data Cleaning

In [7]:
df.columns = df.columns.str.replace(' ', '_')
df.head()

Unnamed: 0,Name_of_the_Title,Topic,Level,Year,Introduction,Learning_Outcomes,Summary,Link_to_the_Summary_Page,Download_Link
0,Time-Series Analysis,Quantitative Methods,Level II,2024,"As financial analysts, we often use time-serie...",The member should be able to: calculate and e...,The predicted trend value of a time series in...,https://www.cfainstitute.org/membership/profes...,/-/media/documents/protected/refresher-reading...
1,Credit Analysis Models,Fixed Income,Level II,2024,Credit analysis plays an important role in the...,The member should be able to: explain expecte...,This reading has covered several important top...,https://www.cfainstitute.org/membership/profes...,/-/media/documents/protected/refresher-reading...
2,Introduction to Alternative Investments,Alternative Investments,Level I,2023,"In this section, we explain what alternative i...",The member should be able to: describe types ...,This reading provides a comprehensive introduc...,https://www.cfainstitute.org/membership/profes...,/-/media/documents/protected/refresher-reading...
3,Credit Default Swaps,Fixed Income,Level II,2024,Derivative instruments in which the underlying...,The member should be able to: describe credit...,A credit default swap (CDS) is a contract bet...,https://www.cfainstitute.org/membership/profes...,/-/media/documents/protected/refresher-reading...
4,Valuation of Contingent Claims,Derivatives,Level II,2024,A contingent claim is a derivative instrument ...,The member should be able to: describe and in...,This reading on the valuation of contingent cl...,https://www.cfainstitute.org/membership/profes...,/-/media/documents/protected/refresher-reading...


In [8]:
df['Level'] = df['Level'].str.strip()

In [9]:
pattern = re.compile(r'^https?:\/\/[^\s\/$.?#].[^\s]*$')
def correct_url(url):
    # If URL is already valid or is a placeholder, return it unchanged
    if pattern.match(url) or url in [None, 'Data Not Found.', '']:
        return url
    # Add missing parts to make URL valid
    corrected_url = 'https://www.cfainstitute.org/' + url if not url.startswith('http://') and not url.startswith('https://') else url
    # Ensure the URL ends with .pdf
    if not corrected_url.endswith('.pdf'):
        corrected_url += '.pdf'
    return corrected_url

# Apply the correction function to the Download_Link column
df['Download_Link'] = df['Download_Link'].apply(lambda x: correct_url(str(x)))

# Display the updated DataFrame
print(df)

                                     Name_of_the_Title  \
0                                 Time-Series Analysis   
1                               Credit Analysis Models   
2              Introduction to Alternative Investments   
3                                 Credit Default Swaps   
4                       Valuation of Contingent Claims   
..                                                 ...   
219                  Fixed-Income Cash Flows and Types   
220  Private Capital, Real Estate, Infrastructure, ...   
221                  Extensions of Multiple Regression   
222  Pricing and Valuation of Forward Contracts and...   
223           Option Replication Using Put-Call Parity   

                       Topic     Level  Year  \
0       Quantitative Methods  Level II  2024   
1               Fixed Income  Level II  2024   
2    Alternative Investments   Level I  2023   
3               Fixed Income  Level II  2024   
4                Derivatives  Level II  2024   
..             

In [10]:
columns_to_check = ['Introduction', 'Learning_Outcomes', 'Summary']

# Function to replace text if its length is less than 5
def replace_short_text(text):
    if len(text) < 40:
        return 'Data Not Found.'
    else:
        return text

# Apply the function to the specified columns
for column in columns_to_check:
    df[column] = df[column].apply(lambda x: replace_short_text(str(x)))

In [11]:
csv_path = 'Assignment2.csv'
df.to_csv(csv_path, index=False)

In [12]:
df.isnull().sum()

Name_of_the_Title           0
Topic                       0
Level                       0
Year                        0
Introduction                0
Learning_Outcomes           0
Summary                     0
Link_to_the_Summary_Page    0
Download_Link               0
dtype: int64

In [13]:
import pandas as pd
df= pd.read_csv("Assignment2.csv")

In [14]:
columns_to_strip = ['Introduction', 'Learning_Outcomes', 'Summary']

# Strip whitespace from the beginning and end of specified columns
for column in columns_to_strip:
    df[column] = df[column].str.strip()

In [15]:
df.head(10)

Unnamed: 0,Name_of_the_Title,Topic,Level,Year,Introduction,Learning_Outcomes,Summary,Link_to_the_Summary_Page,Download_Link
0,Time-Series Analysis,Quantitative Methods,Level II,2024,"As financial analysts, we often use time-serie...",The member should be able to: calculate and ev...,The predicted trend value of a time series in ...,https://www.cfainstitute.org/membership/profes...,https://www.cfainstitute.org//-/media/document...
1,Credit Analysis Models,Fixed Income,Level II,2024,Credit analysis plays an important role in the...,The member should be able to: explain expected...,This reading has covered several important top...,https://www.cfainstitute.org/membership/profes...,https://www.cfainstitute.org//-/media/document...
2,Introduction to Alternative Investments,Alternative Investments,Level I,2023,"In this section, we explain what alternative i...",The member should be able to: describe types a...,This reading provides a comprehensive introduc...,https://www.cfainstitute.org/membership/profes...,https://www.cfainstitute.org//-/media/document...
3,Credit Default Swaps,Fixed Income,Level II,2024,Derivative instruments in which the underlying...,The member should be able to: describe credit ...,A credit default swap (CDS) is a contract betw...,https://www.cfainstitute.org/membership/profes...,https://www.cfainstitute.org//-/media/document...
4,Valuation of Contingent Claims,Derivatives,Level II,2024,A contingent claim is a derivative instrument ...,The member should be able to: describe and int...,This reading on the valuation of contingent cl...,https://www.cfainstitute.org/membership/profes...,https://www.cfainstitute.org//-/media/document...
5,Introduction to Commodities and Commodity Deri...,Alternative Investments,Level II,2024,"In the upcoming sections, we present the chara...",The member should be able to: compare characte...,Commodities are a diverse asset class comprisi...,https://www.cfainstitute.org/membership/profes...,https://www.cfainstitute.org//-/media/document...
6,Understanding Income Statements,Financial Reporting and Analysis,Level I,2023,The income statement presents information on t...,The member should be able to: describe the com...,This reading has presented the elements of inc...,https://www.cfainstitute.org/membership/profes...,https://www.cfainstitute.org//-/media/document...
7,Pricing and Valuation of Forward Commitments,Derivatives,Level II,2024,"Forward commitments include forwards, futures,...",The member should be able to: describe the car...,This reading on forward commitment pricing and...,https://www.cfainstitute.org/membership/profes...,https://www.cfainstitute.org//-/media/document...
8,Private Equity Investments,Alternative Investments,Level II,2023,Private equitys shift from a niche activity to...,The member should be able to: explain sources ...,Private equity funds seek to add value by vari...,https://www.cfainstitute.org/membership/profes...,https://www.cfainstitute.org//-/media/document...
9,Valuation and Analysis of Bonds with Embedded ...,Fixed Income,Level II,2024,"The valuation of a fixed-rate, option-free bon...",The member should be able to: describe fixed-i...,An embedded option represents a right that can...,https://www.cfainstitute.org/membership/profes...,https://www.cfainstitute.org//-/media/document...


# PYDANTIC & PYTEST

In [16]:
from pydantic import BaseModel, validator, ValidationError, Field, root_validator, constr
from typing import Optional
import pandas as pd  # For pd.isna()
import datetime
from datetime import datetime  # Corrected import
import re  # Add this import statement

class RecordModel(BaseModel):  #url class
    Name_of_the_Title: Optional[str] = Field(min_length=5, description="The title of the record. Must start with a capital letter and be at least 5 characters long.")
    Topic: Optional[str] = Field(min_length=5, description="The main topic covered by the record. Begins with a capital letter and requires a minimum of 5 characters.")
    Level: Optional[str] = Field(None, description="The difficulty level or audience level the record is intended for. Must follow the pattern 'Level I', 'Level II', or 'Level III'.")
    Year: Optional[int] = Field(ge=0, description="The year associated with the record. A non-negative integer.")
    Introduction: Optional[str] = Field(min_length=10, description="A brief introduction to the record. Requires a minimum of 50 characters to ensure sufficient detail.")
    Learning_Outcomes: Optional[str] = Field(min_length=10, description="Expected learning outcomes from engaging with the record. Must be descriptive and at least 50 characters long.")
    Summary: Optional[str] = Field(min_length=10, description="A concise summary of the record's content. Should provide a clear overview in at least 50 characters.")
    Link_to_the_Summary_Page: Optional[str] = Field(None, alias='Link to the Summary Page', description="URL to a webpage providing a detailed summary of the record.")
    Download_Link: Optional[str] = Field(None, description="Direct link to download the record or associated materials.")

    # Validators and Config class as before


    @root_validator(pre=True)
    def check_for_nulls_and_replace(cls, values):
        for field, value in values.items():
            if pd.isna(value) or value is None:
                values[field] = "Data Not Found."
        return values

    @validator('Year', pre=True)
    def validate_year_not_in_future(cls, v):
        if v != 0:
            current_year = datetime.now().year
            if v > current_year or v < 2000:
                raise ValueError("Year cannot be in the future or before 2000.")
        return v


    from pydantic import validator

    @validator('Download_Link', pre=True)
    def urls_must_be_fully_qualified(cls, v):
        if v is not None and v != "Data Not Found.":
            if not v.startswith('https://www.cfainstitute'): 
                raise ValueError('URL must be fully qualified and start with https://www.cfainstitute')
            if not v.endswith(".pdf"):
                raise ValueError('URL must end with .pdf')
        return v

    
    @validator('Link_to_the_Summary_Page', pre=True)
    def urls_must_be_fully_qualified_and_follow_title_pattern(cls, v, values, **kwargs):
        title = values.get('Name_of_the_Title', '')
        # Create a URL-friendly version of the title
        url_friendly_title = urllib.parse.quote(title.replace(' ', '-'), safe='')

        expected_url_format = f"https://www.cfainstitute.org/{url_friendly_title}"
        if v and not v.startswith(expected_url_format):
            raise ValueError(f"URL must follow the pattern: {expected_url_format}")
        return v

    @validator('Level', pre=True)
    def level_must_start_with_level_and_followed_by_roman_numeral(cls, v):
        if v is not None:
            # Regular expression to match the pattern "Level " followed by I, II, or III
            pattern = r"^Level (I|II|III)$"
            if not re.match(pattern, v):
                raise ValueError("Level must start with 'Level' and be followed by a Roman numeral I, II, or III.")
        return v
    
    
    # Adjusted content length validator (combined logic, if applicable)
    @validator('Introduction', 'Learning_Outcomes', 'Summary', pre=True)
    def check_content_quality(cls, v):
        if v == "Data Not Found.":
            return v  # Skip further checks for placeholder text
        if len(v) < 50:
            raise ValueError('Content is too short to be meaningful')
        if not v[0].isupper():
            raise ValueError(f"'{v}' must start with a capital letter.")
        if not v.endswith('.'):
            raise ValueError(f"'{v}' must end with a full stop.")
        return v
    
    @validator('Learning_Outcomes', pre=True, always=True)
    def learning_outcomes_must_start_with_specified_phrases(cls, v):
        if v is not None and not (v.startswith("The candidate should be able to:") or v.startswith("The member should be able to:") or v.startswith("Data Not Found.") ):
            raise ValueError("Learning_Outcomes must start with 'The candidate should be able to:' or 'The member should be able to:'.")
        return v
    
    @validator('Name_of_the_Title', pre=True)
    def title_must_have_proper_case(cls, v):
        if v is not None or v!="Data Not Found.":
            # List of words that should not be capitalized
            lowercase_words = ['and', 'to', 'of', 'for', 'with', 'in', 'on', 'at', 'by', 'from']
            
            # Split the title into words
            words = v.split()
            
            # Capitalize the first word and words not in lowercase_words
            corrected_words = [words[0].capitalize()] + [word if word in lowercase_words else word.capitalize() for word in words[1:]]
            
            # Reconstruct the title
            corrected_title = ' '.join(corrected_words)
            
            # Check if the corrected title matches the original
            if corrected_title != v:
                raise ValueError("Title words must start with a capital letter, except for specific lowercased words.")
                
        return v
    
    @validator('Topic', pre=True)
    def title_must_have_proper_case(cls, v):
        if v is not None or v!="Data Not Found.":
            # List of words that should not be capitalized
            lowercase_words = ['and', 'to', 'of', 'for', 'with', 'in', 'on', 'at', 'by', 'from']
            
            # Split the title into words
            words = v.split()
            
            # Capitalize the first word and words not in lowercase_words
            corrected_words = [words[0].capitalize()] + [word if word in lowercase_words else word.capitalize() for word in words[1:]]
            
            # Reconstruct the title
            corrected_title = ' '.join(corrected_words)
            
            # Check if the corrected title matches the original
            if corrected_title != v:
                raise ValueError("Topic words must start with a capital letter, except for specific lowercased words.")
                
        return v
    
    
   
    
  


C:\Users\devmi\anaconda3\lib\site-packages\ipykernel_launcher.py:22: PydanticDeprecatedSince20: Pydantic V1 style `@root_validator` validators are deprecated. You should migrate to Pydantic V2 style `@model_validator` validators, see the migration guide for more details. Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.5/migration/
C:\Users\devmi\anaconda3\lib\site-packages\ipykernel_launcher.py:29: PydanticDeprecatedSince20: Pydantic V1 style `@validator` validators are deprecated. You should migrate to Pydantic V2 style `@field_validator` validators, see the migration guide for more details. Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.5/migration/
C:\Users\devmi\anaconda3\lib\site-packages\ipykernel_launcher.py:40: PydanticDeprecatedSince20: Pydantic V1 style `@validator` validators are deprecated. You should migrate to Pydantic V2 style `@field_v

Cleaning the csv file

In [17]:
validation_results = []

for index, row in df.iterrows():
    try:
        record = RecordModel(**row.to_dict())
        validation_results.append((index, "Validation successful."))
    except Exception as e:
        validation_results.append((index, f"Validation failed: {e}"))

for result in validation_results:
    if "Validation failed" in result[1]:  # Check if the validation result indicates a failure
        print(f"Row {result[0]}: {result[1]}")

Row 2: Validation failed: 1 validation error for RecordModel
Introduction
  Value error, 'In this section, we explain what alternative investments are and why assets under management in alternative investments have grown in recent decades. We also explain how alternative investments differ from traditional investments, and we examine their perceived investment merit. We conclude this section with a brief overview of the various categories of alternative investments; these categories will be explored further in later sections.Alternative investments is a label for a disparate group of investments that are distinguished from long-only, publicly traded investments in stocks, bonds, and cash (often referred to as traditional investments). The terms traditional and alternative should not imply that alternatives are necessarily uncommon or that they are relatively recent additions to the investment universe. Alternative investments include such assets as real estate and commodities, which ar

In [18]:
print(df.iloc[41])

Name_of_the_Title                                        Measures of Leverage
Topic                                                       Corporate Finance
Level                                                                 Level I
Year                                                                     2023
Introduction                This reading presents elementary topics in lev...
Learning_Outcomes           The member should be able to: define and expla...
Summary                     In this reading, we have reviewed the fundamen...
Link_to_the_Summary_Page    https://www.cfainstitute.org/membership/profes...
Download_Link               https://www.cfainstitute.org//-/media/document...
Name: 41, dtype: object


The above validation errors occured due to missing '.' and short length of texts

In [19]:
# Function to convert a string to title case with exceptions
def to_title_case_except(s):
    lower_case_words = ['a', 'an', 'the', 'and', 'but', 'or', 'for', 'nor', 'on', 'at', 'to', 'from', 'by']
    return ' '.join([word if word in lower_case_words else word.capitalize() for word in s.split()])

# Apply title case transformation with exceptions for specific columns
df['Name_of_the_Title'] = df['Name_of_the_Title'].apply(to_title_case_except)
df['Topic'] = df['Topic'].apply(to_title_case_except)

# Ensure Introduction and Learning_Outcomes end with a period
def ensure_ending_period(s):
    return s if s.endswith('.') else s + '.'

df['Introduction'] = df['Introduction'].apply(ensure_ending_period)
df['Learning_Outcomes'] = df['Learning_Outcomes'].apply(ensure_ending_period)
df['Summary'] = df['Summary'].apply(ensure_ending_period)


In [20]:
validation_results = []

for index, row in df.iterrows():
    try:
        record = RecordModel(**row.to_dict())
        validation_results.append((index, "Validation successful."))
    except Exception as e:
        validation_results.append((index, f"Validation failed: {e}"))

for result in validation_results:
    if "Validation failed" in result[1]:  # Check if the validation result indicates a failure
        print(f"Row {result[0]}: {result[1]}")

Row 143: Validation failed: 1 validation error for RecordModel
Learning_Outcomes
  Value error, Content is too short to be meaningful [type=value_error, input_value='The candidate should be able to: LL Outcomes.', input_type=str]
    For further information visit https://errors.pydantic.dev/2.5/v/value_error


In [21]:
# Function to replace text if its length is less than 5
def replace_short_text(text):
    if len(text) < 60:
        return 'Data Not Found.'
    else:
        return text

df['Learning_Outcomes'] = df['Learning_Outcomes'].apply(lambda x: replace_short_text(str(x)))

In [22]:
validation_results = []

for index, row in df.iterrows():
    try:
        record = RecordModel(**row.to_dict())
        validation_results.append((index, "Validation successful."))
    except Exception as e:
        validation_results.append((index, f"Validation failed: {e}"))

for result in validation_results:
    if "Validation failed" in result[1]:  # Check if the validation result indicates a failure
        print(f"Row {result[0]}: {result[1]}")

In [23]:
csv_path = 'Assignment2.csv'
df.to_csv(csv_path, index=False)

In [24]:
pip install pytest


Note: you may need to restart the kernel to use updated packages.


# PYTEST

In [25]:
import pytest
from datetime import datetime
import ipytest
ipytest.autoconfig()
import urllib.parse


# Test for valid Learning_Outcomes format
def test_valid_learning_outcomes_format_should_pass():
    obj = RecordModel(
        Name_of_the_Title="Understanding Financial Markets",
        Topic="Financial Markets",
        Year=2023,
        Level="Level II",
        Introduction="This section introduces financial markets and their importance." * 2,
        Learning_Outcomes="The candidate should be able to: identify key participants in financial markets." * 2,
        Summary="This summary covers the basics of financial markets, including their structure and participants." * 2,
        Link_to_the_Summary_Page="https://www.cfainstitute.org/Understanding-Financial-Markets",
        Download_Link="https://www.cfainstitute.org/download/understanding-financial-markets.pdf"
    )

# Test for valid URL format for Download_Link
def test_valid_download_link_format_should_pass():
    obj = RecordModel(
        Name_of_the_Title="Investment Portfolio Management",
        Topic="Portfolio Management",
        Year=2022,
        Level="Level III",
        Introduction="This section delves into the strategies for effective investment portfolio management." * 2,
        Learning_Outcomes="The candidate should be able to: construct and manage a diversified investment portfolio." * 2,
        Summary="This summary explores various approaches to portfolio management and their implications for risk and return." * 2,
        Link_to_the_Summary_Page="https://www.cfainstitute.org/Investment-Portfolio-Management",
        Download_Link="https://www.cfainstitute.org/download/investment-portfolio-management.pdf"
    )

# Test for valid title and topic casing
def test_valid_title_and_topic_casing_should_pass():
    obj = RecordModel(
        Name_of_the_Title="Risk Management and Financial Institutions",
        Topic="Risk Management",
        Year=2021,
        Level="Level I",
        Introduction="This section examines the principles of risk management within financial institutions." * 2,
        Learning_Outcomes="The candidate should be able to: assess and mitigate various types of financial risks." * 2,
        Summary="This summary provides an overview of risk management practices in banking and financial institutions." * 2,
        Link_to_the_Summary_Page="https://www.cfainstitute.org/Risk-Management-and-Financial-Institutions",
        Download_Link="https://www.cfainstitute.org/download/risk-management-and-financial-institutions.pdf"
    )



# Example corrected test case
def test_year_less_than_2000_should_fail():
    with pytest.raises(ValidationError):
        obj =RecordModel(
            Name_of_the_Title="Example Title",
            Topic="Example Topic",
            Year=1799,  # Invalid year
            Level="Level I",
            Introduction="A valid introduction that is long enough." * 5,
            Learning_Outcomes="The candidate should be able to: demonstrate understanding." * 5,
            Summary="A valid summary that is sufficiently long." * 5,
            Link_to_the_Summary_Page="https://study.cfainstitute.org/Example-Title",
            Download_Link="https://study.cfainstitute.org/download/example-title.pdf"
        )

# Test for year validation: greater than current year should fail
def test_year_greater_than_current_year_should_fail():
    future_year = datetime.now().year + 1
    with pytest.raises(ValidationError):  # Expect ValidationError
         obj =RecordModel(
            Name_of_the_Title="Example Title",
            Topic="Example Topic",
            Year=future_year,  # Future year
            Level="Level I",
            Introduction="A valid introduction that is long enough."*5,
            Learning_Outcomes="The candidate should be able to: demonstrate understanding."*5,
            Summary="A valid summary that is sufficiently long."*5,
            Link_to_the_Summary_Page="https://study.cfainstitute.org/Example-Title",
            Download_Link="https://study.cfainstitute.org/download/example-title.pdf"
        )

# Test for successful creation with valid year
def test_valid_year_should_pass():
    try:
        obj = RecordModel(
            Name_of_the_Title="Example Title",
            Topic="Example Topic",
            Year=2023,  # Valid year
            Level="Level I",
            Introduction="A valid introduction that is long enough."*5,
            Learning_Outcomes="The candidate should be able to: demonstrate understanding."*5,
            Summary="A valid summary that is sufficiently long."*5,
            Link_to_the_Summary_Page="https://www.cfainstitute.org/Example-Title",
            Download_Link="https://www.cfainstitute.org/download/example-title.pdf"
        )
    except ValidationError as e:
        pytest.fail(f"Unexpected ValidationError occurred: {e}")


# Test for HTML tags in introduction should fail
def test_html_tags_in_introduction_should_fail():
    with pytest.raises(ValueError):
        obj =RecordModel(
            Name_of_the_Title="Example Title",
            Topic="Example Topic",
            Year=2023,
            Level="Level I",
            Introduction="<h1>Invalid Introduction</h1>"*5,  # Invalid because of HTML
            Learning_Outcomes="The candidate should be able to: demonstrate understanding."*5,
            Summary="A valid summary that is sufficiently long."*5,
            Link_to_the_Summary_Page="https://study.cfainstitute.org/Example-Title",
            Download_Link="https://study.cfainstitute.org/download/example-title.pdf"
        )

# Test for introduction should fail
def test_puncutation_tags_in_introduction_should_fail():
    with pytest.raises(ValueError):
        obj =RecordModel(
            Name_of_the_Title="Example Title",
            Topic="Example Topic",
            Year=2023,
            Level="Level I",
            Introduction="Invalid Introduction"*5,  # Invalid because of no "." at end
            Learning_Outcomes="The candidate should be able to: demonstrate understanding."*5,
            Summary="A valid summary that is sufficiently long."*5,
            Link_to_the_Summary_Page="https://study.cfainstitute.org/Example-Title",
            Download_Link="https://study.cfainstitute.org/download/example-title.pdf"
        )

# Test for introduction should fail
def test_case_tags_in_introduction_should_fail():
    with pytest.raises(ValueError):
        obj =RecordModel(
            Name_of_the_Title="Example Title",
            Topic="Example Topic",
            Year=2023,
            Level="Level I",
            Introduction="invalid Introduction."*5,  # Invalid because of starting with lower case
            Learning_Outcomes="The candidate should be able to: demonstrate understanding."*5,
            Summary="A valid summary that is sufficiently long."*5,
            Link_to_the_Summary_Page="https://study.cfainstitute.org/Example-Title",
            Download_Link="https://study.cfainstitute.org/download/example-title.pdf"
        )
        
# Test for introduction should fail
def test_case_introduction_should_pass():
        obj =RecordModel(
            Name_of_the_Title="Example Title",
            Topic="Example Topic",
            Year=2023,
            Level="Level I",
            Introduction="Valid Introduction."*5,  # valid 
            Learning_Outcomes="The candidate should be able to: demonstrate understanding."*5,
            Summary="A valid summary that is sufficiently long."*5,
            Link_to_the_Summary_Page="https://www.cfainstitute.org/Example-Title",
            Download_Link="https://www.cfainstitute.org/download/example-title.pdf"
        )


# Test for Level format: "Level IIII" should fail due to incorrect Roman numeral
def test_incorrect_level_format_should_fail():
    with pytest.raises(ValueError):
        obj = RecordModel(
            Name_of_the_Title="Example Title",
            Topic="Example Topic",
            Year=2023,
            Level="Level IIII",  # Invalid Level
            Introduction="A valid introduction that is long enough."*5,
            Learning_Outcomes="The candidate should be able to: demonstrate understanding."*5,
            Summary="A valid summary that is sufficiently long."*5,
            Link_to_the_Summary_Page="https://study.cfainstitute.org/Example-Title",
            Download_Link="https://study.cfainstitute.org/download/example-title.pdf"
        )


        

# Test for Level format: "Advance" incorrect level at first
def test_incorrect_level_format1_should_fail():
    with pytest.raises(ValueError):
        obj = RecordModel(
            Name_of_the_Title="Example Title",
            Topic="Example Topic",
            Year=2023,
            Level="Advance",  # Invalid Level
            Introduction="A valid introduction that is long enough."*5,
            Learning_Outcomes="The candidate should be able to: demonstrate understanding."*5,
            Summary="A valid summary that is sufficiently long."*5,
            Link_to_the_Summary_Page="https://study.cfainstitute.org/Example-Title",
            Download_Link="https://study.cfainstitute.org/download/example-title.pdf"
        )
        
def test_title_with_improper_case_should_fail():
    with pytest.raises(ValueError):
        obj = RecordModel(
            Name_of_the_Title="example Title With Prepositions And Conjunctions",  # Incorrect capitalization
            Topic="Example Topic",
            Year=2023,
            Level="Level I",
            Introduction="A valid introduction that is long enough."*5,
            Learning_Outcomes="The candidate should be able to: demonstrate understanding."*5,
            Summary="A valid summary that is sufficiently long."*5,
            Link_to_the_Summary_Page="https://study.cfainstitute.org/download/example-Title-With-Prepositions-And-Conjunctions",
            Download_Link="https://study.cfainstitute.org/Example-Title.pdf"
        )


    
def test_title_with_proper_case_should_pass():
        obj = RecordModel(
            Name_of_the_Title="Example and Title",  # correct capitalization
            Topic="Example Topic",
            Year=2023,
            Level="Level I",
            Introduction="A valid introduction that is long enough."*5,
            Learning_Outcomes="The candidate should be able to: demonstrate understanding."*5,
            Summary="A valid summary that is sufficiently long."*5,
            Link_to_the_Summary_Page="https://www.cfainstitute.org/download/Example-and-Title",
            Download_Link="https://www.cfainstitute.org/Example-Title.pdf"
        )

        
def test_Learning_Outcomes_with_improper_format_should_fail():
      with pytest.raises(ValidationError):
        obj = RecordModel(
            Name_of_the_Title="Example and Title",
            Topic="Example Topic",
            Year=2023,
            Level="Level I",
            Introduction="A valid introduction."*5,
            Learning_Outcomes="candidate should be able to: demonstrate understanding."*5,
            Summary="A valid summary."*5,
            Link_to_the_Summary_Page="//study.cfainstitute.org/wrong-format-example-and-title",
            Download_Link="https://study.cfainstitute.org/Example-Title.pdf"
        )


In [26]:
ipytest.run('-vv')


platform win32 -- Python 3.7.6, pytest-7.4.4, pluggy-0.13.1 -- C:\Users\devmi\anaconda3\python.exe
cachedir: .pytest_cache
hypothesis profile 'default' -> database=DirectoryBasedExampleDatabase('C:\\Users\\devmi\\Downloads\\.hypothesis\\examples')
rootdir: C:\Users\devmi\Downloads
plugins: hypothesis-5.5.4, arraydiff-0.3, astropy-header-0.1.2, doctestplus-0.5.0, openfiles-0.4.0, remotedata-0.3.2
[1mcollecting ... [0mcollected 15 items

t_4708730ae8654c6f85f615b41a24ffa5.py::test_valid_learning_outcomes_format_should_pass [32mPASSED[0m[32m [  6%][0m
t_4708730ae8654c6f85f615b41a24ffa5.py::test_valid_download_link_format_should_pass [32mPASSED[0m[32m    [ 13%][0m
t_4708730ae8654c6f85f615b41a24ffa5.py::test_valid_title_and_topic_casing_should_pass [32mPASSED[0m[32m  [ 20%][0m
t_4708730ae8654c6f85f615b41a24ffa5.py::test_year_less_than_2000_should_fail [32mPASSED[0m[32m           [ 26%][0m
t_4708730ae8654c6f85f615b41a24ffa5.py::test_year_greater_than_current_year_should_fai

<ExitCode.OK: 0>