## **📦 Importing Libraries**

- `BeautifulSoup` from `bs4`: Used for parsing HTML and XML documents. It's helpful for web scraping and navigating the structure of web pages.
- `requests`: Allows sending HTTP/1.1 requests easily. Used here to fetch the content of web pages.
- `pandas`: A powerful data manipulation and analysis library. Useful for storing and handling data in structured formats like DataFrames.


In [28]:
import requests
from bs4 import BeautifulSoup
from urllib.parse import quote_plus
import pandas as pd

## **🔧 Define scraping funtion**

---

### Function: `auto_Scrapper_Class()`

This function extracts course information like titles, organizations, difficulty levels, and skills by parsing the HTML of Coursera's search results.

#### Parameters:
- **`user_query (str)`**: The keyword to search courses for (e.g., `"data science"`).
- **`number_page (int)`**: The number of result pages to scrape.
- **`html_tag (str)`**: The HTML tag used to find content (e.g., `h3`, `p`).
- **`course_case (list)`**: A list to store the extracted content.
- **`tag_class (str)`**: The class name associated with the tag (used to target the correct element).
- **`div_class (str, optional)`**: Optional class if the target element is a `<div>` (used for metadata or skill sections).

#### Function Logic:
1. Encodes the user query for use in a URL.
2. Iterates through pages of Coursera search results.
3. Fetches HTML content and parses it with BeautifulSoup.
4. Depending on tag and class parameters, extracts desired data.
5. Appends found data to the provided list, ensuring consistent list size even if data is missing.

---

In [29]:
def auto_Scrapper_Class(user_query, html_tag, course_case, tag_class, div_class=None):
    """
    Scrap course titles or info from Coursera based on user input query.
    
    Parameters:
    - user_query (str): the search keyword from user, e.g., "data science"
    - html_tag (str): HTML tag to look for
    - course_case (list): list to store the results
    - tag_class (str): class of the target HTML tag
    - div_class (str, optional): class of <div> if that's what is being targeted
    """
    encoded_query = quote_plus(user_query)

    for i in range(1,50): # adjust as needed, according to current coursera website, there are 83 pages for all courses
        url = f"https://www.coursera.org/search?query={encoded_query}&page=" +str(i)
        
        page = requests.get(url)
        soup = BeautifulSoup(page.content, 'html.parser')

        if div_class:
            elements = soup.find_all('div',  class_ = div_class)

            if (len(elements)) != 12:  
                for j in range(0,12):    # There are 12 courses per page
                    course_case.append(None)
                continue
            for name in elements:
                x = name.get_text()
                if x:
                    course_case.append(x)
                else:
                    course_case.append(None)

        else:
            element = soup.find_all(html_tag,  class_ = tag_class)
            if (len(element)) != 12:
                for j in range(0,12):
                    course_case.append(None)
                continue

            for name in element:
                x = name.get_text()
                if x:
                    course_case.append(x)
                else:
                    course_case.append("")


### 📥 User Input:

In [None]:
user_query = input("Enter the course you want to search: ")

### 📄 Lists to Store Scraped Data:

In [31]:
course_title = []
course_organization = []
course_Certificate_type = []
course_difficulty = []
course_skills = []

### ⚙️ Function Usage:
Each call to `auto_Scrapper_Class` scrapes a different part of the course info.

1. **Course Title**  
   Extracted from `<h3>` tags with class `cds-CommonCard-title`.

2. **Course Organization**  
   Extracted from `<p>` tags with class `cds-ProductCard-partnerNames`.

3. **Course Difficulty**  
   Extracted from `<p>` tags with class `css-vac8rf` inside `cds-CommonCard-metadata` `<div>`s.

4. **Skills Taught**  
   Extracted from `<p>` tags with class `css-vac8rf` inside `cds-CommonCard-bodyContent` `<div>`s.


In [32]:
# 1. Course Title - Fixed class name
auto_Scrapper_Class(user_query,'h3', course_title, 'cds-CommonCard-title')

# 2. Course Organization - Fixed class name
auto_Scrapper_Class(user_query,'p', course_organization, 'cds-ProductCard-partnerNames')

# 3. Course Difficulty - Look in metadata section
auto_Scrapper_Class(user_query,'p', course_difficulty, 'css-vac8rf', 'cds-CommonCard-metadata')

# 4. Skills - Look in body content
auto_Scrapper_Class(user_query,'p', course_skills, 'css-vac8rf', 'cds-CommonCard-bodyContent')    

---
## **🧹 Clean and Organize Scraped Data**

After scraping, the data is compiled and cleaned using `pandas`.

---

In [33]:
data = {
    'Title': course_title,
    'Organization': course_organization,
    'Skills': course_skills,
    'Metadata': course_difficulty    
}
min_len = min(len(course_title), len(course_organization), len(course_skills), len(course_difficulty))
data = {
    'Title': course_title[:min_len],
    'Organization': course_organization[:min_len],
    'Skills': course_skills[:min_len],
    'Metadata': course_difficulty[:min_len]
}
df = pd.DataFrame(data)
df['Skills'] = df['Skills'].str.replace("Skills you'll gain:", '', regex=False)
df

Unnamed: 0,Title,Organization,Skills,Metadata
0,Machine Learning,Multiple educators,"Unsupervised Learning, Supervised Learning, M...",Beginner · Specialization · 1 - 3 Months
1,Machine Learning with Python,IBM,"Unsupervised Learning, Supervised Learning, R...",Intermediate · Course · 1 - 3 Months
2,Mathematics for Machine Learning and Data Science,DeepLearning.AI,"Descriptive Statistics, Bayesian Statistics, ...",Intermediate · Specialization · 1 - 3 Months
3,IBM Machine Learning,IBM,"Exploratory Data Analysis, Feature Engineerin...",Intermediate · Professional Certificate · 3 - ...
4,"Python for Data Science, AI & Development",IBM,"Jupyter, Python Programming, Data Structures,...",Beginner · Course · 1 - 3 Months
...,...,...,...,...
583,Introduction to Artificial Intelligence (AI),IBM,"Generative AI, ChatGPT, Natural Language Proc...",Beginner · Course · 1 - 4 Weeks
584,Mathematics for Machine Learning,Imperial College London,"Linear Algebra, Dimensionality Reduction, Num...",Beginner · Specialization · 3 - 6 Months
585,Fundamentals of Machine Learning and Artificia...,Amazon Web Services,Artificial Intelligence and Machine Learning ...,Mixed · Course · 1 - 4 Weeks
586,Supervised Machine Learning: Regression and Cl...,DeepLearning.AI,"Supervised Learning, Jupyter, Scikit Learn (M...",Beginner · Course · 1 - 4 Weeks


## **💾 Export Data to CSV**
- This command saves the cleaned DataFrame `df` to a CSV file named `coursera_course_dataset.csv`.
- The file will be created in the current working directory.

In [34]:
df.to_csv("coursera_course_dataset.csv")