### **Web Scraping and Text Analysis of Martin Luther King Jr.'s Speech**

This project involves web scraping, text preprocessing, and word frequency analysis of Martin Luther King Jr.'s speech from an online source. The script leverages Python libraries such as **BeautifulSoup**, **Requests**, **re** (Regular Expressions), and **pandas** to extract, clean, and analyze textual data.

#### **Project Workflow:**

1. **Web Scraping:**  
   - The script fetches the HTML content of the webpage containing the speech using the `requests` library.  
   - It parses the HTML using `BeautifulSoup` and extracts all paragraphs (`<p>` tags) that contain the speech text.

2. **Text Preprocessing:**  
   - The extracted text is combined into a single string for uniform analysis.  
   - Carriage returns and newline characters are replaced with spaces.  
   - Punctuation is removed using regular expressions.  
   - The text is converted to lowercase to ensure consistency.  
   - The cleaned text is split into individual words.

3. **Word Frequency Analysis:**  
   - The words are organized into a `pandas` DataFrame.  
   - The frequency of each unique word is calculated and stored.

4. **Exporting Results:**  
   - The word frequency data is exported to a CSV file (`mlk_speech_counts.csv`) with two columns: `Word` and `Counts`.

#### **Purpose of the Project:**  
- **Data Collection:** Demonstrates the use of web scraping techniques to extract unstructured text data from a webpage.  
- **Text Cleaning:** Highlights preprocessing steps to prepare raw text for analysis.  
- **Data Analysis:** Provides insights into the most frequently used words in the speech.  
- **Data Export:** Stores the results in a structured format (CSV) for further exploration and visualization.


#### **Webpage Link:**
- ****
This project serves as a foundational example of how web scraping, text processing, and data analysis can be integrated into a cohesive workflow using Python.

In [44]:
# Importing the BeautifulSoup class from the bs4 library for web scraping and HTML parsing.
from bs4 import BeautifulSoup  

# Importing the requests library to make HTTP requests to fetch web page content.
import requests  

# Importing the re module for regular expression operations, useful for pattern matching in text.
import re  

# Importing the pandas library to work with data in a tabular format, such as dataframes.
import pandas as pd  


In [46]:
# Define the URL of the webpage containing the text of Martin Luther King Jr.'s speech.
url = 'http://analytictech.com/mb021/mlk.htm'

# Send an HTTP GET request to the specified URL to retrieve the page content.
page = requests.get(url)

# Parse the HTML content of the page using BeautifulSoup.
soup = BeautifulSoup(page.text, 'html')

# Extract all paragraphs (<p>) from the HTML document.
mlkj_speech = soup.find_all('p')

# Extract the text content from each paragraph tag and store it in a list.
speech_combined = [p.text for p in mlkj_speech]

# Combine all paragraph texts into a single string, separating each paragraph with a space.
string_speech = ' '.join(speech_combined)

# Display the first 100 characters of the speech string (for preview purposes).
string_speech[:100]

# Clean the string by replacing any carriage return and newline characters with a space.
string_speech_cleaned = string_speech.replace('\r\n', ' ')

# Remove all punctuation from the string using a regular expression.
speach_no_punt = re.sub(r'[^\w\s]', '', string_speech_cleaned)

# Convert the cleaned string into lowercase for uniformity.
speech_lower = speach_no_punt.lower()

# Split the lowercase string into individual words using whitespace as the delimiter.
speech_broken_out = re.split(r'\s+', speech_lower)

# Create a pandas DataFrame from the list of words and count the frequency of each unique word.
df = pd.DataFrame(speech_broken_out).value_counts()

# Export the word frequency data to a CSV file with the word as the index and its count as a column.
df.to_csv(r'Web Scraper + Regular Expression Project/mlk_speech_counts.csv', header=['Counts'], index_label='Word')
