# Part1: Data Collection and Wrangling

## Introduction to Data Collection
- Data collection is the systematic approach to gathering and measuring information from varied sources to get a complete and accurate picture of an area of interest.
- In the realm of big data, it's not just about the quantity of data but also about the diversity of sources and types of data we collect, ranging from traditional databases to the internet of things (IoT).

## Web Scraping
- Web scraping is a valuable tool for data collection that involves extracting data from websites.
- It's a practical approach to convert the data found on web pages into a structured format that we can analyze and use in various applications.

## Key Python Libraries
- `requests`: Allows you to send HTTP requests easily, access the content and get the data from the response.
- `BeautifulSoup` from `bs4`: A Python package for parsing HTML and XML documents. It creates parse trees that can be used to extract data from HTML, which is very useful for web scraping.

## Data Wrangling
- Data wrangling, sometimes referred to as data munging, is the process of transforming and mapping data from a raw data form into another format with the intent of making it more appropriate and valuable for a variety of downstream purposes, such as analytics.
- A major aspect of data wrangling is data cleaning which can involve a wide array of tasks such as removing duplicates, correcting errors, and dealing with missing values.

## Pandas for Data Wrangling
- `pandas` is a cornerstone library for data analysis in Python, providing fast, flexible, and expressive data structures designed to make working with "relational" or "labeled" data both easy and intuitive.
- It offers data structures like DataFrames and Series, along with a comprehensive set of operations for data manipulation, cleaning, and preparation.

## Ethical Considerations
- Respect the `robots.txt` file and terms of service of the website.
- Avoid sending too many requests to the website in a short period, which might overload the server.

## Legal Considerations
- Ensure that your web scraping activities are legal and do not infringe on copyright laws or privacy regulations.
- Always be transparent about your data collection methods and purposes.

# Part 2: Follow Me - Web Scraping and Data Cleaning

In [1]:
import requests
from bs4 import BeautifulSoup
import pandas as pd

Web Scraping Example: Retrieving Quotes and Authors
We will begin with a simple web scraping example using the 'requests' and 'BeautifulSoup' libraries to retrieve quotes and their authors from a demo website.

In [2]:
# Sending a GET request to the website
url = 'http://quotes.toscrape.com/'
response = requests.get(url)

In [3]:
# Ensuring we received a successful response
if response.ok:
    print("Response is successful!")
else:
    print("Failed to retrieve data.")


Response is successful!


In [4]:
# Parsing the page using BeautifulSoup
soup = BeautifulSoup(response.text, 'html.parser')
print("Page parsed with BeautifulSoup.")


Page parsed with BeautifulSoup.


In [5]:
# Finding all the span tags that contain quotes
quotes = soup.find_all('span', class_='text')
# Finding all the small tags that contain authors
authors = soup.find_all('small', class_='author')

# Extracting the text from the quotes and authors
quotes_text = [quote.get_text() for quote in quotes]
authors_names = [author.get_text() for author in authors]
print("Extracted quotes and authors.")


Extracted quotes and authors.


In [6]:
# Creating a DataFrame from the scraped data
quotes_df = pd.DataFrame({
    'Quote': quotes_text,
    'Author': authors_names
})

# Displaying the first few entries of the DataFrame
print("Scraped Data:")
print(quotes_df.head())


Scraped Data:
                                               Quote           Author
0  “The world as we have created it is a process ...  Albert Einstein
1  “It is our choices, Harry, that show what we t...     J.K. Rowling
2  “There are only two ways to live your life. On...  Albert Einstein
3  “The person, be it gentleman or lady, who has ...      Jane Austen
4  “Imperfection is beauty, madness is genius and...   Marilyn Monroe


In [7]:
# Before dropping duplicates
print("\nNumber of entries before removing duplicates:", quotes_df.shape[0])

# Removing duplicate rows
quotes_df.drop_duplicates(inplace=True)

# After dropping duplicates
print("Number of entries after removing duplicates:", quotes_df.shape[0])



Number of entries before removing duplicates: 10
Number of entries after removing duplicates: 10


# Part 3: Your Turn - Advanced Web Scraping and Data Wrangling

This part of the course will engage you in a real-world web scraping and data wrangling project. You'll apply the techniques you've learned to extract data from the web and then clean and prepare it for analysis. This will involve handling various data complexities such as missing values and format inconsistencies.

## Web Scraping:
- Use the `requests` and `BeautifulSoup` libraries to perform web scraping.
- Identify a website that allows scraping and collect data, ensuring you comply with any terms of service.

## Data Cleaning:
- Data may come with missing values or inconsistencies. Determine the best strategies for handling these issues, whether it's filling in missing values (imputation) or removing affected rows.
- Clean your data to ensure accuracy in your analysis, which might involve addressing outliers or errors.

## Feature Engineering:
- Consider ways to enrich your data, such as creating new columns that summarize or combine existing information in useful ways.

## Advanced Aggregations:
- Use pandas functionality such as `.groupby()` to organize your data by relevant categories and compute aggregate statistics.

## Visualization:
- Translate your data into visual form with libraries like Matplotlib or Seaborn. Choose plots that effectively communicate the patterns you've discovered.

## Instructions:
1. Scrape data from your chosen website using `requests` and `BeautifulSoup`.
2. Load the scraped data into a pandas DataFrame.
3. Perform necessary data cleaning and feature engineering.
4. Aggregate and analyze the data to find trends and insights.
5. Visualize your findings with appropriate graphs.
6. Compile your steps and insights into the Jupyter notebook and submit it as your completed assignment.