Python Web Scraping Project

This project demonstrates how to perform web scraping using Python, focusing on the playwright and BeautifulSoup4 libraries. It covers scraping information from a job site and saving the collected data into a CSV file. Additionally, it briefly explores the feasibility of scraping Instagram and TikTok.

Required Libraries

The main Python libraries required for this project are:

playwright: A library for browser automation, useful for data collection on dynamic websites.
beautifulsoup4: A library for extracting data from HTML and XML files.
pandas: A library for data analysis and manipulation, used to save scraped data into CSV files.

Installation

To install the necessary libraries for this project, use the following command:

pip install playwright beautifulsoup4 pandas

Before using playwright, you need to install the necessary browser drivers with:

playwright install

Web Scraping Overview

The project follows these steps:

Loading Web Pages: Use playwright to load the web page and scroll to the section containing the desired data.
Extracting Data: Parse the page's HTML with BeautifulSoup and extract the needed data.
Saving Data: Use pandas to save the extracted data into a CSV file.

Instagram and TikTok Scraping Test

Instagram and TikTok implement robust mechanisms to prevent scraping. This project briefly explores the possibility of scraping these platforms, but users must adhere to the platforms' terms of service.

Project Structure

The project is structured as follows:

lecture_x.py: Simple Syntax Explanation and Example Code.
scrape_x.py: The main script file containing the web scraping logic.
jobs/: A directory for storing scraped data.

Usage Example

This README does not provide direct code examples but focuses on an overview of the project's purpose, technologies used, installation methods, and project structure. Detailed implementation can be found in the scrape_x.py file.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
Include/site/python3.10/greenlet		Include/site/python3.10/greenlet
extractors		extractors
jobs		jobs
.gitignore		.gitignore
README.md		README.md
challenge.py		challenge.py
insta.py		insta.py
jobs.csv		jobs.csv
lecture1.py		lecture1.py
lecture2.py		lecture2.py
lecture3.py		lecture3.py
lecture4.py		lecture4.py
playwright_1.py		playwright_1.py
playwright_2.py		playwright_2.py
playwright_3.py		playwright_3.py
playwright_4.py		playwright_4.py
pyvenv.cfg		pyvenv.cfg
scrape_1.py		scrape_1.py
scrape_2.py		scrape_2.py
scrape_3.py		scrape_3.py
scrape_4.py		scrape_4.py
scrape_5.py		scrape_5.py
tiktok.py		tiktok.py
wanted_playwright.png		wanted_playwright.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Python Web Scraping Project

Required Libraries

Installation

Web Scraping Overview

Instagram and TikTok Scraping Test

Project Structure

Usage Example

About

Releases

Packages

Languages

devkan/python_scraping

Folders and files

Latest commit

History

Repository files navigation

Python Web Scraping Project

Required Libraries

Installation

Web Scraping Overview

Instagram and TikTok Scraping Test

Project Structure

Usage Example

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages