This project showcases a C# code snippet that demonstrates the extraction of job information from HTML documents using the HtmlAgilityPack library. The extracted data is then serialized into both CSV and JSON formats for further analysis and integration with other systems.
To run this project, ensure that you have the following:
- .NET SDK (version 7.0.0 or higher)
- HtmlAgilityPack library
-
Clone the repository:
git clone https://github.com/agailloty/webscraping-cs.git
-
Navigate to the project directory:
cd webscraping-cs
-
Restore the NuGet packages:
dotnet restore
-
Place the HTML files containing job information in the "files" directory within the project.
-
Open the
Program.cs
file and locate theMain
method. -
Within the
Main
method, ensure that the file encoding matches your HTML file encoding. Modify the following line if necessary:doc.Load(file, Encoding.UTF8, false);
-
Run the project:
dotnet run
-
The program will extract job information from each HTML file in the "files" directory, serialize it into CSV and JSON formats, and save the results as "jobs.csv" and "jobs.json", respectively.
The project does not require any additional configuration. However, you can modify the following aspects according to your needs:
- File path and directory:
- The "files" directory is used to store the input HTML files. You can change the directory name or location by modifying the
Directory.GetFiles("files")
line in theMain
method. - The output CSV file name and path can be customized in the
File.WriteAllLines("jobs.csv", ...)
line. - The output JSON file name and path can be customized in the
Persist("jobs.json", ...)
line.
- The "files" directory is used to store the input HTML files. You can change the directory name or location by modifying the
This project utilizes the following dependencies:
- HtmlAgilityPack (version 1.11.46): A library for parsing and manipulating HTML documents. It is used for extracting job information from HTML files.
This project is licensed under the MIT License.
Contributions to this project are welcome. Feel free to open issues or submit pull requests to suggest improvements or bug fixes.
For any questions or inquiries, please contact Axel-Cleris Gailloty.