Learn HTML Agility Pack Step by step --> Web Scraping using C#

Visit Website for More...

Scrape websites using HTML Agility Pack C#

✌️ Chapter 1: What is HTML Agility Pack? 🤠

After the preamble, now exactly what is HTML agility pack 😍 and why it is used? Many times, it becomes a requirement to read or what is technically called as parse an HTML document where the source could be a file, or a string or another web source. Thus, what is HTML agility pack c# is that it is one of the .NET libraries that gives the C# developer 😗 to read and write the DOM (Document Object Model) and has explicit support for plain XPath or XSLT and the bonus is ☺️, you don't even have to know about these terminologies? The library is so forgiving that it won't trouble much with its functionality even if the source of HTML is malformed in standards. Thus, it's the best choice to rely on this library instead of writing up the parsing code all by yourself.

[Subscribe YouTube Channel] (http://bit.ly/2lSE3r6)

Let's Kick off with HTML Agility Pack 😋

✌️ Chapter 2: Learn to Install HTML agility pack and Load an HTML Document

First, you can install nuget package from the link.
Under the section, Package Manager copy the install code. For example, if there is content such as >>> PM> Install-Package HtmlAgilityPack -Version x.x.x, then you shall copy the text that follows after PM>.
After copying the code, now go to your Visual Studio Application and click on Tools menu in the menu bar.
From the menu drop down, go to library manager → Package Manager Console.
In the lower half of the Application, now you will see the Package Manager Console opened and the cursor blinking.
You must paste the code that you copied from the site using the help of step:2 by using the combination of hotkeys Ctrl and V ☺️ .
After pasting the code hit enter and the application will take care of the installation 😃.

👉 Coding Snippet

HtmlWeb web = new HtmlWeb();
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument(); 
doc = web.load(“https://technologycrowds.com”);

✌️ Chapter 3: How to Get elements by class in Html Agility Pack C#

HtmlAgility is a very great tool as we have seen how it can be used to traverse the entire HTML content of webpages in C#, it can also be understood that the HTML content can be manipulated with much ease.

👉 Coding Snippet

using System;
using HtmlAgilityPack;
using System.Collections.Generic;
using System.Linq;

public class Program
{
	public static void Main()
	{
		// declaring & loading dom
		HtmlWeb web = new HtmlWeb();
		HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
		doc = web.Load("https://en.wikipedia.org/wiki/Main_Page");
		
		// filter html elements on the basis of class name
		IEnumerable<HtmlNode> nodes = doc.DocumentNode.Descendants().Where(n => n.HasClass("mw-jump-link"));
		
		foreach(var item in nodes)
		{
			// displaying final output
			Console.WriteLine(item.InnerText);	
		}
	}
}

✌️ Chapter 4: Extract Meta-Information from the website using HTML agility pack

Namespace

using System.Collections.Generic; using System.Linq;
using System.Text;
using System.Threading.Tasks; using HtmlAgilityPack;

Load HTML document using HTML Agility Pack

HtmlWeb web = new HtmlWeb();
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc = web.Load("http://technologyCrowds.com");
GetMetaInformation(doc, "description");

👉 Creating Methods to Extract Meta Information

static void GetMetaInformation(HtmlAgilityPack.HtmlDocument htmldoc, string value)
{
 HtmlNode tcNode = htmldoc.DocumentNode.SelectSingleNode("//meta[@name='" + value + "']");
 string fulldescription = string.Empty;
 if (tcNode != null)
 {
  HtmlAttribute desc;
  desc = tcNode.Attributes["content"];
  Console.ForegroundColor = ConsoleColor.Red;
  Console.Write(desc.Value);
  Console.ReadLine();
 }
}

✌️ Chapter 5: Select Nodes using Html Agility Pack

SelectNodes()

👉 Coding Snippet

var html = @"
var html = @"<TD>
</TD>
<TD>
<INPUT value=Technology>
<INPUT value=Crowds>
</TD>
var htmlDoc = new HtmlDocument();
htmlDoc.LoadHtml(html);
var node = htmlDoc.DocumentNode.SelectNodes("//td/input");

foreach (var node in nodes)
{
  Console.WriteLine(node.Attributes["value"].Value);
}

Output

Technology
Crowds

SelectSingleNode(String)

SelectSingleNode is a type of function that takes in an XPath expression and produces a result that contains the first HtmlAgilityPack.HtmlNode. The return value could also be null if there are no matching nodes.

👉 Coding Snippet

var html = @"
var html = @"<TD>
</TD>
<TD>
<INPUT value=Technology>
<INPUT value=Crowds>
</TD>
var htmlDoc = new HtmlDocument();
htmlDoc.LoadHtml(html);
var node = htmlDoc.DocumentNode.SelectNodes("//td/input").First()
          .Attributes["value"].Value;
Console.WriteLine(node);

Output

Technology

Chapter 6: ✌️ HTML Manipulation using html agility pack

Inner HTML

👉 Coding Snippet

var html =
@"<body>
<h1>.Net Core</h1>
This is <b>C#, ASP.Net</b> paragraph
   <h1>
.Net Core with Angular</h1>
This is <b>HTML Agility Pack</b> sample

  </body>";

var htmlDoc = new HtmlDocument();
htmlDoc.LoadHtml(html);

var htmlNodes = htmlDoc.DocumentNode.SelectNodes("//body/p");

foreach (var node in htmlNodes)
{

 Console.WriteLine(node.InnerHtml);
}

Ouput

This is C#, ASP.Net paragraph This is HTML Agility Pack sample

Inner Text

👉 Coding Snippet

var html =
@"<body>
<h1>
.Net Core</h1>
This is <b>C#, ASP.Net</b> paragraph
   <h1>
.Net Core with Angular</h1>
This is <b>HTML Agility Pack</b> sample
  </body>";

var htmlDoc = new HtmlDocument();
htmlDoc.LoadHtml(html);

var htmlNodes = htmlDoc.DocumentNode.SelectNodes("//body/p");

foreach (var node in htmlNodes)
{
 Console.WriteLine(node.InnerText);
}

Output

This is C#, ASP.Net paragraph This is HTML Agility Pack sample

Outer Html

👉 Coding Snippet

var html =
@"<body>
<h1>.Net Core</h1>
<p>This is <b>C#, ASP.Net</b> paragraph</p>
   
<h1>.Net Core with Angular</h1>
<p>This is <b>HTML Agility Pack</b> sample</p>
</body>";

 var htmlDoc = new HtmlDocument();
 htmlDoc.LoadHtml(html);

 var htmlNodes = htmlDoc.DocumentNode.SelectNodes("//body/p");

 foreach (var node in htmlNodes)
 {
  Console.WriteLine(node.OuterHtml);
 }

Output

.Net Core

.Net Core with Angular

Parent Node

👉 Coding Snippet

var html =
@"<body>
<h1>.Net Core</h1>
<p>This is <b>C#, ASP.Net</b> paragraph</p>   
<h1>.Net Core with Angular</h1>
<p>This is <b>HTML Agility Pack</b> sample</p>
</body>";

var htmlDoc = new HtmlDocument();
htmlDoc.LoadHtml(html);

var node = htmlDoc.DocumentNode.SelectSingleNode("//body/h1");

HtmlNode parentNode = node.ParentNode;
Console.WriteLine(parentNode.Name);

Output

body

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
README.md		README.md

anjankant/LearnHTMLAgilityPack

Folders and files

Latest commit

History

Repository files navigation

Learn HTML Agility Pack Step by step --> Web Scraping using C#

Visit Website for More...

Scrape websites using HTML Agility Pack C#

✌️ Chapter 1: What is HTML Agility Pack? 🤠

Let's Kick off with HTML Agility Pack 😋

✌️ Chapter 2: Learn to Install HTML agility pack and Load an HTML Document

👉 Coding Snippet

✌️ Chapter 3: How to Get elements by class in Html Agility Pack C#

👉 Coding Snippet

✌️ Chapter 4: Extract Meta-Information from the website using HTML agility pack

Namespace

Load HTML document using HTML Agility Pack

👉 Creating Methods to Extract Meta Information

✌️ Chapter 5: Select Nodes using Html Agility Pack

SelectNodes()

👉 Coding Snippet

Output

SelectSingleNode(String)

👉 Coding Snippet

Output

Chapter 6: ✌️ HTML Manipulation using html agility pack

Inner HTML

👉 Coding Snippet

Ouput

Inner Text

👉 Coding Snippet

Output

Outer Html

👉 Coding Snippet

Output

.Net Core

.Net Core with Angular

Parent Node

👉 Coding Snippet

Output

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Packages