Skip to content

anjankant/LearnHTMLAgilityPack

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

13 Commits
Β 
Β 

Repository files navigation

Learn HTML Agility Pack Step by step --> Web Scraping using C#

Scrape websites using HTML Agility Pack C#

✌️ Chapter 1: What is HTML Agility Pack? 🀠

After the preamble, now exactly what is HTML agility pack 😍 and why it is used? Many times, it becomes a requirement to read or what is technically called as parse an HTML document where the source could be a file, or a string or another web source. Thus, what is HTML agility pack c# is that it is one of the .NET libraries that gives the C# developer πŸ˜— to read and write the DOM (Document Object Model) and has explicit support for plain XPath or XSLT and the bonus is ☺️, you don't even have to know about these terminologies? The library is so forgiving that it won't trouble much with its functionality even if the source of HTML is malformed in standards. Thus, it's the best choice to rely on this library instead of writing up the parsing code all by yourself.

[Subscribe YouTube Channel] (http://bit.ly/2lSE3r6)

Let's Kick off with HTML Agility Pack πŸ˜‹

✌️ Chapter 2: Learn to Install HTML agility pack and Load an HTML Document

  • First, you can install nuget package from the link.
  • Under the section, Package Manager copy the install code. For example, if there is content such as >>> PM> Install-Package HtmlAgilityPack -Version x.x.x, then you shall copy the text that follows after PM>.
  • After copying the code, now go to your Visual Studio Application and click on Tools menu in the menu bar.
  • From the menu drop down, go to library manager β†’ Package Manager Console.
  • In the lower half of the Application, now you will see the Package Manager Console opened and the cursor blinking.
  • You must paste the code that you copied from the site using the help of step:2 by using the combination of hotkeys Ctrl and V ☺️ .
  • After pasting the code hit enter and the application will take care of the installation πŸ˜ƒ.

πŸ‘‰ Coding Snippet

HtmlWeb web = new HtmlWeb();
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument(); 
doc = web.load(β€œhttps://technologycrowds.com”);

✌️ Chapter 3: How to Get elements by class in Html Agility Pack C#

HtmlAgility is a very great tool as we have seen how it can be used to traverse the entire HTML content of webpages in C#, it can also be understood that the HTML content can be manipulated with much ease.

πŸ‘‰ Coding Snippet

using System;
using HtmlAgilityPack;
using System.Collections.Generic;
using System.Linq;

public class Program
{
	public static void Main()
	{
		// declaring & loading dom
		HtmlWeb web = new HtmlWeb();
		HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
		doc = web.Load("https://en.wikipedia.org/wiki/Main_Page");
		
		// filter html elements on the basis of class name
		IEnumerable<HtmlNode> nodes = doc.DocumentNode.Descendants().Where(n => n.HasClass("mw-jump-link"));
		
		foreach(var item in nodes)
		{
			// displaying final output
			Console.WriteLine(item.InnerText);	
		}
	}
}

✌️ Chapter 4: Extract Meta-Information from the website using HTML agility pack

Namespace

using System.Collections.Generic; using System.Linq;
using System.Text;
using System.Threading.Tasks; using HtmlAgilityPack;

Load HTML document using HTML Agility Pack

HtmlWeb web = new HtmlWeb();
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc = web.Load("http://technologyCrowds.com");
GetMetaInformation(doc, "description");

πŸ‘‰ Creating Methods to Extract Meta Information

static void GetMetaInformation(HtmlAgilityPack.HtmlDocument htmldoc, string value)
{
 HtmlNode tcNode = htmldoc.DocumentNode.SelectSingleNode("//meta[@name='" + value + "']");
 string fulldescription = string.Empty;
 if (tcNode != null)
 {
  HtmlAttribute desc;
  desc = tcNode.Attributes["content"];
  Console.ForegroundColor = ConsoleColor.Red;
  Console.Write(desc.Value);
  Console.ReadLine();
 }
}

✌️ Chapter 5: Select Nodes using Html Agility Pack

SelectNodes()

πŸ‘‰ Coding Snippet

var html = @"
var html = @"<TD>
</TD>
<TD>
<INPUT value=Technology>
<INPUT value=Crowds>
</TD>
var htmlDoc = new HtmlDocument();
htmlDoc.LoadHtml(html);
var node = htmlDoc.DocumentNode.SelectNodes("//td/input");

foreach (var node in nodes)
{
  Console.WriteLine(node.Attributes["value"].Value);
}

Output

  • Technology
  • Crowds

SelectSingleNode(String)

SelectSingleNode is a type of function that takes in an XPath expression and produces a result that contains the first HtmlAgilityPack.HtmlNode. The return value could also be null if there are no matching nodes.

πŸ‘‰ Coding Snippet

var html = @"
var html = @"<TD>
</TD>
<TD>
<INPUT value=Technology>
<INPUT value=Crowds>
</TD>
var htmlDoc = new HtmlDocument();
htmlDoc.LoadHtml(html);
var node = htmlDoc.DocumentNode.SelectNodes("//td/input").First()
          .Attributes["value"].Value;
Console.WriteLine(node);

Output

  • Technology

Chapter 6: ✌️ HTML Manipulation using html agility pack

Inner HTML

πŸ‘‰ Coding Snippet

var html =
@"<body>
<h1>.Net Core</h1>
This is <b>C#, ASP.Net</b> paragraph
   <h1>
.Net Core with Angular</h1>
This is <b>HTML Agility Pack</b> sample

  </body>";

var htmlDoc = new HtmlDocument();
htmlDoc.LoadHtml(html);

var htmlNodes = htmlDoc.DocumentNode.SelectNodes("//body/p");

foreach (var node in htmlNodes)
{

 Console.WriteLine(node.InnerHtml);
}

Ouput

  • This is C#, ASP.Net paragraph This is HTML Agility Pack sample

Inner Text

πŸ‘‰ Coding Snippet

var html =
@"<body>
<h1>
.Net Core</h1>
This is <b>C#, ASP.Net</b> paragraph
   <h1>
.Net Core with Angular</h1>
This is <b>HTML Agility Pack</b> sample
  </body>";

var htmlDoc = new HtmlDocument();
htmlDoc.LoadHtml(html);

var htmlNodes = htmlDoc.DocumentNode.SelectNodes("//body/p");

foreach (var node in htmlNodes)
{
 Console.WriteLine(node.InnerText);
}

Output

  • This is C#, ASP.Net paragraph This is HTML Agility Pack sample

Outer Html

πŸ‘‰ Coding Snippet

var html =
@"<body>
<h1>.Net Core</h1>
<p>This is <b>C#, ASP.Net</b> paragraph</p>
   
<h1>.Net Core with Angular</h1>
<p>This is <b>HTML Agility Pack</b> sample</p>
</body>";

 var htmlDoc = new HtmlDocument();
 htmlDoc.LoadHtml(html);

 var htmlNodes = htmlDoc.DocumentNode.SelectNodes("//body/p");

 foreach (var node in htmlNodes)
 {
  Console.WriteLine(node.OuterHtml);
 }

Output

  • .Net Core

    .Net Core with Angular

Parent Node

πŸ‘‰ Coding Snippet

var html =
@"<body>
<h1>.Net Core</h1>
<p>This is <b>C#, ASP.Net</b> paragraph</p>   
<h1>.Net Core with Angular</h1>
<p>This is <b>HTML Agility Pack</b> sample</p>
</body>";

var htmlDoc = new HtmlDocument();
htmlDoc.LoadHtml(html);

var node = htmlDoc.DocumentNode.SelectSingleNode("//body/h1");

HtmlNode parentNode = node.ParentNode;
Console.WriteLine(parentNode.Name);

Output

  • body

read more...

** Free Video Library: Learn HTML Agility Pack Step by Step **

About

Scrape websites using HTML Agility Pack C#

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published