<a href="https://colab.research.google.com/github/byiringiroscar/NLP_FELLOWSHIP/blob/main/HTML_scrapping_assignment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Working with HTML
There is a lot of data that can be found in the internet. To get the data, there are two techniques:


*   Web scrapping - Extracting underlying data found in HTML code and store in a new file format
*   web crawling - Use of bots to process different url links, get the data from all the pages and store the data in websites. e.g Google, Bing



## Web Scrapping
In this session, we will be looking at web scrapping. We will be examining news websites and look at how to extract the articles. 

We will use a python package called BEAUTIFULSOUP.

`pip install beautifulsoup4`

To import the package:

`from bs4 import BeautifulSoup`

In [2]:
from bs4 import BeautifulSoup

In [2]:
html_doc = """<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""

In [3]:
# Read the html doc
soup = BeautifulSoup(html_doc, 'html.parser')
print(soup.prettify())

<html>
 <head>
  <title>
   The Dormouse's story
  </title>
 </head>
 <body>
  <p class="title">
   <b>
    The Dormouse's story
   </b>
  </p>
  <p class="story">
   Once upon a time there were three little sisters; and their names were
   <a class="sister" href="http://example.com/elsie" id="link1">
    Elsie
   </a>
   ,
   <a class="sister" href="http://example.com/lacie" id="link2">
    Lacie
   </a>
   and
   <a class="sister" href="http://example.com/tillie" id="link3">
    Tillie
   </a>
   ;
and they lived at the bottom of a well.
  </p>
  <p class="story">
   ...
  </p>
 </body>
</html>


In [4]:
soup.head

<head><title>The Dormouse's story</title></head>

In [5]:
soup.body

<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
</body>

In [6]:
mainbody = soup.body

In [7]:
# find a particular tag
soup.find('p')

<p class="title"><b>The Dormouse's story</b></p>

In [8]:
# find all p
soup.find_all('p')

[<p class="title"><b>The Dormouse's story</b></p>,
 <p class="story">Once upon a time there were three little sisters; and their names were
 <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
 and they lived at the bottom of a well.</p>,
 <p class="story">...</p>]

In [9]:
# get the text
soup.find('p').get_text()

"The Dormouse's story"

In [10]:
# loop through tag to get the text
sisters = soup.find_all('a', class_='sister')

[a.getText() for a in sisters]

['Elsie', 'Lacie', 'Tillie']

### Practicle example
Website - English Premier League ResultDB

**URL** - http://www.resultdb.com/english-premier-league-tables/

**Goal**: *Get the aggregated details of each team for a particular season* 


In [14]:
import requests
import pandas as pd

year = '2000'
page = requests.get("http://www.resultdb.com/english-premier-league-tables/"+year+"/")
maindetails = BeautifulSoup(page.text,'html.parser')

# soup = BeautifulSoup(page.text,'lxml')
# table = soup.find('table')

# data = []
# rows = table.find_all('tr')
# for row in rows:
#     cols = row.find_all('td')
#     cols = [ele.text.strip() for ele in cols]
#     data.append([ele for ele in cols if ele]) # Get rid of empty values

# columns= ['position','team name','games','won','draw','lost','goal scored','goals conceded','goal difference','points']
# season = pd.DataFrame(data[1:],columns=columns)

In [63]:
import requests
import pandas as pd

# year = '2000'
year_all = ['2000', '2001']
def get_year(index):
    name_year = year_all[index]
    return name_year
frames = []
for index, year in enumerate(year_all):
  page = requests.get("http://www.resultdb.com/english-premier-league-tables/"+year+"/")
  soup = BeautifulSoup(page.text,'html.parser')

  # soup = BeautifulSoup(page.text,'lxml')
  table = soup.find('table')

  data = []
  final_data = []
  rows = table.find_all('tr')
  for row in rows:
      cols = row.find_all('td')
      cols = [ele.text.strip() for ele in cols]
      data.append([ele for ele in cols if ele]) # Get rid of empty values

  
  for dat in data:
    dat.append(get_year(index))
    final_data.append(dat)
  new_df = pd.DataFrame(final_data[1:])
  frames.append(new_df)
  
columns= ['position','team_name','games','won','draw','lost','goal scored','goals conceded','goal difference','points', 'years']
result = pd.concat(frames)
print(type(result))
# season = pd.DataFrame(result,columns=columns)
# season
  
  

<class 'pandas.core.frame.DataFrame'>


In [None]:
season

In [None]:
print(maindetails.prettify())

In [13]:
# The details are in the table tag. Find the table
table = maindetails.find('table')

# Table has rows. Get all the table rows. the result will be a list
rows = table.find_all('tr')



In [14]:
# Get the details in each row
# Loop through each row
data =[]
all_details = []
for row in rows:
  details = row.find_all('td')
  

  cols = [ele.text.strip() for ele in details]
  data.append([ele for ele in cols if ele])  # Get rid of empty values




In [15]:
# Create a dataframe where the data will be placed and processed
columns= ['position','team name','games','won','draw','lost','goal scored','goals conceded','goal difference','points']
season = pd.DataFrame(data[1:],columns=columns)

In [16]:
season.head()

Unnamed: 0,position,team name,games,won,draw,lost,goal scored,goals conceded,goal difference,points
0,1,Manchester United,38,24,8,6,79,31,48,80 pts
1,2,Arsenal,38,20,10,8,63,38,25,70 pts
2,3,Liverpool,38,20,9,9,71,39,32,69 pts
3,4,Leeds,38,20,8,10,64,43,21,68 pts
4,5,Ipswich Town,38,20,6,12,57,42,15,66 pts


In [17]:
# TODO convert the above to a function. Then get the details from 2000-2015, place all the details in one dataframe, add a column called season
# ENTER CODE HERE

## Assignment
Based on the above, get the main articles from igihe from February 2022 - present

Steps to do this


1.   Get the links to the main pages from january. Create a list
2.   In each link, get all the links to the main articles
3.   For each article, get the main tag that holds the texts
4.   Get the text and store them in a txt file. The data will be used in week 2
5.   Each article its own txt file. Naming is the date_article_1



In [None]:
import requests
import pandas as pd
import httplib2
from bs4 import BeautifulSoup, SoupStrainer
# import wget


# here we are going to get the link of main pages
url = "https://web.archive.org/web/20220201000813/https://igihe.com/"
page = requests.get(url)
maindetails = BeautifulSoup(page.text,'html.parser')
articles = maindetails.find_all(class_='homenews')
all_links = [] 
for arti in articles:
  new_l = arti.find_all('span', 'homenews-title')
  for n in new_l:
    link_rot = n.find('a', href=True)
    full_path = url + link_rot['href']
    print(full_path)
  

In [100]:
import requests
import pandas as pd
import httplib2
from bs4 import BeautifulSoup, SoupStrainer
# import wget


# here we are going to get main tag containing text
url = "https://web.archive.org/web/20220201000813/https://igihe.com/"
page = requests.get(url)
maindetails = BeautifulSoup(page.text,'html.parser')
articles = maindetails.find_all(class_='homenews')
all_links = [] 
for arti in articles:
  new_l = arti.find_all('span', 'homenews-title')
  for head_title in new_l:
    header_title = head_title.find('a')
    header_all_title = header_title.getText()
    print(header_all_title)

RRA mu nzira yo gukumira inyerezwa ryâumusoro ukomoka mu bucuruzi mpuzamahanga
Kayonza: Abaturage babiri barimo n’utwite bagerageje kwiyahura mu munsi umwe
Manzi Thierry yaguzwe na AS FAR 
Muri UR hatangijwe icyumweru cyahariwe imishinga irimo udushya dushingiye ku bushakashatsi
Purpose Rwanda yahamagariye Abanyarwanda kugira uruhare mu kuzahura ababaswe nâibiyobyabwenge nâuburaya
Abareganwa na Rusesabagina bajuririye nâamatariki bafatiweho
Karongi: Inkuba yishe umusaza w’imyaka 58
Rihanna aratwite
Uburyo bwo guhangana nâumuhangayiko ukabije ushobora kwangiza ubuzima
Bamporiki yagaragarije urubyiruko igisobanuro cyâubutwari
Umukino wa APR FC na Mukura VS wasubitswe ugeze hagati (Amafoto)
 Abahanzi icyenda bagiye guhurira mu gitaramo cya Saint Valentin
Mali: Ambasaderi wâu Bufaransa yahawe amasaha 72 yo kuva muri icyo gihugu
Umusizi Nsanzabera yasohoye igitabo kiranga u Rwanda nâibigwi byarwo
U Bufaransa: Umusaza wâimyaka 60 yihinduje igitsina, aba umugore
Dr Kizza Bes