# WEBSCRAPING PROJECT


![WEB IMAGE](web-scraping-with-python.png)

# INTRODUCTION
Webscrapping is the process of extracting data from a website and tranforming this data to a more useful structure that can 
uderstood by a user or an API.
Webscrapping is one of the recquired skills  in the data collection stage and therefore this project is a simple task that i  
have taken to understand webscrapping.
It involves scrapping wikipedia pages and extracting some information which is then stored in a json file.


## IMPORTING PACKAGES
In this stage I import all the necessary packages that i need to work out this project.

In [187]:
import pandas as pd
import numpy as np
import requests
from datetime import datetime
from bs4 import BeautifulSoup as bs


### REGIONS
In this first section, i scrape through the wikipedia website containing regions in uganda and i save the result in a 
json file containing 
- id:
- Region:
- area:
- chief_town:


In [188]:
url1 = 'https://en.wikipedia.org/wiki/Regions_of_Uganda'
url1_text = requests.get('https://en.wikipedia.org/wiki/Regions_of_Uganda').text
soup = bs(url1_text, 'html.parser')
table = soup.find('table', class_ = 'wikitable sortable')
headers = [th.text.strip() for th in table.tr.find_all('th')]

lst_data = []
for index,row in enumerate(table.tbody.find_all('tr')[1:5]):
    data = [td.text.strip() for td in row.find_all('td')]
    lst_data.append(data)
    
dirty_df = pd.DataFrame(data=lst_data, columns=headers)
dirty_df.drop(columns=['Population(Census 1991)','Population(Census 2002)','Population(Census 2014)','Number of Districts'],inplace=True)
Region_df = dirty_df.reset_index().rename(columns = ({'index':'Reg_ID'}))
Region_df.set_index('Reg_ID',inplace = True)
regions = Region_df.to_json(orient = 'columns')

with open('regions.json','w+') as f:
    f.write(regions)
    


### DISTRICTS
In this section, i scrape through a wikipedia page containing a number of tables on districts in Uganda and save the data
to a json file containing:
- id
- name
- region_id
- 2014_population
- 2023_population_est
- area

In [189]:
url2 = 'https://en.wikipedia.org/wiki/Districts_of_Uganda'
url2_text = requests.get('https://en.wikipedia.org/wiki/Districts_of_Uganda').text
soup = bs(url2_text,'html.parser')
dis_data = []
for table in soup.find_all('table','wikitable sortable'):
    headerss = [th.text.strip() for th in table.tbody.tr.find_all('th')]
    for index,row in enumerate(table.tbody.find_all('tr')[1:-1]):
        data = [td.text.strip() for td in row.find_all('td')]
        dis_data.append(data)
dirty_df2 = pd.DataFrame(data =dis_data,columns =headerss)
dirty_df2.rename(columns={'Map':'District_ID'},inplace = True)
dirty_df2['District_ID'] = dirty_df2['District_ID'].astype('int')
dirty_df2.sort_values(by='District_ID',ascending = True, inplace = True, ignore_index = True)
dirty_df2
def Region_Identification(district):
    if district <=38 :
        return '3'
    if district == 39 or district <= 75:
        return '2'
    if district == 76 or district <= 101:
        return '0'
    if district > 101:
        return '1'
        
dirty_df2['Reg_ID'] = dirty_df2['District_ID'].apply(Region_Identification)
dirty_df2.set_index('District_ID',inplace = True)
dirty_df2
districts = dirty_df2.to_json(orient='columns')
with open('districts.json','w+') as f:
    f.write(districts)


### INSTITUTIIONS
In this project, i scrape through wikipedia pages of different institutions and save the following information in 
json file.
- id
- name:
- district_id:
- ownership_type: 

In [190]:
url3 = 'https://unche.or.ug/institutions/'
url3_text = requests.get('https://unche.or.ug/institutions/').text
soup = bs(url3_text,'html.parser')
insti_table = soup.find('table', id = 'unche-table')
headersss =[th.text.strip() for th in insti_table.tr.find_all('th')]

institution_data = []
for index,row in enumerate(insti_table.tbody.find_all('tr')):
    dara = [td.text.strip() for td in row.find_all('td')]
    institution_data.append(dara)

dirty_df3 = pd.DataFrame(data = institution_data, columns = headersss)
dirty_df3 = dirty_df3.reset_index().rename(columns = ({'index':'id'}))
dirty_df3.drop(['Award Type','Programs'],axis =1 ,inplace = True)
institutions = dirty_df3.to_json(orient='columns')
with open('institutions.json','w+') as f:
    f.write(institutions)

### PROGRAMS 
In this section a i scrape through the wikipedia page containing different programs and save the following information 
in a json file:
- id:
- name:
- level:
- institution_id:
- accredited_day:
- accredited_month:
- accredited_year:
- expiry_day:
- expiry_month:
- expiry_year:
- status

In [191]:
url4 = 'https://unche.or.ug/all-academic-programs/'
url4_text = requests.get('https://unche.or.ug/all-academic-programs/').text
soup = bs(url4_text,'html.parser')
program_table = soup.find('table', id = 'unche-table')
headings =[th.text.strip() for th in program_table.tr.find_all('th')]

program_data = []
for index,row in enumerate(program_table.tbody.find_all('tr')):
    daraa = [td.text.strip() for td in row.find_all('td')]
    program_data.append(daraa)
    
dirty_df4 = pd.DataFrame(data=program_data , columns = headings)
dirty_df4['Accredited Date'] = pd.to_datetime(dirty_df4['Accredited Date'], infer_datetime_format = True)
dirty_df4['Accredited Day'] =dirty_df4['Accredited Date'].dt.day
dirty_df4['Accredited Month'] =dirty_df4['Accredited Date'].dt.month
dirty_df4['Accredited Year'] =dirty_df4['Accredited Date'].dt.year
dirty_df4['Expiry Date'] = pd.to_datetime(dirty_df4['Expiry Date'], infer_datetime_format = True)
dirty_df4['Expiry Day'] =dirty_df4['Expiry Date'].dt.day
dirty_df4['Expiry Month'] =dirty_df4['Expiry Date'].dt.month
dirty_df4['Expiry Year'] =dirty_df4['Expiry Date'].dt.year
dirty_df4 = dirty_df4.reset_index().rename(columns = ({'index': 'Id'}))
programs = dirty_df4.to_json(orient = 'columns')
with open('programs.json','w+') as f:
    f.write(programs)
