# Automation & Webscraping in Python 
In the following project, I do a full walkthrough of freeCodeCamp's [*Automate with Python - Full Course for Beginners*](https://www.youtube.com/watch?v=PXMJ6FS7llk) tutorial where they go over the essentials topics of web scraping across various formats (HTML, CSV, PDF) using different Python libraries to help with automation (Path, Selenium, Xpath, etc.). It is important to note that this video is 3 years old and slight adjustments may have been made to the data sources, libaries, and methodologies. Ultimately, I'm simply doing this for personal practice and education, where I hope to apply any useful teachings into my own workflow-whether that be through automating news extraction, sending texts, excel reports, or more.

In [1]:
#pip install wikipedia-api
#pip install pandas
#pip install tk
#pip install ghostscript
#pip install camelot-py
#pip install selenium

In [59]:
#Dependencies
import pandas as pd
import wikipediaapi as wik
import camelot
import os
import selenium

## Scraping Wikipedia Page

In [6]:
#Extract info from website - this is depracated
#   nba_2024 = pd.read_html("https://en.wikipedia.org/wiki/2024%E2%80%9325_NBA_season")

The [tutorial](https://www.youtube.com/watch?v=PXMJ6FS7llk&t=140s) suggested scraping the Wikipedia page using the pands '.read_html' function but this is blocked and outputs **HTTP Error 403: Forbidden** another workaround will be used.
> To scrape Wikipedia I will be using the following **wikipedia-api** library found [here](https://pypi.org/project/Wikipedia-API/)

#### Initializing & Getting Page
Here I'll start by initializing our wikipedia object and specifying which page I want. The object requires a user-agent specification following the general rules provided by Wikipedia themselves [here](https://foundation.wikimedia.org/wiki/Policy:Wikimedia_Foundation_User-Agent_Policy). They do this in order to prevent 'misbehaving' scripts that can be harmful to their site (too many requests that cause unnessesary load), the generic format is as follows: **"client name/version (contact information) library/framework name/version"**.

In [7]:
wiki_init = wik.Wikipedia(user_agent = 'personalproj (billyhdang@gmail.com) wikipedia-api/0.8.1', language = 'en')
page_parse = wiki_init.page('2024–25_NBA_season')
print(page_parse.exists()) #Checking page existence 

True


In [8]:
page_parse.sections[1].text

"In addition to regular preseason games hosted at NBA teams' own arenas, the NBA often hosts neutral site preseason games (either in domestic non-NBA markets or foreign markets) or against non-NBA teams. Listed below are only those neutral site or preseason games."

In [9]:
page_parse.sections[1]

Section: Preseason (1):
In addition to regular preseason games hosted at NBA teams' own arenas, the NBA often hosts neutral site preseason games (either in domestic non-NBA markets or foreign markets) or against non-NBA teams. Listed below are only those neutral site or preseason games.
Subsections (3):
Section: Domestic neutral site games (2):

Subsections (0):

Section: International games (2):

Subsections (0):

Section: Non-NBA opponents (2):

Subsections (0):

> Note, this API has limitations as it only returns the HTML markup of certain things such as tables which are often what we want.

## Reading .csv Files from URL using Pandas

In [10]:
#Read CSV file from Football-data
premleague_2024 = pd.read_csv("https://www.football-data.co.uk/mmz4281/2425/E0.csv")
premleague_2024.head(3)

Unnamed: 0,Div,Date,Time,HomeTeam,AwayTeam,FTHG,FTAG,FTR,HTHG,HTAG,...,B365CAHH,B365CAHA,PCAHH,PCAHA,MaxCAHH,MaxCAHA,AvgCAHH,AvgCAHA,BFECAHH,BFECAHA
0,E0,16/08/2024,20:00,Man United,Fulham,1,0,H,0,0,...,1.86,2.07,1.83,2.11,1.88,2.11,1.82,2.05,1.9,2.08
1,E0,17/08/2024,12:30,Ipswich,Liverpool,0,2,A,0,0,...,2.05,1.88,2.04,1.9,2.2,2.0,1.99,1.88,2.04,1.93
2,E0,17/08/2024,15:00,Arsenal,Wolves,2,0,H,1,0,...,2.02,1.91,2.0,1.9,2.05,1.93,1.99,1.87,2.02,1.96


In [11]:
features = (premleague_2024.columns.values)
print(features)

['Div' 'Date' 'Time' 'HomeTeam' 'AwayTeam' 'FTHG' 'FTAG' 'FTR' 'HTHG'
 'HTAG' 'HTR' 'Referee' 'HS' 'AS' 'HST' 'AST' 'HF' 'AF' 'HC' 'AC' 'HY'
 'AY' 'HR' 'AR' 'B365H' 'B365D' 'B365A' 'BWH' 'BWD' 'BWA' 'BFH' 'BFD'
 'BFA' 'PSH' 'PSD' 'PSA' 'WHH' 'WHD' 'WHA' '1XBH' '1XBD' '1XBA' 'MaxH'
 'MaxD' 'MaxA' 'AvgH' 'AvgD' 'AvgA' 'BFEH' 'BFED' 'BFEA' 'B365>2.5'
 'B365<2.5' 'P>2.5' 'P<2.5' 'Max>2.5' 'Max<2.5' 'Avg>2.5' 'Avg<2.5'
 'BFE>2.5' 'BFE<2.5' 'AHh' 'B365AHH' 'B365AHA' 'PAHH' 'PAHA' 'MaxAHH'
 'MaxAHA' 'AvgAHH' 'AvgAHA' 'BFEAHH' 'BFEAHA' 'B365CH' 'B365CD' 'B365CA'
 'BWCH' 'BWCD' 'BWCA' 'BFCH' 'BFCD' 'BFCA' 'PSCH' 'PSCD' 'PSCA' 'WHCH'
 'WHCD' 'WHCA' '1XBCH' '1XBCD' '1XBCA' 'MaxCH' 'MaxCD' 'MaxCA' 'AvgCH'
 'AvgCD' 'AvgCA' 'BFECH' 'BFECD' 'BFECA' 'B365C>2.5' 'B365C<2.5' 'PC>2.5'
 'PC<2.5' 'MaxC>2.5' 'MaxC<2.5' 'AvgC>2.5' 'AvgC<2.5' 'BFEC>2.5'
 'BFEC<2.5' 'AHCh' 'B365CAHH' 'B365CAHA' 'PCAHH' 'PCAHA' 'MaxCAHH'
 'MaxCAHA' 'AvgCAHH' 'AvgCAHA' 'BFECAHH' 'BFECAHA']


> There are alot of ambiguous columns here, I'll rename a couple to clarify

In [12]:
premleague_2024.rename(columns = {'FTHG':'home_goals',
                                 'FTAG': 'away_goals', 
                                 'FTR': 'winner', 
                                 }, inplace = True)
premleague_2024.head(2)

Unnamed: 0,Div,Date,Time,HomeTeam,AwayTeam,home_goals,away_goals,winner,HTHG,HTAG,...,B365CAHH,B365CAHA,PCAHH,PCAHA,MaxCAHH,MaxCAHA,AvgCAHH,AvgCAHA,BFECAHH,BFECAHA
0,E0,16/08/2024,20:00,Man United,Fulham,1,0,H,0,0,...,1.86,2.07,1.83,2.11,1.88,2.11,1.82,2.05,1.9,2.08
1,E0,17/08/2024,12:30,Ipswich,Liverpool,0,2,A,0,0,...,2.05,1.88,2.04,1.9,2.2,2.0,1.99,1.88,2.04,1.93


## Extracting Tables from PDFs
Here we'll use the camelot-py library in order to grab tables from PDFs, which can be useful for clear research papers with various table types and formats.

In [46]:
tables = camelot.read_pdf('data/player_research_paper.pdf', pages = '5')
print(tables)

<TableList n=1>


In [47]:
tables.export('research_table', f='csv', compress = True)
tables[0].to_csv('movement_analysis_table.csv')

In [51]:
research_table = pd.read_csv('movement_analysis_table.csv')
research_table.head(3)

Unnamed: 0,Complex\nDeﬁnition\nAnalytical methodology\nExample\nfrom\nHow does it\ninﬂuence practice?\nsystems\nthe literature\nfeature,Unnamed: 1,Unnamed: 2,Unnamed: 3,Unnamed: 4
0,Feedback,When the output of a process inﬂuences an\ninp...,Decision tree classiﬁcation,Shot classiﬁcation\n(157),An increase in running load during\ncompetitio...
1,Emergence,"New, unexpected properties that arise from\nth...",Shannon entropy,Particles (158),The development of novel ball movement\npatter...
2,Self-organisation,The autonomous organisation of subgroups\nor i...,"Centrality, ﬂocking motion\nmodels",Ornithology (159),An attacking unit alters their movement\nwitho...


> It clearly did not parse the headers correctly so further manual adjust would be needed

## Scraping Websites using Selenium
This will require the installation of ChromeDriver and Selenium. For ChromeDriver, you'll need to head [here](https://developer.chrome.com/docs/chromedriver) for more instructions on the download after you check which version of Chrome you are on (which can be found in the 'Help' section of your browser. Selenium will be done through just a pip install.