Web Scraper in Python

Basic Function

Takes in a URL and outputs a JSON data file of the most common words within that URL, along with a couple of other pieces of language processing information.

Pip Install Python and Initialise Program

To initialise the program, run app.py within a Python IDE. If you have not done so, the program will require installation or import of the Python libraries for urlopen, NLTK and BeautifulSoup.

These can be found here:

NLTK

Beautfiful Soup

I/O

The program prompts you to enter a URL, or defaults to the bbc.co.uk main page and then processes the text at that URL. Please take care to obey scraping advice from the target website. It outputs lexical diversity (proportion of unique words across whole text), frequency distribution (the most common words and the number of times they are used) and a JSON file of the frequency distribution.

HTML/JS files

The d3.js and wordcloud Javascript files were obtained from the internet and were not the focus of this project. When utilised within the wordcloud.html file they create a wordcloud of the most common words within the target URL. Parameters can be altered within the py files to change the wordcloud size and shape. Opening wordcloud.html is the final prompt of the app.py file output, which tries to do this automatically.

TO DO

Add different visualisations
Add inheritance for different types of operation

Name		Name	Last commit message	Last commit date
Latest commit History 39 Commits
js		js
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
app.py		app.py
engine.py		engine.py
wordcloud.html		wordcloud.html

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Web Scraper in Python

Basic Function

Pip Install Python and Initialise Program

I/O

HTML/JS files

TO DO

About

Releases

Packages

Languages

License

djtarrant/web-scraper

Folders and files

Latest commit

History

Repository files navigation

Web Scraper in Python

Basic Function

Pip Install Python and Initialise Program

I/O

HTML/JS files

TO DO

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages