Skip to content

ekbrown/python_scripts

Repository files navigation

python_scripts

Python scripts, mostly for corpus linguistic research.

Script_OCR_with_Tesseract.py

Python script to OCR .pdf files with Google's Tesseract and save the text in .txt files. Before running this script, you must have downloaded and installed Tesseract and ImageMagick (and GhostScript, if using a Mac), and the Pillow Python module.

Script_YTcomments.py

Python script to webscrape comments below Youtube videos. Uses Youtube's API rather than Selenium.

Script_Youtube_comments.py

Python script to control Selenium in order to webscrape comments below a Youtube video.

Script_extract_text_PDF.py

Python function to extract text out of PDF files, whether they are already indexed or not.

Script_scrape_domain.py

Python script to scrape a website, until a user-given number of webpages have been scraped or the entire domain has been scraped, whichever comes first.

Script_webscrape_KWIC_CdE.py

A Python script to control the web browser automating software Selenium to webscrape keyword-in-context results from the Corpus del Español (Web / Dialects).

Script_webscrape_Yelp.py

Use Selenium to a webscrape Yelp review.

Script_webscrape_freq_CdE.py

A Python script to webscrape the frequency of a search term in the Corpus del Español with the Python module requests.

Script_webscrape_freqs_CdE.py

A Python script to control the web browser automating software Selenium to webscrape frequencies from the Corpus del Español (Web / Dialects).

Script_webscrape_twitter_advanced_search.py

Python script (still in a drafty state) to control Selenium to webscrape tweets from Twitter's Advanced Search webpage.

cmd_html2txt.py

Command-line tool to scrape text from a webpage and save it to a TXT file in the working directory.

cmd_scrape_speech.py

Python command-line tool to download a speech from https://factba.se/.

freq_list.py

A Python executable file to create a frequency list of single words from a directory of .txt files. Returns a .csv file with two columns, one with the words and the other with their corresponding frequency.

get_collocates.py

Get collocates of a node word, specifying how many words to the right and left to look.

get_concordance.py

A Python script to create a concordance display of a regex in a directory of .txt files. Returns a .csv file with five columns: FILE (name of file), PARA (paragraph number), PRE (preceding context within paragraph), HIT (the match), FOL (following context within paragraph).

get_keywords.py

Get keywords in a corpus, specifying a reference corpus, both on the user's hard drive.

get_many_ytCC.py

Python script to extract the closed captions generated by automated speech recognition on Youtube videos, and save the transcripts to the hard drive as CSV files.

get_ytCC.py

Get the closed captions of a Youtube video, as generated by Youtube's automated speech recognition software.

About

Python scripts for corpus linguistics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages