- My assorted industry tutorials for working with annoying
python3
packages in Windows 10 or WSL2 (Ubuntu 20.04) environments. - Also has some "life saving" concepts when dealing with niche problems or libraries
- Check the
Scripts
folder for some other stuff.
- PyODBC/SQLAlchemy for WSL
- Offline package installations
- Geopandas for Windows 10
- Installing WSL2 (20.04) without Windows Store
- Installing multiple instances of SQL server onto a single server/VM/RDP
- Download Files with Python
- Using command line git on a VM
- SalesForce API
- Address Regex/Cleaning/Dedupe
- Automated Emails
- Automating Jupyter Notebooks
- Life Saving Regex for data cleansing
- HTCondor and PostgreSQL
- Webscraping dynamically loaded sites
- Install nbgrader
- Automatically Launching Jupyter Notebook in WSL2
- Installing PySpark using Docker
Mitigates the error when you are missing SQL.h
header
sudo su
curl https://packages.microsoft.com/keys/microsoft.asc | apt-key add -
exit
sudo apt-get update && sudo apt-get upgrade
sudo ACCEPT_EULA=Y apt-get install msodbcsql17
sudo apt-get install build-essential libssl-dev libffi-dev python3-dev unixodbc unixodbc-dev
pip3 install sqlalchemy pyodbc
Usage:
import pandas as pd
import sqlalchemy
from sqlalchemy import event
username = r"user"
password = r"pw"
dsnname = r"DB_NAME" # the pyodbc connection name
# mysql would be f"mysql+pyodbc://{username}:{password}@{dsnname}" and remove fast_executemany=True
connection_string = f"mssql+pyodbc://{username}:{password}@{dsnname}"
engine = sqlalchemy.create_engine(connection_string, pool_pre_ping = True, fast_executemany=True)
# Some code that makes it to_sql inserts fast
@event.listens_for(engine, 'before_cursor_execute')
def receive_before_cursor_execute(conn, cursor, statement, params, context, executemany):
if executemany:
cursor.fast_executemany = True
How to manually save packages and install them in an offline environment
- Create a
requirements.txt
file with the list of packages you need pip3 download -r requirements.txt
- Copy downloaded files to offline environment
pip3 install --no-index --find-links /path/to/download/dir/ -r requirements.txt
How to manually save packages and install them in an offline environment
-
You will need 2 different .whl (wheel) files. These are
GDAL
andFiona
. -
Find both of them and download the version corresponding to your device OS and Python. For example, a 64-bit Windows 10 device running
Python 3.7.X
will require this specific file for GDAL. Repeat this for Fiona as well.cp37
stands forC-Python3.7.X
win_amd64
denotes windows devices for a 64-bit architecture
-
Once they are downloaded, open up
cmd
andcd
into the directory.- Example: If I downloaded the
.whl
files to thegeopandas_dependencies
folder, then I would use this command:cd C:\Users\USERNAME\Downloads\geopandas_dependencies
- Example: If I downloaded the
-
Install the dependencies in this specific order:
pip install GDAL-3.1.2-cp37-cp37m-win_amd64.whl
(or the corresponding file you downloaded)pip install Fiona-1.8.13-cp37-cp37m-win_amd64.whl
-
pip install geopandas
-
Done!
How to install WSL2 if you have debloated Windows 10 and got rid of the pesky Windows Store.
- Download the new kernel update here (https://docs.microsoft.com/en-us/windows/wsl/wsl2-kernel).
- Direct link is (https://wslstorestorage.blob.core.windows.net/wslblob/wsl_update_x64.msi)
- If you have a previous installation, you will need to unregister it. If I previously had Ubuntu 18.04, then:
wsl --list --all wsl --unregister Ubuntu-18.04
- Create a powershell script (
.ps1
) with the following code:
# this is for wsl2
dism.exe /online /enable-feature /featurename:Microsoft-Windows-Subsystem-Linux /all /norestart
Enable-WindowsOptionalFeature -Online -FeatureName Microsoft-Windows-Subsystem-Linux
Enable-WindowsOptionalFeature -Online -FeatureName VirtualMachinePlatform -NoRestart
cd c:\
# install 2004 (different url)
# 1804 is https://aka.ms/wsl-ubuntu-1804
Invoke-WebRequest -Uri https://aka.ms/wslubuntu2004 -OutFile Ubuntu.appx -UseBasicParsing
Rename-Item ./Ubuntu.appx ./Ubuntu.zip
Expand-Archive ./Ubuntu.zip ./Ubuntu
cd ./Ubuntu
.\ubuntu2004.exe
$userenv = [System.Environment]::GetEnvironmentVariable("Path", "User")
[System.Environment]::SetEnvironmentVariable("PATH", $userenv + ";C:\Ubuntu", "User")
- Run as admin and you are done!
Issue: SQL server is a type of server which believes it is always the server. This is true most of them when you only have a instance of an SQL server on a single Server/VM/RDP. However, if you want multiple instances of SQL servers, there are specific parameters in order to throttle each SQL server usage.
Initialisation:
- Launch SQL Server Installation
- Click on Installation -> New SQL Server stand-alone installation or add features to an existing installation
- Install updates if required (optional)
- Create a new installation with no features
- Enter product key (or a pre-filled product key)
Parameter Set:
- From the features checkbox, check Database Engine Services and other services where required. You can also change the location of the root instance here.
- From the Server Configuration Step:
- Add a suitable engine nad password
- Tick the Grant Preform Volume Maintenance Task privilage to SQL Server Database Engine Service
- From the Database Engine Configuration Step:
- Server Configuration -> Tick Mixed Mode and enter the default sysadmin password
- Server Configuration -> Specify the user(s) who will be the admins of this SQL server
- Data Directories -> Ensure they are located in the instance root (C Drive by default)
- Temp Database -> Ensure that the are located where required (some directory or drive with a lot of free space as this can grow)
- Memory -> Change the max server memory to something like 16GB (16000mb) [Example: 64GB server with 3 SQL Server instances]
- Accept Terms
- Install
- Once completed, verify it works by creating and querying a sample database.
How to download files using urllib
import urllib
URL = 'www.someurl.com/dataset.csv'
fname = 'some_dataset.csv'
urllib.request.urlretrieve(url, fname)
Basics of cloning/pushing using commandline git
Cloning:
- Open terminal
git clone HTTPS
where HTTPS is the https URL to your GitHub/GitLab repo- Enter your credentials
- Done
Pushing:
cd
into the repo directorygit add .
(add all files in the current.
directory - you can specify specific files/folders if you want instead withgit add ./updates
)git commit
(creates a commit) and add your commit description. Usectrl+x
to save andy
to confirm savinggit push
(push to cloud repo)- Enter your credentials
- Done
Setting up SF API for Python
pip3 install simple_salesforce
Usage:
import pandas as pd
import datetime
from simple_salesforce import Salesforce as sf
# credentials
sf_api = sf(username='',
password='',
security_token='',
domain=''
)
# upsert (update + insert)
data = df.to_dict(orient='records')
response = sf_api.bulk.OBJECT_NAME.upsert(data, EXTERNAL_ID_FIELD, batch_size=2000)
results = pd.DataFrame(columns=['success','created','SFID','errors'], index=df.index)
result['OriginalID'] = df['ID']
for col in ('success','created','SFID'):
result[col] = [i[col] for i in response]
result['errors'] = [i['errors'][0]['message'] if i.get('errors') else '' for i in response]
result['RunTime'] = datetime.now().strftime("%Y-%m-%d %H:%M")
results.to_sql(TABLE_NAME, engine, index=False, schema='dbo', chunksize=int(3e5), if_exists='append')
Nice packages for address stuff on top of regex:
- https://github.com/jasonrig/address-net
- https://data.gov.au/dataset/ds-dga-19432f89-dc3a-4ef3-b943-5326ef1dbecc/details
- https://github.com/seatgeek/fuzzywuzzy
- https://github.com/ethanzhao6/au-addr-parser
How to easily send emails. First, set up environment password
import os
os.environ['EMAIL_USER'] = 'example@example.com' # email address
os.environ['EMAIL_PASSWORD'] = 'test' # password
Usage (you might have enable developer access on gmail or whatever smtp server first):
import smtplib, os
from email.message import EmailMessage
smtp_server = "smpt.gmail.com:587" # server smtp
from_email, password = os.getenv('EMAIL_USER'), os.environ.get('EMAIL_PASSWORD')
def setup(from_email, pw, smtp):
server = smtplib.SMTP(smtp)
server.ehlo()
server.starttls()
server.login(from_email, pw)
server.ehlo()
return server
server = setup(from_email, password, smtp_server)
# lets say youve got a df with a column for a persons name and their email
for name, to_email in df.values:
email_msg = EmailMessage()
email_msg['X-Priority'] = '2' # priorty of email, 2 is important
email_msg['FROM'] = 'Automated Email Bot' # sender name
email_msg['Subject'] = 'Automated Email' # subject field
email_msg['To'] = to_email # send to email 'someone@gmail.com'
# html formatted email body
msg = f"""
<p>Hi {name},</p>
<p> here's some random text you can write in html <p>
"""
EMAIL.add_alternative(msg, subtype='html')
server.send_message(email_msg, from_addr=from_email)
server.quit()
How to automate jupyter notebooks. Better altneratives include luigi
and apache airflow
, but those are not as easy to get working.
pip3 install papermill
Usage:
- Create notebook(s) for your pipeline
- Go to View -> Cell Toolbar and enable Tags
- Create a cell with
parameters
as the Tag - Add your parameters here
import papermill as pm
from datetime import datetime
import os
PIPELINE_NOTEBOOKS = [
'notebook1.ipynb',
'notebook2.ipynb',
...
'notebookN.ipynb'
]
def run(nb):
timestamp = str(datetime.now().date)
nb_dir = os.getcwd().replace('\\', '/')
pm_dir = "./" + timestamp # directory of papermill output notebooks
params = {
'variable1': 'value1',
'variable2': 'value2',
...
'variableN': 'valueN'
}
if not os.path.exists(pm_dir):
os.mkdir(pm_dir)
print(f"""
Notebook: {nb_dir}/{nb}
will be executed to: {pm_dir}/{nb}
""")
pm.execute_notebook(input_path=f"{nb_dir}/{nb}",
output_path=f"{pm_dir}/{nb}",
parameters=params,
progress_bar=True
)
if __name__ == "__main__":
nb_dir = "./notebooks/pipeline" # notebook dir
os.chdir(nb_dir)
print("Running pipeline")
for nb in PIPELINE_NOTEBOOKS:
run(nb)
print("Completed.")
The "reg" in regex is prounounced like "regular", not "registry" -> reg(ular)ex(pression)
.
Best ones of note:
- Emails:
"^[\w-\.]+@([\w-]+\.)+[\w-]{2,4}$"
- Names:
"^[a-zA-Z]+(([\'\,\.\- ][a-zA-Z ])?[a-zA-Z]*)*$"
- Australian Phone Numbers:
"(^1300\d{6}$)|(^1800|1900|1902\d{6}$)|(^0[2|3|7|8]{1}[0-9]{8}$)|(^13\d{4}$)|(^04\d{2,3}\d{6}$)"
- Websites:
"^((http|https|ftp):\/\/(www\.)?|www\.)[a-zA-Z0-9\_\-]+\.([a-zA-Z]{2,4}|[a-zA-Z]{2}\.[a-zA-Z]{2})(\/[a-zA-Z0-9\-\._\?\&=,'\+%\$#~]*)*$"
- Alphanumeric:
"[a-zA-Z0-9]+"
Docs: https://htcondor.readthedocs.io/en/latest/apis/python-bindings/tutorials/index.html
Installation:
pip3 install htcondor classad
python3 -m pip install psycopg2-binary
touch condor_config
export CONDOR_CONFIG=`pwd`/condor_config
Python3 for htcondor
:
- https://research.cs.wisc.edu/htcondor/HTCondorWeek2014/presentations/TheisenT-Python.pdf
- https://htcondor.readthedocs.io/en/latest/users-manual/index.html
- https://research.cs.wisc.edu/htcondor/manual/v8.1/6_7Python_Bindings.html
PostgreSQL with SQLAlchemy
:
Sometimes you'll find a website that has data you want to scrape, but when you simply load the html there's no table data. This is because the data is loaded dynamically using JS.
Some options:
- Use
selenium
. This requires you have a web driver attached to a path you can run. Chromedriver is pretty good, it's what I've used.
Selenium essentially acts as a way to interact with a website through Python. You can click on elements, copy, paste etc.
Here is a simple example using weather data from New York City:
from selenium import webdriver
d = webdriver.Chrome("<path_to_driver>\chromedriver")
d.get('https://www.timeanddate.com/weather/usa/new-york/historic?month=1&year=2019')
w = {}
# This gets all values in a drop down menu
for i in d.find_element_by_id('wt-his-select').find_elements_by_tag_name('option'):
i.click()
print(i.text)
# will print 1 January 2019, 2 January 2019 etc...
# added sleep time as the webpage itself is really shit and if it updates too fast elements are not loaded correctly
time.sleep(3)
# add weather
with get_weather_data(d.page_source, False) as weather:
w[i.text] = weather
Quick tip: Depending on your system speed, you'll loop through faster than you'll be able to harvest data if you save. I recommend adding around a 2-3 second sleep each iteration in order to allow time for all the data in the website to fully load.
- Load the data directly from the ajax request the page makes. This thread pretty much explains how to grab the data you want using Chrome.
Once you have the url it's simply a matter of parsing it, where you can then work with. The below function parses an ajax requested data into a dictionary, which can then be interrogated and worked with.
import requests
import json
def get_json_data_from_url(url):
# set headers for get request (prevents 403 error)
headers = {'User-Agent':'Mozilla/5.0'}
# send get request and parse response as json.
text = requests.get(url, headers).text
data = json.loads(text)
return data
Normal:
pip3 install jupyter_contrib_nbextensions nbgrader
jupyter nbextension install --sys-prefix --py nbgrader --overwrite
jupyter nbextension enable --sys-prefix --py nbgrader
jupyter serverextension enable --sys-prefix --py nbgrader
If PermissionDenied
occurs, replace --sys-prefix
with --user
- Generate a config using
jupyter notebook --generate-config
- Open config
jupyter_notebook_config.py
usingcode
or another editor of your choice. - Brave Browser:
c.NotebookApp.browser = u'/mnt/c/Program\ Files/BraveSoftware/Brave-Browser/Application/brave.exe %s' c.NotebookApp.use_redirect_file = False
- Download docker and install it.
- (Recommended) If you have WSL2 on a Windows 10 device, set this as your main backend instead of Windows.
- Grab the AWS Glue Package (https://hub.docker.com/r/amazon/aws-glue-libs) by running command:
docker pull amazon/aws-glue-libs:glue_libs_1.0.0_image_01
(tag =glue_libs_1.0.0_image_01
as of March 2021) in terminal. - Initialise and run the container:
docker run -itd -p 8888:8888 -p 4040:4040 -v ~/.aws:/root/.aws:ro --name glue_jupyter amazon/aws-glue-libs:glue_libs_1.0.0_image_01 /home/jupyter/jupyter_start.sh
- Check that the container is up and running using:
docker ps
(you should seeglue_jupyter
running). - Launch the notebook at
http://localhost:8888
and code away!
Other info:
- The container has most libraries and can be shutdown / restarted using the Docker UI (Windows 10).
- Read more about Spark with "Learning Spark" from O'Reilly (Jules S. Damji, Brooke Wenig, Tathagata Das & Denny Lee)