In data projects we need to solve issues. Abstract methods, data modelers and data validators come to help there. That’s what I try to show here.
Classes depend on abstract classes (Python Protocols) not on specific classes
allow structural subtyping — checking whether two classes are compatible based on available attributes and functions alone.
-
Specific class
Examble by https://www.youtube.com/watch?v=UvFphlHWchU&list=WL&index=6
import httpx import json class ScrapTopUniversity: ''' Web Scraping ''' def __init__(self,url): self.url = url def download_json(self): self.resp = httpx.get(self.url) for node in self.resp.json()['score_nodes']: yield node -
Protocol (Abstract Class)
from typing import Protocol class WebScrap(Protocol): '''Protocol for Scraping classes''' def download_json(self): '''Download data from web API''' -
How to use
Calling Abstract class (Protocol) directly instead of concrete class
from rich import print from libs.protocols import WebScrap from libs.modules import ScrapTopUniversity class ScrapProcessor: def download_json(self,webS: WebScrap): return webS.download_json() def main(): url = "https://www.topuniversities.com/rankings/endpoint?nid=3846212&page=4&items_per_page=15&tab=®ion=&countries=&cities=&search=&star=&sort_by=&order_by=&program_type=" scrap = ScrapProcessor() top = scrap.download_json(ScrapTopUniversity(url)) for item in top: print(item) if __name__ == "__main__": main()
Pydantic allows custom validators and serializers to alter how data is processed in many powerful ways. More information https://docs.pydantic.dev/latest/
For scraping example, there is a field that is in blank from API, and sometimes we need to set a default value. We could to define a BaseModel, feature from Pydantic library, and add that validator.
from pydantic import BaseModel, validator
class DataUni(BaseModel):
title: str
region: str
stars: str
country: str
city: str
rank: str
@validator('stars')
@classmethod
def stars_default(cls, value):
if value == '':
return 0
-
main
from rich import print from libs.protocols import WebScrap from libs.modules import ScrapTopUniversity from libs.models import DataUni class ScrapProcessor: def download_json(self, webS: WebScrap): return webS.download_json() def main(): url = "https://www.topuniversities.com/rankings/endpoint?nid=3846212&page=4&items_per_page=15&tab=®ion=&countries=&cities=&search=&star=&sort_by=&order_by=&program_type=" scrap = ScrapProcessor() top = scrap.download_json(ScrapTopUniversity(url)) item = [DataUni(**t) for t in top] for row in item: print(row.dict()) if __name__ == "__main__": main()
An object or function receives other objects or functions instead of creating it.
Because it helps to decrease coupling and increase cohesion. Those metrics are often inversely correlated. We need to procure low coupling and high cohesion.
In the previous example we can see it was applied. Here we can see that downloadjson function receive WebScrap instead of create it.
from libs.protocols import WebScrap
class ScrapProcessor:
def download_json(self,webS: WebScrap):
return webS.download_json()
PySpark is the Python API for Apache Spark. It enables you to perform real-time, large-scale data processing in a distributed environment using Python. For more information: https://spark.apache.org/docs/latest/api/python/index.html#:~:text=PySpark%20is%20the%20Python%20API,for%20interactively%20analyzing%20your%20data.
-
Config
from pyspark.sql import SparkSession from pyspark import SparkConf conf = SparkConf().setAppName("MyScraper") \ .setMaster("local[2]") \ .set("spark.executor.memory", "2g") \ .set("spark.executor.cores", "2") sc = SparkSession.builder.config(conf=conf).getOrCreate() print(SparkConf().getAll()) -
Using with Scraper
def main(): url = "https://www.topuniversities.com/rankings/endpoint?nid=3846212&page=4&items_per_page=15&tab=®ion=&countries=&cities=&search=&star=&sort_by=&order_by=&program_type=" scrap = ScrapProcessor() top = scrap.download_json(ScrapTopUniversity(url)) item = [DataUni(**t) for t in top] # for row in item: # print(row.dict()) df = sc.createDataFrame(data=item) # create into Spark context df.show(truncate=False) df.createOrReplaceTempView("table") # using like SQL language sc.sql('select title, rank from table order by rank desc').show(20, False)-
Example
+---------------+----------------+----+-------------+-----+----------------------------------------------------------+ |city |country |rank|region |stars|title | +---------------+----------------+----+-------------+-----+----------------------------------------------------------+ |Mexico City |Mexico |61 |Latin America|0 |Universidad Nacional Autónoma de México (UNAM) | |Seattle |United States |62 |North America|0 |University of Washington | |Dhahran |Saudi Arabia |63 |Asia |0 |King Fahd University of Petroleum & Minerals | |Paris |France |64 |Europe |0 |Sorbonne University | |Barcelona |Spain |65 |Europe |0 |Universitat Politècnica de Catalunya · BarcelonaTech (UPC)| |Kuala Lumpur |Malaysia |66 |Asia |0 |Universiti Malaya (UM) | |Kyoto |Japan |67 |Asia |0 |Kyoto University | |Chennai |India |68 |Asia |0 |Indian Institute of Technology Madras (IITM) | |São Paulo |Brazil |69 |Latin America|0 |Universidade de São Paulo | |Melbourne |Australia |70 |Oceania |0 |Monash University | |New Haven |United States |71 |North America|0 |Yale University | |Harbin |China (Mainland)|72 |Asia |0 |Harbin Institute of Technology | |University Park|United States |73 |North America|0 |Pennsylvania State University | |Pohang |South Korea |74 |Asia |0 |Pohang University of Science And Technology (POSTECH) | |Monterrey |Mexico |75 |Latin America|NULL |Tecnológico de Monterrey | +---------------+----------------+----+-------------+-----+----------------------------------------------------------+ +----------------------------------------------------------+----+ |title |rank| +----------------------------------------------------------+----+ |Tecnológico de Monterrey |75 | |Pohang University of Science And Technology (POSTECH) |74 | |Pennsylvania State University |73 | |Harbin Institute of Technology |72 | |Yale University |71 | |Monash University |70 | |Universidade de São Paulo |69 | |Indian Institute of Technology Madras (IITM) |68 | |Kyoto University |67 | |Universiti Malaya (UM) |66 | |Universitat Politècnica de Catalunya · BarcelonaTech (UPC)|65 | |Sorbonne University |64 | |King Fahd University of Petroleum & Minerals |63 | |University of Washington |62 | |Universidad Nacional Autónoma de México (UNAM) |61 | +----------------------------------------------------------+----+
-