This project focuses on data engineering and exploratory data analysis of a large-scale streaming platform catalog. The main objective is to process compressed columnar data to extract insights about content distribution, release trends, genres, and production origins.
The pipeline demonstrates advanced handling of optimized big data storage formats, schema enforcement, and structural data manipulation.
Dataset: the analysis utilizes a distributed schema stored in an optimized snappy.parquet format, encompassing comprehensive details on movies and tv shows, including directors, cast, countries, and nested arrays for genres.
- Python: core language for dataset processing.
- Apache Spark / PySpark: distributed computing framework used for the initial heavy transformation and metadata management.
- Pandas: utilized for local aggregation and data profiling.
- PyArrow / Fastparquet: engines required for reading compressed columnar schemas.
- Parquet integration: implementation of efficient data loading using binary columnar storage optimized with snappy compression.
- Metadata inspection: validation of explicit data types, ensuring dates, integers, and nested array structures are correctly mapped without information loss.
- Duration parsing: isolation of runtime metrics by splitting raw string descriptions into independent numerical attributes for movie runtimes (
duration_min) and series lengths (duration_seasons). - Geographical classification: development of logical triggers to identify multi-country co-productions (
its_multicountry) versus single-origin titles. - Array unpacking: extraction and structuring of nested categorical variables contained within the genre arrays.
- Content distribution: evaluation of the ratio between movies and tv shows across different release eras.
- Regional analysis: tracking dominant production hubs and identifying key clusters of international collaboration.
- Ensure you have Python 3.x installed along with an environment capable of reading parquet files.
- Place the data file
part-00000-5512fdbf-3ba7-4778-92ca-0df880472840-c000.snappy.parquetin your project data directory. - Install the required dependencies via your terminal:
pip install pandas pyarrow fastparquet matplotlib seaborn
- Create or run your analysis script (or Jupyter Notebook) to query the parquet dataframe directly.
This project was developed as part of the Big Data and Data Engineering modules of the master's degree, showcasing proficiency in modern storage formats, schema design, and analytical pipeline optimization.
Developed by: Alicia Santamaría Román