Skip to content

A data engineering project collecting data from various sources to analyze the representation of women in movies using the Bechdel test

Notifications You must be signed in to change notification settings

dherzey/bechdel-movies-project

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Bechdel Test in Movies

This project is part of the final requirement for DataTalks.Club's data engineering bootcamp.

The Bechdel test, or the Bechdel-Wallace test, is a simple test which measures the representation of women in media. It follows a criteria used to indicate how present women are in a piece of work. Having all of these criteria passes the Bechdel test:

  1. the work has at least two [named] women
  2. the [named] women talk to each other
  3. the [named] women talk to each other about something besides a man

The prominent source used to collect movies with a Bechdel score is found in https://bechdeltest.com/. This project aims to create a pipeline to ingest data from this source and from other sources in order to have an up-to-date analysis regarding the representation of women in movies.

Data Architecture

Data architecture of the project!

Collecting Data From Source

Data is collected from the following web sources and database:

  • BechdelTest.com API
  • Academy Awards database
  • IMDB available datasets
  • The Movie Database API (to be added)

See more info in datasets.

Configure cloud resources using Terraform

Resources are configured and provisioned using Terraform. This would need GCP service account credentials in order to create a Google Cloud Storage bucket and a BigQuery dataset in the indicated GCP project. See Terraform folder for more info.

To run Terraform, make sure to change the path to the service account file and the project's name in the variables.tf. Then, execute the following commands:

  1. terraform init
  2. terraform plan
  3. terraform apply

Setting up Prefect flows

The Python files under the etl folder contains the scripts for the whole workflow. Using Prefect blocks and flows, we can create Prefect deployments that will run the workflows for the extraction and loading of data to GCS and BigQuery.

Create virtual environment

A virtual environment was first created using Python which will contain all necessary packages for deploying through Prefect.

# install virtualenv
pip install virtualenv

# create virtual environment named project-venv
python3 -m venv project-venv

# activate virtual environment
source ./project-venv/bin/activate

# install all needed packages
pip install -r requirements.txt

NOTE: For Selenium, an additional webdriver needs to be installed (install webdriver) before running the primary deployment in Prefect. Alternatively, a deployment which does not use Selenium and uses files from datasets could be run instead as shown in running full ETL workflow.

Connect to Prefect cloud

We can use Prefect Orion to see our workflows or we can use Prefect cloud. To connect to Prefect cloud, make sure that you have created an account first, then generate your API key through your profile.

# Make sure Prefect is successfully installed
prefect --help

# (optional) you can create a new profile and set it as your active account
prefect profile create cloud-user
prefect profile use cloud-user

# Login to Prefect cloud. This will prompt for the generated API_KEY
prefect cloud login

# Set configuration for Prefect account
prefect config set PREFECT_API_URL = "<API_URL>"

Create Prefect blocks and deployments

Before running the script to create blocks, make sure the service account file is saved as ~/keys/project_service_key.json, or change the path to the file under create_prefect_blocks.py. Don't forget to also change the bucket_name variable to the appropriate GCS bucket:

if __name__=="__main__":

    # create gcp credentials block
    service_key_path = "~/keys/project_service_key.json" #change service account file path
    gcp_cred_block = create_gcp_cred_block(service_key_path, 
                                           "bechdel-project-gcp-cred")
    
    # create gcs bucket block
    bucket_name = "bechdel-project_data-lake" #change bucket name
    create_gcs_bucket(gcp_cred_block, bucket_name, "bechdel-project-gcs")

Additionally, make sure to change the bucket_name for the following files:

# create blocks
python3 etl/create_prefect_blocks.py

# create deployments
python3 etl/create_prefect_deployments.py

Run full ETL workflow

Trigger the alternative full ETL deployment to avoid overusing the BechdelTest.com API (as advised by site's owner) or if having trouble with installing Selenium.

# start Prefect agent
prefect agent start -q default

# trigger alternative full ETL workflow
prefect deployment run full-etl-flow-alt/bechdel-etl-full-alt

It takes approximately 2 hours to run the full script using an e2-standard-4 instance in GCP.

Transform data using dbt

Before triggering data transformation of BigQuery tables, make sure to update the service account file path and the project name in profiles.yml for both dev and prod targets. Do the same for the database/project name in schema.yml under staging models. Then, we can run the following:

# trigger dbt development for testing
dbt build

# trigger dbt production through Prefect
# this deployment is scheduled to run every month
prefect deployment run dbt-prod-flow/trigger-dbt-prod

Dashboard and data analysis

The dashboard is created using Looker with data connection to BigQuery. View the dashboard here.

Recommendations

  • add other additional analysis and measures, such as whether having more women in the cast/crew affects the Bechdel test score of a movie
  • add additional charts in the dashboard and enhance visualization
  • try to incorporate other tests to compare with the Bechdel test
  • further develop and organize dbt models and configurations
  • store variables in a single file for easier update or changes

About

A data engineering project collecting data from various sources to analyze the representation of women in movies using the Bechdel test

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published