Bechdel Test in Movies

This project is part of the final requirement for DataTalks.Club's data engineering bootcamp.

The Bechdel test, or the Bechdel-Wallace test, is a simple test which measures the representation of women in media. It follows a criteria used to indicate how present women are in a piece of work. Having all of these criteria passes the Bechdel test:

the work has at least two [named] women
the [named] women talk to each other
the [named] women talk to each other about something besides a man

The prominent source used to collect movies with a Bechdel score is found in https://bechdeltest.com/. This project aims to create a pipeline to ingest data from this source and from other sources in order to have an up-to-date analysis regarding the representation of women in movies.

Data Architecture

Collecting Data From Source

Data is collected from the following web sources and database:

BechdelTest.com API
Academy Awards database
IMDB available datasets
The Movie Database API (to be added)

See more info in datasets.

Configure cloud resources using Terraform

Resources are configured and provisioned using Terraform. This would need GCP service account credentials in order to create a Google Cloud Storage bucket and a BigQuery dataset in the indicated GCP project. See Terraform folder for more info.

To run Terraform, make sure to change the path to the service account file and the project's name in the variables.tf. Then, execute the following commands:

terraform init
terraform plan
terraform apply

Setting up Prefect flows

The Python files under the etl folder contains the scripts for the whole workflow. Using Prefect blocks and flows, we can create Prefect deployments that will run the workflows for the extraction and loading of data to GCS and BigQuery.

Create virtual environment

A virtual environment was first created using Python which will contain all necessary packages for deploying through Prefect.

# install virtualenv
pip install virtualenv

# create virtual environment named project-venv
python3 -m venv project-venv

# activate virtual environment
source ./project-venv/bin/activate

# install all needed packages
pip install -r requirements.txt

NOTE: For Selenium, an additional webdriver needs to be installed (install webdriver) before running the primary deployment in Prefect. Alternatively, a deployment which does not use Selenium and uses files from datasets could be run instead as shown in running full ETL workflow.

Connect to Prefect cloud

We can use Prefect Orion to see our workflows or we can use Prefect cloud. To connect to Prefect cloud, make sure that you have created an account first, then generate your API key through your profile.

# Make sure Prefect is successfully installed
prefect --help

# (optional) you can create a new profile and set it as your active account
prefect profile create cloud-user
prefect profile use cloud-user

# Login to Prefect cloud. This will prompt for the generated API_KEY
prefect cloud login

# Set configuration for Prefect account
prefect config set PREFECT_API_URL = "<API_URL>"

Create Prefect blocks and deployments

Before running the script to create blocks, make sure the service account file is saved as ~/keys/project_service_key.json, or change the path to the file under create_prefect_blocks.py. Don't forget to also change the bucket_name variable to the appropriate GCS bucket:

if __name__=="__main__":

    # create gcp credentials block
    service_key_path = "~/keys/project_service_key.json" #change service account file path
    gcp_cred_block = create_gcp_cred_block(service_key_path, 
                                           "bechdel-project-gcp-cred")
    
    # create gcs bucket block
    bucket_name = "bechdel-project_data-lake" #change bucket name
    create_gcs_bucket(gcp_cred_block, bucket_name, "bechdel-project-gcs")

Additionally, make sure to change the bucket_name for the following files:

# create blocks
python3 etl/create_prefect_blocks.py

# create deployments
python3 etl/create_prefect_deployments.py

Run full ETL workflow

Trigger the alternative full ETL deployment to avoid overusing the BechdelTest.com API (as advised by site's owner) or if having trouble with installing Selenium.

# start Prefect agent
prefect agent start -q default

# trigger alternative full ETL workflow
prefect deployment run full-etl-flow-alt/bechdel-etl-full-alt

It takes approximately 2 hours to run the full script using an e2-standard-4 instance in GCP.

Transform data using dbt

Before triggering data transformation of BigQuery tables, make sure to update the service account file path and the project name in profiles.yml for both dev and prod targets. Do the same for the database/project name in schema.yml under staging models. Then, we can run the following:

# trigger dbt development for testing
dbt build

# trigger dbt production through Prefect
# this deployment is scheduled to run every month
prefect deployment run dbt-prod-flow/trigger-dbt-prod

Dashboard and data analysis

The dashboard is created using Looker with data connection to BigQuery. View the dashboard here.

Recommendations

add other additional analysis and measures, such as whether having more women in the cast/crew affects the Bechdel test score of a movie
add additional charts in the dashboard and enhance visualization
try to incorporate other tests to compare with the Bechdel test
further develop and organize dbt models and configurations
store variables in a single file for easier update or changes

Name		Name	Last commit message	Last commit date
Latest commit History 139 Commits
datasets		datasets
dbt		dbt
diagram		diagram
etl		etl
scraper		scraper
terraform		terraform
testing		testing
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

datasets

datasets

dbt

dbt

diagram

diagram

etl

etl

scraper

scraper

terraform

terraform

testing

testing

.gitignore

.gitignore

README.md

README.md

requirements.txt

requirements.txt

Repository files navigation

Bechdel Test in Movies

Data Architecture

Collecting Data From Source

Configure cloud resources using Terraform

Setting up Prefect flows

Create virtual environment

Connect to Prefect cloud

Create Prefect blocks and deployments

Run full ETL workflow

Transform data using dbt

Dashboard and data analysis

Recommendations

About

Releases

Packages

Languages

dherzey/bechdel-movies-project

Folders and files

Latest commit

History

Repository files navigation

Bechdel Test in Movies

Data Architecture

Collecting Data From Source

Configure cloud resources using Terraform

Setting up Prefect flows

Create virtual environment

Connect to Prefect cloud

Create Prefect blocks and deployments

Run full ETL workflow

Transform data using dbt

Dashboard and data analysis

Recommendations

About

Resources

Stars

Watchers

Forks

Languages