Skip to content
This repository has been archived by the owner on Mar 4, 2019. It is now read-only.
/ movie_torrents Public archive

A multivariate regression model using pre-release film data to estimate the number of torrent copies that might become available online

License

Notifications You must be signed in to change notification settings

bryantbiggs/movie_torrents

Repository files navigation

Estimating the Number of Movie Torrents

Waffle Code Climate Issue Count Test Coverage

Project Description

A multivariate regression model using pre-release film data to estimate the number of torrent copies that might become available online.


Motivation

As more media content (films, tv shows, music, etc.) move online to streaming providers and other digital sources, 'pirating' content illegally has become easier through the use of torrent sites. This investigation is intended to look at what factors can better predict the number of torrent copies that will become readily available online for a given movie.


Data Sources

  • Film Data

    • The Numbers - The Numbers provides detailed movie financial analysis, including box office, DVD and Blu-ray sales reports, and release schedules
    • OMDB API - The OMDb API is a free web service to obtain movie information
  • Torrent Data

    • Kickass Torrents - KAT provided a directory for torrent files and magnet links to facilitate peer-to-peer file sharing using the BitTorrent protocol.

    • Torrentz - Torrentz is a meta-search engine for BitTorrent that indexed torrents from various major torrent websites, and offered compilations of various trackers.

    • Note - The torrent data sources for this project have since been either seized or voluntarily shutdown.


Libraries Utilized

  • beautifulsoup4, requests - Retrieve and extract data sources from web
  • s3fs - Data storage in AWS (access available upon request)
  • numpy, pandas - Clean and aggregate data sources
  • sckit-learn - Estimator models

Process

  1. Crawl web pages to retrieve data sources
  2. ETL - Extract data from web page requests, clean up data and transform into respective data type (int, string, etc.), load data into S3 for long term storage
  3. Combine/join data sources into one table/multi-dimensional array
  4. Train and test regression models to make future predictions
  5. Tune/optimize model parameters and perform feature engineering - repeat step 4

Results

About

A multivariate regression model using pre-release film data to estimate the number of torrent copies that might become available online

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages