A multivariate regression model using pre-release film data to estimate the number of torrent copies that might become available online.
As more media content (films, tv shows, music, etc.) move online to streaming providers and other digital sources, 'pirating' content illegally has become easier through the use of torrent sites. This investigation is intended to look at what factors can better predict the number of torrent copies that will become readily available online for a given movie.
-
Film Data
- The Numbers - The Numbers provides detailed movie financial analysis, including box office, DVD and Blu-ray sales reports, and release schedules
- OMDB API - The OMDb API is a free web service to obtain movie information
-
Torrent Data
-
Kickass Torrents - KAT provided a directory for torrent files and magnet links to facilitate peer-to-peer file sharing using the BitTorrent protocol.
-
Torrentz - Torrentz is a meta-search engine for BitTorrent that indexed torrents from various major torrent websites, and offered compilations of various trackers.
-
Note - The torrent data sources for this project have since been either seized or voluntarily shutdown.
- Kickass Torrents was seized by the US Justice Department on July 20th, 2016.
- Torrentz was voluntarily shutdown by its owners on August 5th, 2016.
-
- beautifulsoup4, requests - Retrieve and extract data sources from web
- s3fs - Data storage in AWS (access available upon request)
- numpy, pandas - Clean and aggregate data sources
- sckit-learn - Estimator models
- Crawl web pages to retrieve data sources
- ETL - Extract data from web page requests, clean up data and transform into respective data type (int, string, etc.), load data into S3 for long term storage
- Combine/join data sources into one table/multi-dimensional array
- Train and test regression models to make future predictions
- Tune/optimize model parameters and perform feature engineering - repeat step 4
- Original presentation delivered on 07/15/2016