Problem: How profitable is a YouTube channel?
Solution: I defined a Python function to extract the data of a channel from the YouTube API, stored, processed and cleaned the data collected. After that, I estimated the channel revenue and did an exploratory data analysis to get insights from the data.
Result: The analysis showed that the channel raised more than 1500% in two years reaching a revenue from YouTube ads that could reach US$ 6 million.
During the Covid-19 pandemic and the social displacement adopted, some people increased their time watching educational and entertaining videos on YouTube, which caused a rapid growing of some channels during this period. In this project we will analyze how the YouTube chanell "Inventor 101" raised from near 100 subscribers to more than 4.5 million in 3 years. Inventor 101 is a DIY irish channel where the author publishes inventions and science experiments made by himself, every weeky. We will start extracting the data from the YouTube API and preprocessing/cleaning the data collected. After that, we will create data visualizations to get some insights and we will do a exploratory data analysis (EDA) to summarize their main characteristics. Finally, we will discuss the agregated value of our analysis to the Inventor 101 channel and other similar content.
- 1. Extracting and Cleaning the Data
- 2. Importing the Data
- 3. Getting Insights: Exploratory Data Analysis
- 4. Conclusion
- 5. References
We will start creating a function to request the data about the Inventor 101 channel from the YouTube API. Firstly, it is necessary to inform your YouTube API key, which can be obtained from Google Developers using your Google account. If you want to get your own key, follow the tutorial here. Also, the channel id, which appears at the end of the YouTube channel URL is needed. If you want to apply this function to analyze other channels, just change the channel id field.
To avoid the necesity of requesting data from API every time the notebook is executed, its appropriate to import the data from the csv file.
In this section we will do a exploratory data analysis (EDA) to summarize the main features of the data and calculate some basical statistical quantities. In the YouTube about page we can find that Inventor 01 is a DIY irish channel where the author publishes inventions and science experiments every weeky. Yet, it contains the following subscribers milestones:
- 100 - 01/4/19
- 5k - 26/4/19
- 50k - 03/7/19
- 100k - 13/11/19
- 200k - 23/1/20
- 300k - 04/2/20
Today (3/11/22) the channel have 4,54M subscribers, so we can conclude that the channel raised
First we will take a look at basical statistical quantities of the data set:
The above table inform us that the mean number of visualizations is $4.4$M and
The current minimum estimated revenue is: $ 541567.85 The current maximum estimated revenue is: $ 6498814.20
This plot illustrate that the number of views, likes and comments obeys the same pattern, these quantities are almost proportionals. It's a interesting question why the the fraction of likes and comments is approximately the same for all videos. This could be used to predict the number of likes a video will receive, given the expected number of visualizations.
The number of views, comments and likes are positive skewed i. e. a distribution where the mean, median, and mode of the distribution are positive rather than negative or zero. In this case data distribution occurs more on the one side of the scale with a long tail on the right side. It is also known as the right-skewed distribution, where the mean is generally to the right side of the data median. The cause of this skewness is that the views cannot be less than zero and some videos have a very high number of visualizations. The same occurs with likes and comments.
Scatter plots confirms the linear relationship between likes, views and comments like we stated above.
We can see that the revenue decreased from 2020 but the videos published in 2021 and 2022 are still being watched, so this income can grow up in the future.
This is the Pearson correlation coefficient matrix, with is a measure of linear correlation between two sets of data. It is the ratio between the covariance of two variables and the product of their standard deviations normalized such that the result always has a value between −1 and 1. This measure can only reflect a linear correlation of variables, and ignores many other types of relationships or correlations.
We can see that the number of likes, views, comments and the income are highly correlated with each other. As we stated before, this quantities are nearly proportional to each other.
In this project we buit a function to interact with an API to request data, starting from the request of a JSON file and storing the information in a Pandas data frame. We also explored some tools from the packages Seaborn and Pandas and extract informations and insights from data.
The data analysis showed that the Inventor 101 channel, raised more than