### The objective of this case study is to look for insights through visualization and provide guidance for those who want to reach among top YouTube channels.

We will be looking into the following:

Which channels are the most popular?

Which categories are the highest earning?

Which  are the fastest growing channel types?

In [1]:
# Importing the libraries

import numpy as np

import pandas as pd

import seaborn as sns

import matplotlib.pyplot as plt

In [6]:
# Loading the csv into pandas DataFrame

df = pd.read_csv("/content/Global YouTube Statistics.csv", encoding = 'LATIN-1')

In [7]:
df.head()

Unnamed: 0,rank,Youtuber,subscribers,video views,category,Title,uploads,Country,Abbreviation,channel_type,...,subscribers_for_last_30_days,created_year,created_month,created_date,Gross tertiary education enrollment (%),Population,Unemployment rate,Urban_population,Latitude,Longitude
0,1,T-Series,245000000,228000000000.0,Music,T-Series,20082,India,IN,Music,...,2000000.0,2006.0,Mar,13.0,28.1,1366418000.0,5.36,471031528.0,20.593684,78.96288
1,2,YouTube Movies,170000000,0.0,Film & Animation,youtubemovies,1,United States,US,Games,...,,2006.0,Mar,5.0,88.2,328239500.0,14.7,270663028.0,37.09024,-95.712891
2,3,MrBeast,166000000,28368840000.0,Entertainment,MrBeast,741,United States,US,Entertainment,...,8000000.0,2012.0,Feb,20.0,88.2,328239500.0,14.7,270663028.0,37.09024,-95.712891
3,4,Cocomelon - Nursery Rhymes,162000000,164000000000.0,Education,Cocomelon - Nursery Rhymes,966,United States,US,Education,...,1000000.0,2006.0,Sep,1.0,88.2,328239500.0,14.7,270663028.0,37.09024,-95.712891
4,5,SET India,159000000,148000000000.0,Shows,SET India,116536,India,IN,Entertainment,...,1000000.0,2006.0,Sep,20.0,28.1,1366418000.0,5.36,471031528.0,20.593684,78.96288


In [8]:
# Getting an an overiview of the data

# df = df.convert_dtypes()

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 995 entries, 0 to 994
Data columns (total 28 columns):
 #   Column                                   Non-Null Count  Dtype  
---  ------                                   --------------  -----  
 0   rank                                     995 non-null    int64  
 1   Youtuber                                 995 non-null    object 
 2   subscribers                              995 non-null    int64  
 3   video views                              995 non-null    float64
 4   category                                 949 non-null    object 
 5   Title                                    995 non-null    object 
 6   uploads                                  995 non-null    int64  
 7   Country                                  873 non-null    object 
 8   Abbreviation                             873 non-null    object 
 9   channel_type                             965 non-null    object 
 10  video_views_rank                         994 non-n

### Cleaning the dataset

In [None]:
# 1. Standardize the string case to lower

for i in df.columns :
    if df[i].dtype == 'O' :
        df[i] = df[i].str.lower()

df.head()

In [None]:
# 2. Drop duplicate rows

df.drop_duplicates(inplace=True)

df.shape

In [None]:
# 3. Finding missing values

df.isnull().sum() / len(df) * 100

`subscribers_for_last_30_days` has 33% data missing.

Also, we notice several features that are not so useful.

Dropping redundant features

In [None]:
df.drop(['Abbreviation',
         'video_views_for_the_last_30_days',
         'subscribers_for_last_30_days',
         'created_month',
         'created_date'],
          inplace=True, axis=1)

df.isnull().sum() / len(df) * 100

There are ~12% channels with `NaN` in `country` column.

Also, other country dependent metrics have ~12% missing values, which signifies they should be missing in the same rows.

In [None]:
df.drop(df[df['Country'].isnull()].index , inplace=True)

df.isnull().sum() / len(df) * 100

In [None]:
df.shape

**Looking at the difference between the columns 'channel_type' and 'category'.**

That might be useful, since both columns can be used for similar analyses.

In [None]:
# Looking at the difference between channel type and channel category

unique_categories = df['category'].unique()

unique_channel_types = df['channel_type'].unique()

print("Unique Categories:", unique_categories)

print("\n\nUnique Channel Types:", unique_channel_types)

In [None]:
print(df['category'].isnull().sum())

print(df['channel_type'].isnull().sum())

**We can decide to go with 'channel_type' to use in the analysis due to it having lesser categories, allowing for better detail per categories.**

In [None]:
df.drop('category', axis=1, inplace=True)

df.drop(df[df['channel_type'].isnull()].index , inplace=True)

df.isnull().sum()

In [None]:
df.describe()

**We use the average of highest yearly earnings and lowest yearly earnings as a proxy for yearly earnings**

In [None]:
# We don't have data for every year so we decide to average the highest and lowest yearly earnings estimate
# Create a new column 'yearly_earnings' with the average of 'lowest_yearly_earnings' and 'highest_yearly_earnings'

df['yearly_earnings'] = (df['lowest_yearly_earnings'] + df['highest_yearly_earnings']) // 2

# Relationship Between Views and Subscribers
Here we will be looking at the relationship between views and subscribers

In [None]:
plt.figure(figsize=[14, 6])
# Create the scatter plot
sns.scatterplot(
    x=df["subscribers"],
    y=df["video views"],
    hue=df["channel_type"],    # Color-coding points based on the "category" column
    size=df["video views"] # Adjusting point sizes based on the "video views" column
)

# Set labels and title
plt.xlabel("Subscribers")
plt.ylabel("Views")
plt.title("Relationship between Views and Subscribers")

# Show the plot
plt.show()

It is shown to be linerarly correlated, which seems reasonable, and gives us a clue that the data could be reliable.

**Scatterplot showing the relationship between earnings and subscribers**

In [None]:
plt.figure(figsize=[14, 6])
# Create the scatter plot
sns.scatterplot(
    x=df["subscribers"],
    y=df["yearly_earnings"],
    hue=df["channel_type"],    # Color-coding points based on the "category" column
    size=df["yearly_earnings"] # Adjusting point sizes based on the "video views" column
)

# Set labels and title
plt.xlabel("Subscribers")
plt.ylabel("Earnings")
plt.title("Relationship between Earnings and Subscribers")

# Show the plot
plt.show()

In [None]:
plt.figure(figsize=[14, 6])
# Create the scatter plot
sns.scatterplot(
    x=df["video views"],
    y=df["yearly_earnings"],
    hue=df["channel_type"],    # Color-coding points based on the "category" column
    size=df["yearly_earnings"] # Adjusting point sizes based on the "video views" column
)

# Set labels and title
plt.xlabel("Views")
plt.ylabel("Earnings")
plt.title("Relationship between Earnings and Subscribers")

# Show the plot
plt.show()

# Distribution of channels by their types

In [None]:
# Calculating the count of each category and sorting them
channel_type = df['channel_type'].value_counts().sort_values(ascending=False)

# Creating a pie chart using Matplotlib
plt.figure(figsize=(8, 8))
plt.pie(channel_type.values, labels=channel_type.index, autopct='%1.1f%%', textprops={'fontsize': 8})
plt.title("Distribution of channel_types", fontsize=16)

# Show the plot
plt.show()

In [None]:
# Get the top 10 channel types
top_10_channel_types = df['channel_type'].value_counts().head(10)

# Create a countplot with seaborn
plt.figure(figsize=(10, 5))  # Adjust the figure size as per your preference
sns.countplot(
    data=df,
    x='channel_type',
    order=top_10_channel_types.index  # Order the categories based on their counts
)

# Add title and labels
plt.title('10 Most Popular Channel Categories')
plt.xlabel('channel_type')
plt.ylabel('Amount of Channels')

# Rotate x-axis labels for better readability if needed
plt.xticks(rotation=45)

# Display the plot
plt.show()

# Category-Wise Average Subscribers Growth
We see that entertainment and music are popular, but which categories are growing the fastest on average?

In [None]:
# Calculate Channel Age
df['channel_age'] = 2023 - df['created_year']

# Calculate Average Subscribers growth per Year (asgpy)
df['annual_subscriber_growth'] = df['subscribers']/df['channel_age']

In [None]:
# Calculate the percentage growth for each category
channel_type_growth = df.groupby('channel_type')['annual_subscriber_growth'].mean().reset_index()

channel_type_growth.sort_values(by='annual_subscriber_growth', ascending=False)

In [None]:
# Create the bar plot with Seaborn
plt.figure(figsize=(10, 6))  # Adjust the figure size as per your preference
sns.barplot(
    data=channel_type_growth,
    x='channel_type',
    y='annual_subscriber_growth',
)

# Add title and labels
plt.title('channel_type-Wise Average Subscribers Growth')
plt.xlabel('Channel Type')
plt.ylabel('Annual_Subscriber_Growth')

# Rotate x-axis labels for better readability
plt.xticks(rotation=45)

# Display the plot
plt.show()

# Highest Yearly Earnings Average by channel_type

In [None]:
# Group by 'category' and calculate the mean of yearly earnings for each channel_type
category_earnings = df.groupby('channel_type')['yearly_earnings'].mean().reset_index()

# Sort the categories based on yearly earnings in ascending order
sorted_categories_earnings = category_earnings.sort_values(by='yearly_earnings', ascending=False)

# Choose the top 10 categories to display
top_categories_earnings = sorted_categories_earnings.head(10)

In [None]:
# Create the bar plot with Seaborn
plt.figure(figsize=(12, 6))  # Adjust the figure size as per your preference
sns.barplot(
    data=top_categories_earnings,
    x='channel_type',
    y='yearly_earnings',
)

# Add title and labels
plt.title('Yearly Earnings Average by type')
plt.xlabel('channel_type')
plt.ylabel('Highest Yearly Earnings')

# Display the plot
plt.show()

### Geographical Analysis
Now we can take a look at another view of the world of youtube top channels.

Literally talking about the world!

What kind of interesting insights can we get by looking at the relationship between the location of a certain top-performing youtube channel and its other characteristics?

Let's take a look at that.

In [None]:
df_geo_subscribers = df.groupby('Country')['subscribers'].mean().reset_index()

df_geo_subscribers.sort_values( 'subscribers', inplace=True, ascending=False )

df_geo_subscribers = df_geo_subscribers.head(20)

In [None]:
plt.figure(figsize=(15, 6))
sns.barplot(
    data=df_geo_subscribers,
    x='Country',
    y='subscribers')

# Add title and labels
plt.title('Average number of Subscribers by Country')
plt.xlabel('Country')
plt.ylabel('Average number of Subscribers')

# Rotate x-axis labels for better readability if needed
plt.xticks(rotation=90)

# Show the plot
plt.show()

In [None]:
import plotly.express as px

px.choropleth(df_geo_subscribers,
                    locations="Country",
                    locationmode='country names',
                    color='subscribers',
                    title="Average number of Subscribers by Country")

The comparasion of the average of subscribers between countries show us no relevant differentiation between them in the realm of high-achieving youtube channels.

However, we can have a more complete view of the situation by looking at more than just the mean.

Analyzing the quantity of channels in each country, we can better analyze how biased this dataset may be regarding sample sizes for each country.

In [None]:
df_geo_count = df.groupby('Country')['subscribers'].count().reset_index()
df_geo_count = df_geo_count.rename(columns = {'subscribers': "Count of channels"})

px.choropleth(df_geo_count,
                    locations="Country",
                    locationmode='country names',
                    color='Count of channels',
                    color_continuous_scale=px.colors.sequential.Emrld ,
                    title="Amount of channels per Country")

US, Brazil and India contain the majority of the channels.

All others seem to be represented with some entries, but they are just a few.

# Conclusion
- The data indicates that 'Animals' are the fastest growing categories.

- They are also the fastest in the context of earning.

- But among the high-performing channels, music and entertainment categories stand out as the more dominant ones.

- So 'Music' or 'Entertainment' would be a strong category to go with if aiming for a popular YouTube channel in the longer run.

- Our analysis revealed a notable concentration of channels in the dataset from the United States and India.

- The results could be impacted by insufficient data as we only have few rows per category.

- Also, the dataset may not be fully representative of the diverse audience behaviors present in every different geographical location.