Source: https://www.kaggle.com/datasets/nelgiriyewithana/global-youtube-statistics-2023


## Dataset Description:




| Field                                    | Description                                                         |
|------------------------------------------|---------------------------------------------------------------------|
| `rank`                                   | Position of the YouTube channel based on the number of subscribers  |
| `Youtuber`                               | Name of the YouTube channel                                         |
| `subscribers`                            | Number of subscribers to the channel                                |
| `video views`                            | Total views across all videos on the channel                        |
| `category`                               | Category or niche of the channel                                    |
| `Title`                                  | Title of the YouTube channel                                        |
| `uploads`                                | Total number of videos uploaded on the channel                      |
| `Country`                                | Country where the YouTube channel originates                        |
| `Abbreviation`                           | Abbreviation of the country                                         |
| `channel_type`                           | Type of the YouTube channel (e.g., individual, brand)               |
| `video_views_rank`                       | Ranking of the channel based on total video views                   |
| `country_rank`                           | Ranking of the channel based on the number of subscribers within its country |
| `channel_type_rank`                      | Ranking of the channel based on its type (individual or brand)      |
| `video_views_for_the_last_30_days`       | Total video views in the last 30 days                               |
| `lowest_monthly_earnings`                | Lowest estimated monthly earnings from the channel                  |
| `highest_monthly_earnings`               | Highest estimated monthly earnings from the channel                 |
| `lowest_yearly_earnings`                 | Lowest estimated yearly earnings from the channel                   |
| `highest_yearly_earnings`                | Highest estimated yearly earnings from the channel                  |
| `subscribers_for_last_30_days`           | Number of new subscribers gained in the last 30 days                |
| `created_year`                           | Year when the YouTube channel was created                           |
| `created_month`                          | Month when the YouTube channel was created                          |
| `created_date`                           | Exact date of the YouTube channel's creation                        |
| `Gross tertiary education enrollment (%)`| Percentage of the population enrolled in tertiary education in the country |
| `Population`                             | Total population of the country                                     |
| `Unemployment rate`                      | Unemployment rate in the country                                    |
| `Urban_population`                       | Percentage of the population living in urban areas                  |
| `Latitude`                               | Latitude coordinate of the country's location                       |
| `Longitude`                              | Longitude coordinate of the country's location                      |


# 1. Improt Libraries

In [7]:
import warnings
warnings.filterwarnings("ignore")

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.impute import KNNImputer
from sklearn.preprocessing import StandardScaler
from scipy import stats
%matplotlib inline

# 2. Read Dataset

In [8]:
df = pd.read_csv('/Users/dooinnkim/jupyter_notebook/2023_data_portfolio/youtube_2023/dataset.csv', encoding='latin1')
df.head()

Unnamed: 0,rank,Youtuber,subscribers,video views,category,Title,uploads,Country,Abbreviation,channel_type,...,subscribers_for_last_30_days,created_year,created_month,created_date,Gross tertiary education enrollment (%),Population,Unemployment rate,Urban_population,Latitude,Longitude
0,1,T-Series,245000000,228000000000.0,Music,T-Series,20082,India,IN,Music,...,2000000.0,2006.0,Mar,13.0,28.1,1366418000.0,5.36,471031528.0,20.593684,78.96288
1,2,YouTube Movies,170000000,0.0,Film & Animation,youtubemovies,1,United States,US,Games,...,,2006.0,Mar,5.0,88.2,328239500.0,14.7,270663028.0,37.09024,-95.712891
2,3,MrBeast,166000000,28368840000.0,Entertainment,MrBeast,741,United States,US,Entertainment,...,8000000.0,2012.0,Feb,20.0,88.2,328239500.0,14.7,270663028.0,37.09024,-95.712891
3,4,Cocomelon - Nursery Rhymes,162000000,164000000000.0,Education,Cocomelon - Nursery Rhymes,966,United States,US,Education,...,1000000.0,2006.0,Sep,1.0,88.2,328239500.0,14.7,270663028.0,37.09024,-95.712891
4,5,SET India,159000000,148000000000.0,Shows,SET India,116536,India,IN,Entertainment,...,1000000.0,2006.0,Sep,20.0,28.1,1366418000.0,5.36,471031528.0,20.593684,78.96288


# 3. Dataset Overview

## 3.1. Dataset Basic Information

In [9]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 995 entries, 0 to 994
Data columns (total 28 columns):
 #   Column                                   Non-Null Count  Dtype  
---  ------                                   --------------  -----  
 0   rank                                     995 non-null    int64  
 1   Youtuber                                 995 non-null    object 
 2   subscribers                              995 non-null    int64  
 3   video views                              995 non-null    float64
 4   category                                 949 non-null    object 
 5   Title                                    995 non-null    object 
 6   uploads                                  995 non-null    int64  
 7   Country                                  873 non-null    object 
 8   Abbreviation                             873 non-null    object 
 9   channel_type                             965 non-null    object 
 10  video_views_rank                         994 non-n

### Conclusion:

- The dataset contains **995 entries (rows**) and **28 columns**.


- The columns are of different data types:


    - integer (int64)
    - float (float64)
    - object (usually representing string or categorical data).
    
    
- The dataset contains some missing values. Specifically, the columns category, Country,Abbreviation, video_views_rank, country_rank, channel_type_rank, video_views_for_the_last_30_days, subscribers_for_last_30_days, created_year, created_month, created_date, Gross tertiary education enrollment (%),  Population   , Unemployment rate, Urban_population, Latitude, Longitute.

- Looking into the features, it seems that the columns like Population, Unemployment rate, Urban_population are irrelvant for this project.


## 3.2 Summary Statistics for Numerical Values

In [19]:
df.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
rank,995.0,498.0,287.3761,1.0,249.5,498.0,746.5,995.0
subscribers,995.0,22982410.0,17526110.0,12300000.0,14500000.0,17700000.0,24600000.0,245000000.0
video_views,995.0,11039540000.0,14110840000.0,0.0,4288145000.0,7760820000.0,13554700000.0,228000000000.0
uploads,995.0,9187.126,34151.35,0.0,194.5,729.0,2667.5,301308.0
video_views_rank,994.0,554248.9,1362782.0,1.0,323.0,915.5,3584.5,4057944.0
country_rank,879.0,386.0535,1232.245,1.0,11.0,51.0,123.0,7741.0
channel_type_rank,962.0,745.7193,1944.387,1.0,27.0,65.5,139.75,7741.0
video_views_for_the_last_30_days,939.0,175610300.0,416378200.0,1.0,20137500.0,64085000.0,168826500.0,6589000000.0
lowest_monthly_earnings,995.0,36886.15,71858.72,0.0,2700.0,13300.0,37900.0,850900.0
highest_monthly_earnings,995.0,589807.8,1148622.0,0.0,43500.0,212700.0,606800.0,13600000.0


Rank (995 entries): This represents the position of the YouTube channel based on subscribers. Ranging from 1 to 995, with an average rank of 498, it's evenly distributed since the mean and median (50% mark) are close.

Subscribers (995 entries): YouTube channels in this dataset have, on average, around 22.98 million subscribers. There's a significant variation (standard deviation of 17.53 million), and the values range from 12.3 million to a whopping 245 million!

Video Views (995 entries): The channels have an average of about 11 billion views, but the spread is substantial, ranging from 0 to 228 billion views. Some channels are way more popular than others!

Uploads (995 entries): On average, channels have uploaded around 9,187 videos. Some channels have not uploaded any videos (min 0), while some have uploaded up to 301,308!

Video Views Rank (994 entries): This represents the ranking of the channel based on video views. The values range widely from 1 to over 4 million, showing that there's a huge variation in how channels are ranked by views.

Country Rank (879 entries): This ranks channels within their countries, ranging from 1 to 7,741, with a mean of 386.05. It suggests a wide variation within countries too.

Channel Type Rank (962 entries): With values ranging from 1 to 7,741 and an average of 745.72, this describes the ranking of channels based on their types.

Video Views for the Last 30 Days (939 entries): Channels have an average of 175.6 million views in the last month. The data varies significantly, from just 1 view to 6.59 billion views.

Monthly Earnings (995 entries): Channels make between 0 and 850,900 in the lowest monthly earnings and 0 to 13,600,000 in the highest monthly earnings. That's a big range!

Yearly Earnings (995 entries): The lowest yearly earnings range from 0 to 10.2 million, and the highest range from 0 to 163.4 million.

Subscribers for Last 30 Days (658 entries): Channels have gained an average of 349,079 new subscribers in the last month. Some gained only 1 while others gained up to 8 million!

Created Year (990 entries): Most channels were created around the year 2012, with the oldest being from 1970 and the newest from 2022.

## 3.3. Summary Statistics for Categorical Values

In [20]:
df.describe(include='object')

Unnamed: 0,youtuber,category,title,country,abbreviation,channel_type,created_month
count,995,949,995,873,873,965,990
unique,995,18,992,49,49,14,12
top,T-Series,Entertainment,Preston,United States,US,Entertainment,Jan
freq,1,241,2,313,313,304,101


# 4. Data Cleaning

In this stage, we will apply the most basic data cleaning techinques just for EDA. Since we don't run machine learnin model, the data preprocessing such as treating null values and outliers will not be handled. 

1. Standardizatin of hearders
2. Drop unecessary columns:  Gross tertiary education enrollment (%), Population, Unemployment, Urban_population
3. Assign correct data types.
4. Inconsistency values.
5. Data extraction


## 4.1. Standardizatin of Headers

Standardization of hearder is to unifie the format of headers into certain way. The most popular method employed is modify them into lower cases and removing white space or link with underscore ('_')

In [10]:
def standard_header(x):
    return x.lower().replace(' ','_')

df.columns = [standard_header(col) for col in df.columns]



In [13]:
df.columns

Index(['rank', 'youtuber', 'subscribers', 'video_views', 'category', 'title',
       'uploads', 'country', 'abbreviation', 'channel_type',
       'video_views_rank', 'country_rank', 'channel_type_rank',
       'video_views_for_the_last_30_days', 'lowest_monthly_earnings',
       'highest_monthly_earnings', 'lowest_yearly_earnings',
       'highest_yearly_earnings', 'subscribers_for_last_30_days',
       'created_year', 'created_month', 'created_date',
       'gross_tertiary_education_enrollment_(%)', 'population',
       'unemployment_rate', 'urban_population', 'latitude', 'longitude'],
      dtype='object')

## 4.2. Drop columns
As we discussed in the dataset overview, we wil drop the following columns: 'gross_tertiary_education_enrollment_(%)', 'population', 'unemployment_rate', 'urban_population'

In [14]:
drop_columns = ['gross_tertiary_education_enrollment_(%)', 'population', 'unemployment_rate', 'urban_population']
df.drop(columns=drop_columns, inplace=True)

In [15]:
df.head()

Unnamed: 0,rank,youtuber,subscribers,video_views,category,title,uploads,country,abbreviation,channel_type,...,lowest_monthly_earnings,highest_monthly_earnings,lowest_yearly_earnings,highest_yearly_earnings,subscribers_for_last_30_days,created_year,created_month,created_date,latitude,longitude
0,1,T-Series,245000000,228000000000.0,Music,T-Series,20082,India,IN,Music,...,564600.0,9000000.0,6800000.0,108400000.0,2000000.0,2006.0,Mar,13.0,20.593684,78.96288
1,2,YouTube Movies,170000000,0.0,Film & Animation,youtubemovies,1,United States,US,Games,...,0.0,0.05,0.04,0.58,,2006.0,Mar,5.0,37.09024,-95.712891
2,3,MrBeast,166000000,28368840000.0,Entertainment,MrBeast,741,United States,US,Entertainment,...,337000.0,5400000.0,4000000.0,64700000.0,8000000.0,2012.0,Feb,20.0,37.09024,-95.712891
3,4,Cocomelon - Nursery Rhymes,162000000,164000000000.0,Education,Cocomelon - Nursery Rhymes,966,United States,US,Education,...,493800.0,7900000.0,5900000.0,94800000.0,1000000.0,2006.0,Sep,1.0,37.09024,-95.712891
4,5,SET India,159000000,148000000000.0,Shows,SET India,116536,India,IN,Entertainment,...,455900.0,7300000.0,5500000.0,87500000.0,1000000.0,2006.0,Sep,20.0,20.593684,78.96288


## 4.3. Assign Correct Data type
It looks each columns assigned with correct data type except created_date (float). 

In [17]:
df['created_date'] = pd.to_datetime(df['created_date'], errors='coerce')


In [18]:
df.dtypes

rank                                         int64
youtuber                                    object
subscribers                                  int64
video_views                                float64
category                                    object
title                                       object
uploads                                      int64
country                                     object
abbreviation                                object
channel_type                                object
video_views_rank                           float64
country_rank                               float64
channel_type_rank                          float64
video_views_for_the_last_30_days           float64
lowest_monthly_earnings                    float64
highest_monthly_earnings                   float64
lowest_yearly_earnings                     float64
highest_yearly_earnings                    float64
subscribers_for_last_30_days               float64
created_year                   

Now it's all correctly assinged

## 4.4. Inconcsistece data values
we also need to investigate if ther inconsistent values are exist in categorical values. Let's check each categorical features unique values.