<a href="https://colab.research.google.com/github/geonextgis/Mastering-Machine-Learning-and-GEE-for-Earth-Science/blob/main/01_Data_Gathering/01_Understanding_the_Data.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Understanding the Data**
Understanding the data is a critical step in the data science process. It involves gaining insights into the structure, content, quality, and characteristics of the data you're working with. Properly understanding the data sets the foundation for making informed decisions, building accurate models, and deriving meaningful insights.

## **Import Required Libraries**

In [4]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [5]:
import pandas as pd

## **Read the Data**

In [6]:
df = pd.read_csv(r"/content/drive/MyDrive/Colab Notebooks/GitHub Repo/Mastering-Machine-Learning-and-GEE-for-Earth-Science/Datasets/Global YouTube Statistics.csv", encoding="latin")

## **How Big is the Data?**

In [7]:
df.shape

(995, 28)

## **How does the Data look like?**

In [8]:
# Print first 5 rows of the dataframe
df.head()

Unnamed: 0,rank,Youtuber,subscribers,video views,category,Title,uploads,Country,Abbreviation,channel_type,...,subscribers_for_last_30_days,created_year,created_month,created_date,Gross tertiary education enrollment (%),Population,Unemployment rate,Urban_population,Latitude,Longitude
0,1,T-Series,245000000,228000000000.0,Music,T-Series,20082,India,IN,Music,...,2000000.0,2006.0,Mar,13.0,28.1,1366418000.0,5.36,471031528.0,20.593684,78.96288
1,2,YouTube Movies,170000000,0.0,Film & Animation,youtubemovies,1,United States,US,Games,...,,2006.0,Mar,5.0,88.2,328239500.0,14.7,270663028.0,37.09024,-95.712891
2,3,MrBeast,166000000,28368840000.0,Entertainment,MrBeast,741,United States,US,Entertainment,...,8000000.0,2012.0,Feb,20.0,88.2,328239500.0,14.7,270663028.0,37.09024,-95.712891
3,4,Cocomelon - Nursery Rhymes,162000000,164000000000.0,Education,Cocomelon - Nursery Rhymes,966,United States,US,Education,...,1000000.0,2006.0,Sep,1.0,88.2,328239500.0,14.7,270663028.0,37.09024,-95.712891
4,5,SET India,159000000,148000000000.0,Shows,SET India,116536,India,IN,Entertainment,...,1000000.0,2006.0,Sep,20.0,28.1,1366418000.0,5.36,471031528.0,20.593684,78.96288


In [9]:
# Randomly choose 5 rows and print it
df.sample(5)

Unnamed: 0,rank,Youtuber,subscribers,video views,category,Title,uploads,Country,Abbreviation,channel_type,...,subscribers_for_last_30_days,created_year,created_month,created_date,Gross tertiary education enrollment (%),Population,Unemployment rate,Urban_population,Latitude,Longitude
560,561,Jordan Matter,16600000,5819509000.0,Entertainment,Jordan Matter,413,United States,US,Entertainment,...,300000.0,2006.0,Dec,21.0,88.2,328239500.0,14.7,270663028.0,37.09024,-95.712891
831,832,Acenix,13600000,2122062000.0,Gaming,Acenix,368,Spain,ES,Games,...,200000.0,2014.0,Jan,2.0,88.9,47076780.0,13.96,37927409.0,40.463667,-3.74922
844,845,Sanjoy Das Official,13500000,3912334000.0,Entertainment,Sanjoy Das Official,1793,India,IN,Games,...,400000.0,2015.0,Apr,29.0,28.1,1366418000.0,5.36,471031528.0,20.593684,78.96288
275,276,That Little Puff,23700000,20289690000.0,Pets & Animals,That Little Puff,769,United States,US,Animals,...,1100000.0,2020.0,Aug,29.0,88.2,328239500.0,14.7,270663028.0,37.09024,-95.712891
710,711,Major Lazer Official,14800000,9383431000.0,Music,MajorLazerOfficial,0,,,,...,,2013.0,Jun,14.0,,,,,,


## **What are the Data Types of the Columns?**

In [10]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 995 entries, 0 to 994
Data columns (total 28 columns):
 #   Column                                   Non-Null Count  Dtype  
---  ------                                   --------------  -----  
 0   rank                                     995 non-null    int64  
 1   Youtuber                                 995 non-null    object 
 2   subscribers                              995 non-null    int64  
 3   video views                              995 non-null    float64
 4   category                                 949 non-null    object 
 5   Title                                    995 non-null    object 
 6   uploads                                  995 non-null    int64  
 7   Country                                  873 non-null    object 
 8   Abbreviation                             873 non-null    object 
 9   channel_type                             965 non-null    object 
 10  video_views_rank                         994 non-n

## **Are there any Missing Values?**

In [11]:
# Checking the number of missing values for each column
df.isnull().sum()

rank                                         0
Youtuber                                     0
subscribers                                  0
video views                                  0
category                                    46
Title                                        0
uploads                                      0
Country                                    122
Abbreviation                               122
channel_type                                30
video_views_rank                             1
country_rank                               116
channel_type_rank                           33
video_views_for_the_last_30_days            56
lowest_monthly_earnings                      0
highest_monthly_earnings                     0
lowest_yearly_earnings                       0
highest_yearly_earnings                      0
subscribers_for_last_30_days               337
created_year                                 5
created_month                                5
created_date 

## **How does the Data look Mathematically?**

In [12]:
df.describe()

Unnamed: 0,rank,subscribers,video views,uploads,video_views_rank,country_rank,channel_type_rank,video_views_for_the_last_30_days,lowest_monthly_earnings,highest_monthly_earnings,...,highest_yearly_earnings,subscribers_for_last_30_days,created_year,created_date,Gross tertiary education enrollment (%),Population,Unemployment rate,Urban_population,Latitude,Longitude
count,995.0,995.0,995.0,995.0,994.0,879.0,962.0,939.0,995.0,995.0,...,995.0,658.0,990.0,990.0,872.0,872.0,872.0,872.0,872.0,872.0
mean,498.0,22982410.0,11039540000.0,9187.125628,554248.9,386.05347,745.719335,175610300.0,36886.148281,589807.8,...,7081814.0,349079.1,2012.630303,15.746465,63.627752,430387300.0,9.279278,224215000.0,26.632783,-14.128146
std,287.37606,17526110.0,14110840000.0,34151.352254,1362782.0,1232.244746,1944.386561,416378200.0,71858.724092,1148622.0,...,13797040.0,614355.4,4.512503,8.77752,26.106893,472794700.0,4.888354,154687400.0,20.560533,84.760809
min,1.0,12300000.0,0.0,0.0,1.0,1.0,1.0,1.0,0.0,0.0,...,0.0,1.0,1970.0,1.0,7.6,202506.0,0.75,35588.0,-38.416097,-172.104629
25%,249.5,14500000.0,4288145000.0,194.5,323.0,11.0,27.0,20137500.0,2700.0,43500.0,...,521750.0,100000.0,2009.0,8.0,36.3,83355410.0,5.27,55908320.0,20.593684,-95.712891
50%,498.0,17700000.0,7760820000.0,729.0,915.5,51.0,65.5,64085000.0,13300.0,212700.0,...,2600000.0,200000.0,2013.0,16.0,68.0,328239500.0,9.365,270663000.0,37.09024,-51.92528
75%,746.5,24600000.0,13554700000.0,2667.5,3584.5,123.0,139.75,168826500.0,37900.0,606800.0,...,7300000.0,400000.0,2016.0,23.0,88.2,328239500.0,14.7,270663000.0,37.09024,78.96288
max,995.0,245000000.0,228000000000.0,301308.0,4057944.0,7741.0,7741.0,6589000000.0,850900.0,13600000.0,...,163400000.0,8000000.0,2022.0,31.0,113.1,1397715000.0,14.72,842934000.0,61.92411,138.252924


## **Are there any Duplicate Rows?**

In [13]:
df.duplicated().sum()

0

## **How is the Correlation between Columns?**

In [14]:
# Extract the correlation between all the variables
df.corr()

Unnamed: 0,rank,subscribers,video views,uploads,video_views_rank,country_rank,channel_type_rank,video_views_for_the_last_30_days,lowest_monthly_earnings,highest_monthly_earnings,...,highest_yearly_earnings,subscribers_for_last_30_days,created_year,created_date,Gross tertiary education enrollment (%),Population,Unemployment rate,Urban_population,Latitude,Longitude
rank,1.0,-0.640608,-0.453363,-0.051036,-0.059455,0.016776,-0.029554,-0.186339,-0.248394,-0.24805,...,-0.248392,-0.188571,0.106025,-0.006256,-0.037491,-0.025475,-0.01486,-0.038807,3.6e-05,0.019003
subscribers,-0.640608,1.0,0.750958,0.077136,0.057202,0.032683,0.027393,0.278846,0.388941,0.388579,...,0.388935,0.309527,-0.141827,-0.011836,-0.006804,0.082219,-0.008251,0.083521,0.01945,0.022443
video views,-0.453363,0.750958,1.0,0.165928,-0.061807,-0.068277,-0.050194,0.361856,0.552096,0.551455,...,0.552091,0.187384,-0.127068,-0.03818,-0.015232,0.080214,-0.000729,0.076649,0.037334,0.031268
uploads,-0.051036,0.077136,0.165928,1.0,-0.108988,-0.078394,-0.09845,0.101521,0.166922,0.167283,...,0.166904,0.008933,-0.154904,0.0349,-0.218396,0.143122,-0.188101,0.072807,-0.067868,0.233169
video_views_rank,-0.059455,0.057202,-0.061807,-0.108988,1.0,0.877504,0.949936,-0.067193,-0.208863,-0.208935,...,-0.208851,-0.167295,0.006671,0.031231,0.046934,-0.103178,-0.029276,-0.122747,0.015932,-0.016492
country_rank,0.016776,0.032683,-0.068277,-0.078394,0.877504,1.0,0.898442,-0.098737,-0.148947,-0.14896,...,-0.148946,-0.126175,-0.037807,-0.012699,0.10329,-0.053181,0.066697,-0.024578,0.048323,-0.072476
channel_type_rank,-0.029554,0.027393,-0.050194,-0.09845,0.949936,0.898442,1.0,-0.129051,-0.187908,-0.18797,...,-0.187896,-0.154021,-0.014002,0.038299,0.062484,-0.116254,0.003697,-0.123852,0.010195,-0.055144
video_views_for_the_last_30_days,-0.186339,0.278846,0.361856,0.101521,-0.067193,-0.098737,-0.129051,1.0,0.68033,0.680289,...,0.68033,0.451523,0.053123,-0.01367,-0.03561,0.053859,-0.002323,0.051126,-0.026864,0.049033
lowest_monthly_earnings,-0.248394,0.388941,0.552096,0.166922,-0.208863,-0.148947,-0.187908,0.68033,1.0,0.999955,...,0.999998,0.67936,0.072316,-0.040269,-0.06219,0.104812,-0.042874,0.081206,0.006583,0.100379
highest_monthly_earnings,-0.24805,0.388579,0.551455,0.167283,-0.208935,-0.14896,-0.18797,0.680289,0.999955,1.0,...,0.999953,0.679699,0.072289,-0.039959,-0.061973,0.104785,-0.042627,0.081226,0.006873,0.100299


In [15]:
# Extract the correlation between the 'subscribers' and other numerical columns
df.corr()["subscribers"]

rank                                      -0.640608
subscribers                                1.000000
video views                                0.750958
uploads                                    0.077136
video_views_rank                           0.057202
country_rank                               0.032683
channel_type_rank                          0.027393
video_views_for_the_last_30_days           0.278846
lowest_monthly_earnings                    0.388941
highest_monthly_earnings                   0.388579
lowest_yearly_earnings                     0.389072
highest_yearly_earnings                    0.388935
subscribers_for_last_30_days               0.309527
created_year                              -0.141827
created_date                              -0.011836
Gross tertiary education enrollment (%)   -0.006804
Population                                 0.082219
Unemployment rate                         -0.008251
Urban_population                           0.083521
Latitude    