### Exploratory Data Analysis: % of individuals using the internet

Years: 2000, 2005, 2010 ,2015, 2019 ,2020, 2021

Data source: [UN](https://data.un.org/)

Data URL: [Internet Usage Data](https://data.un.org/_Docs/SYB/CSV/SYB66_314_202310_Internet%20Usage.csv)

Objective: Exploratory data analysis to detect anomalies, identify patterns, understand the data, and identify preliminary insights from the data using quantitative and graphical methods.

1) Understand the dataset.
2) Define analytics objectives or questions.
3) Understand the attributes/variables, their types, relevance and significance, and their relationships.
4) Prepare and clean the dataset in readiness for analytics.
5) Perform exploratory data analysis
6) Communicate insights.

### Preliminary Questions

- How has the global percentage of individuals using the internet changed over time?
- Which regions or countries have shown the most significant increase in internet usage?
- Are there any noticeable patterns or trends in internet adoption rates across different geographical areas?
- What is the average internet usage percentage for all countries?
- How does internet usage compare between developed and developing countries?
- Are there any countries or regions that show unusually high or low internet adoption rates compared to their neighbors or global averages?
- What is the rate of growth in internet usage for different time periods (e.g., 2000-2005, 2005-2010, 2010-2015)?
- Is there a correlation between a country's economic status and its internet usage percentage?
- How does the data distribution of internet usage percentages change over the years?
- Are there any notable outliers in terms of internet usage, and what factors might contribute to their exceptional status?

### Python Libraries

In [61]:
# package importation
import pandas as pd

### Data Loading

In [62]:
# create a list of column names for the dataset
dataset_columns = ["idx", "region_country_area", "year", "series", "value", "footnotes","source"]


# read the dataset into a pandas dataframe
internet_usage = pd.read_csv('InternetUsage.csv', skiprows=1, encoding='ISO-8859-1')

# set the column names
internet_usage.columns = dataset_columns

# display the first 5 rows of the dataset
internet_usage.head()

Unnamed: 0,idx,region_country_area,year,series,value,footnotes,source
0,1,"Total, all countries or areas",2000,Percentage of individuals using the internet,5.3,,"International Telecommunication Union (ITU), G..."
1,1,"Total, all countries or areas",2005,Percentage of individuals using the internet,15.6,,"International Telecommunication Union (ITU), G..."
2,1,"Total, all countries or areas",2010,Percentage of individuals using the internet,28.5,,"International Telecommunication Union (ITU), G..."
3,1,"Total, all countries or areas",2015,Percentage of individuals using the internet,40.0,,"International Telecommunication Union (ITU), G..."
4,1,"Total, all countries or areas",2019,Percentage of individuals using the internet,53.7,,"International Telecommunication Union (ITU), G..."


### Check datatypes and identify useful columns 

<pre>
Attributes              Description
idx                     int64: used for identifying the country region or area
region_country_area     text: used to identify a country, region, area or group of countries
year                    int64: the year the value represent
series                  text: descriptive column
value                   float: percentage of individuals using internet
footnotes               text: descriptive column
source                  text: descriptive column
</pre>


<pre>
Total records: 1528
significant fields: idx, region_country_area, year, value
</pre>

In [63]:
internet_usage.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1528 entries, 0 to 1527
Data columns (total 7 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   idx                  1528 non-null   int64  
 1   region_country_area  1528 non-null   object 
 2   year                 1528 non-null   int64  
 3   series               1528 non-null   object 
 4   value                1528 non-null   float64
 5   footnotes            966 non-null    object 
 6   source               1528 non-null   object 
dtypes: float64(1), int64(2), object(4)
memory usage: 83.7+ KB


### Drop columns not useful for analysis

In [64]:
internet_usage.drop(columns=['series', 'footnotes', 'source'], inplace=True)

internet_usage.head()

Unnamed: 0,idx,region_country_area,year,value
0,1,"Total, all countries or areas",2000,5.3
1,1,"Total, all countries or areas",2005,15.6
2,1,"Total, all countries or areas",2010,28.5
3,1,"Total, all countries or areas",2015,40.0
4,1,"Total, all countries or areas",2019,53.7


### Drop LDC§ records from the dataset, not needed

In [65]:
# drop LDC§ records from the dataset, idx = 199

print(internet_usage[internet_usage['idx'] == 199])

ldc_index = internet_usage[internet_usage['idx'] == 199].index
internet_usage.drop(ldc_index, inplace=True)

# check if the LDC records have been dropped
if not internet_usage[internet_usage['idx'] == 199].empty:
    print("LDC records still exist")
else:
    print("LDC records have been dropped")


      idx region_country_area  year  value
1521  199                LDC§  2000    0.4
1522  199                LDC§  2005    0.7
1523  199                LDC§  2010    3.3
1524  199                LDC§  2015   10.7
1525  199                LDC§  2019   23.5
1526  199                LDC§  2020   27.6
1527  199                LDC§  2021   31.2
LDC records have been dropped


### Check for missing values 

In [66]:
internet_usage.isnull().sum()

idx                    0
region_country_area    0
year                   0
value                  0
dtype: int64

### Extract country related data and region data

In [67]:
region_values = [
    "Total, all countries or areas", "Northern Africa", "Sub-Saharan Africa", 
    "Eastern Africa", "Middle Africa", "Southern Africa", "Western Africa", 
    "Northern America", "Latin America & the Caribbean", "Caribbean", 
    "Central Asia", "Eastern Asia", "South-central Asia", 
    "South-eastern Asia", "Southern Asia", "Western Asia", 
    "Europe", "Oceania", "Australia and New Zealand", "Micronesia"
]

region_internet_usage = internet_usage[internet_usage['region_country_area'].isin(region_values)]

country_internet_usage = internet_usage[~internet_usage['region_country_area'].isin(region_values)]

# print shapes of the region and country dataframes

print("Region dataframe shape: ", region_internet_usage.shape)
print("Country dataframe shape: ", country_internet_usage.shape)



Region dataframe shape:  (139, 4)
Country dataframe shape:  (1382, 4)


In [68]:
internet_usage.describe()

Unnamed: 0,idx,year,value
count,1521.0,1521.0,1521.0
mean,391.798817,2012.477975,42.882643
std,262.001806,7.52721,32.444772
min,1.0,2000.0,0.0
25%,156.0,2005.0,10.1
50%,384.0,2015.0,40.0
75%,620.0,2020.0,73.4
max,894.0,2021.0,100.0
