### Provjera datoteka
Na početku provjeravam da je direktorij data_raw/ ispravno postavljen i da sadrži potrebne sirove (raw) podatke koji su preuzeti iz izvora.

In [None]:
from pathlib import Path 

DATA_RAW = Path("..") / "data_raw" #postavlja put do foldera data_raw/ iz notebooka

print("data_raw exists:", DATA_RAW.exists()) #proverava da li folder postoji
print("files:", [p.name for p in DATA_RAW.glob("*")]) #ispisuje sve fajlove u folderu data_raw/

data_raw exists: True
files: ['DT.csv', 'numberofinternetusers new.csv', 'README.md', 'social_media_vs_productivity.csv', 'Time-Wasters on Social Media.csv', 'worldbank_IT.NET.USER.ZS.json', 'worldbank_IT.NET.USER.ZS_sample.csv']


### Učitavanje CSV-a

Definiram pomoćnu funkciju inspect_df(df, name, head_n=5) koja služi za brzu provjeru učitanih tablica.

Funkcija za zadani DataFrame ispisuje: 
- naziv skupa podataka (name), 
- dimenzije (shape: broj redaka i stupaca), 
- popis stupaca, prvih nekoliko redaka (head), 
- tipove podataka (dtypes), 
- broj nedostajućih vrijednosti po stupcu (samo stupci koji imaju missing), 
- osnovni numerički sažetak (describe) za numeričke stupce

In [None]:
import pandas as pd

def inspect_df(df: pd.DataFrame, name: str, head_n: int = 5): #funkcija za ispis osnovnih informacija o DataFrame-u
    print(f"DATASET: {name}") 
    print("-" * (9 + len(name)))
    print(f"Shape: {df.shape[0]} rows x {df.shape[1]} columns") 

    print("\nColumns:")
    for c in df.columns:
        print(f" - {c}")

    print("\nPreview (head):")
    display(df.head(head_n))

    print("\nDtypes:")
    display(df.dtypes)

    missing = df.isna().sum()
    missing = missing[missing > 0].sort_values(ascending=False)
    print("\nMissing values (only columns with missing):")
    if len(missing) == 0:
        print(" - None")
    else:
        display(missing.to_frame("missing_count"))

    print("\nNumeric summary (describe):")
    display(df.select_dtypes(include="number").describe().T)
   

## 1) DT – Time the Internet
Pregled osnovne strukture i varijabli (godina i prosječno dnevno vrijeme).

- Učitavam `DT.csv` iz `data_raw/`.
- Definiram `PROJECT_ROOT` i `DATA_RAW` za ispravne putanje iz notebooka.
- `pd.read_csv(...)` učitava podatke u DataFrame `dt`.
- `inspect_df(...)` ispisuje osnovne informacije (shape, stupci, tipovi, missing, preview).


In [3]:
PROJECT_ROOT = Path.cwd().parent 
DATA_RAW = PROJECT_ROOT / "data_raw"

dt = pd.read_csv(DATA_RAW / "DT.csv")
inspect_df(dt, "DT - Time the Internet")

DATASET: DT - Time the Internet
-------------------------------
Shape: 13 rows x 2 columns

Columns:
 - Year
 - Daily Time (Hours:Minutes)

Preview (head):


Unnamed: 0,Year,Daily Time (Hours:Minutes)
0,2012,01:30
1,2013,01:35
2,2014,01:44
3,2015,01:51
4,2016,02:08



Dtypes:


Year                           int64
Daily Time (Hours:Minutes)    object
dtype: object


Missing values (only columns with missing):
 - None

Numeric summary (describe):


Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Year,13.0,2018.0,3.89444,2012.0,2015.0,2018.0,2021.0,2024.0


## 2) Social Media vs Productivity 

Podaci o navikama, notifikacijama, radnim satima i (samoprocijenjenoj/stvarnoj) produktivnosti.

Učitavam dataset social_media_vs_productivity.csv iz data_raw/:

- pd.read_csv(...) učitava podatke u smp
- inspect_df(smp, "Social Media vs Productivity") daje brzi pregled strukture

Ovaj dataset sadrži individualne varijable vezane uz navike i produktivnost (npr. vrijeme na društvenim mrežama i produktivnost).


In [4]:
PROJECT_ROOT = Path.cwd().parent  
DATA_RAW = PROJECT_ROOT / "data_raw"

smp = pd.read_csv(DATA_RAW / "social_media_vs_productivity.csv")
inspect_df(smp, "Social Media vs Productivity")


DATASET: Social Media vs Productivity
-------------------------------------
Shape: 30000 rows x 19 columns

Columns:
 - age
 - gender
 - job_type
 - daily_social_media_time
 - social_platform_preference
 - number_of_notifications
 - work_hours_per_day
 - perceived_productivity_score
 - actual_productivity_score
 - stress_level
 - sleep_hours
 - screen_time_before_sleep
 - breaks_during_work
 - uses_focus_apps
 - has_digital_wellbeing_enabled
 - coffee_consumption_per_day
 - days_feeling_burnout_per_month
 - weekly_offline_hours
 - job_satisfaction_score

Preview (head):


Unnamed: 0,age,gender,job_type,daily_social_media_time,social_platform_preference,number_of_notifications,work_hours_per_day,perceived_productivity_score,actual_productivity_score,stress_level,sleep_hours,screen_time_before_sleep,breaks_during_work,uses_focus_apps,has_digital_wellbeing_enabled,coffee_consumption_per_day,days_feeling_burnout_per_month,weekly_offline_hours,job_satisfaction_score
0,56,Male,Unemployed,4.18094,Facebook,61,6.753558,8.040464,7.291555,4.0,5.116546,0.419102,8,False,False,4,11,21.927072,6.336688
1,46,Male,Health,3.249603,Twitter,59,9.169296,5.063368,5.165093,7.0,5.103897,0.671519,7,True,True,2,25,0.0,3.412427
2,32,Male,Finance,,Twitter,57,7.910952,3.861762,3.474053,4.0,8.583222,0.624378,0,True,False,3,17,10.322044,2.474944
3,60,Female,Unemployed,,Facebook,59,6.355027,2.916331,1.774869,6.0,6.052984,1.20454,1,False,False,0,4,23.876616,1.73367
4,25,Male,IT,,Telegram,66,6.214096,8.868753,,7.0,5.405706,1.876254,1,False,True,1,30,10.653519,9.69306



Dtypes:


age                                 int64
gender                             object
job_type                           object
daily_social_media_time           float64
social_platform_preference         object
number_of_notifications             int64
work_hours_per_day                float64
perceived_productivity_score      float64
actual_productivity_score         float64
stress_level                      float64
sleep_hours                       float64
screen_time_before_sleep          float64
breaks_during_work                  int64
uses_focus_apps                      bool
has_digital_wellbeing_enabled        bool
coffee_consumption_per_day          int64
days_feeling_burnout_per_month      int64
weekly_offline_hours              float64
job_satisfaction_score            float64
dtype: object


Missing values (only columns with missing):


Unnamed: 0,missing_count
daily_social_media_time,2765
job_satisfaction_score,2730
sleep_hours,2598
actual_productivity_score,2365
screen_time_before_sleep,2211
stress_level,1904
perceived_productivity_score,1614



Numeric summary (describe):


Unnamed: 0,count,mean,std,min,25%,50%,75%,max
age,30000.0,41.486867,13.835221,18.0,30.0,41.0,53.0,65.0
daily_social_media_time,27235.0,3.113418,2.074813,0.0,1.639566,3.025913,4.368917,17.973256
number_of_notifications,30000.0,59.958767,7.723772,30.0,55.0,60.0,65.0,90.0
work_hours_per_day,30000.0,6.990792,1.997736,0.0,5.643771,6.990641,8.354725,12.0
perceived_productivity_score,28386.0,5.510488,2.02347,2.000252,3.757861,5.525005,7.265776,8.999376
actual_productivity_score,27635.0,4.951805,1.883378,0.296812,3.373284,4.951742,6.526342,9.846258
stress_level,28096.0,5.514059,2.866344,1.0,3.0,6.0,8.0,10.0
sleep_hours,27402.0,6.500247,1.464004,3.0,5.493536,6.49834,7.504143,10.0
screen_time_before_sleep,27789.0,1.025568,0.653355,0.0,0.52849,1.006159,1.477221,3.0
breaks_during_work,30000.0,4.9922,3.173737,0.0,2.0,5.0,8.0,10.0


## 3) Time-Wasters on Social Media 

Podaci o aktivnostima na društvenim mrežama i ponašanjima koja “kradu vrijeme”.

Učitavam Time-Wasters on Social Media.csv:

- spremam putanju u tw_path radi preglednosti
- pd.read_csv(tw_path) učitava podatke u tw
- inspect_df(tw, "Time-Wasters on Social Media") provjerava dimenzije i stupce

Ovaj dataset opisuje aktivnosti/ponašanja na društvenim mrežama koja mogu predstavljati “gubitnike vremena”.


In [5]:
PROJECT_ROOT = Path.cwd().parent  
DATA_RAW = PROJECT_ROOT / "data_raw"
tw_path = DATA_RAW / "Time-Wasters on Social Media.csv"

tw = pd.read_csv(tw_path)
inspect_df(tw, "Time-Wasters on Social Media")


DATASET: Time-Wasters on Social Media
-------------------------------------
Shape: 1000 rows x 31 columns

Columns:
 - UserID
 - Age
 - Gender
 - Location
 - Income
 - Debt
 - Owns Property
 - Profession
 - Demographics
 - Platform
 - Total Time Spent
 - Number of Sessions
 - Video ID
 - Video Category
 - Video Length
 - Engagement
 - Importance Score
 - Time Spent On Video
 - Number of Videos Watched
 - Scroll Rate
 - Frequency
 - ProductivityLoss
 - Satisfaction
 - Watch Reason
 - DeviceType
 - OS
 - Watch Time
 - Self Control
 - Addiction Level
 - CurrentActivity
 - ConnectionType

Preview (head):


Unnamed: 0,UserID,Age,Gender,Location,Income,Debt,Owns Property,Profession,Demographics,Platform,...,ProductivityLoss,Satisfaction,Watch Reason,DeviceType,OS,Watch Time,Self Control,Addiction Level,CurrentActivity,ConnectionType
0,1,56,Male,Pakistan,82812,True,True,Engineer,Rural,Instagram,...,3,7,Procrastination,Smartphone,Android,9:00 PM,5,5,Commuting,Mobile Data
1,2,46,Female,Mexico,27999,False,True,Artist,Urban,Instagram,...,5,5,Habit,Computer,Android,5:00 PM,7,3,At school,Wi-Fi
2,3,32,Female,United States,42436,False,True,Engineer,Rural,Facebook,...,6,4,Entertainment,Tablet,Android,2:00 PM,8,2,At home,Mobile Data
3,4,60,Male,Barzil,62963,True,False,Waiting staff,Rural,YouTube,...,3,7,Habit,Smartphone,Android,9:00 PM,5,5,Commuting,Mobile Data
4,5,25,Male,Pakistan,22096,False,True,Manager,Urban,TikTok,...,8,2,Boredom,Smartphone,iOS,8:00 AM,10,0,At home,Mobile Data



Dtypes:


UserID                       int64
Age                          int64
Gender                      object
Location                    object
Income                       int64
Debt                          bool
Owns Property                 bool
Profession                  object
Demographics                object
Platform                    object
Total Time Spent             int64
Number of Sessions           int64
Video ID                     int64
Video Category              object
Video Length                 int64
Engagement                   int64
Importance Score             int64
Time Spent On Video          int64
Number of Videos Watched     int64
Scroll Rate                  int64
Frequency                   object
ProductivityLoss             int64
Satisfaction                 int64
Watch Reason                object
DeviceType                  object
OS                          object
Watch Time                  object
Self Control                 int64
Addiction Level     


Missing values (only columns with missing):
 - None

Numeric summary (describe):


Unnamed: 0,count,mean,std,min,25%,50%,75%,max
UserID,1000.0,500.5,288.819436,1.0,250.75,500.5,750.25,1000.0
Age,1000.0,40.986,13.497852,18.0,29.0,42.0,52.0,64.0
Income,1000.0,59524.213,23736.212925,20138.0,38675.25,58805.0,79792.25,99676.0
Total Time Spent,1000.0,151.406,83.952637,10.0,78.0,152.0,223.0,298.0
Number of Sessions,1000.0,10.013,5.380314,1.0,6.0,10.0,15.0,19.0
Video ID,1000.0,4891.738,2853.144258,11.0,2542.0,4720.5,7346.0,9997.0
Video Length,1000.0,15.214,8.224953,1.0,8.0,15.0,22.0,29.0
Engagement,1000.0,4997.159,2910.053701,15.0,2415.75,5016.0,7540.25,9982.0
Importance Score,1000.0,5.129,2.582834,1.0,3.0,5.0,7.0,9.0
Time Spent On Video,1000.0,14.973,8.200092,1.0,8.0,15.0,22.0,29.0


## 4) Number of Internet Users 

Učitavam numberofinternetusers new.csv:
- pd.read_csv(...) učitava podatke u iu
- inspect_df(iu, "Number of Internet Users") prikazuje osnovne informacije

Ovaj dataset sadrži makro podatke o broju korisnika interneta po državama i godinama (koristi se kao dodatni kontekst).

In [6]:
PROJECT_ROOT = Path.cwd().parent
DATA_RAW = PROJECT_ROOT / "data_raw"

iu = pd.read_csv(DATA_RAW / "numberofinternetusers new.csv")
inspect_df(iu, "Number of Internet Users")

DATASET: Number of Internet Users
---------------------------------
Shape: 6192 rows x 4 columns

Columns:
 - Entity
 - Code
 - Year
 - Number of Internet users

Preview (head):


Unnamed: 0,Entity,Code,Year,Number of Internet users
0,Afghanistan,AFG,1990,0
1,Albania,ALB,1990,0
2,Algeria,DZA,1990,0
3,American Samoa,ASM,1990,0
4,Andorra,AND,1990,0



Dtypes:


Entity                      object
Code                        object
Year                         int64
Number of Internet users     int64
dtype: object


Missing values (only columns with missing):
 - None

Numeric summary (describe):


Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Year,6192.0,2004.729,8.84573,1990.0,1997.0,2005.0,2012.0,2020.0
Number of Internet users,6192.0,22135060.0,172629300.0,0.0,7538.25,199903.5,2946347.75,4699886000.0


### Top kategorije 

Definiram helper funkciju top_categories(df, top_n=5) koja:
- pronalazi sve kategorijske (object) stupce
- za svaki ispisuje top_n najčešćih vrijednosti (value_counts)

Ovo koristim da brzo vidim raspodjelu kategorija (npr. spol, platforma, tip aktivnosti…), te potencijalne “rare” vrijednosti i NaN.

In [7]:
def top_categories(df, top_n=5):
    obj_cols = df.select_dtypes(include="object").columns
    for c in obj_cols:
        print(f"\nTop {top_n} values for: {c}")
        display(df[c].value_counts(dropna=False).head(top_n))

top_categories(smp, top_n=7)



Top 7 values for: gender


gender
Male      14452
Female    14370
Other      1178
Name: count, dtype: int64


Top 7 values for: job_type


job_type
Education     5055
IT            5026
Finance       5017
Student       5012
Unemployed    4958
Health        4932
Name: count, dtype: int64


Top 7 values for: social_platform_preference


social_platform_preference
TikTok       6096
Telegram     6013
Instagram    6006
Twitter      5964
Facebook     5921
Name: count, dtype: int64

## 5) World Bank API (JSON) – indikator internet usage

Kao heterogeni izvor koristim World Bank API (JSON). Dohvaćam vremensku seriju po državama i godinama te ju pretvaram u tablični oblik.

Ovdje dohvaćam heterogeni izvor podataka preko World Bank API-ja u JSON formatu.

- definiram funkciju fetch_worldbank_indicator(indicator, pages_limit, per_page)
- radim GET zahtjev na endpoint za indikator 
- koristim paging (više stranica), ali postavljam pages_limit kao “safety cap” da se ne preuzme previše podataka odjednom
- iz JSON odgovora radim “flat” tablicu s ključnim poljima: country, country_iso3 , year, value (vrijednost indikatora), indicator_id i indicator_name

Na kraju spremam rezultat u DataFrame wb_internet_pct i ispisujem broj redaka


In [8]:
import requests
import pandas as pd

def fetch_worldbank_indicator(indicator: str, pages_limit: int = 5, per_page: int = 20000):
    """
    Fetches World Bank indicator as JSON and returns a flat DataFrame.
    pages_limit: safety cap to avoid huge downloads.
    """
    url = f"https://api.worldbank.org/v2/country/all/indicator/{indicator}"
    params = {"format": "json", "per_page": per_page, "page": 1}

    r = requests.get(url, params=params, timeout=60)
    r.raise_for_status()
    meta, data = r.json()

    pages = meta.get("pages", 1)
    pages = min(pages, pages_limit)

    rows = []
    for page in range(1, pages + 1):
        params["page"] = page
        rp = requests.get(url, params=params, timeout=60)
        rp.raise_for_status()
        meta_p, data_p = rp.json()

        for item in data_p:
            if item is None:
                continue
            rows.append({
                "country": item.get("country", {}).get("value"),
                "country_iso3": item.get("countryiso3code"),
                "year": int(item.get("date")) if item.get("date") else None,
                "value": item.get("value"),
                "indicator_id": indicator,
                "indicator_name": item.get("indicator", {}).get("value")
            })

    df = pd.DataFrame(rows)
    return df

# Indikator: Individuals using the Internet (% of population)
wb_internet_pct = fetch_worldbank_indicator("IT.NET.USER.ZS", pages_limit=3)

print("World Bank rows:", len(wb_internet_pct))
display(wb_internet_pct.head())

World Bank rows: 17290


Unnamed: 0,country,country_iso3,year,value,indicator_id,indicator_name
0,Africa Eastern and Southern,AFE,2024,28.8,IT.NET.USER.ZS,Individuals using the Internet (% of population)
1,Africa Eastern and Southern,AFE,2023,27.8,IT.NET.USER.ZS,Individuals using the Internet (% of population)
2,Africa Eastern and Southern,AFE,2022,26.8,IT.NET.USER.ZS,Individuals using the Internet (% of population)
3,Africa Eastern and Southern,AFE,2021,25.0,IT.NET.USER.ZS,Individuals using the Internet (% of population)
4,Africa Eastern and Southern,AFE,2020,23.5,IT.NET.USER.ZS,Individuals using the Internet (% of population)


In [9]:
# osnovna provjera
print(wb_internet_pct["year"].min(), wb_internet_pct["year"].max())
print("Missing value:", wb_internet_pct["value"].isna().mean())

# primjer filtriranja: od 2005 nadalje
wb_internet_pct_2005 = wb_internet_pct[wb_internet_pct["year"] >= 2005].copy()

display(wb_internet_pct_2005.head())


1960 2024
Missing value: 0.6102949681897051


Unnamed: 0,country,country_iso3,year,value,indicator_id,indicator_name
0,Africa Eastern and Southern,AFE,2024,28.8,IT.NET.USER.ZS,Individuals using the Internet (% of population)
1,Africa Eastern and Southern,AFE,2023,27.8,IT.NET.USER.ZS,Individuals using the Internet (% of population)
2,Africa Eastern and Southern,AFE,2022,26.8,IT.NET.USER.ZS,Individuals using the Internet (% of population)
3,Africa Eastern and Southern,AFE,2021,25.0,IT.NET.USER.ZS,Individuals using the Internet (% of population)
4,Africa Eastern and Southern,AFE,2020,23.5,IT.NET.USER.ZS,Individuals using the Internet (% of population)


### Spremanje World Bank podataka u data_raw/

In [10]:
from pathlib import Path

DATA_RAW = Path("..") / "data_raw"   # jer smo u notebooks/
DATA_RAW.mkdir(exist_ok=True)

out_csv = DATA_RAW / "worldbank_IT.NET.USER.ZS_sample.csv"
wb_internet_pct_2005.to_csv(out_csv, index=False)

json_path = DATA_RAW / "worldbank_IT.NET.USER.ZS.json"
wb_internet_pct.to_json(json_path, orient="records", force_ascii=False, indent=2)
print("Saved WorldBank JSON to:", json_path)

print("Saved:", out_csv)


Saved WorldBank JSON to: ..\data_raw\worldbank_IT.NET.USER.ZS.json
Saved: ..\data_raw\worldbank_IT.NET.USER.ZS_sample.csv


#### Usklađivanje godina (DT vs World Bank)

In [11]:
# godine iz DT dataset-a
dt_years = sorted(dt["Year"].dropna().unique())
print("DT years:", dt_years[:10], "...", dt_years[-5:])

wb_aligned = wb_internet_pct[wb_internet_pct["year"].isin(dt_years)].copy()
print("WB aligned rows:", len(wb_aligned))
display(wb_aligned.head())


DT years: [np.int64(2012), np.int64(2013), np.int64(2014), np.int64(2015), np.int64(2016), np.int64(2017), np.int64(2018), np.int64(2019), np.int64(2020), np.int64(2021)] ... [np.int64(2020), np.int64(2021), np.int64(2022), np.int64(2023), np.int64(2024)]
WB aligned rows: 3458


Unnamed: 0,country,country_iso3,year,value,indicator_id,indicator_name
0,Africa Eastern and Southern,AFE,2024,28.8,IT.NET.USER.ZS,Individuals using the Internet (% of population)
1,Africa Eastern and Southern,AFE,2023,27.8,IT.NET.USER.ZS,Individuals using the Internet (% of population)
2,Africa Eastern and Southern,AFE,2022,26.8,IT.NET.USER.ZS,Individuals using the Internet (% of population)
3,Africa Eastern and Southern,AFE,2021,25.0,IT.NET.USER.ZS,Individuals using the Internet (% of population)
4,Africa Eastern and Southern,AFE,2020,23.5,IT.NET.USER.ZS,Individuals using the Internet (% of population)
