# 02 – Pretprocesiranje i integracija podataka

U ovom notebooku provodimo:
- čišćenje i pretvorbu tipova (CSV + JSON->tablica)
- standardizaciju naziva stupaca
- integraciju u jedinstveni skup za pohranu u bazu (SQLite)


Učitavanje podataka (iz data_raw)

In [26]:
from pathlib import Path
import pandas as pd

PROJECT_ROOT = Path.cwd().parent  # notebooks/ -> project root
DATA_RAW = PROJECT_ROOT / "data_raw"

dt = pd.read_csv(DATA_RAW / "DT.csv")
smp = pd.read_csv(DATA_RAW / "social_media_vs_productivity.csv")
tw = pd.read_csv(DATA_RAW / "Time-Wasters on Social Media.csv")
iu = pd.read_csv(DATA_RAW / "numberofinternetusers new.csv")

print("Loaded:")
print("DT:", dt.shape)
print("SMP:", smp.shape)
print("TW:", tw.shape)
print("IU:", iu.shape)


Loaded:
DT: (13, 2)
SMP: (30000, 19)
TW: (1000, 31)
IU: (6192, 4)


### DT: Hours:Minutes -> minute

Iz kolone Daily Time (Hours:Minutes) napraviti numeričku vrijednost u minutama.

Definiram funkciju hhmm_to_minutes(x) koja:
- očekuje format "HH:MM"
- vraća HH*60 + MM
- za neispravne / prazne vrijednosti vraća None

In [27]:
def hhmm_to_minutes(x):
    if pd.isna(x):
        return None
    s = str(x).strip()
    # očekujemo "HH:MM"
    if ":" not in s:
        return None
    hh, mm = s.split(":", 1)
    try:
        return int(hh) * 60 + int(mm)
    except:
        return None

dt_clean = dt.copy()
dt_clean["daily_minutes"] = dt_clean["Daily Time (Hours:Minutes)"].apply(hhmm_to_minutes)
dt_clean = dt_clean.drop(columns=["Daily Time (Hours:Minutes)"])

display(dt_clean.head())
print(dt_clean[["Year","daily_minutes"]].isna().sum())


Unnamed: 0,Year,daily_minutes
0,2012,90
1,2013,95
2,2014,104
3,2015,111
4,2016,128


Year             0
daily_minutes    0
dtype: int64


### SMP: standardizacija i čišćenje

Pripremiti “core” podatke za kasniju analizu produktivnosti.

Radim kopiju smp_clean = smp.copy()

Standardiziram nazive stupaca (trim + lowercase) radi konzistentnosti

Definiram ključne stupce:

- daily_social_media_time
- perceived_productivity_score
- actual_productivity_score

In [28]:
smp_clean = smp.copy()

# nazive stupaca (lowercase + underscore)
smp_clean.columns = [c.strip().lower() for c in smp_clean.columns]

key_cols = ["daily_social_media_time", "perceived_productivity_score", "actual_productivity_score"]
existing_key_cols = [c for c in key_cols if c in smp_clean.columns]

if existing_key_cols:
    before = len(smp_clean)
    smp_clean = smp_clean.dropna(subset=existing_key_cols)
    print("Dropped rows (missing key cols):", before - len(smp_clean))

display(smp_clean.head())


Dropped rows (missing key cols): 6270


Unnamed: 0,age,gender,job_type,daily_social_media_time,social_platform_preference,number_of_notifications,work_hours_per_day,perceived_productivity_score,actual_productivity_score,stress_level,sleep_hours,screen_time_before_sleep,breaks_during_work,uses_focus_apps,has_digital_wellbeing_enabled,coffee_consumption_per_day,days_feeling_burnout_per_month,weekly_offline_hours,job_satisfaction_score
0,56,Male,Unemployed,4.18094,Facebook,61,6.753558,8.040464,7.291555,4.0,5.116546,0.419102,8,False,False,4,11,21.927072,6.336688
1,46,Male,Health,3.249603,Twitter,59,9.169296,5.063368,5.165093,7.0,5.103897,0.671519,7,True,True,2,25,0.0,3.412427
6,56,Female,Unemployed,4.38107,TikTok,60,3.902309,6.420989,5.976408,7.0,7.549849,2.252624,4,False,False,4,20,24.084905,5.501373
7,36,Female,Education,4.089168,Twitter,49,6.560467,2.68183,2.446927,4.0,6.325507,0.747998,2,False,False,4,29,8.419648,3.444376
8,40,Female,Education,4.097401,Instagram,57,5.83959,3.219022,3.00424,4.0,,0.0,10,False,True,2,10,0.0,1.960131


### World Bank iz API-ja, ali sad ga agregiramo po godini

Iz API-ja dohvatiti indikator IT.NET.USER.ZS i pretvoriti ga u godišnju tablicu.
Definiram fetch_worldbank_indicator(...) (paginacija + limit stranica)
Dohvaćam podatke u wb

- Agregiram po godini (groupby("year").mean()), ignorirajući NaN
- Preimenujem stupac year u Year radi spajanja s DT (merge koristi isti ključ)
- Ispisuje se prvih par redaka i raspon godina (kod tebe: 1960–2024)

In [29]:
import requests

def fetch_worldbank_indicator(indicator: str, pages_limit: int = 3, per_page: int = 20000):
    url = f"https://api.worldbank.org/v2/country/all/indicator/{indicator}"
    params = {"format": "json", "per_page": per_page, "page": 1}

    r = requests.get(url, params=params, timeout=60)
    r.raise_for_status()
    meta, data = r.json()

    pages = min(meta.get("pages", 1), pages_limit)

    rows = []
    for page in range(1, pages + 1):
        params["page"] = page
        rp = requests.get(url, params=params, timeout=60)
        rp.raise_for_status()
        meta_p, data_p = rp.json()

        for item in data_p:
            if item is None:
                continue
            rows.append({
                "country": item.get("country", {}).get("value"),
                "country_iso3": item.get("countryiso3code"),
                "year": int(item.get("date")) if item.get("date") else None,
                "internet_pct": item.get("value")
            })

    return pd.DataFrame(rows)

wb = fetch_worldbank_indicator("IT.NET.USER.ZS", pages_limit=3)

# agregacija po godini: globalni prosjek (ignorira NaN)
wb_year = (
    wb.dropna(subset=["year"])
      .groupby("year", as_index=False)["internet_pct"]
      .mean()
      .rename(columns={"year": "Year"})
)

display(wb_year.head())
print("WB years:", wb_year["Year"].min(), wb_year["Year"].max())


Unnamed: 0,Year,internet_pct
0,1960,
1,1961,
2,1962,
3,1963,
4,1964,


WB years: 1960 2024


### Integracija DT + WB po godini

Napraviti godišnju integriranu tablicu.

Spajamo dt_clean i wb_year po stupcu Year: 

- integrated_year = dt_clean.merge(wb_year, on="Year", how="left")
- Računamo udio nedostajućih vrijednosti u internet_pct nakon spajanja (isna().mean())

Ova tablica kasnije služi za analize na godišnjoj razini (npr. DT minute vs internet %).

In [30]:
integrated_year = dt_clean.merge(wb_year, on="Year", how="left")

display(integrated_year.head())
print("Integrated shape:", integrated_year.shape)
print("Missing internet_pct:", integrated_year["internet_pct"].isna().mean())


Unnamed: 0,Year,daily_minutes,internet_pct
0,2012,90,38.020353
1,2013,95,40.499988
2,2014,104,43.31502
3,2015,111,46.12328
4,2016,128,49.483999


Integrated shape: (13, 3)
Missing internet_pct: 0.0


### SMP tablica + osnovni feature set

Izdvojiti najvažnije stupce koji opisuju navike i produktivnost.

Definiramo listu core_cols (dob, spol, posao, vrijeme na mrežama, notifikacije, radni sati, perceived/actual produktivnost, stress, platform preference)

Uzimamo samo postojeće stupce iz te liste (sigurno ako neki nedostaje)

Spremamju se  u smp_core


In [31]:
# SMP: core (individual-level)
core_cols = [
    "age", "gender", "job_type",
    "daily_social_media_time", "number_of_notifications", "work_hours_per_day",
    "perceived_productivity_score", "actual_productivity_score", "stress_level",
    "social_platform_preference"
]
smp_core = smp_clean[[c for c in core_cols if c in smp_clean.columns]].copy()

print("SMP core shape:", smp_core.shape)
display(smp_core.head())


SMP core shape: (23730, 10)


Unnamed: 0,age,gender,job_type,daily_social_media_time,number_of_notifications,work_hours_per_day,perceived_productivity_score,actual_productivity_score,stress_level,social_platform_preference
0,56,Male,Unemployed,4.18094,61,6.753558,8.040464,7.291555,4.0,Facebook
1,46,Male,Health,3.249603,59,9.169296,5.063368,5.165093,7.0,Twitter
6,56,Female,Unemployed,4.38107,60,3.902309,6.420989,5.976408,7.0,TikTok
7,36,Female,Education,4.089168,49,6.560467,2.68183,2.446927,4.0,Twitter
8,40,Female,Education,4.097401,57,5.83959,3.219022,3.00424,4.0,Instagram


### Izračun agregata iz DT+WB te dodavanje SMP-u

Dodati kontekstne (agregirane) vrijednosti individualnim zapisima.

In [32]:
# agregati perioda (DT godine)
dt_daily_minutes_avg = dt_clean["daily_minutes"].mean()

# internet_pct
internet_pct_avg = integrated_year["internet_pct"].mean()

print("DT avg daily minutes:", dt_daily_minutes_avg)
print("WB avg internet pct (DT years):", internet_pct_avg)

integrated_df = smp_core.copy()
integrated_df["dt_daily_minutes_avg"] = dt_daily_minutes_avg
integrated_df["internet_pct_avg_dt_years"] = internet_pct_avg

display(integrated_df.head())
print("Integrated (individual-level) shape:", integrated_df.shape)


DT avg daily minutes: 129.23076923076923
WB avg internet pct (DT years): 56.20038915557802


Unnamed: 0,age,gender,job_type,daily_social_media_time,number_of_notifications,work_hours_per_day,perceived_productivity_score,actual_productivity_score,stress_level,social_platform_preference,dt_daily_minutes_avg,internet_pct_avg_dt_years
0,56,Male,Unemployed,4.18094,61,6.753558,8.040464,7.291555,4.0,Facebook,129.230769,56.200389
1,46,Male,Health,3.249603,59,9.169296,5.063368,5.165093,7.0,Twitter,129.230769,56.200389
6,56,Female,Unemployed,4.38107,60,3.902309,6.420989,5.976408,7.0,TikTok,129.230769,56.200389
7,36,Female,Education,4.089168,49,6.560467,2.68183,2.446927,4.0,Twitter,129.230769,56.200389
8,40,Female,Education,4.097401,57,5.83959,3.219022,3.00424,4.0,Instagram,129.230769,56.200389


Integrated (individual-level) shape: (23730, 12)


Time-Wasters (TW) kroz 1–2 agregata

In [33]:
#uzeti svi numerički stupci iz TW i izračunat prosjek (globalni profil time-wasters)
tw_num = tw.select_dtypes(include="number")
tw_profile = tw_num.mean(numeric_only=True)

print("TW numeric columns:", list(tw_num.columns)[:20], "...")
display(tw_profile.head(10))


TW numeric columns: ['UserID', 'Age', 'Income', 'Total Time Spent', 'Number of Sessions', 'Video ID', 'Video Length', 'Engagement', 'Importance Score', 'Time Spent On Video', 'Number of Videos Watched', 'Scroll Rate', 'ProductivityLoss', 'Satisfaction', 'Self Control', 'Addiction Level'] ...


UserID                   500.500
Age                       40.986
Income                 59524.213
Total Time Spent         151.406
Number of Sessions        10.013
Video ID                4891.738
Video Length              15.214
Engagement              4997.159
Importance Score           5.129
Time Spent On Video       14.973
dtype: float64

### TW features

Cilje je dobiti jednostavne agregate koji mogu poslužiti kao dodatni “kontekst” u integriranom skupu.

In [34]:
tw_features = ["Total Time Spent", "Number of Sessions", "Video Length", "Engagement", "Importance Score", "Time Spent On"]

tw_features = [c for c in tw_features if c in tw.columns]
print("TW features used:", tw_features)

tw_summary = tw[tw_features].agg(["mean", "median", "std"]).T
display(tw_summary)


TW features used: ['Total Time Spent', 'Number of Sessions', 'Video Length', 'Engagement', 'Importance Score']


Unnamed: 0,mean,median,std
Total Time Spent,151.406,152.0,83.952637
Number of Sessions,10.013,10.0,5.380314
Video Length,15.214,15.0,8.224953
Engagement,4997.159,5016.0,2910.053701
Importance Score,5.129,5.0,2.582834


### integrated_df

In [35]:
if "Total Time Spent" in tw.columns:
    integrated_df["tw_total_time_spent_mean"] = tw["Total Time Spent"].mean()

if "Number of Sessions" in tw.columns:
    integrated_df["tw_sessions_mean"] = tw["Number of Sessions"].mean()

if "Engagement" in tw.columns:
    integrated_df["tw_engagement_mean"] = tw["Engagement"].mean()

display(integrated_df.head())


Unnamed: 0,age,gender,job_type,daily_social_media_time,number_of_notifications,work_hours_per_day,perceived_productivity_score,actual_productivity_score,stress_level,social_platform_preference,dt_daily_minutes_avg,internet_pct_avg_dt_years,tw_total_time_spent_mean,tw_sessions_mean,tw_engagement_mean
0,56,Male,Unemployed,4.18094,61,6.753558,8.040464,7.291555,4.0,Facebook,129.230769,56.200389,151.406,10.013,4997.159
1,46,Male,Health,3.249603,59,9.169296,5.063368,5.165093,7.0,Twitter,129.230769,56.200389,151.406,10.013,4997.159
6,56,Female,Unemployed,4.38107,60,3.902309,6.420989,5.976408,7.0,TikTok,129.230769,56.200389,151.406,10.013,4997.159
7,36,Female,Education,4.089168,49,6.560467,2.68183,2.446927,4.0,Twitter,129.230769,56.200389,151.406,10.013,4997.159
8,40,Female,Education,4.097401,57,5.83959,3.219022,3.00424,4.0,Instagram,129.230769,56.200389,151.406,10.013,4997.159


### Spremanje integriranog dataset-a

Ova datoteka se kasnije koristi za kreiranje SQLite baze i analizu.

In [36]:
from pathlib import Path

DATA_PROCESSED = PROJECT_ROOT / "data_processed"
DATA_PROCESSED.mkdir(exist_ok=True)

out_path = DATA_PROCESSED / "integrated_individual_level.csv"
integrated_df.to_csv(out_path, index=False)

print("Saved:", out_path)


Saved: d:\Preuzimanja\PZAP_PROJEKT\PAP_PROJEKT\data_processed\integrated_individual_level.csv
