# 2-dars: Estimates of Location

Ushbu darsda biz **Students Performance Dataset** bilan ishlaymiz.
Datasetni Colabda yuklab olishning bir nechta usullarini ko‘rib chiqamiz:
- GitHub RAW link
- wget
- curl
- pandas orqali URL
- requests
- Google Drive (gdown yoki mount)
- Kaggle API
- Kompyuterdan qo‘lda yuklash
- Seaborn tayyor datasetlari
- Ichkarida mini-namuna yaratish


In [None]:
# Kutubxonalarni chaqirib olish
import os, io, zipfile
import numpy as np
import pandas as pd
from pathlib import Path

DATA_DIR = Path("/content/data")
DATA_DIR.mkdir(parents=True, exist_ok=True)
print("Data dir:", DATA_DIR)

## 1-usul: GitHub RAW link orqali yuklash

In [None]:
RAW_URL = "https://raw.githubusercontent.com/<user>/<repo>/<branch>/path/to/StudentsPerformance.csv"
try:
    df = pd.read_csv(RAW_URL)
    df.to_csv(DATA_DIR / "students_performance.csv", index=False)
    print("Yuklandi:", df.shape)
except Exception as e:
    print("Xatolik:", e)

## 2-usul: wget bilan yuklab olish

In [None]:
!wget -O /content/data/students_performance.csv "https://raw.githubusercontent.com/<user>/<repo>/<branch>/path/to/StudentsPerformance.csv"
df = pd.read_csv("/content/data/students_performance.csv")
df.head()

## 3-usul: curl bilan yuklab olish

In [None]:
!curl -L "https://raw.githubusercontent.com/<user>/<repo>/<branch>/path/to/StudentsPerformance.csv" -o /content/data/students_performance.csv
df = pd.read_csv("/content/data/students_performance.csv")
df.head()

## 4-usul: pandas.read_csv() bilan URL’dan

In [None]:
URL = "https://raw.githubusercontent.com/<user>/<repo>/<branch>/path/to/StudentsPerformance.csv"
df = pd.read_csv(URL)
df.head()

## 5-usul: requests + io orqali

In [None]:
import requests
from io import StringIO

URL = "https://raw.githubusercontent.com/<user>/<repo>/<branch>/path/to/StudentsPerformance.csv"
resp = requests.get(URL, timeout=30)
resp.raise_for_status()
df = pd.read_csv(StringIO(resp.text))
df.head()

## 6-usul: Google Drive’dan gdown orqali

In [None]:
!pip -q install gdown
import gdown

FILE_ID = "1AbCdEfGhIjKlMnOP-YourFileIDHere"
gdown.download(f"https://drive.google.com/uc?id={FILE_ID}", str(DATA_DIR / "students_performance.csv"), quiet=False)
df = pd.read_csv(DATA_DIR / "students_performance.csv")
df.head()

## 7-usul: Google Drive’ni mount qilib ishlatish

In [None]:
from google.colab import drive
drive.mount('/content/drive')
csv_path = "/content/drive/MyDrive/datasets/StudentsPerformance.csv"
df = pd.read_csv(csv_path)
df.head()

## 8-usul: Kaggle API orqali

In [None]:
!mkdir -p ~/.kaggle
!cp /content/kaggle.json ~/.kaggle/
!chmod 600 ~/.kaggle/kaggle.json

!kaggle datasets download -d spscientist/students-performance-in-exams -p /content/data
!unzip -o /content/data/*.zip -d /content/data

df = pd.read_csv("/content/data/StudentsPerformance.csv")
df.head()

## 9-usul: Kompyuterdan qo‘lda yuklash

In [None]:
from google.colab import files
uploaded = files.upload()
csv_path = list(uploaded.keys())[0]
df = pd.read_csv(csv_path)
df.head()

## 10-usul: Ichkarida mini-dataset yaratish

In [None]:
csv_text = """gender,race,parent_edu,lunch,prep,math,reading,writing
female,A,bachelor,standard,completed,72,72,74
female,C,some college,standard,none,69,90,88
female,B,master,free/discount,completed,90,95,93
male,B,associate,standard,none,47,57,44
male,E,high school,standard,none,76,78,75
female,D,associate,free/discount,completed,71,83,78
female,E,some college,standard,completed,88,95,92
male,C,some college,free/discount,none,40,43,39
male,D,high school,standard,completed,64,64,67
female,B,bachelor,standard,none,38,60,50
male,C,bachelor,standard,completed,58,54,52
female,C,master,free/discount,completed,95,99,98
male,E,associate,free/discount,none,45,52,49
female,A,high school,standard,none,74,72,75
male,B,bachelor,standard,completed,66,68,62
"""
df = pd.read_csv(io.StringIO(csv_text))
df.head()

## 2) EDA — Savollar va Qisqa Mashqlar

**Maqsad:** Datasetning umumiy ko'rinishini tushunish, ustunlar turlari, yo'q qiymatlar, takrorlar, va asosiy taqsimotlar.

**Savollar (discussion):**
1. Qaysi ustunlar **sonli**, qaysilari **kategoriyaviy**?
2. Datasetda **yo'q (NaN)** qiymatlar bormi? Qaysi ustunlarda ko'p?
3. Qaysi ustunlarning tarqalishi **notekis** (skewed) ko'rinadi?
4. `gender`, `lunch`, `prep` (test preparation) ustunlarining taqsimoti qanday?

> Eslatma: `df` avvalgi bo'limda yaratilgan. Agar mavjud bo'lmasa, mini-namuna yaratiladi.

In [None]:
# df mavjudligini tekshiramiz, bo'lmasa mini-namuna yaratamiz
try:
    _ = df.shape
except NameError:
    import pandas as pd, io
    csv_text = """gender,race,parent_edu,lunch,prep,math,reading,writing
female,A,bachelor,standard,completed,72,72,74
female,C,some college,standard,none,69,90,88
female,B,master,free/discount,completed,90,95,93
male,B,associate,standard,none,47,57,44
male,E,high school,standard,none,76,78,75
female,D,associate,free/discount,completed,71,83,78
female,E,some college,standard,completed,88,95,92
male,C,some college,free/discount,none,40,43,39
male,D,high school,standard,completed,64,64,67
female,B,bachelor,standard,none,38,60,50
male,C,bachelor,standard,completed,58,54,52
female,C,master,free/discount,completed,95,99,98
male,E,associate,free/discount,none,45,52,49
female,A,high school,standard,none,74,72,75
male,B,bachelor,standard,completed,66,68,62
"""
    df = pd.read_csv(io.StringIO(csv_text))

print("Shape:", df.shape)
display(df.head())

In [None]:
# Umumiy ma'lumotlar
display(df.info())
display(df.describe(include='all').T.head(15))
print('\nMissing values per column:\n', df.isna().sum())
print('\nDuplicate rows:', df.duplicated().sum())

### 🧪 EDA — Mashqlar
1) Datasetdagi **ustun nomlari** va **turlari** ro‘yxatini chiqaring.
2) `gender`, `lunch`, `prep` bo'yicha **value_counts()** natijalarini ko'rsating.
3) Sonli ustunlar (masalan: `math`, `reading`, `writing`) bo'yicha **min, max, mean** qiymatlarni ko'rsating.
4) Agar mavjud bo'lsa, **NaN** qiymatlarni ustunlar bo'yicha % hisobida ko'rsating.

Quyidagi **TODO** kataklarda bajaring.

In [None]:
# TODO: 1) Ustun nomlari va dtypes
print("Columns:", list(df.columns))
print("\nDTypes:\n", df.dtypes)

In [None]:
# TODO: 2) value_counts for categorical columns
for col in [c for c in df.columns if df[c].dtype=='object']:
    print(f"\n==== {col} ====")
    print(df[col].value_counts(dropna=False))

In [None]:
# TODO: 3) Basic stats for numeric columns
numeric_cols = df.select_dtypes(include=['int64','float64']).columns
display(df[numeric_cols].agg(['min','max','mean']).T)

In [None]:
# TODO: 4) Missing values in percent
mv = df.isna().mean().sort_values(ascending=False)*100
display(mv.to_frame('missing_%').T if mv.sum()==0 else mv.to_frame('missing_%'))

## 3) Estimates of Location — Nazariya va Amaliyot

**Asosiy tushunchalar:**
- **Mean (o'rtacha):** hamma qiymatlarning yig'indisi / soni. Outlierlarga sezgir.
- **Median (mediana):** tartiblangan qiymatlarning o‘rta nuqtasi. Outlierlarga chidamli.
- **Mode (moda):** eng ko‘p uchraydigan qiymat.
- **Quantiles/Percentiles:** taqsimotdagi foizli kesmalar (masalan, 25%, 50%, 75%).
- **Weighted Mean:** har bir qiymatga og‘irlik berilgan holda o‘rtacha.
- **Trimmed Mean:** chet (eng kichik va eng katta) foizlardan qisqartirilgan o‘rtacha.

> Qachon qaysi biri? 
- Outlierlar ko‘p bo‘lsa → **Median / Trimmed Mean** yaxshi.
- Kategorik ma'lumot / diskret qiymatlar → **Mode** foydali.
- Vaznli baholash kerak bo‘lsa (masalan, fanlar turlicha ahamiyatga ega) → **Weighted Mean**.

In [None]:
# Sonli ustunlar bilan ishlaymiz
import numpy as np
numeric_cols = [c for c in df.columns if np.issubdtype(df[c].dtype, np.number)]
print("Numeric columns:", numeric_cols)

In [None]:
# Mean, Median, Mode (umumiy)
from statistics import mode, StatisticsError

summary = {}
for col in numeric_cols:
    try:
        m = df[col].mean()
        med = df[col].median()
        try:
            mo = df[col].mode(dropna=True).iloc[0]
        except Exception:
            mo = None
        summary[col] = {"mean": m, "median": med, "mode": mo}
    except Exception as e:
        summary[col] = {"error": str(e)}

import pandas as pd
display(pd.DataFrame(summary).T)

In [None]:
# Quantiles & Percentiles
q = df[numeric_cols].quantile([0.1, 0.25, 0.5, 0.75, 0.9])
display(q)

In [None]:
# Weighted Mean (misol: math va reading ni 60/40 nisbatda baholash)
if set(["math","reading"]).issubset(set(df.columns)):
    w = np.array([0.6, 0.4])
    values = df[["math","reading"]].to_numpy()
    weighted_scores = (values * w).sum(axis=1) / w.sum()
    print("Weighted mean (math=0.6, reading=0.4):", weighted_scores.mean())
else:
    print("'math' va 'reading' ustunlari topilmadi. Weighted mean demo o'tkazib yuborildi.")

In [None]:
# Trimmed Mean (10% qisqartirish)
try:
    from scipy.stats import trim_mean
    trimmed = {col: trim_mean(df[col].dropna(), 0.1) for col in numeric_cols}
    display(pd.Series(trimmed, name="trimmed_mean_10%"))
except Exception as e:
    print("SciPy mavjud emas yoki xatolik:", e)

### 🧪 Mashqlar — Estimates of Location
1) `math`, `reading`, `writing` ustunlari uchun **mean**, **median**, **mode** ni alohida hisoblang va taqqoslang.
2) 25%, 50%, 75% **quantile** qiymatlarini har bir fan uchun ko'rsating.
3) **Weighted mean** (math 50%, reading 25%, writing 25%) ni hisoblang.
4) **Trimmed mean** (5% va 20%) ni hisoblab, oddiy mean bilan solishtiring.
5) `gender` bo'yicha guruhlab **mean/median** ni solishtiring.


In [None]:
# TODO: 1) mean/median/mode for math, reading, writing
cols = [c for c in ["math","reading","writing"] if c in df.columns]
for c in cols:
    print(f"\n-- {c.upper()} --")
    print("mean:", df[c].mean())
    print("median:", df[c].median())
    try:
        print("mode:", df[c].mode().iloc[0])
    except Exception:
        print("mode: N/A")

In [None]:
# TODO: 2) quantiles for math, reading, writing
quantiles = df[cols].quantile([0.25, 0.5, 0.75]) if cols else None
display(quantiles)

In [None]:
# TODO: 3) Weighted mean with weights 0.5, 0.25, 0.25
import numpy as np
if set(["math","reading","writing"]).issubset(set(df.columns)):
    w = np.array([0.5, 0.25, 0.25])
    arr = df[["math","reading","writing"]].to_numpy()
    weighted = (arr * w).sum(axis=1) / w.sum()
    print("Weighted mean (0.5, 0.25, 0.25):", weighted.mean())
else:
    print("Kerakli ustunlar topilmadi.")

In [None]:
# TODO: 4) Trimmed mean 5% va 20%
try:
    from scipy.stats import trim_mean
    for p in [0.05, 0.2]:
        vals = {c: trim_mean(df[c].dropna(), p) for c in cols}
        print(f"Trimmed mean (p={p}):", vals)
except Exception as e:
    print("SciPy kerak.", e)

In [None]:
# TODO: 5) gender bo'yicha mean/median
if "gender" in df.columns and cols:
    grp_mean = df.groupby("gender")[cols].mean()
    grp_median = df.groupby("gender")[cols].median()
    display(grp_mean)
    display(grp_median)
else:
    print("'gender' yoki fan ustunlari topilmadi.")

## 4) Vizualizatsiya
- **Histogram**: masalan, `math` ballari taqsimoti (mean/median chiziqlari bilan)
- **Boxplot**: `math` ballari bo'yicha gender kesimida quti-chiziq grafiki

> Eslatma: talab bo'yicha **faqat matplotlib** ishlatamiz; har figura alohida.

In [None]:
import matplotlib.pyplot as plt
if "math" in df.columns:
    plt.figure()
    plt.hist(df["math"].dropna(), bins=15)
    plt.axvline(df["math"].mean(), linestyle='--', label='Mean')
    plt.axvline(df["math"].median(), linestyle=':', label='Median')
    plt.title("Math score histogram")
    plt.xlabel("Score")
    plt.ylabel("Count")
    plt.legend()
    plt.show()
else:
    print("'math' ustuni topilmadi.")

In [None]:
import matplotlib.pyplot as plt
if "math" in df.columns and "gender" in df.columns:
    plt.figure()
    groups = [df.loc[df["gender"]==g, "math"].dropna().values for g in df["gender"].dropna().unique()]
    labels = list(df["gender"].dropna().unique())
    plt.boxplot(groups, labels=labels)
    plt.title("Math score by gender")
    plt.ylabel("Score")
    plt.show()
else:
    print("'math' yoki 'gender' topilmadi.")

## 5) Review — Savol-javob
1. Mean va median farqi nimada, qachon median ustun?
2. Mode qaysi hollarda foydali metrika?
3. Quantile/percentile nimani bildiradi? 25% va 75% quantile qanday talqin qilinadi?
4. Weighted mean va trimmed mean afzalliklari nimada?
5. EDA paytida nimalarga e'tibor berish kerak (missing values, duplicates, skewness)?

## 6) Mini-proyekt: **Top 10% Students Report**
**Vazifa:** Har bir fan bo'yicha top 10% talabalarni aniqlash va `gender` bo'yicha taqqoslash.

**Bosqichlar:**
1) `math`, `reading`, `writing` uchun 90-percentile qiymatlarni toping.
2) Har bir fan bo'yicha 90-percentile dan yuqori talabalar ro'yxatini oling.
3) Har bir ro'yxatda `gender` bo'yicha nechta talaba borligini hisoblang.
4) (ixtiyoriy) Uch fanning hammasidan 90%+ bo'lgan super-top talabalarni toping.


In [None]:
# TODO: Mini-proyekt yechimi
target_cols = [c for c in ["math","reading","writing"] if c in df.columns]
if target_cols:
    p90 = df[target_cols].quantile(0.9)
    print("P90 thresholds:\n", p90)

    top = {}
    for c in target_cols:
        top[c] = df[df[c] >= p90[c]].copy()
        print(f"\nTop 10% for {c}:", top[c].shape)
        if "gender" in top[c].columns:
            print(top[c]["gender"].value_counts())

    # ixtiyoriy: hammasidan 90%+
    if set(["math","reading","writing"]).issubset(set(target_cols)):
        super_top = df[(df["math"]>=p90["math"]) & (df["reading"]>=p90["reading"]) & (df["writing"]>=p90["writing"])].copy()
        print("\nSuper-top (3 fanda ham 90%+):", super_top.shape)
        display(super_top.head())
else:
    print("Kerakli ustunlar topilmadi.")

## 7) Uyga vazifa
1) EDA ni kengaytiring: har bir kategoriyaviy ustun (`gender`, `lunch`, `prep`, `parent_edu`) bo'yicha **mean/median** taqqoslash jadvalini tuzing.
2) `writing` ballari uchun **histogram** chizing va **mean/median** chiziqlarini qo'shing.
3) `race` (agar mavjud) bo'yicha `math` medianlarini **bar chart** ko'rinishida chizing.
4) Outlier ta'sirini tekshiring: `math` ga bitta juda katta qiymat qo'shing va **mean vs median** qanday o'zgarishini taqqoslang.
5) (ixtiyoriy) `weighted mean` vaznlarini turli ssenariylarda solishtiring (masalan, `math`ga 0.7 vazn berish).