# Umumiy mashqlar — 1~5 dars jamlangan

**Maqsad:** 1-darsdan 5-darsgacha o‘rganilgan **ma’lumotlarni tozalash va tayyorlash** (결측치, encoding, EDA) bo‘yicha birlashtirilgan mashq va eslatma.

**Tarkib:** Import & CSV yuklash → Data ko‘rinishi (head, tail, info, describe) → 결측치 tekshirish → 결측치 to‘ldirish/olib tashlash → Encoding (Label, One-Hot) → Concat.


## 1-Dars: Import va ma'lumot yuklash

- **import** = tashqi kutubxona bilan aloqa (모듈 가져오기)
- **as** = nom qisqartirish (pd, np kabi)
- **from ... import** = kutubxonaning ma’lum qismini ishlatish  
- Kerak bo‘lsa: `pip install pandas scikit-learn`

In [11]:
import pandas as pd
from sklearn.preprocessing import LabelEncoder
# 5-darsda Encoding uchun LabelEncoder ishlatamiz

In [10]:
# df = DataFrame (pandas jadvali). read_csv() — CSV faylni o‘qiydi.
# encoding='euc-kr' — koreyscha belgilar to‘g‘ri chiqishi uchun (한글 인코딩)
df = pd.read_csv("./Data/법무부_외국인체류데이터_20241231.csv", encoding='euc-kr')
df2 = pd.read_csv('./Data/titanic.csv')
df3 = pd.read_csv('./Data/Iris.csv')
df4 = pd.read_csv('./Data/housing.csv')

## 2-Dars: Data bilan tanishuv (EDA asoslari)

Data qanday ko‘rinishda ekanini tekshirish: qatorlar, ustunlar, turlar, unikal qiymatlar.

In [22]:
df.head(1)   # boshidan 1 qator (default 5). 데이터 앞부분 확인

Unnamed: 0,대륙,국적,성별,D1(문화예술),D2(유학),D3(기술연수),D4(일반연수),D5(취재),D6(종교),D7(주재),...,E10(선원취업),F1(방문동거),F2(거주),F3(동반),F5(영주),F6(결혼이민),G1(기타),H1(관광취업),H2(방문취업),기타(Other)
0,아시아주,베트남,남성,2,17385,280,27156,2,45,18,...,9540,12857,2960,1408,831,4958,713,0,0,1203


In [23]:
df.tail(1)   # oxiridan 1 qator (default 5). 데이터 뒷부분 확인

Unnamed: 0,대륙,국적,성별,D1(문화예술),D2(유학),D3(기술연수),D4(일반연수),D5(취재),D6(종교),D7(주재),...,E10(선원취업),F1(방문동거),F2(거주),F3(동반),F5(영주),F6(결혼이민),G1(기타),H1(관광취업),H2(방문취업),기타(Other)
379,기타,기타,여성,0,1,0,0,0,0,0,...,0,45,14,1,2,7,13,0,0,8


In [24]:
df.info()   # ustunlar, non-null soni, Dtype (int64, object, float64). 데이터 타입·결측 개수 확인

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 380 entries, 0 to 379
Data columns (total 32 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   대륙         380 non-null    object
 1   국적         380 non-null    object
 2   성별         380 non-null    object
 3   D1(문화예술)   380 non-null    int64 
 4   D2(유학)     380 non-null    int64 
 5   D3(기술연수)   380 non-null    int64 
 6   D4(일반연수)   380 non-null    int64 
 7   D5(취재)     380 non-null    int64 
 8   D6(종교)     380 non-null    int64 
 9   D7(주재)     380 non-null    int64 
 10  D8(기업투자)   380 non-null    int64 
 11  D9(무역경영)   380 non-null    int64 
 12  D10(구직)    380 non-null    int64 
 13  E1(교수)     380 non-null    int64 
 14  E2(회화)     380 non-null    int64 
 15  E3(연구)     380 non-null    int64 
 16  E4(기술지도)   380 non-null    int64 
 17  E5(전문직업)   380 non-null    int64 
 18  E6(예술흥행)   380 non-null    int64 
 19  E7(특정활동)   380 non-null    int64 
 20  E8(계절근로)   380 non-null    int64

In [25]:
df.nunique()   # har bir ustundagi unikal (takrorlanmas) qiymatlar soni. 범주형에서 클래스 개수 확인

대륙             7
국적           197
성별             2
D1(문화예술)       6
D2(유학)       144
D3(기술연수)      17
D4(일반연수)      88
D5(취재)        11
D6(종교)        33
D7(주재)        23
D8(기업투자)      62
D9(무역경영)      43
D10(구직)       60
E1(교수)        34
E2(회화)        30
E3(연구)        39
E4(기술지도)      13
E5(전문직업)      15
E6(예술흥행)      43
E7(특정활동)      84
E8(계절근로)      26
E9(비전문취업)     42
E10(선원취업)      9
F1(방문동거)      79
F2(거주)        92
F3(동반)       113
F5(영주)        80
F6(결혼이민)     120
G1(기타)       105
H1(관광취업)      37
H2(방문취업)      17
기타(Other)    105
dtype: int64

In [26]:
df["대륙"].nunique()   # faqat "대륙" ustunidagi unikal qiymatlar soni

7

In [None]:
# describe() — sonli ustunlar uchun statistika: count, mean, std, min, 25%, 50%(median), 75%, max
# 4-darsda fillna(median) va fillna(mean) ishlatishda foydalidir
df.describe()

In [None]:
# shape = (qatorlar_soni, ustunlar_soni), columns = ustun nomlari ro‘yxati
df.shape, df.columns.tolist()[:5]

## 3-Dars: 결측치 (Missing value) tekshirish

NaN yoki bo‘sh qiymatlar bor-yo‘qligini va qaysi ustunda nechta ekanini aniqlash.

In [None]:
# Har bir ustundagi 결측치 (NaN) soni. 3-dars
df.isnull().sum()

대륙           0
국적           0
성별           0
D1(문화예술)     0
D2(유학)       0
D3(기술연수)     0
D4(일반연수)     0
D5(취재)       0
D6(종교)       0
D7(주재)       0
D8(기업투자)     0
D9(무역경영)     0
D10(구직)      0
E1(교수)       0
E2(회화)       0
E3(연구)       0
E4(기술지도)     0
E5(전문직업)     0
E6(예술흥행)     0
E7(특정활동)     0
E8(계절근로)     0
E9(비전문취업)    0
E10(선원취업)    0
F1(방문동거)     0
F2(거주)       0
F3(동반)       0
F5(영주)       0
F6(결혼이민)     0
G1(기타)       0
H1(관광취업)     0
H2(방문취업)     0
기타(Other)    0
dtype: int64

In [28]:
# Ustunda bitta bo‘lsa ham NaN bor mi? True/False. 결측치 존재 여부
df.isnull().any()


대륙           False
국적           False
성별           False
D1(문화예술)     False
D2(유학)       False
D3(기술연수)     False
D4(일반연수)     False
D5(취재)       False
D6(종교)       False
D7(주재)       False
D8(기업투자)     False
D9(무역경영)     False
D10(구직)      False
E1(교수)       False
E2(회화)       False
E3(연구)       False
E4(기술지도)     False
E5(전문직업)     False
E6(예술흥행)     False
E7(특정활동)     False
E8(계절근로)     False
E9(비전문취업)    False
E10(선원취업)    False
F1(방문동거)     False
F2(거주)       False
F3(동반)       False
F5(영주)       False
F6(결혼이민)     False
G1(기타)       False
H1(관광취업)     False
H2(방문취업)     False
기타(Other)    False
dtype: bool

In [None]:
# Ma’lum bir ustunda NaN bormi? (4-darsdan)
df["F5(영주)"].isnull().any()

np.False_

## 4-Dars: 결측치 to‘ldirish va olib tashlash

- **fillna** — NaN ni boshqa qiymat bilan to‘ldirish (mean, median, mode yoki aniq qiymat)
- **dropna** — NaN bor qatorlarni o‘chirish (ehtiyotkorlik bilan)
- **drop** — ustun yoki qatorni o‘chirish

In [None]:
# fillna — 결측치를 채우기. Mean=o‘rtacha, Median=o‘rtadagi, Mode=eng ko‘p takrorlangan
# Misol: titanic Age ustunidagi NaN ni median bilan to‘ldirish (5-dars note)
df2_copy = df2.copy()
df2_copy['Age'] = df2_copy['Age'].fillna(df2_copy['Age'].median())
df2_copy['Age'].isnull().sum()   # 0 bo‘lishi kerak

In [None]:
# Boshqa usullar: mean, mode (object ustunlar uchun), yoki aniq qiymat
# df['col'] = df['col'].fillna(df['col'].mean())
# df['col'].fillna("Noma'lum", inplace=True)
# dropna — NaN bor qatorlarni tashlash (ma’lumot yo‘qoladi, ehtiyotkorlik bilan)
# df_clean = df.dropna()   yoki   df.dropna(subset=['Age'], inplace=True)

In [None]:
# drop — ustunni o‘chirish. axis=1 = ustun, inplace=True = o‘zida saqlash
# df.drop('ustun_nomi', axis=1, inplace=True)
# Bir nechta: df.drop(['col1','col2'], axis=1, inplace=True)
print("df2 ustunlari:", df2.columns.tolist()[:6])

## 5-Dars: Encoding (kategoriyani raqamga o‘tkazish)

- **Label Encoding** — har kategoriyaga 0, 1, 2... (sklearn LabelEncoder, fit_transform)
- **One-Hot (get_dummies)** — har kategoriya alohida 0/1 ustun (tartibsiz kategoriyalar uchun)
- **concat** — ikkita DataFrame ni birlashtirish (Pandas)

In [None]:
# Label Encoding — sklearn.preprocessing.LabelEncoder, fit_transform()
# Misol: df "성별" (남성/여성) ni 0/1 ga o‘tkazish
le = LabelEncoder()
df['성별_encoded'] = le.fit_transform(df['성별'])
df[['성별', '성별_encoded']].drop_duplicates()

In [None]:
# One-Hot Encoding — pd.get_dummies(). Har kategoriya alohida ustun, 0 yoki 1
# prefix — yangi ustunlar oldiga qo‘shiladigan nom
dummies = pd.get_dummies(df['대륙'], prefix='대륙')
dummies.head(3)

In [None]:
# concat — ikkita (yoki ko‘p) DataFrame ni bitta qilib birlashtirish
# axis=0 — pastga qo‘shish (qatorlar), axis=1 — yonma-yon (ustunlar)
# pd.concat([df_a, df_b], axis=0, ignore_index=True)
top = df.head(2)
bottom = df.tail(2)
pd.concat([top, bottom], axis=0, ignore_index=True)

## Qo‘shimcha: loc va iloc

- **loc** — label bo‘yicha: indeks nomi yoki ustun nomi (masalan `loc[9]` yoki `loc[0:5, '대륙']`)
- **iloc** — pozitsiya (butun son) bo‘yicha: `iloc[0:5, 0:3]` — birinchi 5 qator, 3 ustun

In [43]:
# loc[9] — indeks 9 bo‘lgan qatorni oladi (bitta qator = Series)
uzb = df.loc[9]
uzb